{
"cells": [
{
"cell_type": "markdown",
"id": "527a835c-afc5-4df7-a924-d5f61a417cf2",
"metadata": {},
"source": [
"# Creating Evals with synthetic data and measuring hallucinations\n",
"\n",
"When you deploy Llama for your use case, it is a good practice to have Evals for your use case. Though it might be ideal to have human annotated Evals, this notebook shows a strategy fow how one might go about addressing this using synthetic data. However, the Evals generated still requires validation by a human to make sure that your production use case can rely on this. \n",
"The notebook also shows how one could accurately measure hallucinations without using LLM-As-A-Judge methodology using Llama"
]
},
{
"cell_type": "markdown",
"id": "acfdf84f-0f4e-4684-83b0-9d2657441886",
"metadata": {},
"source": [
"## Overall idea\n",
"\n",
"Let's assume we have a use case for generating a summarization report based on a given context, which is a pretty common use case with LLM. Both the context and the report have a lot of factual information and we want to make sure the generated report is not hallucinating.\n",
"\n",
"Since its not trivial to find an open source dataset for this, the idea is to take synthetic tabular data and then use Llama to generate a story(context) for every row of the tabular data using Prompt Engineering. Then we ask Llama to summarize the generated context as a report in a specific format using Prompt Engineering. Finally we check the factual accuracy of the generated report using Llama by converting this into a QA task using the tabular data as the ground truth.\n",
"\n",
"To generate synthetic data for this approach, we use an open source tool like [Synthetic Data Vault](https://github.com/sdv-dev/SDV)\n",
"\n",
"The overall workflow is shown in the below diagram\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"id": "1d1e65c0-e9ae-41eb-b1a2-165d09dfacbf",
"metadata": {},
"source": [
"### Synthetic Data Vault installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ff645d5c-b441-40cc-97c2-451a2c103ba2",
"metadata": {},
"outputs": [],
"source": [
"!pip install sdv"
]
},
{
"cell_type": "markdown",
"id": "5e37994a-1683-4c86-b4fa-e3e49563137c",
"metadata": {},
"source": [
"SDV has a number of single table datasets. We choose `student_placements` dataset for this notebook"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "4f0281a7-027d-4e4c-bbd8-7bda766a2183",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" dataset_name | \n",
" size_MB | \n",
" num_tables | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" KRK_v1 | \n",
" 0.06 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" adult | \n",
" 3.91 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" alarm | \n",
" 4.52 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" asia | \n",
" 1.28 | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" census | \n",
" 98.17 | \n",
" 1 | \n",
"
\n",
" \n",
" 5 | \n",
" census_extended | \n",
" 4.95 | \n",
" 1 | \n",
"
\n",
" \n",
" 6 | \n",
" child | \n",
" 3.20 | \n",
" 1 | \n",
"
\n",
" \n",
" 7 | \n",
" covtype | \n",
" 255.65 | \n",
" 1 | \n",
"
\n",
" \n",
" 8 | \n",
" credit | \n",
" 68.35 | \n",
" 1 | \n",
"
\n",
" \n",
" 9 | \n",
" expedia_hotel_logs | \n",
" 0.20 | \n",
" 1 | \n",
"
\n",
" \n",
" 10 | \n",
" fake_companies | \n",
" 0.00 | \n",
" 1 | \n",
"
\n",
" \n",
" 11 | \n",
" fake_hotel_guests | \n",
" 0.03 | \n",
" 1 | \n",
"
\n",
" \n",
" 12 | \n",
" grid | \n",
" 0.32 | \n",
" 1 | \n",
"
\n",
" \n",
" 13 | \n",
" gridr | \n",
" 0.32 | \n",
" 1 | \n",
"
\n",
" \n",
" 14 | \n",
" insurance | \n",
" 3.34 | \n",
" 1 | \n",
"
\n",
" \n",
" 15 | \n",
" intrusion | \n",
" 162.04 | \n",
" 1 | \n",
"
\n",
" \n",
" 16 | \n",
" mnist12 | \n",
" 81.20 | \n",
" 1 | \n",
"
\n",
" \n",
" 17 | \n",
" mnist28 | \n",
" 439.60 | \n",
" 1 | \n",
"
\n",
" \n",
" 18 | \n",
" news | \n",
" 18.71 | \n",
" 1 | \n",
"
\n",
" \n",
" 19 | \n",
" ring | \n",
" 0.32 | \n",
" 1 | \n",
"
\n",
" \n",
" 20 | \n",
" student_placements | \n",
" 0.03 | \n",
" 1 | \n",
"
\n",
" \n",
" 21 | \n",
" student_placements_pii | \n",
" 0.03 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" dataset_name size_MB num_tables\n",
"0 KRK_v1 0.06 1\n",
"1 adult 3.91 1\n",
"2 alarm 4.52 1\n",
"3 asia 1.28 1\n",
"4 census 98.17 1\n",
"5 census_extended 4.95 1\n",
"6 child 3.20 1\n",
"7 covtype 255.65 1\n",
"8 credit 68.35 1\n",
"9 expedia_hotel_logs 0.20 1\n",
"10 fake_companies 0.00 1\n",
"11 fake_hotel_guests 0.03 1\n",
"12 grid 0.32 1\n",
"13 gridr 0.32 1\n",
"14 insurance 3.34 1\n",
"15 intrusion 162.04 1\n",
"16 mnist12 81.20 1\n",
"17 mnist28 439.60 1\n",
"18 news 18.71 1\n",
"19 ring 0.32 1\n",
"20 student_placements 0.03 1\n",
"21 student_placements_pii 0.03 1"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sdv.datasets.demo import get_available_demos\n",
"\n",
"get_available_demos(modality='single_table')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "b9e92e78-e89d-4d7e-987f-b7d26d5d8ff2",
"metadata": {},
"outputs": [],
"source": [
"from sdv.datasets.demo import download_demo\n",
"\n",
"real_data, metadata = download_demo(\n",
" modality='single_table',\n",
" dataset_name='student_placements')"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "5c3f2ddc-5096-432f-b78f-cb65c6ac2d0a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" student_id | \n",
" gender | \n",
" second_perc | \n",
" high_perc | \n",
" high_spec | \n",
" degree_perc | \n",
" degree_type | \n",
" work_experience | \n",
" experience_years | \n",
" employability_perc | \n",
" mba_spec | \n",
" mba_perc | \n",
" salary | \n",
" placed | \n",
" start_date | \n",
" end_date | \n",
" duration | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 17264 | \n",
" M | \n",
" 67.00 | \n",
" 91.00 | \n",
" Commerce | \n",
" 58.00 | \n",
" Sci&Tech | \n",
" False | \n",
" 0 | \n",
" 55.0 | \n",
" Mkt&HR | \n",
" 58.80 | \n",
" 27000.0 | \n",
" True | \n",
" 2020-07-23 | \n",
" 2020-10-12 | \n",
" 3.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 17265 | \n",
" M | \n",
" 79.33 | \n",
" 78.33 | \n",
" Science | \n",
" 77.48 | \n",
" Sci&Tech | \n",
" True | \n",
" 1 | \n",
" 86.5 | \n",
" Mkt&Fin | \n",
" 66.28 | \n",
" 20000.0 | \n",
" True | \n",
" 2020-01-11 | \n",
" 2020-04-09 | \n",
" 3.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 17266 | \n",
" M | \n",
" 65.00 | \n",
" 68.00 | \n",
" Arts | \n",
" 64.00 | \n",
" Comm&Mgmt | \n",
" False | \n",
" 0 | \n",
" 75.0 | \n",
" Mkt&Fin | \n",
" 57.80 | \n",
" 25000.0 | \n",
" True | \n",
" 2020-01-26 | \n",
" 2020-07-13 | \n",
" 6.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 17267 | \n",
" M | \n",
" 56.00 | \n",
" 52.00 | \n",
" Science | \n",
" 52.00 | \n",
" Sci&Tech | \n",
" False | \n",
" 0 | \n",
" 66.0 | \n",
" Mkt&HR | \n",
" 59.43 | \n",
" NaN | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 17268 | \n",
" M | \n",
" 85.80 | \n",
" 73.60 | \n",
" Commerce | \n",
" 73.30 | \n",
" Comm&Mgmt | \n",
" False | \n",
" 0 | \n",
" 96.8 | \n",
" Mkt&Fin | \n",
" 55.50 | \n",
" 42500.0 | \n",
" True | \n",
" 2020-07-04 | \n",
" 2020-09-27 | \n",
" 3.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" student_id gender second_perc high_perc high_spec degree_perc \\\n",
"0 17264 M 67.00 91.00 Commerce 58.00 \n",
"1 17265 M 79.33 78.33 Science 77.48 \n",
"2 17266 M 65.00 68.00 Arts 64.00 \n",
"3 17267 M 56.00 52.00 Science 52.00 \n",
"4 17268 M 85.80 73.60 Commerce 73.30 \n",
"\n",
" degree_type work_experience experience_years employability_perc mba_spec \\\n",
"0 Sci&Tech False 0 55.0 Mkt&HR \n",
"1 Sci&Tech True 1 86.5 Mkt&Fin \n",
"2 Comm&Mgmt False 0 75.0 Mkt&Fin \n",
"3 Sci&Tech False 0 66.0 Mkt&HR \n",
"4 Comm&Mgmt False 0 96.8 Mkt&Fin \n",
"\n",
" mba_perc salary placed start_date end_date duration \n",
"0 58.80 27000.0 True 2020-07-23 2020-10-12 3.0 \n",
"1 66.28 20000.0 True 2020-01-11 2020-04-09 3.0 \n",
"2 57.80 25000.0 True 2020-01-26 2020-07-13 6.0 \n",
"3 59.43 NaN False NaN NaN NaN \n",
"4 55.50 42500.0 True 2020-07-04 2020-09-27 3.0 "
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"real_data.head(5)"
]
},
{
"cell_type": "markdown",
"id": "ce7a9171-ee54-425f-88b7-8943464adbca",
"metadata": {},
"source": [
"#### Generate synthetic data from real data"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "91b0c477-ba82-4b77-ab64-8b3abf76fa9d",
"metadata": {},
"outputs": [],
"source": [
"from sdv.single_table import GaussianCopulaSynthesizer\n",
"\n",
"synthesizer = GaussianCopulaSynthesizer(metadata)\n",
"synthesizer.fit(data=real_data)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "fb8e6436-07ab-4957-a368-564078aaf92a",
"metadata": {},
"outputs": [],
"source": [
"synthetic_data = synthesizer.sample(num_rows=12)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "59f679f8-dda7-47f5-b193-5d4ddec3f2cb",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" start_date | \n",
" end_date | \n",
" salary | \n",
" duration | \n",
" student_id | \n",
" high_perc | \n",
" high_spec | \n",
" mba_spec | \n",
" second_perc | \n",
" gender | \n",
" degree_perc | \n",
" placed | \n",
" experience_years | \n",
" employability_perc | \n",
" mba_perc | \n",
" work_experience | \n",
" degree_type | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2020-01-10 | \n",
" NaN | \n",
" NaN | \n",
" 3.0 | \n",
" 3040587 | \n",
" 66.62 | \n",
" Science | \n",
" Mkt&Fin | \n",
" 75.01 | \n",
" M | \n",
" 75.76 | \n",
" True | \n",
" 1 | \n",
" 85.98 | \n",
" 58.37 | \n",
" True | \n",
" Sci&Tech | \n",
"
\n",
" \n",
" 1 | \n",
" NaN | \n",
" 2020-11-07 | \n",
" 39320.0 | \n",
" NaN | \n",
" 5940200 | \n",
" 81.61 | \n",
" Commerce | \n",
" Mkt&HR | \n",
" 73.03 | \n",
" M | \n",
" 67.27 | \n",
" True | \n",
" 1 | \n",
" 91.44 | \n",
" 65.12 | \n",
" False | \n",
" Comm&Mgmt | \n",
"
\n",
" \n",
" 2 | \n",
" 2020-02-21 | \n",
" 2020-07-08 | \n",
" 36408.0 | \n",
" 3.0 | \n",
" 13408830 | \n",
" 62.71 | \n",
" Arts | \n",
" Mkt&Fin | \n",
" 82.09 | \n",
" F | \n",
" 71.97 | \n",
" True | \n",
" 1 | \n",
" 62.18 | \n",
" 71.15 | \n",
" True | \n",
" Comm&Mgmt | \n",
"
\n",
" \n",
" 3 | \n",
" 2020-01-30 | \n",
" 2020-09-29 | \n",
" 36591.0 | \n",
" 3.0 | \n",
" 16186310 | \n",
" 51.00 | \n",
" Commerce | \n",
" Mkt&Fin | \n",
" 62.04 | \n",
" M | \n",
" 65.32 | \n",
" True | \n",
" 1 | \n",
" 61.87 | \n",
" 58.90 | \n",
" False | \n",
" Comm&Mgmt | \n",
"
\n",
" \n",
" 4 | \n",
" 2020-01-16 | \n",
" NaN | \n",
" 33032.0 | \n",
" NaN | \n",
" 2086931 | \n",
" 67.04 | \n",
" Commerce | \n",
" Mkt&Fin | \n",
" 53.53 | \n",
" M | \n",
" 51.08 | \n",
" True | \n",
" 1 | \n",
" 58.65 | \n",
" 56.32 | \n",
" False | \n",
" Sci&Tech | \n",
"
\n",
" \n",
" 5 | \n",
" 2020-07-20 | \n",
" NaN | \n",
" 31536.0 | \n",
" 3.0 | \n",
" 6414765 | \n",
" 80.30 | \n",
" Science | \n",
" Mkt&HR | \n",
" 87.34 | \n",
" M | \n",
" 74.10 | \n",
" True | \n",
" 1 | \n",
" 64.24 | \n",
" 68.55 | \n",
" False | \n",
" Sci&Tech | \n",
"
\n",
" \n",
" 6 | \n",
" 2020-02-13 | \n",
" 2020-11-26 | \n",
" 32428.0 | \n",
" 12.0 | \n",
" 6180804 | \n",
" 67.61 | \n",
" Commerce | \n",
" Mkt&HR | \n",
" 49.94 | \n",
" M | \n",
" 72.79 | \n",
" True | \n",
" 1 | \n",
" 86.51 | \n",
" 69.26 | \n",
" False | \n",
" Sci&Tech | \n",
"
\n",
" \n",
" 7 | \n",
" 2020-01-02 | \n",
" 2020-07-14 | \n",
" 36317.0 | \n",
" 6.0 | \n",
" 14357765 | \n",
" 63.09 | \n",
" Commerce | \n",
" Mkt&Fin | \n",
" 86.17 | \n",
" M | \n",
" 83.25 | \n",
" True | \n",
" 1 | \n",
" 71.89 | \n",
" 75.90 | \n",
" False | \n",
" Sci&Tech | \n",
"
\n",
" \n",
" 8 | \n",
" NaN | \n",
" 2020-05-10 | \n",
" 27104.0 | \n",
" 3.0 | \n",
" 9499396 | \n",
" 77.42 | \n",
" Science | \n",
" Mkt&HR | \n",
" 71.74 | \n",
" F | \n",
" 66.19 | \n",
" False | \n",
" 1 | \n",
" 95.38 | \n",
" 59.49 | \n",
" False | \n",
" Sci&Tech | \n",
"
\n",
" \n",
" 9 | \n",
" 2020-01-01 | \n",
" 2020-04-15 | \n",
" NaN | \n",
" 3.0 | \n",
" 10945558 | \n",
" 57.54 | \n",
" Science | \n",
" Mkt&HR | \n",
" 57.63 | \n",
" F | \n",
" 72.51 | \n",
" True | \n",
" 1 | \n",
" 86.40 | \n",
" 60.99 | \n",
" True | \n",
" Comm&Mgmt | \n",
"
\n",
" \n",
" 10 | \n",
" 2020-01-01 | \n",
" 2020-10-27 | \n",
" NaN | \n",
" 6.0 | \n",
" 5714925 | \n",
" 82.43 | \n",
" Science | \n",
" Mkt&Fin | \n",
" 68.14 | \n",
" M | \n",
" 76.55 | \n",
" True | \n",
" 1 | \n",
" 95.86 | \n",
" 67.78 | \n",
" False | \n",
" Sci&Tech | \n",
"
\n",
" \n",
" 11 | \n",
" 2020-01-01 | \n",
" 2020-07-02 | \n",
" NaN | \n",
" 3.0 | \n",
" 12273151 | \n",
" 58.25 | \n",
" Commerce | \n",
" Mkt&Fin | \n",
" 65.04 | \n",
" M | \n",
" 61.35 | \n",
" True | \n",
" 1 | \n",
" 65.73 | \n",
" 55.15 | \n",
" True | \n",
" Comm&Mgmt | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" start_date end_date salary duration student_id high_perc \\\n",
"0 2020-01-10 NaN NaN 3.0 3040587 66.62 \n",
"1 NaN 2020-11-07 39320.0 NaN 5940200 81.61 \n",
"2 2020-02-21 2020-07-08 36408.0 3.0 13408830 62.71 \n",
"3 2020-01-30 2020-09-29 36591.0 3.0 16186310 51.00 \n",
"4 2020-01-16 NaN 33032.0 NaN 2086931 67.04 \n",
"5 2020-07-20 NaN 31536.0 3.0 6414765 80.30 \n",
"6 2020-02-13 2020-11-26 32428.0 12.0 6180804 67.61 \n",
"7 2020-01-02 2020-07-14 36317.0 6.0 14357765 63.09 \n",
"8 NaN 2020-05-10 27104.0 3.0 9499396 77.42 \n",
"9 2020-01-01 2020-04-15 NaN 3.0 10945558 57.54 \n",
"10 2020-01-01 2020-10-27 NaN 6.0 5714925 82.43 \n",
"11 2020-01-01 2020-07-02 NaN 3.0 12273151 58.25 \n",
"\n",
" high_spec mba_spec second_perc gender degree_perc placed \\\n",
"0 Science Mkt&Fin 75.01 M 75.76 True \n",
"1 Commerce Mkt&HR 73.03 M 67.27 True \n",
"2 Arts Mkt&Fin 82.09 F 71.97 True \n",
"3 Commerce Mkt&Fin 62.04 M 65.32 True \n",
"4 Commerce Mkt&Fin 53.53 M 51.08 True \n",
"5 Science Mkt&HR 87.34 M 74.10 True \n",
"6 Commerce Mkt&HR 49.94 M 72.79 True \n",
"7 Commerce Mkt&Fin 86.17 M 83.25 True \n",
"8 Science Mkt&HR 71.74 F 66.19 False \n",
"9 Science Mkt&HR 57.63 F 72.51 True \n",
"10 Science Mkt&Fin 68.14 M 76.55 True \n",
"11 Commerce Mkt&Fin 65.04 M 61.35 True \n",
"\n",
" experience_years employability_perc mba_perc work_experience \\\n",
"0 1 85.98 58.37 True \n",
"1 1 91.44 65.12 False \n",
"2 1 62.18 71.15 True \n",
"3 1 61.87 58.90 False \n",
"4 1 58.65 56.32 False \n",
"5 1 64.24 68.55 False \n",
"6 1 86.51 69.26 False \n",
"7 1 71.89 75.90 False \n",
"8 1 95.38 59.49 False \n",
"9 1 86.40 60.99 True \n",
"10 1 95.86 67.78 False \n",
"11 1 65.73 55.15 True \n",
"\n",
" degree_type \n",
"0 Sci&Tech \n",
"1 Comm&Mgmt \n",
"2 Comm&Mgmt \n",
"3 Comm&Mgmt \n",
"4 Sci&Tech \n",
"5 Sci&Tech \n",
"6 Sci&Tech \n",
"7 Sci&Tech \n",
"8 Sci&Tech \n",
"9 Comm&Mgmt \n",
"10 Sci&Tech \n",
"11 Comm&Mgmt "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"synthetic_data"
]
},
{
"cell_type": "code",
"execution_count": 91,
"id": "d17f0fcf-4380-459f-b901-52e63be3b8d1",
"metadata": {},
"outputs": [],
"source": [
"# Save the DataFrame to a CSV file\n",
"synthetic_data.to_csv('generated_data/tabular_data.csv')"
]
},
{
"cell_type": "markdown",
"id": "b63773e2-a599-410b-8955-85c9a79242db",
"metadata": {},
"source": [
"## Load pre-generated synthetic tabular data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8fed0fe7-3e34-4dec-850d-2af119bc51ec",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"# Read the CSV file into a DataFrame\n",
"synthetic_data = pd.read_csv('generated_data/tabular_data.csv')"
]
},
{
"cell_type": "markdown",
"id": "45652011-8bdb-4e03-8ae3-87cd92714b70",
"metadata": {},
"source": [
"## Synthetic Data Generation with Llama-3.3-70B-Instruct\n",
"\n",
"In this section, we use `Llama-3.3-70B-Instruct` to create a story using tabular data and then generate extractive summary report from the generated context.\n",
"You could try using Llama-3.1-8B-Instruct but we have seen better results with the 70B model for generating synthetic data.\n",
"\n",
"### Alternate approach\n",
"\n",
"In the below section, we choose tabular data as the ground truth and generate all the context & reports from the table. Another approach is to use couple of examples as few shot prompting and use Llama to generate the context & story from this and asking it to vary the factual information. We can then use Llama to create the ground truth tabular data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "bc314704-caf4-42b0-962d-edbea2fe6b3c",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/agunapal/anaconda3/envs/torchtune/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n",
"Loading checkpoint shards: 100%|██████████| 30/30 [19:53<00:00, 39.79s/it]\n"
]
}
],
"source": [
"from transformers import AutoTokenizer, AutoModelForCausalLM\n",
"\n",
"model_id: str = \"meta-llama/Llama-3.3-70B-Instruct\"\n",
"\n",
"tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
"model = AutoModelForCausalLM.from_pretrained(model_id, device_map=\"auto\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "37595ef4-5872-4755-af8a-0bc5bbda3554",
"metadata": {},
"outputs": [],
"source": [
"# Prompt for generating context from synthetic tabular data\n",
"\n",
"story_teller = lambda index: f\"\"\"You are expert Story Teller. \n",
"Look at the following data and tell a story in the form of a progress report. \n",
"This report should have a sub-section for the following:\n",
"- Academic Background\n",
"- Career Aspirations\n",
"- Salary Expectations\n",
"- Placement Status\n",
"- Course Details\n",
"- Story Behind the Numbers\n",
"\n",
"Be creative and make up story with other statistics\n",
"\n",
"\n",
"{synthetic_data.loc[[synthetic_data.index[index]]]}\n",
"\n",
"- DO NOT create another table. \n",
"- DO NOT ask any clarifying questions\n",
"- DO NOT justify your answers\n",
"- Make sure each column has a subheading in the report\n",
"- Each of the sections should have the respective tag. \n",
"- All currency is in tokens\n",
"- Answer within 800 tokens\n",
"\n",
"Example:\n",
"\n",
"Student 17264 has a commendable academic record, which is evident from his second-year and high school percentage scores. His second-year percentage stands at 67%, while his high school percentage is an impressive 91%. He has also shown a keen interest in commerce, with a degree percentage of 58%. His academic background is a testament to his hard work and dedication to his studies.\n",
"\n",
"\n",
"Answer:\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "8806dacb-d03b-47f1-8363-36f3c8e88365",
"metadata": {},
"outputs": [],
"source": [
"# Prompt for generating report from the generated context\n",
"\n",
"report_creator = lambda context: f\"\"\" You are an expert report creator.\n",
"Look at the data in context: \n",
"{context}\n",
"\n",
"and generate a shortened report with 1 line with the following subsections:\n",
"- student_id\n",
"- degree_type\n",
"- salary\n",
"- mba_spec\n",
"- duration\n",
"- employability_perc\n",
"\n",
"IMPORTANT:\n",
"- DO NOT ask any clarifying questions\n",
"- DO NOT justify your answers\n",
"- DO NOT show the data \n",
"- DO NOT write any python code\n",
"- Each of the sections should have the respective tag and should be shown ONLY once\n",
"- Make sure to copy the mba_spec & degree_type as is\n",
"\n",
"Example:\n",
"\n",
"Summary Report:\n",
"\n",
"\n",
"Student ID is 17269\n",
"\n",
"\n",
"\n",
"Student has a realistic salary expectation of 27,000 tokens per month\n",
"\n",
"\n",
"\n",
"Student has a degree in Sci&Tech\n",
"\n",
"\n",
"\n",
"Student has a specialization in Mkt&Fin\n",
"\n",
"\n",
"\n",
"Student has a degree duration of 4 years\n",
"\n",
"\n",
"\n",
"Student has a 95.0% employability percentage\n",
"\n",
"\n",
"Answer:\n",
"\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"id": "f8620c9e-0641-4ca7-9445-71a22de9f494",
"metadata": {},
"source": [
"### Generate 12 examples of synthetic data using this loop\n",
"\n",
"Why 12?: We will use 2 examples for few shot prompting and the rest 10 for Evals.\n",
"\n",
"In practice, you want the number of data points to be much higher for your production application"
]
},
{
"cell_type": "code",
"execution_count": 75,
"id": "44fa3c0b-9d2c-4dcb-bcf8-879d17c71512",
"metadata": {},
"outputs": [],
"source": [
"import random\n",
"import json\n",
"\n",
"for i in range(12):\n",
"\n",
" formatted_prompt = story_teller(i)\n",
" input = tokenizer([formatted_prompt], return_tensors=\"pt\").to(\"cuda\")\n",
" \n",
" # Generate context from tabular data\n",
" output = model.generate(**input, max_new_tokens=800, pad_token_id=0, temperature=0.8)\n",
" prompt_len = input[\"input_ids\"].shape[-1]\n",
" context = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)\n",
"\n",
" formatted_prompt = report_creator(context)\n",
" input = tokenizer([formatted_prompt], return_tensors=\"pt\").to(\"cuda\")\n",
"\n",
" # Generate report from generated report\n",
" output = model.generate(**input, max_new_tokens=120, pad_token_id=0,)\n",
" prompt_len = input[\"input_ids\"].shape[-1]\n",
" report = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)\n",
"\n",
" # Create json output\n",
" result = {}\n",
" result[\"context\"] = context\n",
" result[\"report\"] = report\n",
" with open(f'generated_data/data_{i}.json', 'w') as f:\n",
" json.dump(result, f, indent=4)"
]
},
{
"cell_type": "markdown",
"id": "06e81b1a-e4da-4b67-97c1-0544af2eb104",
"metadata": {},
"source": [
"## Example Context & Report\n",
"\n",
"By manual inspection we see that, Llama has created well structured context and the corresponding report. We also see that all the factual information is correct."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "a6fd5a95-8a73-4a0e-bd28-bf6291af3827",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Context is -------------------------\n",
"\n",
"### Progress Report for Student 3040587\n",
"#### \n",
"Student 3040587 has a strong academic foundation, with a high school percentage of 66.62% in Science. He also holds a degree in Science and Technology with a percentage of 75.76%. His second-year percentage is 75.01%, demonstrating his consistent academic performance. \n",
"#### \n",
"With a specialization in Marketing and Finance, Student 3040587 aspires to pursue a career in the finance sector, leveraging his skills in market analysis and financial planning. His career goal is to become a financial analyst, with a focus on investment banking. \n",
"#### \n",
"Student 3040587 expects a starting salary of 5000 tokens per annum, considering his one year of work experience and academic achievements. He is confident that his skills and knowledge will enable him to secure a job with a reputable company. \n",
"#### \n",
"Student 3040587 has been successfully placed, with an employability percentage of 85.98%. His placement is a testament to his hard work and dedication to his studies, as well as his relevant work experience. \n",
"#### \n",
"Student 3040587 is currently pursuing an MBA with a specialization in Marketing and Finance, with a course duration of 3 years. He has completed one year of the course, with an MBA percentage of 58.37%. \n",
"#### \n",
"Behind the numbers, Student 3040587's story is one of perseverance and determination. Despite facing challenges in his academic journey, he has consistently worked hard to achieve his goals. His work experience has equipped him with the skills and knowledge required to succeed in the finance sector. With his strong academic background, career aspirations, and relevant work experience, Student 3040587 is poised to achieve great things in his future career. \n",
"### End of Report 3040587\n",
"\n",
"\n",
"Report is -------------------------\n",
"\n",
"Summary Report:\n",
"\n",
"\n",
"Student 3040587\n",
"\n",
"\n",
"\n",
"Student has a realistic salary expectation of 5000 tokens per annum\n",
"\n",
"\n",
"\n",
"Student has a degree in Science and Technology\n",
"\n",
"\n",
"\n",
"Student has a specialization in Marketing and Finance\n",
"\n",
"\n",
"\n",
"Student has a degree duration of 3 years\n",
"\n",
"\n",
"\n",
"Student has a 85.98% employability percentage\n",
"\n"
]
}
],
"source": [
"import json\n",
"def read_json_file(file_path):\n",
" try:\n",
" with open(file_path, 'r') as file:\n",
" data = json.load(file)\n",
" return data\n",
" except FileNotFoundError:\n",
" print(f\"File not found: {file_path}\")\n",
" return None\n",
" except json.JSONDecodeError as e:\n",
" print(f\"Invalid JSON: {e}\")\n",
" return None\n",
"# Example usage:\n",
"file_path = 'generated_data/data_0.json'\n",
"data = read_json_file(file_path)\n",
"print(\"Context is -------------------------\\n\")\n",
"print(data[\"context\"])\n",
"print(\"\\nReport is -------------------------\\n\") \n",
"print(data[\"report\"])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "6a90b297-67a2-457e-99a6-658546a96af1",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" start_date | \n",
" end_date | \n",
" salary | \n",
" duration | \n",
" student_id | \n",
" high_perc | \n",
" high_spec | \n",
" mba_spec | \n",
" second_perc | \n",
" gender | \n",
" degree_perc | \n",
" placed | \n",
" experience_years | \n",
" employability_perc | \n",
" mba_perc | \n",
" work_experience | \n",
" degree_type | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 2020-01-10 | \n",
" NaN | \n",
" NaN | \n",
" 3.0 | \n",
" 3040587 | \n",
" 66.62 | \n",
" Science | \n",
" Mkt&Fin | \n",
" 75.01 | \n",
" M | \n",
" 75.76 | \n",
" True | \n",
" 1 | \n",
" 85.98 | \n",
" 58.37 | \n",
" True | \n",
" Sci&Tech | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 start_date end_date salary duration student_id high_perc \\\n",
"0 0 2020-01-10 NaN NaN 3.0 3040587 66.62 \n",
"\n",
" high_spec mba_spec second_perc gender degree_perc placed \\\n",
"0 Science Mkt&Fin 75.01 M 75.76 True \n",
"\n",
" experience_years employability_perc mba_perc work_experience degree_type \n",
"0 1 85.98 58.37 True Sci&Tech "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"synthetic_data.loc[[synthetic_data.index[0]]]"
]
},
{
"cell_type": "markdown",
"id": "84609a64-0842-49f0-a3a5-3ad94c3f9b5d",
"metadata": {},
"source": [
"## Important!!!!! Verification by human!\n",
"\n",
"At this point, ideally you need a human to look at the synthetic data that you have generated and fix any errors in the formatting or factual information or be aware of the number of errors in the dataset"
]
},
{
"cell_type": "markdown",
"id": "6f9c4b0a-0d15-4140-8a18-7c3710ab71f2",
"metadata": {},
"source": [
"## Measuring Hallucinations\n",
"\n",
"The usual method to measure hallucinations uses LLM-As-Judge methodology. An example hallucination metric is using [DeepEval](https://www.deepeval.com/docs/metrics-hallucination).\n",
"This would use a powerful LLM as the ground truth to measure hallucinations.\n",
"\n",
"The below section shows a way to measure hallucinations using the ground truth data that we have (tabular data). The methodology is to make use of the tags that we have added in the report and use Llama to answer simple questions looking at the corresponding sections. Llama compares the answers with the ground truth and generates a list of boolean values. This is then used to measure accuracy of the factual information in the report. If your report has a well defined structure, using QA to measure hallucinations can be highly effective and cost efficient"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "cef3d897-5d3a-44d8-a04b-a5b56ab19fee",
"metadata": {},
"outputs": [],
"source": [
"# Use the first 2 data points for few shot prompting\n",
"file_path = 'generated_data_faulty/data_0.json'\n",
"example_data = read_json_file(file_path)\n",
"\n",
"file_path = 'generated_data_faulty/data_1.json'\n",
"example_data_1 = read_json_file(file_path)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2c35d758-3d7b-4512-9bdf-8055b9bff3e2",
"metadata": {},
"outputs": [],
"source": [
"check_hallucinations = lambda data,index: f\"\"\"You are a Helpful Assistant. \n",
"Look at the section called Generated Report below & answer the following questions by only looking\n",
"at the corresponding sections in the report\n",
"- student_id: Question : What is the student id?\n",
"- degree_type : Question : What is the degree_type?\n",
"- salary: Question : What is the salary?\n",
"- mba_spec: Question : What is the mba_spec?\n",
"- duration: Question : What is the duration?\n",
"- employability_perc: Question : What is the employability percentage?\n",
"\n",
"Generated Report:\n",
"{data[\"report\"]}\n",
"\n",
"Compare your answers with the ground truth and return either True or False within the tags & \n",
"Only if an answer is False, explain why in the format shown in the examples below\n",
"\n",
"Ground Truth:\n",
"{synthetic_data.loc[[synthetic_data.index[index]]]}\n",
"\n",
"\n",
"\n",
"Important Notes:\n",
"- Only check for the above mentioned questions\n",
"- Make sure each of the section is shown ONLY once\n",
"- DO NOT reason or explain your process\n",
"- DO NOT code this\n",
"- DO NOT explain why something is True\n",
"- Be lenient when checking decimal points. Ex: 4.0 is the same as 4\n",
"\n",
"Example:\n",
"1)\n",
"With the following report:\n",
"\n",
"{example_data[\"report\"]}\n",
"\n",
"and the ground truth:\n",
"{synthetic_data.loc[[synthetic_data.index[0]]]}\n",
"\n",
"the following output is expected:\n",
"\n",
"\n",
"student_id: [False, report shows 17263 and ground truth says 17264]\n",
"degree_type: [True, None]\n",
"salary: [False, report says 28000 and ground truth says 27000]\n",
"mba_spec: [True, None]\n",
"duration: [True, None]\n",
"employability_perc: [True, None]\n",
"\n",
"\n",
"2) \n",
"With the following report:\n",
"\n",
"{example_data_1[\"report\"]}\n",
"\n",
"and the ground truth:\n",
"{synthetic_data.loc[[synthetic_data.index[1]]]}\n",
"\n",
"the following output is expected:\n",
"\n",
"\n",
"student_id: [True, None]\n",
"degree_type: [True, None]\n",
"salary: [True, None]\n",
"mba_spec: [True, None]\n",
"duration: [True, None]\n",
"employability_perc: [True, None]\n",
"\n",
"\n",
"Answer:\n",
"\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "bd532640-db12-46e9-aec2-7609dfe1e72c",
"metadata": {},
"outputs": [],
"source": [
"def parse_output(output):\n",
" \"\"\"\n",
" Parse the output and return a list of bool values\n",
" \"\"\"\n",
" lines = output.strip().splitlines()\n",
" bool_values = []\n",
" for line in lines:\n",
" # Skip empty lines, lines with tags, or lines starting with '<'\n",
" if not line or line.startswith('<') or line.endswith('>'):\n",
" continue\n",
" parts = line.split(': ')\n",
" if len(parts) != 2:\n",
" raise ValueError(f\"Invalid line format: {line}\")\n",
" value_str, _ = parts[1].strip('[]').split(', ')\n",
" if value_str == 'True':\n",
" bool_values.append(True)\n",
" elif value_str == 'False':\n",
" bool_values.append(False)\n",
" else:\n",
" raise ValueError(f\"Invalid bool value: {value_str}\")\n",
" if parts[0] == 'employability_perc':\n",
" break\n",
" return bool_values"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b2ee3a24-bb64-4f3a-a547-5e976bbf817b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Checking accuracy of generated report in generated_data/data_2.json\n",
"\n",
"\n",
"student_id: [True, None]\n",
"degree_type: [True, None]\n",
"salary: [True, None]\n",
"mba_spec: [True, None]\n",
"duration: [True, None]\n",
"employability_perc: [True, None]\n",
"\n",
"\n",
"Checking accuracy of generated report in generated_data/data_3.json\n",
"\n",
"\n",
"student_id: [True, None]\n",
"degree_type: [True, None]\n",
"salary: [True, None]\n",
"mba_spec: [True, None]\n",
"duration: [True, None]\n",
"employability_perc: [True, None]\n",
"\n",
"\n",
"Checking accuracy of generated report in generated_data/data_4.json\n",
"\n",
"\n",
"student_id: [True, None]\n",
"degree_type: [True, None]\n",
"salary: [True, None]\n",
"mba_spec: [True, None]\n",
"duration: [True, None]\n",
"employability_perc: [True, None]\n",
"\n",
"\n",
"Checking accuracy of generated report in generated_data/data_5.json\n",
"\n",
"\n",
"student_id: [True, None]\n",
"degree_type: [True, None]\n",
"salary: [True, None]\n",
"mba_spec: [True, None]\n",
"duration: [True, None]\n",
"employability_perc: [True, None]\n",
"\n",
"\n",
"Checking accuracy of generated report in generated_data/data_6.json\n",
"\n",
"\n",
"student_id: [False, report shows Student ID is not mentioned in the data and ground truth says 6180804]\n",
"degree_type: [True, None]\n",
"salary: [True, None]\n",
"mba_spec: [True, None]\n",
"duration: [True, None]\n",
"employability_perc: [True, None]\n",
"\n",
"\n",
"Checking accuracy of generated report in generated_data/data_7.json\n",
"\n",
"\n",
"student_id: [True, None]\n",
"degree_type: [False, report says Mkt&Fin and ground truth says Sci&Tech]\n",
"salary: [True, None]\n",
"mba_spec: [True, None]\n",
"duration: [True, None]\n",
"employability_perc: [True, None]\n",
"\n",
"\n",
"Checking accuracy of generated report in generated_data/data_8.json\n",
"\n",
"\n",
"student_id: [True, None]\n",
"degree_type: [True, None]\n",
"salary: [True, None]\n",
"mba_spec: [True, None]\n",
"duration: [True, None]\n",
"employability_perc: [True, None]\n",
"\n",
"\n",
"Checking accuracy of generated report in generated_data/data_9.json\n",
"\n",
"\n",
"student_id: [True, None]\n",
"degree_type: [True, None]\n",
"salary: [True, None]\n",
"mba_spec: [True, None]\n",
"duration: [True, None]\n",
"employability_perc: [True, None]\n",
"\n",
"\n",
"Checking accuracy of generated report in generated_data/data_10.json\n",
"\n",
"\n",
"student_id: [True, None]\n",
"degree_type: [True, None]\n",
"salary: [True, None]\n",
"mba_spec: [True, None]\n",
"duration: [True, None]\n",
"employability_perc: [True, None]\n",
"\n",
"\n",
"Checking accuracy of generated report in generated_data/data_11.json\n",
"\n",
"\n",
"student_id: [True, None]\n",
"degree_type: [False, report says Commerce and ground truth says Comm&Mgmt]\n",
"salary: [False, report says 500 and ground truth says NaN]\n",
"mba_spec: [True, None]\n",
"duration: [True, None]\n",
"employability_perc: [True, None]\n",
"\n",
"\n",
"Accuracy of factual information generation is : 0.9333\n"
]
}
],
"source": [
"from sklearn.metrics import accuracy_score\n",
"\n",
"y_pred = []\n",
"y_true = [True]*60\n",
"for i in range(2,12):\n",
" fname = f'generated_data/data_{i}.json'\n",
" print(f\"\\nChecking accuracy of generated report in {fname}\\n\")\n",
" data = read_json_file(fname)\n",
" \n",
" formatted_prompt = check_hallucinations(data, i)\n",
"\n",
" input = tokenizer([formatted_prompt], return_tensors=\"pt\").to(\"cuda\")\n",
" \n",
" output = model.generate(**input, max_new_tokens=120, pad_token_id=0, do_sample=False, top_p=None, temperature=None)\n",
" prompt_len = input[\"input_ids\"].shape[-1]\n",
" results = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)\n",
" print(results)\n",
" y_pred.extend(parse_output(results))\n",
"accuracy = accuracy_score(y_true, y_pred)\n",
"print(f\"\\nAccuracy of factual information generation is : {accuracy:.4f}\")"
]
},
{
"cell_type": "markdown",
"id": "fe3e87f5-e04d-472a-873c-675951459334",
"metadata": {},
"source": [
"# Conclusion & Next Steps\n",
"\n",
"- Creating Evals for summarization is important\n",
"- Llama can be used to create evals given few samples of ground truth\n",
"- Using simple QA to measure hallucinations can be an effective strategy to be be confident that important factual information is being verified "
]
},
{
"cell_type": "markdown",
"id": "785aaee8",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}