{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "527a835c-afc5-4df7-a924-d5f61a417cf2",
   "metadata": {},
   "source": [
    "# Creating Evals with synthetic data and measuring hallucinations\n",
    "\n",
    "When you deploy Llama for your use case, it is a good practice to have Evals for your use case. Though it might be ideal to have human annotated Evals, this notebook shows a strategy fow how one might go about addressing this using synthetic data. However, the Evals generated still requires validation by a human to make sure that your production use case can rely on this. \n",
    "The notebook also shows how one could accurately measure hallucinations without using LLM-As-A-Judge methodology using Llama"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "acfdf84f-0f4e-4684-83b0-9d2657441886",
   "metadata": {},
   "source": [
    "## Overall idea\n",
    "\n",
    "Let's assume we have a use case for generating a summarization report based on a given context, which is a pretty common use case with LLM. Both the context and the report have a lot of factual information and we want to make sure the generated report is not hallucinating.\n",
    "\n",
    "Since its not trivial to find an open source dataset for this, the idea is to take synthetic tabular data and then use Llama to generate a story(context) for every row of the tabular data using Prompt Engineering. Then we ask Llama to summarize the generated context as a report in a specific format using Prompt Engineering. Finally we check the factual accuracy of the generated report using Llama by converting this into a QA task using the tabular data as the ground truth.\n",
    "\n",
    "To generate synthetic data for this approach, we use an open source tool like [Synthetic Data Vault](https://github.com/sdv-dev/SDV)\n",
    "\n",
    "The overall workflow is shown in the below diagram\n",
    "\n",
    "![Workflow](./Workflow_Diagram.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d1e65c0-e9ae-41eb-b1a2-165d09dfacbf",
   "metadata": {},
   "source": [
    "### Synthetic Data Vault installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff645d5c-b441-40cc-97c2-451a2c103ba2",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install sdv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e37994a-1683-4c86-b4fa-e3e49563137c",
   "metadata": {},
   "source": [
    "SDV has a number of single table datasets. We choose `student_placements` dataset for this notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "4f0281a7-027d-4e4c-bbd8-7bda766a2183",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>dataset_name</th>\n",
       "      <th>size_MB</th>\n",
       "      <th>num_tables</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>KRK_v1</td>\n",
       "      <td>0.06</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>adult</td>\n",
       "      <td>3.91</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>alarm</td>\n",
       "      <td>4.52</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>asia</td>\n",
       "      <td>1.28</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>census</td>\n",
       "      <td>98.17</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>census_extended</td>\n",
       "      <td>4.95</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>child</td>\n",
       "      <td>3.20</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>covtype</td>\n",
       "      <td>255.65</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>credit</td>\n",
       "      <td>68.35</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>expedia_hotel_logs</td>\n",
       "      <td>0.20</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>fake_companies</td>\n",
       "      <td>0.00</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>fake_hotel_guests</td>\n",
       "      <td>0.03</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>grid</td>\n",
       "      <td>0.32</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>gridr</td>\n",
       "      <td>0.32</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>insurance</td>\n",
       "      <td>3.34</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>intrusion</td>\n",
       "      <td>162.04</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>mnist12</td>\n",
       "      <td>81.20</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>mnist28</td>\n",
       "      <td>439.60</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>news</td>\n",
       "      <td>18.71</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>ring</td>\n",
       "      <td>0.32</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>student_placements</td>\n",
       "      <td>0.03</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>student_placements_pii</td>\n",
       "      <td>0.03</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              dataset_name  size_MB  num_tables\n",
       "0                   KRK_v1     0.06           1\n",
       "1                    adult     3.91           1\n",
       "2                    alarm     4.52           1\n",
       "3                     asia     1.28           1\n",
       "4                   census    98.17           1\n",
       "5          census_extended     4.95           1\n",
       "6                    child     3.20           1\n",
       "7                  covtype   255.65           1\n",
       "8                   credit    68.35           1\n",
       "9       expedia_hotel_logs     0.20           1\n",
       "10          fake_companies     0.00           1\n",
       "11       fake_hotel_guests     0.03           1\n",
       "12                    grid     0.32           1\n",
       "13                   gridr     0.32           1\n",
       "14               insurance     3.34           1\n",
       "15               intrusion   162.04           1\n",
       "16                 mnist12    81.20           1\n",
       "17                 mnist28   439.60           1\n",
       "18                    news    18.71           1\n",
       "19                    ring     0.32           1\n",
       "20      student_placements     0.03           1\n",
       "21  student_placements_pii     0.03           1"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sdv.datasets.demo import get_available_demos\n",
    "\n",
    "get_available_demos(modality='single_table')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "b9e92e78-e89d-4d7e-987f-b7d26d5d8ff2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sdv.datasets.demo import download_demo\n",
    "\n",
    "real_data, metadata = download_demo(\n",
    "    modality='single_table',\n",
    "    dataset_name='student_placements')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "5c3f2ddc-5096-432f-b78f-cb65c6ac2d0a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>student_id</th>\n",
       "      <th>gender</th>\n",
       "      <th>second_perc</th>\n",
       "      <th>high_perc</th>\n",
       "      <th>high_spec</th>\n",
       "      <th>degree_perc</th>\n",
       "      <th>degree_type</th>\n",
       "      <th>work_experience</th>\n",
       "      <th>experience_years</th>\n",
       "      <th>employability_perc</th>\n",
       "      <th>mba_spec</th>\n",
       "      <th>mba_perc</th>\n",
       "      <th>salary</th>\n",
       "      <th>placed</th>\n",
       "      <th>start_date</th>\n",
       "      <th>end_date</th>\n",
       "      <th>duration</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>17264</td>\n",
       "      <td>M</td>\n",
       "      <td>67.00</td>\n",
       "      <td>91.00</td>\n",
       "      <td>Commerce</td>\n",
       "      <td>58.00</td>\n",
       "      <td>Sci&amp;Tech</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>55.0</td>\n",
       "      <td>Mkt&amp;HR</td>\n",
       "      <td>58.80</td>\n",
       "      <td>27000.0</td>\n",
       "      <td>True</td>\n",
       "      <td>2020-07-23</td>\n",
       "      <td>2020-10-12</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>17265</td>\n",
       "      <td>M</td>\n",
       "      <td>79.33</td>\n",
       "      <td>78.33</td>\n",
       "      <td>Science</td>\n",
       "      <td>77.48</td>\n",
       "      <td>Sci&amp;Tech</td>\n",
       "      <td>True</td>\n",
       "      <td>1</td>\n",
       "      <td>86.5</td>\n",
       "      <td>Mkt&amp;Fin</td>\n",
       "      <td>66.28</td>\n",
       "      <td>20000.0</td>\n",
       "      <td>True</td>\n",
       "      <td>2020-01-11</td>\n",
       "      <td>2020-04-09</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>17266</td>\n",
       "      <td>M</td>\n",
       "      <td>65.00</td>\n",
       "      <td>68.00</td>\n",
       "      <td>Arts</td>\n",
       "      <td>64.00</td>\n",
       "      <td>Comm&amp;Mgmt</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>Mkt&amp;Fin</td>\n",
       "      <td>57.80</td>\n",
       "      <td>25000.0</td>\n",
       "      <td>True</td>\n",
       "      <td>2020-01-26</td>\n",
       "      <td>2020-07-13</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>17267</td>\n",
       "      <td>M</td>\n",
       "      <td>56.00</td>\n",
       "      <td>52.00</td>\n",
       "      <td>Science</td>\n",
       "      <td>52.00</td>\n",
       "      <td>Sci&amp;Tech</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>Mkt&amp;HR</td>\n",
       "      <td>59.43</td>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>17268</td>\n",
       "      <td>M</td>\n",
       "      <td>85.80</td>\n",
       "      <td>73.60</td>\n",
       "      <td>Commerce</td>\n",
       "      <td>73.30</td>\n",
       "      <td>Comm&amp;Mgmt</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>96.8</td>\n",
       "      <td>Mkt&amp;Fin</td>\n",
       "      <td>55.50</td>\n",
       "      <td>42500.0</td>\n",
       "      <td>True</td>\n",
       "      <td>2020-07-04</td>\n",
       "      <td>2020-09-27</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   student_id gender  second_perc  high_perc high_spec  degree_perc  \\\n",
       "0       17264      M        67.00      91.00  Commerce        58.00   \n",
       "1       17265      M        79.33      78.33   Science        77.48   \n",
       "2       17266      M        65.00      68.00      Arts        64.00   \n",
       "3       17267      M        56.00      52.00   Science        52.00   \n",
       "4       17268      M        85.80      73.60  Commerce        73.30   \n",
       "\n",
       "  degree_type  work_experience  experience_years  employability_perc mba_spec  \\\n",
       "0    Sci&Tech            False                 0                55.0   Mkt&HR   \n",
       "1    Sci&Tech             True                 1                86.5  Mkt&Fin   \n",
       "2   Comm&Mgmt            False                 0                75.0  Mkt&Fin   \n",
       "3    Sci&Tech            False                 0                66.0   Mkt&HR   \n",
       "4   Comm&Mgmt            False                 0                96.8  Mkt&Fin   \n",
       "\n",
       "   mba_perc   salary  placed  start_date    end_date  duration  \n",
       "0     58.80  27000.0    True  2020-07-23  2020-10-12       3.0  \n",
       "1     66.28  20000.0    True  2020-01-11  2020-04-09       3.0  \n",
       "2     57.80  25000.0    True  2020-01-26  2020-07-13       6.0  \n",
       "3     59.43      NaN   False         NaN         NaN       NaN  \n",
       "4     55.50  42500.0    True  2020-07-04  2020-09-27       3.0  "
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "real_data.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce7a9171-ee54-425f-88b7-8943464adbca",
   "metadata": {},
   "source": [
    "#### Generate synthetic data from real data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "91b0c477-ba82-4b77-ab64-8b3abf76fa9d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sdv.single_table import GaussianCopulaSynthesizer\n",
    "\n",
    "synthesizer = GaussianCopulaSynthesizer(metadata)\n",
    "synthesizer.fit(data=real_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "fb8e6436-07ab-4957-a368-564078aaf92a",
   "metadata": {},
   "outputs": [],
   "source": [
    "synthetic_data = synthesizer.sample(num_rows=12)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "59f679f8-dda7-47f5-b193-5d4ddec3f2cb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>start_date</th>\n",
       "      <th>end_date</th>\n",
       "      <th>salary</th>\n",
       "      <th>duration</th>\n",
       "      <th>student_id</th>\n",
       "      <th>high_perc</th>\n",
       "      <th>high_spec</th>\n",
       "      <th>mba_spec</th>\n",
       "      <th>second_perc</th>\n",
       "      <th>gender</th>\n",
       "      <th>degree_perc</th>\n",
       "      <th>placed</th>\n",
       "      <th>experience_years</th>\n",
       "      <th>employability_perc</th>\n",
       "      <th>mba_perc</th>\n",
       "      <th>work_experience</th>\n",
       "      <th>degree_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2020-01-10</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3040587</td>\n",
       "      <td>66.62</td>\n",
       "      <td>Science</td>\n",
       "      <td>Mkt&amp;Fin</td>\n",
       "      <td>75.01</td>\n",
       "      <td>M</td>\n",
       "      <td>75.76</td>\n",
       "      <td>True</td>\n",
       "      <td>1</td>\n",
       "      <td>85.98</td>\n",
       "      <td>58.37</td>\n",
       "      <td>True</td>\n",
       "      <td>Sci&amp;Tech</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2020-11-07</td>\n",
       "      <td>39320.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5940200</td>\n",
       "      <td>81.61</td>\n",
       "      <td>Commerce</td>\n",
       "      <td>Mkt&amp;HR</td>\n",
       "      <td>73.03</td>\n",
       "      <td>M</td>\n",
       "      <td>67.27</td>\n",
       "      <td>True</td>\n",
       "      <td>1</td>\n",
       "      <td>91.44</td>\n",
       "      <td>65.12</td>\n",
       "      <td>False</td>\n",
       "      <td>Comm&amp;Mgmt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2020-02-21</td>\n",
       "      <td>2020-07-08</td>\n",
       "      <td>36408.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>13408830</td>\n",
       "      <td>62.71</td>\n",
       "      <td>Arts</td>\n",
       "      <td>Mkt&amp;Fin</td>\n",
       "      <td>82.09</td>\n",
       "      <td>F</td>\n",
       "      <td>71.97</td>\n",
       "      <td>True</td>\n",
       "      <td>1</td>\n",
       "      <td>62.18</td>\n",
       "      <td>71.15</td>\n",
       "      <td>True</td>\n",
       "      <td>Comm&amp;Mgmt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2020-01-30</td>\n",
       "      <td>2020-09-29</td>\n",
       "      <td>36591.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>16186310</td>\n",
       "      <td>51.00</td>\n",
       "      <td>Commerce</td>\n",
       "      <td>Mkt&amp;Fin</td>\n",
       "      <td>62.04</td>\n",
       "      <td>M</td>\n",
       "      <td>65.32</td>\n",
       "      <td>True</td>\n",
       "      <td>1</td>\n",
       "      <td>61.87</td>\n",
       "      <td>58.90</td>\n",
       "      <td>False</td>\n",
       "      <td>Comm&amp;Mgmt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2020-01-16</td>\n",
       "      <td>NaN</td>\n",
       "      <td>33032.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2086931</td>\n",
       "      <td>67.04</td>\n",
       "      <td>Commerce</td>\n",
       "      <td>Mkt&amp;Fin</td>\n",
       "      <td>53.53</td>\n",
       "      <td>M</td>\n",
       "      <td>51.08</td>\n",
       "      <td>True</td>\n",
       "      <td>1</td>\n",
       "      <td>58.65</td>\n",
       "      <td>56.32</td>\n",
       "      <td>False</td>\n",
       "      <td>Sci&amp;Tech</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2020-07-20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>31536.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>6414765</td>\n",
       "      <td>80.30</td>\n",
       "      <td>Science</td>\n",
       "      <td>Mkt&amp;HR</td>\n",
       "      <td>87.34</td>\n",
       "      <td>M</td>\n",
       "      <td>74.10</td>\n",
       "      <td>True</td>\n",
       "      <td>1</td>\n",
       "      <td>64.24</td>\n",
       "      <td>68.55</td>\n",
       "      <td>False</td>\n",
       "      <td>Sci&amp;Tech</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2020-02-13</td>\n",
       "      <td>2020-11-26</td>\n",
       "      <td>32428.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>6180804</td>\n",
       "      <td>67.61</td>\n",
       "      <td>Commerce</td>\n",
       "      <td>Mkt&amp;HR</td>\n",
       "      <td>49.94</td>\n",
       "      <td>M</td>\n",
       "      <td>72.79</td>\n",
       "      <td>True</td>\n",
       "      <td>1</td>\n",
       "      <td>86.51</td>\n",
       "      <td>69.26</td>\n",
       "      <td>False</td>\n",
       "      <td>Sci&amp;Tech</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2020-01-02</td>\n",
       "      <td>2020-07-14</td>\n",
       "      <td>36317.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>14357765</td>\n",
       "      <td>63.09</td>\n",
       "      <td>Commerce</td>\n",
       "      <td>Mkt&amp;Fin</td>\n",
       "      <td>86.17</td>\n",
       "      <td>M</td>\n",
       "      <td>83.25</td>\n",
       "      <td>True</td>\n",
       "      <td>1</td>\n",
       "      <td>71.89</td>\n",
       "      <td>75.90</td>\n",
       "      <td>False</td>\n",
       "      <td>Sci&amp;Tech</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2020-05-10</td>\n",
       "      <td>27104.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>9499396</td>\n",
       "      <td>77.42</td>\n",
       "      <td>Science</td>\n",
       "      <td>Mkt&amp;HR</td>\n",
       "      <td>71.74</td>\n",
       "      <td>F</td>\n",
       "      <td>66.19</td>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "      <td>95.38</td>\n",
       "      <td>59.49</td>\n",
       "      <td>False</td>\n",
       "      <td>Sci&amp;Tech</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2020-01-01</td>\n",
       "      <td>2020-04-15</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>10945558</td>\n",
       "      <td>57.54</td>\n",
       "      <td>Science</td>\n",
       "      <td>Mkt&amp;HR</td>\n",
       "      <td>57.63</td>\n",
       "      <td>F</td>\n",
       "      <td>72.51</td>\n",
       "      <td>True</td>\n",
       "      <td>1</td>\n",
       "      <td>86.40</td>\n",
       "      <td>60.99</td>\n",
       "      <td>True</td>\n",
       "      <td>Comm&amp;Mgmt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2020-01-01</td>\n",
       "      <td>2020-10-27</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6.0</td>\n",
       "      <td>5714925</td>\n",
       "      <td>82.43</td>\n",
       "      <td>Science</td>\n",
       "      <td>Mkt&amp;Fin</td>\n",
       "      <td>68.14</td>\n",
       "      <td>M</td>\n",
       "      <td>76.55</td>\n",
       "      <td>True</td>\n",
       "      <td>1</td>\n",
       "      <td>95.86</td>\n",
       "      <td>67.78</td>\n",
       "      <td>False</td>\n",
       "      <td>Sci&amp;Tech</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>2020-01-01</td>\n",
       "      <td>2020-07-02</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>12273151</td>\n",
       "      <td>58.25</td>\n",
       "      <td>Commerce</td>\n",
       "      <td>Mkt&amp;Fin</td>\n",
       "      <td>65.04</td>\n",
       "      <td>M</td>\n",
       "      <td>61.35</td>\n",
       "      <td>True</td>\n",
       "      <td>1</td>\n",
       "      <td>65.73</td>\n",
       "      <td>55.15</td>\n",
       "      <td>True</td>\n",
       "      <td>Comm&amp;Mgmt</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    start_date    end_date   salary  duration  student_id  high_perc  \\\n",
       "0   2020-01-10         NaN      NaN       3.0     3040587      66.62   \n",
       "1          NaN  2020-11-07  39320.0       NaN     5940200      81.61   \n",
       "2   2020-02-21  2020-07-08  36408.0       3.0    13408830      62.71   \n",
       "3   2020-01-30  2020-09-29  36591.0       3.0    16186310      51.00   \n",
       "4   2020-01-16         NaN  33032.0       NaN     2086931      67.04   \n",
       "5   2020-07-20         NaN  31536.0       3.0     6414765      80.30   \n",
       "6   2020-02-13  2020-11-26  32428.0      12.0     6180804      67.61   \n",
       "7   2020-01-02  2020-07-14  36317.0       6.0    14357765      63.09   \n",
       "8          NaN  2020-05-10  27104.0       3.0     9499396      77.42   \n",
       "9   2020-01-01  2020-04-15      NaN       3.0    10945558      57.54   \n",
       "10  2020-01-01  2020-10-27      NaN       6.0     5714925      82.43   \n",
       "11  2020-01-01  2020-07-02      NaN       3.0    12273151      58.25   \n",
       "\n",
       "   high_spec mba_spec  second_perc gender  degree_perc  placed  \\\n",
       "0    Science  Mkt&Fin        75.01      M        75.76    True   \n",
       "1   Commerce   Mkt&HR        73.03      M        67.27    True   \n",
       "2       Arts  Mkt&Fin        82.09      F        71.97    True   \n",
       "3   Commerce  Mkt&Fin        62.04      M        65.32    True   \n",
       "4   Commerce  Mkt&Fin        53.53      M        51.08    True   \n",
       "5    Science   Mkt&HR        87.34      M        74.10    True   \n",
       "6   Commerce   Mkt&HR        49.94      M        72.79    True   \n",
       "7   Commerce  Mkt&Fin        86.17      M        83.25    True   \n",
       "8    Science   Mkt&HR        71.74      F        66.19   False   \n",
       "9    Science   Mkt&HR        57.63      F        72.51    True   \n",
       "10   Science  Mkt&Fin        68.14      M        76.55    True   \n",
       "11  Commerce  Mkt&Fin        65.04      M        61.35    True   \n",
       "\n",
       "    experience_years  employability_perc  mba_perc  work_experience  \\\n",
       "0                  1               85.98     58.37             True   \n",
       "1                  1               91.44     65.12            False   \n",
       "2                  1               62.18     71.15             True   \n",
       "3                  1               61.87     58.90            False   \n",
       "4                  1               58.65     56.32            False   \n",
       "5                  1               64.24     68.55            False   \n",
       "6                  1               86.51     69.26            False   \n",
       "7                  1               71.89     75.90            False   \n",
       "8                  1               95.38     59.49            False   \n",
       "9                  1               86.40     60.99             True   \n",
       "10                 1               95.86     67.78            False   \n",
       "11                 1               65.73     55.15             True   \n",
       "\n",
       "   degree_type  \n",
       "0     Sci&Tech  \n",
       "1    Comm&Mgmt  \n",
       "2    Comm&Mgmt  \n",
       "3    Comm&Mgmt  \n",
       "4     Sci&Tech  \n",
       "5     Sci&Tech  \n",
       "6     Sci&Tech  \n",
       "7     Sci&Tech  \n",
       "8     Sci&Tech  \n",
       "9    Comm&Mgmt  \n",
       "10    Sci&Tech  \n",
       "11   Comm&Mgmt  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "synthetic_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "id": "d17f0fcf-4380-459f-b901-52e63be3b8d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the DataFrame to a CSV file\n",
    "synthetic_data.to_csv('generated_data/tabular_data.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b63773e2-a599-410b-8955-85c9a79242db",
   "metadata": {},
   "source": [
    "## Load pre-generated synthetic tabular data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "8fed0fe7-3e34-4dec-850d-2af119bc51ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "# Read the CSV file into a DataFrame\n",
    "synthetic_data = pd.read_csv('generated_data/tabular_data.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45652011-8bdb-4e03-8ae3-87cd92714b70",
   "metadata": {},
   "source": [
    "## Synthetic Data Generation with Llama-3.3-70B-Instruct\n",
    "\n",
    "In this section, we use `Llama-3.3-70B-Instruct` to create a story using tabular data and then generate extractive summary report from the generated context.\n",
    "You could try using Llama-3.1-8B-Instruct but we have seen better results with the 70B model for generating synthetic data.\n",
    "\n",
    "### Alternate approach\n",
    "\n",
    "In the below section, we choose tabular data as the ground truth and generate all the context & reports from the table. Another approach is to use couple of examples as few shot prompting and use Llama to generate the context & story from this and asking it to vary the factual information. We can then use Llama to create the ground truth tabular data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "bc314704-caf4-42b0-962d-edbea2fe6b3c",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/agunapal/anaconda3/envs/torchtune/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n",
      "Loading checkpoint shards: 100%|██████████| 30/30 [19:53<00:00, 39.79s/it]\n"
     ]
    }
   ],
   "source": [
    "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
    "\n",
    "model_id: str = \"meta-llama/Llama-3.3-70B-Instruct\"\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
    "model = AutoModelForCausalLM.from_pretrained(model_id, device_map=\"auto\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "37595ef4-5872-4755-af8a-0bc5bbda3554",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prompt for generating context from synthetic tabular data\n",
    "\n",
    "story_teller = lambda index: f\"\"\"You are expert Story Teller. \n",
    "Look at the following data and tell a story in the form of a progress report. \n",
    "This report should have a sub-section for the following:\n",
    "- Academic Background\n",
    "- Career Aspirations\n",
    "- Salary Expectations\n",
    "- Placement Status\n",
    "- Course Details\n",
    "- Story Behind the Numbers\n",
    "\n",
    "Be creative and make up story with other statistics\n",
    "\n",
    "\n",
    "{synthetic_data.loc[[synthetic_data.index[index]]]}\n",
    "\n",
    "- DO NOT create another table. \n",
    "- DO NOT ask any clarifying questions\n",
    "- DO NOT justify your answers\n",
    "- Make sure each column has a subheading in the report\n",
    "- Each of the sections should have the respective tag. \n",
    "- All currency is in tokens\n",
    "- Answer within 800 tokens\n",
    "\n",
    "Example:\n",
    "<academic_background>\n",
    "Student 17264 has a commendable academic record, which is evident from his second-year and high school percentage scores. His second-year percentage stands at 67%, while his high school percentage is an impressive 91%. He has also shown a keen interest in commerce, with a degree percentage of 58%. His academic background is a testament to his hard work and dedication to his studies.\n",
    "</academic_background>\n",
    "\n",
    "Answer:\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8806dacb-d03b-47f1-8363-36f3c8e88365",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prompt for generating report from the generated context\n",
    "\n",
    "report_creator = lambda context: f\"\"\" You are an expert report creator.\n",
    "Look at the data in context: \n",
    "{context}\n",
    "\n",
    "and generate a shortened report with 1 line with the following subsections:\n",
    "- student_id\n",
    "- degree_type\n",
    "- salary\n",
    "- mba_spec\n",
    "- duration\n",
    "- employability_perc\n",
    "\n",
    "IMPORTANT:\n",
    "- DO NOT ask any clarifying questions\n",
    "- DO NOT justify your answers\n",
    "- DO NOT show the data \n",
    "- DO NOT write any python code\n",
    "- Each of the sections should have the respective tag and should be shown ONLY once\n",
    "- Make sure to copy the mba_spec & degree_type as is\n",
    "\n",
    "Example:\n",
    "\n",
    "Summary Report:\n",
    "\n",
    "<student_id>\n",
    "Student ID is 17269\n",
    "<student_id>\n",
    "\n",
    "<salary>\n",
    "Student has a realistic salary expectation of 27,000 tokens per month\n",
    "<salary>\n",
    "\n",
    "<degree_type>\n",
    "Student has a degree in Sci&Tech\n",
    "<degree_type>\n",
    "\n",
    "<mba_spec>\n",
    "Student has a specialization in Mkt&Fin\n",
    "<mba_spec>\n",
    "\n",
    "<duration>\n",
    "Student has a degree duration of 4 years\n",
    "<duration>\n",
    "\n",
    "<employability_perc>\n",
    "Student has a 95.0% employability percentage\n",
    "<employability_perc>\n",
    "\n",
    "Answer:\n",
    "\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8620c9e-0641-4ca7-9445-71a22de9f494",
   "metadata": {},
   "source": [
    "### Generate 12 examples of synthetic data using this loop\n",
    "\n",
    "Why 12?: We will use 2 examples for few shot prompting and the rest 10 for Evals.\n",
    "\n",
    "In practice, you want the number of data points to be much higher for your production application"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "id": "44fa3c0b-9d2c-4dcb-bcf8-879d17c71512",
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "import json\n",
    "\n",
    "for i in range(12):\n",
    "\n",
    "    formatted_prompt = story_teller(i)\n",
    "    input = tokenizer([formatted_prompt], return_tensors=\"pt\").to(\"cuda\")\n",
    "    \n",
    "    # Generate context from tabular data\n",
    "    output = model.generate(**input, max_new_tokens=800, pad_token_id=0, temperature=0.8)\n",
    "    prompt_len = input[\"input_ids\"].shape[-1]\n",
    "    context = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)\n",
    "\n",
    "    formatted_prompt = report_creator(context)\n",
    "    input = tokenizer([formatted_prompt], return_tensors=\"pt\").to(\"cuda\")\n",
    "\n",
    "    # Generate report from generated report\n",
    "    output = model.generate(**input, max_new_tokens=120, pad_token_id=0,)\n",
    "    prompt_len = input[\"input_ids\"].shape[-1]\n",
    "    report = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)\n",
    "\n",
    "    # Create json output\n",
    "    result = {}\n",
    "    result[\"context\"] = context\n",
    "    result[\"report\"] = report\n",
    "    with open(f'generated_data/data_{i}.json', 'w') as f:\n",
    "        json.dump(result, f, indent=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06e81b1a-e4da-4b67-97c1-0544af2eb104",
   "metadata": {},
   "source": [
    "## Example Context & Report\n",
    "\n",
    "By manual inspection we see that, Llama has created well structured context and the corresponding report. We also see that all the factual information is correct."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "a6fd5a95-8a73-4a0e-bd28-bf6291af3827",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Context is -------------------------\n",
      "\n",
      "### Progress Report for Student 3040587\n",
      "#### <academic_background>\n",
      "Student 3040587 has a strong academic foundation, with a high school percentage of 66.62% in Science. He also holds a degree in Science and Technology with a percentage of 75.76%. His second-year percentage is 75.01%, demonstrating his consistent academic performance. \n",
      "#### <career_aspirations>\n",
      "With a specialization in Marketing and Finance, Student 3040587 aspires to pursue a career in the finance sector, leveraging his skills in market analysis and financial planning. His career goal is to become a financial analyst, with a focus on investment banking. \n",
      "#### <salary_expectations>\n",
      "Student 3040587 expects a starting salary of 5000 tokens per annum, considering his one year of work experience and academic achievements. He is confident that his skills and knowledge will enable him to secure a job with a reputable company. \n",
      "#### <placement_status>\n",
      "Student 3040587 has been successfully placed, with an employability percentage of 85.98%. His placement is a testament to his hard work and dedication to his studies, as well as his relevant work experience. \n",
      "#### <course_details>\n",
      "Student 3040587 is currently pursuing an MBA with a specialization in Marketing and Finance, with a course duration of 3 years. He has completed one year of the course, with an MBA percentage of 58.37%. \n",
      "#### <story_behind_the_numbers>\n",
      "Behind the numbers, Student 3040587's story is one of perseverance and determination. Despite facing challenges in his academic journey, he has consistently worked hard to achieve his goals. His work experience has equipped him with the skills and knowledge required to succeed in the finance sector. With his strong academic background, career aspirations, and relevant work experience, Student 3040587 is poised to achieve great things in his future career. \n",
      "### End of Report 3040587\n",
      "\n",
      "\n",
      "Report is -------------------------\n",
      "\n",
      "Summary Report:\n",
      "\n",
      "<student_id>\n",
      "Student 3040587\n",
      "<student_id>\n",
      "\n",
      "<salary>\n",
      "Student has a realistic salary expectation of 5000 tokens per annum\n",
      "<salary>\n",
      "\n",
      "<degree_type>\n",
      "Student has a degree in Science and Technology\n",
      "<degree_type>\n",
      "\n",
      "<mba_spec>\n",
      "Student has a specialization in Marketing and Finance\n",
      "<mba_spec>\n",
      "\n",
      "<duration>\n",
      "Student has a degree duration of 3 years\n",
      "<duration>\n",
      "\n",
      "<employability_perc>\n",
      "Student has a 85.98% employability percentage\n",
      "<employability_perc>\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "def read_json_file(file_path):\n",
    "    try:\n",
    "        with open(file_path, 'r') as file:\n",
    "            data = json.load(file)\n",
    "            return data\n",
    "    except FileNotFoundError:\n",
    "        print(f\"File not found: {file_path}\")\n",
    "        return None\n",
    "    except json.JSONDecodeError as e:\n",
    "        print(f\"Invalid JSON: {e}\")\n",
    "        return None\n",
    "# Example usage:\n",
    "file_path = 'generated_data/data_0.json'\n",
    "data = read_json_file(file_path)\n",
    "print(\"Context is -------------------------\\n\")\n",
    "print(data[\"context\"])\n",
    "print(\"\\nReport is -------------------------\\n\")      \n",
    "print(data[\"report\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "6a90b297-67a2-457e-99a6-658546a96af1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>start_date</th>\n",
       "      <th>end_date</th>\n",
       "      <th>salary</th>\n",
       "      <th>duration</th>\n",
       "      <th>student_id</th>\n",
       "      <th>high_perc</th>\n",
       "      <th>high_spec</th>\n",
       "      <th>mba_spec</th>\n",
       "      <th>second_perc</th>\n",
       "      <th>gender</th>\n",
       "      <th>degree_perc</th>\n",
       "      <th>placed</th>\n",
       "      <th>experience_years</th>\n",
       "      <th>employability_perc</th>\n",
       "      <th>mba_perc</th>\n",
       "      <th>work_experience</th>\n",
       "      <th>degree_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>2020-01-10</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3040587</td>\n",
       "      <td>66.62</td>\n",
       "      <td>Science</td>\n",
       "      <td>Mkt&amp;Fin</td>\n",
       "      <td>75.01</td>\n",
       "      <td>M</td>\n",
       "      <td>75.76</td>\n",
       "      <td>True</td>\n",
       "      <td>1</td>\n",
       "      <td>85.98</td>\n",
       "      <td>58.37</td>\n",
       "      <td>True</td>\n",
       "      <td>Sci&amp;Tech</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Unnamed: 0  start_date end_date  salary  duration  student_id  high_perc  \\\n",
       "0           0  2020-01-10      NaN     NaN       3.0     3040587      66.62   \n",
       "\n",
       "  high_spec mba_spec  second_perc gender  degree_perc  placed  \\\n",
       "0   Science  Mkt&Fin        75.01      M        75.76    True   \n",
       "\n",
       "   experience_years  employability_perc  mba_perc  work_experience degree_type  \n",
       "0                 1               85.98     58.37             True    Sci&Tech  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "synthetic_data.loc[[synthetic_data.index[0]]]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84609a64-0842-49f0-a3a5-3ad94c3f9b5d",
   "metadata": {},
   "source": [
    "## Important!!!!! Verification by human!\n",
    "\n",
    "At this point, ideally you need a human to look at the synthetic data that you have generated and fix any errors in the formatting or factual information or be aware of the number of errors in the dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f9c4b0a-0d15-4140-8a18-7c3710ab71f2",
   "metadata": {},
   "source": [
    "## Measuring Hallucinations\n",
    "\n",
    "The usual method to measure hallucinations uses LLM-As-Judge methodology. An example hallucination metric is using [DeepEval](https://www.deepeval.com/docs/metrics-hallucination).\n",
    "This would use a powerful LLM as the ground truth to measure hallucinations.\n",
    "\n",
    "The below section shows a way to measure hallucinations using the ground truth data that we have (tabular data). The methodology is to make use of the tags that we have added in the report and use Llama to answer simple questions looking at the corresponding sections. Llama compares the answers with the ground truth and generates a list of boolean values. This is then used to measure accuracy of the factual information in the report. If your report has a well defined structure, using QA to measure hallucinations can be highly effective and cost efficient"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "cef3d897-5d3a-44d8-a04b-a5b56ab19fee",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use the first 2 data points for few shot prompting\n",
    "file_path = 'generated_data_faulty/data_0.json'\n",
    "example_data = read_json_file(file_path)\n",
    "\n",
    "file_path = 'generated_data_faulty/data_1.json'\n",
    "example_data_1 = read_json_file(file_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c35d758-3d7b-4512-9bdf-8055b9bff3e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "check_hallucinations = lambda data,index: f\"\"\"You are a Helpful Assistant. \n",
    "Look at the section called Generated Report below & answer the following questions by only looking\n",
    "at the corresponding sections in the report\n",
    "- student_id: Question : What is the student id?\n",
    "- degree_type : Question : What is the degree_type?\n",
    "- salary: Question : What is the salary?\n",
    "- mba_spec: Question : What is the mba_spec?\n",
    "- duration: Question : What is the duration?\n",
    "- employability_perc: Question : What is the employability percentage?\n",
    "\n",
    "Generated Report:\n",
    "{data[\"report\"]}\n",
    "\n",
    "Compare your answers with the ground truth and return either True or False within the tags <answer> & </answer>\n",
    "Only if an answer is False, explain why in the format shown in the examples below\n",
    "\n",
    "Ground Truth:\n",
    "{synthetic_data.loc[[synthetic_data.index[index]]]}\n",
    "\n",
    "\n",
    "\n",
    "Important Notes:\n",
    "- Only check for the above mentioned questions\n",
    "- Make sure each of the section is shown ONLY once\n",
    "- DO NOT reason or explain your process\n",
    "- DO NOT code this\n",
    "- DO NOT explain why something is True\n",
    "- Be lenient when checking decimal points. Ex: 4.0 is the same as 4\n",
    "\n",
    "Example:\n",
    "1)\n",
    "With the following report:\n",
    "\n",
    "{example_data[\"report\"]}\n",
    "\n",
    "and the ground truth:\n",
    "{synthetic_data.loc[[synthetic_data.index[0]]]}\n",
    "\n",
    "the following output is expected:\n",
    "\n",
    "<answer>\n",
    "student_id: [False, report shows 17263 and ground truth says 17264]\n",
    "degree_type: [True, None]\n",
    "salary: [False, report says 28000 and ground truth says 27000]\n",
    "mba_spec: [True, None]\n",
    "duration: [True, None]\n",
    "employability_perc: [True, None]\n",
    "</answer>\n",
    "\n",
    "2) \n",
    "With the following report:\n",
    "\n",
    "{example_data_1[\"report\"]}\n",
    "\n",
    "and the ground truth:\n",
    "{synthetic_data.loc[[synthetic_data.index[1]]]}\n",
    "\n",
    "the following output is expected:\n",
    "\n",
    "<answer>\n",
    "student_id: [True, None]\n",
    "degree_type: [True, None]\n",
    "salary: [True, None]\n",
    "mba_spec: [True, None]\n",
    "duration: [True, None]\n",
    "employability_perc: [True, None]\n",
    "</answer>\n",
    "\n",
    "Answer:\n",
    "\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "bd532640-db12-46e9-aec2-7609dfe1e72c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def parse_output(output):\n",
    "    \"\"\"\n",
    "    Parse the output and return a list of bool values\n",
    "    \"\"\"\n",
    "    lines = output.strip().splitlines()\n",
    "    bool_values = []\n",
    "    for line in lines:\n",
    "        # Skip empty lines, lines with tags, or lines starting with '<'\n",
    "        if not line or line.startswith('<') or line.endswith('>'):\n",
    "            continue\n",
    "        parts = line.split(': ')\n",
    "        if len(parts) != 2:\n",
    "            raise ValueError(f\"Invalid line format: {line}\")\n",
    "        value_str, _ = parts[1].strip('[]').split(', ')\n",
    "        if value_str == 'True':\n",
    "            bool_values.append(True)\n",
    "        elif value_str == 'False':\n",
    "            bool_values.append(False)\n",
    "        else:\n",
    "            raise ValueError(f\"Invalid bool value: {value_str}\")\n",
    "        if parts[0] == 'employability_perc':\n",
    "            break\n",
    "    return bool_values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2ee3a24-bb64-4f3a-a547-5e976bbf817b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Checking accuracy of generated report in generated_data/data_2.json\n",
      "\n",
      "<answer>\n",
      "student_id: [True, None]\n",
      "degree_type: [True, None]\n",
      "salary: [True, None]\n",
      "mba_spec: [True, None]\n",
      "duration: [True, None]\n",
      "employability_perc: [True, None]\n",
      "</answer>\n",
      "\n",
      "Checking accuracy of generated report in generated_data/data_3.json\n",
      "\n",
      "<answer>\n",
      "student_id: [True, None]\n",
      "degree_type: [True, None]\n",
      "salary: [True, None]\n",
      "mba_spec: [True, None]\n",
      "duration: [True, None]\n",
      "employability_perc: [True, None]\n",
      "</answer>\n",
      "\n",
      "Checking accuracy of generated report in generated_data/data_4.json\n",
      "\n",
      "<answer>\n",
      "student_id: [True, None]\n",
      "degree_type: [True, None]\n",
      "salary: [True, None]\n",
      "mba_spec: [True, None]\n",
      "duration: [True, None]\n",
      "employability_perc: [True, None]\n",
      "</answer>\n",
      "\n",
      "Checking accuracy of generated report in generated_data/data_5.json\n",
      "\n",
      "<answer>\n",
      "student_id: [True, None]\n",
      "degree_type: [True, None]\n",
      "salary: [True, None]\n",
      "mba_spec: [True, None]\n",
      "duration: [True, None]\n",
      "employability_perc: [True, None]\n",
      "</answer>\n",
      "\n",
      "Checking accuracy of generated report in generated_data/data_6.json\n",
      "\n",
      "<answer>\n",
      "student_id: [False, report shows Student ID is not mentioned in the data and ground truth says 6180804]\n",
      "degree_type: [True, None]\n",
      "salary: [True, None]\n",
      "mba_spec: [True, None]\n",
      "duration: [True, None]\n",
      "employability_perc: [True, None]\n",
      "</answer>\n",
      "\n",
      "Checking accuracy of generated report in generated_data/data_7.json\n",
      "\n",
      "<answer>\n",
      "student_id: [True, None]\n",
      "degree_type: [False, report says Mkt&Fin and ground truth says Sci&Tech]\n",
      "salary: [True, None]\n",
      "mba_spec: [True, None]\n",
      "duration: [True, None]\n",
      "employability_perc: [True, None]\n",
      "</answer>\n",
      "\n",
      "Checking accuracy of generated report in generated_data/data_8.json\n",
      "\n",
      "<answer>\n",
      "student_id: [True, None]\n",
      "degree_type: [True, None]\n",
      "salary: [True, None]\n",
      "mba_spec: [True, None]\n",
      "duration: [True, None]\n",
      "employability_perc: [True, None]\n",
      "</answer>\n",
      "\n",
      "Checking accuracy of generated report in generated_data/data_9.json\n",
      "\n",
      "<answer>\n",
      "student_id: [True, None]\n",
      "degree_type: [True, None]\n",
      "salary: [True, None]\n",
      "mba_spec: [True, None]\n",
      "duration: [True, None]\n",
      "employability_perc: [True, None]\n",
      "</answer>\n",
      "\n",
      "Checking accuracy of generated report in generated_data/data_10.json\n",
      "\n",
      "<answer>\n",
      "student_id: [True, None]\n",
      "degree_type: [True, None]\n",
      "salary: [True, None]\n",
      "mba_spec: [True, None]\n",
      "duration: [True, None]\n",
      "employability_perc: [True, None]\n",
      "</answer>\n",
      "\n",
      "Checking accuracy of generated report in generated_data/data_11.json\n",
      "\n",
      "<answer>\n",
      "student_id: [True, None]\n",
      "degree_type: [False, report says Commerce and ground truth says Comm&Mgmt]\n",
      "salary: [False, report says 500 and ground truth says NaN]\n",
      "mba_spec: [True, None]\n",
      "duration: [True, None]\n",
      "employability_perc: [True, None]\n",
      "</answer>\n",
      "\n",
      "Accuracy of factual information generation is : 0.9333\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "y_pred = []\n",
    "y_true = [True]*60\n",
    "for i in range(2,12):\n",
    "    fname = f'generated_data/data_{i}.json'\n",
    "    print(f\"\\nChecking accuracy of generated report in {fname}\\n\")\n",
    "    data = read_json_file(fname)\n",
    "    \n",
    "    formatted_prompt = check_hallucinations(data, i)\n",
    "\n",
    "    input = tokenizer([formatted_prompt], return_tensors=\"pt\").to(\"cuda\")\n",
    "    \n",
    "    output = model.generate(**input, max_new_tokens=120, pad_token_id=0, do_sample=False, top_p=None, temperature=None)\n",
    "    prompt_len = input[\"input_ids\"].shape[-1]\n",
    "    results = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)\n",
    "    print(results)\n",
    "    y_pred.extend(parse_output(results))\n",
    "accuracy = accuracy_score(y_true, y_pred)\n",
    "print(f\"\\nAccuracy of factual information generation is : {accuracy:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe3e87f5-e04d-472a-873c-675951459334",
   "metadata": {},
   "source": [
    "# Conclusion & Next Steps\n",
    "\n",
    "- Creating Evals for summarization is important\n",
    "- Llama can be used to create evals given few samples of ground truth\n",
    "- Using simple QA to measure hallucinations can be an effective strategy to be be confident that important factual information is being verified "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "785aaee8",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}