# Creating Evals with synthetic data and measuring hallucinations

When you deploy Llama for your use case, it is a good practice to have Evals for your use case. Though it might be ideal to have human annotated Evals, this notebook shows a strategy fow how one might go about addressing this using synthetic data. However, the Evals generated still requires validation by a human to make sure that your production use case can rely on this. 
The notebook also shows how one could accurately measure hallucinations without using LLM-As-A-Judge methodology using Llama

## Overall idea

Let's assume we have a use case for generating a summarization report based on a given context, which is a pretty common use case with LLM. Both the context and the report have a lot of factual information and we want to make sure the generated report is not hallucinating.

Since its not trivial to find an open source dataset for this, the idea is to take synthetic tabular data and then use Llama to generate a story(context) for every row of the tabular data using Prompt Engineering. Then we ask Llama to summarize the generated context as a report in a specific format using Prompt Engineering. Finally we check the factual accuracy of the generated report using Llama by converting this into a QA task using the tabular data as the ground truth.

To generate synthetic data for this approach, we use an open source tool like [Synthetic Data Vault](https://github.com/sdv-dev/SDV)

The overall workflow is shown in the below diagram

![Workflow](./Workflow_Diagram.png)

### Synthetic Data Vault installation

In [None]:
!pip install sdv

SDV has a number of single table datasets. We choose `student_placements` dataset for this notebook

In [12]:
from sdv.datasets.demo import get_available_demos

get_available_demos(modality='single_table')

Unnamed: 0,dataset_name,size_MB,num_tables
0,KRK_v1,0.06,1
1,adult,3.91,1
2,alarm,4.52,1
3,asia,1.28,1
4,census,98.17,1
5,census_extended,4.95,1
6,child,3.2,1
7,covtype,255.65,1
8,credit,68.35,1
9,expedia_hotel_logs,0.2,1


In [13]:
from sdv.datasets.demo import download_demo

real_data, metadata = download_demo(
    modality='single_table',
    dataset_name='student_placements')

In [28]:
real_data.head(5)

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,17264,M,67.0,91.0,Commerce,58.0,Sci&Tech,False,0,55.0,Mkt&HR,58.8,27000.0,True,2020-07-23,2020-10-12,3.0
1,17265,M,79.33,78.33,Science,77.48,Sci&Tech,True,1,86.5,Mkt&Fin,66.28,20000.0,True,2020-01-11,2020-04-09,3.0
2,17266,M,65.0,68.0,Arts,64.0,Comm&Mgmt,False,0,75.0,Mkt&Fin,57.8,25000.0,True,2020-01-26,2020-07-13,6.0
3,17267,M,56.0,52.0,Science,52.0,Sci&Tech,False,0,66.0,Mkt&HR,59.43,,False,,,
4,17268,M,85.8,73.6,Commerce,73.3,Comm&Mgmt,False,0,96.8,Mkt&Fin,55.5,42500.0,True,2020-07-04,2020-09-27,3.0


#### Generate synthetic data from real data

In [17]:
from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=real_data)

In [18]:
synthetic_data = synthesizer.sample(num_rows=12)

In [19]:
synthetic_data

Unnamed: 0,start_date,end_date,salary,duration,student_id,high_perc,high_spec,mba_spec,second_perc,gender,degree_perc,placed,experience_years,employability_perc,mba_perc,work_experience,degree_type
0,2020-01-10,,,3.0,3040587,66.62,Science,Mkt&Fin,75.01,M,75.76,True,1,85.98,58.37,True,Sci&Tech
1,,2020-11-07,39320.0,,5940200,81.61,Commerce,Mkt&HR,73.03,M,67.27,True,1,91.44,65.12,False,Comm&Mgmt
2,2020-02-21,2020-07-08,36408.0,3.0,13408830,62.71,Arts,Mkt&Fin,82.09,F,71.97,True,1,62.18,71.15,True,Comm&Mgmt
3,2020-01-30,2020-09-29,36591.0,3.0,16186310,51.0,Commerce,Mkt&Fin,62.04,M,65.32,True,1,61.87,58.9,False,Comm&Mgmt
4,2020-01-16,,33032.0,,2086931,67.04,Commerce,Mkt&Fin,53.53,M,51.08,True,1,58.65,56.32,False,Sci&Tech
5,2020-07-20,,31536.0,3.0,6414765,80.3,Science,Mkt&HR,87.34,M,74.1,True,1,64.24,68.55,False,Sci&Tech
6,2020-02-13,2020-11-26,32428.0,12.0,6180804,67.61,Commerce,Mkt&HR,49.94,M,72.79,True,1,86.51,69.26,False,Sci&Tech
7,2020-01-02,2020-07-14,36317.0,6.0,14357765,63.09,Commerce,Mkt&Fin,86.17,M,83.25,True,1,71.89,75.9,False,Sci&Tech
8,,2020-05-10,27104.0,3.0,9499396,77.42,Science,Mkt&HR,71.74,F,66.19,False,1,95.38,59.49,False,Sci&Tech
9,2020-01-01,2020-04-15,,3.0,10945558,57.54,Science,Mkt&HR,57.63,F,72.51,True,1,86.4,60.99,True,Comm&Mgmt


In [91]:
# Save the DataFrame to a CSV file
synthetic_data.to_csv('generated_data/tabular_data.csv')

## Load pre-generated synthetic tabular data

In [2]:
import pandas as pd
# Read the CSV file into a DataFrame
synthetic_data = pd.read_csv('generated_data/tabular_data.csv')

## Synthetic Data Generation with Llama-3.3-70B-Instruct

In this section, we use `Llama-3.3-70B-Instruct` to create a story using tabular data and then generate extractive summary report from the generated context.
You could try using Llama-3.1-8B-Instruct but we have seen better results with the 70B model for generating synthetic data.

### Alternate approach

In the below section, we choose tabular data as the ground truth and generate all the context & reports from the table. Another approach is to use couple of examples as few shot prompting and use Llama to generate the context & story from this and asking it to vary the factual information. We can then use Llama to create the ground truth tabular data

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id: str = "meta-llama/Llama-3.3-70B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 30/30 [19:53<00:00, 39.79s/it]


In [3]:
# Prompt for generating context from synthetic tabular data

story_teller = lambda index: f"""You are expert Story Teller. 
Look at the following data and tell a story in the form of a progress report. 
This report should have a sub-section for the following:
- Academic Background
- Career Aspirations
- Salary Expectations
- Placement Status
- Course Details
- Story Behind the Numbers

Be creative and make up story with other statistics


{synthetic_data.loc[[synthetic_data.index[index]]]}

- DO NOT create another table. 
- DO NOT ask any clarifying questions
- DO NOT justify your answers
- Make sure each column has a subheading in the report
- Each of the sections should have the respective tag. 
- All currency is in tokens
- Answer within 800 tokens

Example:
<academic_background>
Student 17264 has a commendable academic record, which is evident from his second-year and high school percentage scores. His second-year percentage stands at 67%, while his high school percentage is an impressive 91%. He has also shown a keen interest in commerce, with a degree percentage of 58%. His academic background is a testament to his hard work and dedication to his studies.
</academic_background>

Answer:
"""

In [4]:
# Prompt for generating report from the generated context

report_creator = lambda context: f""" You are an expert report creator.
Look at the data in context: 
{context}

and generate a shortened report with 1 line with the following subsections:
- student_id
- degree_type
- salary
- mba_spec
- duration
- employability_perc

IMPORTANT:
- DO NOT ask any clarifying questions
- DO NOT justify your answers
- DO NOT show the data 
- DO NOT write any python code
- Each of the sections should have the respective tag and should be shown ONLY once
- Make sure to copy the mba_spec & degree_type as is

Example:

Summary Report:

<student_id>
Student ID is 17269
<student_id>

<salary>
Student has a realistic salary expectation of 27,000 tokens per month
<salary>

<degree_type>
Student has a degree in Sci&Tech
<degree_type>

<mba_spec>
Student has a specialization in Mkt&Fin
<mba_spec>

<duration>
Student has a degree duration of 4 years
<duration>

<employability_perc>
Student has a 95.0% employability percentage
<employability_perc>

Answer:

"""

### Generate 12 examples of synthetic data using this loop

Why 12?: We will use 2 examples for few shot prompting and the rest 10 for Evals.

In practice, you want the number of data points to be much higher for your production application

In [75]:
import random
import json

for i in range(12):

    formatted_prompt = story_teller(i)
    input = tokenizer([formatted_prompt], return_tensors="pt").to("cuda")
    
    # Generate context from tabular data
    output = model.generate(**input, max_new_tokens=800, pad_token_id=0, temperature=0.8)
    prompt_len = input["input_ids"].shape[-1]
    context = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)

    formatted_prompt = report_creator(context)
    input = tokenizer([formatted_prompt], return_tensors="pt").to("cuda")

    # Generate report from generated report
    output = model.generate(**input, max_new_tokens=120, pad_token_id=0,)
    prompt_len = input["input_ids"].shape[-1]
    report = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)

    # Create json output
    result = {}
    result["context"] = context
    result["report"] = report
    with open(f'generated_data/data_{i}.json', 'w') as f:
        json.dump(result, f, indent=4)

## Example Context & Report

By manual inspection we see that, Llama has created well structured context and the corresponding report. We also see that all the factual information is correct.

In [16]:
import json
def read_json_file(file_path):
    try:
        with open(file_path, 'r') as file:
            data = json.load(file)
            return data
    except FileNotFoundError:
        print(f"File not found: {file_path}")
        return None
    except json.JSONDecodeError as e:
        print(f"Invalid JSON: {e}")
        return None
# Example usage:
file_path = 'generated_data/data_0.json'
data = read_json_file(file_path)
print("Context is -------------------------\n")
print(data["context"])
print("\nReport is -------------------------\n")      
print(data["report"])

Context is -------------------------

### Progress Report for Student 3040587
#### <academic_background>
Student 3040587 has a strong academic foundation, with a high school percentage of 66.62% in Science. He also holds a degree in Science and Technology with a percentage of 75.76%. His second-year percentage is 75.01%, demonstrating his consistent academic performance. 
#### <career_aspirations>
With a specialization in Marketing and Finance, Student 3040587 aspires to pursue a career in the finance sector, leveraging his skills in market analysis and financial planning. His career goal is to become a financial analyst, with a focus on investment banking. 
#### <salary_expectations>
Student 3040587 expects a starting salary of 5000 tokens per annum, considering his one year of work experience and academic achievements. He is confident that his skills and knowledge will enable him to secure a job with a reputable company. 
#### <placement_status>
Student 3040587 has been successfully 

In [17]:
synthetic_data.loc[[synthetic_data.index[0]]]

Unnamed: 0.1,Unnamed: 0,start_date,end_date,salary,duration,student_id,high_perc,high_spec,mba_spec,second_perc,gender,degree_perc,placed,experience_years,employability_perc,mba_perc,work_experience,degree_type
0,0,2020-01-10,,,3.0,3040587,66.62,Science,Mkt&Fin,75.01,M,75.76,True,1,85.98,58.37,True,Sci&Tech


## Important!!!!! Verification by human!

At this point, ideally you need a human to look at the synthetic data that you have generated and fix any errors in the formatting or factual information or be aware of the number of errors in the dataset

## Measuring Hallucinations

The usual method to measure hallucinations uses LLM-As-Judge methodology. An example hallucination metric is using [DeepEval](https://www.deepeval.com/docs/metrics-hallucination).
This would use a powerful LLM as the ground truth to measure hallucinations.

The below section shows a way to measure hallucinations using the ground truth data that we have (tabular data). The methodology is to make use of the tags that we have added in the report and use Llama to answer simple questions looking at the corresponding sections. Llama compares the answers with the ground truth and generates a list of boolean values. This is then used to measure accuracy of the factual information in the report. If your report has a well defined structure, using QA to measure hallucinations can be highly effective and cost efficient

In [7]:
# Use the first 2 data points for few shot prompting
file_path = 'generated_data_faulty/data_0.json'
example_data = read_json_file(file_path)

file_path = 'generated_data_faulty/data_1.json'
example_data_1 = read_json_file(file_path)

In [None]:
check_hallucinations = lambda data,index: f"""You are a Helpful Assistant. 
Look at the section called Generated Report below & answer the following questions by only looking
at the corresponding sections in the report
- student_id: Question : What is the student id?
- degree_type : Question : What is the degree_type?
- salary: Question : What is the salary?
- mba_spec: Question : What is the mba_spec?
- duration: Question : What is the duration?
- employability_perc: Question : What is the employability percentage?

Generated Report:
{data["report"]}

Compare your answers with the ground truth and return either True or False within the tags <answer> & </answer>
Only if an answer is False, explain why in the format shown in the examples below

Ground Truth:
{synthetic_data.loc[[synthetic_data.index[index]]]}



Important Notes:
- Only check for the above mentioned questions
- Make sure each of the section is shown ONLY once
- DO NOT reason or explain your process
- DO NOT code this
- DO NOT explain why something is True
- Be lenient when checking decimal points. Ex: 4.0 is the same as 4

Example:
1)
With the following report:

{example_data["report"]}

and the ground truth:
{synthetic_data.loc[[synthetic_data.index[0]]]}

the following output is expected:

<answer>
student_id: [False, report shows 17263 and ground truth says 17264]
degree_type: [True, None]
salary: [False, report says 28000 and ground truth says 27000]
mba_spec: [True, None]
duration: [True, None]
employability_perc: [True, None]
</answer>

2) 
With the following report:

{example_data_1["report"]}

and the ground truth:
{synthetic_data.loc[[synthetic_data.index[1]]]}

the following output is expected:

<answer>
student_id: [True, None]
degree_type: [True, None]
salary: [True, None]
mba_spec: [True, None]
duration: [True, None]
employability_perc: [True, None]
</answer>

Answer:

"""

In [9]:
def parse_output(output):
    """
    Parse the output and return a list of bool values
    """
    lines = output.strip().splitlines()
    bool_values = []
    for line in lines:
        # Skip empty lines, lines with tags, or lines starting with '<'
        if not line or line.startswith('<') or line.endswith('>'):
            continue
        parts = line.split(': ')
        if len(parts) != 2:
            raise ValueError(f"Invalid line format: {line}")
        value_str, _ = parts[1].strip('[]').split(', ')
        if value_str == 'True':
            bool_values.append(True)
        elif value_str == 'False':
            bool_values.append(False)
        else:
            raise ValueError(f"Invalid bool value: {value_str}")
        if parts[0] == 'employability_perc':
            break
    return bool_values

In [None]:
from sklearn.metrics import accuracy_score

y_pred = []
y_true = [True]*60
for i in range(2,12):
    fname = f'generated_data/data_{i}.json'
    print(f"\nChecking accuracy of generated report in {fname}\n")
    data = read_json_file(fname)
    
    formatted_prompt = check_hallucinations(data, i)

    input = tokenizer([formatted_prompt], return_tensors="pt").to("cuda")
    
    output = model.generate(**input, max_new_tokens=120, pad_token_id=0, do_sample=False, top_p=None, temperature=None)
    prompt_len = input["input_ids"].shape[-1]
    results = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
    print(results)
    y_pred.extend(parse_output(results))
accuracy = accuracy_score(y_true, y_pred)
print(f"\nAccuracy of factual information generation is : {accuracy:.4f}")


Checking accuracy of generated report in generated_data/data_2.json

<answer>
student_id: [True, None]
degree_type: [True, None]
salary: [True, None]
mba_spec: [True, None]
duration: [True, None]
employability_perc: [True, None]
</answer>

Checking accuracy of generated report in generated_data/data_3.json

<answer>
student_id: [True, None]
degree_type: [True, None]
salary: [True, None]
mba_spec: [True, None]
duration: [True, None]
employability_perc: [True, None]
</answer>

Checking accuracy of generated report in generated_data/data_4.json

<answer>
student_id: [True, None]
degree_type: [True, None]
salary: [True, None]
mba_spec: [True, None]
duration: [True, None]
employability_perc: [True, None]
</answer>

Checking accuracy of generated report in generated_data/data_5.json

<answer>
student_id: [True, None]
degree_type: [True, None]
salary: [True, None]
mba_spec: [True, None]
duration: [True, None]
employability_perc: [True, None]
</answer>

Checking accuracy of generated report i

# Conclusion & Next Steps

- Creating Evals for summarization is important
- Llama can be used to create evals given few samples of ground truth
- Using simple QA to measure hallucinations can be an effective strategy to be be confident that important factual information is being verified 