# Distillation with Llama 4 and Synthetic Data Kit

*Copyright (c) Meta Platforms, Inc. and affiliates.
This software may be used and distributed according to the terms of the Llama Community License Agreement.*

<a href="https://colab.research.google.com/github/meta-llama/llama-cookbook/blob/main/getting-started/distillation/distillation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook will walk you through [distilling](https://www.llama.com/docs/how-to-guides/distillation/) model knowledge from [Llama 4](https://www.llama.com/docs/model-cards-and-prompt-formats/llama4) into a smaller [Llama 3.2](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/) model using synthetic training data from [Synthetic Data Kit](https://github.com/meta-llama/synthetic-data-kit). 

### The goal
The goal of this notebook is to distill knowledge from a more powerful model (Llama 4 Scout) into a smaller, less powerful model (Llama 3.2 3B).

Smaller models have several advantages when compared with larger models: they're faster to generate text, have lower time to first token, and cost less to host since they need less hardware. However, larger models tend to be generalists ‚Äì that is, they have the ability to perform a wide variety of tasks well. On specific or specialized tasks, smaller models can be just as good as the generalist, larger models. Distillation allows you to take knowledge present in a larger model and transfer it to a smaller model with a minimal drop in quality for narrow tasks.

### The data
This notebook uses air traffic control data to demonstrate tuning a model towards a specialized field. During distillation, we will fully generate pairs from scratch, because our generalist teacher model has a strong understanding of ATC phraseology. During evaluation, we will evaluate both synthetic pairs as well as actual ATC data.

We will use the [ATCO2 corpus](https://github.com/idiap/atco2-corpus/tree/main) of air traffic data, an MIT-licensed dataset that contains audio, transcriptions, and additional contextual and metadata for each interaction. For this exercise we will only use the text transcripts, and will use the small (1h) sample dataset to demonstrate how only a small amount of data is actually necessary for fine-tuning the model.

### Evaluation
To evaluate our model, we will use standard language evaluation metrics such as [perplexity](https://en.wikipedia.org/wiki/Perplexity) and accuracy. We will also use [BLEU](https://en.wikipedia.org/wiki/BLEU) (bilingual evaluation understudy) to measure similarity without requiring that the model matches exactly every word. While originally designed for machine translation, BLEU compares n-gram similarity, meaning that minor word order differences are not penalized.

## Prerequisites
#### Hardware Requirements:

- NVIDIA GPU with at least 80GB VRAM (H100, A100, or similar)
    - 8x GPU to run Llama 4 Scout and create the dataset
    - 1x GPU to distill and fine-tune the model
- 200GB+ disk space
- 64GB+ system RAM

#### Software Requirements:

- CUDA 12.x
- HuggingFace account and token
- Fast internet connection for downloading models


## Preparing your environment

In [None]:
# Install dependencies
# Some Ubuntu setups may require you to uninstall blinker if it's managed
# by the system package manager. If you see an error about blinker, try
# uninstalling it with `apt remove python3-blinker`.
!apt remove -y python3-blinker
!pip install unsloth_zoo unsloth==2025.8.9 transformers==4.55.4 nltk synthetic-data-kit -q --upgrade

## Generate the synthetic dataset
We will use the synthetic data kit to produce synthetic data to distill our model.

First, set up the VLLM server. You will need to run this in a separate terminal window
since Jupyter doesn't support long running tasks/servers. Make sure to install vLLM with
`pip install vllm`

```shell
HF_HOME=/workspace/huggingface_cache \
HF_TOKEN=$HF_TOKEN \
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --port 8000 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 8
```

Then check that the server is working properly.

In [4]:
# Test that the server is working
!synthetic-data-kit -c config.yaml system-check

Loading config from: /usr/local/lib/python3.10/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.10/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: config.yaml
Config has LLM provider set to: vllm
[1;34mEnvironment variable check:[0m
API_ENDPOINT_KEY: Not found
get_llm_provider returning: vllm
[?25l[32m vLLM server is running at [0m[4;94mhttp://localhost:8000/v1[0m
[2KAvailable models: [1m{[0m[32m'object'[0m: [32m'list'[0m, [32m'data'[0m: [1m[[0m[1m{[0m[32m'id'[0m: 
[32m'meta-llama/Llama-4-Scout-17B-16E-Instruct'[0m, [32m'object'[0m: [32m'model'[0m, [32m'created'[0m: 
[1;36m1752251909[0m, [32m'owned_by'[0m: [32m'vllm'[0m, [32m'root'[0m: 
[32m'meta-llama/Llama-4-Scout-17B-16E-Instruct'[0m, [32m'parent'[0m: [3;35mNone[0m, [32m'max_model_len'[0m: 
[1;36m8192[0m, [32m'permission'[0m: [1m[[0m[1

If the model is working correctly you should see `VLLM server is running`.

Next, we will set up our configuration file for generating the data. We will use the QA task for our task, giving an example set of data and then asking the model to create call/response pairs similar to the examples. This is slightly different than an actual QA dataset but demonstrates different tasks can fit into the general framework that synthetic data kit provides.

In [7]:
%%bash

cat > config.yaml << 'EOF'
# generation: Content generation parameters
generation:
  temperature: 0.6
  top_p: 0.95
  chunk_size: 4000
  overlap: 200
  max_tokens: 4096
  num_pairs: 25
  batch_size: 2

llm:
  # Provider selection: "vllm" or "api-endpoint"
  provider: "vllm"

# vllm: Configure VLLM server settings
vllm:
  api_base: "http://localhost:8000/v1"
  port: 8000
  model: "meta-llama/Llama-4-Scout-17B-16E-Instruct"
  max_retries: 3
  retry_delay: 1.0

# format: Export format parameters
format:
  default: "jsonl"
  include_metadata: true
  pretty_json: true

# prompts: LLM prompts for different tasks, we have
# to include all of them but we modify the QA generation
prompts:
  qa_generation: |
    Create {num_pairs} pairs of simulated ATC call/response transcripts.
    
    Rules:
    1. Use full words instead of numbers, i.e. seven thirty two not 732
    2. Include all phases of flight, first contact/handover, and ground/tower/TRACON
    3. Return JSON format only

    Here are some examples:

    {text}
    
  summary: |
    Summarize this document in 3-5 sentences, focusing on the main topic and key concepts.

  qa_rating: |
    You are a helpful JSON processor that rates question-answer pairs.
    
    Your task is to rate each pair on a scale from 1-10 and return valid JSON with added ratings.
    
    ONLY return a valid JSON array with the original pairs plus ratings. Do not include any explanations or text outside the JSON.
    
    Here are the pairs to rate:
    
    {pairs}
EOF

We also create a dataset of examples to guide the model to producing better synthetic data. We provide 20 examples to produce 500+ training examples from synthetic data kit.

In [8]:
%%bash

cat > examples.txt << 'EOF'
JetBlue Eight Three Two, cleared to Boston via LENDO Seven, maintain five thousand, one two four point eight five, squawk four two one five
Cleared to Boston via LENDO Seven, maintain five thousand, one two four point eight five, squawk four two one five, JetBlue Eight Three Two

Cessna Seven Four Romeo Tango, taxi to Runway Two Four via Alpha, hold short of Runway Two Four
Taxi Runway Two Four via Alpha, hold short Two Four, Seven Four Romeo Tango

Southwest Two Twenty-Nine, Runway One Six Right, cleared for take-off, wind one niner zero at six
Cleared for take-off One Six Right, Southwest Two Twenty-Nine

Delta Four Zero Six, contact Departure one two six point niner five
One two six point niner five, Delta Four Zero Six

FedEx Four Eight Four Heavy, climb and maintain flight level three five zero
Climb and maintain flight level three five zero, FedEx Four Eight Four Heavy

American One Eight, turn right heading zero niner zero, descend and maintain three thousand, expect ILS Runway Two Seven Left
Right heading zero niner zero, descend three thousand, expect ILS Two Seven Left, American One Eight

American One Eight, cleared to land Runway Two Seven Left, wind two five zero at one four
Cleared to land Two Seven Left, American One Eight

American One Eight, cross Runway Two Seven Right at Kilo, then taxi to Gate Alpha Four
Cross Two Seven Right at Kilo, to Alpha Four, American One Eight

Emirates One Seven Four Heavy, cleared Dubai via the LONAM Two Foxtrot departure, initial climb five thousand feet, QNH one zero zero six, squawk five three five one
Cleared Dubai via LONAM Two Foxtrot, climb five thousand feet, QNH one zero zero six, squawk five three five one, Emirates One Seven Four Heavy

Qatar Four One Six, push back and start approved, facing south
Push back and start approved, facing south, Qatar Four One Six

Ryanair Eight Four, taxi to holding point Runway Two Four via Bravo and Delta, hold short
Holding short Two Four via Bravo and Delta, Ryanair Eight Four

KLM Six Zero Three, line up and wait Runway Two Seven
Line up and wait Two Seven, KLM Six Zero Three

British Airways Two Seven, cleared to enter oceanic airspace via Track Alpha, flight level three five zero, Mach decimal eight two
Cleared Track Alpha, flight level three five zero, Mach decimal eight two, British Airways Two Seven

Air France Four Six, climb flight level three eight zero
Climb flight level three eight zero, Air France Four Six

Singapore Three One, descend to altitude six thousand feet, QNH one zero zero nine, cleared ILS approach Runway Zero Four Right via AKOMA One
Descend six thousand feet, QNH one zero zero nine, cleared ILS Zero Four Right via AKOMA One, Singapore Three One

Singapore Three One, vacate left via Alpha Seven, contact Ground one two one decimal seven five
Vacate left Alpha Seven, Ground one two one decimal seven five, Singapore Three One

Speedbird Four Niner, cleared to enter controlled airspace, proceed direct MALBY, climb altitude four thousand feet, QNH one zero one five
Direct MALBY, climb four thousand feet, QNH one zero one five, Speedbird Four Niner

Lufthansa Three Two, descend and maintain two thousand five hundred, cleared visual approach Runway One Six Left, QNH one zero one eight
Descend two thousand five hundred, cleared visual One Six Left, QNH one zero one eight, Lufthansa Three Two

Emirates One Seven Four Heavy, taxi stand Alpha Seven via Mike and Echo, contact Apron on one two two decimal four
Taxi to stand Alpha Seven via Mike and Echo, one two two decimal four, Emirates One Seven Four Heavy

Air Canada Eight Eight, Runway Two Four, cleared to land, wind two six zero degrees at eight knots
Cleared to land Runway Two Four, Air Canada Eight Eight

EOF

We create our synthetic dataset using synthetic-data-kit, running the command in batches in order to create enough examples. This is because weaker models have issues generating large numbers of examples.

In [None]:
%%bash

NUM_BATCHES=10

# Generate synthetic data using `create`
for i in $(seq 1 $NUM_BATCHES); do
  synthetic-data-kit -c config.yaml create -n 50 examples.txt -o data/train/$i
done

# Convert generated data to JSONL format using `save-as`
for i in $(seq 1 $NUM_BATCHES); do
  synthetic-data-kit save-as data/train/$i/examples_qa_pairs.json -f jsonl -o data/train/$i/output.jsonl
done

# Concatenate all output files into one with `cat`
cat $(for i in $(seq 1 $NUM_BATCHES); do echo -n "data/train/$i/outpxut.jsonl "; done) > data/train.jsonl

# Eval doesn't need multiple runs
synthetic-data-kit -c config.yaml create -n 50 examples.txt -o data/eval
synthetic-data-kit save-as data/eval/examples_qa_pairs.json -f jsonl -o data/eval/output.jsonl

In [3]:
!cat data/train.jsonl | wc -l
!cat data/eval/output.jsonl | wc -l

500
50


## Preparing the eval dataset
Our human curated eval dataset contains text annotations in the form of XML files. We want to just produce transcripts of the conversation, and do not need to include any other metadata or audio.

In [None]:
# Download the dataset
!mkdir Datasets && cd Datasets && wget https://www.replaywell.com/atco2/download/ATCO2-ASRdataset-v1_beta.tgz && tar xf ATCO2-ASRdataset-v1_beta.tgz >/dev/null 2>&1

In [5]:
import xml.etree.ElementTree as ET
import os
import glob
import re

def parse_xml_files(directory_path: str):
    """
    Parse all XML files in the specified directory and extract text entries.
    
    Args:
        directory_path: Path to the directory containing XML files
        
    Returns:
        A nested list where each item represents an XML file,
            containing a list of text entries from that file
    """
    xml_files = glob.glob(os.path.join(directory_path, "*.xml"))
    results = []
    
    for xml_file in xml_files:
        try:
            tree = ET.parse(xml_file)
            root = tree.getroot()
            
            file_texts = []
            
            for segment in root.findall('segment'):
                text_element = segment.find('text')
                if text_element is not None and text_element.text:
                    # Remove any part of speech details or metadata included in square brackets
                    raw_text = text_element.text
                    cleaned_text = re.sub(r"\[.*?\]", "", raw_text)
                    # Fix some weirdness with non breaking spaces
                    cleaned_text = cleaned_text.replace('\xa0', '').replace('\n', '')
                    file_texts.append(cleaned_text.strip())
            
            if file_texts and len(file_texts) >= 2:
                results.append(file_texts)
                
        except ET.ParseError as e:
            print(f"Error parsing {xml_file}: {e}")
        except Exception as e:
            print(f"Error processing {xml_file}: {e}")
    
    return results

In [6]:
parsed = parse_xml_files("Datasets/ATCO2-ASRdataset-v1_beta/DATA")
print(f"Parsed {len(parsed)}")

Parsed 244


In [7]:
# Llama 3 prompt template
def format_llama(instruction: str, first_message: str, reply: str):
    instruction = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{instruction}
<|eot_id|><|start_header_id|>user<|end_header_id|>
{first_message}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{reply}"""
    return instruction.format(first_message, reply)

# Format for our saved json format
def format_json(first_message: str, reply: str):
    return {
        "instruction": "You are a helpful controller who responds to air traffic control messages.",
        "input": first_message,
        "output": reply,
    }

# Converts the saved json format to llama format for ingestion
def json_to_llama(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = format_llama(instruction, input, output) + tokenizer.eos_token
        texts.append(text)
    return { "text" : texts, }

In [8]:
import json

# Grab 100 of the examples for evaluation
messages_eval = []
for message in parsed[0:100]:
    messages_eval.append(format_json(message[0], message[1]))

# Save the dataset in our custom json format
os.makedirs("Datasets", exist_ok=True)
with open("Datasets/dataset_eval.json", 'w') as f:
    json.dump(messages_eval, f)

In [9]:
from datasets import Dataset

def json_dataset(path: str):
    """Create a dataset from a JSON file, used for the ATC dataset."""
    with open(path, 'r') as f:
        data = json.load(f)

    return Dataset.from_list(data)
    
def jsonl_dataset(path: str):
    """Create a dataset from a JSONL file, used for synthetic data."""
    lines = []
    with open(path, 'r') as f:
        for line in f:
            data = json.loads(line)
            lines.append(format_json(data["atc"], data["response"]))

    return Dataset.from_list(lines)

## Evaluating the baseline model
To evaluate the baseline results of the model we will use the HuggingFace transformers package and Unsloth for inference. We use two metrics here, **perplexity** and **BLEU**. Perplexity captures the "surprise" of the model, and applies on a per-token basis. BLEU is typically used for machine translation, but here is capturing if the response gets the gist of the correct answer, accounting for differences in word order.

In [10]:
# This is where Model weights will be downloaded/used from
cache_dir = "Models"

In [11]:
from unsloth import FastLanguageModel

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
INFO 07-11 18:16:50 [__init__.py:244] Automatically detected platform cuda.


In [12]:
import torch
import torch.nn.functional as F
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def compute_bleu(reference: str, candidate: str) -> float:
    """
    Compute BLEU score between reference and candidate strings.

    Args:
        reference: Ground-truth text.
        candidate: Generated text to evaluate.

    Returns:
        bleu_score: BLEU score (0 to 1).
    """
    reference_tokens = reference.strip().split()
    candidate_tokens = candidate.strip().split()

    smoothie = SmoothingFunction().method4
    bleu_score = sentence_bleu(
        [reference_tokens],
        candidate_tokens,
        smoothing_function=smoothie
    )
    return bleu_score

def compute_loss(model, tokenizer, prompt: str, target: str) -> float:
    """
    Compute loss for a target response given a prompt.

    Args:
        model: Pretrained language model.
        tokenizer: Tokenizer for the model.
        prompt: Input text prompt.
        target: Ground-truth text continuation.

    Returns:
        loss: Computed loss value.
    """
    # Tokenize separately to keep the prompt boundary
    prompt_ids  = tokenizer(prompt,  return_tensors="pt").input_ids.to(model.device)
    target_ids  = tokenizer(target,  return_tensors="pt").input_ids.to(model.device)

    # Create the combined input
    input_ids = torch.cat((prompt_ids, target_ids), dim=1)

    # Labels are the complete prompt and target response
    labels = input_ids.clone()

    # Set the tokens up to the end of the prompt to -100 to prevent loss computation there
    # This is because we don't care how the model predicts the prompt, just how well it
    # completes the text from the end of the prompt onwards
    prompt_len = prompt_ids.shape[1]
    labels[:, :prompt_len] = -100

    # Use the model to compute the loss
    with torch.no_grad():
        outputs = model(input_ids=input_ids, labels=labels)
        loss = outputs.loss

    # Perplexity is the exponentiated negative log-likelihood
    return loss.item()

In [13]:
from trl import SFTTrainer
from transformers import TrainingArguments
import torch

def generate(model, tokenizer, text: str, max_new_tokens: int = 100) -> str:
    """
    Generate text from model given an input prompt.
    
    Args:
        model: Pretrained language model.
        tokenizer: Corresponding tokenizer.
        text: Prompt text.
        max_new_tokens: Number of tokens to generate.
    
    Returns:
        str: Generated output text.
    """
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    input_ids = inputs["input_ids"]
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=0.7,
        use_cache=True
    )
    
    # Decode only the newly generated tokens (the part after the prompt)
    return tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=True)

In [None]:
from tqdm.notebook import tqdm
import numpy as np

def evaluate(model, tokenizer, debug=False):
    """
    This function loads the eval dataset and then loops over it to compute the
    metrics. Enable `debug` to show the text generated and the ground truth.
    """
    # Load the dataset
    dataset = json_dataset("Datasets/dataset_eval.json")
    
    # Compute Perplexity and BLEU scores
    losses, bleus = [], []
    
    for convo in tqdm(dataset, desc="Evaluating"):
        prompt = format_llama(convo["instruction"], convo["input"], "")
        output = generate(model, tokenizer, prompt)
        ground_truth = convo["output"]

        if debug:
            print("Input:\n", prompt)
            print("Output\n", output)
            print("GT\n", ground_truth)
    
        loss = compute_loss(model, tokenizer, output, ground_truth)
        bleu = compute_bleu(output, ground_truth)
    
        losses.append(loss)
        bleus.append(bleu)
    
    # Report metrics
    mean_loss = np.mean(loss)
    mean_bleu = np.mean(bleus)
    mean_ppl = np.exp(mean_loss)
    
    print(f"\n=== Evaluation Results ===")
    print(f"Average Perplexity: {mean_ppl:.2f}")
    print(f"Average BLEU Score: {mean_bleu:.2f}")

    return mean_ppl, mean_bleu

In [15]:
# Load base model and compute the base metrics
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    cache_dir=cache_dir,
)

==((====))==  Unsloth 2025.7.3: Fast Llama patching. Transformers: 4.53.2. vLLM: 0.9.2.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.209 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

In [16]:
base_ppl, base_bleu = evaluate(model, tokenizer)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Evaluating:   0%|          | 0/100 [00:00<?, ?it/s]


=== Evaluation Results ===
Average Perplexity: 597.31
Average BLEU Score: 0.04


## Fine-tuning the model

In [17]:
print("üöÄ Starting fine-tuning process...")
cache_dir = "Models/"

# Load base model
tuned_model, tuned_tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    cache_dir=cache_dir,
)

# Format the dataset
dataset = jsonl_dataset("data/train.jsonl")
dataset = dataset.map(json_to_llama, batched=True)

# Add LoRA adapters for efficient fine-tuning
tuned_model = FastLanguageModel.get_peft_model(
    tuned_model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

# Set up training
trainer = SFTTrainer(
    model=tuned_model,
    tokenizer=tuned_tokenizer,
    dataset_text_field="text",
    train_dataset=dataset,
    max_seq_length=2048,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=8,
        gradient_accumulation_steps=1,
        warmup_steps=5,
        max_steps=250,
        learning_rate=2e-5,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="Results",
    ),
)

print("üèãÔ∏è Training started...")
trainer.train()

# Save the fine-tuned model
tuned_model.save_pretrained("Results")
tuned_tokenizer.save_pretrained("Results")

print("‚úÖ Training complete! Model saved to Results")


üöÄ Starting fine-tuning process...
==((====))==  Unsloth 2025.7.3: Fast Llama patching. Transformers: 4.53.2. vLLM: 0.9.2.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.209 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.7.3 patched 28 layers with 28 QKV layers, 28 O layers and 0 MLP layers.


Unsloth: Tokenizing ["text"]:   0%|          | 0/500 [00:00<?, ? examples/s]

üèãÔ∏è Training started...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 500 | Num Epochs = 4 | Total steps = 250
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 9,175,040 of 3,221,924,864 (0.28% trained)


Step,Training Loss
1,4.762
2,4.6861
3,4.8801
4,4.7027
5,4.9649
6,4.5416
7,4.3378
8,4.4336
9,4.5546
10,4.6218


Unsloth: Will smartly offload gradients to save VRAM!
‚úÖ Training complete! Model saved to Results


## Evaluating the fine-tuned model
Once we have a fine-tuned model, we can re-run our evaluation with the new model! We'll look at the metrics for both, as well as a "vibe check" where we manually inspect a few outputs to confirm the model is working how we expect. During evaluation, both metrics as well as manual spot checking are important -- metrics capture broad patterns and spot checking makes up for deficiencies in metrics.

In [18]:
tuned_ppl, tuned_bleu = evaluate(tuned_model, tuned_tokenizer)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Evaluating:   0%|          | 0/100 [00:00<?, ?it/s]


=== Evaluation Results ===
Average Perplexity: 229.11
Average BLEU Score: 0.20


In [19]:
print(f"Original Perplexity: {base_ppl:.3f}, Tuned Perplexity: {tuned_ppl:.3f}")
print(f"Original BLEU: {base_bleu:.3f}, Tuned BLEU: {tuned_bleu:.3f}")

Original Perplexity: 597.310, Tuned Perplexity: 229.106
Original BLEU: 0.042, Tuned BLEU: 0.203


In [20]:
# Vibe check the model with some examples from both original and fine-tuned model
eval_dataset = json_dataset("Datasets/dataset_eval.json")

max_examples = 5
for idx, convo in enumerate(eval_dataset):
    prompt = format_llama(convo["instruction"], convo["input"], "")
    output_og = generate(model, tokenizer, prompt)
    output_tuned = generate(tuned_model, tokenizer, prompt)

    print(f"ATC Request:\t {convo['input']}")
    print(f"GT:\t\t {convo['output']}")
    print(f"Original:\t {output_og}")
    print(f"Tuned:\t\t {output_tuned}".replace('\n', ''))
    print("")
    
    if idx + 1 >= max_examples:
        break

ATC Request:	 CSA One Delta Zulu descend flight level one hundred no speed restrictions
GT:		 descending flight level one hundred  free speed CSA One Delta Zulu
Original:	 Roger that, One Delta Zulu. Descend and maintain level one hundred.
Tuned:		 Descend flight level one hundred no speed

ATC Request:	 Oscar Kilo Triple Hotel please confirm one more holding
GT:		 Oscar Kilo Hotel Hotel Hotel affirm one holding and then it should be possible to follow ILS runway zero six
Original:	 Roger that, Oscar Kilo Triple Hotel, holding for clearance. What's your planned departure?
Tuned:		 One more holding, Oscar Kilo Triple Hotel

ATC Request:	 Ruzyne Tower hello again Eurowings One Tango Kilo
GT:		 Eurowings One Tango Kilo Ruzyne Tower good afternoon go ahead
Original:	 This is Ruzyne Tower, Eurowings One Tango Kilo, cleared to the runway. Be advised, there is a departing Boeing 737-800 on the adjacent runway, expect a possible taxi to the north. Climb to 30000 feet, contact Ground Control on

## Conclusion
By the end of this guide, you should have:

* ‚úÖ A running vLLM server with a quantized Llama model
* ‚úÖ Infrastructure to create synthetic examples for training
* ‚úÖ A 200+ example synthetic dataset created using Llama 4 Scout
* ‚úÖ A distilled Llama 3.1 8B model
* ‚úÖ Test results showing improved metrics and qualitative results

What's next?

* Use an even more powerful model to generate synthetic examples, for example Llama 4 Maverick
* Develop more comprehensive evaluation strategies, including domain-specific metrics
* Extend the dataset to include more data and thus better transfer knowledge
* Examine your dataset using automated tools to understand what's inside and determine gaps