{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "07c45c6d-d3a2-44c7-8e14-7e57a05e80b6",
      "metadata": {},
      "source": [
        "# Chatbot with Conversation History"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "aef74060",
      "metadata": {},
      "source": [
        "*Copyright (c) Meta Platforms, Inc. and affiliates.\n",
        "This software may be used and distributed according to the terms of the Llama Community License Agreement.*"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ee08f1d9",
      "metadata": {},
      "source": [
        "<a href=\"https://colab.research.google.com/github/meta-llama/llama-cookbook/blob/main/end-to-end-use-cases/chatbot-with-conversation-history/chatbot-with-conversation-history.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ede00bda-8bfe-450a-9b06-7b0caa4752f8",
      "metadata": {},
      "source": [
        "This tutorial shows you how to build a chatbot with conversation history. Using Llama 4, we will create a conversational agent that takes a URL, understands its content, and allows you have an interactive conversation with it, while maintaining conversation history.\n",
        "\n",
        "| Component          | Choice                                     | Why                   |\n",
        "| :----------------- | :----------------------------------------- | :-------------------- |\n",
        "| **Model**          | `Llama-4-Maverick-17B-128E-Instruct-FP8`     | A powerful Mixture-of-Experts (MoE) model ideal for complex instruction-following. Llama 4 Maverick offers superior performance and a massive context window (up to 1M tokens). |\n",
        "| **Pattern**        | In-context learning + sliding window memory | We will pass the entire webpage content directly into the model's context. Llama 4's large context window makes this simple approach viable for even very large pages, often removing the need for a complex RAG system.        |           \n",
        "| **Infrastructure**        | Meta's official [Llama API](https://llama.developer.meta.com/)          | Provides serverless, production-ready access to Llama 4 models using the `llama_api_client` SDK.          |\n",
        "---\n",
        "\n",
        "**Note on Inference Providers:** This tutorial uses the Llama API for demonstration purposes. However, you can run Llama 4 models with any preferred inference provider. Common examples include [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-meta.html) and [Together AI](https://together.ai/llama). The core logic of this tutorial can be adapted to any of these providers.\n",
        "\n",
        "## What you will learn\n",
        "\n",
        "- **The fundamentals of chat completion:** How to structure conversations using system, user, and assistant roles.\n",
        "- **How to manage conversation history:** Implement a sliding window to maintain context in long conversations without exceeding token limits.\n",
        "- **Practical prompt engineering:** How to guide the model to answer questions based *only* on provided text.\n",
        "- **How to perform meta-tasks:** Leverage the model to summarize the conversation history."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f23c1096-c3d8-45b4-99cc-ecc741ae7107",
      "metadata": {},
      "source": [
        "## Install dependencies\n",
        "\n",
        "You will need a few libraries for this project: `requests` to download webpages, `readability-lxml` to extract the core content, `markdownify` to convert HTML to clean Markdown, `tiktoken` for accurate token counting, and the official `llama-api-client`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "id": "33159f01-510a-4196-b438-a015e4e4e4b5",
      "metadata": {},
      "outputs": [],
      "source": [
        "!uv pip install --quiet requests beautifulsoup4 readability-lxml markdownify tiktoken llama-api-client"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "40362350-96d3-429c-9b1d-e3da54889f4c",
      "metadata": {},
      "source": [
        "## Imports & Llama API client setup\n",
        "\n",
        "In this tutorial, we will use [Llama API](https://llama.developer.meta.com/) as the inference provider. So, you would first need to get an API key from Llama API if you don't have one already. Then set the Llama API key as an environment variable, such as `LLAMA_API_KEY`, as shown in the example.\n",
        "\n",
        "Remember, you can adapt this section to use your preferred inference provider."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 26,
      "id": "a1d0e9b5-6d93-4cb1-bca5-f4eba103197d",
      "metadata": {},
      "outputs": [],
      "source": [
        "import os, sys, re, html, textwrap\n",
        "import requests\n",
        "from typing import List, Dict\n",
        "from bs4 import BeautifulSoup\n",
        "import tiktoken\n",
        "from readability import Document\n",
        "from markdownify import markdownify\n",
        "from llama_api_client import LlamaAPIClient"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "id": "8853bb9a-6fe4-445f-951d-ddc4e63d9f8e",
      "metadata": {},
      "outputs": [],
      "source": [
        "# --- Llama client ---\n",
        "API_KEY = os.getenv(\"LLAMA_API_KEY\")\n",
        "if not API_KEY:\n",
        "    sys.exit(\"❌  Please set the LLAMA_API_KEY environment variable.\")\n",
        "\n",
        "client = LlamaAPIClient(api_key=API_KEY)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "24f0e401-768e-4f5f-965c-acc72837aa0e",
      "metadata": {},
      "source": [
        "## Fetch and clean a webpage\n",
        "\n",
        "To get high-quality responses from the model, you first need to provide it with high-quality data. Raw HTML contains a lot of \"noise\" (like navigation bars, ads, and scripts) that can distract the model. The following function implements a three-step process to transform a messy webpage into clean, structured Markdown that is ideal for the LLM.\n",
        "\n",
        "1.  **Extract Core Content:** It uses the `readability` library to pull out the main body of the article, discarding common boilerplate like headers, footers, and sidebars.\n",
        "2.  **Final Cleanup:** It uses `BeautifulSoup` to remove any remaining `<script>` or `<style>` tags.\n",
        "3.  **Convert to Markdown:** It converts the clean HTML to Markdown using `markdownify`. This is better than plain text because it preserves important semantic structure—such as headings, lists, and links—which helps the model better understand the content's hierarchy and meaning."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 38,
      "id": "863fa3fd-63d0-4571-be47-86fcc67c1c72",
      "metadata": {},
      "outputs": [],
      "source": [
        "def fetch_page_text(url: str, timeout: int = 15) -> str:\n",
        "    \"\"\"Download a webpage and return plain text (scripts/styles removed).\"\"\"\n",
        "    r = requests.get(url, timeout=timeout)\n",
        "    r.raise_for_status\n",
        "    html_raw = r.text\n",
        "\n",
        "    # ---- 1. keep only the main article if possible -----------------------\n",
        "    html_main = Document(html_raw).summary(html_partial=True)\n",
        "    soup = BeautifulSoup(html_main, \"html.parser\")\n",
        "\n",
        "    # ---- 2. drop noise ----------------------------------------------------\n",
        "    for tag in soup([\"script\", \"style\", \"noscript\", \"header\", \"footer\", \"nav\", \"aside\"]):\n",
        "        tag.decompose()\n",
        "\n",
        "    # ---- 3. html to markdown ----------------------------------------------\n",
        "    cleaned_html = str(soup)\n",
        "    md_text = markdownify(cleaned_html, heading_style=\"ATX\")      # ## Heading\n",
        "    md_text = html.unescape(md_text)\n",
        "    md_text = re.sub(r\"\\n{3,}\", \"\\n\\n\", md_text).strip()\n",
        "\n",
        "    return md_text"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 40,
      "id": "b5ea096b-068e-4adc-99a0-1d81fa33266c",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "🔗  Paste a URL to chat about:  https://ai.meta.com/blog/llama-4-multimodal-intelligence/\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "✅  Retrieved 20,286 characters.\n",
            "## Takeaways\n",
            "\n",
            "* We’re sharing the first models in the Llama 4 herd, which will enable people to build more personalized multimodal experiences.\n",
            "* Llama 4 Scout, a 17 billion active parameter model with 16 experts, is the best multimodal model in the world in its class and is more powerful than all previous generation Llama models, while fitting in a single NVIDIA H100 GPU. Additionally, Llama 4 Scout offers an industry-leading context window of 10M and delivers better results than Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across a broad range of widely reported benchmarks.\n",
            "* Llama 4 Maverick, a 17 billion active parameter model with 128 experts, is the best multimodal model in its class, beating GPT-4o and Gemini 2.0 Flash across a broad range of widely reported benchmarks, while achieving comparable results to the new DeepSeek v3 on reasoning and coding—at less than half the active parameters. Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on [LMArena](https://lmarena.ai/leaderboard).\n",
            "* These models are our best yet thanks to distillation from Llama 4 Behemoth, a 288 billion active parameter model with 16 experts that is our most powerful yet and among the world’s smartest LLMs. Llama 4 Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks. Llama 4 Behemoth is still training, and we’re excited to share more details about it even while it’s still in flight.\n",
            "* Download the Llama 4 Scout and Llama 4 Maverick models today on [llama.com](https://www.llama.com/llama-downloads/) and [Hugging Face](https://huggingface.co/meta-llama). Try Meta AI built with Llama 4 in WhatsApp, Messenger, Instagram Direct, and on the [web](https://meta.ai/).\n",
            "\n",
            "As more people continue to use artificial intelligence to enhance their daily lives, it’s important that the leading models and systems are openly available so everyone can build the future of personalized experiences. Today, we’re excited to announce the most advanced suite of models that support the entire [Llama](https://www.llama.com/) ecosystem. We’re introducing Llama 4 Scout and Llama 4 Maverick, the first open-weight natively multimodal models with unprecedented context length support and our first built using a mixture-of-experts (MoE) architecture. We’re also previewing Llama 4 Behemoth, one of the smartest LLMs in the world and our most powerful yet to serve as a teacher for our new models.\n",
            "\n",
            "These Llama 4 models mark the beginning of a new era for the Llama ecosystem. We designed two efficient models in the Llama 4 series, Llama 4 Scout, a 17 billion active parameter model with 16 experts, and Llama 4 Maverick, a 17 billion active parameter model with 128 experts. The former fits on a single H100 GPU (with Int4 quantization) while the latter fits on a single H100 host. We also trained a teacher model, Llama 4 Behemoth, that outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM-focused benchmarks such as MATH-500 and GPQA Diamond. While we’re not yet releasing Llama 4 Behemoth as it is still training, we’re excited to share more technical details about our approach.\n",
            "\n",
            "We continue to believe that openness drives innovation and is good for developers, good for Meta, and good for the world. We’re making Llama 4 Scout and Llama 4 Maverick available for download today on [llama.com](https://www.llama.com/llama-downloads/) and [Hugging Face](https://huggingface.co/meta-llama) so everyone can continue to build new experiences using our latest technology. We’ll also make them available via our partners in the coming days. You can also try Meta AI with Llama 4 starting today in WhatsApp, Messenger, Instagram Direct, and on the [Meta.AI](https://meta.ai/) website.\n",
            "\n",
            "This is just the beginning for the Llama 4 collection. We believe that the most intelligent systems need to be capable of taking generalized actions, conversing naturally with humans, and working through challenging problems they haven’t seen before. Giving Llama superpowers in these areas will lead to better products for people on our platforms and more opportunities for developers to innovate on the next big consumer and business use cases. We’re continuing to research and prototype both models and products, and we’ll share more about our vision at LlamaCon on April 29—[sign up to hear more](https://www.llama.com/events/llamacon/signup/).\n",
            "\n",
            "Whether you’re a developer building on top of our models, an enterprise integrating them into your workflows, or simply curious about the potential uses and benefits of AI, Llama 4 Scout and Llama 4 Maverick are the best choices for adding next-generation intelligence to your products. Today, we’re excited to share more about the four major parts of their development and insights into our research and design process. We also can’t wait to see the incredible new experiences the community builds with our new Llama 4 models.\n",
            "\n",
            "## Pre-training\n",
            "\n",
            "These models represent the best of Llama, offering multimodal intelligence at a compelling price while outperforming models of significantly larger sizes. Building the next generation of Llama models required us to take several new approaches during pre-training.\n",
            "\n",
            "Our new Llama 4 models are our first models that use a mixture of experts (MoE) architecture. In MoE models, a single token activates only a fraction of the total parameters. MoE architectures are more compute efficient for training and inference and, given a fixed training FLOPs budget, delivers higher quality compared to a dense model.\n",
            "\n",
            "As an example, Llama 4 Maverick models have 17B active parameters and 400B total parameters. We use alternating dense and mixture-of-experts (MoE) layers for inference efficiency. MoE layers use 128 routed experts and a shared expert. Each token is sent to the shared expert and also to one of the 128 routed experts. As a result, while all parameters are stored in memory, only a subset of the total parameters are activated while serving these models. This improves inference efficiency by lowering model serving costs and latency—Llama 4 Maverick can be run on a single NVIDIA H100 DGX host for easy deployment, or with distributed inference for maximum efficiency.\n",
            "\n",
            "Llama 4 models are designed with native multimodality, incorporating early fusion to seamlessly integrate text and vision tokens into a unified model backbone. Early fusion is a major step forward, since it enables us to jointly pre-train the model with large amounts of unlabeled text, image, and video data. We also improved the vision encoder in Llama 4. This is based on MetaCLIP but trained separately in conjunction with a frozen Llama model to better adapt the encoder to the LLM.\n",
            "\n",
            "We developed a new training technique which we refer to as MetaP that allows us to reliably set critical model hyper-parameters such as per-layer learning rates and initialization scales. We found that chosen hyper-parameters transfer well across different values of batch size, model width, depth, and training tokens. Llama 4 enables open source fine-tuning efforts by pre-training on 200 languages, including over 100 with over 1 billion tokens each, and overall 10x more multilingual tokens than Llama 3.\n",
            "\n",
            "Additionally, we focus on efficient model training by using FP8 precision, without sacrificing quality and ensuring high model FLOPs utilization—while pre-training our Llama 4 Behemoth model using FP8 and 32K GPUs, we achieved 390 TFLOPs/GPU. The overall data mixture for training consisted of more than 30 trillion tokens, which is more than double the Llama 3 pre-training mixture and includes diverse text, image, and video datasets.\n",
            "\n",
            "We continued training the model in what we call “mid-training” to improve core capabilities with new training recipes including long context extension using specialized datasets. This enabled us to enhance model quality while also unlocking best-in-class 10M input context length for Llama 4 Scout.\n",
            "\n",
            "## Post-training our new models\n",
            "\n",
            "Our newest models include smaller and larger options to accommodate a range of use cases and developer needs. Llama 4 Maverick offers unparalleled, industry-leading performance in image and text understanding, enabling the creation of sophisticated AI applications that bridge language barriers. As our product workhorse model for general assistant and chat use cases, Llama 4 Maverick is great for precise image understanding and creative writing.\n",
            "\n",
            "The biggest challenge while post-training the Llama 4 Maverick model was maintaining a balance between multiple input modalities, reasoning, and conversational abilities. For mixing modalities, we came up with a carefully curated curriculum strategy that does not trade-off performance compared to the individual modality expert models. With Llama 4, we revamped our post-training pipeline by adopting a different approach: lightweight supervised fine-tuning (SFT) > online reinforcement learning (RL) > lightweight direct preference optimization (DPO). A key learning was that SFT and DPO can over-constrain the model, restricting exploration during the online RL stage and leading to suboptimal accuracy, particularly in reasoning, coding, and math domains. To address this, we removed more than 50% of our data tagged as easy by using Llama models as a judge and did lightweight SFT on the remaining harder set. In the subsequent multimodal online RL stage, by carefully selecting harder prompts, we were able to achieve a step change in performance. Furthermore, we implemented a continuous online RL strategy, where we alternated between training the model and then using it to continually filter and retain only medium-to-hard difficulty prompts. This strategy proved highly beneficial in terms of compute and accuracy tradeoffs. We then did a lightweight DPO to handle corner cases related to model response quality, effectively achieving a good balance between the model’s intelligence and conversational abilities. Both the pipeline architecture and the continuous online RL strategy with adaptive data filtering culminated in an industry-leading, general-purpose chat model with state-of-the-art intelligence and image understanding capabilities.\n",
            "\n",
            "As a general purpose LLM, Llama 4 Maverick contains 17 billion active parameters, 128 experts, and 400 billion total parameters, offering high quality at a lower price compared to Llama 3.3 70B. Llama 4 Maverick is the best-in-class multimodal model, exceeding comparable models like GPT-4o and Gemini 2.0 on coding, reasoning, multilingual, long-context, and image benchmarks, and it’s competitive with the much larger DeepSeek v3.1 on coding and reasoning.\n",
            "\n",
            "These new models are important building blocks that will help enable the future of human connection. In keeping with our commitment to open source, we’re making Llama 4 Maverick and Llama 4 Scout available to download on [llama.com](https://www.llama.com/llama-downloads/) and Hugging Face, with availability across the most widely used cloud and data platforms, edge silicon, and global service integrators to follow shortly.\n",
            "\n",
            "## Pushing Llama to new sizes: The 2T Behemoth\n",
            "\n",
            "We’re excited to share a preview of Llama 4 Behemoth, a teacher model that demonstrates advanced intelligence among models in its class. Llama 4 Behemoth is also a multimodal mixture-of-experts model, with 288B active parameters, 16 experts, and nearly two trillion total parameters. Offering state-of-the-art performance for non-reasoning models on math, multilinguality, and image benchmarks, it was the perfect choice to teach the smaller Llama 4 models. We codistilled the Llama 4 Maverick model from Llama 4 Behemoth as a teacher model, resulting in substantial quality improvements across end task evaluation metrics. We developed a novel distillation loss function that dynamically weights the soft and hard targets through training. Codistillation from Llama 4 Behemoth during pre-training amortizes the computational cost of resource-intensive forward passes needed to compute the targets for distillation for the majority of the training data used in student training. For additional new data incorporated in student training, we ran forward passes on the Behemoth model to create distillation targets.\n",
            "\n",
            "Post-training a model with two trillion parameters was a significant challenge too that required us to completely overhaul and revamp the recipe, starting from the scale of data. In order to maximize performance, we had to prune 95% of the SFT data, as opposed to 50% for smaller models, to achieve the necessary focus on quality and efficiency. We also found that doing lightweight SFT followed by large-scale reinforcement learning (RL) produced even more significant improvements in reasoning and coding abilities of the model. Our RL recipe focused on sampling hard prompts by doing pass@k analysis with the policy model and crafting a training curriculum of increasing prompt hardness. We also found that dynamically filtering out prompts with zero advantage during training and constructing training batches with mixed prompts from multiple capabilities were instrumental in providing a performance boost on math, reasoning, and coding. Finally, sampling from a variety of system instructions was crucial in ensuring that the model retained its instruction following ability for reasoning and coding and was able to perform well across a variety of tasks.\n",
            "\n",
            "Scaling RL for a two trillion parameter model also required revamping our underlying RL infrastructure due to its unprecedented scale. We optimized the design of our MoE parallelization for speed, which enabled faster iteration. We developed a fully asynchronous online RL training framework that enhanced flexibility. Compared to the existing distributed training framework, which sacrifices the compute memory in order to stack all models in memory, our new infrastructure enabled flexible allocation of different models to separate GPUs, balancing resources across multiple models based on computational speed. This innovation resulted in a ~10x improvement in training efficiency over previous generations.\n",
            "\n",
            "## Safeguards and protections\n",
            "\n",
            "We aim to develop the most helpful and useful models while protecting against and mitigating the most severe risks. We built Llama 4 with the best practices outlined in our Developer Use Guide: AI Protections. This includes integrating mitigations at each layer of model development from pre-training to post-training to tunable system-level mitigations that shield developers from adversarial users. In doing so, we empower developers to create helpful, safe, and adaptable experiences for their Llama-supported applications.\n",
            "\n",
            "**Pre- and post-training mitigations**\n",
            "\n",
            "For pre-training, we use data filtering in combination with other data mitigations to safeguard models. For post-training, we apply a range of techniques to ensure our models conform to policies that are helpful to users and developers, including the right level of safety data at each stage.\n",
            "\n",
            "**System-level approaches**\n",
            "\n",
            "At the system-level, we have open-sourced several safeguards which can help identify and guard against potentially harmful inputs and outputs. These tools can be integrated into our Llama models and with other third-party tools:\n",
            "\n",
            "* Llama Guard: Our input/output safety large language model based on the [hazards taxonomy](https://arxiv.org/abs/2404.12241) we developed with MLCommons. Developers can use it to detect whether inputs or outputs violate the policies they’ve created for their specific application.\n",
            "* Prompt Guard: A classifier model trained on a large corpus of attacks, which is capable of detecting both explicitly malicious prompts (Jailbreaks) as well as prompts that contain inject inputs (Prompt Injections).\n",
            "* CyberSecEval: Evaluations that help AI model and product developers understand and reduce generative AI cybersecurity risk.\n",
            "\n",
            "We’ve heard from developers that these tools are most effective and helpful when they can be tailored to their applications. We provide developers with an open solution so they can create the safest and most effective experiences based on their needs. We’ll also continue working with a global set of partners to create industry-wide system standards that benefit the open source community.\n",
            "\n",
            "**Evaluations and red-teaming**\n",
            "\n",
            "We run systematic testing of models across a wide range of scenarios and use cases in a controlled and repeatable manner. This produces data that we incorporate back into post-training.\n",
            "\n",
            "We stress test our models using adversarial dynamic probing across a range of topics using automated and manual testing. We’ve made advancements in understanding and evaluating potential model risk. One example of this is our new development of Generative Offensive Agent Testing (GOAT). Using GOAT, we address the limitations of traditional red-teaming by simulating multi-turn interactions of medium-skilled adversarial actors, helping us increase our testing coverage and raise vulnerabilities faster. By adding automation to our testing toolkit, GOAT has allowed our expert human red teamers to focus on more novel adversarial areas, while the automation focuses on known risk areas. This makes the process more efficient and effective, and it enables us to build a better quantitative and qualitative picture of risk.\n",
            "\n",
            "**Addressing bias in LLMs**\n",
            "\n",
            "It’s well-known that all leading LLMs have had issues with bias—specifically, they historically have leaned left when it comes to debated political and social topics. This is due to the types of training data available on the internet.\n",
            "\n",
            "Our goal is to remove bias from our AI models and to make sure that Llama can understand and articulate both sides of a contentious issue. As part of this work, we’re continuing to make Llama more responsive so that it answers questions, can respond to a variety of different viewpoints without passing judgment, and doesn't favor some views over others.\n",
            "\n",
            "We have made improvements on these efforts with this release—Llama 4 performs significantly better than Llama 3 and is comparable to Grok:\n",
            "\n",
            "* Llama 4 refuses less on debated political and social topics overall (from 7% in Llama 3.3 to below 2%).\n",
            "* Llama 4 is dramatically more balanced with which prompts it refuses to respond to (the proportion of unequal response refusals is now less than 1% on a set of debated topical questions).\n",
            "* Our testing shows that Llama 4 responds with strong political lean at a rate comparable to Grok (and at half of the rate of Llama 3.3) on a contentious set of political or social topics. While we are making progress, we know we have more work to do and will continue to drive this rate further down.\n",
            "\n",
            "We’re proud of this progress to date and remain committed to our goal of eliminating overall bias in our models.\n",
            "\n",
            "## Explore the Llama ecosystem\n",
            "\n",
            "While it’s important that models are intelligent, people also want models that can reply in a personalized way with human-like speed. As our most advanced models yet, Llama 4 is optimized to meet these needs.\n",
            "\n",
            "Of course, models are one piece of the larger ecosystem that brings these experiences to life. We’re focused on the full stack, which includes new product integrations. We’re excited to continue the conversations we’re having with our partners and the open source community, and as always, we can’t wait to see the rich experiences people build in the new Llama ecosystem.\n",
            "\n",
            "Download the Llama 4 Scout and Llama 4 Maverick models today on [llama.com](https://www.llama.com/llama-downloads/) and [Hugging Face](https://huggingface.co/meta-llama). Try Meta AI built with Llama 4 in WhatsApp, Messenger, Instagram Direct, and on the [Meta.AI](https://meta.ai/) website.\n",
            "\n",
            "*This work was supported by our partners across the AI community. We’d like to thank and acknowledge (in alphabetical order): Accenture, Amazon Web Services, AMD, Arm, CentML, Cerebras, Cloudflare, Databricks, Deepinfra, DeepLearning.AI, Dell, Deloitte, Fireworks AI, Google Cloud, Groq, Hugging Face, IBM Watsonx, Infosys, Intel, Kaggle, Mediatek, Microsoft Azure, Nebius, NVIDIA, ollama, Oracle Cloud, PwC, Qualcomm, Red Hat, SambaNova, Sarvam AI, Scale AI, Scaleway, Snowflake, TensorWave, Together AI, vLLM, Wipro.*\n"
          ]
        }
      ],
      "source": [
        "url = input(\"🔗  Paste a URL to chat about: \").strip()\n",
        "raw_article = fetch_page_text(url)\n",
        "print(f\"✅  Retrieved {len(raw_article):,} characters.\")\n",
        "print(raw_article)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7e329ff1-4ba8-4e1e-b5fc-e9b0af389ae4",
      "metadata": {},
      "source": [
        "## Managing the context window\n",
        "\n",
        "Llama models have a fixed context window. The context window is the maximum number of tokens they can consider at one time. A key advantage of Llama 4 is the size of this window. `Llama-4-Maverick-17B-128E-Instruct-FP8` supports up to 1 million tokens, enabling you to pass entire books or extensive documents as context.\n",
        "\n",
        "While Llama 4 offers a very large context window of 1M tokens, most API providers support smaller token windows than this. As this tutorial uses Llama API, we'll work within its token window, which is 128k. We must ensure that our entire prompt, which includes the system message, the webpage content, and the conversation history, fits within this limit.\n",
        "\n",
        "To prevent errors, we will truncate the webpage content if it's too long. We will use `tiktoken` for accurate token counting and truncation. We will reserve a `HEADROOM` of 16,384 tokens to accommodate a long-running chat history and the model's next response, and trim the article to fit the remaining space. Note that while `tiktoken` provides a accurate local count, the exact number of tokens processed by an API can vary slightly; hence, we use the '≈' symbol for the count."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 42,
      "id": "9d3994e5-dbb4-4021-bf62-4ea446ead7ec",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Article now ≈ 4216 tokens.\n"
          ]
        }
      ],
      "source": [
        "MAX_CTX = 128000 # A practical context window for Llama 4 Maverick\n",
        "HEADROOM = 16384 # for turns + response\n",
        "MAX_ARTICLE = MAX_CTX - HEADROOM\n",
        "\n",
        "encoding = tiktoken.get_encoding(\"o200k_base\")\n",
        "def count_tokens(s: str) -> int:\n",
        "    \"\"\"Returns the number of tokens in a text string.\"\"\"\n",
        "    return len(encoding.encode(s))\n",
        "\n",
        "def truncate(text: str, max_tokens: int = MAX_ARTICLE) -> str:\n",
        "    \"\"\"Truncates a text string to a maximum number of tokens.\"\"\"\n",
        "    if count_tokens(text) <= max_tokens:\n",
        "        return text\n",
        "    \n",
        "    tokens = encoding.encode(text)\n",
        "    truncated_tokens = tokens[:max_tokens]\n",
        "    return encoding.decode(truncated_tokens, errors='ignore') + \"\\n\\n[... truncated to fit context ...]\"\n",
        "    \n",
        "article = truncate(raw_article)\n",
        "print(f\"Article now ≈ {count_tokens(article)} tokens.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "e5f9feed-a3a7-4d6e-8b81-b3f7796978d3",
      "metadata": {},
      "source": [
        "## Chatbot with conversation history\n",
        "\n",
        "Next, we will create a `PageChat` class to encapsulate the chatbot's logic and manage its state. Using a class is a clean way to handle the conversation history and model configuration.\n",
        "\n",
        "The `SYSTEM_PROMPT` is a key component. It provides the model with its core instructions, defining its persona and constraints. A best practice is to be highly specific. Here, we instruct it to answer questions *only* from the provided webpage text and to explicitly state when information is missing. This is a critical technique for grounding the model and reducing the likelihood of fabricated answers (hallucinations).\n",
        "\n",
        "The `_messages` method assembles the final payload sent to the API. Notice the order:\n",
        "1.  The `SYSTEM_PROMPT` sets the overall behavior.\n",
        "2.  A second system message injects the `article` content as the context.\n",
        "3.  The last `k` turns of the conversation history are included, implementing a sliding window for memory.\n",
        "4.  The latest `user_msg` is added.\n",
        "\n",
        "This structure ensures the model has all the necessary context to generate a relevant and accurate response."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 43,
      "id": "29576f9f-f6a2-4721-ae70-f117f8deb0c2",
      "metadata": {},
      "outputs": [],
      "source": [
        "SYSTEM_PROMPT = (\n",
        "    \"You are PageChat, an AI that answers questions **only** from the supplied \"\n",
        "    \"webpage text. If information is absent, say so. Be concise.\"\n",
        ")\n",
        "\n",
        "class PageChat:\n",
        "    def __init__(self, article_text: str,\n",
        "                 model: str = \"Llama-4-Maverick-17B-128E-Instruct-FP8\",\n",
        "                 history_window: int = 128):\n",
        "        self.article = article_text\n",
        "        self.model  = model\n",
        "        self.k      = history_window\n",
        "        self.history: List[Dict[str, str]] = []\n",
        "\n",
        "    def _messages(self, user_msg: str) -> List[Dict[str, str]]:\n",
        "        msgs = [\n",
        "            {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
        "            {\"role\": \"system\", \"content\": f\"[WEBPAGE]\\n\\n{self.article}\"},\n",
        "            *self.history[-self.k*2:],\n",
        "            {\"role\": \"user\", \"content\": user_msg},\n",
        "        ]\n",
        "        return msgs\n",
        "\n",
        "    def chat(self, user_msg: str) -> str:\n",
        "        resp = client.chat.completions.create(\n",
        "            model=self.model,\n",
        "            messages=self._messages(user_msg),\n",
        "            temperature=0.1,  # Lower temperature for more factual, deterministic answers\n",
        "        )\n",
        "        assistant = resp.completion_message.content.text\n",
        "        self.history.extend([\n",
        "            {\"role\": \"user\", \"content\": user_msg},\n",
        "            {\"role\": \"assistant\", \"content\": assistant},\n",
        "        ])\n",
        "        return assistant"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "2478f250-87ed-4ec7-a724-078395f14bf9",
      "metadata": {},
      "source": [
        "## Run the interactive chat loop\n",
        "\n",
        "This final piece of code starts the interactive session. It creates an instance of the `PageChat` bot and enters a loop, waiting for your input. Type \"exit\" or \"quit\" to end the conversation."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 44,
      "id": "4c326b32-c8c0-4f6d-aeaf-a70ecce2ef3e",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "🤖  Ask me about the page!  Type 'exit' to quit.\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "You:  Give me a 2 sentence summary\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "PageChat: Meta is releasing Llama 4 Scout and Llama 4 Maverick, two new multimodal AI models that offer state-of-the-art performance and are available for download on llama.com and Hugging Face. The models are part of the Llama 4 series, which also includes the larger Llama 4 Behemoth model that is still in training and has shown exceptional performance on various benchmarks.\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "You:  What is the difference between the Llama 4 Maverick and Llama 4 Scout models?\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "PageChat: Llama 4 Maverick and Llama 4 Scout are both 17 billion active parameter models, but they differ in the number of experts used in their mixture-of-experts (MoE) architecture: Llama 4 Maverick has 128 experts, while Llama 4 Scout has 16 experts. This difference affects their performance, with Llama 4 Maverick outperforming Llama 4 Scout and other models on various benchmarks, but also potentially requiring more resources.\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "You:  What is the max context length of the Llama 4 Scout model?\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "PageChat: The Llama 4 Scout model has a context window of 10 million tokens.\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "You:  What is the pricing of the Llama 4 Maverick model?\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "PageChat: The webpage does not mention the pricing of the Llama 4 Maverick model. It does mention that Llama 4 Maverick offers a \"best-in-class performance to cost ratio\", but the actual cost is not specified.\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "You:  When will the Behemoth model be released?\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "PageChat: The webpage does not provide a specific release date for the Llama 4 Behemoth model, stating only that it is \"still training\" and that more details will be shared \"even while it's still in flight\".\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "You:  exit\n"
          ]
        }
      ],
      "source": [
        "bot = PageChat(article)\n",
        "print(\"\\n🤖  Ask me about the page!  Type 'exit' to quit.\")\n",
        "while True:\n",
        "    try:\n",
        "        user = input(\"\\nYou: \").strip()\n",
        "    except (EOFError, KeyboardInterrupt):\n",
        "        break\n",
        "    if user.lower() in {\"exit\", \"quit\"}:\n",
        "        break\n",
        "    if not user:\n",
        "        continue\n",
        "    answer = bot.chat(user)\n",
        "    print(f\"\\nPageChat: {answer}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "cfcb0588-d225-464f-849b-5714c56772a7",
      "metadata": {},
      "source": [
        "Sample queries to try:\n",
        "\n",
        "* *\"Give me a two-sentence summary.\"*\n",
        "* *\"What is the pricing?\"*\n",
        "* *Follow-up:* *\"What are the top 3 takeaways?\"*"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "976cb34d-5d12-4a5d-81f4-3850e69f5b48",
      "metadata": {},
      "source": [
        "## Bonus: Summarize the conversation\n",
        "\n",
        "Since you are storing the conversation history, you can use the model to perform meta-tasks on it, such as summarizing the chat. This can be useful for logging, analysis, or providing a user with a quick recap of a long interaction."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 45,
      "id": "37092e00-0b3a-405e-b85e-fa0ad8c7cd43",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "Chat so far:\n",
            "Here is a 2-sentence summary:\n",
            "\n",
            "Meta has released two new multimodal AI models, Llama 4 Scout and Llama 4 Maverick, which offer state-of-the-art performance and are available for download. The models are part of the Llama 4 series, which also includes the larger Llama 4 Behemoth model that is still in training and has no specified release date.\n"
          ]
        }
      ],
      "source": [
        "def summarize_conversation(bot: PageChat, max_tokens: int = 128) -> str:\n",
        "    msgs = [\n",
        "        {\"role\": \"system\", \"content\": \"Summarize the chat in 3 concise bullets.\"},\n",
        "        {\"role\": \"user\",   \"content\": \"\\n\".join(m['content'] for m in bot.history)},\n",
        "    ]\n",
        "    resp = client.chat.completions.create(\n",
        "        model=bot.model,\n",
        "        messages=msgs,\n",
        "        max_completion_tokens=max_tokens,\n",
        "    )\n",
        "    return resp.completion_message.content.text.strip()\n",
        "\n",
        "print(\"\\nChat so far:\")\n",
        "print(summarize_conversation(bot))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "b96eafb6-f0b0-4f38-b0df-394038c4672e",
      "metadata": {},
      "source": [
        "## Next steps and upgrade paths\n",
        "\n",
        "This tutorial provides a solid foundation, but you can extend it in several ways for a production-grade application.\n",
        "\n",
        "| Need                           | Where to look                                                                                                                                                                                                                            |\n",
        "| :----------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n",
        "| **Long pages / multiple docs** | For content larger than the context window, use **Retrieval-Augmented Generation (RAG)**. This involves chunking documents, storing them in a vector database, and retrieving only the most relevant chunks to answer a question. See our [Contextual chunking RAG cookbook](https://github.com/meta-llama/llama-cookbook/tree/main/end-to-end-use-cases/Contextual-Chunking-RAG). |\n",
        "| **Persistent memory**          | For production systems, you might store conversation history in a database.                                                                   |\n",
        "| **Real-time feel**             | Enable `stream=True` to receive the response token-by-token, improving perceived latency. See the streaming example in the [Chat & conversation guide](https://llama.developer.meta.com/docs/guides/chat-guide#enhancing-user-experience-with-streaming-responses).                                                                                       |\n",
        "| **Live data & actions**        | Give the chatbot access to live data or external APIs using **Tool Calling**. See the full [Tool calling guide](https://llama.developer.meta.com/docs/guides/tool-guide/).                                                                                                          |"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.7"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}