# üåê Building an Intelligent Browser Agent with Llama 3.2

This notebook provides a step-by-step guide to creating an AI-powered browser agent capable of navigating and interacting with websites autonomously. By combining the power of Llama 3.2 Vision, Playwright, and Together AI, this agent can perform tasks seamlessly while understanding both visual and textual content.

##### Demo
For a detailed explanation of the code and a demo video, visit our blog post: [**Blog Post and Demo Video**](https://miguelg719.github.io/browser-use-blog/)

##### Features
- Visual understanding of web pages through screenshots
- Autonomous navigation and interaction
- Natural language instructions for web tasks
- Persistent browser session management

For example, you can ask the agent to:
- Search for a product on Amazon
- Find the cheapest flight to Tokyo
- Buy tickets for the next Warriors game


##### What's in this Notebook?

This recipe walks you through:
- Setting up the environment and installing dependencies.
- Automating browser interactions using Playwright.
- Defining a structured prompt for the LLM to understand the task and execute the next action.
- Leveraging Llama 3.2 Vision for content comprehension.
- Creating a persistent and intelligent browser agent for real-world applications.

***Please note that the agent is not perfect and may not always behave as expected.**



### 1. Install Required Libraries
This cell installs the necessary Python packages for the script, such as `together`, `playwright`, and `beautifulsoup4`.
It also ensures that Playwright is properly installed to enable automated browser interactions.

In [None]:
%pip install together playwright
!playwright install

### 2. Import Modules and Set Up Environment
Set your `Together` API key to instantiate the client client. Feel free to use a different provider if it's more convenient. 

In [2]:
import os
from dotenv import load_dotenv
from together import Together

load_dotenv()

client = Together(api_key=os.getenv("TOGETHER_API_KEY"))

##### Vision Query Example
This function converts an image file into a Base64-encoded string, which is required for LLM querying.

The next cell shows an example of how to use the `encode_image` function to convert an image file into a Base64-encoded string, which is then used in a chat completion request to the Llama 3.2 Vision model.


In [3]:
import base64
from IPython.display import Markdown
imagePath= "sample_screenshot.png"

def encode_image(image_path):
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode('utf-8')

# Must have an image on the local path to use it
base64_image = encode_image(imagePath)

In [None]:
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "what is this image about?"},
                {
                    "type": "image_url",
                    # Uses a local image path. To use a remote image, replace the url with the image URL.
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}",
                    }
                },
            ],
        }
    ]
)

display(Markdown(response.choices[0].message.content))

#### Helper Functions to Parse the Accessibility Tree

The agent will use the accessibility tree to understand the elements on the page and interact with them. A helper function is defined here to help simplity the accessibility tree for the agent.

In [5]:
def parse_accessibility_tree(node, indent=0):
    """
    Recursively parses the accessibility tree and prints a readable structure.
    Args:
        node (dict): A node in the accessibility tree.
        indent (int): Indentation level for the nested structure.
    """
    # Initialize res as an empty string at the start of each parse
    res = ""
    
    def _parse_node(node, indent, res):
        # Base case: If the node is empty or doesn't have a 'role', skip it
        if not node or 'role' not in node:
            return res

        # Indentation for nested levels
        indented_space = " " * indent
        
        # Add node's name and role to result string
        if 'value' in node:
            res = res + f"{indented_space}Role: {node['role']} - Name: {node.get('name', 'No name')} - Value: {node['value']}\n"
        else:
            res = res + f"{indented_space}Role: {node['role']} - Name: {node.get('name', 'No name')}\n"
        
        # If the node has children, recursively parse them
        if 'children' in node:
            for child in node['children']:
                res = _parse_node(child, indent + 2, res)  # Increase indentation for child nodes
                
        return res

    return _parse_node(node, indent, res)

### 3. Define Prompts
a) **Planning Prompt:**
Create a structured prompt for the LLM to understand the task and execute the next action.

b) **Agent Execution Prompt**
A structured prompt is created, specifying the instructions for processing the webpage content and screenshots.

In [6]:
planning_prompt = """
Given a user request, define a very simple plan of subtasks (actions) to achieve the desired outcome and execute them iteratively using Playwright.

1. Understand the Task:
   - Interpret the user's request and identify the core goal.
   - Break down the task into a few smaller, actionable subtasks to achieve the goal effectively.

2. Planning Actions:
   - Translate the user's request into a high-level plan of actions.
   - Example actions include:
     - Searching for specific information.
     - Navigating to specified URLs.
     - Interacting with website elements (clicking, filling).
     - Extracting or validating data.

Input:
- User Request (Task)

Output from the Agent:
- Step-by-Step Action Plan:: Return only an ordered list of actions. Only return the list, no other text.

**Example User Requests and Agent Behavior:**

1. **Input:** "Search for a product on Amazon."
   - **Output:**
     1. Navigate to Amazon's homepage.
     2. Enter the product name in the search bar and perform the search.
     3. Extract and display the top results, including the product title, price, and ratings.

2. **Input:** "Find the cheapest flight to Tokyo."
   - **Output:**
     1. Visit a flight aggregator website (e.g. Kayak).
     2. Enter the departure city.
     3. Enter the destination city
     4. Enter the start and end dates.
     5. Extract and compare the flight options, highlighting the cheapest option.

3. **Input:** "Buy tickets for the next Warriors game."
   - **Output:**
     1. Navigate to a ticket-selling platform (e.g., Ticketmaster).
     2. Fill the search bar with the team name.
     2. Search for upcoming team games.
     3. Select the next available game and purchase tickets for the specified quantity.

"""


execution_prompt = """
You will be given a task, a website's page accessibility tree, and the page screenshot as context. The screenshot is where you are now, use it to understand the accessibility tree. Based on that information, you need to decide the next step action. ONLY RETURN THE NEXT STEP ACTION IN A SINGLE JSON.

When selecting elements, use elements from the accessibility tree.

Reflect on what you are seeing in the accessibility tree and the screenshot and decide the next step action, elaborate on it in reasoning, and choose the next appropriate action.

Selectors must follow the format:
- For a button with a specific name: "button=ButtonName"
- For a placeholder (e.g., input field): "placeholder=PlaceholderText"
- For text: "text=VisibleText"

Make sure to analyze the accessibility tree and the screenshot to understand the current state, if something is not clear, you can use the previous actions to understand the current state. Explain why you are in the current state in current_state.

You will be given a task and you MUST return the next step action in JSON format:
{
    "current_state": "Where are you now? Analyze the accessibility tree and the screenshot to understand the current state.",
    "reasoning": "What is the next step to accomplish the task?",
    "action": "navigation" or "click" or "fill" or "finished",
    "url": "https://www.example.com", // Only for navigation actions
    "selector": "button=Click me", // For click or fill actions, derived from the accessibility tree
    "value": "Input text", // Only for fill actions
}

### Guidelines:
1. Use **"navigation"** for navigating to a new website through a URL.
2. Use **"click"** for interacting with clickable elements. Examples:
   - Buttons: "button=Click me"
   - Text: "text=VisibleText"
   - Placeholders: "placeholder=Search..."
   - Link: "link=BUY NOW"
3. Use **"fill"** for inputting text into editable fields. Examples:
   - Placeholder: "placeholder=Search..."
   - Textbox: "textbox=Flight destination output"
   - Input: "input=Search..."
4. Use **"finished"** when the task is done. For example:
   - If a task is successfully completed.
   - If navigation confirms you are on the correct page.


### Accessibility Tree Examples:

You will be given an accessibility tree to interact with the webpage. It consists of a nested node structure that represents elements on the page. For example:

Role: generic - Name: 
   Role: text - Name: San Francisco (SFO)
   Role: button - Name: 
   Role: listitem - Name: 
   Role: textbox - Name: Flight origin input
Role: button - Name: Swap departure airport and destination airport
Role: generic - Name: 
   Role: textbox - Name: Flight destination input
Role: button - Name: Start date
Role: button - Name: 
Role: button - Name: 
Role: button - Name: End date
Role: button - Name: 
Role: button - Name: 
Role: button - Name: Search

This section indicates that there is a textbox with a name "Flight destination input" filled with San Francisco (SFO). There is also a button with the name "Swap departure airport and destination airport". Another textbox with the name "Flight destination input" not filled with any text. There are also buttons with the names "Start date", "End date", which are not filled with any dates, and a button named "Search".

Retry actions at most 2 times before trying a different action.

### Examples:
1. To click on a button labeled "Search":
   {
       "current_state": "On the homepage of a search engine.",
       "reasoning": "The accessibility tree shows a button named 'Search'. Clicking it is the appropriate next step to proceed with the task.",
       "action": "click",
       "selector": "button=Search"
   }

2. To fill a search bar with the text "AI tools":
   {
       "current_state": "On the search page with a focused search bar.",
       "reasoning": "The accessibility tree shows an input field with placeholder 'Search...'. Entering the query 'AI tools' fulfills the next step of the task.",
       "action": "fill",
       "selector": "placeholder=Search...",
       "value": "AI tools"
   }

3. To navigate to a specific URL:
   {
       "current_state": "Starting from a blank page.",
       "reasoning": "The task requires visiting a specific website to gather relevant information. Navigating to the URL is the first step.",
       "action": "navigation",
       "url": "https://example.com"
   }

4. To finish the task:
   {
       "current_state": "Completed the search and extracted the necessary data.",
       "reasoning": "The task goal has been achieved, and no further actions are required.",
       "action": "finished"
   }
"""

#### Few Shot Examples

Performance improves drastically by adding a few shot examples.

In [7]:
few_shot_example_1 = """
User Input: "What are the best tacos in San Francisco?"

Agent Step Sequence:
Step 1: 
{
    "current_state": "On a blank page.",
    "reasoning": "The task is to find the best tacos in San Francisco, so the first step is to navigate to Google to perform a search.",
    "action": "navigation",
    "url": "https://www.google.com",
}

Step 2: 
{
    "current_state": "On the Google homepage.",
    "reasoning": "To search for the best tacos in San Francisco, I need to fill the Google search bar with the query.",
    "action": "fill",
    "selector": "combobox=Search",
    "value": "Best tacos in San Francisco"
}

Step 3:
{
    "current_state": "On Google search results page.",
    "reasoning": "After entering the query, I need to click the search button to retrieve the results.",
    "action": "click",
    "selector": "button=Google Search"
}

Step 4: 
{
    "current_state": "On the search results page with multiple links.",
    "reasoning": "From the search results, I need to click on a reliable food-review or blogwebsite link.",
    "action": "click",
    "selector": "text=Yelp"
}

Step 5:
{
    "current_state": "On Yelp's best taqueria near San Francisco page.",
    "reasoning": "The task is complete as I have found the top taquerias in San Francisco.",
    "action": "finished",
    "summary": "I have successfully found the best tacos in San Francisco."
}
"""

few_shot_example_2 = """
User Input: Can you send an email to reschedule a meeting for Dmitry at gmail.com for tomorrow morning? I'm sick today.

Agent Step Sequence:
Step 1:
{
    "current_state": "On a blank page.",
    "reasoning": "To send an email, the first step is to navigate to Gmail.",
    "action": "navigation",
    "url": "https://mail.google.com",
}

Step 2:
{
    "current_state": "On Gmail's homepage.",
    "reasoning": "Click the 'Compose' button to start drafting a new email.",
    "action": "click",
    "selector": "button=Compose"
}

Step 3:
{
    "current_state": "In the new email draft window.",
    "reasoning": "Enter Dmitry's email address in the recipient field.",
    "action": "fill",
    "selector": "placeholder=Recipients",
    "value": "dmitry@gmail.com"
}

Step 4: 
{
    "current_state": "In the new email draft with the recipient filled.",
    "reasoning": "Set the subject line to indicate the purpose of the email.",
    "action": "fill",
    "selector": "placeholder=Subject",
    "value": "Rescheduling Meeting"
}

Step 5:
{
    "current_state": "In the new email draft with the subject set.",
    "reasoning": "Compose the email body to politely inform Dmitry about rescheduling the meeting.",
    "action": "fill",
    "selector": "placeholder=Email body",
    "value": "Hi Dmitry,\\n\\nI'm feeling unwell today and would like to reschedule our meeting for tomorrow morning. Please let me know if this works for you.\\n\\nBest regards,\\n[Your Name]"
}

Step 6: 
{
    "current_state": "In the new email draft with the body composed.",
    "reasoning": "Click the 'Send' button to deliver the email to Dmitry.",
    "action": "click",
    "selector": "button=Send"
}

Step 7:
{
    "current_state": "On Gmail's homepage after sending the email.",
    "reasoning": "The email has been drafted and sent, fulfilling the task of informing Dmitry about the reschedule.",
    "action": "finished",
    "summary": "Email sent to Dmitry to reschedule the meeting for tomorrow morning."
}
"""

few_shot_examples = [few_shot_example_1, few_shot_example_2]

### 4. Define a task and generate a plan of actions to execute

You can define your own task or use one of the examples below

In [8]:
# Define your task here:
# task = 'Find toys to buy for my 10 year old niece this Christmas'
# task = 'Find tickets for the next Warriors game'
task = 'Find the cheapest flight to Madrid'

### Generate a plan of actions to execute

The next cell queries the LLM using the planning prompt to generate a plan of actions to execute. This then becomes each of the individual subtasks for the execution agent to complete.

In [None]:
print("Generating plan...")
planning_response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo",
    temperature=0.0,
    messages=[
        {"role": "system", "content": planning_prompt},
        {"role": "user", "content": task},
    ],
)     
plan = planning_response.choices[0].message.content
print(plan)
steps = [line.strip()[3:] for line in plan.strip().split('\n')]


### 5. Create the Browser environment and Run the Agent
The necessary modules for web scraping are imported, and the setup for using Playwright asynchronously is initialized.

The context is provided to the LLM to help it understand its current state and generate the next required action to complete the provided task. 

- At any step, you can press **enter** to continue or **'q'** to quit the agent loop. 

In [None]:
from playwright.async_api import async_playwright
import asyncio 
import json
import re

previous_context = None

async def run_browser():
    async with async_playwright() as playwright:
        # Launch Chromium browser
        browser = await playwright.chromium.launch(headless=False, channel="chrome")
        page = await browser.new_page()
        await asyncio.sleep(1)
        await page.goto("https://google.com/")
        previous_actions = []
        try:
            while True:  # Infinite loop to keep session alive, press enter to continue or 'q' to quit
                # Get Context from page
                accessibility_tree = await page.accessibility.snapshot()
                accessibility_tree = parse_accessibility_tree(accessibility_tree)
                await page.screenshot(path="screenshot.png")
                base64_image = encode_image(imagePath)
                previous_context = accessibility_tree
                response = client.chat.completions.create(
                    model="meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo",
                    temperature=0.0,
                    messages=[
                        {"role": "system", "content": execution_prompt},
                        {"role": "system", "content": f"Few shot examples: {few_shot_examples}. Just a few examples, user will assign you VERY range set of tasks."},
                        {"role": "system", "content": f"Plan to execute: {steps}\n\n Accessibility Tree: {previous_context}\n\n, previous actions: {previous_actions}"},
                        {"role": "user", "content": 
                         [
                            {
                                "type": "text",
                                "text": f'What should be the next action to accomplish the task: {task} based on the current state? Remember to review the plan and select the next action based on the current state. Provide the next action in JSON format strictly as specified above.',
                            },
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": f"data:image/jpeg;base64,{base64_image}",
                                }
                            },
                         ]
                        }
                    ],
                )
                res = response.choices[0].message.content
                print('Agent response:', res)
                try:
                    match = re.search(r'\{.*\}', res, re.DOTALL)
                    if match:
                        output = json.loads(match.group(0))
                except Exception as e:
                    print('Error parsing JSON:', e)

                if output["action"] == "navigation":
                    try:
                        await page.goto(output["url"])
                        previous_actions.append(f"navigated to {output['url']}, SUCCESS")
                    except Exception as e:
                        previous_actions.append(f"Error navigating to {output['url']}: {e}")

                elif output["action"] == "click":
                    try:
                        selector_type, selector_name = output["selector"].split("=")[0], output["selector"].split("=")[1]
                        res = await page.get_by_role(selector_type, name=selector_name).first.click()
                        previous_actions.append(f"clicked {output['selector']}, SUCCESS")
                    except Exception as e:
                        previous_actions.append(f"Error clicking on {output['selector']}: {e}")
                        
                elif output["action"] == "fill":
                    try:
                        selector_type, selector_name = output["selector"].split("=")[0], output["selector"].split("=")[1]
                        res = await page.get_by_role(selector_type, name=selector_name).fill(output["value"])
                        await asyncio.sleep(1)
                        await page.keyboard.press("Enter")
                        previous_actions.append(f"filled {output['selector']} with {output['value']}, SUCCESS")
                    except Exception as e:
                            previous_actions.append(f"Error filling {output['selector']} with {output['value']}: {e}")

                elif output["action"] == "finished":
                    print(output["summary"])
                    break

                await asyncio.sleep(1) 
                
                # Or wait for user input
                user_input = input("Press 'q' to quit or Enter to continue: ")
                if user_input.lower() == 'q':
                    break
                
        except Exception as e:
            print(f"An error occurred: {e}")
        finally:
            # Only close the browser when explicitly requested
            await browser.close()

# Run the async function
await run_browser()

## And that's it! Congratulations! üéâüéâ

You've just created a browser agent that can navigate websites, understand page content through vision, plan and execute actions based on natural language commands, and maintain context across multiple interactions.


**Collaborators**

Feel free to reach out with any questions or feedback!


**Miguel Gonzalez** on [X](https://x.com/miguel_gonzf) or [LinkedIn](https://www.linkedin.com/in/gonzalezfernandezmiguel/)

**Dimitry Khorzov** on [X](https://x.com/korzhov_dm) or [LinkedIn](https://www.linkedin.com/in/korzhovdm)
