{ "cells": [ { "cell_type": "markdown", "source": "# Step 3: Web Content Collection\n\nThis notebook is responsible for searching the web and downloading relevant information based on the queries generated in Step 2. It serves as the \"research gathering\" phase of our workflow.\n\n## What This Notebook Does:\n\n1. **Web Searching**: Uses SerpAPI to perform Google searches for each query\n2. **Content Download**: Retrieves HTML content from search results\n3. **Data Organization**: Creates a structured directory for all downloaded content\n4. **Metadata Tracking**: Records information about each search and download\n\nThe goal is to collect a diverse set of web content that will be processed and analyzed in Step 4. This notebook acts as the bridge between our AI-generated research questions and the actual data collection process.", "metadata": {} }, { "cell_type": "markdown", "source": "## Required Dependencies\n\nFirst, we need to install and import the necessary libraries:\n- **serpapi**: For performing Google searches via an API\n- **requests**: For downloading HTML content\n- **pandas**: For data organization and manipulation\n- **json**: For storing and retrieving structured data\n- **hashlib**: For creating unique identifiers for files\n- **pathlib**: For filesystem operations\n\nNote: You may need to run the pip install command below if SerpAPI is not already installed.", "metadata": {} }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "#!pip install google-search-results" ] }, { "cell_type": "markdown", "source": "## Set Up Directory Structure\n\nWe'll create a structured file system to organize all our downloaded data:\n- **base_dir**: Main directory for all research data\n- **src_dir**: Directory for source files\n- **results_dir**: Directory for downloaded search results\n\nThis organization makes it easier to manage the large number of files we'll be working with.", "metadata": {} }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import json\n", "import os\n", "import time\n", "from serpapi import GoogleSearch\n", "import requests\n", "import hashlib\n", "from pathlib import Path\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "base_dir = Path(\"llama_data\")\n", "src_dir = base_dir / \"src\"\n", "results_dir = base_dir / \"results\"" ] }, { "cell_type": "markdown", "source": "## Load Report Outlines\n\nNow we'll load the detailed report outlines generated in Step 2. These outlines contain:\n1. Research report titles and topics\n2. Web search queries for each report\n3. The purpose of each query\n\nThis data will guide our web search process and ensure we're collecting information that's directly relevant to our research goals.", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "base_dir.mkdir(exist_ok=True)\n", "src_dir.mkdir(exist_ok=True)\n", "results_dir.mkdir(exist_ok=True)\n" ] }, { "cell_type": "markdown", "source": "## Verify Data Loading\n\nLet's check that we've successfully loaded the report outlines and display:\n1. The total number of reports loaded\n2. A sample report title\n3. Sample queries for one report\n\nThis helps us confirm we're working with the expected data before proceeding.", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('generated_outlines.json', 'r') as file:\n", " content = file.read()\n", " data = json.loads(content)" ] }, { "cell_type": "markdown", "source": "## Extract All Queries\n\nNow we'll extract and organize all the queries from the report outlines:\n1. Loop through each report\n2. Extract metadata (report index, title)\n3. Extract all queries and their purposes\n4. Combine everything into a structured format\n\nThis gives us a flat list of all queries across all reports, making it easier to process them systematically.", "metadata": {} }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded 5 report outlines\n" ] } ], "source": [ "print(f\"Loaded {len(data)} report outlines\")\n" ] }, { "cell_type": "markdown", "source": "## Convert to DataFrame for Analysis\n\nWe'll convert our query list to a pandas DataFrame for easier:\n- Visualization\n- Filtering\n- Analysis\n\nThe DataFrame gives us a clean tabular view of all queries we'll be researching.", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Sample report title: Llama 3.3: A Revolutionary Leap in AI\n", "Sample queries:\n", "- Llama 3.3 new features and enhancements: To gather information on the new features and enhancements in Llama 3.3\n", "- Llama 3.3 vs Llama 3.1 performance comparison: To gather information on the performance comparison between Llama 3.3 and Llama 3.1\n" ] } ], "source": [ "print(\"\\nSample report title:\", data[0].get('original_goal', {}).get('Report Title', 'No title'))\n", "print(\"Sample queries:\")\n", "for query in data[0].get('Web Queries', [])[:2]:\n", " print(f\"- {query.get('query')}: {query.get('purpose')}\")" ] }, { "cell_type": "markdown", "source": "## Set Up SerpAPI Key for Web Searches\n\n**CRITICAL STEP**: You must add your SerpAPI key here.\n\nSerpAPI is a service that allows us to programmatically access Google search results. \nIt requires an API key to function.\n\nTo get your key:\n1. Go to https://serpapi.com/ and sign up (they offer free credits)\n2. Find your API key in your account dashboard\n3. Paste your key in the string below\n\nWithout a valid API key, the web searches will fail and the notebook won't be able to collect data.", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_queries = []" ] }, { "cell_type": "markdown", "source": "## Define Search Function\n\nThis function handles the web search part of our process:\n1. Takes a query string and number of results to return\n2. Uses SerpAPI to perform a Google search\n3. Returns the organic search results (excluding ads, etc.)\n4. Provides error handling if search fails\n\nThe function returns a list of search results containing:\n- Title\n- URL\n- Snippet of content", "metadata": {} }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "for report_index, report_data in enumerate(data):\n", " report_title = report_data.get('original_goal', {}).get('Report Title', f\"Report {report_index}\")\n", " \n", " for query_index, query_data in enumerate(report_data.get('Web Queries', [])):\n", " query = query_data.get('query', '')\n", " purpose = query_data.get('purpose', '')\n", " \n", " all_queries.append({\n", " 'report_index': report_index,\n", " 'report_title': report_title,\n", " 'query_index': query_index,\n", " 'query': query,\n", " 'purpose': purpose\n", " })" ] }, { "cell_type": "markdown", "source": "## Define HTML Download Function\n\nThis function handles downloading the actual HTML content:\n1. Takes a URL to fetch\n2. Uses requests library with appropriate headers (to avoid blocks)\n3. Sets a timeout to avoid hanging on slow sites\n4. Handles errors gracefully with informative messages\n\nThe function returns the HTML content as text if successful, or None if the download fails.", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total queries extracted: 15\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
report_indexreport_titlequery_indexquerypurpose
00Llama 3.3: A Revolutionary Leap in AI0Llama 3.3 new features and enhancementsTo gather information on the new features and ...
10Llama 3.3: A Revolutionary Leap in AI1Llama 3.3 vs Llama 3.1 performance comparisonTo gather information on the performance compa...
20Llama 3.3: A Revolutionary Leap in AI2Cost of running Llama 3.3 on cloud vs local in...To gather information on the cost-effectivenes...
31Llama 3.3 vs Llama 3.1: A Comparative Analysis0Llama 3.3 new features and improvementsTo gather information on new features and impr...
41Llama 3.3 vs Llama 3.1: A Comparative Analysis1Llama 3.1 vs Llama 3.3 performance comparisonTo gather information on performance differenc...
\n", "
" ], "text/plain": [ " report_index report_title query_index \\\n", "0 0 Llama 3.3: A Revolutionary Leap in AI 0 \n", "1 0 Llama 3.3: A Revolutionary Leap in AI 1 \n", "2 0 Llama 3.3: A Revolutionary Leap in AI 2 \n", "3 1 Llama 3.3 vs Llama 3.1: A Comparative Analysis 0 \n", "4 1 Llama 3.3 vs Llama 3.1: A Comparative Analysis 1 \n", "\n", " query \\\n", "0 Llama 3.3 new features and enhancements \n", "1 Llama 3.3 vs Llama 3.1 performance comparison \n", "2 Cost of running Llama 3.3 on cloud vs local in... \n", "3 Llama 3.3 new features and improvements \n", "4 Llama 3.1 vs Llama 3.3 performance comparison \n", "\n", " purpose \n", "0 To gather information on the new features and ... \n", "1 To gather information on the performance compa... \n", "2 To gather information on the cost-effectivenes... \n", "3 To gather information on new features and impr... \n", "4 To gather information on performance differenc... " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "queries_df = pd.DataFrame(all_queries)\n", "print(f\"Total queries extracted: {len(queries_df)}\")\n", "queries_df.head()\n" ] }, { "cell_type": "markdown", "source": "## Define HTML Saving Function\n\nThis function handles the file organization and storage aspects:\n1. Creates a hierarchical directory structure for each report and query\n2. Generates unique filenames using URL hashing to avoid duplicates\n3. Sanitizes titles and filenames to ensure they're filesystem-safe\n4. Saves both the HTML content and metadata about each result\n\nThe function returns the filepath where the content was saved for later reference.", "metadata": {} }, { "cell_type": "code", "metadata": {}, "outputs": [], "source": "# IMPORTANT: Replace with your actual SerpAPI key below\n# You can get a free API key from https://serpapi.com/\nSERPAPI_KEY = \"\" # <--- ADD YOUR API KEY HERE\nSERPAPI_KEY" }, { "cell_type": "markdown", "source": "## Main Processing Function\n\nThis function orchestrates the entire data collection process:\n1. Processes each query in sequence\n2. Performs web searches for each query\n3. Downloads HTML content for each search result\n4. Saves everything with proper organization\n5. Maintains progress by saving intermediate results\n\nThe function builds a comprehensive record of all searches and downloads, which will be used in Step 4.", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def search_with_serpapi(query, num_results=5):\n", " print(f\"Searching for: {query}\")\n", " \n", " params = {\n", " \"engine\": \"google\",\n", " \"q\": query,\n", " \"api_key\": SERPAPI_KEY,\n", " \"num\": num_results,\n", " }\n", " \n", " search = GoogleSearch(params)\n", " results = search.get_dict()\n", " \n", " # Check if we have organic results\n", " if \"organic_results\" not in results:\n", " print(f\"Warning: No organic results found for query: {query}\")\n", " return []\n", " \n", " return results[\"organic_results\"]" ] }, { "cell_type": "markdown", "source": "## Run the Complete Process\n\nNow we'll execute the entire data collection process:\n1. Call our main processing function with all queries\n2. This will work through all reports and queries in sequence\n3. Download and save web content for each query\n4. Create a comprehensive dataset for Step 4\n\nNote: This process might take some time depending on:\n- Number of queries\n- Number of results per query\n- Web page sizes and download speeds\n- SerpAPI rate limits on your account\n\nThe results will be saved in the directory structure we defined, with full metadata about each search and download.", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def fetch_html(url):\n", " try:\n", " headers = {\n", " \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\"\n", " }\n", " response = requests.get(url, headers=headers, timeout=10)\n", " response.raise_for_status()\n", " return response.text\n", " except Exception as e:\n", " print(f\"Error fetching HTML from {url}: {str(e)}\")\n", " return None" ] }, { "cell_type": "markdown", "source": "## Define Analysis Function (Optional)\n\nThis function provides a summary of what was downloaded:\n1. Total number of queries processed\n2. Total number of search results fetched\n3. Breakdown by report and query\n\nYou can run this after the main process to get statistics about the data collection.", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def save_html(html_content, report_index, report_title, query_index, query, result_index, title, url):\n", " if html_content is None:\n", " return None\n", " \n", " sanitized_report = report_title.replace(\" \", \"_\").replace(\":\", \"\").replace(\"/\", \"\")[:30]\n", " sanitized_query = query.replace(\" \", \"_\").replace(\":\", \"\").replace(\"/\", \"\")[:30]\n", " \n", " url_hash = hashlib.md5(url.encode()).hexdigest()[:8]\n", "\n", " report_dir = results_dir / f\"report_{report_index}_{sanitized_report}\"\n", " report_dir.mkdir(exist_ok=True)\n", " \n", " query_dir = report_dir / f\"query_{query_index}_{sanitized_query}\"\n", " query_dir.mkdir(exist_ok=True)\n", " \n", " sanitized_title = ''.join(c if c.isalnum() or c in ['_', '-'] else '_' for c in title)[:30]\n", " filename = f\"result_{result_index}_{url_hash}_{sanitized_title}.html\"\n", " filepath = query_dir / filename\n", "\n", " with open(filepath, \"w\", encoding=\"utf-8\") as f:\n", " f.write(html_content)\n", " \n", " metadata = {\n", " \"report_index\": report_index,\n", " \"report_title\": report_title,\n", " \"query_index\": query_index,\n", " \"query\": query,\n", " \"result_index\": result_index,\n", " \"title\": title,\n", " \"url\": url,\n", " \"timestamp\": time.strftime(\"%Y-%m-%d %H:%M:%S\")\n", " }\n", " \n", " metadata_path = query_dir / f\"result_{result_index}_{url_hash}_metadata.json\"\n", " with open(metadata_path, \"w\") as f:\n", " json.dump(metadata, f, indent=2)\n", " \n", " return str(filepath)" ] }, { "cell_type": "markdown", "source": "## Run Analysis (Optional)\n\nIf desired, run the analysis function to get statistics about the data collection process.\nThis helps verify that everything downloaded as expected before moving to Step 4.", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def process_all_queries(queries_df):\n", " results = []\n", " \n", " for index, row in queries_df.iterrows():\n", " print(f\"\\nProcessing query {index + 1}/{len(queries_df)}\")\n", " print(f\"Report: {row['report_title']}\")\n", " print(f\"Query: {row['query']}\")\n", " \n", " search_results = search_with_serpapi(row['query'])\n", " \n", " query_results = []\n", " for result_index, result in enumerate(search_results):\n", " title = result.get('title', 'No Title')\n", " url = result.get('link', '')\n", " snippet = result.get('snippet', '')\n", " \n", " print(f\" Result {result_index + 1}: {title[:50]}...\")\n", " \n", " html_content = fetch_html(url)\n", " filepath = save_html(\n", " html_content, \n", " row['report_index'], \n", " row['report_title'],\n", " row['query_index'], \n", " row['query'], \n", " result_index, \n", " title, \n", " url\n", " )\n", " \n", " result_info = {\n", " \"result_index\": result_index,\n", " \"title\": title,\n", " \"url\": url,\n", " \"snippet\": snippet,\n", " \"filepath\": filepath\n", " }\n", " \n", " query_results.append(result_info)\n", " \n", " # Timeout\n", " time.sleep(1)\n", " \n", " query_result = {\n", " \"report_index\": row['report_index'],\n", " \"report_title\": row['report_title'],\n", " \"query_index\": row['query_index'],\n", " \"query\": row['query'],\n", " \"purpose\": row['purpose'],\n", " \"results\": query_results\n", " }\n", " \n", " results.append(query_result)\n", " \n", " with open(base_dir / \"results_so_far.json\", \"w\") as f:\n", " json.dump(results, f, indent=2)\n", " \n", " return results\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Processing query 1/15\n", "Report: Llama 3.3: A Revolutionary Leap in AI\n", "Query: Llama 3.3 new features and enhancements\n", "Searching for: Llama 3.3 new features and enhancements\n", " Result 1: Introducing the new Llama 3.3: Features and Overvi...\n", " Result 2: What is Meta Llama 3.3 70B? Features, Use Cases & ...\n", " Result 3: Key Features and Improvements in LLaMA 3.3...\n", " Result 4: Everything You Need to Know About Llama 3.3 | by A...\n", "\n", "Processing query 2/15\n", "Report: Llama 3.3: A Revolutionary Leap in AI\n", "Query: Llama 3.3 vs Llama 3.1 performance comparison\n", "Searching for: Llama 3.3 vs Llama 3.1 performance comparison\n", " Result 1: Llama 3 vs 3.1 vs 3.2 : r/LocalLLaMA...\n", " Result 2: Llama 3.3 70B Instruct vs Llama 3.1 405B Instruct...\n", " Result 3: Choosing the Best Llama Model: Llama 3 vs 3.1 vs 3...\n", " Result 4: Llama 3.3 just dropped — is it better than GPT-4 o...\n", " Result 5: Llama 3 vs Llama 3.1 : Which is Better for Your AI...\n", "\n", "Processing query 3/15\n", "Report: Llama 3.3: A Revolutionary Leap in AI\n", "Query: Cost of running Llama 3.3 on cloud vs local infrastructure\n", "Searching for: Cost of running Llama 3.3 on cloud vs local infrastructure\n", " Result 1: What's the cost of running Llama3:8b & 70b in the ...\n", " Result 2: What Is Meta's Llama 3.3 70B? How It Works, Use Ca...\n", "Error fetching HTML from https://www.datacamp.com/blog/llama-3-3-70b: 403 Client Error: Forbidden for url: https://www.datacamp.com/blog/llama-3-3-70b\n", " Result 3: Llama 3.3 vs. ChatGPT Pro: Key Considerations...\n", " Result 4: Llama 3.3 API Pricing: What You Need to Know...\n", " Result 5: Llama models | Generative AI...\n", "\n", "Processing query 4/15\n", "Report: Llama 3.3 vs Llama 3.1: A Comparative Analysis\n", "Query: Llama 3.3 new features and improvements\n", "Searching for: Llama 3.3 new features and improvements\n", " Result 1: What is Meta Llama 3.3 70B? Features, Use Cases & ...\n", " Result 2: What Is Meta's Llama 3.3 70B? How It Works, Use Ca...\n", "Error fetching HTML from https://www.datacamp.com/blog/llama-3-3-70b: 403 Client Error: Forbidden for url: https://www.datacamp.com/blog/llama-3-3-70b\n", " Result 3: Efficient, Accessible Generative AI on CPU with Ne...\n", " Result 4: Meta Releases Llama 3.3: a Model with Enhanced Per...\n", "\n", "Processing query 5/15\n", "Report: Llama 3.3 vs Llama 3.1: A Comparative Analysis\n", "Query: Llama 3.1 vs Llama 3.3 performance comparison\n", "Searching for: Llama 3.1 vs Llama 3.3 performance comparison\n", " Result 1: Llama 3 vs 3.1 vs 3.2 : r/LocalLLaMA...\n", " Result 2: Llama 3.3 70B Instruct vs Llama 3.1 405B Instruct...\n", " Result 3: Llama 3.3 just dropped — is it better than GPT-4 o...\n", " Result 4: Llama 3 vs Llama 3.1 : Which is Better for Your AI...\n", "\n", "Processing query 6/15\n", "Report: Llama 3.3 vs Llama 3.1: A Comparative Analysis\n", "Query: Cost of running Llama 3.3 vs Llama 3.1 on cloud and local infrastructure\n", "Searching for: Cost of running Llama 3.3 vs Llama 3.1 on cloud and local infrastructure\n", " Result 1: What's the cost of running Llama3:8b & 70b in the ...\n", " Result 2: The Million-Dollar Trick: LLAMA 3.1 is Free to Own...\n", " Result 3: Decoding Llama 3 vs 3.1: Which One Is Right for Yo...\n", " Result 4: What Is Meta's Llama 3.3 70B? How It Works, Use Ca...\n", "Error fetching HTML from https://www.datacamp.com/blog/llama-3-3-70b: 403 Client Error: Forbidden for url: https://www.datacamp.com/blog/llama-3-3-70b\n", " Result 5: Llama models | Generative AI...\n", "\n", "Processing query 7/15\n", "Report: The Cost-Benefit Analysis of Llama 3.3\n", "Query: Llama 3.3 new features and improvements\n", "Searching for: Llama 3.3 new features and improvements\n", " Result 1: What is Meta Llama 3.3 70B? Features, Use Cases & ...\n", " Result 2: What Is Meta's Llama 3.3 70B? How It Works, Use Ca...\n", "Error fetching HTML from https://www.datacamp.com/blog/llama-3-3-70b: 403 Client Error: Forbidden for url: https://www.datacamp.com/blog/llama-3-3-70b\n", " Result 3: Efficient, Accessible Generative AI on CPU with Ne...\n", " Result 4: Meta Releases Llama 3.3: a Model with Enhanced Per...\n", "\n", "Processing query 8/15\n", "Report: The Cost-Benefit Analysis of Llama 3.3\n", "Query: Cost of running Llama 3.3 on cloud vs local\n", "Searching for: Cost of running Llama 3.3 on cloud vs local\n", " Result 1: Costs to run Llama 3.3 on cloud? : r/LocalLLaMA...\n", " Result 2: What Is Meta's Llama 3.3 70B? How It Works, Use Ca...\n", "Error fetching HTML from https://www.datacamp.com/blog/llama-3-3-70b: 403 Client Error: Forbidden for url: https://www.datacamp.com/blog/llama-3-3-70b\n", " Result 3: Llama models | Generative AI...\n", " Result 4: Meta Llama in the Cloud | Llama Everywhere...\n", "Error fetching HTML from https://www.llama.com/docs/llama-everywhere/running-meta-llama-in-the-cloud/: 400 Client Error: Bad Request for url: https://www.llama.com/docs/llama-everywhere/running-meta-llama-in-the-cloud/\n", " Result 5: Llama 3.3 vs. ChatGPT Pro: Key Considerations...\n", "\n", "Processing query 9/15\n", "Report: The Cost-Benefit Analysis of Llama 3.3\n", "Query: Llama 3.3 vs Llama 3.1 performance comparison\n", "Searching for: Llama 3.3 vs Llama 3.1 performance comparison\n", " Result 1: Llama 3 vs 3.1 vs 3.2 : r/LocalLLaMA...\n", " Result 2: Llama 3.3 70B Instruct vs Llama 3.1 405B Instruct...\n", " Result 3: Choosing the Best Llama Model: Llama 3 vs 3.1 vs 3...\n", " Result 4: Llama 3.3 just dropped — is it better than GPT-4 o...\n", " Result 5: Llama 3 vs Llama 3.1 : Which is Better for Your AI...\n", "\n", "Processing query 10/15\n", "Report: Llama 3.3: The Future of AI-Driven Innovation\n", "Query: Llama 3.3 new features and enhancements\n", "Searching for: Llama 3.3 new features and enhancements\n", " Result 1: Introducing the new Llama 3.3: Features and Overvi...\n", " Result 2: What is Meta Llama 3.3 70B? Features, Use Cases & ...\n", " Result 3: Key Features and Improvements in LLaMA 3.3...\n", " Result 4: Everything You Need to Know About Llama 3.3 | by A...\n", "\n", "Processing query 11/15\n", "Report: Llama 3.3: The Future of AI-Driven Innovation\n", "Query: Llama 3.3 vs Llama 3.1 comparison\n", "Searching for: Llama 3.3 vs Llama 3.1 comparison\n", " Result 1: Llama 3 vs 3.1 vs 3.2 : r/LocalLLaMA...\n", " Result 2: Llama 3 vs Llama 3.1 : Which is Better for Your AI...\n", " Result 3: Llama 3.3 70B Instruct vs Llama 3.1 405B Instruct...\n", " Result 4: Llama 3.1 vs Llama 3 Differences - GoPenAI...\n", " Result 5: Decoding Llama 3 vs 3.1: Which One Is Right for Yo...\n", "\n", "Processing query 12/15\n", "Report: Llama 3.3: The Future of AI-Driven Innovation\n", "Query: Cost of running Llama 3.3 on cloud vs local infrastructure\n", "Searching for: Cost of running Llama 3.3 on cloud vs local infrastructure\n", " Result 1: What's the cost of running Llama3:8b & 70b in the ...\n", " Result 2: What Is Meta's Llama 3.3 70B? How It Works, Use Ca...\n", "Error fetching HTML from https://www.datacamp.com/blog/llama-3-3-70b: 403 Client Error: Forbidden for url: https://www.datacamp.com/blog/llama-3-3-70b\n", " Result 3: Llama 3.3 vs. ChatGPT Pro: Key Considerations...\n", " Result 4: Llama 3.3 API Pricing: What You Need to Know...\n", " Result 5: Llama models | Generative AI...\n", "\n", "Processing query 13/15\n", "Report: Llama 3.3: A Technical Deep Dive\n", "Query: Llama 3.3 architecture and technical specifications\n", "Searching for: Llama 3.3 architecture and technical specifications\n", " Result 1: What Is Meta's Llama 3.3 70B? How It Works, Use Ca...\n", "Error fetching HTML from https://www.datacamp.com/blog/llama-3-3-70b: 403 Client Error: Forbidden for url: https://www.datacamp.com/blog/llama-3-3-70b\n", " Result 2: Introducing Meta Llama 3: The most capable openly ...\n", "Error fetching HTML from https://ai.meta.com/blog/meta-llama-3/: 400 Client Error: Bad Request for url: https://ai.meta.com/blog/meta-llama-3/\n", " Result 3: meta-llama/Llama-3.3-70B-Instruct...\n", " Result 4: llama-3.3-70b-instruct Model by Meta...\n", " Result 5: Llama-3.3-70B - Documentation & FAQ...\n", "\n", "Processing query 14/15\n", "Report: Llama 3.3: A Technical Deep Dive\n", "Query: Llama 3.3 vs Llama 3.1 comparison\n", "Searching for: Llama 3.3 vs Llama 3.1 comparison\n", " Result 1: Llama 3 vs 3.1 vs 3.2 : r/LocalLLaMA...\n", " Result 2: Llama 3 vs Llama 3.1 : Which is Better for Your AI...\n", " Result 3: Llama 3.3 70B Instruct vs Llama 3.1 405B Instruct...\n", " Result 4: Llama 3.1 vs Llama 3 Differences - GoPenAI...\n", " Result 5: Decoding Llama 3 vs 3.1: Which One Is Right for Yo...\n", "\n", "Processing query 15/15\n", "Report: Llama 3.3: A Technical Deep Dive\n", "Query: Cost of running Llama 3.3 on cloud vs local infrastructure\n", "Searching for: Cost of running Llama 3.3 on cloud vs local infrastructure\n", " Result 1: What's the cost of running Llama3:8b & 70b in the ...\n", " Result 2: What Is Meta's Llama 3.3 70B? How It Works, Use Ca...\n", "Error fetching HTML from https://www.datacamp.com/blog/llama-3-3-70b: 403 Client Error: Forbidden for url: https://www.datacamp.com/blog/llama-3-3-70b\n", " Result 3: Llama 3.3 vs. ChatGPT Pro: Key Considerations...\n", " Result 4: Llama 3.3 API Pricing: What You Need to Know...\n", " Result 5: Llama models | Generative AI...\n" ] } ], "source": [ "results = process_all_queries(queries_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def analyze_results():\n", "\n", " try:\n", " with open(base_dir / \"results_so_far.json\", \"r\") as f:\n", " results = json.load(f)\n", " \n", " total_results = sum(len(query[\"results\"]) for query in results)\n", " print(f\"Total queries processed: {len(results)}\")\n", " print(f\"Total search results fetched: {total_results}\")\n", " \n", " summary_data = []\n", " for query in results:\n", " report_title = query[\"report_title\"]\n", " query_text = query[\"query\"]\n", " results_count = len(query[\"results\"])\n", " \n", " summary_data.append({\n", " \"Report\": report_title,\n", " \"Query\": query_text,\n", " \"Results Count\": results_count\n", " })\n", " \n", " summary_df = pd.DataFrame(summary_data)\n", " return summary_df\n", " except FileNotFoundError:\n", " print(\"No results file found. Run the processing first.\")\n", " return None\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total queries processed: 15\n", "Total search results fetched: 70\n" ] } ], "source": [ "summary_df = analyze_results()\n", "# if summary_df is not None:\n", "# summary_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 2 }