{ "cells": [ { "cell_type": "markdown", "source": "# Step 3: Web Content Collection\n\nThis notebook is responsible for searching the web and downloading relevant information based on the queries generated in Step 2. It serves as the \"research gathering\" phase of our workflow.\n\n## What This Notebook Does:\n\n1. **Web Searching**: Uses SerpAPI to perform Google searches for each query\n2. **Content Download**: Retrieves HTML content from search results\n3. **Data Organization**: Creates a structured directory for all downloaded content\n4. **Metadata Tracking**: Records information about each search and download\n\nThe goal is to collect a diverse set of web content that will be processed and analyzed in Step 4. This notebook acts as the bridge between our AI-generated research questions and the actual data collection process.", "metadata": {} }, { "cell_type": "markdown", "source": "## Required Dependencies\n\nFirst, we need to install and import the necessary libraries:\n- **serpapi**: For performing Google searches via an API\n- **requests**: For downloading HTML content\n- **pandas**: For data organization and manipulation\n- **json**: For storing and retrieving structured data\n- **hashlib**: For creating unique identifiers for files\n- **pathlib**: For filesystem operations\n\nNote: You may need to run the pip install command below if SerpAPI is not already installed.", "metadata": {} }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "#!pip install google-search-results" ] }, { "cell_type": "markdown", "source": "## Set Up Directory Structure\n\nWe'll create a structured file system to organize all our downloaded data:\n- **base_dir**: Main directory for all research data\n- **src_dir**: Directory for source files\n- **results_dir**: Directory for downloaded search results\n\nThis organization makes it easier to manage the large number of files we'll be working with.", "metadata": {} }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import json\n", "import os\n", "import time\n", "from serpapi import GoogleSearch\n", "import requests\n", "import hashlib\n", "from pathlib import Path\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "base_dir = Path(\"llama_data\")\n", "src_dir = base_dir / \"src\"\n", "results_dir = base_dir / \"results\"" ] }, { "cell_type": "markdown", "source": "## Load Report Outlines\n\nNow we'll load the detailed report outlines generated in Step 2. These outlines contain:\n1. Research report titles and topics\n2. Web search queries for each report\n3. The purpose of each query\n\nThis data will guide our web search process and ensure we're collecting information that's directly relevant to our research goals.", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "base_dir.mkdir(exist_ok=True)\n", "src_dir.mkdir(exist_ok=True)\n", "results_dir.mkdir(exist_ok=True)\n" ] }, { "cell_type": "markdown", "source": "## Verify Data Loading\n\nLet's check that we've successfully loaded the report outlines and display:\n1. The total number of reports loaded\n2. A sample report title\n3. Sample queries for one report\n\nThis helps us confirm we're working with the expected data before proceeding.", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('generated_outlines.json', 'r') as file:\n", " content = file.read()\n", " data = json.loads(content)" ] }, { "cell_type": "markdown", "source": "## Extract All Queries\n\nNow we'll extract and organize all the queries from the report outlines:\n1. Loop through each report\n2. Extract metadata (report index, title)\n3. Extract all queries and their purposes\n4. Combine everything into a structured format\n\nThis gives us a flat list of all queries across all reports, making it easier to process them systematically.", "metadata": {} }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded 5 report outlines\n" ] } ], "source": [ "print(f\"Loaded {len(data)} report outlines\")\n" ] }, { "cell_type": "markdown", "source": "## Convert to DataFrame for Analysis\n\nWe'll convert our query list to a pandas DataFrame for easier:\n- Visualization\n- Filtering\n- Analysis\n\nThe DataFrame gives us a clean tabular view of all queries we'll be researching.", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Sample report title: Llama 3.3: A Revolutionary Leap in AI\n", "Sample queries:\n", "- Llama 3.3 new features and enhancements: To gather information on the new features and enhancements in Llama 3.3\n", "- Llama 3.3 vs Llama 3.1 performance comparison: To gather information on the performance comparison between Llama 3.3 and Llama 3.1\n" ] } ], "source": [ "print(\"\\nSample report title:\", data[0].get('original_goal', {}).get('Report Title', 'No title'))\n", "print(\"Sample queries:\")\n", "for query in data[0].get('Web Queries', [])[:2]:\n", " print(f\"- {query.get('query')}: {query.get('purpose')}\")" ] }, { "cell_type": "markdown", "source": "## Set Up SerpAPI Key for Web Searches\n\n**CRITICAL STEP**: You must add your SerpAPI key here.\n\nSerpAPI is a service that allows us to programmatically access Google search results. \nIt requires an API key to function.\n\nTo get your key:\n1. Go to https://serpapi.com/ and sign up (they offer free credits)\n2. Find your API key in your account dashboard\n3. Paste your key in the string below\n\nWithout a valid API key, the web searches will fail and the notebook won't be able to collect data.", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_queries = []" ] }, { "cell_type": "markdown", "source": "## Define Search Function\n\nThis function handles the web search part of our process:\n1. Takes a query string and number of results to return\n2. Uses SerpAPI to perform a Google search\n3. Returns the organic search results (excluding ads, etc.)\n4. Provides error handling if search fails\n\nThe function returns a list of search results containing:\n- Title\n- URL\n- Snippet of content", "metadata": {} }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "for report_index, report_data in enumerate(data):\n", " report_title = report_data.get('original_goal', {}).get('Report Title', f\"Report {report_index}\")\n", " \n", " for query_index, query_data in enumerate(report_data.get('Web Queries', [])):\n", " query = query_data.get('query', '')\n", " purpose = query_data.get('purpose', '')\n", " \n", " all_queries.append({\n", " 'report_index': report_index,\n", " 'report_title': report_title,\n", " 'query_index': query_index,\n", " 'query': query,\n", " 'purpose': purpose\n", " })" ] }, { "cell_type": "markdown", "source": "## Define HTML Download Function\n\nThis function handles downloading the actual HTML content:\n1. Takes a URL to fetch\n2. Uses requests library with appropriate headers (to avoid blocks)\n3. Sets a timeout to avoid hanging on slow sites\n4. Handles errors gracefully with informative messages\n\nThe function returns the HTML content as text if successful, or None if the download fails.", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total queries extracted: 15\n" ] }, { "data": { "text/html": [ "
| \n", " | report_index | \n", "report_title | \n", "query_index | \n", "query | \n", "purpose | \n", "
|---|---|---|---|---|---|
| 0 | \n", "0 | \n", "Llama 3.3: A Revolutionary Leap in AI | \n", "0 | \n", "Llama 3.3 new features and enhancements | \n", "To gather information on the new features and ... | \n", "
| 1 | \n", "0 | \n", "Llama 3.3: A Revolutionary Leap in AI | \n", "1 | \n", "Llama 3.3 vs Llama 3.1 performance comparison | \n", "To gather information on the performance compa... | \n", "
| 2 | \n", "0 | \n", "Llama 3.3: A Revolutionary Leap in AI | \n", "2 | \n", "Cost of running Llama 3.3 on cloud vs local in... | \n", "To gather information on the cost-effectivenes... | \n", "
| 3 | \n", "1 | \n", "Llama 3.3 vs Llama 3.1: A Comparative Analysis | \n", "0 | \n", "Llama 3.3 new features and improvements | \n", "To gather information on new features and impr... | \n", "
| 4 | \n", "1 | \n", "Llama 3.3 vs Llama 3.1: A Comparative Analysis | \n", "1 | \n", "Llama 3.1 vs Llama 3.3 performance comparison | \n", "To gather information on performance differenc... | \n", "