|
@@ -2,7 +2,7 @@
|
|
"cells": [
|
|
"cells": [
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "placed-inspection",
|
|
|
|
|
|
+ "id": "quality-channel",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"## Website scrapping\n",
|
|
"## Website scrapping\n",
|
|
@@ -16,7 +16,7 @@
|
|
"## Learning Objectives\n",
|
|
"## Learning Objectives\n",
|
|
"The goal of this lab is to obtain raw text data via webscrapping.\n",
|
|
"The goal of this lab is to obtain raw text data via webscrapping.\n",
|
|
"\n",
|
|
"\n",
|
|
- "The raw text data obtained from this notebook will be used for subsequent notebooks for Lab1\n",
|
|
|
|
|
|
+ "To run through Megatron-LM default workflow in order to train a GPT model, we will need to obtain data first. The outcome of this notebook is the raw text data which will be used for subsequent tasks in Lab1.\n",
|
|
"\n",
|
|
"\n",
|
|
"This notebook covers the below steps : \n",
|
|
"This notebook covers the below steps : \n",
|
|
"\n",
|
|
"\n",
|
|
@@ -32,7 +32,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "trained-midwest",
|
|
|
|
|
|
+ "id": "everyday-leonard",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"1. install python libraries and download 2 python scripts which will be used for website crawling."
|
|
"1. install python libraries and download 2 python scripts which will be used for website crawling."
|
|
@@ -41,7 +41,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "unknown-spiritual",
|
|
|
|
|
|
+ "id": "exotic-grave",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -57,7 +57,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "meaning-dream",
|
|
|
|
|
|
+ "id": "tamil-electric",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -68,7 +68,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "periodic-dispute",
|
|
|
|
|
|
+ "id": "precious-birth",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"2. Crawl links from a seeded url and write to a text file named `NVdevblog_urls.txt`"
|
|
"2. Crawl links from a seeded url and write to a text file named `NVdevblog_urls.txt`"
|
|
@@ -77,7 +77,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "executed-spanish",
|
|
|
|
|
|
+ "id": "dietary-beads",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -87,7 +87,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "owned-alignment",
|
|
|
|
|
|
+ "id": "potential-regard",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"3. Remove incompliant links from the text file in order to ensure legal compliancy.\n",
|
|
"3. Remove incompliant links from the text file in order to ensure legal compliancy.\n",
|
|
@@ -100,7 +100,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "fallen-dating",
|
|
|
|
|
|
+ "id": "amazing-nickname",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -122,7 +122,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "exceptional-grain",
|
|
|
|
|
|
+ "id": "military-electronics",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"4. Fetch the corresponding webpage from each approved url and write it to `XXX.html` format."
|
|
"4. Fetch the corresponding webpage from each approved url and write it to `XXX.html` format."
|
|
@@ -131,7 +131,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "acquired-afghanistan",
|
|
|
|
|
|
+ "id": "worth-album",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -141,7 +141,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "billion-service",
|
|
|
|
|
|
+ "id": "collective-dimension",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Below is an example of expected outputs :\n",
|
|
"Below is an example of expected outputs :\n",
|
|
@@ -157,7 +157,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "speaking-basin",
|
|
|
|
|
|
+ "id": "heard-recovery",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"5. Parse the html file and extract the raw text data, which will be written to `extractedNVblogs.txt`."
|
|
"5. Parse the html file and extract the raw text data, which will be written to `extractedNVblogs.txt`."
|
|
@@ -166,7 +166,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "indoor-bachelor",
|
|
|
|
|
|
+ "id": "suspended-degree",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -202,7 +202,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "indie-fusion",
|
|
|
|
|
|
+ "id": "continued-voice",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"6. Move the `extractedNVblogs.txt` to the correct folder under **dataset**. This file `extractedNVblogs.txt` will be used in subsequent notebooks in lab1."
|
|
"6. Move the `extractedNVblogs.txt` to the correct folder under **dataset**. This file `extractedNVblogs.txt` will be used in subsequent notebooks in lab1."
|
|
@@ -211,7 +211,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"execution_count": 12,
|
|
- "id": "korean-given",
|
|
|
|
|
|
+ "id": "german-shareware",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -220,7 +220,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "caroline-assault",
|
|
|
|
|
|
+ "id": "willing-charleston",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"**Note:** Please run below cell to free up space."
|
|
"**Note:** Please run below cell to free up space."
|
|
@@ -229,7 +229,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"execution_count": 14,
|
|
- "id": "developmental-casino",
|
|
|
|
|
|
+ "id": "square-montana",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -241,7 +241,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "integrated-omega",
|
|
|
|
|
|
+ "id": "brave-ranking",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Verify `extractedNVblogs.txt` is successfully moved to the correct folder."
|
|
"Verify `extractedNVblogs.txt` is successfully moved to the correct folder."
|
|
@@ -250,7 +250,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "daily-england",
|
|
|
|
|
|
+ "id": "pressed-model",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -259,7 +259,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "comfortable-update",
|
|
|
|
|
|
+ "id": "worse-affairs",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Below is an example of expected outputs :\n",
|
|
"Below is an example of expected outputs :\n",
|
|
@@ -269,7 +269,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "cutting-template",
|
|
|
|
|
|
+ "id": "convenient-treatment",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"--- \n",
|
|
"--- \n",
|
|
@@ -280,7 +280,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "minimal-translator",
|
|
|
|
|
|
+ "id": "sorted-federation",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"-----\n",
|
|
"-----\n",
|
|
@@ -289,7 +289,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "reserved-knife",
|
|
|
|
|
|
+ "id": "exclusive-qualification",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"--- \n",
|
|
"--- \n",
|