| 
					
				 | 
			
			
				@@ -2,7 +2,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  "cells": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "placed-inspection", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "quality-channel", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "## Website scrapping\n", 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -16,7 +16,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "## Learning Objectives\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "The goal of this lab is to obtain raw text data via webscrapping.\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "The raw text data obtained from this notebook will be used for subsequent notebooks for Lab1\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "To run through Megatron-LM default workflow in order to train a GPT model, we will need to obtain data first. The outcome of this notebook is the raw text data which will be used for subsequent tasks in Lab1.\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "This notebook covers the below steps : \n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "\n", 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -32,7 +32,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "trained-midwest", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "everyday-leonard", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "1. install python libraries and download 2 python scripts which will be used for website crawling." 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -41,7 +41,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": null, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "unknown-spiritual", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "exotic-grave", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -57,7 +57,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": null, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "meaning-dream", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "tamil-electric", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -68,7 +68,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "periodic-dispute", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "precious-birth", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "2. Crawl links from a seeded url and write to a text file named `NVdevblog_urls.txt`" 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -77,7 +77,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": null, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "executed-spanish", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "dietary-beads", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -87,7 +87,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "owned-alignment", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "potential-regard", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "3. Remove incompliant links from the text file in order to ensure legal compliancy.\n", 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -100,7 +100,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": null, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "fallen-dating", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "amazing-nickname", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -122,7 +122,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "exceptional-grain", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "military-electronics", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "4. Fetch the corresponding webpage from each approved url and write it to `XXX.html` format." 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -131,7 +131,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": null, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "acquired-afghanistan", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "worth-album", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -141,7 +141,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "billion-service", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "collective-dimension", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "Below is an example of expected outputs :\n", 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -157,7 +157,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "speaking-basin", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "heard-recovery", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "5. Parse the html file and extract the raw text data, which will be written to `extractedNVblogs.txt`." 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -166,7 +166,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": null, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "indoor-bachelor", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "suspended-degree", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -202,7 +202,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "indie-fusion", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "continued-voice", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "6. Move the `extractedNVblogs.txt` to the correct folder under **dataset**. This file `extractedNVblogs.txt` will be used in subsequent notebooks in lab1." 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -211,7 +211,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 12, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "korean-given", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "german-shareware", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -220,7 +220,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "caroline-assault", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "willing-charleston", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "**Note:** Please run below cell to free up space." 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -229,7 +229,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 14, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "developmental-casino", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "square-montana", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -241,7 +241,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "integrated-omega", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "brave-ranking", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "Verify `extractedNVblogs.txt` is successfully moved to the correct folder." 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -250,7 +250,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": null, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "daily-england", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "pressed-model", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -259,7 +259,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "comfortable-update", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "worse-affairs", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "Below is an example of expected outputs :\n", 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -269,7 +269,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "cutting-template", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "convenient-treatment", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "--- \n", 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -280,7 +280,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "minimal-translator", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "sorted-federation", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "-----\n", 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -289,7 +289,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "reserved-knife", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "exclusive-qualification", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "--- \n", 
			 |