4 years ago · eee29646a0
--- a/ai/Megatron/English/Python/Start_Here.ipynb
+++ b/ai/Megatron/English/Python/Start_Here.ipynb
@@ -190,7 +190,7 @@
 
																     "\n",
															
 
																     "- **Outlines of Lab 1**\n",
															
 
																     "    Megatron 101 in half a day - Please go through the below notebooks sequentially.\n",
															
 
																-    "    1. [WebCrawling](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb)\n",
															
 
																+    "    1. [WebCrawling to obtain raw text data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb)\n",
															
 
																     "    2. [Estimate hours/days needed to execute one end-to-end run per Megatron configuration](./jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb)\n",
															
 
																     "    3. [Understanding the core of Megatron - mpu ](./jupyter_notebook/Lab1-3_MegatronFundementals.ipynb)\n",
															
 
																     "    4. [About GPT's tokenizer](./jupyter_notebook/Lab1-4_GPT_vocab_merge_files.ipynb)\n",
															
--- a/ai/Megatron/English/Python/jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb
--- a/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb
@@ -2,7 +2,7 @@
 
																  "cells": [
															
 
																   {
															
 
																    "cell_type": "markdown",
															
 
																-   "id": "placed-inspection",
															
 
																+   "id": "quality-channel",
															
 
																    "metadata": {},
															
 
																    "source": [
															
 
																     "## Website scrapping\n",
															
@@ -16,7 +16,7 @@
 
																     "## Learning Objectives\n",
															
 
																     "The goal of this lab is to obtain raw text data via webscrapping.\n",
															
 
																     "\n",
															
 
																-    "The raw text data obtained from this notebook will be used for subsequent notebooks for Lab1\n",
															
 
																+    "To run through Megatron-LM default workflow in order to train a GPT model, we will need to obtain data first. The outcome of this notebook is the raw text data which will be used for subsequent tasks in Lab1.\n",
															
 
																     "\n",
															
 
																     "This notebook covers the below steps : \n",
															
 
																     "\n",
															
@@ -32,7 +32,7 @@
 
																   },
															
 
																   {
															
 
																    "cell_type": "markdown",
															
 
																-   "id": "trained-midwest",
															
 
																+   "id": "everyday-leonard",
															
 
																    "metadata": {},
															
 
																    "source": [
															
 
																     "1. install python libraries and download 2 python scripts which will be used for website crawling."
															
@@ -41,7 +41,7 @@
 
																   {
															
 
																    "cell_type": "code",
															
 
																    "execution_count": null,
															
 
																-   "id": "unknown-spiritual",
															
 
																+   "id": "exotic-grave",
															
 
																    "metadata": {},
															
 
																    "outputs": [],
															
 
																    "source": [
															
@@ -57,7 +57,7 @@
 
																   {
															
 
																    "cell_type": "code",
															
 
																    "execution_count": null,
															
 
																-   "id": "meaning-dream",
															
 
																+   "id": "tamil-electric",
															
 
																    "metadata": {},
															
 
																    "outputs": [],
															
 
																    "source": [
															
@@ -68,7 +68,7 @@
 
																   },
															
 
																   {
															
 
																    "cell_type": "markdown",
															
 
																-   "id": "periodic-dispute",
															
 
																+   "id": "precious-birth",
															
 
																    "metadata": {},
															
 
																    "source": [
															
 
																     "2. Crawl links from a seeded url and write to a text file named `NVdevblog_urls.txt`"
															
@@ -77,7 +77,7 @@
 
																   {
															
 
																    "cell_type": "code",
															
 
																    "execution_count": null,
															
 
																-   "id": "executed-spanish",
															
 
																+   "id": "dietary-beads",
															
 
																    "metadata": {},
															
 
																    "outputs": [],
															
 
																    "source": [
															
@@ -87,7 +87,7 @@
 
																   },
															
 
																   {
															
 
																    "cell_type": "markdown",
															
 
																-   "id": "owned-alignment",
															
 
																+   "id": "potential-regard",
															
 
																    "metadata": {},
															
 
																    "source": [
															
 
																     "3. Remove incompliant links from the text file in order to ensure legal compliancy.\n",
															
@@ -100,7 +100,7 @@
 
																   {
															
 
																    "cell_type": "code",
															
 
																    "execution_count": null,
															
 
																-   "id": "fallen-dating",
															
 
																+   "id": "amazing-nickname",
															
 
																    "metadata": {},
															
 
																    "outputs": [],
															
 
																    "source": [
															
@@ -122,7 +122,7 @@
 
																   },
															
 
																   {
															
 
																    "cell_type": "markdown",
															
 
																-   "id": "exceptional-grain",
															
 
																+   "id": "military-electronics",
															
 
																    "metadata": {},
															
 
																    "source": [
															
 
																     "4. Fetch the corresponding webpage from each approved url and write it to `XXX.html` format."
															
@@ -131,7 +131,7 @@
 
																   {
															
 
																    "cell_type": "code",
															
 
																    "execution_count": null,
															
 
																-   "id": "acquired-afghanistan",
															
 
																+   "id": "worth-album",
															
 
																    "metadata": {},
															
 
																    "outputs": [],
															
 
																    "source": [
															
@@ -141,7 +141,7 @@
 
																   },
															
 
																   {
															
 
																    "cell_type": "markdown",
															
 
																-   "id": "billion-service",
															
 
																+   "id": "collective-dimension",
															
 
																    "metadata": {},
															
 
																    "source": [
															
 
																     "Below is an example of expected outputs :\n",
															
@@ -157,7 +157,7 @@
 
																   },
															
 
																   {
															
 
																    "cell_type": "markdown",
															
 
																-   "id": "speaking-basin",
															
 
																+   "id": "heard-recovery",
															
 
																    "metadata": {},
															
 
																    "source": [
															
 
																     "5. Parse the html file and extract the raw text data, which will be written to `extractedNVblogs.txt`."
															
@@ -166,7 +166,7 @@
 
																   {
															
 
																    "cell_type": "code",
															
 
																    "execution_count": null,
															
 
																-   "id": "indoor-bachelor",
															
 
																+   "id": "suspended-degree",
															
 
																    "metadata": {},
															
 
																    "outputs": [],
															
 
																    "source": [
															
@@ -202,7 +202,7 @@
 
																   },
															
 
																   {
															
 
																    "cell_type": "markdown",
															
 
																-   "id": "indie-fusion",
															
 
																+   "id": "continued-voice",
															
 
																    "metadata": {},
															
 
																    "source": [
															
 
																     "6. Move the `extractedNVblogs.txt` to the correct folder under **dataset**. This file `extractedNVblogs.txt` will be used in subsequent notebooks in lab1."
															
@@ -211,7 +211,7 @@
 
																   {
															
 
																    "cell_type": "code",
															
 
																    "execution_count": 12,
															
 
																-   "id": "korean-given",
															
 
																+   "id": "german-shareware",
															
 
																    "metadata": {},
															
 
																    "outputs": [],
															
 
																    "source": [
															
@@ -220,7 +220,7 @@
 
																   },
															
 
																   {
															
 
																    "cell_type": "markdown",
															
 
																-   "id": "caroline-assault",
															
 
																+   "id": "willing-charleston",
															
 
																    "metadata": {},
															
 
																    "source": [
															
 
																     "**Note:** Please run below cell to free up space."
															
@@ -229,7 +229,7 @@
 
																   {
															
 
																    "cell_type": "code",
															
 
																    "execution_count": 14,
															
 
																-   "id": "developmental-casino",
															
 
																+   "id": "square-montana",
															
 
																    "metadata": {},
															
 
																    "outputs": [],
															
 
																    "source": [
															
@@ -241,7 +241,7 @@
 
																   },
															
 
																   {
															
 
																    "cell_type": "markdown",
															
 
																-   "id": "integrated-omega",
															
 
																+   "id": "brave-ranking",
															
 
																    "metadata": {},
															
 
																    "source": [
															
 
																     "Verify `extractedNVblogs.txt` is successfully moved to the correct folder."
															
@@ -250,7 +250,7 @@
 
																   {
															
 
																    "cell_type": "code",
															
 
																    "execution_count": null,
															
 
																-   "id": "daily-england",
															
 
																+   "id": "pressed-model",
															
 
																    "metadata": {},
															
 
																    "outputs": [],
															
 
																    "source": [
															
@@ -259,7 +259,7 @@
 
																   },
															
 
																   {
															
 
																    "cell_type": "markdown",
															
 
																-   "id": "comfortable-update",
															
 
																+   "id": "worse-affairs",
															
 
																    "metadata": {},
															
 
																    "source": [
															
 
																     "Below is an example of expected outputs :\n",
															
@@ -269,7 +269,7 @@
 
																   },
															
 
																   {
															
 
																    "cell_type": "markdown",
															
 
																-   "id": "cutting-template",
															
 
																+   "id": "convenient-treatment",
															
 
																    "metadata": {},
															
 
																    "source": [
															
 
																     "--- \n",
															
@@ -280,7 +280,7 @@
 
																   },
															
 
																   {
															
 
																    "cell_type": "markdown",
															
 
																-   "id": "minimal-translator",
															
 
																+   "id": "sorted-federation",
															
 
																    "metadata": {},
															
 
																    "source": [
															
 
																     "-----\n",
															
@@ -289,7 +289,7 @@
 
																   },
															
 
																   {
															
 
																    "cell_type": "markdown",
															
 
																-   "id": "reserved-knife",
															
 
																+   "id": "exclusive-qualification",
															
 
																    "metadata": {},
															
 
																    "source": [
															
 
																     "--- \n",