4 лет назад · eee29646a0
--- a/ai/Megatron/English/Python/Start_Here.ipynb
+++ b/ai/Megatron/English/Python/Start_Here.ipynb
@@ -190,7 +190,7 @@
 
				     "\n",
			
 
				     "- **Outlines of Lab 1**\n",
			
 
				     "    Megatron 101 in half a day - Please go through the below notebooks sequentially.\n",
			
 
				-    "    1. [WebCrawling](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb)\n",
			
 
				+    "    1. [WebCrawling to obtain raw text data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb)\n",
			
 
				     "    2. [Estimate hours/days needed to execute one end-to-end run per Megatron configuration](./jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb)\n",
			
 
				     "    3. [Understanding the core of Megatron - mpu ](./jupyter_notebook/Lab1-3_MegatronFundementals.ipynb)\n",
			
 
				     "    4. [About GPT's tokenizer](./jupyter_notebook/Lab1-4_GPT_vocab_merge_files.ipynb)\n",
			
--- a/ai/Megatron/English/Python/jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb
--- a/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb
@@ -2,7 +2,7 @@
 
				  "cells": [
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "placed-inspection",
			
 
				+   "id": "quality-channel",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "## Website scrapping\n",
			
@@ -16,7 +16,7 @@
 
				     "## Learning Objectives\n",
			
 
				     "The goal of this lab is to obtain raw text data via webscrapping.\n",
			
 
				     "\n",
			
 
				-    "The raw text data obtained from this notebook will be used for subsequent notebooks for Lab1\n",
			
 
				+    "To run through Megatron-LM default workflow in order to train a GPT model, we will need to obtain data first. The outcome of this notebook is the raw text data which will be used for subsequent tasks in Lab1.\n",
			
 
				     "\n",
			
 
				     "This notebook covers the below steps : \n",
			
 
				     "\n",
			
@@ -32,7 +32,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "trained-midwest",
			
 
				+   "id": "everyday-leonard",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "1. install python libraries and download 2 python scripts which will be used for website crawling."
			
@@ -41,7 +41,7 @@
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": null,
			
 
				-   "id": "unknown-spiritual",
			
 
				+   "id": "exotic-grave",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -57,7 +57,7 @@
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": null,
			
 
				-   "id": "meaning-dream",
			
 
				+   "id": "tamil-electric",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -68,7 +68,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "periodic-dispute",
			
 
				+   "id": "precious-birth",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "2. Crawl links from a seeded url and write to a text file named `NVdevblog_urls.txt`"
			
@@ -77,7 +77,7 @@
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": null,
			
 
				-   "id": "executed-spanish",
			
 
				+   "id": "dietary-beads",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -87,7 +87,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "owned-alignment",
			
 
				+   "id": "potential-regard",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "3. Remove incompliant links from the text file in order to ensure legal compliancy.\n",
			
@@ -100,7 +100,7 @@
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": null,
			
 
				-   "id": "fallen-dating",
			
 
				+   "id": "amazing-nickname",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -122,7 +122,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "exceptional-grain",
			
 
				+   "id": "military-electronics",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "4. Fetch the corresponding webpage from each approved url and write it to `XXX.html` format."
			
@@ -131,7 +131,7 @@
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": null,
			
 
				-   "id": "acquired-afghanistan",
			
 
				+   "id": "worth-album",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -141,7 +141,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "billion-service",
			
 
				+   "id": "collective-dimension",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "Below is an example of expected outputs :\n",
			
@@ -157,7 +157,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "speaking-basin",
			
 
				+   "id": "heard-recovery",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "5. Parse the html file and extract the raw text data, which will be written to `extractedNVblogs.txt`."
			
@@ -166,7 +166,7 @@
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": null,
			
 
				-   "id": "indoor-bachelor",
			
 
				+   "id": "suspended-degree",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -202,7 +202,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "indie-fusion",
			
 
				+   "id": "continued-voice",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "6. Move the `extractedNVblogs.txt` to the correct folder under **dataset**. This file `extractedNVblogs.txt` will be used in subsequent notebooks in lab1."
			
@@ -211,7 +211,7 @@
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": 12,
			
 
				-   "id": "korean-given",
			
 
				+   "id": "german-shareware",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -220,7 +220,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "caroline-assault",
			
 
				+   "id": "willing-charleston",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "**Note:** Please run below cell to free up space."
			
@@ -229,7 +229,7 @@
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": 14,
			
 
				-   "id": "developmental-casino",
			
 
				+   "id": "square-montana",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -241,7 +241,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "integrated-omega",
			
 
				+   "id": "brave-ranking",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "Verify `extractedNVblogs.txt` is successfully moved to the correct folder."
			
@@ -250,7 +250,7 @@
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": null,
			
 
				-   "id": "daily-england",
			
 
				+   "id": "pressed-model",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -259,7 +259,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "comfortable-update",
			
 
				+   "id": "worse-affairs",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "Below is an example of expected outputs :\n",
			
@@ -269,7 +269,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "cutting-template",
			
 
				+   "id": "convenient-treatment",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "--- \n",
			
@@ -280,7 +280,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "minimal-translator",
			
 
				+   "id": "sorted-federation",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "-----\n",
			
@@ -289,7 +289,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "reserved-knife",
			
 
				+   "id": "exclusive-qualification",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "--- \n",