{ "cells": [ { "cell_type": "markdown", "id": "whole-offset", "metadata": {}, "source": [ "## The bootcamp computer environment BerzeLiUs ( SuperPOD in Sweden ) -\n", "The [BerzeLiUS](https://blogs.nvidia.com/blog/2021/03/23/ai-supercomputer-sweden/) system consists of 60 NVIDIA DGX A100 systems, \n", "linked on a 200 Gbit/second NVIDIA Mellanox InfiniBand HDR network. \n", "The same network links the processors to 1.5 petabytes of flash memory on four storage servers from DataDirect Networks.\n" ] }, { "cell_type": "markdown", "id": "martial-cathedral", "metadata": {}, "source": [ "## Learning Objectives\n", "Today ( Day3 ) we will be focus on catering to the specifics of local langauge needs, in this case Swedish. We will give recommandations which can be optionally applied to your workflow and include some practical, useful scripts to help you kick-start your own journey in training local langauge Megatron GPT2/3 models. \n", "\n", "## Dataset -\n", "Today we will be fetching and extracting Swedish data from [Språkbank webnyheter2013 ](https://spraakbanken.gu.se/en/resources/webbnyheter2013)\n", "\n", "### Bootcamp Outline ( Day 3 )\n", "This is day 2 of the bootcamp ,we are focusing on familiarize ourselves with the Megatron default workflow,\n", "given the superPOD environment with ? gpus / per attendees. \n", "We will start from data cleansing [Megatron repo](https://github.com/NVIDIA/Megatron-LM/tools/openwebtext) and aiming to understand how to utilize gpus performance via experimenting on various Megatron GPT training configuration. \n", "\n", "\n", " \n", "- [Fetch and extract Swedish data](./Megatron-LM/tools/openwebtext/Day3-1_acquiring_data.ipynb)\n", "- [Find sentence boundary and deduplicate your data](./Megatron-LM/tools/openwebtext/Day3-2_SentenceBoundary_and_Deduplicate.ipynb)\n", " - [mini challenge - approaching groundtruth](./Megatron-LM/tools/openwebtext/Day3-1_SentenceBoundary_and_Deduplicate.ipynb#TheChallenge)\n", "- [Train your own GPTBPE Tokenizer on your own data ](./Day3-3_train_own_GPT2BPETokenizer.ipynb)\n", "- [customize preprocess data python script and convert to mmap](./Day3-4_customize_process2mmap.ipynb)\n", "- [The Challenge - Go Big or go home!](./Day3-4_run_Megatron_with_varying_config.ipynb)\n", "\n", "\n", "### Tutorial Duration\n", "The lab material will be presented in an 4-hour session. A Link to the scripts (without the data) is available for download at the end of the bootcamp.\n", "\n", "### Content Level\n", "Intermediate , advanced \n", "\n", "### Target Audience and Prerequisites\n", "The target audience for this lab are NLP researchers, data scientists and NLP engineers who are interested in adopting Megatron to train their own GPT2/3 models on their own langauge.\n", "\n", "Basic experience with Python programming is needed. No GPU programming knowledge is required.\n", "\n" ] }, { "cell_type": "markdown", "id": "nearby-village", "metadata": {}, "source": [ "---\n", "## Up Next : \n", "\n", "[Fetch and extract Swedish data](./Megatron-LM/tools/openwebtext/Day3-1_acquiring_data.ipynb)\n", "\n", "## Back To Start Menu\n", "[start menu](../Start_Here.ipynb)" ] }, { "cell_type": "markdown", "id": "lined-participation", "metadata": {}, "source": [ "-----\n", "## Licensing \n", "\n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }