{ "cells": [ { "cell_type": "markdown", "id": "micro-village", "metadata": {}, "source": [ "## Website scrapping\n", "\n", "It is strongly recommanded to consult with local legal department for compliance before proceeding to scrap content from websites/webpages with permission.\n", "\n", "There is no one-fits-all website scrapping solution, when applying the following to other websites/webpages, please modify accoridngly.\n", "\n", "---\n", "\n", "## Learning Objectives\n", "The goal of this lab is to obtain raw text data via webscrapping.\n", "\n", "To run through Megatron-LM default workflow in order to train a GPT model, we will need to obtain data first. The outcome of this notebook is the raw text data which will be used for subsequent tasks in Lab1.\n", "\n", "This notebook covers the below steps : \n", "\n", " 1. Install necessary python libraries and download 2 python scripts which will be used for website crawling.\n", " 2. Crawl links from a seeded url and write to a text file.\n", " 3. Remove incompliant links from the text file in order to ensure legal compliancy.\n", " 4. Fetch the corresponding webpage from each approved url and write it to html format.\n", " 5. Parse the html file and extract raw text and write to disk.\n", " 6. Move the `extractedNVblogs.txt` to the correct folder under **dataset**.\n", "\n", "This notebook did not intend to cover crawling webpages for other websites/webpages.\n" ] }, { "cell_type": "markdown", "id": "listed-stopping", "metadata": {}, "source": [ "1. install python libraries and download 2 python scripts which will be used for website crawling." ] }, { "cell_type": "code", "execution_count": null, "id": "endangered-vessel", "metadata": {}, "outputs": [], "source": [ "# install python libraries\n", "!pip install beautifulsoup4\n", "!pip install html5lib\n", "!pip install PyPDF2\n", "!pip install selenium\n", "!pip install Scrapy\n", "!pip install requests bs4 colorama requests-html" ] }, { "cell_type": "code", "execution_count": null, "id": "designed-insight", "metadata": {}, "outputs": [], "source": [ "# download 2 python scripts which will be used for website crawling\n", "!wget https://raw.githubusercontent.com/x4nth055/pythoncode-tutorials/master/web-scraping/link-extractor/link_extractor.py\n", "!wget https://raw.githubusercontent.com/x4nth055/pythoncode-tutorials/master/web-scraping/link-extractor/link_extractor_js.py" ] }, { "cell_type": "markdown", "id": "knowing-andorra", "metadata": {}, "source": [ "2. Crawl links from a seeded url and write to a text file named `NVdevblog_urls.txt`" ] }, { "cell_type": "code", "execution_count": null, "id": "natural-commander", "metadata": {}, "outputs": [], "source": [ "# extracting links from a seeded url and write to a raw text file\n", "!python link_extractor_js.py https://blogs.nvidia.com/blog/category/deep-learning/ -m 2" ] }, { "cell_type": "markdown", "id": "incredible-lunch", "metadata": {}, "source": [ "3. Remove incompliant links from the text file in order to ensure legal compliancy.\n", "\n", " Normally, one should check with the legal and remove each incompliant link.\n", "\n", " For this exercise, a pre-filtered `NVdevblog_urls.txt` is provided for your convenience." ] }, { "cell_type": "code", "execution_count": null, "id": "angry-cattle", "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "import urllib.request as request\n", "import re\n", "import scrapy\n", "import random\n", "import os, sys\n", "# create folder to hold the scrapped html pages\n", "os.makedirs('./htmls/', exist_ok=True)\n", "# read the NVdevblog_urls.txt file and print out a sample url to view\n", "f=open('NVdevblog_urls.txt','r')\n", "lines=f.readlines()\n", "rn=random.randint(0,len(lines)-1)\n", "url=str(lines[rn]).strip()\n", "url" ] }, { "cell_type": "markdown", "id": "defined-interim", "metadata": {}, "source": [ "4. Fetch the corresponding webpage from each approved url and write it to `XXX.html` format." ] }, { "cell_type": "code", "execution_count": null, "id": "competitive-tower", "metadata": {}, "outputs": [], "source": [ "!python -c \"import scrapy\"\n", "!bash fetchURLs_and_write2html.sh" ] }, { "cell_type": "markdown", "id": "guided-certification", "metadata": {}, "source": [ "Below is an example of expected outputs :\n", "\n", " ./htmls/response_64.html\n", " ./htmls/response_65.html\n", " ./htmls/response_66.html\n", " ./htmls/response_67.html\n", " ./htmls/response_68.html\n", " ./htmls/response_69.html\n", " ./htmls/response_70.html" ] }, { "cell_type": "markdown", "id": "neural-method", "metadata": {}, "source": [ "5. Parse the html file and extract the raw text data, which will be written to `extractedNVblogs.txt`." ] }, { "cell_type": "code", "execution_count": null, "id": "affected-albania", "metadata": {}, "outputs": [], "source": [ "\n", "from bs4 import BeautifulSoup\n", "import requests\n", "import pandas as pd\n", "import csv\n", "import html5lib\n", "import codecs\n", "import os,sys\n", "# read the html into python\n", "def covert2txt(html_f ,f_out):\n", " file = codecs.open(html_f, \"r\", \"utf-8\")\n", " html_doc=file.read()\n", " soup = BeautifulSoup(html_doc)\n", " sent_cnt=0\n", " for node in soup.findAll('p'):\n", " #print(type(node.text), node.text)\n", " if node.text not in ['/n','','\\t',' ','\\n\\r'] : \n", " sent_cnt+=1\n", " f_out.write(node.text) \n", " f_out.write('\\n') \n", "html_dir='./htmls/'\n", "htmls=os.listdir('./htmls')\n", "f_out=open('extractedNVblogs.txt' , 'a')\n", "for html in htmls:\n", " outtxt=html.split('.')[0] \n", " covert2txt(html_dir+html ,f_out)\n", "f_out.close()\n", "print(\"finish processing htmls files and convert them to raw txt file\") " ] }, { "cell_type": "markdown", "id": "first-research", "metadata": {}, "source": [ "6. Move the `extractedNVblogs.txt` to the correct folder under **dataset**. This file `extractedNVblogs.txt` will be used in subsequent notebooks in lab1." ] }, { "cell_type": "code", "execution_count": 12, "id": "fourth-certificate", "metadata": {}, "outputs": [], "source": [ "!mv extractedNVblogs.txt ../../../../dataset/EN/" ] }, { "cell_type": "markdown", "id": "cultural-retro", "metadata": {}, "source": [ "**Note:** Please run below cell to free up space." ] }, { "cell_type": "code", "execution_count": 14, "id": "stretch-pattern", "metadata": {}, "outputs": [], "source": [ "!rm -fr htmls*\n", "!rm link_extractor.py\n", "!rm link_extractor_js.py\n", "!rm blogs.nvidia.com_*" ] }, { "cell_type": "markdown", "id": "black-courage", "metadata": {}, "source": [ "Verify `extractedNVblogs.txt` is successfully moved to the correct folder." ] }, { "cell_type": "code", "execution_count": null, "id": "standing-bridges", "metadata": {}, "outputs": [], "source": [ "!head -1 ../../../../dataset/EN/extractedNVblogs.txt" ] }, { "cell_type": "markdown", "id": "prepared-ballot", "metadata": {}, "source": [ "Below is an example of expected outputs :\n", "\n", " The NVIDIA NGC team is hosting a webinar with live Q&A to dive into this Jupyter notebook available from the NGC catalog. Learn how to use these resources to kickstart your AI journey. Register now: NVIDIA NGC Jupyter Notebook Day: Medical Imaging Segmentation.Image segmentation partitions a digital image into multiple segments by changing the representation into something more meaningful and easier to analyze. In the field of medical imaging, image segmentation can be used to help identify organs and anomalies, measure them, classify them, and even uncover diagnostic information. It does this by using data gathered from x-rays, magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and other formats.To achieve state-of-the-art models that deliver the desired accuracy and performance for a use case, you must set up the right environment, train with the ideal hyperparameters, and optimize it to achieve the desired accuracy. All of this can be time-consuming. Data scientists and developers need the right set of tools to quickly overcome tedious tasks." ] }, { "cell_type": "markdown", "id": "cellular-termination", "metadata": {}, "source": [ "--- \n", "\n", "## Links and Resources\n", "Don't forget to check out additional webscraping documents such as [selenium](https://www.selenium.dev/selenium/docs/api/py/index.html), [scrapy](https://docs.scrapy.org/en/latest/) and [beautifulsoup](https://beautiful-soup-4.readthedocs.io/en/latest/).\n" ] }, { "cell_type": "markdown", "id": "complex-valentine", "metadata": {}, "source": [ "-----\n", "##

HOME      NEXT

" ] }, { "cell_type": "markdown", "id": "flush-bruce", "metadata": {}, "source": [ "--- \n", "\n", "## Licensing\n", "\n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }