{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "exotic-testing",
   "metadata": {},
   "source": [
    "## Acquire Swedish data \n",
    "---\n",
    "\n",
    "For data licensing and privacy concerns, we will not provide training data in this bootcamp.\n",
    "\n",
    "However, we do need data in order to proceed the customization of Megatron-LM's workflow for local language needs ( in this case, it is Swedish ), hence, the first thing we need to do, is acquiring some Swedish raw text data.\n",
    "\n",
    "This notebook is therefore provided to assist acquisition of Swedish raw text data from språkbanken.\n",
    "\n",
    "Gollowing the steps given below:\n",
    "\n",
    "    1. Download data via wget and download the python script which will be used to extract the Swedish text.\n",
    "    \n",
    "    2. Unzip the data using bunzip and move the data to the correct folder under dataset.\n",
    "    \n",
    "    3. A custom function is provided in order to extract the raw txt file from the xml file and move the text file to the correct folder under dataset.\n",
    "\n",
    "\n",
    "\n",
    "**About the data source : språkbank**  :\n",
    "\n",
    "This data belongs to Språkbanken, Språkbanken Text is a research unit and part of the National Language Bank, a national e-infrastructure to support research based on linguistic data.\n",
    "[Learn more about språkbank here](https://spraakbanken.gu.se/om)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "disciplinary-sheet",
   "metadata": {},
   "source": [
    "1. Download data via wget and download the python script which will be used to extract the Swedish text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "adverse-degree",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget http://spraakbanken.gu.se/lb/resurser/meningsmangder/webbnyheter2013.xml.bz2\n",
    "!wget https://raw.githubusercontent.com/spraakbanken/sb-nltk-tools/master/sb_corpus_reader.py"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "antique-extra",
   "metadata": {},
   "source": [
    "2. Unzip the data using bunzip and move the data to the correct folder under dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "rolled-banking",
   "metadata": {},
   "outputs": [],
   "source": [
    "!bunzip2 -d webbnyheter2013.xml.bz2 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "isolated-mauritius",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mv ./webbnyheter2013.xml ../../../../dataset/SV/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "driving-wagon",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls ../../../../dataset/SV/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "weighted-hygiene",
   "metadata": {},
   "source": [
    "3. A custom function is provided in order to extract the raw txt file from the xml file and move the text file to the correct folder under dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "encouraging-business",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import os, sys\n",
    "import numpy as np\n",
    "import nltk\n",
    "from sb_corpus_reader import SBCorpusReader\n",
    "import random\n",
    "\n",
    "def write2csv(out_path, fname, sents):\n",
    "    f=open(out_path+fname,'a')\n",
    "    for s in sents:\n",
    "        if len(s)>=2:\n",
    "            s_text=' '.join(s)\n",
    "            f.write(s_text+'\\n')\n",
    "    print(\"finish processing \",fname)\n",
    "    f.close()\n",
    "    \n",
    "out_path='../../../../dataset/SV/'\n",
    "xml_f=out_path+'webbnyheter2013.xml'\n",
    "if xml_f.endswith('.xml') :    \n",
    "    corpus = SBCorpusReader(xml_f)\n",
    "    sents=corpus.sents()\n",
    "    print(sents[:2])\n",
    "    #n=len(sents)\n",
    "    #rn=random.randint(0,n-1)\n",
    "    #print(\"a random sample of sentence : \\n\".format(' '.join(sents[rn])))\n",
    "    fname='webnyheter2013.txt'  \n",
    "    print(\"write to : \",fname)\n",
    "    write2csv(out_path,fname,sents)\n",
    "    print('-----'*10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "built-subcommittee",
   "metadata": {},
   "source": [
    "Verify the output `webnyheter2013.txt` exists under `../../../../dataset/SV/`. We need this raw text file to proceed with the subsequent notebooks for Lab2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "conservative-andorra",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls ../../../../dataset/SV/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "modern-handbook",
   "metadata": {},
   "source": [
    "-----\n",
    "## <p style=\"text-align:center;border:3px; padding: 1em\"> <a href=../../../../Start_Here.ipynb>HOME</a>&nbsp; &nbsp; &nbsp; <a href=./Lab2-2_SentenceBoundary_and_Deduplicate.ipynb>NEXT</a></p>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "portable-excuse",
   "metadata": {},
   "source": [
    "-----\n",
    "\n",
    "\n",
    "## Licensing \n",
    "\n",
    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}