{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fabulous-yield",
   "metadata": {},
   "source": [
    "## Acquire Swedish data \n",
    "---\n",
    "\n",
    "For data licensing and privacy concerns, we will not providing training data in this bootcamp.\n",
    "\n",
    "However, we do need data in order to proceed the customization of Megatron-LM's workflow for local language needs ( in this case, it is Swedish ), hence, the first thing we need to do, is to acquire Swedish raw text data.\n",
    "\n",
    "This notebook is therefore provided to assist acquisition of Swedish raw text data from språkbanken.\n",
    "\n",
    "following the steps given below -\n",
    "\n",
    "    1. Download data via wget and download the python script which will be used to extract the Swedish text.\n",
    "    \n",
    "    2. Unzip the data using bunzip and move the data to the correct folder under dataset.\n",
    "    \n",
    "    3. A custom function is provided in order to extract raw txt file from xml file and move the text file to the correct folder under dataset.\n",
    "\n",
    "\n",
    "\n",
    "**About the data source : språkbank**  :\n",
    "\n",
    "This data belongs to Språkbanken, Språkbanken Text is a research unit and part of the National Language Bank, a national e-infrastructure to support research based on linguistic data.\n",
    "[read more about språkbank](https://spraakbanken.gu.se/om)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "wrapped-fields",
   "metadata": {},
   "source": [
    "1. Download data via wget and download the python script which will be used to extract the Swedish text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "silent-writer",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget http://spraakbanken.gu.se/lb/resurser/meningsmangder/webbnyheter2013.xml.bz2\n",
    "!wget https://raw.githubusercontent.com/spraakbanken/sb-nltk-tools/master/sb_corpus_reader.py"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "sunset-moisture",
   "metadata": {},
   "source": [
    "2. unzip the data using bunzip and move the data to the correct folder under dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "satisfied-absolute",
   "metadata": {},
   "outputs": [],
   "source": [
    "!bunzip2 -d webbnyheter2013.xml.bz2 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "detected-volleyball",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mv ./webbnyheter2013.xml ../../../../dataset/SV/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "expired-compact",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls ../../../../dataset/SV/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "minimal-charm",
   "metadata": {},
   "source": [
    "3. A custom function is provided in order to extract raw txt file from xml file and move the text file to the correct folder under dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "hungry-fundamental",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import os, sys\n",
    "import numpy as np\n",
    "import nltk\n",
    "from sb_corpus_reader import SBCorpusReader\n",
    "import random\n",
    "\n",
    "def write2csv(out_path, fname, sents):\n",
    "    f=open(out_path+fname,'a')\n",
    "    for s in sents:\n",
    "        if len(s)>=2:\n",
    "            s_text=' '.join(s)\n",
    "            f.write(s_text+'\\n')\n",
    "    print(\"finish processing \",fname)\n",
    "    f.close()\n",
    "    \n",
    "out_path='../../../../dataset/SV/'\n",
    "xml_f=out_path+'webbnyheter2013.xml'\n",
    "if xml_f.endswith('.xml') :    \n",
    "    corpus = SBCorpusReader(xml_f)\n",
    "    sents=corpus.sents()\n",
    "    print(sents[:2])\n",
    "    #n=len(sents)\n",
    "    #rn=random.randint(0,n-1)\n",
    "    #print(\"a random sample of sentence : \\n\".format(' '.join(sents[rn])))\n",
    "    fname='webnyheter2013.txt'  \n",
    "    print(\"write to : \",fname)\n",
    "    write2csv(out_path,fname,sents)\n",
    "    print('-----'*10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "thick-consortium",
   "metadata": {},
   "source": [
    "Verify the output `webnyheter2013.txt` exist under `../../../../dataset/SV/`, we need this raw text file to proceed the subsequent notebooks for Lab2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bronze-interpretation",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls ../../../../dataset/SV/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "pleased-saskatchewan",
   "metadata": {},
   "source": [
    "-----\n",
    "## <p style=\"text-align:center;border:3px; padding: 1em\"> <a href=../../../../Start_Here.ipynb>HOME</a>&nbsp; &nbsp; &nbsp; <a href=./Lab2-2_SentenceBoundary_and_Deduplicate.ipynb>NEXT</a></p>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "local-limit",
   "metadata": {},
   "source": [
    "-----\n",
    "\n",
    "\n",
    "## Licensing \n",
    "\n",
    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}