{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pre-Processing the Text Data\n",
    "\n",
    "One of the Key Principles to understand during Pre-processing of Data is to have a clear Idea on how the Input data looks and and how we would like the end output to look like. \n",
    "\n",
    "This Steps followed for pre-processing are as follows :\n",
    "\n",
    "- Understanding the Format of the Data \n",
    "- Storing The Cyclone in a Dictionary\n",
    "- Converting the Dictionary to a Dataframe\n",
    "- Restructuring the Columns and making it readable\n",
    "- Replacing Sentinel Values and Removing Empty Strings\n",
    "- Removing Unwanted Spaces and Reindexing the Data frame\n",
    "- Save this Dataframe to a CSV File\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Understanding the Format of the Data\n",
    "\n",
    "Let us Take a look at the Modified CSV Format used by the HURDAT2 Team :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import IFrame\n",
    "IFrame(\"http://www.nhc.noaa.gov/data/hurdat/hurdat2-format-atlantic.pdf\", width=950, height=600)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The File has already been download and now let us read the file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Let's Open the File First\n",
    "atlantic = open(\"data/hurdat2-1851-2018-120319.txt\", \"r\")\n",
    "atlantic_raw = atlantic.read()\n",
    "\n",
    "# Running a counter to check first two letter of the Document\n",
    "import io\n",
    "from collections import Counter\n",
    "\n",
    "c = Counter()\n",
    "for line in io.StringIO(atlantic_raw):\n",
    "    c[line[:2]] += 1\n",
    "#Printing Counter Output\n",
    "print(c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's Take a Moment to Understand what the Counter Output Means : \n",
    "\n",
    "* AL : Number of Atlantic Storms from 1851-2018 \n",
    "* 18 : Number of Entries in 19th Century ( 1851 - 1899)\n",
    "* 19 : Number of Entries in 20th Century ( 1900 - 1999)\n",
    "* 20 : Number of Entries in 21st Century ( 2000 - 2018)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Storing The Cyclone in a Dictionary\n",
    "\n",
    "Let us now create a Dictionary to store the Cyclone data according to their name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "\n",
    "# Create a Dictionary to Store All Cyclone Data According to their names\n",
    "atlantic_storms_r = []\n",
    "atlantic_storm_r = {'header': None, 'data': []}\n",
    "\n",
    "for i, line in enumerate(io.StringIO(atlantic_raw)):\n",
    "    if line[:2] == 'AL':\n",
    "        atlantic_storms_r.append(atlantic_storm_r.copy())\n",
    "        atlantic_storm_r['header'] = line\n",
    "        atlantic_storm_r['data'] = []\n",
    "    else:\n",
    "        atlantic_storm_r['data'].append(line)\n",
    "# Removing the First Element of the List and Storing Everything else.\n",
    "atlantic_storms_r = atlantic_storms_r[1:]\n",
    "#Number of Atlantic Cyclone \n",
    "len(atlantic_storms_r)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Converting the Dictionary to a Dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let us Convert the Dictionary Data to a Pandas Dataframe which will be easier to workwith later\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "atlantic_storm_dfs = []\n",
    "for storm_dict in atlantic_storms_r:\n",
    "    storm_id, storm_name, storm_entries_n = storm_dict['header'].split(\",\")[:3]\n",
    "    # remove hanging newline ('\\n'), split fields\n",
    "    data = [[entry.strip() for entry in datum[:-1].split(\",\")] for datum in storm_dict['data']]\n",
    "    frame = pd.DataFrame(data)\n",
    "    frame['id'] = storm_id\n",
    "    frame['name'] = storm_name\n",
    "    atlantic_storm_dfs.append(frame)\n",
    "    \n",
    "# Let's print the first Cyclone Data to see how it looks.\n",
    "atlantic_storm_dfs[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Concatenate All the Cyclones Data into one\n",
    "atlantic_storms = pd.concat(atlantic_storm_dfs)\n",
    "len(atlantic_storms)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Restructuring the Columns and making it readable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Restructurings the Columns in the Dataframe\n",
    "atlantic_storms = atlantic_storms.reindex(columns=atlantic_storms.columns[-2:] | atlantic_storms.columns[:-2]) \n",
    "# Printing the First 5 Rows\n",
    "atlantic_storms.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Display the Columns of the Dataframe\n",
    "atlantic_storms.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make the Dataframe's Columns Readable \n",
    "atlantic_storms.columns = [\n",
    "        \"id\",\n",
    "        \"name\",\n",
    "        \"date\",\n",
    "        \"hours_minutes\",\n",
    "        \"record_identifier\",\n",
    "        \"status_of_system\",\n",
    "        \"latitude\",\n",
    "        \"longitude\",\n",
    "        \"maximum_sustained_wind_knots\",\n",
    "        \"maximum_pressure\",\n",
    "        \"34_kt_ne\",\n",
    "        \"34_kt_se\",\n",
    "        \"34_kt_sw\",\n",
    "        \"34_kt_nw\",\n",
    "        \"50_kt_ne\",\n",
    "        \"50_kt_se\",\n",
    "        \"50_kt_sw\",\n",
    "        \"50_kt_nw\",\n",
    "        \"64_kt_ne\",\n",
    "        \"64_kt_se\",\n",
    "        \"64_kt_sw\",\n",
    "        \"64_kt_nw\",\n",
    "        \"na\"\n",
    "]\n",
    "del atlantic_storms['na']\n",
    "pd.set_option(\"max_columns\", None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's have a look at our Data frame : \n",
    "atlantic_storms.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Replacing Sentinel Values and Removing Empty Strings\n",
    "\n",
    "Now that we have completed most of the Parsing , Let us do some final fixes by changing the sentinel values which are '-999' to NaN ( Not a number ) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replacing all old Sentinels (-999 ) with nan\n",
    "atlantic_storms.iloc[0]['34_kt_sw']\n",
    "\n",
    "# We use Numpy ( Numerical Python ) to replace the Sentinels.\n",
    "import numpy as np\n",
    "atlantic_storms = atlantic_storms.replace(to_replace='-999', value=np.nan)\n",
    "atlantic_storms.iloc[0]['34_kt_sw']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Checking Data types of Columns \n",
    "atlantic_storms.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "atlantic_storms['record_identifier'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now , Let us now also replace all the Empty Strings with NaN "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replacing All Empty String with nan values\n",
    "atlantic_storms = atlantic_storms.replace(to_replace=\"\", value=np.nan)\n",
    "atlantic_storms['record_identifier'].value_counts(dropna=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Let us have a look at the Data frame now\n",
    "atlantic_storms.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Removing Unwanted Spaces and Reindexing the Data frame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Final Fixes \n",
    "\n",
    "# Strip Unwanted Spaces from names\n",
    "atlantic_storms['name'] = atlantic_storms['name'].map(lambda n: n.strip()) \n",
    "\n",
    "#ReIndex\n",
    "atlantic_storms.index = range(len(atlantic_storms.index))\n",
    "atlantic_storms.index.name = \"index\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Saving this Dataframe to a CSV File\n",
    "\n",
    "Let us now save the Dataframe into a CSV file which we will be using for Annotating the Data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "atlantic_storms.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "atlantic_storms.to_csv(\"atlantic.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## License\n",
    "\n",
    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}