{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pre-Processing the Text Data\n", "\n", "One of the Key Principles to understand during Pre-processing of Data is to have a clear Idea on how the Input data looks and and how we would like the end output to look like. \n", "\n", "This Steps followed for pre-processing are as follows :\n", "\n", "- Understanding the Format of the Data \n", "- Storing The Cyclone in a Dictionary\n", "- Converting the Dictionary to a Dataframe\n", "- Restructuring the Columns and making it readable\n", "- Replacing Sentinel Values and Removing Empty Strings\n", "- Removing Unwanted Spaces and Reindexing the Data frame\n", "- Save this Dataframe to a CSV File\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Understanding the Format of the Data\n", "\n", "Let us Take a look at the Modified CSV Format used by the HURDAT2 Team :" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import IFrame\n", "IFrame(\"http://www.nhc.noaa.gov/data/hurdat/hurdat2-format-atlantic.pdf\", width=950, height=600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The File has already been download and now let us read the file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Let's Open the File First\n", "atlantic = open(\"data/hurdat2-1851-2018-120319.txt\", \"r\")\n", "atlantic_raw = atlantic.read()\n", "\n", "# Running a counter to check first two letter of the Document\n", "import io\n", "from collections import Counter\n", "\n", "c = Counter()\n", "for line in io.StringIO(atlantic_raw):\n", " c[line[:2]] += 1\n", "#Printing Counter Output\n", "print(c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's Take a Moment to Understand what the Counter Output Means : \n", "\n", "* AL : Number of Atlantic Storms from 1851-2018 \n", "* 18 : Number of Entries in 19th Century ( 1851 - 1899)\n", "* 19 : Number of Entries in 20th Century ( 1900 - 1999)\n", "* 20 : Number of Entries in 21st Century ( 2000 - 2018)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Storing The Cyclone in a Dictionary\n", "\n", "Let us now create a Dictionary to store the Cyclone data according to their name." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import io\n", "\n", "# Create a Dictionary to Store All Cyclone Data According to their names\n", "atlantic_storms_r = []\n", "atlantic_storm_r = {'header': None, 'data': []}\n", "\n", "for i, line in enumerate(io.StringIO(atlantic_raw)):\n", " if line[:2] == 'AL':\n", " atlantic_storms_r.append(atlantic_storm_r.copy())\n", " atlantic_storm_r['header'] = line\n", " atlantic_storm_r['data'] = []\n", " else:\n", " atlantic_storm_r['data'].append(line)\n", "# Removing the First Element of the List and Storing Everything else.\n", "atlantic_storms_r = atlantic_storms_r[1:]\n", "#Number of Atlantic Cyclone \n", "len(atlantic_storms_r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Converting the Dictionary to a Dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let us Convert the Dictionary Data to a Pandas Dataframe which will be easier to workwith later\n", "\n", "import pandas as pd\n", "\n", "atlantic_storm_dfs = []\n", "for storm_dict in atlantic_storms_r:\n", " storm_id, storm_name, storm_entries_n = storm_dict['header'].split(\",\")[:3]\n", " # remove hanging newline ('\\n'), split fields\n", " data = [[entry.strip() for entry in datum[:-1].split(\",\")] for datum in storm_dict['data']]\n", " frame = pd.DataFrame(data)\n", " frame['id'] = storm_id\n", " frame['name'] = storm_name\n", " atlantic_storm_dfs.append(frame)\n", " \n", "# Let's print the first Cyclone Data to see how it looks.\n", "atlantic_storm_dfs[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Concatenate All the Cyclones Data into one\n", "atlantic_storms = pd.concat(atlantic_storm_dfs)\n", "len(atlantic_storms)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Restructuring the Columns and making it readable" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Restructurings the Columns in the Dataframe\n", "atlantic_storms = atlantic_storms.reindex(columns=atlantic_storms.columns[-2:] | atlantic_storms.columns[:-2]) \n", "# Printing the First 5 Rows\n", "atlantic_storms.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Display the Columns of the Dataframe\n", "atlantic_storms.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Make the Dataframe's Columns Readable \n", "atlantic_storms.columns = [\n", " \"id\",\n", " \"name\",\n", " \"date\",\n", " \"hours_minutes\",\n", " \"record_identifier\",\n", " \"status_of_system\",\n", " \"latitude\",\n", " \"longitude\",\n", " \"maximum_sustained_wind_knots\",\n", " \"maximum_pressure\",\n", " \"34_kt_ne\",\n", " \"34_kt_se\",\n", " \"34_kt_sw\",\n", " \"34_kt_nw\",\n", " \"50_kt_ne\",\n", " \"50_kt_se\",\n", " \"50_kt_sw\",\n", " \"50_kt_nw\",\n", " \"64_kt_ne\",\n", " \"64_kt_se\",\n", " \"64_kt_sw\",\n", " \"64_kt_nw\",\n", " \"na\"\n", "]\n", "del atlantic_storms['na']\n", "pd.set_option(\"max_columns\", None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's have a look at our Data frame : \n", "atlantic_storms.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Replacing Sentinel Values and Removing Empty Strings\n", "\n", "Now that we have completed most of the Parsing , Let us do some final fixes by changing the sentinel values which are '-999' to NaN ( Not a number ) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Replacing all old Sentinels (-999 ) with nan\n", "atlantic_storms.iloc[0]['34_kt_sw']\n", "\n", "# We use Numpy ( Numerical Python ) to replace the Sentinels.\n", "import numpy as np\n", "atlantic_storms = atlantic_storms.replace(to_replace='-999', value=np.nan)\n", "atlantic_storms.iloc[0]['34_kt_sw']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Checking Data types of Columns \n", "atlantic_storms.dtypes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "atlantic_storms['record_identifier'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now , Let us now also replace all the Empty Strings with NaN " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Replacing All Empty String with nan values\n", "atlantic_storms = atlantic_storms.replace(to_replace=\"\", value=np.nan)\n", "atlantic_storms['record_identifier'].value_counts(dropna=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Let us have a look at the Data frame now\n", "atlantic_storms.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Removing Unwanted Spaces and Reindexing the Data frame" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Final Fixes \n", "\n", "# Strip Unwanted Spaces from names\n", "atlantic_storms['name'] = atlantic_storms['name'].map(lambda n: n.strip()) \n", "\n", "#ReIndex\n", "atlantic_storms.index = range(len(atlantic_storms.index))\n", "atlantic_storms.index.name = \"index\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving this Dataframe to a CSV File\n", "\n", "Let us now save the Dataframe into a CSV file which we will be using for Annotating the Data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "atlantic_storms.tail()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "atlantic_storms.to_csv(\"atlantic.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## License\n", "\n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }