# Pre-Processing the Text Data

One of the Key Principles to understand during Pre-processing of Data is to have a clear Idea on how the Input data looks and and how we would like the end output to look like. 

This Steps followed for pre-processing are as follows :

- Understanding the Format of the Data 
- Storing The Cyclone in a Dictionary
- Converting the Dictionary to a Dataframe
- Restructuring the Columns and making it readable
- Replacing Sentinel Values and Removing Empty Strings
- Removing Unwanted Spaces and Reindexing the Data frame
- Save this Dataframe to a CSV File


### Understanding the Format of the Data

Let us Take a look at the Modified CSV Format used by the HURDAT2 Team :

In [None]:
from IPython.display import IFrame
IFrame("http://www.nhc.noaa.gov/data/hurdat/hurdat2-format-atlantic.pdf", width=950, height=600)

The File has already been download and now let us read the file.

In [None]:
#Let's Open the File First
atlantic = open("data/hurdat2-1851-2018-120319.txt", "r")
atlantic_raw = atlantic.read()

# Running a counter to check first two letter of the Document
import io
from collections import Counter

c = Counter()
for line in io.StringIO(atlantic_raw):
 c[line[:2]] += 1
#Printing Counter Output
print(c)

Let's Take a Moment to Understand what the Counter Output Means : 

* AL : Number of Atlantic Storms from 1851-2018 
* 18 : Number of Entries in 19th Century ( 1851 - 1899)
* 19 : Number of Entries in 20th Century ( 1900 - 1999)
* 20 : Number of Entries in 21st Century ( 2000 - 2018)


### Storing The Cyclone in a Dictionary

Let us now create a Dictionary to store the Cyclone data according to their name.

In [None]:
import io

# Create a Dictionary to Store All Cyclone Data According to their names
atlantic_storms_r = []
atlantic_storm_r = {'header': None, 'data': []}

for i, line in enumerate(io.StringIO(atlantic_raw)):
 if line[:2] == 'AL':
 atlantic_storms_r.append(atlantic_storm_r.copy())
 atlantic_storm_r['header'] = line
 atlantic_storm_r['data'] = []
 else:
 atlantic_storm_r['data'].append(line)
# Removing the First Element of the List and Storing Everything else.
atlantic_storms_r = atlantic_storms_r[1:]
#Number of Atlantic Cyclone 
len(atlantic_storms_r)

### Converting the Dictionary to a Dataframe

In [None]:
# Let us Convert the Dictionary Data to a Pandas Dataframe which will be easier to workwith later

import pandas as pd

atlantic_storm_dfs = []
for storm_dict in atlantic_storms_r:
 storm_id, storm_name, storm_entries_n = storm_dict['header'].split(",")[:3]
 # remove hanging newline ('\n'), split fields
 data = [[entry.strip() for entry in datum[:-1].split(",")] for datum in storm_dict['data']]
 frame = pd.DataFrame(data)
 frame['id'] = storm_id
 frame['name'] = storm_name
 atlantic_storm_dfs.append(frame)
 
# Let's print the first Cyclone Data to see how it looks.
atlantic_storm_dfs[0]

In [None]:
# Concatenate All the Cyclones Data into one
atlantic_storms = pd.concat(atlantic_storm_dfs)
len(atlantic_storms)

### Restructuring the Columns and making it readable

In [None]:
# Restructurings the Columns in the Dataframe
atlantic_storms = atlantic_storms.reindex(columns=atlantic_storms.columns[-2:] | atlantic_storms.columns[:-2]) 
# Printing the First 5 Rows
atlantic_storms.head()

In [None]:
#Display the Columns of the Dataframe
atlantic_storms.columns

In [None]:
# Make the Dataframe's Columns Readable 
atlantic_storms.columns = [
 "id",
 "name",
 "date",
 "hours_minutes",
 "record_identifier",
 "status_of_system",
 "latitude",
 "longitude",
 "maximum_sustained_wind_knots",
 "maximum_pressure",
 "34_kt_ne",
 "34_kt_se",
 "34_kt_sw",
 "34_kt_nw",
 "50_kt_ne",
 "50_kt_se",
 "50_kt_sw",
 "50_kt_nw",
 "64_kt_ne",
 "64_kt_se",
 "64_kt_sw",
 "64_kt_nw",
 "na"
]
del atlantic_storms['na']
pd.set_option("max_columns", None)

In [None]:
# Let's have a look at our Data frame : 
atlantic_storms.head()

### Replacing Sentinel Values and Removing Empty Strings

Now that we have completed most of the Parsing , Let us do some final fixes by changing the sentinel values which are '-999' to NaN ( Not a number ) 

In [None]:
# Replacing all old Sentinels (-999 ) with nan
atlantic_storms.iloc[0]['34_kt_sw']

# We use Numpy ( Numerical Python ) to replace the Sentinels.
import numpy as np
atlantic_storms = atlantic_storms.replace(to_replace='-999', value=np.nan)
atlantic_storms.iloc[0]['34_kt_sw']

In [None]:
# Checking Data types of Columns 
atlantic_storms.dtypes

In [None]:
atlantic_storms['record_identifier'].value_counts()

Now , Let us now also replace all the Empty Strings with NaN 

In [None]:
# Replacing All Empty String with nan values
atlantic_storms = atlantic_storms.replace(to_replace="", value=np.nan)
atlantic_storms['record_identifier'].value_counts(dropna=False)

In [None]:
#Let us have a look at the Data frame now
atlantic_storms.head()

### Removing Unwanted Spaces and Reindexing the Data frame

In [None]:
# Final Fixes 

# Strip Unwanted Spaces from names
atlantic_storms['name'] = atlantic_storms['name'].map(lambda n: n.strip()) 

#ReIndex
atlantic_storms.index = range(len(atlantic_storms.index))
atlantic_storms.index.name = "index"

### Saving this Dataframe to a CSV File

Let us now save the Dataframe into a CSV file which we will be using for Annotating the Data.

In [None]:
atlantic_storms.tail()

In [None]:
atlantic_storms.to_csv("atlantic.csv")

## License

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).