{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load in Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Pandas is used for data manipulation\n",
    "import pandas as pd\n",
    "\n",
    "# Read in data as a dataframe\n",
    "features = pd.read_csv('data/temps_extended.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Preparation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# One Hot Encoding\n",
    "features = pd.get_dummies(features)\n",
    "\n",
    "# Extract features and labels\n",
    "labels = features['actual']\n",
    "features = features.drop('actual', axis = 1)\n",
    "\n",
    "# List of features for later use\n",
    "feature_list = list(features.columns)\n",
    "\n",
    "# Convert to numpy arrays\n",
    "import numpy as np\n",
    "\n",
    "features = np.array(features)\n",
    "labels = np.array(labels)\n",
    "\n",
    "# Training and Testing Sets\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "train_features, test_features, train_labels, test_labels = train_test_split(features, labels, \n",
    "                                                                            test_size = 0.25, random_state = 42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training Features Shape: (1643, 17)\n",
      "Training Labels Shape: (1643,)\n",
      "Testing Features Shape: (548, 17)\n",
      "Testing Labels Shape: (548,)\n"
     ]
    }
   ],
   "source": [
    "print('Training Features Shape:', train_features.shape)\n",
    "print('Training Labels Shape:', train_labels.shape)\n",
    "print('Testing Features Shape:', test_features.shape)\n",
    "print('Testing Labels Shape:', test_labels.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4.5 years of data in the training set\n",
      "1.5 years of data in the test set\n"
     ]
    }
   ],
   "source": [
    "print('{:0.1f} years of data in the training set'.format(train_features.shape[0] / 365.))\n",
    "print('{:0.1f} years of data in the test set'.format(test_features.shape[0] / 365.))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Restrict to the Most Important Features\n",
    "\n",
    "These were the six features required to reach a total feature importance of 95% in the first improving random forest notebook.\n",
    "We will use only these features in order to speed up the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Important train features shape: (1643, 6)\n",
      "Important test features shape: (548, 6)\n"
     ]
    }
   ],
   "source": [
    "# Names of five importances accounting for 95% of total importance\n",
    "important_feature_names = ['temp_1', 'average', 'ws_1', 'temp_2', 'friend', 'year']\n",
    "\n",
    "# Find the columns of the most important features\n",
    "important_indices = [feature_list.index(feature) for feature in important_feature_names]\n",
    "\n",
    "# Create training and testing sets with only the important features\n",
    "important_train_features = train_features[:, important_indices]\n",
    "important_test_features = test_features[:, important_indices]\n",
    "\n",
    "# Sanity check on operations\n",
    "print('Important train features shape:', important_train_features.shape)\n",
    "print('Important test features shape:', important_test_features.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Use only the most important features\n",
    "train_features = important_train_features[:]\n",
    "test_features = important_test_features[:]\n",
    "\n",
    "# Update feature list for visualizations\n",
    "feature_list = important_feature_names[:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Examine the Default Random Forest to Determine Parameters\n",
    "\n",
    "We will use these parameters as a starting point. I relied on the [sklearn random forest documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to determine which features to change and the available options."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Parameters currently in use:\n",
      "\n",
      "{'bootstrap': True,\n",
      " 'criterion': 'mse',\n",
      " 'max_depth': None,\n",
      " 'max_features': 'auto',\n",
      " 'max_leaf_nodes': None,\n",
      " 'min_impurity_decrease': 0.0,\n",
      " 'min_impurity_split': None,\n",
      " 'min_samples_leaf': 1,\n",
      " 'min_samples_split': 2,\n",
      " 'min_weight_fraction_leaf': 0.0,\n",
      " 'n_estimators': 10,\n",
      " 'n_jobs': 1,\n",
      " 'oob_score': False,\n",
      " 'random_state': 42,\n",
      " 'verbose': 0,\n",
      " 'warm_start': False}\n"
     ]
    }
   ],
   "source": [
    "from sklearn.ensemble import RandomForestRegressor\n",
    "\n",
    "rf = RandomForestRegressor(random_state = 42)\n",
    "\n",
    "from pprint import pprint\n",
    "\n",
    "# Look at parameters used by our current forest\n",
    "print('Parameters currently in use:\\n')\n",
    "pprint(rf.get_params())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Random Search with Cross Validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import RandomizedSearchCV\n",
    "\n",
    "# Number of trees in random forest\n",
    "n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]\n",
    "# Number of features to consider at every split\n",
    "max_features = ['auto', 'sqrt']\n",
    "# Maximum number of levels in tree\n",
    "max_depth = [int(x) for x in np.linspace(10, 100, num = 10)]\n",
    "max_depth.append(None)\n",
    "# Minimum number of samples required to split a node\n",
    "min_samples_split = [2, 5, 10]\n",
    "# Minimum number of samples required at each leaf node\n",
    "min_samples_leaf = [1, 2, 4]\n",
    "# Method of selecting samples for training each tree\n",
    "bootstrap = [True, False]\n",
    "\n",
    "# Create the random grid\n",
    "random_grid = {'n_estimators': n_estimators,\n",
    "               'max_features': max_features,\n",
    "               'max_depth': max_depth,\n",
    "               'min_samples_split': min_samples_split,\n",
    "               'min_samples_leaf': min_samples_leaf,\n",
    "               'bootstrap': bootstrap}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Use the random grid to search for best hyperparameters\n",
    "# First create the base model to tune\n",
    "rf = RandomForestRegressor()\n",
    "# Random search of parameters, using 3 fold cross validation, \n",
    "# search across 100 different combinations, and use all available cores\n",
    "rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,\n",
    "                              n_iter = 100, scoring='neg_mean_absolute_error', \n",
    "                              cv = 3, verbose=2, random_state=42, n_jobs=-1)\n",
    "\n",
    "# Fit the random search model\n",
    "rf_random.fit(train_features, train_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "rf_random.best_params_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation Function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def evaluate(model, test_features, test_labels):\n",
    "    predictions = model.predict(test_features)\n",
    "    errors = abs(predictions - test_labels)\n",
    "    mape = 100 * np.mean(errors / test_labels)\n",
    "    accuracy = 100 - mape\n",
    "    print('Model Performance')\n",
    "    print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))\n",
    "    print('Accuracy = {:0.2f}%.'.format(accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Evaluate the Default Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model Performance\n",
      "Average Error: 3.8210 degrees.\n",
      "Accuracy = 93.56%.\n"
     ]
    }
   ],
   "source": [
    "base_model = RandomForestRegressor(n_estimators = 1000, random_state = 42)\n",
    "base_model.fit(train_features, train_labels)\n",
    "evaluate(base_model, test_features, test_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Evaluate the Best Random Search Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "best_random = rf_random.best_estimator_\n",
    "evaluate(best_random, test_features, test_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Grid Search \n",
    "\n",
    "We can now perform grid search building on the result from the random search. \n",
    "We will test a range of hyperparameters around the best values returend by random search. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "# Create the parameter grid based on the results of random search \n",
    "param_grid = {\n",
    "    'bootstrap': [True],\n",
    "    'max_depth': [80, 90, 100, 110],\n",
    "    'max_features': [2, 3],\n",
    "    'min_samples_leaf': [3, 4, 5],\n",
    "    'min_samples_split': [8, 10, 12],\n",
    "    'n_estimators': [100, 200, 300, 1000]\n",
    "}\n",
    "\n",
    "# Create a based model\n",
    "rf = RandomForestRegressor()\n",
    "\n",
    "# Instantiate the grid search model\n",
    "grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, \n",
    "                           scoring = 'neg_mean_absolute_error', cv = 3, \n",
    "                           n_jobs = -1, verbose = 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 288 candidates, totalling 864 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=42)]: Done  78 tasks      | elapsed:   42.8s\n",
      "[Parallel(n_jobs=42)]: Done 281 tasks      | elapsed:  2.2min\n",
      "[Parallel(n_jobs=42)]: Done 564 tasks      | elapsed:  4.4min\n",
      "[Parallel(n_jobs=42)]: Done 864 out of 864 | elapsed:  6.5min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=3, error_score='raise',\n",
       "       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,\n",
       "           max_features='auto', max_leaf_nodes=None,\n",
       "           min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "           min_samples_leaf=1, min_samples_split=2,\n",
       "           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n",
       "           oob_score=False, random_state=None, verbose=0, warm_start=False),\n",
       "       fit_params=None, iid=True, n_jobs=42,\n",
       "       param_grid={'bootstrap': [True], 'max_depth': [80, 90, 100, 110], 'max_features': [2, 3], 'min_samples_leaf': [3, 4, 5], 'min_samples_split': [8, 10, 12], 'n_estimators': [100, 200, 300, 1000]},\n",
       "       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,\n",
       "       scoring='neg_mean_absolute_error', verbose=2)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fit the grid search to the data\n",
    "grid_search.fit(train_features, train_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'bootstrap': True,\n",
       " 'max_depth': 110,\n",
       " 'max_features': 3,\n",
       " 'min_samples_leaf': 5,\n",
       " 'min_samples_split': 10,\n",
       " 'n_estimators': 100}"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grid_search.best_params_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model Performance\n",
      "Average Error: 3.6662 degrees.\n",
      "Accuracy = 93.81%.\n"
     ]
    }
   ],
   "source": [
    "best_grid = grid_search.best_estimator_\n",
    "evaluate(best_grid, test_features, test_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Another Round of Grid Search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 54 candidates, totalling 162 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    6.8s\n",
      "[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   25.3s\n",
      "[Parallel(n_jobs=-1)]: Done 162 out of 162 | elapsed:   26.5s finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=3, error_score='raise',\n",
       "       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,\n",
       "           max_features='auto', max_leaf_nodes=None,\n",
       "           min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "           min_samples_leaf=1, min_samples_split=2,\n",
       "           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n",
       "           oob_score=False, random_state=None, verbose=0, warm_start=False),\n",
       "       fit_params=None, iid=True, n_jobs=-1,\n",
       "       param_grid={'bootstrap': [True], 'max_depth': [110, 120, None], 'max_features': [3, 4], 'min_samples_leaf': [5, 6, 7], 'min_samples_split': [10], 'n_estimators': [75, 100, 125]},\n",
       "       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,\n",
       "       scoring='neg_mean_absolute_error', verbose=2)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "param_grid = {\n",
    "    'bootstrap': [True],\n",
    "    'max_depth': [110, 120, None],\n",
    "    'max_features': [3, 4],\n",
    "    'min_samples_leaf': [5, 6, 7],\n",
    "    'min_samples_split': [10],\n",
    "    'n_estimators': [75, 100, 125]\n",
    "}\n",
    "\n",
    "# Create a based model\n",
    "rf = RandomForestRegressor()\n",
    "\n",
    "# Instantiate the grid search model\n",
    "grid_search_ad = GridSearchCV(estimator = rf, param_grid = param_grid, \n",
    "                           scoring = 'neg_mean_absolute_error', cv = 3, \n",
    "                           n_jobs = -1, verbose = 2)\n",
    "\n",
    "grid_search_ad.fit(train_features, train_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'bootstrap': True,\n",
       " 'max_depth': 110,\n",
       " 'max_features': 4,\n",
       " 'min_samples_leaf': 7,\n",
       " 'min_samples_split': 10,\n",
       " 'n_estimators': 100}"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grid_search_ad.best_params_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model Performance\n",
      "Average Error: 3.6809 degrees.\n",
      "Accuracy = 93.79%.\n"
     ]
    }
   ],
   "source": [
    "best_grid_ad = grid_search_ad.best_estimator_\n",
    "evaluate(best_grid_ad, test_features, test_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time our performance slightly decreased. Therefore, we will go back to the best model returned by the first grid search."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Final Model\n",
    "\n",
    "The final model from hyperparameter tuning is as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model Parameters:\n",
      "\n",
      "{'bootstrap': True,\n",
      " 'criterion': 'mse',\n",
      " 'max_depth': 110,\n",
      " 'max_features': 3,\n",
      " 'max_leaf_nodes': None,\n",
      " 'min_impurity_decrease': 0.0,\n",
      " 'min_impurity_split': None,\n",
      " 'min_samples_leaf': 5,\n",
      " 'min_samples_split': 10,\n",
      " 'min_weight_fraction_leaf': 0.0,\n",
      " 'n_estimators': 100,\n",
      " 'n_jobs': 1,\n",
      " 'oob_score': False,\n",
      " 'random_state': None,\n",
      " 'verbose': 0,\n",
      " 'warm_start': False}\n",
      "\n",
      "\n",
      "Model Performance\n",
      "Average Error: 3.6662 degrees.\n",
      "Accuracy = 93.81%.\n"
     ]
    }
   ],
   "source": [
    "print('Model Parameters:\\n')\n",
    "pprint(best_grid.get_params())\n",
    "print('\\n')\n",
    "evaluate(best_grid, test_features, test_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}