{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Load in Data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Pandas is used for data manipulation\n", "import pandas as pd\n", "\n", "# Read in data as a dataframe\n", "features = pd.read_csv('data/temps_extended.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Preparation" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# One Hot Encoding\n", "features = pd.get_dummies(features)\n", "\n", "# Extract features and labels\n", "labels = features['actual']\n", "features = features.drop('actual', axis = 1)\n", "\n", "# List of features for later use\n", "feature_list = list(features.columns)\n", "\n", "# Convert to numpy arrays\n", "import numpy as np\n", "\n", "features = np.array(features)\n", "labels = np.array(labels)\n", "\n", "# Training and Testing Sets\n", "from sklearn.model_selection import train_test_split\n", "\n", "train_features, test_features, train_labels, test_labels = train_test_split(features, labels, \n", " test_size = 0.25, random_state = 42)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training Features Shape: (1643, 17)\n", "Training Labels Shape: (1643,)\n", "Testing Features Shape: (548, 17)\n", "Testing Labels Shape: (548,)\n" ] } ], "source": [ "print('Training Features Shape:', train_features.shape)\n", "print('Training Labels Shape:', train_labels.shape)\n", "print('Testing Features Shape:', test_features.shape)\n", "print('Testing Labels Shape:', test_labels.shape)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4.5 years of data in the training set\n", "1.5 years of data in the test set\n" ] } ], "source": [ "print('{:0.1f} years of data in the training set'.format(train_features.shape[0] / 365.))\n", "print('{:0.1f} years of data in the test set'.format(test_features.shape[0] / 365.))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Restrict to the Most Important Features\n", "\n", "These were the six features required to reach a total feature importance of 95% in the first improving random forest notebook.\n", "We will use only these features in order to speed up the model." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Important train features shape: (1643, 6)\n", "Important test features shape: (548, 6)\n" ] } ], "source": [ "# Names of five importances accounting for 95% of total importance\n", "important_feature_names = ['temp_1', 'average', 'ws_1', 'temp_2', 'friend', 'year']\n", "\n", "# Find the columns of the most important features\n", "important_indices = [feature_list.index(feature) for feature in important_feature_names]\n", "\n", "# Create training and testing sets with only the important features\n", "important_train_features = train_features[:, important_indices]\n", "important_test_features = test_features[:, important_indices]\n", "\n", "# Sanity check on operations\n", "print('Important train features shape:', important_train_features.shape)\n", "print('Important test features shape:', important_test_features.shape)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Use only the most important features\n", "train_features = important_train_features[:]\n", "test_features = important_test_features[:]\n", "\n", "# Update feature list for visualizations\n", "feature_list = important_feature_names[:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examine the Default Random Forest to Determine Parameters\n", "\n", "We will use these parameters as a starting point. I relied on the [sklearn random forest documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to determine which features to change and the available options." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parameters currently in use:\n", "\n", "{'bootstrap': True,\n", " 'criterion': 'mse',\n", " 'max_depth': None,\n", " 'max_features': 'auto',\n", " 'max_leaf_nodes': None,\n", " 'min_impurity_decrease': 0.0,\n", " 'min_impurity_split': None,\n", " 'min_samples_leaf': 1,\n", " 'min_samples_split': 2,\n", " 'min_weight_fraction_leaf': 0.0,\n", " 'n_estimators': 10,\n", " 'n_jobs': 1,\n", " 'oob_score': False,\n", " 'random_state': 42,\n", " 'verbose': 0,\n", " 'warm_start': False}\n" ] } ], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "\n", "rf = RandomForestRegressor(random_state = 42)\n", "\n", "from pprint import pprint\n", "\n", "# Look at parameters used by our current forest\n", "print('Parameters currently in use:\\n')\n", "pprint(rf.get_params())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Random Search with Cross Validation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.model_selection import RandomizedSearchCV\n", "\n", "# Number of trees in random forest\n", "n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]\n", "# Number of features to consider at every split\n", "max_features = ['auto', 'sqrt']\n", "# Maximum number of levels in tree\n", "max_depth = [int(x) for x in np.linspace(10, 100, num = 10)]\n", "max_depth.append(None)\n", "# Minimum number of samples required to split a node\n", "min_samples_split = [2, 5, 10]\n", "# Minimum number of samples required at each leaf node\n", "min_samples_leaf = [1, 2, 4]\n", "# Method of selecting samples for training each tree\n", "bootstrap = [True, False]\n", "\n", "# Create the random grid\n", "random_grid = {'n_estimators': n_estimators,\n", " 'max_features': max_features,\n", " 'max_depth': max_depth,\n", " 'min_samples_split': min_samples_split,\n", " 'min_samples_leaf': min_samples_leaf,\n", " 'bootstrap': bootstrap}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Use the random grid to search for best hyperparameters\n", "# First create the base model to tune\n", "rf = RandomForestRegressor()\n", "# Random search of parameters, using 3 fold cross validation, \n", "# search across 100 different combinations, and use all available cores\n", "rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,\n", " n_iter = 100, scoring='neg_mean_absolute_error', \n", " cv = 3, verbose=2, random_state=42, n_jobs=-1)\n", "\n", "# Fit the random search model\n", "rf_random.fit(train_features, train_labels)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rf_random.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation Function" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def evaluate(model, test_features, test_labels):\n", " predictions = model.predict(test_features)\n", " errors = abs(predictions - test_labels)\n", " mape = 100 * np.mean(errors / test_labels)\n", " accuracy = 100 - mape\n", " print('Model Performance')\n", " print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))\n", " print('Accuracy = {:0.2f}%.'.format(accuracy))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Evaluate the Default Model" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model Performance\n", "Average Error: 3.8210 degrees.\n", "Accuracy = 93.56%.\n" ] } ], "source": [ "base_model = RandomForestRegressor(n_estimators = 1000, random_state = 42)\n", "base_model.fit(train_features, train_labels)\n", "evaluate(base_model, test_features, test_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Evaluate the Best Random Search Model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "best_random = rf_random.best_estimator_\n", "evaluate(best_random, test_features, test_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Grid Search \n", "\n", "We can now perform grid search building on the result from the random search. \n", "We will test a range of hyperparameters around the best values returend by random search. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "# Create the parameter grid based on the results of random search \n", "param_grid = {\n", " 'bootstrap': [True],\n", " 'max_depth': [80, 90, 100, 110],\n", " 'max_features': [2, 3],\n", " 'min_samples_leaf': [3, 4, 5],\n", " 'min_samples_split': [8, 10, 12],\n", " 'n_estimators': [100, 200, 300, 1000]\n", "}\n", "\n", "# Create a based model\n", "rf = RandomForestRegressor()\n", "\n", "# Instantiate the grid search model\n", "grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, \n", " scoring = 'neg_mean_absolute_error', cv = 3, \n", " n_jobs = -1, verbose = 2)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 3 folds for each of 288 candidates, totalling 864 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=42)]: Done 78 tasks | elapsed: 42.8s\n", "[Parallel(n_jobs=42)]: Done 281 tasks | elapsed: 2.2min\n", "[Parallel(n_jobs=42)]: Done 564 tasks | elapsed: 4.4min\n", "[Parallel(n_jobs=42)]: Done 864 out of 864 | elapsed: 6.5min finished\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=3, error_score='raise',\n", " estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,\n", " max_features='auto', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n", " oob_score=False, random_state=None, verbose=0, warm_start=False),\n", " fit_params=None, iid=True, n_jobs=42,\n", " param_grid={'bootstrap': [True], 'max_depth': [80, 90, 100, 110], 'max_features': [2, 3], 'min_samples_leaf': [3, 4, 5], 'min_samples_split': [8, 10, 12], 'n_estimators': [100, 200, 300, 1000]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score=True,\n", " scoring='neg_mean_absolute_error', verbose=2)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fit the grid search to the data\n", "grid_search.fit(train_features, train_labels)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'bootstrap': True,\n", " 'max_depth': 110,\n", " 'max_features': 3,\n", " 'min_samples_leaf': 5,\n", " 'min_samples_split': 10,\n", " 'n_estimators': 100}" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search.best_params_" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model Performance\n", "Average Error: 3.6662 degrees.\n", "Accuracy = 93.81%.\n" ] } ], "source": [ "best_grid = grid_search.best_estimator_\n", "evaluate(best_grid, test_features, test_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Another Round of Grid Search" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 3 folds for each of 54 candidates, totalling 162 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Done 33 tasks | elapsed: 6.8s\n", "[Parallel(n_jobs=-1)]: Done 154 tasks | elapsed: 25.3s\n", "[Parallel(n_jobs=-1)]: Done 162 out of 162 | elapsed: 26.5s finished\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=3, error_score='raise',\n", " estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,\n", " max_features='auto', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n", " oob_score=False, random_state=None, verbose=0, warm_start=False),\n", " fit_params=None, iid=True, n_jobs=-1,\n", " param_grid={'bootstrap': [True], 'max_depth': [110, 120, None], 'max_features': [3, 4], 'min_samples_leaf': [5, 6, 7], 'min_samples_split': [10], 'n_estimators': [75, 100, 125]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score=True,\n", " scoring='neg_mean_absolute_error', verbose=2)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "param_grid = {\n", " 'bootstrap': [True],\n", " 'max_depth': [110, 120, None],\n", " 'max_features': [3, 4],\n", " 'min_samples_leaf': [5, 6, 7],\n", " 'min_samples_split': [10],\n", " 'n_estimators': [75, 100, 125]\n", "}\n", "\n", "# Create a based model\n", "rf = RandomForestRegressor()\n", "\n", "# Instantiate the grid search model\n", "grid_search_ad = GridSearchCV(estimator = rf, param_grid = param_grid, \n", " scoring = 'neg_mean_absolute_error', cv = 3, \n", " n_jobs = -1, verbose = 2)\n", "\n", "grid_search_ad.fit(train_features, train_labels)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'bootstrap': True,\n", " 'max_depth': 110,\n", " 'max_features': 4,\n", " 'min_samples_leaf': 7,\n", " 'min_samples_split': 10,\n", " 'n_estimators': 100}" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search_ad.best_params_" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model Performance\n", "Average Error: 3.6809 degrees.\n", "Accuracy = 93.79%.\n" ] } ], "source": [ "best_grid_ad = grid_search_ad.best_estimator_\n", "evaluate(best_grid_ad, test_features, test_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time our performance slightly decreased. Therefore, we will go back to the best model returned by the first grid search." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Final Model\n", "\n", "The final model from hyperparameter tuning is as follows." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model Parameters:\n", "\n", "{'bootstrap': True,\n", " 'criterion': 'mse',\n", " 'max_depth': 110,\n", " 'max_features': 3,\n", " 'max_leaf_nodes': None,\n", " 'min_impurity_decrease': 0.0,\n", " 'min_impurity_split': None,\n", " 'min_samples_leaf': 5,\n", " 'min_samples_split': 10,\n", " 'min_weight_fraction_leaf': 0.0,\n", " 'n_estimators': 100,\n", " 'n_jobs': 1,\n", " 'oob_score': False,\n", " 'random_state': None,\n", " 'verbose': 0,\n", " 'warm_start': False}\n", "\n", "\n", "Model Performance\n", "Average Error: 3.6662 degrees.\n", "Accuracy = 93.81%.\n" ] } ], "source": [ "print('Model Parameters:\\n')\n", "pprint(best_grid.get_params())\n", "print('\\n')\n", "evaluate(best_grid, test_features, test_labels)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }