{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction: Testing Cyclical Encoding of Features for Machine Learning\n",
    "\n",
    "Cyclical encoding of time-series features is typically done with time/date features that repeat periodically such as the hour of the day or day of the year. However, it is not a foregone conclusion this technique actually improves the performance of complex machine learning algorithms, such as a random forest. In this notebook, we will evaluate the efficacy of cyclical encoding of time-series features for predicting building energy data with two different models. The hypothesis is that cyclical encoding will improve the performance of the simple model, Linear Regression, but will not improve the performance of the model with higher capacity, the Random Forest Regression.\n",
    "\n",
    "## Methods\n",
    "\n",
    "__Feature Sets__\n",
    "1. Standard = `[\"time_of_day\", \"day_of_year\", \"temperature\"]`\n",
    "2. Cyclical = `[\"sin_time_of_day\", \"cos_time_of_day\", \"sin_day_of_year\", \"cos_day_of_year\", \"temperature\"]`\n",
    "\n",
    "(We include temperature in both feature sets because when modeling time-series problems, we almost always have other features besides the date and time. Results may vary with the addition of features).\n",
    "\n",
    "__Models__\n",
    "1. Linear Regression = `LinearRegression(n_jobs=-1)`\n",
    "2. Random Forest Regression = `RandomForestRegressor(n_estimators=25, max_depth=15, n_jobs=-1, random_state=100)`\n",
    "\n",
    "(The hyperparameters for the random forest have proved to work well for use cases in our data science systems. Again, performance may vary with hyperparameter values).\n",
    "\n",
    "We'll use a relatively simple test set-up. After creating the standard and cyclically encoded sets of features, we'll evaluate each set with the two models. Validation will consist of testing on a week of data at a time with all previous data used for training. We go through the dataset one week at a time, training on the historical data, and predicting for that selected week. This mimics what we actually do evaluating models for forecasting energy use at [Cortex Building Intel](https://cortexintel.com). The weekly validation is representative of a real-world situation, in which we'll have past training data and want to make a prediction of future energy consumption. By validating on past data, we can get an estimate of how the model will perform in the future. \n",
    "\n",
    "For our use case, we are primarily concerned with the performance of each set of features. We'll use a single metric, __mean absolute percentage error__ to assess our predictions. The benefits of MAPE are: relative comparisons, interpretability, and simplicity. The downsides are that it cna be sensitive to outliers and it's asymmetric because lower true values will result in greater percentage error for the same absolute error. There are always trade-offs to make in machine learning! Primarily we want to assess the difference in MAPE between the standard features and the cyclical features. Defining the difference in errors as standard mape - cyclical mape, a positive value indicates the cyclical features outperform the standard features. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data \n",
    "\n",
    "We are using building energy data originally part of a DrivenData machine learning competition to forecast building energy use. You can download the original data from DrivenData here. I've cleaned up the building data and removed all features except for timestamp and temperature. The __objective__ is to predict the \"energy\" from the features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 40 buildings to test.\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import glob\n",
    "\n",
    "building_data_files = glob.glob('data/building*')\n",
    "print(f'There are {len(building_data_files)} buildings to test.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at one example dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>temperature</th>\n",
       "      <th>energy</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>timestamp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2016-09-18 04:00:00</th>\n",
       "      <td>56.240300</td>\n",
       "      <td>1.682686</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-09-18 04:15:00</th>\n",
       "      <td>56.087501</td>\n",
       "      <td>2.086212</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-09-18 04:30:00</th>\n",
       "      <td>56.213232</td>\n",
       "      <td>1.687880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-09-18 04:45:00</th>\n",
       "      <td>56.400049</td>\n",
       "      <td>1.926518</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-09-18 05:00:00</th>\n",
       "      <td>56.592497</td>\n",
       "      <td>1.922459</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     temperature    energy\n",
       "timestamp                                 \n",
       "2016-09-18 04:00:00    56.240300  1.682686\n",
       "2016-09-18 04:15:00    56.087501  2.086212\n",
       "2016-09-18 04:30:00    56.213232  1.687880\n",
       "2016-09-18 04:45:00    56.400049  1.926518\n",
       "2016-09-18 05:00:00    56.592497  1.922459"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.read_csv(building_data_files[10], parse_dates=['timestamp'], index_col=0).set_index('timestamp')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nothing too extraordinary here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'file:///Users/williamkoehrsen/git/Data-Analysis/cyclical-features/temp-plot.html'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import cufflinks as cf\n",
    "from plotly.offline import plot\n",
    "\n",
    "fig = data.iplot(y='energy', secondary_y='temperature', title='Temperature and Energy', asFigure=True)\n",
    "plot(fig)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Engineering\n",
    "\n",
    "To repeatedly create the same features, we'll make two scikit-learn transformers and then join them together in a pipeline. This is very simple code, but including it in a well-defined interface means we can use it over and over while confident we'll get the same operations applied to our data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transformers\n",
    "\n",
    "[Scikit-Learn transformers](http://scikit-learn.org/stable/data_transforms.html) give us a standard interface for applying operations to our data. Using transformers is efficient and allows you to build robust systems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.base import BaseEstimator, TransformerMixin\n",
    "\n",
    "class DateTimeFeatures(BaseEstimator, TransformerMixin):\n",
    "    \"\"\"\n",
    "    Extract day of year and time of day features from a timestamp\n",
    "    \"\"\"\n",
    "    def __init__(self):\n",
    "        pass\n",
    "\n",
    "    def fit(self, X, y=None):\n",
    "        return self\n",
    "\n",
    "    def transform(self, X, y=None):\n",
    "        # Timestamps must be in index\n",
    "        field = X.index\n",
    "        X[\"time_of_day\"] = field.hour + field.minute / 60\n",
    "        X[\"day_of_year\"] = field.dayofyear\n",
    "        return X\n",
    "\n",
    "\n",
    "class CyclicalDateTimeFeatures(BaseEstimator, TransformerMixin):\n",
    "    \"\"\"\n",
    "    Make cyclically encoded day of year and time of day features\n",
    "    \"\"\"\n",
    "    def __init__(self):\n",
    "        pass\n",
    "\n",
    "    def fit(self, X, y=None):\n",
    "        return self\n",
    "\n",
    "    def transform(self, X, y=None):\n",
    "        # Apply formula for sin and cosine\n",
    "        X[\"sin_time_of_day\"], X[\"cos_time_of_day\"] = _cyclical_encoding(\n",
    "            X[\"time_of_day\"], period=24\n",
    "        )\n",
    "        X[\"sin_day_of_year\"], X[\"cos_day_of_year\"] = _cyclical_encoding(\n",
    "            X[\"day_of_year\"], period=366\n",
    "        )\n",
    "        return X\n",
    "\n",
    "def _cyclical_encoding(series, period):\n",
    "    \"\"\"\n",
    "    Cyclical encoding of a series with a specified period\n",
    "    \"\"\"\n",
    "    # Basic formula for sin/cosine equation\n",
    "    base = 2 * np.pi * series / period\n",
    "    return np.sin(base), np.cos(base)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pipeline\n",
    "\n",
    "The pipeline is one of the most important features in sklearn. I highly recommend [reading up on the Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and using it throughout your data science projects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "\n",
    "# Make a pipeline with the steps\n",
    "transforms = Pipeline(\n",
    "    steps=[\n",
    "        # Must create the date/time features before encoding\n",
    "        (\"date_time_features\", DateTimeFeatures()),\n",
    "        (\"cylical_date_time_features\", CyclicalDateTimeFeatures()),\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can use our pipeline on any data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>temperature</th>\n",
       "      <th>energy</th>\n",
       "      <th>time_of_day</th>\n",
       "      <th>day_of_year</th>\n",
       "      <th>sin_time_of_day</th>\n",
       "      <th>cos_time_of_day</th>\n",
       "      <th>sin_day_of_year</th>\n",
       "      <th>cos_day_of_year</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>timestamp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2016-09-18 04:00:00</th>\n",
       "      <td>56.240300</td>\n",
       "      <td>1.682686</td>\n",
       "      <td>4.00</td>\n",
       "      <td>262</td>\n",
       "      <td>0.866025</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>-0.977064</td>\n",
       "      <td>-0.212947</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-09-18 04:15:00</th>\n",
       "      <td>56.087501</td>\n",
       "      <td>2.086212</td>\n",
       "      <td>4.25</td>\n",
       "      <td>262</td>\n",
       "      <td>0.896873</td>\n",
       "      <td>0.442289</td>\n",
       "      <td>-0.977064</td>\n",
       "      <td>-0.212947</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-09-18 04:30:00</th>\n",
       "      <td>56.213232</td>\n",
       "      <td>1.687880</td>\n",
       "      <td>4.50</td>\n",
       "      <td>262</td>\n",
       "      <td>0.923880</td>\n",
       "      <td>0.382683</td>\n",
       "      <td>-0.977064</td>\n",
       "      <td>-0.212947</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-09-18 04:45:00</th>\n",
       "      <td>56.400049</td>\n",
       "      <td>1.926518</td>\n",
       "      <td>4.75</td>\n",
       "      <td>262</td>\n",
       "      <td>0.946930</td>\n",
       "      <td>0.321439</td>\n",
       "      <td>-0.977064</td>\n",
       "      <td>-0.212947</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-09-18 05:00:00</th>\n",
       "      <td>56.592497</td>\n",
       "      <td>1.922459</td>\n",
       "      <td>5.00</td>\n",
       "      <td>262</td>\n",
       "      <td>0.965926</td>\n",
       "      <td>0.258819</td>\n",
       "      <td>-0.977064</td>\n",
       "      <td>-0.212947</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     temperature    energy  time_of_day  day_of_year  \\\n",
       "timestamp                                                              \n",
       "2016-09-18 04:00:00    56.240300  1.682686         4.00          262   \n",
       "2016-09-18 04:15:00    56.087501  2.086212         4.25          262   \n",
       "2016-09-18 04:30:00    56.213232  1.687880         4.50          262   \n",
       "2016-09-18 04:45:00    56.400049  1.926518         4.75          262   \n",
       "2016-09-18 05:00:00    56.592497  1.922459         5.00          262   \n",
       "\n",
       "                     sin_time_of_day  cos_time_of_day  sin_day_of_year  \\\n",
       "timestamp                                                                \n",
       "2016-09-18 04:00:00         0.866025         0.500000        -0.977064   \n",
       "2016-09-18 04:15:00         0.896873         0.442289        -0.977064   \n",
       "2016-09-18 04:30:00         0.923880         0.382683        -0.977064   \n",
       "2016-09-18 04:45:00         0.946930         0.321439        -0.977064   \n",
       "2016-09-18 05:00:00         0.965926         0.258819        -0.977064   \n",
       "\n",
       "                     cos_day_of_year  \n",
       "timestamp                             \n",
       "2016-09-18 04:00:00        -0.212947  \n",
       "2016-09-18 04:15:00        -0.212947  \n",
       "2016-09-18 04:30:00        -0.212947  \n",
       "2016-09-18 04:45:00        -0.212947  \n",
       "2016-09-18 05:00:00        -0.212947  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transformed_data = transforms.transform(data)\n",
    "transformed_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>temperature</th>\n",
       "      <th>energy</th>\n",
       "      <th>time_of_day</th>\n",
       "      <th>day_of_year</th>\n",
       "      <th>sin_time_of_day</th>\n",
       "      <th>cos_time_of_day</th>\n",
       "      <th>sin_day_of_year</th>\n",
       "      <th>cos_day_of_year</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>timestamp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2014-01-01 05:15:00</th>\n",
       "      <td>36.736674</td>\n",
       "      <td>0.479060</td>\n",
       "      <td>5.25</td>\n",
       "      <td>1</td>\n",
       "      <td>0.980785</td>\n",
       "      <td>1.950903e-01</td>\n",
       "      <td>0.017166</td>\n",
       "      <td>0.999853</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-01-01 05:30:00</th>\n",
       "      <td>36.682187</td>\n",
       "      <td>0.497991</td>\n",
       "      <td>5.50</td>\n",
       "      <td>1</td>\n",
       "      <td>0.991445</td>\n",
       "      <td>1.305262e-01</td>\n",
       "      <td>0.017166</td>\n",
       "      <td>0.999853</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-01-01 05:45:00</th>\n",
       "      <td>36.679255</td>\n",
       "      <td>0.499107</td>\n",
       "      <td>5.75</td>\n",
       "      <td>1</td>\n",
       "      <td>0.997859</td>\n",
       "      <td>6.540313e-02</td>\n",
       "      <td>0.017166</td>\n",
       "      <td>0.999853</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-01-01 06:00:00</th>\n",
       "      <td>36.699138</td>\n",
       "      <td>0.494427</td>\n",
       "      <td>6.00</td>\n",
       "      <td>1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>6.123234e-17</td>\n",
       "      <td>0.017166</td>\n",
       "      <td>0.999853</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-01-01 06:15:00</th>\n",
       "      <td>36.587124</td>\n",
       "      <td>0.501165</td>\n",
       "      <td>6.25</td>\n",
       "      <td>1</td>\n",
       "      <td>0.997859</td>\n",
       "      <td>-6.540313e-02</td>\n",
       "      <td>0.017166</td>\n",
       "      <td>0.999853</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     temperature    energy  time_of_day  day_of_year  \\\n",
       "timestamp                                                              \n",
       "2014-01-01 05:15:00    36.736674  0.479060         5.25            1   \n",
       "2014-01-01 05:30:00    36.682187  0.497991         5.50            1   \n",
       "2014-01-01 05:45:00    36.679255  0.499107         5.75            1   \n",
       "2014-01-01 06:00:00    36.699138  0.494427         6.00            1   \n",
       "2014-01-01 06:15:00    36.587124  0.501165         6.25            1   \n",
       "\n",
       "                     sin_time_of_day  cos_time_of_day  sin_day_of_year  \\\n",
       "timestamp                                                                \n",
       "2014-01-01 05:15:00         0.980785     1.950903e-01         0.017166   \n",
       "2014-01-01 05:30:00         0.991445     1.305262e-01         0.017166   \n",
       "2014-01-01 05:45:00         0.997859     6.540313e-02         0.017166   \n",
       "2014-01-01 06:00:00         1.000000     6.123234e-17         0.017166   \n",
       "2014-01-01 06:15:00         0.997859    -6.540313e-02         0.017166   \n",
       "\n",
       "                     cos_day_of_year  \n",
       "timestamp                             \n",
       "2014-01-01 05:15:00         0.999853  \n",
       "2014-01-01 05:30:00         0.999853  \n",
       "2014-01-01 05:45:00         0.999853  \n",
       "2014-01-01 06:00:00         0.999853  \n",
       "2014-01-01 06:15:00         0.999853  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.read_csv(building_data_files[20], parse_dates=['timestamp'], index_col=0).set_index('timestamp')\n",
    "transformed_data = transforms.transform(data)\n",
    "transformed_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'file:///Users/williamkoehrsen/git/Data-Analysis/cyclical-features/cylical_encoding_day_of_year.html'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fig = data.iplot(y=['sin_day_of_year', 'cos_day_of_year'], title=\"Cyclically Encoded Day of Year\", asFigure=True)\n",
    "plot(fig, filename='cylical_encoding_day_of_year.html')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'file:///Users/williamkoehrsen/git/Data-Analysis/cyclical-features/temp-plot.html'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fig = data.loc['2014-01-01': '2014-01-08'].iplot(y=['sin_time_of_day', 'cos_time_of_day'], title=\"Cyclically Encoded Time of Day\", asFigure=True)\n",
    "plot(fig, 'cyclical_encoding_time_of_day.html')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Modeling\n",
    "\n",
    "The modeling itself is relatively simple thanks to Scikit-Learn. We'll make a function that takes a set of data iterates through the sets of features (2) and the models (2). On each iteration, we apply the testing framework by stepping through the dataset one week at a time. We train on the all previous data and then test on the selected week recording the predictions along with the actual value.\n",
    "\n",
    "This means that for each dataset, we are running 4 weekly validation testing sessions (2 x 2). Feel free to change the models or test using different features. You can also change the validation structure although we've found this to be an accurate indication of how well we can expect the model to perform on real data (forecasting for the future). \n",
    "\n",
    "If you were using this system in production, the most important aspect would be [writing unit tests](https://realpython.com/python-testing/)! These [should cover](https://docs.python-guide.org/writing/tests/) both the pipeline and the modeling. You are going to make mistakes in your code, that is guaranteed. Unit tests can help you find these mistakes before they cause unexpected errors in a machine learning system."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_weekly_validation(\n",
    "    models,\n",
    "    data,\n",
    "    feature_sets=[\n",
    "        [\"time_of_day\", \"day_of_year\", \"temperature\"],\n",
    "        [\n",
    "            \"sin_time_of_day\",\n",
    "            \"cos_time_of_day\",\n",
    "            \"sin_day_of_year\",\n",
    "            \"cos_day_of_year\",\n",
    "            \"temperature\",\n",
    "        ],\n",
    "    ],\n",
    "):\n",
    "    \"\"\"\n",
    "    Run the weekly validation modeling with models and feature sets on data\n",
    "    \n",
    "    :param models: list of sklearn models\n",
    "    :param data: building energy dataframe\n",
    "    :param feature_sets: list of lists of features\n",
    "    \n",
    "    :return results: dataframe of prediction results from each combination of model and feature set\n",
    "    \"\"\"\n",
    "\n",
    "    all_predictions = []\n",
    "\n",
    "    # Iterate through feature sets\n",
    "    for feature_set in feature_sets:\n",
    "        features = \"standard\" if \"sin_time_of_day\" not in feature_set else \"cyclical\"\n",
    "        \n",
    "        # Subset to features in data\n",
    "        X = data[feature_set + [\"energy\"]].copy()\n",
    "\n",
    "        # Iterate through models\n",
    "        for model in models:\n",
    "            \n",
    "            model_name = model.__class__.__name__\n",
    "            \n",
    "            # Iterate through weeks in the dataset\n",
    "            # Must group by string formatted week and year\n",
    "            for (week, year), X_test in X.groupby(\n",
    "                # [week, year] grouping\n",
    "                [X.index.strftime(\"%U\"), X.index.strftime(\"%Y\")]\n",
    "            ):\n",
    "                # Subset to training data; all data before test set\n",
    "                X_train = X[X.index < X_test.index.min()].copy()\n",
    "\n",
    "                # Can not train or test on zero observations\n",
    "                if len(X_train) == 0 or len(X_test) == 0:\n",
    "                    continue\n",
    "\n",
    "                # Targets\n",
    "                y_train = X_train.pop(\"energy\")\n",
    "                y_test = X_test.pop(\"energy\")\n",
    "\n",
    "                # Fit and predict\n",
    "                model.fit(X_train, y_train)\n",
    "                predictions = model.predict(X_test)\n",
    "\n",
    "                # Record predictions along with actual values, model, and feature set in a dataframe\n",
    "                predictions = pd.DataFrame(\n",
    "                    dict(\n",
    "                        predicted=predictions,\n",
    "                        actual=y_test,\n",
    "                        model=model_name,\n",
    "                        features=features,\n",
    "                    ),\n",
    "                    # Index is same as for testing data\n",
    "                    index=X_test.index,\n",
    "                )\n",
    "                all_predictions.append(predictions)\n",
    "\n",
    "    # Return dataframe of dataframes ordered by timestamp\n",
    "    return (\n",
    "        pd.concat(all_predictions)\n",
    "        .reset_index()\n",
    "        .sort_values([\"model\", \"features\", \"timestamp\"])\n",
    "        .set_index(\"timestamp\")\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll run one example first to make sure the results are as expected (in a real system, you'd want to have unit tests for this function and the pipeline.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example Validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/williamkoehrsen/.local/share/virtualenvs/insight-engine-JU9-2WuC/lib/python3.7/site-packages/sklearn/linear_model/base.py:485: RuntimeWarning:\n",
      "\n",
      "internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "\n",
    "# Create linear model and random forest model for regression\n",
    "models = [\n",
    "    LinearRegression(n_jobs=-1),\n",
    "    RandomForestRegressor(n_estimators=25, max_depth=15, n_jobs=-1, random_state=100),\n",
    "]\n",
    "\n",
    "validation = run_weekly_validation(models, data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>predicted</th>\n",
       "      <th>actual</th>\n",
       "      <th>model</th>\n",
       "      <th>features</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>timestamp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2014-01-05 00:00:00</th>\n",
       "      <td>0.424205</td>\n",
       "      <td>0.398189</td>\n",
       "      <td>LinearRegression</td>\n",
       "      <td>cyclical</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-01-05 00:15:00</th>\n",
       "      <td>0.413624</td>\n",
       "      <td>0.397093</td>\n",
       "      <td>LinearRegression</td>\n",
       "      <td>cyclical</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-01-05 00:30:00</th>\n",
       "      <td>0.408062</td>\n",
       "      <td>0.396569</td>\n",
       "      <td>LinearRegression</td>\n",
       "      <td>cyclical</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-01-05 00:45:00</th>\n",
       "      <td>0.404033</td>\n",
       "      <td>0.400340</td>\n",
       "      <td>LinearRegression</td>\n",
       "      <td>cyclical</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-01-05 01:00:00</th>\n",
       "      <td>0.400667</td>\n",
       "      <td>0.390858</td>\n",
       "      <td>LinearRegression</td>\n",
       "      <td>cyclical</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     predicted    actual             model  features\n",
       "timestamp                                                           \n",
       "2014-01-05 00:00:00   0.424205  0.398189  LinearRegression  cyclical\n",
       "2014-01-05 00:15:00   0.413624  0.397093  LinearRegression  cyclical\n",
       "2014-01-05 00:30:00   0.408062  0.396569  LinearRegression  cyclical\n",
       "2014-01-05 00:45:00   0.404033  0.400340  LinearRegression  cyclical\n",
       "2014-01-05 01:00:00   0.400667  0.390858  LinearRegression  cyclical"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "validation.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>predicted</th>\n",
       "      <th>actual</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>388268.000000</td>\n",
       "      <td>388268.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.243556</td>\n",
       "      <td>0.221177</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.127025</td>\n",
       "      <td>0.141697</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-0.412401</td>\n",
       "      <td>0.014225</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.157419</td>\n",
       "      <td>0.109462</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.219884</td>\n",
       "      <td>0.175026</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>0.301935</td>\n",
       "      <td>0.306944</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1.389055</td>\n",
       "      <td>0.958976</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           predicted         actual\n",
       "count  388268.000000  388268.000000\n",
       "mean        0.243556       0.221177\n",
       "std         0.127025       0.141697\n",
       "min        -0.412401       0.014225\n",
       "25%         0.157419       0.109462\n",
       "50%         0.219884       0.175026\n",
       "75%         0.301935       0.306944\n",
       "max         1.389055       0.958976"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "validation.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Validation for all Buildings\n",
    "\n",
    "Finally, we can run the validation for all buildings. This may take a while and I've already included the results in the `validation_results` directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm import tqdm_notebook\n",
    "\n",
    "def run_all_buildings(building_data_files):\n",
    "    \"\"\"\n",
    "    Run weekly validation for all buildings\n",
    "    \n",
    "    :param building_data_files: list of filenames with building energy data csv files\n",
    "    \n",
    "    :return None: saves each file in `validation_results` for analysis\n",
    "    \"\"\"\n",
    "\n",
    "    # Run validation for all buildings\n",
    "    for building_file_name in tqdm_notebook(building_data_files, desc=\"Buildings\"):\n",
    "\n",
    "        building_data = pd.read_csv(\n",
    "            building_file_name, parse_dates=[\"timestamp\"]\n",
    "        ).set_index(\"timestamp\")\n",
    "        # Create sets of features\n",
    "        building_data = transforms.transform(building_data)\n",
    "\n",
    "        # Run the validation and save the results\n",
    "        building_validation = run_weekly_validation(models, building_data)\n",
    "\n",
    "        # Save off results for analysis\n",
    "        building_validation.to_csv(\n",
    "            f\"{building_file_name.replace('energy_data', 'validation_results').replace('data', 'validation_results')}\"\n",
    "        )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "cd96ee33cbc84e7daed4fa3aeb769821",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(IntProgress(value=0, description='Buildings', max=40, style=ProgressStyle(description_width='in…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "run_all_buildings(building_data_files)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analysis\n",
    "\n",
    "Let's make sure we have all the datasets for analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "40"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "building_result_files = glob.glob('validation_results/*_validation_results.csv')\n",
    "len(building_result_files)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll write our error metric as a function. There are some times when the actual value is 0 which results in infinite percentage error. To account for this, we replace the infinite values in the percentage error with a missing value and then ignore missing values in the mean absolute error calculation. The end outcome is a single float with the mean absolute error from the model's predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def mean_absolute_percentage_error(actual, predicted):\n",
    "    \"\"\"\n",
    "    Calculate the mean absolute percentage error of two arrays\n",
    "\n",
    "    :param actual: true values as a pandas series\n",
    "    :param predicted: estimated values as a pandas series\n",
    "\n",
    "    :return mape: float of mean absolute percentage error (%)\n",
    "\n",
    "    NOTE: We avoid getting an inf for the mape by replacing infinite values in the\n",
    "    absolute_percentage_error with nan and then ignoring the missing values (np.nanmean)\n",
    "    \"\"\"\n",
    "    ape_array = 100 * np.abs(predicted - actual) / actual\n",
    "    return np.nanmean(ape_array.replace({np.inf: np.nan, -np.inf: np.nan}))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we'll test out a simple block of code to calculate the error metric for one building. This groups by the model and feature set and finds the mean absolute error of the predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>mean_absolute_percentage_error</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>features</th>\n",
       "      <th>model</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">cyclical</th>\n",
       "      <th>LinearRegression</th>\n",
       "      <td>36.098788</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>RandomForestRegressor</th>\n",
       "      <td>18.749387</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">standard</th>\n",
       "      <th>LinearRegression</th>\n",
       "      <td>42.860637</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>RandomForestRegressor</th>\n",
       "      <td>18.747120</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                mean_absolute_percentage_error\n",
       "features model                                                \n",
       "cyclical LinearRegression                            36.098788\n",
       "         RandomForestRegressor                       18.749387\n",
       "standard LinearRegression                            42.860637\n",
       "         RandomForestRegressor                       18.747120"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "building_result_file = building_result_files[20]\n",
    "\n",
    "validation_results = pd.read_csv(building_result_file)\n",
    "\n",
    "building_metrics = (\n",
    "    validation_results.groupby(\n",
    "        [validation_results[\"features\"], validation_results[\"model\"]]\n",
    "    )\n",
    "    .apply(lambda x: mean_absolute_percentage_error(x[\"actual\"], x[\"predicted\"]))\n",
    "    .to_frame()\n",
    "    .rename(columns={0: \"mean_absolute_percentage_error\"})\n",
    ")\n",
    "\n",
    "building_metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That works, so now we'll calculate the metrics for all buildings. The above code can go in a function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calculate_metrics(building_result_file):\n",
    "    \"\"\"\n",
    "    Calculate mean_absolute_percentage_error for a building from a predictions file\n",
    "    \n",
    "    :param building_result_file: location of validation results data\n",
    "    \n",
    "    :return building_metrics: dataframe of metrics for building\n",
    "    \"\"\"\n",
    "\n",
    "    # Read in the results\n",
    "    validation_results = pd.read_csv(building_result_file)\n",
    "\n",
    "    # Groupby the feature set and model\n",
    "    building_metrics = (\n",
    "        validation_results.groupby(\n",
    "            [\"features\", \"model\"]\n",
    "        )\n",
    "        # Calculate the error metric\n",
    "        .apply(lambda x: mean_absolute_percentage_error(x[\"actual\"], x[\"predicted\"]))\n",
    "        # Format the dataframe\n",
    "        .to_frame()\n",
    "        .reset_index()\n",
    "        .rename(columns={0: \"mean_absolute_percentage_error\"})\n",
    "    )\n",
    "    # Add the building id\n",
    "    building_metrics[\"building_id\"] = building_result_file.split(\"_\")[2]\n",
    "    return building_metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we apply that function to all building results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a dataframe of all metrcs\n",
    "all_metrics = pd.concat(\n",
    "    [calculate_metrics(br_file) for br_file in building_result_files]\n",
    ")\n",
    "all_metrics.to_csv(\"validation_results/all_metrics.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>features</th>\n",
       "      <th>model</th>\n",
       "      <th>mean_absolute_percentage_error</th>\n",
       "      <th>building_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>cyclical</td>\n",
       "      <td>LinearRegression</td>\n",
       "      <td>17.909297</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>cyclical</td>\n",
       "      <td>RandomForestRegressor</td>\n",
       "      <td>12.280728</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>standard</td>\n",
       "      <td>LinearRegression</td>\n",
       "      <td>20.352490</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>standard</td>\n",
       "      <td>RandomForestRegressor</td>\n",
       "      <td>12.487683</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>cyclical</td>\n",
       "      <td>LinearRegression</td>\n",
       "      <td>29.121338</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   features                  model  mean_absolute_percentage_error building_id\n",
       "0  cyclical       LinearRegression                       17.909297          24\n",
       "1  cyclical  RandomForestRegressor                       12.280728          24\n",
       "2  standard       LinearRegression                       20.352490          24\n",
       "3  standard  RandomForestRegressor                       12.487683          24\n",
       "0  cyclical       LinearRegression                       29.121338          15"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_absolute_percentage_error</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>160.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>63.880972</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>83.319291</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>9.202885</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>22.299167</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>32.141760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>65.297555</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>451.598475</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       mean_absolute_percentage_error\n",
       "count                      160.000000\n",
       "mean                        63.880972\n",
       "std                         83.319291\n",
       "min                          9.202885\n",
       "25%                         22.299167\n",
       "50%                         32.141760\n",
       "75%                         65.297555\n",
       "max                        451.598475"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_metrics.head()\n",
    "all_metrics.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Results\n",
    "\n",
    "We can now answer our main question: does cyclical encoding of date/time features benefit machine learning models? If so, we expect to see a decrease in the model error with the cyclical features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th colspan=\"8\" halign=\"left\">mean_absolute_percentage_error</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>features</th>\n",
       "      <th>model</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">cyclical</th>\n",
       "      <th>LinearRegression</th>\n",
       "      <td>40.0</td>\n",
       "      <td>68.862609</td>\n",
       "      <td>85.376630</td>\n",
       "      <td>10.377362</td>\n",
       "      <td>27.981233</td>\n",
       "      <td>34.571894</td>\n",
       "      <td>67.116970</td>\n",
       "      <td>438.634348</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>RandomForestRegressor</th>\n",
       "      <td>40.0</td>\n",
       "      <td>55.002952</td>\n",
       "      <td>77.700614</td>\n",
       "      <td>9.202885</td>\n",
       "      <td>19.229364</td>\n",
       "      <td>23.832743</td>\n",
       "      <td>48.580763</td>\n",
       "      <td>413.009500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">standard</th>\n",
       "      <th>LinearRegression</th>\n",
       "      <td>40.0</td>\n",
       "      <td>76.941931</td>\n",
       "      <td>94.564915</td>\n",
       "      <td>10.284414</td>\n",
       "      <td>29.178142</td>\n",
       "      <td>41.586746</td>\n",
       "      <td>68.941804</td>\n",
       "      <td>451.598475</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>RandomForestRegressor</th>\n",
       "      <td>40.0</td>\n",
       "      <td>54.716396</td>\n",
       "      <td>75.272497</td>\n",
       "      <td>9.321820</td>\n",
       "      <td>19.344407</td>\n",
       "      <td>23.813667</td>\n",
       "      <td>48.092324</td>\n",
       "      <td>386.051141</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                               mean_absolute_percentage_error             \\\n",
       "                                                        count       mean   \n",
       "features model                                                             \n",
       "cyclical LinearRegression                                40.0  68.862609   \n",
       "         RandomForestRegressor                           40.0  55.002952   \n",
       "standard LinearRegression                                40.0  76.941931   \n",
       "         RandomForestRegressor                           40.0  54.716396   \n",
       "\n",
       "                                                                            \\\n",
       "                                      std        min        25%        50%   \n",
       "features model                                                               \n",
       "cyclical LinearRegression       85.376630  10.377362  27.981233  34.571894   \n",
       "         RandomForestRegressor  77.700614   9.202885  19.229364  23.832743   \n",
       "standard LinearRegression       94.564915  10.284414  29.178142  41.586746   \n",
       "         RandomForestRegressor  75.272497   9.321820  19.344407  23.813667   \n",
       "\n",
       "                                                       \n",
       "                                      75%         max  \n",
       "features model                                         \n",
       "cyclical LinearRegression       67.116970  438.634348  \n",
       "         RandomForestRegressor  48.580763  413.009500  \n",
       "standard LinearRegression       68.941804  451.598475  \n",
       "         RandomForestRegressor  48.092324  386.051141  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_metrics.groupby(['features', 'model']).describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To find the answer, we group by the building and model and calculate the difference in error between the sets of features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"8\" halign=\"left\">difference_in_error</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>model</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>LinearRegression</th>\n",
       "      <td>40.0</td>\n",
       "      <td>8.079322</td>\n",
       "      <td>14.717504</td>\n",
       "      <td>-7.106086</td>\n",
       "      <td>0.422033</td>\n",
       "      <td>4.954048</td>\n",
       "      <td>8.216279</td>\n",
       "      <td>64.505072</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>RandomForestRegressor</th>\n",
       "      <td>40.0</td>\n",
       "      <td>-0.286556</td>\n",
       "      <td>5.676348</td>\n",
       "      <td>-26.958359</td>\n",
       "      <td>-0.358428</td>\n",
       "      <td>0.096175</td>\n",
       "      <td>0.255632</td>\n",
       "      <td>16.377598</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      difference_in_error                                  \\\n",
       "                                    count      mean        std        min   \n",
       "model                                                                       \n",
       "LinearRegression                     40.0  8.079322  14.717504  -7.106086   \n",
       "RandomForestRegressor                40.0 -0.286556   5.676348 -26.958359   \n",
       "\n",
       "                                                                \n",
       "                            25%       50%       75%        max  \n",
       "model                                                           \n",
       "LinearRegression       0.422033  4.954048  8.216279  64.505072  \n",
       "RandomForestRegressor -0.358428  0.096175  0.255632  16.377598  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def difference_between_features(metrics):\n",
    "    \"\"\"\n",
    "    Calculate the difference in mean absolute error between two sets of features.\n",
    "    \"\"\"\n",
    "    return float(\n",
    "        metrics.loc[\n",
    "            metrics[\"features\"] == \"standard\", \"mean_absolute_percentage_error\"\n",
    "        ].values\n",
    "        - metrics.loc[\n",
    "            metrics[\"features\"] == \"cyclical\", \"mean_absolute_percentage_error\"\n",
    "        ].values\n",
    "    )\n",
    "\n",
    "# Groupby building and model\n",
    "feature_differences = (\n",
    "    all_metrics.groupby([\"building_id\", \"model\"])\n",
    "    # Calculate difference\n",
    "    .apply(lambda x: difference_between_features(x))\n",
    "    # Format as dataframe\n",
    "    .to_frame()\n",
    "    .reset_index()\n",
    "    .rename(columns={0: \"difference_in_error\"})\n",
    ")\n",
    "\n",
    "feature_differences.groupby('model').describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The initial hypothesis holds: __Cyclical encoding of features benefits a simple linear regression, but not a complex model such as Random Forest Regression.__\n",
    "\n",
    "# Visualizations\n",
    "\n",
    "We've answered our main question, but we can still make some visualizations. We'll keep using the interactive [plotly library](https://plot.ly/python/) with its newest addition, [plotly express](https://medium.com/@plotlygraphs/introducing-plotly-express-808df010143d)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "from plotly.offline import plot\n",
    "import plotly_express as px"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "def graph_results(all_metrics, feature_differences):\n",
    "    # Make a bar plot of all the errors for each model and set of features\n",
    "    all_scores_fig = px.bar(\n",
    "        all_metrics,\n",
    "        x=\"building_id\",\n",
    "        y=\"mean_absolute_percentage_error\",\n",
    "        color=\"features\",\n",
    "        barmode=\"group\",\n",
    "        facet_row=\"model\",\n",
    "        title=\"Comparison of Models and Errors\",\n",
    "        template=\"plotly_white\",\n",
    "        height=1400,\n",
    "    )\n",
    "    # Make a bar plot of the difference in errors from cyclical encoding\n",
    "    diff_fig = px.bar(\n",
    "        feature_differences,\n",
    "        x=\"building_id\",\n",
    "        y=\"difference_in_error\",\n",
    "        facet_row=\"model\",\n",
    "        height=1400,\n",
    "        template=\"plotly_white\",\n",
    "        title=\"Performance Gain from Cyclical Encoding of Features\",\n",
    "    )\n",
    "    return all_scores_fig, diff_fig\n",
    "\n",
    "\n",
    "all_scores_fig, diff_fig = graph_results(all_metrics, feature_differences)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'validation_results/all_scores.html'"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'validation_results/diff_in_scores.html'"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "plot(all_scores_fig, filename=\"validation_results/all_scores.html\")\n",
    "plot(diff_fig, filename='validation_results/diff_in_scores.html')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The graphs clearly show our conclusion: the cyclically encoding has lowered the error for the linear regression but not for the random forest. Feel free to explore the data and change the testing setup, model hyperparameters, feature sets, data sets to see if this result holds."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conclusion\n",
    "\n",
    "The main takeaway is that __cyclical feature encoding is not required for complex models with a high capacity.__ These models are able to learn the relationships between the day of year, time of day and the target without cyclical encoding. The feature transformation does not hurt the more complex models, but it adds unnecessary steps to a machine learning pipeline without adding benefits. \n",
    "\n",
    "Granted, this result may only hold for building energy data without many other features and I think it'd be great if others applied a similar testing procedure to different datasets. Currently, the only way to figure out if something is a good idea in machine learning is to get some data and try it out, a lesson we learned with this feature testing project."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}