{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction: Testing Cyclical Encoding of Features for Machine Learning\n", "\n", "Cyclical encoding of time-series features is typically done with time/date features that repeat periodically such as the hour of the day or day of the year. However, it is not a foregone conclusion this technique actually improves the performance of complex machine learning algorithms, such as a random forest. In this notebook, we will evaluate the efficacy of cyclical encoding of time-series features for predicting building energy data with two different models. The hypothesis is that cyclical encoding will improve the performance of the simple model, Linear Regression, but will not improve the performance of the model with higher capacity, the Random Forest Regression.\n", "\n", "## Methods\n", "\n", "__Feature Sets__\n", "1. Standard = `[\"time_of_day\", \"day_of_year\", \"temperature\"]`\n", "2. Cyclical = `[\"sin_time_of_day\", \"cos_time_of_day\", \"sin_day_of_year\", \"cos_day_of_year\", \"temperature\"]`\n", "\n", "(We include temperature in both feature sets because when modeling time-series problems, we almost always have other features besides the date and time. Results may vary with the addition of features).\n", "\n", "__Models__\n", "1. Linear Regression = `LinearRegression(n_jobs=-1)`\n", "2. Random Forest Regression = `RandomForestRegressor(n_estimators=25, max_depth=15, n_jobs=-1, random_state=100)`\n", "\n", "(The hyperparameters for the random forest have proved to work well for use cases in our data science systems. Again, performance may vary with hyperparameter values).\n", "\n", "We'll use a relatively simple test set-up. After creating the standard and cyclically encoded sets of features, we'll evaluate each set with the two models. Validation will consist of testing on a week of data at a time with all previous data used for training. We go through the dataset one week at a time, training on the historical data, and predicting for that selected week. This mimics what we actually do evaluating models for forecasting energy use at [Cortex Building Intel](https://cortexintel.com). The weekly validation is representative of a real-world situation, in which we'll have past training data and want to make a prediction of future energy consumption. By validating on past data, we can get an estimate of how the model will perform in the future. \n", "\n", "For our use case, we are primarily concerned with the performance of each set of features. We'll use a single metric, __mean absolute percentage error__ to assess our predictions. The benefits of MAPE are: relative comparisons, interpretability, and simplicity. The downsides are that it cna be sensitive to outliers and it's asymmetric because lower true values will result in greater percentage error for the same absolute error. There are always trade-offs to make in machine learning! Primarily we want to assess the difference in MAPE between the standard features and the cyclical features. Defining the difference in errors as standard mape - cyclical mape, a positive value indicates the cyclical features outperform the standard features. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data \n", "\n", "We are using building energy data originally part of a DrivenData machine learning competition to forecast building energy use. You can download the original data from DrivenData here. I've cleaned up the building data and removed all features except for timestamp and temperature. The __objective__ is to predict the \"energy\" from the features." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 40 buildings to test.\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import glob\n", "\n", "building_data_files = glob.glob('data/building*')\n", "print(f'There are {len(building_data_files)} buildings to test.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at one example dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
temperatureenergy
timestamp
2016-09-18 04:00:0056.2403001.682686
2016-09-18 04:15:0056.0875012.086212
2016-09-18 04:30:0056.2132321.687880
2016-09-18 04:45:0056.4000491.926518
2016-09-18 05:00:0056.5924971.922459
\n", "
" ], "text/plain": [ " temperature energy\n", "timestamp \n", "2016-09-18 04:00:00 56.240300 1.682686\n", "2016-09-18 04:15:00 56.087501 2.086212\n", "2016-09-18 04:30:00 56.213232 1.687880\n", "2016-09-18 04:45:00 56.400049 1.926518\n", "2016-09-18 05:00:00 56.592497 1.922459" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv(building_data_files[10], parse_dates=['timestamp'], index_col=0).set_index('timestamp')\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nothing too extraordinary here." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'file:///Users/williamkoehrsen/git/Data-Analysis/cyclical-features/temp-plot.html'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import cufflinks as cf\n", "from plotly.offline import plot\n", "\n", "fig = data.iplot(y='energy', secondary_y='temperature', title='Temperature and Energy', asFigure=True)\n", "plot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Engineering\n", "\n", "To repeatedly create the same features, we'll make two scikit-learn transformers and then join them together in a pipeline. This is very simple code, but including it in a well-defined interface means we can use it over and over while confident we'll get the same operations applied to our data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transformers\n", "\n", "[Scikit-Learn transformers](http://scikit-learn.org/stable/data_transforms.html) give us a standard interface for applying operations to our data. Using transformers is efficient and allows you to build robust systems." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from sklearn.base import BaseEstimator, TransformerMixin\n", "\n", "class DateTimeFeatures(BaseEstimator, TransformerMixin):\n", " \"\"\"\n", " Extract day of year and time of day features from a timestamp\n", " \"\"\"\n", " def __init__(self):\n", " pass\n", "\n", " def fit(self, X, y=None):\n", " return self\n", "\n", " def transform(self, X, y=None):\n", " # Timestamps must be in index\n", " field = X.index\n", " X[\"time_of_day\"] = field.hour + field.minute / 60\n", " X[\"day_of_year\"] = field.dayofyear\n", " return X\n", "\n", "\n", "class CyclicalDateTimeFeatures(BaseEstimator, TransformerMixin):\n", " \"\"\"\n", " Make cyclically encoded day of year and time of day features\n", " \"\"\"\n", " def __init__(self):\n", " pass\n", "\n", " def fit(self, X, y=None):\n", " return self\n", "\n", " def transform(self, X, y=None):\n", " # Apply formula for sin and cosine\n", " X[\"sin_time_of_day\"], X[\"cos_time_of_day\"] = _cyclical_encoding(\n", " X[\"time_of_day\"], period=24\n", " )\n", " X[\"sin_day_of_year\"], X[\"cos_day_of_year\"] = _cyclical_encoding(\n", " X[\"day_of_year\"], period=366\n", " )\n", " return X\n", "\n", "def _cyclical_encoding(series, period):\n", " \"\"\"\n", " Cyclical encoding of a series with a specified period\n", " \"\"\"\n", " # Basic formula for sin/cosine equation\n", " base = 2 * np.pi * series / period\n", " return np.sin(base), np.cos(base)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pipeline\n", "\n", "The pipeline is one of the most important features in sklearn. I highly recommend [reading up on the Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and using it throughout your data science projects." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "\n", "# Make a pipeline with the steps\n", "transforms = Pipeline(\n", " steps=[\n", " # Must create the date/time features before encoding\n", " (\"date_time_features\", DateTimeFeatures()),\n", " (\"cylical_date_time_features\", CyclicalDateTimeFeatures()),\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use our pipeline on any data set." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
temperatureenergytime_of_dayday_of_yearsin_time_of_daycos_time_of_daysin_day_of_yearcos_day_of_year
timestamp
2016-09-18 04:00:0056.2403001.6826864.002620.8660250.500000-0.977064-0.212947
2016-09-18 04:15:0056.0875012.0862124.252620.8968730.442289-0.977064-0.212947
2016-09-18 04:30:0056.2132321.6878804.502620.9238800.382683-0.977064-0.212947
2016-09-18 04:45:0056.4000491.9265184.752620.9469300.321439-0.977064-0.212947
2016-09-18 05:00:0056.5924971.9224595.002620.9659260.258819-0.977064-0.212947
\n", "
" ], "text/plain": [ " temperature energy time_of_day day_of_year \\\n", "timestamp \n", "2016-09-18 04:00:00 56.240300 1.682686 4.00 262 \n", "2016-09-18 04:15:00 56.087501 2.086212 4.25 262 \n", "2016-09-18 04:30:00 56.213232 1.687880 4.50 262 \n", "2016-09-18 04:45:00 56.400049 1.926518 4.75 262 \n", "2016-09-18 05:00:00 56.592497 1.922459 5.00 262 \n", "\n", " sin_time_of_day cos_time_of_day sin_day_of_year \\\n", "timestamp \n", "2016-09-18 04:00:00 0.866025 0.500000 -0.977064 \n", "2016-09-18 04:15:00 0.896873 0.442289 -0.977064 \n", "2016-09-18 04:30:00 0.923880 0.382683 -0.977064 \n", "2016-09-18 04:45:00 0.946930 0.321439 -0.977064 \n", "2016-09-18 05:00:00 0.965926 0.258819 -0.977064 \n", "\n", " cos_day_of_year \n", "timestamp \n", "2016-09-18 04:00:00 -0.212947 \n", "2016-09-18 04:15:00 -0.212947 \n", "2016-09-18 04:30:00 -0.212947 \n", "2016-09-18 04:45:00 -0.212947 \n", "2016-09-18 05:00:00 -0.212947 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transformed_data = transforms.transform(data)\n", "transformed_data.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
temperatureenergytime_of_dayday_of_yearsin_time_of_daycos_time_of_daysin_day_of_yearcos_day_of_year
timestamp
2014-01-01 05:15:0036.7366740.4790605.2510.9807851.950903e-010.0171660.999853
2014-01-01 05:30:0036.6821870.4979915.5010.9914451.305262e-010.0171660.999853
2014-01-01 05:45:0036.6792550.4991075.7510.9978596.540313e-020.0171660.999853
2014-01-01 06:00:0036.6991380.4944276.0011.0000006.123234e-170.0171660.999853
2014-01-01 06:15:0036.5871240.5011656.2510.997859-6.540313e-020.0171660.999853
\n", "
" ], "text/plain": [ " temperature energy time_of_day day_of_year \\\n", "timestamp \n", "2014-01-01 05:15:00 36.736674 0.479060 5.25 1 \n", "2014-01-01 05:30:00 36.682187 0.497991 5.50 1 \n", "2014-01-01 05:45:00 36.679255 0.499107 5.75 1 \n", "2014-01-01 06:00:00 36.699138 0.494427 6.00 1 \n", "2014-01-01 06:15:00 36.587124 0.501165 6.25 1 \n", "\n", " sin_time_of_day cos_time_of_day sin_day_of_year \\\n", "timestamp \n", "2014-01-01 05:15:00 0.980785 1.950903e-01 0.017166 \n", "2014-01-01 05:30:00 0.991445 1.305262e-01 0.017166 \n", "2014-01-01 05:45:00 0.997859 6.540313e-02 0.017166 \n", "2014-01-01 06:00:00 1.000000 6.123234e-17 0.017166 \n", "2014-01-01 06:15:00 0.997859 -6.540313e-02 0.017166 \n", "\n", " cos_day_of_year \n", "timestamp \n", "2014-01-01 05:15:00 0.999853 \n", "2014-01-01 05:30:00 0.999853 \n", "2014-01-01 05:45:00 0.999853 \n", "2014-01-01 06:00:00 0.999853 \n", "2014-01-01 06:15:00 0.999853 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv(building_data_files[20], parse_dates=['timestamp'], index_col=0).set_index('timestamp')\n", "transformed_data = transforms.transform(data)\n", "transformed_data.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'file:///Users/williamkoehrsen/git/Data-Analysis/cyclical-features/cylical_encoding_day_of_year.html'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fig = data.iplot(y=['sin_day_of_year', 'cos_day_of_year'], title=\"Cyclically Encoded Day of Year\", asFigure=True)\n", "plot(fig, filename='cylical_encoding_day_of_year.html')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'file:///Users/williamkoehrsen/git/Data-Analysis/cyclical-features/temp-plot.html'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fig = data.loc['2014-01-01': '2014-01-08'].iplot(y=['sin_time_of_day', 'cos_time_of_day'], title=\"Cyclically Encoded Time of Day\", asFigure=True)\n", "plot(fig, 'cyclical_encoding_time_of_day.html')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Modeling\n", "\n", "The modeling itself is relatively simple thanks to Scikit-Learn. We'll make a function that takes a set of data iterates through the sets of features (2) and the models (2). On each iteration, we apply the testing framework by stepping through the dataset one week at a time. We train on the all previous data and then test on the selected week recording the predictions along with the actual value.\n", "\n", "This means that for each dataset, we are running 4 weekly validation testing sessions (2 x 2). Feel free to change the models or test using different features. You can also change the validation structure although we've found this to be an accurate indication of how well we can expect the model to perform on real data (forecasting for the future). \n", "\n", "If you were using this system in production, the most important aspect would be [writing unit tests](https://realpython.com/python-testing/)! These [should cover](https://docs.python-guide.org/writing/tests/) both the pipeline and the modeling. You are going to make mistakes in your code, that is guaranteed. Unit tests can help you find these mistakes before they cause unexpected errors in a machine learning system." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def run_weekly_validation(\n", " models,\n", " data,\n", " feature_sets=[\n", " [\"time_of_day\", \"day_of_year\", \"temperature\"],\n", " [\n", " \"sin_time_of_day\",\n", " \"cos_time_of_day\",\n", " \"sin_day_of_year\",\n", " \"cos_day_of_year\",\n", " \"temperature\",\n", " ],\n", " ],\n", "):\n", " \"\"\"\n", " Run the weekly validation modeling with models and feature sets on data\n", " \n", " :param models: list of sklearn models\n", " :param data: building energy dataframe\n", " :param feature_sets: list of lists of features\n", " \n", " :return results: dataframe of prediction results from each combination of model and feature set\n", " \"\"\"\n", "\n", " all_predictions = []\n", "\n", " # Iterate through feature sets\n", " for feature_set in feature_sets:\n", " features = \"standard\" if \"sin_time_of_day\" not in feature_set else \"cyclical\"\n", " \n", " # Subset to features in data\n", " X = data[feature_set + [\"energy\"]].copy()\n", "\n", " # Iterate through models\n", " for model in models:\n", " \n", " model_name = model.__class__.__name__\n", " \n", " # Iterate through weeks in the dataset\n", " # Must group by string formatted week and year\n", " for (week, year), X_test in X.groupby(\n", " # [week, year] grouping\n", " [X.index.strftime(\"%U\"), X.index.strftime(\"%Y\")]\n", " ):\n", " # Subset to training data; all data before test set\n", " X_train = X[X.index < X_test.index.min()].copy()\n", "\n", " # Can not train or test on zero observations\n", " if len(X_train) == 0 or len(X_test) == 0:\n", " continue\n", "\n", " # Targets\n", " y_train = X_train.pop(\"energy\")\n", " y_test = X_test.pop(\"energy\")\n", "\n", " # Fit and predict\n", " model.fit(X_train, y_train)\n", " predictions = model.predict(X_test)\n", "\n", " # Record predictions along with actual values, model, and feature set in a dataframe\n", " predictions = pd.DataFrame(\n", " dict(\n", " predicted=predictions,\n", " actual=y_test,\n", " model=model_name,\n", " features=features,\n", " ),\n", " # Index is same as for testing data\n", " index=X_test.index,\n", " )\n", " all_predictions.append(predictions)\n", "\n", " # Return dataframe of dataframes ordered by timestamp\n", " return (\n", " pd.concat(all_predictions)\n", " .reset_index()\n", " .sort_values([\"model\", \"features\", \"timestamp\"])\n", " .set_index(\"timestamp\")\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll run one example first to make sure the results are as expected (in a real system, you'd want to have unit tests for this function and the pipeline.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example Validation" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/williamkoehrsen/.local/share/virtualenvs/insight-engine-JU9-2WuC/lib/python3.7/site-packages/sklearn/linear_model/base.py:485: RuntimeWarning:\n", "\n", "internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.\n", "\n" ] } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.ensemble import RandomForestRegressor\n", "\n", "# Create linear model and random forest model for regression\n", "models = [\n", " LinearRegression(n_jobs=-1),\n", " RandomForestRegressor(n_estimators=25, max_depth=15, n_jobs=-1, random_state=100),\n", "]\n", "\n", "validation = run_weekly_validation(models, data)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predictedactualmodelfeatures
timestamp
2014-01-05 00:00:000.4242050.398189LinearRegressioncyclical
2014-01-05 00:15:000.4136240.397093LinearRegressioncyclical
2014-01-05 00:30:000.4080620.396569LinearRegressioncyclical
2014-01-05 00:45:000.4040330.400340LinearRegressioncyclical
2014-01-05 01:00:000.4006670.390858LinearRegressioncyclical
\n", "
" ], "text/plain": [ " predicted actual model features\n", "timestamp \n", "2014-01-05 00:00:00 0.424205 0.398189 LinearRegression cyclical\n", "2014-01-05 00:15:00 0.413624 0.397093 LinearRegression cyclical\n", "2014-01-05 00:30:00 0.408062 0.396569 LinearRegression cyclical\n", "2014-01-05 00:45:00 0.404033 0.400340 LinearRegression cyclical\n", "2014-01-05 01:00:00 0.400667 0.390858 LinearRegression cyclical" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "validation.head()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predictedactual
count388268.000000388268.000000
mean0.2435560.221177
std0.1270250.141697
min-0.4124010.014225
25%0.1574190.109462
50%0.2198840.175026
75%0.3019350.306944
max1.3890550.958976
\n", "
" ], "text/plain": [ " predicted actual\n", "count 388268.000000 388268.000000\n", "mean 0.243556 0.221177\n", "std 0.127025 0.141697\n", "min -0.412401 0.014225\n", "25% 0.157419 0.109462\n", "50% 0.219884 0.175026\n", "75% 0.301935 0.306944\n", "max 1.389055 0.958976" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "validation.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Validation for all Buildings\n", "\n", "Finally, we can run the validation for all buildings. This may take a while and I've already included the results in the `validation_results` directory." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from tqdm import tqdm_notebook\n", "\n", "def run_all_buildings(building_data_files):\n", " \"\"\"\n", " Run weekly validation for all buildings\n", " \n", " :param building_data_files: list of filenames with building energy data csv files\n", " \n", " :return None: saves each file in `validation_results` for analysis\n", " \"\"\"\n", "\n", " # Run validation for all buildings\n", " for building_file_name in tqdm_notebook(building_data_files, desc=\"Buildings\"):\n", "\n", " building_data = pd.read_csv(\n", " building_file_name, parse_dates=[\"timestamp\"]\n", " ).set_index(\"timestamp\")\n", " # Create sets of features\n", " building_data = transforms.transform(building_data)\n", "\n", " # Run the validation and save the results\n", " building_validation = run_weekly_validation(models, building_data)\n", "\n", " # Save off results for analysis\n", " building_validation.to_csv(\n", " f\"{building_file_name.replace('energy_data', 'validation_results').replace('data', 'validation_results')}\"\n", " )" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "cd96ee33cbc84e7daed4fa3aeb769821", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, description='Buildings', max=40, style=ProgressStyle(description_width='in…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "run_all_buildings(building_data_files)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Analysis\n", "\n", "Let's make sure we have all the datasets for analysis." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "40" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "building_result_files = glob.glob('validation_results/*_validation_results.csv')\n", "len(building_result_files)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll write our error metric as a function. There are some times when the actual value is 0 which results in infinite percentage error. To account for this, we replace the infinite values in the percentage error with a missing value and then ignore missing values in the mean absolute error calculation. The end outcome is a single float with the mean absolute error from the model's predictions." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def mean_absolute_percentage_error(actual, predicted):\n", " \"\"\"\n", " Calculate the mean absolute percentage error of two arrays\n", "\n", " :param actual: true values as a pandas series\n", " :param predicted: estimated values as a pandas series\n", "\n", " :return mape: float of mean absolute percentage error (%)\n", "\n", " NOTE: We avoid getting an inf for the mape by replacing infinite values in the\n", " absolute_percentage_error with nan and then ignoring the missing values (np.nanmean)\n", " \"\"\"\n", " ape_array = 100 * np.abs(predicted - actual) / actual\n", " return np.nanmean(ape_array.replace({np.inf: np.nan, -np.inf: np.nan}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we'll test out a simple block of code to calculate the error metric for one building. This groups by the model and feature set and finds the mean absolute error of the predictions." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_absolute_percentage_error
featuresmodel
cyclicalLinearRegression36.098788
RandomForestRegressor18.749387
standardLinearRegression42.860637
RandomForestRegressor18.747120
\n", "
" ], "text/plain": [ " mean_absolute_percentage_error\n", "features model \n", "cyclical LinearRegression 36.098788\n", " RandomForestRegressor 18.749387\n", "standard LinearRegression 42.860637\n", " RandomForestRegressor 18.747120" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "building_result_file = building_result_files[20]\n", "\n", "validation_results = pd.read_csv(building_result_file)\n", "\n", "building_metrics = (\n", " validation_results.groupby(\n", " [validation_results[\"features\"], validation_results[\"model\"]]\n", " )\n", " .apply(lambda x: mean_absolute_percentage_error(x[\"actual\"], x[\"predicted\"]))\n", " .to_frame()\n", " .rename(columns={0: \"mean_absolute_percentage_error\"})\n", ")\n", "\n", "building_metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That works, so now we'll calculate the metrics for all buildings. The above code can go in a function." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def calculate_metrics(building_result_file):\n", " \"\"\"\n", " Calculate mean_absolute_percentage_error for a building from a predictions file\n", " \n", " :param building_result_file: location of validation results data\n", " \n", " :return building_metrics: dataframe of metrics for building\n", " \"\"\"\n", "\n", " # Read in the results\n", " validation_results = pd.read_csv(building_result_file)\n", "\n", " # Groupby the feature set and model\n", " building_metrics = (\n", " validation_results.groupby(\n", " [\"features\", \"model\"]\n", " )\n", " # Calculate the error metric\n", " .apply(lambda x: mean_absolute_percentage_error(x[\"actual\"], x[\"predicted\"]))\n", " # Format the dataframe\n", " .to_frame()\n", " .reset_index()\n", " .rename(columns={0: \"mean_absolute_percentage_error\"})\n", " )\n", " # Add the building id\n", " building_metrics[\"building_id\"] = building_result_file.split(\"_\")[2]\n", " return building_metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we apply that function to all building results." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Create a dataframe of all metrcs\n", "all_metrics = pd.concat(\n", " [calculate_metrics(br_file) for br_file in building_result_files]\n", ")\n", "all_metrics.to_csv(\"validation_results/all_metrics.csv\")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
featuresmodelmean_absolute_percentage_errorbuilding_id
0cyclicalLinearRegression17.90929724
1cyclicalRandomForestRegressor12.28072824
2standardLinearRegression20.35249024
3standardRandomForestRegressor12.48768324
0cyclicalLinearRegression29.12133815
\n", "
" ], "text/plain": [ " features model mean_absolute_percentage_error building_id\n", "0 cyclical LinearRegression 17.909297 24\n", "1 cyclical RandomForestRegressor 12.280728 24\n", "2 standard LinearRegression 20.352490 24\n", "3 standard RandomForestRegressor 12.487683 24\n", "0 cyclical LinearRegression 29.121338 15" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_absolute_percentage_error
count160.000000
mean63.880972
std83.319291
min9.202885
25%22.299167
50%32.141760
75%65.297555
max451.598475
\n", "
" ], "text/plain": [ " mean_absolute_percentage_error\n", "count 160.000000\n", "mean 63.880972\n", "std 83.319291\n", "min 9.202885\n", "25% 22.299167\n", "50% 32.141760\n", "75% 65.297555\n", "max 451.598475" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_metrics.head()\n", "all_metrics.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Results\n", "\n", "We can now answer our main question: does cyclical encoding of date/time features benefit machine learning models? If so, we expect to see a decrease in the model error with the cyclical features." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_absolute_percentage_error
countmeanstdmin25%50%75%max
featuresmodel
cyclicalLinearRegression40.068.86260985.37663010.37736227.98123334.57189467.116970438.634348
RandomForestRegressor40.055.00295277.7006149.20288519.22936423.83274348.580763413.009500
standardLinearRegression40.076.94193194.56491510.28441429.17814241.58674668.941804451.598475
RandomForestRegressor40.054.71639675.2724979.32182019.34440723.81366748.092324386.051141
\n", "
" ], "text/plain": [ " mean_absolute_percentage_error \\\n", " count mean \n", "features model \n", "cyclical LinearRegression 40.0 68.862609 \n", " RandomForestRegressor 40.0 55.002952 \n", "standard LinearRegression 40.0 76.941931 \n", " RandomForestRegressor 40.0 54.716396 \n", "\n", " \\\n", " std min 25% 50% \n", "features model \n", "cyclical LinearRegression 85.376630 10.377362 27.981233 34.571894 \n", " RandomForestRegressor 77.700614 9.202885 19.229364 23.832743 \n", "standard LinearRegression 94.564915 10.284414 29.178142 41.586746 \n", " RandomForestRegressor 75.272497 9.321820 19.344407 23.813667 \n", "\n", " \n", " 75% max \n", "features model \n", "cyclical LinearRegression 67.116970 438.634348 \n", " RandomForestRegressor 48.580763 413.009500 \n", "standard LinearRegression 68.941804 451.598475 \n", " RandomForestRegressor 48.092324 386.051141 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_metrics.groupby(['features', 'model']).describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To find the answer, we group by the building and model and calculate the difference in error between the sets of features." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
difference_in_error
countmeanstdmin25%50%75%max
model
LinearRegression40.08.07932214.717504-7.1060860.4220334.9540488.21627964.505072
RandomForestRegressor40.0-0.2865565.676348-26.958359-0.3584280.0961750.25563216.377598
\n", "
" ], "text/plain": [ " difference_in_error \\\n", " count mean std min \n", "model \n", "LinearRegression 40.0 8.079322 14.717504 -7.106086 \n", "RandomForestRegressor 40.0 -0.286556 5.676348 -26.958359 \n", "\n", " \n", " 25% 50% 75% max \n", "model \n", "LinearRegression 0.422033 4.954048 8.216279 64.505072 \n", "RandomForestRegressor -0.358428 0.096175 0.255632 16.377598 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def difference_between_features(metrics):\n", " \"\"\"\n", " Calculate the difference in mean absolute error between two sets of features.\n", " \"\"\"\n", " return float(\n", " metrics.loc[\n", " metrics[\"features\"] == \"standard\", \"mean_absolute_percentage_error\"\n", " ].values\n", " - metrics.loc[\n", " metrics[\"features\"] == \"cyclical\", \"mean_absolute_percentage_error\"\n", " ].values\n", " )\n", "\n", "# Groupby building and model\n", "feature_differences = (\n", " all_metrics.groupby([\"building_id\", \"model\"])\n", " # Calculate difference\n", " .apply(lambda x: difference_between_features(x))\n", " # Format as dataframe\n", " .to_frame()\n", " .reset_index()\n", " .rename(columns={0: \"difference_in_error\"})\n", ")\n", "\n", "feature_differences.groupby('model').describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The initial hypothesis holds: __Cyclical encoding of features benefits a simple linear regression, but not a complex model such as Random Forest Regression.__\n", "\n", "# Visualizations\n", "\n", "We've answered our main question, but we can still make some visualizations. We'll keep using the interactive [plotly library](https://plot.ly/python/) with its newest addition, [plotly express](https://medium.com/@plotlygraphs/introducing-plotly-express-808df010143d)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from plotly.offline import plot\n", "import plotly_express as px" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "def graph_results(all_metrics, feature_differences):\n", " # Make a bar plot of all the errors for each model and set of features\n", " all_scores_fig = px.bar(\n", " all_metrics,\n", " x=\"building_id\",\n", " y=\"mean_absolute_percentage_error\",\n", " color=\"features\",\n", " barmode=\"group\",\n", " facet_row=\"model\",\n", " title=\"Comparison of Models and Errors\",\n", " template=\"plotly_white\",\n", " height=1400,\n", " )\n", " # Make a bar plot of the difference in errors from cyclical encoding\n", " diff_fig = px.bar(\n", " feature_differences,\n", " x=\"building_id\",\n", " y=\"difference_in_error\",\n", " facet_row=\"model\",\n", " height=1400,\n", " template=\"plotly_white\",\n", " title=\"Performance Gain from Cyclical Encoding of Features\",\n", " )\n", " return all_scores_fig, diff_fig\n", "\n", "\n", "all_scores_fig, diff_fig = graph_results(all_metrics, feature_differences)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'validation_results/all_scores.html'" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "'validation_results/diff_in_scores.html'" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plot(all_scores_fig, filename=\"validation_results/all_scores.html\")\n", "plot(diff_fig, filename='validation_results/diff_in_scores.html')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The graphs clearly show our conclusion: the cyclically encoding has lowered the error for the linear regression but not for the random forest. Feel free to explore the data and change the testing setup, model hyperparameters, feature sets, data sets to see if this result holds." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion\n", "\n", "The main takeaway is that __cyclical feature encoding is not required for complex models with a high capacity.__ These models are able to learn the relationships between the day of year, time of day and the target without cyclical encoding. The feature transformation does not hurt the more complex models, but it adds unnecessary steps to a machine learning pipeline without adding benefits. \n", "\n", "Granted, this result may only hold for building energy data without many other features and I think it'd be great if others applied a similar testing procedure to different datasets. Currently, the only way to figure out if something is a good idea in machine learning is to get some data and try it out, a lesson we learned with this feature testing project." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }