{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 2: Feature Processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature standardization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `vinho verde` data set contains physico-chemical information on a number of Portuguese wines, as well as their rating by human tasters. \n", "\n", "Our goal is to use these data to automatically predict the rating of a wine, so as to assist oenologists, improve wine production, and target the taste of niche consumers.\n", "\n", "This data set has been made available on the UCI archive repository (it is one of the oldest and most well-known repository of ML problems).\n", "\n", "It is available from: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/ (but already in your repository; we will focus on white wines here)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv('data/winequality-white.csv', sep=\";\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have loaded the data in a _pandas DataFrame_ object. Let us examine what information is available:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
07.00.270.3620.70.04545.0170.01.00103.000.458.86
16.30.300.341.60.04914.0132.00.99403.300.499.56
28.10.280.406.90.05030.097.00.99513.260.4410.16
37.20.230.328.50.05847.0186.00.99563.190.409.96
47.20.230.328.50.05847.0186.00.99563.190.409.96
\n", "
" ], "text/plain": [ " fixed acidity volatile acidity citric acid residual sugar chlorides \\\n", "0 7.0 0.27 0.36 20.7 0.045 \n", "1 6.3 0.30 0.34 1.6 0.049 \n", "2 8.1 0.28 0.40 6.9 0.050 \n", "3 7.2 0.23 0.32 8.5 0.058 \n", "4 7.2 0.23 0.32 8.5 0.058 \n", "\n", " free sulfur dioxide total sulfur dioxide density pH sulphates \\\n", "0 45.0 170.0 1.0010 3.00 0.45 \n", "1 14.0 132.0 0.9940 3.30 0.49 \n", "2 30.0 97.0 0.9951 3.26 0.44 \n", "3 47.0 186.0 0.9956 3.19 0.40 \n", "4 47.0 186.0 0.9956 3.19 0.40 \n", "\n", " alcohol quality \n", "0 8.8 6 \n", "1 9.5 6 \n", "2 10.1 6 \n", "3 9.9 6 \n", "4 9.9 6 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head(n=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data contains 12 columns. The first 10 (fixed acidity -- alcohol) are physico-chemical features of the wines; the last one is their rating (or quality).\n", "\n", "Let us extract from this data a numpy array that contains the design matrix X:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(4898, 11)\n" ] } ], "source": [ "X = data.values[:, :-1]\n", "print(X.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ Extract from this data a one-dimensional numpy array that contains the labels y." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# TODO\n", "y = data['quality'].to_numpy()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "y = data['quality']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us now plot a histogram of the values taken by each of our features:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "source": [ "%pylab inline" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# create a figure of size 16x12\n", "fig = plt.figure(figsize=(16, 12))\n", "\n", "for feat_idx in range(X.shape[1]):\n", " # create a subplot in the (feat_idx+1) position of a 3x4 grid\n", " ax = fig.add_subplot(3, 4, (feat_idx+1))\n", " # plot the histogram of feat_idx\n", " h = ax.hist(X[:, feat_idx], bins=50, color='steelblue', edgecolor='none')\n", " # use the name of the feature as a title for each histogram\n", " ax.set_title(data.columns[feat_idx], fontsize=14)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__\n", "What are the ranges of values taken by the different features? What do you think is going to happen when one computes the euclidean distance between two samples: will the `free sulfur dioxide` be accounted for in a manner similar to the `sulphates`? How is this going to affect the k-nearest-neighbor algorithm?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Answer:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5-nearest-neighbor prediction\n", "We will now see how to use scikit-learn to split the data between a train and a test set, train a nearest neighbor regressor on the training data, and evaluate its performance on the test set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Splitting the data" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "from sklearn import model_selection\n", "\n", "X_train, X_test, y_train, y_test = \\\n", " model_selection.train_test_split(X, y,\n", " test_size=0.3 # 30% des données dans le jeu de test\n", " )" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(3428, 11) (1470, 11) (3428,) (1470,)\n" ] } ], "source": [ "print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Creating a 5 nearest neighbor regressor" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "from sklearn import neighbors" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "model = neighbors.KNeighborsRegressor(n_neighbors=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Training the 5-NN regressor on the training data" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "KNeighborsRegressor()" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Making predictions with the trained model" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "y_pred = model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7928764476414278" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Compute the RMSE between the predictions and true value\n", "from sklearn import metrics\n", "np.sqrt(metrics.mean_squared_error(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Feature standardization" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "from sklearn import preprocessing\n", "\n", "# Create a standardizer object and fit it to the training data.\n", "std_scale = preprocessing.StandardScaler().fit(X_train)\n", "\n", "# Apply the standardization to the training and the test data.\n", "X_train_std = std_scale.transform(X_train)\n", "X_test_std = std_scale.transform(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ Why did we fit the standardizer (i.e. computed the mean and standard deviation for each feature) on the training set only?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Answer:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ Visualize the scaled data again to check that the standardization had the intended effect." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# TODO\n", "# create a figure of size 16x12\n", "fig = plt.figure(figsize=(16, 12))\n", "\n", "for feat_idx in range(X_train_std.shape[1]):\n", " # create a subplot in the (feat_idx+1) position of a 3x4 grid\n", " ax = fig.add_subplot(3, 4, (feat_idx+1))\n", " # plot the histogram of feat_idx\n", " h = ax.hist(X_train_std[:, feat_idx], bins=50, color='steelblue', edgecolor='none')\n", " # use the name of the feature as a title for each histogram\n", " ax.set_title(data.columns[feat_idx], fontsize=14)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Effect of the feature standardization on the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ Train a new model on the standardized data. Is it better than the one trained on non-standardized data? " ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7113845846030713" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# TODO\n", "model_std = neighbors.KNeighborsRegressor(n_neighbors=5)\n", "model_std.fit(X_train_std, y_train)\n", "y_pred_std = model_std.predict(X_test_std)\n", "np.sqrt(metrics.mean_squared_error(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will work with a data set that describes mushrooms according to the shape of their cap and stalk, their odor, the type of their veil, etc. This data set also contains information on whether a mushroom is edible or not, and that is what we will try to predict." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data are available as `data/mushrooms.csv`. Let us load them in a pandas DataFrame called `df`." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('data/mushrooms.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us look at the first few lines of df" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classcap-shapecap-surfacecap-colorbruisesodorgill-attachmentgill-spacinggill-sizegill-color...stalk-surface-below-ringstalk-color-above-ringstalk-color-below-ringveil-typeveil-colorring-numberring-typespore-print-colorpopulationhabitat
0pxsntpfcnk...swwpwopksu
1exsytafcbk...swwpwopnng
2ebswtlfcbn...swwpwopnnm
3pxywtpfcnn...swwpwopksu
4exsgfnfwbk...swwpwoenag
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " class cap-shape cap-surface cap-color bruises odor gill-attachment \\\n", "0 p x s n t p f \n", "1 e x s y t a f \n", "2 e b s w t l f \n", "3 p x y w t p f \n", "4 e x s g f n f \n", "\n", " gill-spacing gill-size gill-color ... stalk-surface-below-ring \\\n", "0 c n k ... s \n", "1 c b k ... s \n", "2 c b n ... s \n", "3 c n n ... s \n", "4 w b k ... s \n", "\n", " stalk-color-above-ring stalk-color-below-ring veil-type veil-color \\\n", "0 w w p w \n", "1 w w p w \n", "2 w w p w \n", "3 w w p w \n", "4 w w p w \n", "\n", " ring-number ring-type spore-print-color population habitat \n", "0 o p k s u \n", "1 o p n n g \n", "2 o p n n m \n", "3 o p k s u \n", "4 o e n a g \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the features are encoded as _letters_. Each letter correspond to a category . For example, for the `cap shape` feature, `b` corresponds to a bell cap, `c` to a conical cap, `f` to a flat cap, `k` to a knobbed cap, `s` to a sunken cap, and `x` to a convex cap. For more details about their meaning, you can consult [the documentation of the data set](https://archive.ics.uci.edu/ml/datasets/Mushroom)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Direct conversion to numerical attributes\n", "In order to work with this data, we need to convert the categorical attributes into numerical values. Here we will simply convert each letter to a number between 0 and the number of categories, using scikit-learn's [preprocessing.LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "from sklearn import preprocessing\n", "\n", "labelencoder = preprocessing.LabelEncoder()\n", "for col in df.columns:\n", " df[col] = labelencoder.fit_transform(df[col])" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classcap-shapecap-surfacecap-colorbruisesodorgill-attachmentgill-spacinggill-sizegill-color...stalk-surface-below-ringstalk-color-above-ringstalk-color-below-ringveil-typeveil-colorring-numberring-typespore-print-colorpopulationhabitat
01524161014...2770214235
10529101004...2770214321
20028131005...2770214323
31538161015...2770214235
40523051104...2770210301
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " class cap-shape cap-surface cap-color bruises odor gill-attachment \\\n", "0 1 5 2 4 1 6 1 \n", "1 0 5 2 9 1 0 1 \n", "2 0 0 2 8 1 3 1 \n", "3 1 5 3 8 1 6 1 \n", "4 0 5 2 3 0 5 1 \n", "\n", " gill-spacing gill-size gill-color ... stalk-surface-below-ring \\\n", "0 0 1 4 ... 2 \n", "1 0 0 4 ... 2 \n", "2 0 0 5 ... 2 \n", "3 0 1 5 ... 2 \n", "4 1 0 4 ... 2 \n", "\n", " stalk-color-above-ring stalk-color-below-ring veil-type veil-color \\\n", "0 7 7 0 2 \n", "1 7 7 0 2 \n", "2 7 7 0 2 \n", "3 7 7 0 2 \n", "4 7 7 0 2 \n", "\n", " ring-number ring-type spore-print-color population habitat \n", "0 1 4 2 3 5 \n", "1 1 4 3 2 1 \n", "2 1 4 3 2 3 \n", "3 1 4 2 3 5 \n", "4 1 0 3 0 1 \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One-hot encoding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This encoding is not necessarily the best, as (for example), an algorithm that uses the Euclidean distance will consider that a convex cap (`x` converted to 5) is closer to a sunken cap (`s` converted to 4) than to a conical cap (`c` converted to 1), and the [one-hot encoding](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features) is a good alternative. However, it has the drawback of increasing the number of features, and of creating correlated features." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "# Load the data again\n", "#df = pd.read_csv('data/mushrooms.csv')\n", "\n", "ohe_encoder = preprocessing.OneHotEncoder()\n", "X = ohe_encoder.fit_transform(df[df.columns])" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<8124x119 sparse matrix of type ''\n", "\twith 186852 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0., 1., 0., ..., 0., 1., 0.],\n", " [1., 0., 0., ..., 0., 0., 0.],\n", " [1., 0., 1., ..., 0., 0., 0.],\n", " ...,\n", " [1., 0., 0., ..., 0., 0., 0.],\n", " [0., 1., 0., ..., 0., 0., 0.],\n", " [1., 0., 0., ..., 0., 0., 0.]])" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.toarray()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 2 }