# Lab 2: Feature Processing

## Feature standardization

The `vinho verde` data set contains physico-chemical information on a number of Portuguese wines, as well as their rating by human tasters. 

Our goal is to use these data to automatically predict the rating of a wine, so as to assist oenologists, improve wine production, and target the taste of niche consumers.

This data set has been made available on the UCI archive repository (it is one of the oldest and most well-known repository of ML problems).

It is available from: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/ (but already in your repository; we will focus on white wines here).

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('data/winequality-white.csv', sep=";")

In [None]:
type(data)

We have loaded the data in a _pandas DataFrame_ object. Let us examine what information is available:

In [None]:
data.head(n=5)

The data contains 12 columns. The first 10 (fixed acidity -- alcohol) are physico-chemical features of the wines; the last one is their rating (or quality).

Let us extract from this data a numpy array that contains the design matrix X:

In [None]:
X = data.values[:, :-1]
print(X.shape)

__Question:__ Extract from this data a one-dimensional numpy array that contains the labels y.

In [None]:
# TODO

In [None]:
y = data['quality']

Let us now plot a histogram of the values taken by each of our features:

In [None]:
%pylab inline

In [None]:
# create a figure of size 16x12
fig = plt.figure(figsize=(16, 12))

for feat_idx in range(X.shape[1]):
 # create a subplot in the (feat_idx+1) position of a 3x4 grid
 ax = fig.add_subplot(3, 4, (feat_idx+1))
 # plot the histogram of feat_idx
 h = ax.hist(X[:, feat_idx], bins=50, color='steelblue', edgecolor='none')
 # use the name of the feature as a title for each histogram
 ax.set_title(data.columns[feat_idx], fontsize=14)

__Question:__
What are the ranges of values taken by the different features? What do you think is going to happen when one computes the euclidean distance between two samples: will the `free sulfur dioxide` be accounted for in a manner similar to the `sulphates`? How is this going to affect the k-nearest-neighbor algorithm?

__Answer:__

### 5-nearest-neighbor prediction
We will now see how to use scikit-learn to split the data between a train and a test set, train a nearest neighbor regressor on the training data, and evaluate its performance on the test set.

#### Splitting the data

In [None]:
from sklearn import model_selection

X_train, X_test, y_train, y_test = \
 model_selection.train_test_split(X, y,
 test_size=0.3 # 30% des données dans le jeu de test
 )

In [None]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

#### Creating a 5 nearest neighbor regressor

In [None]:
from sklearn import neighbors

In [None]:
model = neighbors.KNeighborsRegressor(n_neighbors=5)

#### Training the 5-NN regressor on the training data

In [None]:
model.fit(X_train, y_train)

#### Making predictions with the trained model

In [None]:
y_pred = model.predict(X_test)

In [None]:
# Compute the RMSE between the predictions and true value
from sklearn import metrics
np.sqrt(metrics.mean_squared_error(y_test, y_pred))

### Feature standardization

In [None]:
from sklearn import preprocessing

# Create a standardizer object and fit it to the training data.
std_scale = preprocessing.StandardScaler().fit(X_train)

# Apply the standardization to the training and the test data.
X_train_std = std_scale.transform(X_train)
X_test_std = std_scale.transform(X_test)

__Question:__ Why did we fit the standardizer (i.e. computed the mean and standard deviation for each feature) on the training set only?

__Answer:__

__Question:__ Visualize the scaled data again to check that the standardization had the intended effect.

In [None]:
# TODO

#### Effect of the feature standardization on the model

__Question:__ Train a new model on the standardized data. Is it better than the one trained on non-standardized data? 

In [None]:
# TODO

## Categorical features

We will work with a data set that describes mushrooms according to the shape of their cap and stalk, their odor, the type of their veil, etc. This data set also contains information on whether a mushroom is edible or not, and that is what we will try to predict.

Data are available as `data/mushrooms.csv`. Let us load them in a pandas DataFrame called `df`.

In [None]:
df = pd.read_csv('data/mushrooms.csv')

Let us look at the first few lines of df

In [None]:
df.head()

As you can see, the features are encoded as _letters_. Each letter correspond to a category . For example, for the `cap shape` feature, `b` corresponds to a bell cap, `c` to a conical cap, `f` to a flat cap, `k` to a knobbed cap, `s` to a sunken cap, and `x` to a convex cap. For more details about their meaning, you can consult [the documentation of the data set](https://archive.ics.uci.edu/ml/datasets/Mushroom).

#### Direct conversion to numerical attributes
In order to work with this data, we need to convert the categorical attributes into numerical values. Here we will simply convert each letter to a number between 0 and the number of categories, using scikit-learn's [preprocessing.LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [None]:
from sklearn import preprocessing

labelencoder = preprocessing.LabelEncoder()
for col in df.columns:
 df[col] = labelencoder.fit_transform(df[col])

In [None]:
df.head()

### One-hot encoding

This encoding is not necessarily the best, as (for example), an algorithm that uses the Euclidean distance will consider that a convex cap (`x` converted to 5) is closer to a sunken cap (`s` converted to 4) than to a conical cap (`c` converted to 1), and the [one-hot encoding](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features) is a good alternative. However, it has the drawback of increasing the number of features, and of creating correlated features.

In [None]:
# Load the data again
#df = pd.read_csv('data/mushrooms.csv')

ohe_encoder = preprocessing.OneHotEncoder()
X = ohe_encoder.fit_transform(df[df.columns])

In [None]:
X

In [None]:
X.toarray()