{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&ensp;\n",
    "[Home Page](../../START_HERE.ipynb)\n",
    "\n",
    "[Previous Notebook](Challenge.ipynb)\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "[1](Challenge.ipynb)\n",
    "[2]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Challenge - Gene Expression Classification - Workbook\n",
    "\n",
    "\n",
    "### Introduction\n",
    "\n",
    "This notebook walks through an end-to-end GPU machine learning workflow where cuDF is used for processing the data and cuML is used to train machine learning models on it. \n",
    "\n",
    "After completing this excercise, you will be able to use cuDF to load data from disk, combine tables, scale features, use one-hote encoding and even write your own GPU kernels to efficiently transform feature columns. Additionaly you will learn how to pass this data to cuML, and how to train ML models on it. The trained model is saved and it will be used for prediction.\n",
    "\n",
    "It is not required that the user is familiar with cuDF or cuML. Since our aim is to go from ETL to ML training, a detailed introduction is out of scope for this notebook. We recommend [Introduction to cuDF](../../CuDF/01-Intro_to_cuDF.ipynb) for additional information.\n",
    "\n",
    "### Problem Statement:\n",
    "We are trying to classify patients with acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) using machine learning (classification) algorithms. This dataset comes from a proof-of-concept study published in 1999 by Golub et al. It showed how new cases of cancer could be classified by gene expression monitoring (via DNA microarray) and thereby provided a general approach for identifying new cancer classes and assigning tumors to known classes. \n",
    "\n",
    "Here is the dataset link: https://www.kaggle.com/crawford/gene-expression."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Here is the list of exercises and modules to work on in the lab:\n",
    "\n",
    "- Convert the serial Pandas computations to CuDF operations.\n",
    "- Utilize CuML to accelerate the machine learning models.\n",
    "- Experiment with Dask to create a cluster and distribute the data and scale the operations.\n",
    "\n",
    "You will start writing code from <a href='#dask1'>here</a>, but make sure you execute the data processing blocks to understand the dataset.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Data Processing\n",
    "\n",
    "The first step is downloading the dataset and putting it in the data directory, for using in this tutorial. Download the dataset here, and place it in (host/data) folder. Now we will import the necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NumPy Version: 1.19.2\n",
      "Scikit-Learn Version: 0.23.1\n"
     ]
    }
   ],
   "source": [
    "import numpy as np; print('NumPy Version:', np.__version__)\n",
    "import pandas as pd\n",
    "import sys\n",
    "import sklearn; print('Scikit-Learn Version:', sklearn.__version__)\n",
    "from sklearn import preprocessing \n",
    "from sklearn.utils import resample\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.feature_selection import SelectFromModel\n",
    "from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc\n",
    "from sklearn.preprocessing import OrdinalEncoder, StandardScaler\n",
    "import cudf\n",
    "import cupy\n",
    "# import for model building\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from cuml.metrics.regression import r2_score\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn import linear_model\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn import model_selection, datasets\n",
    "from cuml.dask.common import utils as dask_utils\n",
    "from dask.distributed import Client, wait\n",
    "from dask_cuda import LocalCUDACluster\n",
    "import dask_cudf\n",
    "from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRF\n",
    "from sklearn.ensemble import RandomForestClassifier as sklRF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll read the dataframe into y from the csv file, view its dimensions and observe the first 5 rows of the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(72, 2)\n",
      "CPU times: user 4.54 ms, sys: 1.11 ms, total: 5.66 ms\n",
      "Wall time: 5.27 ms\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>patient</th>\n",
       "      <th>cancer</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>ALL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>ALL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>ALL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>ALL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>ALL</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   patient cancer\n",
       "0        1    ALL\n",
       "1        2    ALL\n",
       "2        3    ALL\n",
       "3        4    ALL\n",
       "4        5    ALL"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "y = pd.read_csv('../../../data/actual.csv')\n",
    "print(y.shape)\n",
    "y.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's convert our target variable categories to numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "y['cancer'].value_counts()\n",
    "# Recode label to numeric\n",
    "y = y.replace({'ALL':0,'AML':1})\n",
    "labels = ['ALL', 'AML'] # for plotting convenience later on"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the training and test data provided in the challenge from the data folder. View their dimensions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(7129, 78)\n",
      "(7129, 70)\n"
     ]
    }
   ],
   "source": [
    "# Import training data\n",
    "df_train = pd.read_csv('../../../data/data_set_ALL_AML_train.csv')\n",
    "print(df_train.shape)\n",
    "\n",
    "# Import testing data\n",
    "df_test = pd.read_csv('../../../data/data_set_ALL_AML_independent.csv')\n",
    "print(df_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Observe the first few rows of the train dataframe and the data format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Gene Description</th>\n",
       "      <th>Gene Accession Number</th>\n",
       "      <th>1</th>\n",
       "      <th>call</th>\n",
       "      <th>2</th>\n",
       "      <th>call.1</th>\n",
       "      <th>3</th>\n",
       "      <th>call.2</th>\n",
       "      <th>4</th>\n",
       "      <th>call.3</th>\n",
       "      <th>...</th>\n",
       "      <th>29</th>\n",
       "      <th>call.33</th>\n",
       "      <th>30</th>\n",
       "      <th>call.34</th>\n",
       "      <th>31</th>\n",
       "      <th>call.35</th>\n",
       "      <th>32</th>\n",
       "      <th>call.36</th>\n",
       "      <th>33</th>\n",
       "      <th>call.37</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AFFX-BioB-5_at (endogenous control)</td>\n",
       "      <td>AFFX-BioB-5_at</td>\n",
       "      <td>-214</td>\n",
       "      <td>A</td>\n",
       "      <td>-139</td>\n",
       "      <td>A</td>\n",
       "      <td>-76</td>\n",
       "      <td>A</td>\n",
       "      <td>-135</td>\n",
       "      <td>A</td>\n",
       "      <td>...</td>\n",
       "      <td>15</td>\n",
       "      <td>A</td>\n",
       "      <td>-318</td>\n",
       "      <td>A</td>\n",
       "      <td>-32</td>\n",
       "      <td>A</td>\n",
       "      <td>-124</td>\n",
       "      <td>A</td>\n",
       "      <td>-135</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AFFX-BioB-M_at (endogenous control)</td>\n",
       "      <td>AFFX-BioB-M_at</td>\n",
       "      <td>-153</td>\n",
       "      <td>A</td>\n",
       "      <td>-73</td>\n",
       "      <td>A</td>\n",
       "      <td>-49</td>\n",
       "      <td>A</td>\n",
       "      <td>-114</td>\n",
       "      <td>A</td>\n",
       "      <td>...</td>\n",
       "      <td>-114</td>\n",
       "      <td>A</td>\n",
       "      <td>-192</td>\n",
       "      <td>A</td>\n",
       "      <td>-49</td>\n",
       "      <td>A</td>\n",
       "      <td>-79</td>\n",
       "      <td>A</td>\n",
       "      <td>-186</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AFFX-BioB-3_at (endogenous control)</td>\n",
       "      <td>AFFX-BioB-3_at</td>\n",
       "      <td>-58</td>\n",
       "      <td>A</td>\n",
       "      <td>-1</td>\n",
       "      <td>A</td>\n",
       "      <td>-307</td>\n",
       "      <td>A</td>\n",
       "      <td>265</td>\n",
       "      <td>A</td>\n",
       "      <td>...</td>\n",
       "      <td>2</td>\n",
       "      <td>A</td>\n",
       "      <td>-95</td>\n",
       "      <td>A</td>\n",
       "      <td>49</td>\n",
       "      <td>A</td>\n",
       "      <td>-37</td>\n",
       "      <td>A</td>\n",
       "      <td>-70</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AFFX-BioC-5_at (endogenous control)</td>\n",
       "      <td>AFFX-BioC-5_at</td>\n",
       "      <td>88</td>\n",
       "      <td>A</td>\n",
       "      <td>283</td>\n",
       "      <td>A</td>\n",
       "      <td>309</td>\n",
       "      <td>A</td>\n",
       "      <td>12</td>\n",
       "      <td>A</td>\n",
       "      <td>...</td>\n",
       "      <td>193</td>\n",
       "      <td>A</td>\n",
       "      <td>312</td>\n",
       "      <td>A</td>\n",
       "      <td>230</td>\n",
       "      <td>P</td>\n",
       "      <td>330</td>\n",
       "      <td>A</td>\n",
       "      <td>337</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AFFX-BioC-3_at (endogenous control)</td>\n",
       "      <td>AFFX-BioC-3_at</td>\n",
       "      <td>-295</td>\n",
       "      <td>A</td>\n",
       "      <td>-264</td>\n",
       "      <td>A</td>\n",
       "      <td>-376</td>\n",
       "      <td>A</td>\n",
       "      <td>-419</td>\n",
       "      <td>A</td>\n",
       "      <td>...</td>\n",
       "      <td>-51</td>\n",
       "      <td>A</td>\n",
       "      <td>-139</td>\n",
       "      <td>A</td>\n",
       "      <td>-367</td>\n",
       "      <td>A</td>\n",
       "      <td>-188</td>\n",
       "      <td>A</td>\n",
       "      <td>-407</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 78 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                      Gene Description Gene Accession Number    1 call    2  \\\n",
       "0  AFFX-BioB-5_at (endogenous control)        AFFX-BioB-5_at -214    A -139   \n",
       "1  AFFX-BioB-M_at (endogenous control)        AFFX-BioB-M_at -153    A  -73   \n",
       "2  AFFX-BioB-3_at (endogenous control)        AFFX-BioB-3_at  -58    A   -1   \n",
       "3  AFFX-BioC-5_at (endogenous control)        AFFX-BioC-5_at   88    A  283   \n",
       "4  AFFX-BioC-3_at (endogenous control)        AFFX-BioC-3_at -295    A -264   \n",
       "\n",
       "  call.1    3 call.2    4 call.3  ...   29 call.33   30 call.34   31 call.35  \\\n",
       "0      A  -76      A -135      A  ...   15       A -318       A  -32       A   \n",
       "1      A  -49      A -114      A  ... -114       A -192       A  -49       A   \n",
       "2      A -307      A  265      A  ...    2       A  -95       A   49       A   \n",
       "3      A  309      A   12      A  ...  193       A  312       A  230       P   \n",
       "4      A -376      A -419      A  ...  -51       A -139       A -367       A   \n",
       "\n",
       "    32 call.36   33 call.37  \n",
       "0 -124       A -135       A  \n",
       "1  -79       A -186       A  \n",
       "2  -37       A  -70       A  \n",
       "3  330       A  337       A  \n",
       "4 -188       A -407       A  \n",
       "\n",
       "[5 rows x 78 columns]"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Observe the first few rows of the test dataframe and the data format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Gene Description</th>\n",
       "      <th>Gene Accession Number</th>\n",
       "      <th>39</th>\n",
       "      <th>call</th>\n",
       "      <th>40</th>\n",
       "      <th>call.1</th>\n",
       "      <th>42</th>\n",
       "      <th>call.2</th>\n",
       "      <th>47</th>\n",
       "      <th>call.3</th>\n",
       "      <th>...</th>\n",
       "      <th>65</th>\n",
       "      <th>call.29</th>\n",
       "      <th>66</th>\n",
       "      <th>call.30</th>\n",
       "      <th>63</th>\n",
       "      <th>call.31</th>\n",
       "      <th>64</th>\n",
       "      <th>call.32</th>\n",
       "      <th>62</th>\n",
       "      <th>call.33</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AFFX-BioB-5_at (endogenous control)</td>\n",
       "      <td>AFFX-BioB-5_at</td>\n",
       "      <td>-342</td>\n",
       "      <td>A</td>\n",
       "      <td>-87</td>\n",
       "      <td>A</td>\n",
       "      <td>22</td>\n",
       "      <td>A</td>\n",
       "      <td>-243</td>\n",
       "      <td>A</td>\n",
       "      <td>...</td>\n",
       "      <td>-62</td>\n",
       "      <td>A</td>\n",
       "      <td>-58</td>\n",
       "      <td>A</td>\n",
       "      <td>-161</td>\n",
       "      <td>A</td>\n",
       "      <td>-48</td>\n",
       "      <td>A</td>\n",
       "      <td>-176</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AFFX-BioB-M_at (endogenous control)</td>\n",
       "      <td>AFFX-BioB-M_at</td>\n",
       "      <td>-200</td>\n",
       "      <td>A</td>\n",
       "      <td>-248</td>\n",
       "      <td>A</td>\n",
       "      <td>-153</td>\n",
       "      <td>A</td>\n",
       "      <td>-218</td>\n",
       "      <td>A</td>\n",
       "      <td>...</td>\n",
       "      <td>-198</td>\n",
       "      <td>A</td>\n",
       "      <td>-217</td>\n",
       "      <td>A</td>\n",
       "      <td>-215</td>\n",
       "      <td>A</td>\n",
       "      <td>-531</td>\n",
       "      <td>A</td>\n",
       "      <td>-284</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AFFX-BioB-3_at (endogenous control)</td>\n",
       "      <td>AFFX-BioB-3_at</td>\n",
       "      <td>41</td>\n",
       "      <td>A</td>\n",
       "      <td>262</td>\n",
       "      <td>A</td>\n",
       "      <td>17</td>\n",
       "      <td>A</td>\n",
       "      <td>-163</td>\n",
       "      <td>A</td>\n",
       "      <td>...</td>\n",
       "      <td>-5</td>\n",
       "      <td>A</td>\n",
       "      <td>63</td>\n",
       "      <td>A</td>\n",
       "      <td>-46</td>\n",
       "      <td>A</td>\n",
       "      <td>-124</td>\n",
       "      <td>A</td>\n",
       "      <td>-81</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AFFX-BioC-5_at (endogenous control)</td>\n",
       "      <td>AFFX-BioC-5_at</td>\n",
       "      <td>328</td>\n",
       "      <td>A</td>\n",
       "      <td>295</td>\n",
       "      <td>A</td>\n",
       "      <td>276</td>\n",
       "      <td>A</td>\n",
       "      <td>182</td>\n",
       "      <td>A</td>\n",
       "      <td>...</td>\n",
       "      <td>141</td>\n",
       "      <td>A</td>\n",
       "      <td>95</td>\n",
       "      <td>A</td>\n",
       "      <td>146</td>\n",
       "      <td>A</td>\n",
       "      <td>431</td>\n",
       "      <td>A</td>\n",
       "      <td>9</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AFFX-BioC-3_at (endogenous control)</td>\n",
       "      <td>AFFX-BioC-3_at</td>\n",
       "      <td>-224</td>\n",
       "      <td>A</td>\n",
       "      <td>-226</td>\n",
       "      <td>A</td>\n",
       "      <td>-211</td>\n",
       "      <td>A</td>\n",
       "      <td>-289</td>\n",
       "      <td>A</td>\n",
       "      <td>...</td>\n",
       "      <td>-256</td>\n",
       "      <td>A</td>\n",
       "      <td>-191</td>\n",
       "      <td>A</td>\n",
       "      <td>-172</td>\n",
       "      <td>A</td>\n",
       "      <td>-496</td>\n",
       "      <td>A</td>\n",
       "      <td>-294</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 70 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                      Gene Description Gene Accession Number   39 call   40  \\\n",
       "0  AFFX-BioB-5_at (endogenous control)        AFFX-BioB-5_at -342    A  -87   \n",
       "1  AFFX-BioB-M_at (endogenous control)        AFFX-BioB-M_at -200    A -248   \n",
       "2  AFFX-BioB-3_at (endogenous control)        AFFX-BioB-3_at   41    A  262   \n",
       "3  AFFX-BioC-5_at (endogenous control)        AFFX-BioC-5_at  328    A  295   \n",
       "4  AFFX-BioC-3_at (endogenous control)        AFFX-BioC-3_at -224    A -226   \n",
       "\n",
       "  call.1   42 call.2   47 call.3  ...   65 call.29   66 call.30   63 call.31  \\\n",
       "0      A   22      A -243      A  ...  -62       A  -58       A -161       A   \n",
       "1      A -153      A -218      A  ... -198       A -217       A -215       A   \n",
       "2      A   17      A -163      A  ...   -5       A   63       A  -46       A   \n",
       "3      A  276      A  182      A  ...  141       A   95       A  146       A   \n",
       "4      A -211      A -289      A  ... -256       A -191       A -172       A   \n",
       "\n",
       "    64 call.32   62 call.33  \n",
       "0  -48       A -176       A  \n",
       "1 -531       A -284       A  \n",
       "2 -124       A  -81       A  \n",
       "3  431       A    9       A  \n",
       "4 -496       A -294       A  \n",
       "\n",
       "[5 rows x 70 columns]"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the data set has categorical values but only for the columns starting with \"call\". We won't use the columns having categorical values, but remove them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove \"call\" columns from training and testing data\n",
    "train_to_keep = [col for col in df_train.columns if \"call\" not in col]\n",
    "test_to_keep = [col for col in df_test.columns if \"call\" not in col]\n",
    "\n",
    "X_train_tr = df_train[train_to_keep]\n",
    "X_test_tr = df_test[test_to_keep]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rename the columns and reindex for formatting purposes and ease in reading the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_columns_titles = ['Gene Description', 'Gene Accession Number', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10',\n",
    "       '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', \n",
    "       '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38']\n",
    "\n",
    "X_train_tr = X_train_tr.reindex(columns=train_columns_titles)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_columns_titles = ['Gene Description', 'Gene Accession Number','39', '40', '41', '42', '43', '44', '45', '46',\n",
    "       '47', '48', '49', '50', '51', '52', '53',  '54', '55', '56', '57', '58', '59',\n",
    "       '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72']\n",
    "\n",
    "X_test_tr = X_test_tr.reindex(columns=test_columns_titles)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will take the transpose of the dataframe so that each row is a patient and each column is a gene."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(40, 7129)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>7119</th>\n",
       "      <th>7120</th>\n",
       "      <th>7121</th>\n",
       "      <th>7122</th>\n",
       "      <th>7123</th>\n",
       "      <th>7124</th>\n",
       "      <th>7125</th>\n",
       "      <th>7126</th>\n",
       "      <th>7127</th>\n",
       "      <th>7128</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Gene Description</th>\n",
       "      <td>AFFX-BioB-5_at (endogenous control)</td>\n",
       "      <td>AFFX-BioB-M_at (endogenous control)</td>\n",
       "      <td>AFFX-BioB-3_at (endogenous control)</td>\n",
       "      <td>AFFX-BioC-5_at (endogenous control)</td>\n",
       "      <td>AFFX-BioC-3_at (endogenous control)</td>\n",
       "      <td>AFFX-BioDn-5_at (endogenous control)</td>\n",
       "      <td>AFFX-BioDn-3_at (endogenous control)</td>\n",
       "      <td>AFFX-CreX-5_at (endogenous control)</td>\n",
       "      <td>AFFX-CreX-3_at (endogenous control)</td>\n",
       "      <td>AFFX-BioB-5_st (endogenous control)</td>\n",
       "      <td>...</td>\n",
       "      <td>Transcription factor Stat5b (stat5b) mRNA</td>\n",
       "      <td>Breast epithelial antigen BA46 mRNA</td>\n",
       "      <td>GB DEF = Calcium/calmodulin-dependent protein ...</td>\n",
       "      <td>TUBULIN ALPHA-4 CHAIN</td>\n",
       "      <td>CYP4B1 Cytochrome P450; subfamily IVB; polypep...</td>\n",
       "      <td>PTGER3 Prostaglandin E receptor 3 (subtype EP3...</td>\n",
       "      <td>HMG2 High-mobility group (nonhistone chromosom...</td>\n",
       "      <td>RB1 Retinoblastoma 1 (including osteosarcoma)</td>\n",
       "      <td>GB DEF = Glycophorin Sta (type A) exons 3 and ...</td>\n",
       "      <td>GB DEF = mRNA (clone 1A7)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Gene Accession Number</th>\n",
       "      <td>AFFX-BioB-5_at</td>\n",
       "      <td>AFFX-BioB-M_at</td>\n",
       "      <td>AFFX-BioB-3_at</td>\n",
       "      <td>AFFX-BioC-5_at</td>\n",
       "      <td>AFFX-BioC-3_at</td>\n",
       "      <td>AFFX-BioDn-5_at</td>\n",
       "      <td>AFFX-BioDn-3_at</td>\n",
       "      <td>AFFX-CreX-5_at</td>\n",
       "      <td>AFFX-CreX-3_at</td>\n",
       "      <td>AFFX-BioB-5_st</td>\n",
       "      <td>...</td>\n",
       "      <td>U48730_at</td>\n",
       "      <td>U58516_at</td>\n",
       "      <td>U73738_at</td>\n",
       "      <td>X06956_at</td>\n",
       "      <td>X16699_at</td>\n",
       "      <td>X83863_at</td>\n",
       "      <td>Z17240_at</td>\n",
       "      <td>L49218_f_at</td>\n",
       "      <td>M71243_f_at</td>\n",
       "      <td>Z78285_f_at</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-214</td>\n",
       "      <td>-153</td>\n",
       "      <td>-58</td>\n",
       "      <td>88</td>\n",
       "      <td>-295</td>\n",
       "      <td>-558</td>\n",
       "      <td>199</td>\n",
       "      <td>-176</td>\n",
       "      <td>252</td>\n",
       "      <td>206</td>\n",
       "      <td>...</td>\n",
       "      <td>185</td>\n",
       "      <td>511</td>\n",
       "      <td>-125</td>\n",
       "      <td>389</td>\n",
       "      <td>-37</td>\n",
       "      <td>793</td>\n",
       "      <td>329</td>\n",
       "      <td>36</td>\n",
       "      <td>191</td>\n",
       "      <td>-37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-139</td>\n",
       "      <td>-73</td>\n",
       "      <td>-1</td>\n",
       "      <td>283</td>\n",
       "      <td>-264</td>\n",
       "      <td>-400</td>\n",
       "      <td>-330</td>\n",
       "      <td>-168</td>\n",
       "      <td>101</td>\n",
       "      <td>74</td>\n",
       "      <td>...</td>\n",
       "      <td>169</td>\n",
       "      <td>837</td>\n",
       "      <td>-36</td>\n",
       "      <td>442</td>\n",
       "      <td>-17</td>\n",
       "      <td>782</td>\n",
       "      <td>295</td>\n",
       "      <td>11</td>\n",
       "      <td>76</td>\n",
       "      <td>-14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-76</td>\n",
       "      <td>-49</td>\n",
       "      <td>-307</td>\n",
       "      <td>309</td>\n",
       "      <td>-376</td>\n",
       "      <td>-650</td>\n",
       "      <td>33</td>\n",
       "      <td>-367</td>\n",
       "      <td>206</td>\n",
       "      <td>-215</td>\n",
       "      <td>...</td>\n",
       "      <td>315</td>\n",
       "      <td>1199</td>\n",
       "      <td>33</td>\n",
       "      <td>168</td>\n",
       "      <td>52</td>\n",
       "      <td>1138</td>\n",
       "      <td>777</td>\n",
       "      <td>41</td>\n",
       "      <td>228</td>\n",
       "      <td>-41</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 7129 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                      0     \\\n",
       "Gene Description       AFFX-BioB-5_at (endogenous control)   \n",
       "Gene Accession Number                       AFFX-BioB-5_at   \n",
       "1                                                     -214   \n",
       "2                                                     -139   \n",
       "3                                                      -76   \n",
       "\n",
       "                                                      1     \\\n",
       "Gene Description       AFFX-BioB-M_at (endogenous control)   \n",
       "Gene Accession Number                       AFFX-BioB-M_at   \n",
       "1                                                     -153   \n",
       "2                                                      -73   \n",
       "3                                                      -49   \n",
       "\n",
       "                                                      2     \\\n",
       "Gene Description       AFFX-BioB-3_at (endogenous control)   \n",
       "Gene Accession Number                       AFFX-BioB-3_at   \n",
       "1                                                      -58   \n",
       "2                                                       -1   \n",
       "3                                                     -307   \n",
       "\n",
       "                                                      3     \\\n",
       "Gene Description       AFFX-BioC-5_at (endogenous control)   \n",
       "Gene Accession Number                       AFFX-BioC-5_at   \n",
       "1                                                       88   \n",
       "2                                                      283   \n",
       "3                                                      309   \n",
       "\n",
       "                                                      4     \\\n",
       "Gene Description       AFFX-BioC-3_at (endogenous control)   \n",
       "Gene Accession Number                       AFFX-BioC-3_at   \n",
       "1                                                     -295   \n",
       "2                                                     -264   \n",
       "3                                                     -376   \n",
       "\n",
       "                                                       5     \\\n",
       "Gene Description       AFFX-BioDn-5_at (endogenous control)   \n",
       "Gene Accession Number                       AFFX-BioDn-5_at   \n",
       "1                                                      -558   \n",
       "2                                                      -400   \n",
       "3                                                      -650   \n",
       "\n",
       "                                                       6     \\\n",
       "Gene Description       AFFX-BioDn-3_at (endogenous control)   \n",
       "Gene Accession Number                       AFFX-BioDn-3_at   \n",
       "1                                                       199   \n",
       "2                                                      -330   \n",
       "3                                                        33   \n",
       "\n",
       "                                                      7     \\\n",
       "Gene Description       AFFX-CreX-5_at (endogenous control)   \n",
       "Gene Accession Number                       AFFX-CreX-5_at   \n",
       "1                                                     -176   \n",
       "2                                                     -168   \n",
       "3                                                     -367   \n",
       "\n",
       "                                                      8     \\\n",
       "Gene Description       AFFX-CreX-3_at (endogenous control)   \n",
       "Gene Accession Number                       AFFX-CreX-3_at   \n",
       "1                                                      252   \n",
       "2                                                      101   \n",
       "3                                                      206   \n",
       "\n",
       "                                                      9     ...  \\\n",
       "Gene Description       AFFX-BioB-5_st (endogenous control)  ...   \n",
       "Gene Accession Number                       AFFX-BioB-5_st  ...   \n",
       "1                                                      206  ...   \n",
       "2                                                       74  ...   \n",
       "3                                                     -215  ...   \n",
       "\n",
       "                                                            7119  \\\n",
       "Gene Description       Transcription factor Stat5b (stat5b) mRNA   \n",
       "Gene Accession Number                                  U48730_at   \n",
       "1                                                            185   \n",
       "2                                                            169   \n",
       "3                                                            315   \n",
       "\n",
       "                                                      7120  \\\n",
       "Gene Description       Breast epithelial antigen BA46 mRNA   \n",
       "Gene Accession Number                            U58516_at   \n",
       "1                                                      511   \n",
       "2                                                      837   \n",
       "3                                                     1199   \n",
       "\n",
       "                                                                    7121  \\\n",
       "Gene Description       GB DEF = Calcium/calmodulin-dependent protein ...   \n",
       "Gene Accession Number                                          U73738_at   \n",
       "1                                                                   -125   \n",
       "2                                                                    -36   \n",
       "3                                                                     33   \n",
       "\n",
       "                                        7122  \\\n",
       "Gene Description       TUBULIN ALPHA-4 CHAIN   \n",
       "Gene Accession Number              X06956_at   \n",
       "1                                        389   \n",
       "2                                        442   \n",
       "3                                        168   \n",
       "\n",
       "                                                                    7123  \\\n",
       "Gene Description       CYP4B1 Cytochrome P450; subfamily IVB; polypep...   \n",
       "Gene Accession Number                                          X16699_at   \n",
       "1                                                                    -37   \n",
       "2                                                                    -17   \n",
       "3                                                                     52   \n",
       "\n",
       "                                                                    7124  \\\n",
       "Gene Description       PTGER3 Prostaglandin E receptor 3 (subtype EP3...   \n",
       "Gene Accession Number                                          X83863_at   \n",
       "1                                                                    793   \n",
       "2                                                                    782   \n",
       "3                                                                   1138   \n",
       "\n",
       "                                                                    7125  \\\n",
       "Gene Description       HMG2 High-mobility group (nonhistone chromosom...   \n",
       "Gene Accession Number                                          Z17240_at   \n",
       "1                                                                    329   \n",
       "2                                                                    295   \n",
       "3                                                                    777   \n",
       "\n",
       "                                                                7126  \\\n",
       "Gene Description       RB1 Retinoblastoma 1 (including osteosarcoma)   \n",
       "Gene Accession Number                                    L49218_f_at   \n",
       "1                                                                 36   \n",
       "2                                                                 11   \n",
       "3                                                                 41   \n",
       "\n",
       "                                                                    7127  \\\n",
       "Gene Description       GB DEF = Glycophorin Sta (type A) exons 3 and ...   \n",
       "Gene Accession Number                                        M71243_f_at   \n",
       "1                                                                    191   \n",
       "2                                                                     76   \n",
       "3                                                                    228   \n",
       "\n",
       "                                            7128  \n",
       "Gene Description       GB DEF = mRNA (clone 1A7)  \n",
       "Gene Accession Number                Z78285_f_at  \n",
       "1                                            -37  \n",
       "2                                            -14  \n",
       "3                                            -41  \n",
       "\n",
       "[5 rows x 7129 columns]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train = X_train_tr.T\n",
    "X_test = X_test_tr.T\n",
    "\n",
    "print(X_train.shape) \n",
    "X_train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just clearning the data, removing extra columns and converting to numerical values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(38, 7129)\n",
      "(34, 7129)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Gene Accession Number</th>\n",
       "      <th>AFFX-BioB-5_at</th>\n",
       "      <th>AFFX-BioB-M_at</th>\n",
       "      <th>AFFX-BioB-3_at</th>\n",
       "      <th>AFFX-BioC-5_at</th>\n",
       "      <th>AFFX-BioC-3_at</th>\n",
       "      <th>AFFX-BioDn-5_at</th>\n",
       "      <th>AFFX-BioDn-3_at</th>\n",
       "      <th>AFFX-CreX-5_at</th>\n",
       "      <th>AFFX-CreX-3_at</th>\n",
       "      <th>AFFX-BioB-5_st</th>\n",
       "      <th>...</th>\n",
       "      <th>U48730_at</th>\n",
       "      <th>U58516_at</th>\n",
       "      <th>U73738_at</th>\n",
       "      <th>X06956_at</th>\n",
       "      <th>X16699_at</th>\n",
       "      <th>X83863_at</th>\n",
       "      <th>Z17240_at</th>\n",
       "      <th>L49218_f_at</th>\n",
       "      <th>M71243_f_at</th>\n",
       "      <th>Z78285_f_at</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-214</td>\n",
       "      <td>-153</td>\n",
       "      <td>-58</td>\n",
       "      <td>88</td>\n",
       "      <td>-295</td>\n",
       "      <td>-558</td>\n",
       "      <td>199</td>\n",
       "      <td>-176</td>\n",
       "      <td>252</td>\n",
       "      <td>206</td>\n",
       "      <td>...</td>\n",
       "      <td>185</td>\n",
       "      <td>511</td>\n",
       "      <td>-125</td>\n",
       "      <td>389</td>\n",
       "      <td>-37</td>\n",
       "      <td>793</td>\n",
       "      <td>329</td>\n",
       "      <td>36</td>\n",
       "      <td>191</td>\n",
       "      <td>-37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-139</td>\n",
       "      <td>-73</td>\n",
       "      <td>-1</td>\n",
       "      <td>283</td>\n",
       "      <td>-264</td>\n",
       "      <td>-400</td>\n",
       "      <td>-330</td>\n",
       "      <td>-168</td>\n",
       "      <td>101</td>\n",
       "      <td>74</td>\n",
       "      <td>...</td>\n",
       "      <td>169</td>\n",
       "      <td>837</td>\n",
       "      <td>-36</td>\n",
       "      <td>442</td>\n",
       "      <td>-17</td>\n",
       "      <td>782</td>\n",
       "      <td>295</td>\n",
       "      <td>11</td>\n",
       "      <td>76</td>\n",
       "      <td>-14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-76</td>\n",
       "      <td>-49</td>\n",
       "      <td>-307</td>\n",
       "      <td>309</td>\n",
       "      <td>-376</td>\n",
       "      <td>-650</td>\n",
       "      <td>33</td>\n",
       "      <td>-367</td>\n",
       "      <td>206</td>\n",
       "      <td>-215</td>\n",
       "      <td>...</td>\n",
       "      <td>315</td>\n",
       "      <td>1199</td>\n",
       "      <td>33</td>\n",
       "      <td>168</td>\n",
       "      <td>52</td>\n",
       "      <td>1138</td>\n",
       "      <td>777</td>\n",
       "      <td>41</td>\n",
       "      <td>228</td>\n",
       "      <td>-41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-135</td>\n",
       "      <td>-114</td>\n",
       "      <td>265</td>\n",
       "      <td>12</td>\n",
       "      <td>-419</td>\n",
       "      <td>-585</td>\n",
       "      <td>158</td>\n",
       "      <td>-253</td>\n",
       "      <td>49</td>\n",
       "      <td>31</td>\n",
       "      <td>...</td>\n",
       "      <td>240</td>\n",
       "      <td>835</td>\n",
       "      <td>218</td>\n",
       "      <td>174</td>\n",
       "      <td>-110</td>\n",
       "      <td>627</td>\n",
       "      <td>170</td>\n",
       "      <td>-50</td>\n",
       "      <td>126</td>\n",
       "      <td>-91</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>-106</td>\n",
       "      <td>-125</td>\n",
       "      <td>-76</td>\n",
       "      <td>168</td>\n",
       "      <td>-230</td>\n",
       "      <td>-284</td>\n",
       "      <td>4</td>\n",
       "      <td>-122</td>\n",
       "      <td>70</td>\n",
       "      <td>252</td>\n",
       "      <td>...</td>\n",
       "      <td>156</td>\n",
       "      <td>649</td>\n",
       "      <td>57</td>\n",
       "      <td>504</td>\n",
       "      <td>-26</td>\n",
       "      <td>250</td>\n",
       "      <td>314</td>\n",
       "      <td>14</td>\n",
       "      <td>56</td>\n",
       "      <td>-25</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 7129 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Gene Accession Number  AFFX-BioB-5_at  AFFX-BioB-M_at  AFFX-BioB-3_at  \\\n",
       "1                                -214            -153             -58   \n",
       "2                                -139             -73              -1   \n",
       "3                                 -76             -49            -307   \n",
       "4                                -135            -114             265   \n",
       "5                                -106            -125             -76   \n",
       "\n",
       "Gene Accession Number  AFFX-BioC-5_at  AFFX-BioC-3_at  AFFX-BioDn-5_at  \\\n",
       "1                                  88            -295             -558   \n",
       "2                                 283            -264             -400   \n",
       "3                                 309            -376             -650   \n",
       "4                                  12            -419             -585   \n",
       "5                                 168            -230             -284   \n",
       "\n",
       "Gene Accession Number  AFFX-BioDn-3_at  AFFX-CreX-5_at  AFFX-CreX-3_at  \\\n",
       "1                                  199            -176             252   \n",
       "2                                 -330            -168             101   \n",
       "3                                   33            -367             206   \n",
       "4                                  158            -253              49   \n",
       "5                                    4            -122              70   \n",
       "\n",
       "Gene Accession Number  AFFX-BioB-5_st  ...  U48730_at  U58516_at  U73738_at  \\\n",
       "1                                 206  ...        185        511       -125   \n",
       "2                                  74  ...        169        837        -36   \n",
       "3                                -215  ...        315       1199         33   \n",
       "4                                  31  ...        240        835        218   \n",
       "5                                 252  ...        156        649         57   \n",
       "\n",
       "Gene Accession Number  X06956_at  X16699_at  X83863_at  Z17240_at  \\\n",
       "1                            389        -37        793        329   \n",
       "2                            442        -17        782        295   \n",
       "3                            168         52       1138        777   \n",
       "4                            174       -110        627        170   \n",
       "5                            504        -26        250        314   \n",
       "\n",
       "Gene Accession Number  L49218_f_at  M71243_f_at  Z78285_f_at  \n",
       "1                               36          191          -37  \n",
       "2                               11           76          -14  \n",
       "3                               41          228          -41  \n",
       "4                              -50          126          -91  \n",
       "5                               14           56          -25  \n",
       "\n",
       "[5 rows x 7129 columns]"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Clean up the column names for training and testing data\n",
    "X_train.columns = X_train.iloc[1]\n",
    "X_train = X_train.drop([\"Gene Description\", \"Gene Accession Number\"]).apply(pd.to_numeric)\n",
    "\n",
    "# Clean up the column names for Testing data\n",
    "X_test.columns = X_test.iloc[1]\n",
    "X_test = X_test.drop([\"Gene Description\", \"Gene Accession Number\"]).apply(pd.to_numeric)\n",
    "\n",
    "print(X_train.shape)\n",
    "print(X_test.shape)\n",
    "X_train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have the 38 patients as rows in the training set, and the other 34 as rows in the testing set. Each of those datasets has 7129 gene expression features. But we haven't yet associated the target labels with the right patients. You will recall that all the labels are all stored in a single dataframe. Let's split the data so that the patients and labels match up across the training and testing dataframes.We are now splitting the data into train and test sets. We will subset the first 38 patient's cancer types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = X_train.reset_index(drop=True)\n",
    "y_train = y[y.patient <= 38].reset_index(drop=True)\n",
    "\n",
    "# Subset the rest for testing\n",
    "X_test = X_test.reset_index(drop=True)\n",
    "y_test = y[y.patient > 38].reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generate descriptive statistics to analyse the data further."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Gene Accession Number</th>\n",
       "      <th>AFFX-BioB-5_at</th>\n",
       "      <th>AFFX-BioB-M_at</th>\n",
       "      <th>AFFX-BioB-3_at</th>\n",
       "      <th>AFFX-BioC-5_at</th>\n",
       "      <th>AFFX-BioC-3_at</th>\n",
       "      <th>AFFX-BioDn-5_at</th>\n",
       "      <th>AFFX-BioDn-3_at</th>\n",
       "      <th>AFFX-CreX-5_at</th>\n",
       "      <th>AFFX-CreX-3_at</th>\n",
       "      <th>AFFX-BioB-5_st</th>\n",
       "      <th>...</th>\n",
       "      <th>U48730_at</th>\n",
       "      <th>U58516_at</th>\n",
       "      <th>U73738_at</th>\n",
       "      <th>X06956_at</th>\n",
       "      <th>X16699_at</th>\n",
       "      <th>X83863_at</th>\n",
       "      <th>Z17240_at</th>\n",
       "      <th>L49218_f_at</th>\n",
       "      <th>M71243_f_at</th>\n",
       "      <th>Z78285_f_at</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>38.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>-120.868421</td>\n",
       "      <td>-150.526316</td>\n",
       "      <td>-17.157895</td>\n",
       "      <td>181.394737</td>\n",
       "      <td>-276.552632</td>\n",
       "      <td>-439.210526</td>\n",
       "      <td>-43.578947</td>\n",
       "      <td>-201.184211</td>\n",
       "      <td>99.052632</td>\n",
       "      <td>112.131579</td>\n",
       "      <td>...</td>\n",
       "      <td>178.763158</td>\n",
       "      <td>750.842105</td>\n",
       "      <td>8.815789</td>\n",
       "      <td>399.131579</td>\n",
       "      <td>-20.052632</td>\n",
       "      <td>869.052632</td>\n",
       "      <td>335.842105</td>\n",
       "      <td>19.210526</td>\n",
       "      <td>504.394737</td>\n",
       "      <td>-29.210526</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>109.555656</td>\n",
       "      <td>75.734507</td>\n",
       "      <td>117.686144</td>\n",
       "      <td>117.468004</td>\n",
       "      <td>111.004431</td>\n",
       "      <td>135.458412</td>\n",
       "      <td>219.482393</td>\n",
       "      <td>90.838989</td>\n",
       "      <td>83.178397</td>\n",
       "      <td>211.815597</td>\n",
       "      <td>...</td>\n",
       "      <td>84.826830</td>\n",
       "      <td>298.008392</td>\n",
       "      <td>77.108507</td>\n",
       "      <td>469.579868</td>\n",
       "      <td>42.346031</td>\n",
       "      <td>482.366461</td>\n",
       "      <td>209.826766</td>\n",
       "      <td>31.158841</td>\n",
       "      <td>728.744405</td>\n",
       "      <td>30.851132</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-476.000000</td>\n",
       "      <td>-327.000000</td>\n",
       "      <td>-307.000000</td>\n",
       "      <td>-36.000000</td>\n",
       "      <td>-541.000000</td>\n",
       "      <td>-790.000000</td>\n",
       "      <td>-479.000000</td>\n",
       "      <td>-463.000000</td>\n",
       "      <td>-82.000000</td>\n",
       "      <td>-215.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>30.000000</td>\n",
       "      <td>224.000000</td>\n",
       "      <td>-178.000000</td>\n",
       "      <td>36.000000</td>\n",
       "      <td>-112.000000</td>\n",
       "      <td>195.000000</td>\n",
       "      <td>41.000000</td>\n",
       "      <td>-50.000000</td>\n",
       "      <td>-2.000000</td>\n",
       "      <td>-94.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>-138.750000</td>\n",
       "      <td>-205.000000</td>\n",
       "      <td>-83.250000</td>\n",
       "      <td>81.250000</td>\n",
       "      <td>-374.250000</td>\n",
       "      <td>-547.000000</td>\n",
       "      <td>-169.000000</td>\n",
       "      <td>-239.250000</td>\n",
       "      <td>36.000000</td>\n",
       "      <td>-47.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>120.000000</td>\n",
       "      <td>575.500000</td>\n",
       "      <td>-42.750000</td>\n",
       "      <td>174.500000</td>\n",
       "      <td>-48.000000</td>\n",
       "      <td>595.250000</td>\n",
       "      <td>232.750000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>136.000000</td>\n",
       "      <td>-42.750000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>-106.500000</td>\n",
       "      <td>-141.500000</td>\n",
       "      <td>-43.500000</td>\n",
       "      <td>200.000000</td>\n",
       "      <td>-263.000000</td>\n",
       "      <td>-426.500000</td>\n",
       "      <td>-33.500000</td>\n",
       "      <td>-185.500000</td>\n",
       "      <td>99.500000</td>\n",
       "      <td>70.500000</td>\n",
       "      <td>...</td>\n",
       "      <td>174.500000</td>\n",
       "      <td>700.000000</td>\n",
       "      <td>10.500000</td>\n",
       "      <td>266.000000</td>\n",
       "      <td>-18.000000</td>\n",
       "      <td>744.500000</td>\n",
       "      <td>308.500000</td>\n",
       "      <td>20.000000</td>\n",
       "      <td>243.500000</td>\n",
       "      <td>-26.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>-68.250000</td>\n",
       "      <td>-94.750000</td>\n",
       "      <td>47.250000</td>\n",
       "      <td>279.250000</td>\n",
       "      <td>-188.750000</td>\n",
       "      <td>-344.750000</td>\n",
       "      <td>79.000000</td>\n",
       "      <td>-144.750000</td>\n",
       "      <td>152.250000</td>\n",
       "      <td>242.750000</td>\n",
       "      <td>...</td>\n",
       "      <td>231.750000</td>\n",
       "      <td>969.500000</td>\n",
       "      <td>57.000000</td>\n",
       "      <td>451.750000</td>\n",
       "      <td>9.250000</td>\n",
       "      <td>1112.000000</td>\n",
       "      <td>389.500000</td>\n",
       "      <td>30.250000</td>\n",
       "      <td>487.250000</td>\n",
       "      <td>-11.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>17.000000</td>\n",
       "      <td>-20.000000</td>\n",
       "      <td>265.000000</td>\n",
       "      <td>392.000000</td>\n",
       "      <td>-51.000000</td>\n",
       "      <td>-155.000000</td>\n",
       "      <td>419.000000</td>\n",
       "      <td>-24.000000</td>\n",
       "      <td>283.000000</td>\n",
       "      <td>561.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>356.000000</td>\n",
       "      <td>1653.000000</td>\n",
       "      <td>218.000000</td>\n",
       "      <td>2527.000000</td>\n",
       "      <td>52.000000</td>\n",
       "      <td>2315.000000</td>\n",
       "      <td>1109.000000</td>\n",
       "      <td>115.000000</td>\n",
       "      <td>3193.000000</td>\n",
       "      <td>36.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>8 rows × 7129 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Gene Accession Number  AFFX-BioB-5_at  AFFX-BioB-M_at  AFFX-BioB-3_at  \\\n",
       "count                       38.000000       38.000000       38.000000   \n",
       "mean                      -120.868421     -150.526316      -17.157895   \n",
       "std                        109.555656       75.734507      117.686144   \n",
       "min                       -476.000000     -327.000000     -307.000000   \n",
       "25%                       -138.750000     -205.000000      -83.250000   \n",
       "50%                       -106.500000     -141.500000      -43.500000   \n",
       "75%                        -68.250000      -94.750000       47.250000   \n",
       "max                         17.000000      -20.000000      265.000000   \n",
       "\n",
       "Gene Accession Number  AFFX-BioC-5_at  AFFX-BioC-3_at  AFFX-BioDn-5_at  \\\n",
       "count                       38.000000       38.000000        38.000000   \n",
       "mean                       181.394737     -276.552632      -439.210526   \n",
       "std                        117.468004      111.004431       135.458412   \n",
       "min                        -36.000000     -541.000000      -790.000000   \n",
       "25%                         81.250000     -374.250000      -547.000000   \n",
       "50%                        200.000000     -263.000000      -426.500000   \n",
       "75%                        279.250000     -188.750000      -344.750000   \n",
       "max                        392.000000      -51.000000      -155.000000   \n",
       "\n",
       "Gene Accession Number  AFFX-BioDn-3_at  AFFX-CreX-5_at  AFFX-CreX-3_at  \\\n",
       "count                        38.000000       38.000000       38.000000   \n",
       "mean                        -43.578947     -201.184211       99.052632   \n",
       "std                         219.482393       90.838989       83.178397   \n",
       "min                        -479.000000     -463.000000      -82.000000   \n",
       "25%                        -169.000000     -239.250000       36.000000   \n",
       "50%                         -33.500000     -185.500000       99.500000   \n",
       "75%                          79.000000     -144.750000      152.250000   \n",
       "max                         419.000000      -24.000000      283.000000   \n",
       "\n",
       "Gene Accession Number  AFFX-BioB-5_st  ...   U48730_at    U58516_at  \\\n",
       "count                       38.000000  ...   38.000000    38.000000   \n",
       "mean                       112.131579  ...  178.763158   750.842105   \n",
       "std                        211.815597  ...   84.826830   298.008392   \n",
       "min                       -215.000000  ...   30.000000   224.000000   \n",
       "25%                        -47.000000  ...  120.000000   575.500000   \n",
       "50%                         70.500000  ...  174.500000   700.000000   \n",
       "75%                        242.750000  ...  231.750000   969.500000   \n",
       "max                        561.000000  ...  356.000000  1653.000000   \n",
       "\n",
       "Gene Accession Number   U73738_at    X06956_at   X16699_at    X83863_at  \\\n",
       "count                   38.000000    38.000000   38.000000    38.000000   \n",
       "mean                     8.815789   399.131579  -20.052632   869.052632   \n",
       "std                     77.108507   469.579868   42.346031   482.366461   \n",
       "min                   -178.000000    36.000000 -112.000000   195.000000   \n",
       "25%                    -42.750000   174.500000  -48.000000   595.250000   \n",
       "50%                     10.500000   266.000000  -18.000000   744.500000   \n",
       "75%                     57.000000   451.750000    9.250000  1112.000000   \n",
       "max                    218.000000  2527.000000   52.000000  2315.000000   \n",
       "\n",
       "Gene Accession Number    Z17240_at  L49218_f_at  M71243_f_at  Z78285_f_at  \n",
       "count                    38.000000    38.000000    38.000000    38.000000  \n",
       "mean                    335.842105    19.210526   504.394737   -29.210526  \n",
       "std                     209.826766    31.158841   728.744405    30.851132  \n",
       "min                      41.000000   -50.000000    -2.000000   -94.000000  \n",
       "25%                     232.750000     8.000000   136.000000   -42.750000  \n",
       "50%                     308.500000    20.000000   243.500000   -26.000000  \n",
       "75%                     389.500000    30.250000   487.250000   -11.500000  \n",
       "max                    1109.000000   115.000000  3193.000000    36.000000  \n",
       "\n",
       "[8 rows x 7129 columns]"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clearly there is some variation in the scales across the different features. Many machine learning models work much better with data that's on the same scale, so let's create a scaled version of the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train_fl = X_train.astype(float, 64)\n",
    "X_test_fl = X_test.astype(float, 64)\n",
    "\n",
    "# Apply the same scaling to both datasets\n",
    "scaler = StandardScaler()\n",
    "X_train = scaler.fit_transform(X_train_fl)\n",
    "X_test = scaler.transform(X_test_fl) # note that we transform rather than fit_transform"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='dask1'></a>\n",
    "\n",
    "### 2. Conversion to CuDF Dataframe\n",
    "Convert the pandas dataframes to CuDF dataframes to carry out the further CuML tasks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Modify the code in this cell\n",
    "\n",
    "%%time\n",
    "X_cudf_train = cudf.DataFrame()  #Pass X train dataframe here\n",
    "X_cudf_test = cudf.DataFrame()    #Pass X test dataframe here\n",
    "\n",
    "y_cudf_train = cudf.DataFrame()  #Pass y train dataframe here\n",
    "#y_cudf_test = cudf.Series(y_test.values)   #Pass y test dataframe here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Model Building\n",
    "#### Dask Integration\n",
    "\n",
    "We will try using the Random Forests Classifier  and implement using CuML and Dask."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Start Dask cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Modify the code in this cell\n",
    "\n",
    "# This will use all GPUs on the local host by default\n",
    "cluster = LocalCUDACluster() #Set 1 thread per worker using arguments to cluster\n",
    "c = Client() #Pass the cluster as an argument to Client\n",
    "\n",
    "# Query the client for all connected workers\n",
    "workers = c.has_what().keys()\n",
    "n_workers = len(workers)\n",
    "n_streams = 8 # Performance optimization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Define Parameters\n",
    "\n",
    "In addition to the number of examples, random forest fitting performance depends heavily on the number of columns in a dataset and (especially) on the maximum depth to which trees are allowed to grow. Lower `max_depth` values can greatly speed up fitting, though going too low may reduce accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Random Forest building parameters\n",
    "max_depth = 12\n",
    "n_bins = 16\n",
    "n_trees = 1000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Distribute data to worker GPUs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = X_train.astype(np.float32)\n",
    "X_test = X_test.astype(np.float32)\n",
    "y_train = y_train.astype(np.int32)\n",
    "y_test = y_test.astype(np.int32)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_partitions = n_workers\n",
    "\n",
    "def distribute(X, y):\n",
    "    # First convert to cudf (with real data, you would likely load in cuDF format to start)\n",
    "    X_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X))\n",
    "    y_cudf = cudf.Series(y)\n",
    "\n",
    "    # Partition with Dask\n",
    "    # In this case, each worker will train on 1/n_partitions fraction of the data\n",
    "    X_dask = dask_cudf.from_cudf(X_cudf, npartitions=n_partitions)\n",
    "    y_dask = dask_cudf.from_cudf(y_cudf, npartitions=n_partitions)\n",
    "\n",
    "    # Persist to cache the data in active memory\n",
    "    X_dask, y_dask = \\\n",
    "      dask_utils.persist_across_workers(c, [X_dask, y_dask], workers=workers)\n",
    "    \n",
    "    return X_dask, y_dask"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Modify the code in this cell\n",
    "\n",
    "X_train_dask, y_train_dask = distribute() #Pass train data as arguments here\n",
    "X_test_dask, y_test_dask = distribute() #Pass test data as arguments here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create the  Scikit-learn model\n",
    "\n",
    "Since a scikit-learn equivalent to the multi-node multi-GPU K-means in cuML doesn't exist, we will use Dask-ML's implementation for comparison."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 3.48 s, sys: 790 ms, total: 4.27 s\n",
      "Wall time: 3.21 s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "RandomForestClassifier(max_depth=12, n_estimators=1000, n_jobs=-1)"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "# Use all avilable CPU cores\n",
    "skl_model = sklRF(max_depth=max_depth, n_estimators=n_trees, n_jobs=-1)\n",
    "skl_model.fit(X_train, y_train.iloc[:,1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Train the distributed cuML model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Modify the code in this cell\n",
    "\n",
    "%%time\n",
    "\n",
    "cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins, n_streams=n_streams)\n",
    "cuml_model.fit() # Pass X and y train dask data here\n",
    "\n",
    "wait(cuml_model.rfs) # Allow asynchronous training tasks to finish"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predict and check accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Modify the code in this cell\n",
    "\n",
    "skl_y_pred = skl_model.predict(X_test)\n",
    "cuml_y_pred = cuml_model.predict().compute().to_array()  #Pass the X test dask data as argument here\n",
    "\n",
    "# Due to randomness in the algorithm, you may see slight variation in accuracies\n",
    "print(\"SKLearn accuracy:  \", accuracy_score(y_test.iloc[:,1], skl_y_pred))\n",
    "print(\"CuML accuracy:     \", accuracy_score())  #Pass the y test dask data  and predicted values from CuML model as argument here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='ex4'></a><br>\n",
    "\n",
    "### 4. CONCLUSION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compare the performance of our solution!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| Algorithm     | Implementation | Accuracy      | Time | Algorithm     | Implementation | Accuracy      | Time |\n",
    "| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Write down your observations and compare the CuML and Scikit learn scores. They should be approximately equal.  We hope that you found this exercise exciting and beneficial in understanding RAPIDS better. Share your highest accuracy and try to use the unique features of RAPIDS for accelerating your data science pipelines. Don't restrict yourself to the previously explained concepts, but use the documentation to apply more models and functions and achieve the best results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.  References\n",
    "\n",
    "\n",
    "\n",
    "<p xmlns:dct=\"http://purl.org/dc/terms/\">\n",
    "  <a rel=\"license\"\n",
    "     href=\"http://creativecommons.org/publicdomain/zero/1.0/\">\n",
    "    <center><img src=\"http://i.creativecommons.org/p/zero/1.0/88x31.png\" style=\"border-style: none;\" alt=\"CC0\"  /></center>\n",
    "  </a>\n",
    " \n",
    "</p>\n",
    "\n",
    "\n",
    "- The dataset is licensed under a CC0: Public Domain license.\n",
    "\n",
    "- Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science 286:531-537. (1999). Published: 1999.10.14. T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licensing\n",
    "  \n",
    "This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Previous Notebook](Challenge.ipynb)\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "[1](Challenge.ipynb)\n",
    "[2]\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "\n",
    "\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&ensp;\n",
    "[Home Page](../../START_HERE.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}