{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "     \n", "     \n", "     \n", "     \n", "     \n", "   \n", "[Home Page](../../START_HERE.ipynb)\n", "\n", "[Previous Notebook](Challenge.ipynb)\n", "     \n", "     \n", "     \n", "     \n", "[1](Challenge.ipynb)\n", "[2]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Challenge - Gene Expression Classification - Workbook\n", "\n", "\n", "### Introduction\n", "\n", "This notebook walks through an end-to-end GPU machine learning workflow where cuDF is used for processing the data and cuML is used to train machine learning models on it. \n", "\n", "After completing this excercise, you will be able to use cuDF to load data from disk, combine tables, scale features, use one-hote encoding and even write your own GPU kernels to efficiently transform feature columns. Additionaly you will learn how to pass this data to cuML, and how to train ML models on it. The trained model is saved and it will be used for prediction.\n", "\n", "It is not required that the user is familiar with cuDF or cuML. Since our aim is to go from ETL to ML training, a detailed introduction is out of scope for this notebook. We recommend [Introduction to cuDF](../../CuDF/01-Intro_to_cuDF.ipynb) for additional information.\n", "\n", "### Problem Statement:\n", "We are trying to classify patients with acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) using machine learning (classification) algorithms. This dataset comes from a proof-of-concept study published in 1999 by Golub et al. It showed how new cases of cancer could be classified by gene expression monitoring (via DNA microarray) and thereby provided a general approach for identifying new cancer classes and assigning tumors to known classes. \n", "\n", "Here is the dataset link: https://www.kaggle.com/crawford/gene-expression." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Here is the list of exercises and modules to work on in the lab:\n", "\n", "- Convert the serial Pandas computations to CuDF operations.\n", "- Utilize CuML to accelerate the machine learning models.\n", "- Experiment with Dask to create a cluster and distribute the data and scale the operations.\n", "\n", "You will start writing code from here, but make sure you execute the data processing blocks to understand the dataset.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Data Processing\n", "\n", "The first step is downloading the dataset and putting it in the data directory, for using in this tutorial. Download the dataset here, and place it in (host/data) folder. Now we will import the necessary libraries." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NumPy Version: 1.19.2\n", "Scikit-Learn Version: 0.23.1\n" ] } ], "source": [ "import numpy as np; print('NumPy Version:', np.__version__)\n", "import pandas as pd\n", "import sys\n", "import sklearn; print('Scikit-Learn Version:', sklearn.__version__)\n", "from sklearn import preprocessing \n", "from sklearn.utils import resample\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.feature_selection import SelectFromModel\n", "from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc\n", "from sklearn.preprocessing import OrdinalEncoder, StandardScaler\n", "import cudf\n", "import cupy\n", "# import for model building\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.metrics import mean_squared_error\n", "from cuml.metrics.regression import r2_score\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn import linear_model\n", "from sklearn.metrics import accuracy_score\n", "from sklearn import model_selection, datasets\n", "from cuml.dask.common import utils as dask_utils\n", "from dask.distributed import Client, wait\n", "from dask_cuda import LocalCUDACluster\n", "import dask_cudf\n", "from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRF\n", "from sklearn.ensemble import RandomForestClassifier as sklRF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll read the dataframe into y from the csv file, view its dimensions and observe the first 5 rows of the dataframe." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(72, 2)\n", "CPU times: user 4.54 ms, sys: 1.11 ms, total: 5.66 ms\n", "Wall time: 5.27 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patientcancer
01ALL
12ALL
23ALL
34ALL
45ALL
\n", "
" ], "text/plain": [ " patient cancer\n", "0 1 ALL\n", "1 2 ALL\n", "2 3 ALL\n", "3 4 ALL\n", "4 5 ALL" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "y = pd.read_csv('../../../data/actual.csv')\n", "print(y.shape)\n", "y.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's convert our target variable categories to numbers." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "y['cancer'].value_counts()\n", "# Recode label to numeric\n", "y = y.replace({'ALL':0,'AML':1})\n", "labels = ['ALL', 'AML'] # for plotting convenience later on" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the training and test data provided in the challenge from the data folder. View their dimensions." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(7129, 78)\n", "(7129, 70)\n" ] } ], "source": [ "# Import training data\n", "df_train = pd.read_csv('../../../data/data_set_ALL_AML_train.csv')\n", "print(df_train.shape)\n", "\n", "# Import testing data\n", "df_test = pd.read_csv('../../../data/data_set_ALL_AML_independent.csv')\n", "print(df_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe the first few rows of the train dataframe and the data format." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Gene DescriptionGene Accession Number1call2call.13call.24call.3...29call.3330call.3431call.3532call.3633call.37
0AFFX-BioB-5_at (endogenous control)AFFX-BioB-5_at-214A-139A-76A-135A...15A-318A-32A-124A-135A
1AFFX-BioB-M_at (endogenous control)AFFX-BioB-M_at-153A-73A-49A-114A...-114A-192A-49A-79A-186A
2AFFX-BioB-3_at (endogenous control)AFFX-BioB-3_at-58A-1A-307A265A...2A-95A49A-37A-70A
3AFFX-BioC-5_at (endogenous control)AFFX-BioC-5_at88A283A309A12A...193A312A230P330A337A
4AFFX-BioC-3_at (endogenous control)AFFX-BioC-3_at-295A-264A-376A-419A...-51A-139A-367A-188A-407A
\n", "

5 rows × 78 columns

\n", "
" ], "text/plain": [ " Gene Description Gene Accession Number 1 call 2 \\\n", "0 AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -214 A -139 \n", "1 AFFX-BioB-M_at (endogenous control) AFFX-BioB-M_at -153 A -73 \n", "2 AFFX-BioB-3_at (endogenous control) AFFX-BioB-3_at -58 A -1 \n", "3 AFFX-BioC-5_at (endogenous control) AFFX-BioC-5_at 88 A 283 \n", "4 AFFX-BioC-3_at (endogenous control) AFFX-BioC-3_at -295 A -264 \n", "\n", " call.1 3 call.2 4 call.3 ... 29 call.33 30 call.34 31 call.35 \\\n", "0 A -76 A -135 A ... 15 A -318 A -32 A \n", "1 A -49 A -114 A ... -114 A -192 A -49 A \n", "2 A -307 A 265 A ... 2 A -95 A 49 A \n", "3 A 309 A 12 A ... 193 A 312 A 230 P \n", "4 A -376 A -419 A ... -51 A -139 A -367 A \n", "\n", " 32 call.36 33 call.37 \n", "0 -124 A -135 A \n", "1 -79 A -186 A \n", "2 -37 A -70 A \n", "3 330 A 337 A \n", "4 -188 A -407 A \n", "\n", "[5 rows x 78 columns]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe the first few rows of the test dataframe and the data format." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Gene DescriptionGene Accession Number39call40call.142call.247call.3...65call.2966call.3063call.3164call.3262call.33
0AFFX-BioB-5_at (endogenous control)AFFX-BioB-5_at-342A-87A22A-243A...-62A-58A-161A-48A-176A
1AFFX-BioB-M_at (endogenous control)AFFX-BioB-M_at-200A-248A-153A-218A...-198A-217A-215A-531A-284A
2AFFX-BioB-3_at (endogenous control)AFFX-BioB-3_at41A262A17A-163A...-5A63A-46A-124A-81A
3AFFX-BioC-5_at (endogenous control)AFFX-BioC-5_at328A295A276A182A...141A95A146A431A9A
4AFFX-BioC-3_at (endogenous control)AFFX-BioC-3_at-224A-226A-211A-289A...-256A-191A-172A-496A-294A
\n", "

5 rows × 70 columns

\n", "
" ], "text/plain": [ " Gene Description Gene Accession Number 39 call 40 \\\n", "0 AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -342 A -87 \n", "1 AFFX-BioB-M_at (endogenous control) AFFX-BioB-M_at -200 A -248 \n", "2 AFFX-BioB-3_at (endogenous control) AFFX-BioB-3_at 41 A 262 \n", "3 AFFX-BioC-5_at (endogenous control) AFFX-BioC-5_at 328 A 295 \n", "4 AFFX-BioC-3_at (endogenous control) AFFX-BioC-3_at -224 A -226 \n", "\n", " call.1 42 call.2 47 call.3 ... 65 call.29 66 call.30 63 call.31 \\\n", "0 A 22 A -243 A ... -62 A -58 A -161 A \n", "1 A -153 A -218 A ... -198 A -217 A -215 A \n", "2 A 17 A -163 A ... -5 A 63 A -46 A \n", "3 A 276 A 182 A ... 141 A 95 A 146 A \n", "4 A -211 A -289 A ... -256 A -191 A -172 A \n", "\n", " 64 call.32 62 call.33 \n", "0 -48 A -176 A \n", "1 -531 A -284 A \n", "2 -124 A -81 A \n", "3 431 A 9 A \n", "4 -496 A -294 A \n", "\n", "[5 rows x 70 columns]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_test.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the data set has categorical values but only for the columns starting with \"call\". We won't use the columns having categorical values, but remove them." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "# Remove \"call\" columns from training and testing data\n", "train_to_keep = [col for col in df_train.columns if \"call\" not in col]\n", "test_to_keep = [col for col in df_test.columns if \"call\" not in col]\n", "\n", "X_train_tr = df_train[train_to_keep]\n", "X_test_tr = df_test[test_to_keep]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rename the columns and reindex for formatting purposes and ease in reading the data." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "train_columns_titles = ['Gene Description', 'Gene Accession Number', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10',\n", " '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', \n", " '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38']\n", "\n", "X_train_tr = X_train_tr.reindex(columns=train_columns_titles)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "test_columns_titles = ['Gene Description', 'Gene Accession Number','39', '40', '41', '42', '43', '44', '45', '46',\n", " '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59',\n", " '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72']\n", "\n", "X_test_tr = X_test_tr.reindex(columns=test_columns_titles)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will take the transpose of the dataframe so that each row is a patient and each column is a gene." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(40, 7129)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...7119712071217122712371247125712671277128
Gene DescriptionAFFX-BioB-5_at (endogenous control)AFFX-BioB-M_at (endogenous control)AFFX-BioB-3_at (endogenous control)AFFX-BioC-5_at (endogenous control)AFFX-BioC-3_at (endogenous control)AFFX-BioDn-5_at (endogenous control)AFFX-BioDn-3_at (endogenous control)AFFX-CreX-5_at (endogenous control)AFFX-CreX-3_at (endogenous control)AFFX-BioB-5_st (endogenous control)...Transcription factor Stat5b (stat5b) mRNABreast epithelial antigen BA46 mRNAGB DEF = Calcium/calmodulin-dependent protein ...TUBULIN ALPHA-4 CHAINCYP4B1 Cytochrome P450; subfamily IVB; polypep...PTGER3 Prostaglandin E receptor 3 (subtype EP3...HMG2 High-mobility group (nonhistone chromosom...RB1 Retinoblastoma 1 (including osteosarcoma)GB DEF = Glycophorin Sta (type A) exons 3 and ...GB DEF = mRNA (clone 1A7)
Gene Accession NumberAFFX-BioB-5_atAFFX-BioB-M_atAFFX-BioB-3_atAFFX-BioC-5_atAFFX-BioC-3_atAFFX-BioDn-5_atAFFX-BioDn-3_atAFFX-CreX-5_atAFFX-CreX-3_atAFFX-BioB-5_st...U48730_atU58516_atU73738_atX06956_atX16699_atX83863_atZ17240_atL49218_f_atM71243_f_atZ78285_f_at
1-214-153-5888-295-558199-176252206...185511-125389-3779332936191-37
2-139-73-1283-264-400-330-16810174...169837-36442-177822951176-14
3-76-49-307309-376-65033-367206-215...31511993316852113877741228-41
\n", "

5 rows × 7129 columns

\n", "
" ], "text/plain": [ " 0 \\\n", "Gene Description AFFX-BioB-5_at (endogenous control) \n", "Gene Accession Number AFFX-BioB-5_at \n", "1 -214 \n", "2 -139 \n", "3 -76 \n", "\n", " 1 \\\n", "Gene Description AFFX-BioB-M_at (endogenous control) \n", "Gene Accession Number AFFX-BioB-M_at \n", "1 -153 \n", "2 -73 \n", "3 -49 \n", "\n", " 2 \\\n", "Gene Description AFFX-BioB-3_at (endogenous control) \n", "Gene Accession Number AFFX-BioB-3_at \n", "1 -58 \n", "2 -1 \n", "3 -307 \n", "\n", " 3 \\\n", "Gene Description AFFX-BioC-5_at (endogenous control) \n", "Gene Accession Number AFFX-BioC-5_at \n", "1 88 \n", "2 283 \n", "3 309 \n", "\n", " 4 \\\n", "Gene Description AFFX-BioC-3_at (endogenous control) \n", "Gene Accession Number AFFX-BioC-3_at \n", "1 -295 \n", "2 -264 \n", "3 -376 \n", "\n", " 5 \\\n", "Gene Description AFFX-BioDn-5_at (endogenous control) \n", "Gene Accession Number AFFX-BioDn-5_at \n", "1 -558 \n", "2 -400 \n", "3 -650 \n", "\n", " 6 \\\n", "Gene Description AFFX-BioDn-3_at (endogenous control) \n", "Gene Accession Number AFFX-BioDn-3_at \n", "1 199 \n", "2 -330 \n", "3 33 \n", "\n", " 7 \\\n", "Gene Description AFFX-CreX-5_at (endogenous control) \n", "Gene Accession Number AFFX-CreX-5_at \n", "1 -176 \n", "2 -168 \n", "3 -367 \n", "\n", " 8 \\\n", "Gene Description AFFX-CreX-3_at (endogenous control) \n", "Gene Accession Number AFFX-CreX-3_at \n", "1 252 \n", "2 101 \n", "3 206 \n", "\n", " 9 ... \\\n", "Gene Description AFFX-BioB-5_st (endogenous control) ... \n", "Gene Accession Number AFFX-BioB-5_st ... \n", "1 206 ... \n", "2 74 ... \n", "3 -215 ... \n", "\n", " 7119 \\\n", "Gene Description Transcription factor Stat5b (stat5b) mRNA \n", "Gene Accession Number U48730_at \n", "1 185 \n", "2 169 \n", "3 315 \n", "\n", " 7120 \\\n", "Gene Description Breast epithelial antigen BA46 mRNA \n", "Gene Accession Number U58516_at \n", "1 511 \n", "2 837 \n", "3 1199 \n", "\n", " 7121 \\\n", "Gene Description GB DEF = Calcium/calmodulin-dependent protein ... \n", "Gene Accession Number U73738_at \n", "1 -125 \n", "2 -36 \n", "3 33 \n", "\n", " 7122 \\\n", "Gene Description TUBULIN ALPHA-4 CHAIN \n", "Gene Accession Number X06956_at \n", "1 389 \n", "2 442 \n", "3 168 \n", "\n", " 7123 \\\n", "Gene Description CYP4B1 Cytochrome P450; subfamily IVB; polypep... \n", "Gene Accession Number X16699_at \n", "1 -37 \n", "2 -17 \n", "3 52 \n", "\n", " 7124 \\\n", "Gene Description PTGER3 Prostaglandin E receptor 3 (subtype EP3... \n", "Gene Accession Number X83863_at \n", "1 793 \n", "2 782 \n", "3 1138 \n", "\n", " 7125 \\\n", "Gene Description HMG2 High-mobility group (nonhistone chromosom... \n", "Gene Accession Number Z17240_at \n", "1 329 \n", "2 295 \n", "3 777 \n", "\n", " 7126 \\\n", "Gene Description RB1 Retinoblastoma 1 (including osteosarcoma) \n", "Gene Accession Number L49218_f_at \n", "1 36 \n", "2 11 \n", "3 41 \n", "\n", " 7127 \\\n", "Gene Description GB DEF = Glycophorin Sta (type A) exons 3 and ... \n", "Gene Accession Number M71243_f_at \n", "1 191 \n", "2 76 \n", "3 228 \n", "\n", " 7128 \n", "Gene Description GB DEF = mRNA (clone 1A7) \n", "Gene Accession Number Z78285_f_at \n", "1 -37 \n", "2 -14 \n", "3 -41 \n", "\n", "[5 rows x 7129 columns]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train = X_train_tr.T\n", "X_test = X_test_tr.T\n", "\n", "print(X_train.shape) \n", "X_train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just clearning the data, removing extra columns and converting to numerical values." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(38, 7129)\n", "(34, 7129)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Gene Accession NumberAFFX-BioB-5_atAFFX-BioB-M_atAFFX-BioB-3_atAFFX-BioC-5_atAFFX-BioC-3_atAFFX-BioDn-5_atAFFX-BioDn-3_atAFFX-CreX-5_atAFFX-CreX-3_atAFFX-BioB-5_st...U48730_atU58516_atU73738_atX06956_atX16699_atX83863_atZ17240_atL49218_f_atM71243_f_atZ78285_f_at
1-214-153-5888-295-558199-176252206...185511-125389-3779332936191-37
2-139-73-1283-264-400-330-16810174...169837-36442-177822951176-14
3-76-49-307309-376-65033-367206-215...31511993316852113877741228-41
4-135-11426512-419-585158-2534931...240835218174-110627170-50126-91
5-106-125-76168-230-2844-12270252...15664957504-262503141456-25
\n", "

5 rows × 7129 columns

\n", "
" ], "text/plain": [ "Gene Accession Number AFFX-BioB-5_at AFFX-BioB-M_at AFFX-BioB-3_at \\\n", "1 -214 -153 -58 \n", "2 -139 -73 -1 \n", "3 -76 -49 -307 \n", "4 -135 -114 265 \n", "5 -106 -125 -76 \n", "\n", "Gene Accession Number AFFX-BioC-5_at AFFX-BioC-3_at AFFX-BioDn-5_at \\\n", "1 88 -295 -558 \n", "2 283 -264 -400 \n", "3 309 -376 -650 \n", "4 12 -419 -585 \n", "5 168 -230 -284 \n", "\n", "Gene Accession Number AFFX-BioDn-3_at AFFX-CreX-5_at AFFX-CreX-3_at \\\n", "1 199 -176 252 \n", "2 -330 -168 101 \n", "3 33 -367 206 \n", "4 158 -253 49 \n", "5 4 -122 70 \n", "\n", "Gene Accession Number AFFX-BioB-5_st ... U48730_at U58516_at U73738_at \\\n", "1 206 ... 185 511 -125 \n", "2 74 ... 169 837 -36 \n", "3 -215 ... 315 1199 33 \n", "4 31 ... 240 835 218 \n", "5 252 ... 156 649 57 \n", "\n", "Gene Accession Number X06956_at X16699_at X83863_at Z17240_at \\\n", "1 389 -37 793 329 \n", "2 442 -17 782 295 \n", "3 168 52 1138 777 \n", "4 174 -110 627 170 \n", "5 504 -26 250 314 \n", "\n", "Gene Accession Number L49218_f_at M71243_f_at Z78285_f_at \n", "1 36 191 -37 \n", "2 11 76 -14 \n", "3 41 228 -41 \n", "4 -50 126 -91 \n", "5 14 56 -25 \n", "\n", "[5 rows x 7129 columns]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Clean up the column names for training and testing data\n", "X_train.columns = X_train.iloc[1]\n", "X_train = X_train.drop([\"Gene Description\", \"Gene Accession Number\"]).apply(pd.to_numeric)\n", "\n", "# Clean up the column names for Testing data\n", "X_test.columns = X_test.iloc[1]\n", "X_test = X_test.drop([\"Gene Description\", \"Gene Accession Number\"]).apply(pd.to_numeric)\n", "\n", "print(X_train.shape)\n", "print(X_test.shape)\n", "X_train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have the 38 patients as rows in the training set, and the other 34 as rows in the testing set. Each of those datasets has 7129 gene expression features. But we haven't yet associated the target labels with the right patients. You will recall that all the labels are all stored in a single dataframe. Let's split the data so that the patients and labels match up across the training and testing dataframes.We are now splitting the data into train and test sets. We will subset the first 38 patient's cancer types." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "X_train = X_train.reset_index(drop=True)\n", "y_train = y[y.patient <= 38].reset_index(drop=True)\n", "\n", "# Subset the rest for testing\n", "X_test = X_test.reset_index(drop=True)\n", "y_test = y[y.patient > 38].reset_index(drop=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate descriptive statistics to analyse the data further." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Gene Accession NumberAFFX-BioB-5_atAFFX-BioB-M_atAFFX-BioB-3_atAFFX-BioC-5_atAFFX-BioC-3_atAFFX-BioDn-5_atAFFX-BioDn-3_atAFFX-CreX-5_atAFFX-CreX-3_atAFFX-BioB-5_st...U48730_atU58516_atU73738_atX06956_atX16699_atX83863_atZ17240_atL49218_f_atM71243_f_atZ78285_f_at
count38.00000038.00000038.00000038.00000038.00000038.00000038.00000038.00000038.00000038.000000...38.00000038.00000038.00000038.00000038.00000038.00000038.00000038.00000038.00000038.000000
mean-120.868421-150.526316-17.157895181.394737-276.552632-439.210526-43.578947-201.18421199.052632112.131579...178.763158750.8421058.815789399.131579-20.052632869.052632335.84210519.210526504.394737-29.210526
std109.55565675.734507117.686144117.468004111.004431135.458412219.48239390.83898983.178397211.815597...84.826830298.00839277.108507469.57986842.346031482.366461209.82676631.158841728.74440530.851132
min-476.000000-327.000000-307.000000-36.000000-541.000000-790.000000-479.000000-463.000000-82.000000-215.000000...30.000000224.000000-178.00000036.000000-112.000000195.00000041.000000-50.000000-2.000000-94.000000
25%-138.750000-205.000000-83.25000081.250000-374.250000-547.000000-169.000000-239.25000036.000000-47.000000...120.000000575.500000-42.750000174.500000-48.000000595.250000232.7500008.000000136.000000-42.750000
50%-106.500000-141.500000-43.500000200.000000-263.000000-426.500000-33.500000-185.50000099.50000070.500000...174.500000700.00000010.500000266.000000-18.000000744.500000308.50000020.000000243.500000-26.000000
75%-68.250000-94.75000047.250000279.250000-188.750000-344.75000079.000000-144.750000152.250000242.750000...231.750000969.50000057.000000451.7500009.2500001112.000000389.50000030.250000487.250000-11.500000
max17.000000-20.000000265.000000392.000000-51.000000-155.000000419.000000-24.000000283.000000561.000000...356.0000001653.000000218.0000002527.00000052.0000002315.0000001109.000000115.0000003193.00000036.000000
\n", "

8 rows × 7129 columns

\n", "
" ], "text/plain": [ "Gene Accession Number AFFX-BioB-5_at AFFX-BioB-M_at AFFX-BioB-3_at \\\n", "count 38.000000 38.000000 38.000000 \n", "mean -120.868421 -150.526316 -17.157895 \n", "std 109.555656 75.734507 117.686144 \n", "min -476.000000 -327.000000 -307.000000 \n", "25% -138.750000 -205.000000 -83.250000 \n", "50% -106.500000 -141.500000 -43.500000 \n", "75% -68.250000 -94.750000 47.250000 \n", "max 17.000000 -20.000000 265.000000 \n", "\n", "Gene Accession Number AFFX-BioC-5_at AFFX-BioC-3_at AFFX-BioDn-5_at \\\n", "count 38.000000 38.000000 38.000000 \n", "mean 181.394737 -276.552632 -439.210526 \n", "std 117.468004 111.004431 135.458412 \n", "min -36.000000 -541.000000 -790.000000 \n", "25% 81.250000 -374.250000 -547.000000 \n", "50% 200.000000 -263.000000 -426.500000 \n", "75% 279.250000 -188.750000 -344.750000 \n", "max 392.000000 -51.000000 -155.000000 \n", "\n", "Gene Accession Number AFFX-BioDn-3_at AFFX-CreX-5_at AFFX-CreX-3_at \\\n", "count 38.000000 38.000000 38.000000 \n", "mean -43.578947 -201.184211 99.052632 \n", "std 219.482393 90.838989 83.178397 \n", "min -479.000000 -463.000000 -82.000000 \n", "25% -169.000000 -239.250000 36.000000 \n", "50% -33.500000 -185.500000 99.500000 \n", "75% 79.000000 -144.750000 152.250000 \n", "max 419.000000 -24.000000 283.000000 \n", "\n", "Gene Accession Number AFFX-BioB-5_st ... U48730_at U58516_at \\\n", "count 38.000000 ... 38.000000 38.000000 \n", "mean 112.131579 ... 178.763158 750.842105 \n", "std 211.815597 ... 84.826830 298.008392 \n", "min -215.000000 ... 30.000000 224.000000 \n", "25% -47.000000 ... 120.000000 575.500000 \n", "50% 70.500000 ... 174.500000 700.000000 \n", "75% 242.750000 ... 231.750000 969.500000 \n", "max 561.000000 ... 356.000000 1653.000000 \n", "\n", "Gene Accession Number U73738_at X06956_at X16699_at X83863_at \\\n", "count 38.000000 38.000000 38.000000 38.000000 \n", "mean 8.815789 399.131579 -20.052632 869.052632 \n", "std 77.108507 469.579868 42.346031 482.366461 \n", "min -178.000000 36.000000 -112.000000 195.000000 \n", "25% -42.750000 174.500000 -48.000000 595.250000 \n", "50% 10.500000 266.000000 -18.000000 744.500000 \n", "75% 57.000000 451.750000 9.250000 1112.000000 \n", "max 218.000000 2527.000000 52.000000 2315.000000 \n", "\n", "Gene Accession Number Z17240_at L49218_f_at M71243_f_at Z78285_f_at \n", "count 38.000000 38.000000 38.000000 38.000000 \n", "mean 335.842105 19.210526 504.394737 -29.210526 \n", "std 209.826766 31.158841 728.744405 30.851132 \n", "min 41.000000 -50.000000 -2.000000 -94.000000 \n", "25% 232.750000 8.000000 136.000000 -42.750000 \n", "50% 308.500000 20.000000 243.500000 -26.000000 \n", "75% 389.500000 30.250000 487.250000 -11.500000 \n", "max 1109.000000 115.000000 3193.000000 36.000000 \n", "\n", "[8 rows x 7129 columns]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clearly there is some variation in the scales across the different features. Many machine learning models work much better with data that's on the same scale, so let's create a scaled version of the dataset." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "X_train_fl = X_train.astype(float, 64)\n", "X_test_fl = X_test.astype(float, 64)\n", "\n", "# Apply the same scaling to both datasets\n", "scaler = StandardScaler()\n", "X_train = scaler.fit_transform(X_train_fl)\n", "X_test = scaler.transform(X_test_fl) # note that we transform rather than fit_transform" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### 2. Conversion to CuDF Dataframe\n", "Convert the pandas dataframes to CuDF dataframes to carry out the further CuML tasks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Modify the code in this cell\n", "\n", "%%time\n", "X_cudf_train = cudf.DataFrame() #Pass X train dataframe here\n", "X_cudf_test = cudf.DataFrame() #Pass X test dataframe here\n", "\n", "y_cudf_train = cudf.DataFrame() #Pass y train dataframe here\n", "#y_cudf_test = cudf.Series(y_test.values) #Pass y test dataframe here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Model Building\n", "#### Dask Integration\n", "\n", "We will try using the Random Forests Classifier and implement using CuML and Dask." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Start Dask cluster" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Modify the code in this cell\n", "\n", "# This will use all GPUs on the local host by default\n", "cluster = LocalCUDACluster() #Set 1 thread per worker using arguments to cluster\n", "c = Client() #Pass the cluster as an argument to Client\n", "\n", "# Query the client for all connected workers\n", "workers = c.has_what().keys()\n", "n_workers = len(workers)\n", "n_streams = 8 # Performance optimization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define Parameters\n", "\n", "In addition to the number of examples, random forest fitting performance depends heavily on the number of columns in a dataset and (especially) on the maximum depth to which trees are allowed to grow. Lower `max_depth` values can greatly speed up fitting, though going too low may reduce accuracy." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "# Random Forest building parameters\n", "max_depth = 12\n", "n_bins = 16\n", "n_trees = 1000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Distribute data to worker GPUs" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "X_train = X_train.astype(np.float32)\n", "X_test = X_test.astype(np.float32)\n", "y_train = y_train.astype(np.int32)\n", "y_test = y_test.astype(np.int32)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_partitions = n_workers\n", "\n", "def distribute(X, y):\n", " # First convert to cudf (with real data, you would likely load in cuDF format to start)\n", " X_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X))\n", " y_cudf = cudf.Series(y)\n", "\n", " # Partition with Dask\n", " # In this case, each worker will train on 1/n_partitions fraction of the data\n", " X_dask = dask_cudf.from_cudf(X_cudf, npartitions=n_partitions)\n", " y_dask = dask_cudf.from_cudf(y_cudf, npartitions=n_partitions)\n", "\n", " # Persist to cache the data in active memory\n", " X_dask, y_dask = \\\n", " dask_utils.persist_across_workers(c, [X_dask, y_dask], workers=workers)\n", " \n", " return X_dask, y_dask" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "#Modify the code in this cell\n", "\n", "X_train_dask, y_train_dask = distribute() #Pass train data as arguments here\n", "X_test_dask, y_test_dask = distribute() #Pass test data as arguments here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create the Scikit-learn model\n", "\n", "Since a scikit-learn equivalent to the multi-node multi-GPU K-means in cuML doesn't exist, we will use Dask-ML's implementation for comparison." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.48 s, sys: 790 ms, total: 4.27 s\n", "Wall time: 3.21 s\n" ] }, { "data": { "text/plain": [ "RandomForestClassifier(max_depth=12, n_estimators=1000, n_jobs=-1)" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "# Use all avilable CPU cores\n", "skl_model = sklRF(max_depth=max_depth, n_estimators=n_trees, n_jobs=-1)\n", "skl_model.fit(X_train, y_train.iloc[:,1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Train the distributed cuML model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Modify the code in this cell\n", "\n", "%%time\n", "\n", "cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins, n_streams=n_streams)\n", "cuml_model.fit() # Pass X and y train dask data here\n", "\n", "wait(cuml_model.rfs) # Allow asynchronous training tasks to finish" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict and check accuracy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Modify the code in this cell\n", "\n", "skl_y_pred = skl_model.predict(X_test)\n", "cuml_y_pred = cuml_model.predict().compute().to_array() #Pass the X test dask data as argument here\n", "\n", "# Due to randomness in the algorithm, you may see slight variation in accuracies\n", "print(\"SKLearn accuracy: \", accuracy_score(y_test.iloc[:,1], skl_y_pred))\n", "print(\"CuML accuracy: \", accuracy_score()) #Pass the y test dask data and predicted values from CuML model as argument here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### 4. CONCLUSION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's compare the performance of our solution!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Algorithm | Implementation | Accuracy | Time | Algorithm | Implementation | Accuracy | Time |\n", "| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write down your observations and compare the CuML and Scikit learn scores. They should be approximately equal. We hope that you found this exercise exciting and beneficial in understanding RAPIDS better. Share your highest accuracy and try to use the unique features of RAPIDS for accelerating your data science pipelines. Don't restrict yourself to the previously explained concepts, but use the documentation to apply more models and functions and achieve the best results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. References\n", "\n", "\n", "\n", "

\n", " \n", "

\"CC0\"
\n", " \n", " \n", "

\n", "\n", "\n", "- The dataset is licensed under a CC0: Public Domain license.\n", "\n", "- Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science 286:531-537. (1999). Published: 1999.10.14. T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Licensing\n", " \n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Previous Notebook](Challenge.ipynb)\n", "     \n", "     \n", "     \n", "     \n", "[1](Challenge.ipynb)\n", "[2]\n", "     \n", "     \n", "     \n", "     \n", "\n", "\n", "     \n", "     \n", "     \n", "     \n", "     \n", "   \n", "[Home Page](../../START_HERE.ipynb)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 4 }