{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ " \n", " \n", " \n", " \n", " \n", " \n", "[Home Page](../../START_HERE.ipynb)\n", "\n", "[Previous Notebook](Challenge.ipynb)\n", " \n", " \n", " \n", " \n", "[1](Challenge.ipynb)\n", "[2]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Challenge - Gene Expression Classification - Workbook\n", "\n", "\n", "### Introduction\n", "\n", "This notebook walks through an end-to-end GPU machine learning workflow where cuDF is used for processing the data and cuML is used to train machine learning models on it. \n", "\n", "After completing this excercise, you will be able to use cuDF to load data from disk, combine tables, scale features, use one-hote encoding and even write your own GPU kernels to efficiently transform feature columns. Additionaly you will learn how to pass this data to cuML, and how to train ML models on it. The trained model is saved and it will be used for prediction.\n", "\n", "It is not required that the user is familiar with cuDF or cuML. Since our aim is to go from ETL to ML training, a detailed introduction is out of scope for this notebook. We recommend [Introduction to cuDF](../../CuDF/01-Intro_to_cuDF.ipynb) for additional information.\n", "\n", "### Problem Statement:\n", "We are trying to classify patients with acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) using machine learning (classification) algorithms. This dataset comes from a proof-of-concept study published in 1999 by Golub et al. It showed how new cases of cancer could be classified by gene expression monitoring (via DNA microarray) and thereby provided a general approach for identifying new cancer classes and assigning tumors to known classes. \n", "\n", "Here is the dataset link: https://www.kaggle.com/crawford/gene-expression." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Here is the list of exercises and modules to work on in the lab:\n", "\n", "- Convert the serial Pandas computations to CuDF operations.\n", "- Utilize CuML to accelerate the machine learning models.\n", "- Experiment with Dask to create a cluster and distribute the data and scale the operations.\n", "\n", "You will start writing code from here, but make sure you execute the data processing blocks to understand the dataset.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Data Processing\n", "\n", "The first step is downloading the dataset and putting it in the data directory, for using in this tutorial. Download the dataset here, and place it in (host/data) folder. Now we will import the necessary libraries." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NumPy Version: 1.19.2\n", "Scikit-Learn Version: 0.23.1\n" ] } ], "source": [ "import numpy as np; print('NumPy Version:', np.__version__)\n", "import pandas as pd\n", "import sys\n", "import sklearn; print('Scikit-Learn Version:', sklearn.__version__)\n", "from sklearn import preprocessing \n", "from sklearn.utils import resample\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.feature_selection import SelectFromModel\n", "from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc\n", "from sklearn.preprocessing import OrdinalEncoder, StandardScaler\n", "import cudf\n", "import cupy\n", "# import for model building\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.metrics import mean_squared_error\n", "from cuml.metrics.regression import r2_score\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn import linear_model\n", "from sklearn.metrics import accuracy_score\n", "from sklearn import model_selection, datasets\n", "from cuml.dask.common import utils as dask_utils\n", "from dask.distributed import Client, wait\n", "from dask_cuda import LocalCUDACluster\n", "import dask_cudf\n", "from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRF\n", "from sklearn.ensemble import RandomForestClassifier as sklRF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll read the dataframe into y from the csv file, view its dimensions and observe the first 5 rows of the dataframe." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(72, 2)\n", "CPU times: user 4.54 ms, sys: 1.11 ms, total: 5.66 ms\n", "Wall time: 5.27 ms\n" ] }, { "data": { "text/html": [ "
\n", " | patient | \n", "cancer | \n", "
---|---|---|
0 | \n", "1 | \n", "ALL | \n", "
1 | \n", "2 | \n", "ALL | \n", "
2 | \n", "3 | \n", "ALL | \n", "
3 | \n", "4 | \n", "ALL | \n", "
4 | \n", "5 | \n", "ALL | \n", "
\n", " | Gene Description | \n", "Gene Accession Number | \n", "1 | \n", "call | \n", "2 | \n", "call.1 | \n", "3 | \n", "call.2 | \n", "4 | \n", "call.3 | \n", "... | \n", "29 | \n", "call.33 | \n", "30 | \n", "call.34 | \n", "31 | \n", "call.35 | \n", "32 | \n", "call.36 | \n", "33 | \n", "call.37 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "AFFX-BioB-5_at (endogenous control) | \n", "AFFX-BioB-5_at | \n", "-214 | \n", "A | \n", "-139 | \n", "A | \n", "-76 | \n", "A | \n", "-135 | \n", "A | \n", "... | \n", "15 | \n", "A | \n", "-318 | \n", "A | \n", "-32 | \n", "A | \n", "-124 | \n", "A | \n", "-135 | \n", "A | \n", "
1 | \n", "AFFX-BioB-M_at (endogenous control) | \n", "AFFX-BioB-M_at | \n", "-153 | \n", "A | \n", "-73 | \n", "A | \n", "-49 | \n", "A | \n", "-114 | \n", "A | \n", "... | \n", "-114 | \n", "A | \n", "-192 | \n", "A | \n", "-49 | \n", "A | \n", "-79 | \n", "A | \n", "-186 | \n", "A | \n", "
2 | \n", "AFFX-BioB-3_at (endogenous control) | \n", "AFFX-BioB-3_at | \n", "-58 | \n", "A | \n", "-1 | \n", "A | \n", "-307 | \n", "A | \n", "265 | \n", "A | \n", "... | \n", "2 | \n", "A | \n", "-95 | \n", "A | \n", "49 | \n", "A | \n", "-37 | \n", "A | \n", "-70 | \n", "A | \n", "
3 | \n", "AFFX-BioC-5_at (endogenous control) | \n", "AFFX-BioC-5_at | \n", "88 | \n", "A | \n", "283 | \n", "A | \n", "309 | \n", "A | \n", "12 | \n", "A | \n", "... | \n", "193 | \n", "A | \n", "312 | \n", "A | \n", "230 | \n", "P | \n", "330 | \n", "A | \n", "337 | \n", "A | \n", "
4 | \n", "AFFX-BioC-3_at (endogenous control) | \n", "AFFX-BioC-3_at | \n", "-295 | \n", "A | \n", "-264 | \n", "A | \n", "-376 | \n", "A | \n", "-419 | \n", "A | \n", "... | \n", "-51 | \n", "A | \n", "-139 | \n", "A | \n", "-367 | \n", "A | \n", "-188 | \n", "A | \n", "-407 | \n", "A | \n", "
5 rows × 78 columns
\n", "\n", " | Gene Description | \n", "Gene Accession Number | \n", "39 | \n", "call | \n", "40 | \n", "call.1 | \n", "42 | \n", "call.2 | \n", "47 | \n", "call.3 | \n", "... | \n", "65 | \n", "call.29 | \n", "66 | \n", "call.30 | \n", "63 | \n", "call.31 | \n", "64 | \n", "call.32 | \n", "62 | \n", "call.33 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "AFFX-BioB-5_at (endogenous control) | \n", "AFFX-BioB-5_at | \n", "-342 | \n", "A | \n", "-87 | \n", "A | \n", "22 | \n", "A | \n", "-243 | \n", "A | \n", "... | \n", "-62 | \n", "A | \n", "-58 | \n", "A | \n", "-161 | \n", "A | \n", "-48 | \n", "A | \n", "-176 | \n", "A | \n", "
1 | \n", "AFFX-BioB-M_at (endogenous control) | \n", "AFFX-BioB-M_at | \n", "-200 | \n", "A | \n", "-248 | \n", "A | \n", "-153 | \n", "A | \n", "-218 | \n", "A | \n", "... | \n", "-198 | \n", "A | \n", "-217 | \n", "A | \n", "-215 | \n", "A | \n", "-531 | \n", "A | \n", "-284 | \n", "A | \n", "
2 | \n", "AFFX-BioB-3_at (endogenous control) | \n", "AFFX-BioB-3_at | \n", "41 | \n", "A | \n", "262 | \n", "A | \n", "17 | \n", "A | \n", "-163 | \n", "A | \n", "... | \n", "-5 | \n", "A | \n", "63 | \n", "A | \n", "-46 | \n", "A | \n", "-124 | \n", "A | \n", "-81 | \n", "A | \n", "
3 | \n", "AFFX-BioC-5_at (endogenous control) | \n", "AFFX-BioC-5_at | \n", "328 | \n", "A | \n", "295 | \n", "A | \n", "276 | \n", "A | \n", "182 | \n", "A | \n", "... | \n", "141 | \n", "A | \n", "95 | \n", "A | \n", "146 | \n", "A | \n", "431 | \n", "A | \n", "9 | \n", "A | \n", "
4 | \n", "AFFX-BioC-3_at (endogenous control) | \n", "AFFX-BioC-3_at | \n", "-224 | \n", "A | \n", "-226 | \n", "A | \n", "-211 | \n", "A | \n", "-289 | \n", "A | \n", "... | \n", "-256 | \n", "A | \n", "-191 | \n", "A | \n", "-172 | \n", "A | \n", "-496 | \n", "A | \n", "-294 | \n", "A | \n", "
5 rows × 70 columns
\n", "\n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "... | \n", "7119 | \n", "7120 | \n", "7121 | \n", "7122 | \n", "7123 | \n", "7124 | \n", "7125 | \n", "7126 | \n", "7127 | \n", "7128 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Gene Description | \n", "AFFX-BioB-5_at (endogenous control) | \n", "AFFX-BioB-M_at (endogenous control) | \n", "AFFX-BioB-3_at (endogenous control) | \n", "AFFX-BioC-5_at (endogenous control) | \n", "AFFX-BioC-3_at (endogenous control) | \n", "AFFX-BioDn-5_at (endogenous control) | \n", "AFFX-BioDn-3_at (endogenous control) | \n", "AFFX-CreX-5_at (endogenous control) | \n", "AFFX-CreX-3_at (endogenous control) | \n", "AFFX-BioB-5_st (endogenous control) | \n", "... | \n", "Transcription factor Stat5b (stat5b) mRNA | \n", "Breast epithelial antigen BA46 mRNA | \n", "GB DEF = Calcium/calmodulin-dependent protein ... | \n", "TUBULIN ALPHA-4 CHAIN | \n", "CYP4B1 Cytochrome P450; subfamily IVB; polypep... | \n", "PTGER3 Prostaglandin E receptor 3 (subtype EP3... | \n", "HMG2 High-mobility group (nonhistone chromosom... | \n", "RB1 Retinoblastoma 1 (including osteosarcoma) | \n", "GB DEF = Glycophorin Sta (type A) exons 3 and ... | \n", "GB DEF = mRNA (clone 1A7) | \n", "
Gene Accession Number | \n", "AFFX-BioB-5_at | \n", "AFFX-BioB-M_at | \n", "AFFX-BioB-3_at | \n", "AFFX-BioC-5_at | \n", "AFFX-BioC-3_at | \n", "AFFX-BioDn-5_at | \n", "AFFX-BioDn-3_at | \n", "AFFX-CreX-5_at | \n", "AFFX-CreX-3_at | \n", "AFFX-BioB-5_st | \n", "... | \n", "U48730_at | \n", "U58516_at | \n", "U73738_at | \n", "X06956_at | \n", "X16699_at | \n", "X83863_at | \n", "Z17240_at | \n", "L49218_f_at | \n", "M71243_f_at | \n", "Z78285_f_at | \n", "
1 | \n", "-214 | \n", "-153 | \n", "-58 | \n", "88 | \n", "-295 | \n", "-558 | \n", "199 | \n", "-176 | \n", "252 | \n", "206 | \n", "... | \n", "185 | \n", "511 | \n", "-125 | \n", "389 | \n", "-37 | \n", "793 | \n", "329 | \n", "36 | \n", "191 | \n", "-37 | \n", "
2 | \n", "-139 | \n", "-73 | \n", "-1 | \n", "283 | \n", "-264 | \n", "-400 | \n", "-330 | \n", "-168 | \n", "101 | \n", "74 | \n", "... | \n", "169 | \n", "837 | \n", "-36 | \n", "442 | \n", "-17 | \n", "782 | \n", "295 | \n", "11 | \n", "76 | \n", "-14 | \n", "
3 | \n", "-76 | \n", "-49 | \n", "-307 | \n", "309 | \n", "-376 | \n", "-650 | \n", "33 | \n", "-367 | \n", "206 | \n", "-215 | \n", "... | \n", "315 | \n", "1199 | \n", "33 | \n", "168 | \n", "52 | \n", "1138 | \n", "777 | \n", "41 | \n", "228 | \n", "-41 | \n", "
5 rows × 7129 columns
\n", "Gene Accession Number | \n", "AFFX-BioB-5_at | \n", "AFFX-BioB-M_at | \n", "AFFX-BioB-3_at | \n", "AFFX-BioC-5_at | \n", "AFFX-BioC-3_at | \n", "AFFX-BioDn-5_at | \n", "AFFX-BioDn-3_at | \n", "AFFX-CreX-5_at | \n", "AFFX-CreX-3_at | \n", "AFFX-BioB-5_st | \n", "... | \n", "U48730_at | \n", "U58516_at | \n", "U73738_at | \n", "X06956_at | \n", "X16699_at | \n", "X83863_at | \n", "Z17240_at | \n", "L49218_f_at | \n", "M71243_f_at | \n", "Z78285_f_at | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "-214 | \n", "-153 | \n", "-58 | \n", "88 | \n", "-295 | \n", "-558 | \n", "199 | \n", "-176 | \n", "252 | \n", "206 | \n", "... | \n", "185 | \n", "511 | \n", "-125 | \n", "389 | \n", "-37 | \n", "793 | \n", "329 | \n", "36 | \n", "191 | \n", "-37 | \n", "
2 | \n", "-139 | \n", "-73 | \n", "-1 | \n", "283 | \n", "-264 | \n", "-400 | \n", "-330 | \n", "-168 | \n", "101 | \n", "74 | \n", "... | \n", "169 | \n", "837 | \n", "-36 | \n", "442 | \n", "-17 | \n", "782 | \n", "295 | \n", "11 | \n", "76 | \n", "-14 | \n", "
3 | \n", "-76 | \n", "-49 | \n", "-307 | \n", "309 | \n", "-376 | \n", "-650 | \n", "33 | \n", "-367 | \n", "206 | \n", "-215 | \n", "... | \n", "315 | \n", "1199 | \n", "33 | \n", "168 | \n", "52 | \n", "1138 | \n", "777 | \n", "41 | \n", "228 | \n", "-41 | \n", "
4 | \n", "-135 | \n", "-114 | \n", "265 | \n", "12 | \n", "-419 | \n", "-585 | \n", "158 | \n", "-253 | \n", "49 | \n", "31 | \n", "... | \n", "240 | \n", "835 | \n", "218 | \n", "174 | \n", "-110 | \n", "627 | \n", "170 | \n", "-50 | \n", "126 | \n", "-91 | \n", "
5 | \n", "-106 | \n", "-125 | \n", "-76 | \n", "168 | \n", "-230 | \n", "-284 | \n", "4 | \n", "-122 | \n", "70 | \n", "252 | \n", "... | \n", "156 | \n", "649 | \n", "57 | \n", "504 | \n", "-26 | \n", "250 | \n", "314 | \n", "14 | \n", "56 | \n", "-25 | \n", "
5 rows × 7129 columns
\n", "Gene Accession Number | \n", "AFFX-BioB-5_at | \n", "AFFX-BioB-M_at | \n", "AFFX-BioB-3_at | \n", "AFFX-BioC-5_at | \n", "AFFX-BioC-3_at | \n", "AFFX-BioDn-5_at | \n", "AFFX-BioDn-3_at | \n", "AFFX-CreX-5_at | \n", "AFFX-CreX-3_at | \n", "AFFX-BioB-5_st | \n", "... | \n", "U48730_at | \n", "U58516_at | \n", "U73738_at | \n", "X06956_at | \n", "X16699_at | \n", "X83863_at | \n", "Z17240_at | \n", "L49218_f_at | \n", "M71243_f_at | \n", "Z78285_f_at | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "... | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "38.000000 | \n", "
mean | \n", "-120.868421 | \n", "-150.526316 | \n", "-17.157895 | \n", "181.394737 | \n", "-276.552632 | \n", "-439.210526 | \n", "-43.578947 | \n", "-201.184211 | \n", "99.052632 | \n", "112.131579 | \n", "... | \n", "178.763158 | \n", "750.842105 | \n", "8.815789 | \n", "399.131579 | \n", "-20.052632 | \n", "869.052632 | \n", "335.842105 | \n", "19.210526 | \n", "504.394737 | \n", "-29.210526 | \n", "
std | \n", "109.555656 | \n", "75.734507 | \n", "117.686144 | \n", "117.468004 | \n", "111.004431 | \n", "135.458412 | \n", "219.482393 | \n", "90.838989 | \n", "83.178397 | \n", "211.815597 | \n", "... | \n", "84.826830 | \n", "298.008392 | \n", "77.108507 | \n", "469.579868 | \n", "42.346031 | \n", "482.366461 | \n", "209.826766 | \n", "31.158841 | \n", "728.744405 | \n", "30.851132 | \n", "
min | \n", "-476.000000 | \n", "-327.000000 | \n", "-307.000000 | \n", "-36.000000 | \n", "-541.000000 | \n", "-790.000000 | \n", "-479.000000 | \n", "-463.000000 | \n", "-82.000000 | \n", "-215.000000 | \n", "... | \n", "30.000000 | \n", "224.000000 | \n", "-178.000000 | \n", "36.000000 | \n", "-112.000000 | \n", "195.000000 | \n", "41.000000 | \n", "-50.000000 | \n", "-2.000000 | \n", "-94.000000 | \n", "
25% | \n", "-138.750000 | \n", "-205.000000 | \n", "-83.250000 | \n", "81.250000 | \n", "-374.250000 | \n", "-547.000000 | \n", "-169.000000 | \n", "-239.250000 | \n", "36.000000 | \n", "-47.000000 | \n", "... | \n", "120.000000 | \n", "575.500000 | \n", "-42.750000 | \n", "174.500000 | \n", "-48.000000 | \n", "595.250000 | \n", "232.750000 | \n", "8.000000 | \n", "136.000000 | \n", "-42.750000 | \n", "
50% | \n", "-106.500000 | \n", "-141.500000 | \n", "-43.500000 | \n", "200.000000 | \n", "-263.000000 | \n", "-426.500000 | \n", "-33.500000 | \n", "-185.500000 | \n", "99.500000 | \n", "70.500000 | \n", "... | \n", "174.500000 | \n", "700.000000 | \n", "10.500000 | \n", "266.000000 | \n", "-18.000000 | \n", "744.500000 | \n", "308.500000 | \n", "20.000000 | \n", "243.500000 | \n", "-26.000000 | \n", "
75% | \n", "-68.250000 | \n", "-94.750000 | \n", "47.250000 | \n", "279.250000 | \n", "-188.750000 | \n", "-344.750000 | \n", "79.000000 | \n", "-144.750000 | \n", "152.250000 | \n", "242.750000 | \n", "... | \n", "231.750000 | \n", "969.500000 | \n", "57.000000 | \n", "451.750000 | \n", "9.250000 | \n", "1112.000000 | \n", "389.500000 | \n", "30.250000 | \n", "487.250000 | \n", "-11.500000 | \n", "
max | \n", "17.000000 | \n", "-20.000000 | \n", "265.000000 | \n", "392.000000 | \n", "-51.000000 | \n", "-155.000000 | \n", "419.000000 | \n", "-24.000000 | \n", "283.000000 | \n", "561.000000 | \n", "... | \n", "356.000000 | \n", "1653.000000 | \n", "218.000000 | \n", "2527.000000 | \n", "52.000000 | \n", "2315.000000 | \n", "1109.000000 | \n", "115.000000 | \n", "3193.000000 | \n", "36.000000 | \n", "
8 rows × 7129 columns
\n", "\n",
" \n",
"