{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"[Home Page](../../START_HERE.ipynb)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Challenge - Gene Expression Classification\n",
"\n",
"\n",
"### Introduction\n",
"\n",
"This notebook walks through an end-to-end GPU machine learning workflow where cuDF is used for processing the data and cuML is used to train machine learning models on it. \n",
"\n",
"After completing this excercise, you will be able to use cuDF to load data from disk, combine tables, scale features, use one-hote encoding and even write your own GPU kernels to efficiently transform feature columns. Additionaly you will learn how to pass this data to cuML, and how to train ML models on it. The trained model is saved and it will be used for prediction.\n",
"\n",
"It is not required that the user is familiar with cuDF or cuML. Since our aim is to go from ETL to ML training, a detailed introduction is out of scope for this notebook. We recommend [Introduction to cuDF](../../CuDF/01-Intro_to_cuDF.ipynb) for additional information.\n",
"\n",
"### Problem Statement:\n",
"We are trying to classify patients with acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) using machine learning (classification) algorithms. This dataset comes from a proof-of-concept study published in 1999 by Golub et al. It showed how new cases of cancer could be classified by gene expression monitoring (via DNA microarray) and thereby provided a general approach for identifying new cancer classes and assigning tumors to known classes. \n",
"\n",
"Here is the dataset link: https://www.kaggle.com/crawford/gene-expression."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Here is the list of exercises and modules to work on in the lab:\n",
"\n",
"- Convert the serial Pandas computations to CuDF operations.\n",
"- Utilize CuML to accelerate the machine learning models.\n",
"- Experiment with Dask to create a cluster and distribute the data and scale the operations.\n",
"\n",
"You will start writing code from here, but make sure you execute the data processing blocks to understand the dataset.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Data Processing\n",
"\n",
"The first step is downloading the dataset and putting it in the data directory, for using in this tutorial. Download the dataset here, and place it in (host/data) folder. Now we will import the necessary libraries."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np; print('NumPy Version:', np.__version__)\n",
"import pandas as pd\n",
"import sys\n",
"import sklearn; print('Scikit-Learn Version:', sklearn.__version__)\n",
"from sklearn import preprocessing \n",
"from sklearn.utils import resample\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.feature_selection import SelectFromModel\n",
"from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc\n",
"from sklearn.preprocessing import OrdinalEncoder, StandardScaler\n",
"import cudf\n",
"import cupy\n",
"# import for model building\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.metrics import mean_squared_error\n",
"from cuml.metrics.regression import r2_score\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.pipeline import make_pipeline\n",
"from sklearn import linear_model\n",
"from sklearn.metrics import accuracy_score\n",
"from sklearn import model_selection, datasets\n",
"from cuml.dask.common import utils as dask_utils\n",
"from dask.distributed import Client, wait\n",
"from dask_cuda import LocalCUDACluster\n",
"import dask_cudf\n",
"from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRF\n",
"from sklearn.ensemble import RandomForestClassifier as sklRF"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll read the dataframe into y from the csv file, view its dimensions and observe the first 5 rows of the dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"y = pd.read_csv('../../../data/actual.csv')\n",
"print(y.shape)\n",
"y.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's convert our target variable categories to numbers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y['cancer'].value_counts()\n",
"# Recode label to numeric\n",
"y = y.replace({'ALL':0,'AML':1})\n",
"labels = ['ALL', 'AML'] # for plotting convenience later on"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read the training and test data provided in the challenge from the data folder. View their dimensions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import training data\n",
"df_train = pd.read_csv('../../../data/data_set_ALL_AML_train.csv')\n",
"print(df_train.shape)\n",
"\n",
"# Import testing data\n",
"df_test = pd.read_csv('../../../data/data_set_ALL_AML_independent.csv')\n",
"print(df_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Observe the first few rows of the train dataframe and the data format."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Observe the first few rows of the test dataframe and the data format."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_test.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, the data set has categorical values but only for the columns starting with \"call\". We won't use the columns having categorical values, but remove them."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Remove \"call\" columns from training and testing data\n",
"train_to_keep = [col for col in df_train.columns if \"call\" not in col]\n",
"test_to_keep = [col for col in df_test.columns if \"call\" not in col]\n",
"\n",
"X_train_tr = df_train[train_to_keep]\n",
"X_test_tr = df_test[test_to_keep]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rename the columns and reindex for formatting purposes and ease in reading the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_columns_titles = ['Gene Description', 'Gene Accession Number', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10',\n",
" '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', \n",
" '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38']\n",
"\n",
"X_train_tr = X_train_tr.reindex(columns=train_columns_titles)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_columns_titles = ['Gene Description', 'Gene Accession Number','39', '40', '41', '42', '43', '44', '45', '46',\n",
" '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59',\n",
" '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72']\n",
"\n",
"X_test_tr = X_test_tr.reindex(columns=test_columns_titles)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will take the transpose of the dataframe so that each row is a patient and each column is a gene."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train = X_train_tr.T\n",
"X_test = X_test_tr.T\n",
"\n",
"print(X_train.shape) \n",
"X_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just clearning the data, removing extra columns and converting to numerical values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Clean up the column names for training and testing data\n",
"X_train.columns = X_train.iloc[1]\n",
"X_train = X_train.drop([\"Gene Description\", \"Gene Accession Number\"]).apply(pd.to_numeric)\n",
"\n",
"# Clean up the column names for Testing data\n",
"X_test.columns = X_test.iloc[1]\n",
"X_test = X_test.drop([\"Gene Description\", \"Gene Accession Number\"]).apply(pd.to_numeric)\n",
"\n",
"print(X_train.shape)\n",
"print(X_test.shape)\n",
"X_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have the 38 patients as rows in the training set, and the other 34 as rows in the testing set. Each of those datasets has 7129 gene expression features. But we haven't yet associated the target labels with the right patients. You will recall that all the labels are all stored in a single dataframe. Let's split the data so that the patients and labels match up across the training and testing dataframes.We are now splitting the data into train and test sets. We will subset the first 38 patient's cancer types."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train = X_train.reset_index(drop=True)\n",
"y_train = y[y.patient <= 38].reset_index(drop=True)\n",
"\n",
"# Subset the rest for testing\n",
"X_test = X_test.reset_index(drop=True)\n",
"y_test = y[y.patient > 38].reset_index(drop=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate descriptive statistics to analyse the data further."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clearly there is some variation in the scales across the different features. Many machine learning models work much better with data that's on the same scale, so let's create a scaled version of the dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train_fl = X_train.astype(float, 64)\n",
"X_test_fl = X_test.astype(float, 64)\n",
"\n",
"# Apply the same scaling to both datasets\n",
"scaler = StandardScaler()\n",
"X_train = scaler.fit_transform(X_train_fl)\n",
"X_test = scaler.transform(X_test_fl) # note that we transform rather than fit_transform"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"### 2. Conversion to CuDF Dataframe\n",
"Convert the pandas dataframes to CuDF dataframes to carry out the further CuML tasks."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Modify the code in this cell\n",
"\n",
"%%time\n",
"X_cudf_train = cudf.DataFrame() #Pass X train dataframe here\n",
"X_cudf_test = cudf.DataFrame() #Pass X test dataframe here\n",
"\n",
"y_cudf_train = cudf.DataFrame() #Pass y train dataframe here\n",
"#y_cudf_test = cudf.Series(y_test.values) #Pass y test dataframe here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Model Building\n",
"#### Dask Integration\n",
"\n",
"We will try using the Random Forests Classifier and implement using CuML and Dask."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Start Dask cluster"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Modify the code in this cell\n",
"\n",
"# This will use all GPUs on the local host by default\n",
"cluster = LocalCUDACluster() #Set 1 thread per worker using arguments to cluster\n",
"c = Client() #Pass the cluster as an argument to Client\n",
"\n",
"# Query the client for all connected workers\n",
"workers = c.has_what().keys()\n",
"n_workers = len(workers)\n",
"n_streams = 8 # Performance optimization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Define Parameters\n",
"\n",
"In addition to the number of examples, random forest fitting performance depends heavily on the number of columns in a dataset and (especially) on the maximum depth to which trees are allowed to grow. Lower `max_depth` values can greatly speed up fitting, though going too low may reduce accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Random Forest building parameters\n",
"max_depth = 12\n",
"n_bins = 16\n",
"n_trees = 1000"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Distribute data to worker GPUs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train = X_train.astype(np.float32)\n",
"X_test = X_test.astype(np.float32)\n",
"y_train = y_train.astype(np.int32)\n",
"y_test = y_test.astype(np.int32)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n_partitions = n_workers\n",
"\n",
"def distribute(X, y):\n",
" # First convert to cudf (with real data, you would likely load in cuDF format to start)\n",
" X_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X))\n",
" y_cudf = cudf.Series(y)\n",
"\n",
" # Partition with Dask\n",
" # In this case, each worker will train on 1/n_partitions fraction of the data\n",
" X_dask = dask_cudf.from_cudf(X_cudf, npartitions=n_partitions)\n",
" y_dask = dask_cudf.from_cudf(y_cudf, npartitions=n_partitions)\n",
"\n",
" # Persist to cache the data in active memory\n",
" X_dask, y_dask = \\\n",
" dask_utils.persist_across_workers(c, [X_dask, y_dask], workers=workers)\n",
" \n",
" return X_dask, y_dask"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Modify the code in this cell\n",
"\n",
"X_train_dask, y_train_dask = distribute() #Pass train data as arguments here\n",
"X_test_dask, y_test_dask = distribute() #Pass test data as arguments here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Create the Scikit-learn model\n",
"\n",
"Since a scikit-learn equivalent to the multi-node multi-GPU K-means in cuML doesn't exist, we will use Dask-ML's implementation for comparison."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"\n",
"# Use all avilable CPU cores\n",
"skl_model = sklRF(max_depth=max_depth, n_estimators=n_trees, n_jobs=-1)\n",
"skl_model.fit(X_train, y_train.iloc[:,1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Train the distributed cuML model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Modify the code in this cell\n",
"\n",
"%%time\n",
"\n",
"cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins, n_streams=n_streams)\n",
"cuml_model.fit() # Pass X and y train dask data here\n",
"\n",
"wait(cuml_model.rfs) # Allow asynchronous training tasks to finish"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Predict and check accuracy"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Modify the code in this cell\n",
"\n",
"skl_y_pred = skl_model.predict(X_test)\n",
"cuml_y_pred = cuml_model.predict().compute().to_array() #Pass the X test dask data as argument here\n",
"\n",
"# Due to randomness in the algorithm, you may see slight variation in accuracies\n",
"print(\"SKLearn accuracy: \", accuracy_score(y_test.iloc[:,1], skl_y_pred))\n",
"print(\"CuML accuracy: \", accuracy_score()) #Pass the y test dask data and predicted values from CuML model as argument here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"### 4. CONCLUSION"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's compare the performance of our solution!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| Algorithm | Implementation | Accuracy | Time | Algorithm | Implementation | Accuracy | Time |\n",
"| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Write down your observations and compare the CuML and Scikit learn scores. They should be approximately equal. We hope that you found this exercise exciting and beneficial in understanding RAPIDS better. Share your highest accuracy and try to use the unique features of RAPIDS for accelerating your data science pipelines. Don't restrict yourself to the previously explained concepts, but use the documentation to apply more models and functions and achieve the best results. Jump over to the next notebook for our sample solution.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5. References\n",
"\n",
"\n",
"\n",
"
\n",
" \n",
"