{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"[Home Page](../START_HERE.ipynb)\n",
"\n",
"[Previous Notebook](03-CuML_and_Dask.ipynb)\n",
" \n",
" \n",
" \n",
" \n",
"[1](01-Intro_to_Dask.ipynb)\n",
"[2](02-CuDF_and_Dask.ipynb)\n",
"[3](03-CuML_and_Dask.ipynb)\n",
"[4]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# K-Means Challenge\n",
"\n",
"KMeans is a basic but powerful clustering method which is optimized via Expectation Maximization. It randomly selects K data points in X, and computes which samples are close to these points. For every cluster of points, a mean is computed, and this becomes the new centroid.\n",
"\n",
"cuML’s KMeans supports the scalable KMeans++ intialization method. This method is more stable than randomnly selecting K points.\n",
" \n",
"The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input.\n",
"\n",
"For information about cuDF, refer to the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable).\n",
"\n",
"For additional information on cuML's k-means implementation: https://docs.rapids.ai/api/cuml/stable/api.html#cuml.KMeans.\n",
"\n",
"The given solution implements CuML on a single GPU. Your task is to convert the entire code using Dask so that it can run on Multi-node, Multi-GPU systems. Your coding task begins here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports\n",
"\n",
"Let's begin by importing the libraries necessary for this implementation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import cudf\n",
"import cupy\n",
"import matplotlib.pyplot as plt\n",
"from cuml.cluster import KMeans as cuKMeans\n",
"from cuml.datasets import make_blobs\n",
"from sklearn.cluster import KMeans as skKMeans\n",
"from sklearn.metrics import adjusted_rand_score\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define Parameters\n",
"\n",
"Here we will define the data and model parameters which will be used while generating data and building our model. You can change these parameters and observe the change in the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n_samples = 10000\n",
"n_features = 2\n",
"\n",
"n_clusters = 5\n",
"random_state = 0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate Data\n",
"\n",
"Generate isotropic Gaussian blobs for clustering."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"device_data, device_labels = make_blobs(n_samples=n_samples,\n",
" n_features=n_features,\n",
" centers=n_clusters,\n",
" random_state=random_state,\n",
" cluster_std=0.1)\n",
"\n",
"device_data = cudf.DataFrame(device_data)\n",
"device_labels = cudf.Series(device_labels)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Copy dataset from GPU memory to host memory.\n",
"# This is done to later compare CPU and GPU results.\n",
"host_data = device_data.to_pandas()\n",
"host_labels = device_labels.to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scikit-learn model\n",
"\n",
"Here we will use Scikit-learn to define our model. The arguments to the model include:\n",
"\n",
"- n_clusters: int, default=8\n",
"The number of clusters to form as well as the number of centroids to generate.\n",
"\n",
"- init{‘k-means++’, ‘random’}, callable or array-like of shape (n_clusters, n_features), default=’k-means++’\n",
"Method for initialization:\n",
"\n",
"- ‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. \n",
"- max_iterint, default=300\n",
"Maximum number of iterations of the k-means algorithm for a single run.\n",
"\n",
"- random_state: int, RandomState instance or None, default=None\n",
"Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. .\n",
"\n",
"- n_jobs: int, default=None\n",
"The number of OpenMP threads to use for the computation. Parallelism is sample-wise on the main cython loop which assigns each sample to its closest center. None or -1 means using all processors.\n",
"\n",
"### Fit"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"kmeans_sk = skKMeans(init=\"k-means++\",\n",
" n_clusters=n_clusters,\n",
" n_jobs=-1,\n",
" random_state=random_state)\n",
"\n",
"kmeans_sk.fit(host_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## cuML Model\n",
"\n",
"### Fit"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"kmeans_cuml = cuKMeans(init=\"k-means||\",\n",
" n_clusters=n_clusters,\n",
" oversampling_factor=40,\n",
" random_state=random_state)\n",
"\n",
"kmeans_cuml.fit(device_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualize Centroids\n",
"\n",
"Scikit-learn's k-means implementation uses the `k-means++` initialization strategy while cuML's k-means uses `k-means||`. As a result, the exact centroids found may not be exact as the std deviation of the points around the centroids in `make_blobs` is increased.\n",
"\n",
"*Note*: Visualizing the centroids will only work when `n_features = 2` "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig = plt.figure(figsize=(16, 10))\n",
"plt.scatter(host_data.iloc[:, 0], host_data.iloc[:, 1], c=host_labels, s=50, cmap='viridis')\n",
"\n",
"#plot the sklearn kmeans centers with blue filled circles\n",
"centers_sk = kmeans_sk.cluster_centers_\n",
"plt.scatter(centers_sk[:,0], centers_sk[:,1], c='blue', s=100, alpha=.5)\n",
"\n",
"#plot the cuml kmeans centers with red circle outlines\n",
"centers_cuml = kmeans_cuml.cluster_centers_\n",
"plt.scatter(cupy.asnumpy(centers_cuml[0].values), \n",
" cupy.asnumpy(centers_cuml[1].values), \n",
" facecolors = 'none', edgecolors='red', s=100)\n",
"\n",
"plt.title('cuml and sklearn kmeans clustering')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compare Results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"cuml_score = adjusted_rand_score(host_labels, kmeans_cuml.labels_.to_array())\n",
"sk_score = adjusted_rand_score(host_labels, kmeans_sk.labels_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"threshold = 1e-4\n",
"\n",
"passed = (cuml_score - sk_score) < threshold\n",
"print('compare kmeans: cuml vs sklearn labels_ are ' + ('equal' if passed else 'NOT equal'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"# K-Means Multi-Node Multi-GPU (MNMG) \n",
"\n",
"K-Means multi-Node multi-GPU implementation leverages Dask to spread data and computations across multiple workers. cuML uses One Process Per GPU (OPG) layout, which maps a single Dask worker to each GPU.\n",
"\n",
"The main difference between cuML's MNMG implementation of k-means and the single-GPU is that the fit can be performed in parallel for each iteration, sharing only the centroids between iterations. The MNMG version also provides the same scalable k-means++ initialization algorithm as the single-GPU version.\n",
"\n",
"Unlike the single-GPU implementation, The MNMG k-means API requires a Dask Dataframe or Array as input. `predict()` and `transform()` return the same type as input. The Dask cuDF Dataframe API is very similar to the Dask DataFrame API, but underlying Dataframes are cuDF, rather than Pandas. Dask cuPy arrays are also available.\n",
"\n",
"For information about cuDF, refer to the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable).\n",
"\n",
"For additional information on cuML's k-means implementation: \n",
"https://docs.rapids.ai/api/cuml/stable/api.html#cuml.dask.cluster.KMeans."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports\n",
"\n",
"Let's begin by importing the libraries necessary for this implementation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from cuml.dask.cluster.kmeans import KMeans as cuKMeans\n",
"from cuml.dask.common import to_dask_df\n",
"from cuml.dask.datasets import make_blobs\n",
"from cuml.metrics import adjusted_rand_score\n",
"from dask.distributed import Client, wait\n",
"from dask_cuda import LocalCUDACluster\n",
"from dask_ml.cluster import KMeans as skKMeans\n",
"import cupy as cp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Start Dask Cluster\n",
"\n",
"We can use the `LocalCUDACluster` to start a Dask cluster on a single machine with one worker mapped to each GPU. This is called one-process-per-GPU (OPG). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Modify the code in this cell\n",
"\n",
"cluster = LocalCUDACluster() # pass threads_per_worker=1 as argument\n",
"client = Client() # pass the cluster object as argument to the Client "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define Parameters\n",
"\n",
"Here we will define the data and model parameters which will be used while generating data and building our model. You can change these parameters and observe the change in the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n_samples = 100000\n",
"n_features = 2\n",
"\n",
"n_total_partitions = len(list(client.has_what().keys()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate Data\n",
"\n",
"Generate isotropic Gaussian blobs for clustering.\n",
"\n",
"### Device\n",
"\n",
"We can generate a Dask cuPY Array of synthetic data for multiple clusters using `cuml.dask.datasets.make_blobs`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_dca, Y_dca = make_blobs(n_samples, \n",
" n_features,\n",
" centers = 5, \n",
" n_parts = n_total_partitions,\n",
" cluster_std=0.1, \n",
" verbose=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Host\n",
"\n",
"We collect the Dask cuPy Array on a single node as a cuPy array. Then we transfer the cuPy array from device to host memory into a Numpy array."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Modify the code in this cell\n",
"\n",
"X_cp = X_dca #add the compute function on X_dca to convert it\n",
"X_np = cp.asnumpy(X_cp)\n",
"del X_cp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scikit-learn model\n",
"\n",
"The arguments to the model object include:\n",
"\n",
"- n_clusters: int, default=8\n",
"The number of clusters to form as well as the number of centroids to generate.\n",
"\n",
"- init{‘k-means++’, ‘random’}, callable or array-like of shape (n_clusters, n_features), default=’k-means++’\n",
"Method for initialization:\n",
"\n",
"- ‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. \n",
"- max_iterint, default=300\n",
"Maximum number of iterations of the k-means algorithm for a single run.\n",
"\n",
"- random_state: int, RandomState instance or None, default=None\n",
"Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. .\n",
"\n",
"- n_jobs: int, default=None\n",
"The number of OpenMP threads to use for the computation. Parallelism is sample-wise on the main cython loop which assigns each sample to its closest center. None or -1 means using all processors.\n",
"\n",
"### Fit and predict\n",
"\n",
"Since a scikit-learn equivalent to the multi-node multi-GPU K-means in cuML doesn't exist, we will use Dask-ML's implementation for comparison."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Modify the code in this cell\n",
"\n",
"%%time\n",
"kmeans_sk = skKMeans(init=\"k-means||\",\n",
" n_clusters=5,\n",
" n_jobs=-1,\n",
" random_state=100)\n",
"\n",
"kmeans_sk.fit() #Pass the Numpy array here as argument"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Modify the code in this cell\n",
"\n",
"%%time\n",
"labels_sk = kmeans_sk.predict().compute() #Pass the Numpy array here as argument to thepredict function"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## cuML Model\n",
"\n",
"### Fit and predict"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Modify the code in this cell\n",
"\n",
"%%time\n",
"kmeans_cuml = cuKMeans(init=\"k-means||\",\n",
" n_clusters=5,\n",
" random_state=100)\n",
"\n",
"kmeans_cuml.fit() #Pass the Dask array here as argument"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Modify the code in this cell\n",
"\n",
"%%time\n",
"labels_cuml = kmeans_cuml.predict(X_dca).compute() #Pass the Dask array here as argument to the predict function"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compare Results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score = adjusted_rand_score(labels_sk, labels_cuml)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"passed = score == 1.0\n",
"print('compare kmeans: cuml vs sklearn labels_ are ' + ('equal' if passed else 'NOT equal'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"Write down you observations here and compare the CuML with and without Dask Multi-GPU implementation. If you want to explore Dask in detail, refer to the documentation [here](https://docs.dask.org/en/latest/). If you are unable to solve the problem, jump to the next notebook to find the solution."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licensing\n",
" \n",
"This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Previous Notebook](03-CuML_and_Dask.ipynb)\n",
" \n",
" \n",
" \n",
" \n",
"[1](01-Intro_to_Dask.ipynb)\n",
"[2](02-CuDF_and_Dask.ipynb)\n",
"[3](03-CuML_and_Dask.ipynb)\n",
"[4]\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"[Home Page](../START_HERE.ipynb)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}