|
@@ -0,0 +1,599 @@
|
|
|
+{
|
|
|
+ "cells": [
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "# \n",
|
|
|
+ "\n",
|
|
|
+ "# Numba Lab1: Numba For CUDA GPU\n",
|
|
|
+ "---\n",
|
|
|
+ "\n",
|
|
|
+ "Before we begin, let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "!nvidia-smi"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "### Learning Objectives\n",
|
|
|
+ "- **The goal of this lab is to:**\n",
|
|
|
+ " - quickly get you started with Numba from beginner to advanced level\n",
|
|
|
+ " - teach you application of CUDA GPU programming concept in HPC field(s)\n",
|
|
|
+ " - show you how to maximize the throughput of your HPC implementation through computational speedup on the GPU. \n",
|
|
|
+ " \n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "## Introduction\n",
|
|
|
+ "- Numba is a just-in-time (jit) compiler for Python that works best on code that uses NumPy arrays, functions, and loops. Numba has set of decorators that can be specified before user-defined functions to determine how they are compiled. \n",
|
|
|
+ "- A decorated function written in python is compiled into CUDA kernel to speed up execution rate, thus, Numba supports CUDA GPU programming model. \n",
|
|
|
+ "- A kernel is written in Numba automatically have direct access to NumPy arrays. This implies a great support for data visiblilty between the host (CPU) and the device (GPU). \n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "### Definition of Terms\n",
|
|
|
+ "- The CPU is called a **Host**. \n",
|
|
|
+ "- The GPU is called a **Device**.\n",
|
|
|
+ "- A GPU function launched by the host and executed on the device is called a **Kernels**.\n",
|
|
|
+ "- A GPU function executed on the device which can only be called from the device is called a **Device function**.\n",
|
|
|
+ "\n",
|
|
|
+ "### Note\n",
|
|
|
+ "- It is recommended to visit the NVIDIA official documentary web page and read through [CUDA C programming guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide), because most CUDA programming features exposed by Numba map directly to the CUDA C language offered by NVidia. \n",
|
|
|
+ "- Numba does not implement of these CUDA features of CUDA:\n",
|
|
|
+ " - dynamic parallelism\n",
|
|
|
+ " - texture memory\n",
|
|
|
+ "\n",
|
|
|
+ "## CUDA Kernel\n",
|
|
|
+ "- In CUDA, written code can be executed by hundreds or thousands of threads at a single run, hence, a solution is modeled after the following thread hierarchy: \n",
|
|
|
+ " - **Grid**: A kernel executed as a collection of blocks \n",
|
|
|
+ " - **Thread Block**: Collection of threads that can communicate via a shared memory. Each thread is executed by a core.\n",
|
|
|
+ " - **Thread**: Single execution units that run kernels on GPU.\n",
|
|
|
+ "- Numba exposes three kinds of GPU memory: \n",
|
|
|
+ " - global device memory \n",
|
|
|
+ " - shared memory \n",
|
|
|
+ " - local memory. \n",
|
|
|
+ "- Memory access should be carefully considered in order to keep bandwidth contention at minimal.\n",
|
|
|
+ "\n",
|
|
|
+ " <img src=\"../images/thread_blocks.JPG\"/> <img src=\"../images/memory_architecture.png\"/> \n",
|
|
|
+ "\n",
|
|
|
+ "### Kernel Declaration\n",
|
|
|
+ "- A kernel function is a GPU function that is called from a CPU code by specifying the number of block threads and threads per block, and can not explicitly return a value except through a passed array. \n",
|
|
|
+ "- A kernel can be called multiple times with varying number of blocks per grid and threads per block after its has been compiled once.\n",
|
|
|
+ "\n",
|
|
|
+ "Example:\n",
|
|
|
+ "\n",
|
|
|
+ "```python\n",
|
|
|
+ "@cuda.jit\n",
|
|
|
+ "def arrayAdd(array_A, array_B, array_out):\n",
|
|
|
+ " #...code body ...\n",
|
|
|
+ "```\n",
|
|
|
+ "###### Kernel Invocation\n",
|
|
|
+ "- A kernel is typically launched in the following way:\n",
|
|
|
+ "```python\n",
|
|
|
+ "threadsperblock = 128\n",
|
|
|
+ "N = array_out.size\n",
|
|
|
+ "blockspergrid = ( N + (threadsperblock - 1))// threadsperblock\n",
|
|
|
+ "arrayAdd[blockspergrid, threadsperblock](array_A, array_B, array_out)\n",
|
|
|
+ "```\n",
|
|
|
+ "\n",
|
|
|
+ "###### Choosing Block Size\n",
|
|
|
+ "- The block size determines how many threads share a given area of shared memory.\n",
|
|
|
+ "- The block size must be large enough to accommodate all computation units. See more details [here](https://docs.nvidia.com/cuda/cuda-c-programming-guide/).\n",
|
|
|
+ "\n",
|
|
|
+ "### Thread Positioning \n",
|
|
|
+ "- When running a kernel, the kernel function’s code is executed by every thread once. Hence is it important to uniquely identify distinct threads.\n",
|
|
|
+ "- The default way to determine a thread position in a grid and block is to manually compute the corresponding array position:\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "<img src=\"../images/thread_position.png\"/>\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "```python\n",
|
|
|
+ "threadsperblock = 128\n",
|
|
|
+ "N = array_out.size\n",
|
|
|
+ "\n",
|
|
|
+ "@cuda.jit\n",
|
|
|
+ "def arrayAdd(array_A, array_B, array_out):\n",
|
|
|
+ " tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x\n",
|
|
|
+ " if tid < N: #Check array boundaries\n",
|
|
|
+ " array_out[tid] = array_A[tid] + array_B[tid]\n",
|
|
|
+ "\n",
|
|
|
+ "#Unless you are sure the block size and grid size are a divisor of your array size, you must check boundaries as shown in the code block above. \n",
|
|
|
+ "```\n",
|
|
|
+ "### Example 1: Addition on 1D-Arrays\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "import numba.cuda as cuda\n",
|
|
|
+ "import numpy as np\n",
|
|
|
+ "\n",
|
|
|
+ "N = 500000\n",
|
|
|
+ "threadsperblock = 1000\n",
|
|
|
+ "\n",
|
|
|
+ "@cuda.jit()\n",
|
|
|
+ "def arrayAdd(array_A, array_B, array_out):\n",
|
|
|
+ " tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x\n",
|
|
|
+ " if tid < N:\n",
|
|
|
+ " array_out[tid] = array_A[tid] + array_B[tid]\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ " \n",
|
|
|
+ "array_A = np.arange(N, dtype=np.int)\n",
|
|
|
+ "array_B = np.arange(N, dtype=np.int)\n",
|
|
|
+ "array_out = np.zeros(N, dtype=np.int)\n",
|
|
|
+ "\n",
|
|
|
+ "blockpergrid = N + (threadsperblock - 1) // threadsperblock\n",
|
|
|
+ "\n",
|
|
|
+ "arrayAdd[blockpergrid, threadsperblock](array_A, array_B, array_out)\n",
|
|
|
+ "\n",
|
|
|
+ "print(\"result: {} \".format(array_out))"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "**From Example 1:** \n",
|
|
|
+ "> - N is the size of the array and the number of threads in a single block is 128.\n",
|
|
|
+ "> - The **cuda.jit()** decorator indicates that the function (arrayAdd) below is a device kernel and should run parallel. The **tid** is the estimate of a unique index for each thread in the device memory grid: \n",
|
|
|
+ ">> **tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x**.\n",
|
|
|
+ "> - **array_A** and **array_B** are input data, while **array_out** is output array and is already preload with zeros.\n",
|
|
|
+ "> - The statement **blockpergrid = N + (threadsperblock - 1) // threadsperblock** Computes the size of block per grid. This line of code is commonly use as the default formular to estimate number of blocks per grid in several GPU programming documentations.\n",
|
|
|
+ "> - **arrayAdd[blockpergrid, threadsperblock](array_A, array_B, array_out)** indicate a call to a kernel function **addAdd** having the number of blocks per grid and number of threads per block in square bracket, while kernel arguments are in round brackets.\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "### Matrix multiplication on 2D Array \n",
|
|
|
+ "\n",
|
|
|
+ "<img src=\"../images/2d_array.png\"/>\n",
|
|
|
+ "\n",
|
|
|
+ "<img src=\"../images/2d_col_mult.png\"/>\n",
|
|
|
+ "\n",
|
|
|
+ "> **Note**\n",
|
|
|
+ "> - **Approach 2** would not be possible if the matrix size exceed the maximum number of threads per block on the device, while **Approach 1** would continue to execute. Most latest GPUs have maximum of 1024 threads per thread block. \n",
|
|
|
+ "\n",
|
|
|
+ "### Example 2: Matrix multiplication "
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "import numba.cuda as cuda\n",
|
|
|
+ "import numpy as np\n",
|
|
|
+ "import math\n",
|
|
|
+ "\n",
|
|
|
+ "N = 4\n",
|
|
|
+ "@cuda.jit()\n",
|
|
|
+ "def MatrixMul2D(array_A, array_B, array_out):\n",
|
|
|
+ " row, col = cuda.grid(2)\n",
|
|
|
+ " if row < array_out.shape[0] and col < array_out.shape[1]:\n",
|
|
|
+ " for k in range(N):\n",
|
|
|
+ " array_out[row][col]+= array_A[row][k] * array_B[k][col]\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "array_A = np.array([[0,0,0,0],[1,1,1,1],[2,2,2,2],[3,3,3,3]], dtype=np.int32)\n",
|
|
|
+ "array_B = np.array([[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3]], dtype=np.int32)\n",
|
|
|
+ "array_out = np.zeros(N*N, dtype=np.int32).reshape(N, N)\n",
|
|
|
+ "\n",
|
|
|
+ "threadsperblock = (2,2)\n",
|
|
|
+ "blockpergrid_x = (math.ceil( N / threadsperblock[0]))\n",
|
|
|
+ "blockpergrid_y = (math.ceil( N / threadsperblock[1]))\n",
|
|
|
+ "blockpergrid = (blockpergrid_x, blockpergrid_y)\n",
|
|
|
+ "\n",
|
|
|
+ "MatrixMul2D[blockpergrid,threadsperblock](array_A, array_B, array_out)\n",
|
|
|
+ "\n",
|
|
|
+ "print(\"array_A:\\n {}\\n\".format(array_A))\n",
|
|
|
+ "print(\"array_B:\\n {}\\n\".format(array_B))\n",
|
|
|
+ "print(\"array_A * array_B:\\n {}\".format(array_out))\n",
|
|
|
+ "\n",
|
|
|
+ "#Note\n",
|
|
|
+ "#The cuda.grid() returns the thread ID in X and Y (row & col) direction of the memory grid\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "### Exaample 3: A 225 × 225 Matrix Multiplication"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "N = 225\n",
|
|
|
+ "\n",
|
|
|
+ "@cuda.jit()\n",
|
|
|
+ "def MatrixMul2D(array_A, array_B, array_out):\n",
|
|
|
+ " x, y = cuda.grid(2)\n",
|
|
|
+ " if x < array_out.shape[0] and y < array_out.shape[1]:\n",
|
|
|
+ " for k in range(N):\n",
|
|
|
+ " array_out[x][y] += array_A[x][k] * array_B[k][y]\n",
|
|
|
+ "\n",
|
|
|
+ "threadsperblock = (25,25)\n",
|
|
|
+ "array_A = np.arange((N*N), dtype=np.int32).reshape(N,N)\n",
|
|
|
+ "array_B = np.arange((N*N), dtype=np.int32).reshape(N,N)\n",
|
|
|
+ "array_out = np.zeros((N*N), dtype=np.int32).reshape(N,N)\n",
|
|
|
+ "\n",
|
|
|
+ "blockpergrid_x = (math.ceil( N / threadsperblock[0]))\n",
|
|
|
+ "blockpergrid_y = (math.ceil( N / threadsperblock[1]))\n",
|
|
|
+ "blockpergrid = (blockpergrid_x, blockpergrid_y)\n",
|
|
|
+ "\n",
|
|
|
+ "MatrixMul2D[blockpergrid,threadsperblock](array_A, array_B, array_out)\n",
|
|
|
+ "\n",
|
|
|
+ "print(array_out)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "### Thread reuse \n",
|
|
|
+ "\n",
|
|
|
+ "- It is possible to specify a few number of threads for a data size such that threads are reused to complete the computation of the entire data. This is one of the approach used when a data to be computed is larger than the maximum number of threads available in a device memory. \n",
|
|
|
+ "- This statement is used in a while loop: ***tid += cuda.blockDim.x * cuda.gridDim.x***\n",
|
|
|
+ "- An example is given below to illustrates thread reuse. In the example, small number of thread is specified on purpose in order to show the possibility of this approach. \n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "#### Example 4: "
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "import numba.cuda as cuda\n",
|
|
|
+ "import numpy as np\n",
|
|
|
+ "\n",
|
|
|
+ "N = 500000\n",
|
|
|
+ "threadsperblock = 1000\n",
|
|
|
+ "\n",
|
|
|
+ "@cuda.jit\n",
|
|
|
+ "def arrayAdd(array_A, array_B, array_out):\n",
|
|
|
+ " tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x\n",
|
|
|
+ " while tid < N:\n",
|
|
|
+ " array_out[tid] = array_A[tid] + array_B[tid]\n",
|
|
|
+ " tid += cuda.blockDim.x * cuda.gridDim.x\n",
|
|
|
+ "\n",
|
|
|
+ "array_A = np.arange(N, dtype=np.int32)\n",
|
|
|
+ "array_B = np.arange(N, dtype=np.int32)\n",
|
|
|
+ "array_out = np.zeros(N, dtype=np.int32)\n",
|
|
|
+ "\n",
|
|
|
+ "arrayAdd[1, threadsperblock](array_A, array_B, array_out)\n",
|
|
|
+ "\n",
|
|
|
+ "print(\"result: {} \".format(array_out))\n",
|
|
|
+ "\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "> **Note**\n",
|
|
|
+ "> - The task in **example 4** is the same as in **example 1** but with limited number of threads specified, howbeit, the same result was achieved. \n",
|
|
|
+ "> - Note that this approach may delegate more threads than required. In the code above, an excess of 1 block of threads may be delegated.\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Memory Management\n",
|
|
|
+ "\n",
|
|
|
+ "### Data Transfer \n",
|
|
|
+ "- When a kernel is excuted, Numba automatically transfer NumPy arrays to the device and vice versa.\n",
|
|
|
+ "- In order to avoid the unnecessary transfer for read-only arrays, the following APIs can be used to manually control the transfer.\n",
|
|
|
+ "\n",
|
|
|
+ "##### 1. Copy host to device\n",
|
|
|
+ "```python\n",
|
|
|
+ "import numba.cuda as cuda\n",
|
|
|
+ "import numpy as np\n",
|
|
|
+ "\n",
|
|
|
+ "N = 500000\n",
|
|
|
+ "h_A = np.arange(N, dtype=np.int)\n",
|
|
|
+ "h_B = np.arange(N, dtype=np.int)\n",
|
|
|
+ "h_C = np.zeros(N, dtype=np.int)\n",
|
|
|
+ "\n",
|
|
|
+ "d_A = cuda.to_device(h_A)\n",
|
|
|
+ "d_B = cuda.to_device(h_B)\n",
|
|
|
+ "d_C = cuda.to_device(h_C)\n",
|
|
|
+ "```\n",
|
|
|
+ "##### 2. Enqueue the transfer to a stream\n",
|
|
|
+ "```python\n",
|
|
|
+ "h_A = np.arange(N, dtype=np.int)\n",
|
|
|
+ "stream = cuda.stream()\n",
|
|
|
+ "d_A = cuda.to_device(h_A, stream=stream)\n",
|
|
|
+ "```\n",
|
|
|
+ "##### 3. Copy device to host / enqueue the transfer to a stream \n",
|
|
|
+ "```python\n",
|
|
|
+ "h_C = d_C.copy_to_host()\n",
|
|
|
+ "h_C = d_C.copy_to_host(stream=stream)\n",
|
|
|
+ "```\n",
|
|
|
+ "### Example 5: data movement "
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "import numba.cuda as cuda\n",
|
|
|
+ "import numpy as np\n",
|
|
|
+ "N = 200\n",
|
|
|
+ "threadsperblock = 25\n",
|
|
|
+ "\n",
|
|
|
+ "@cuda.jit\n",
|
|
|
+ "def arrayAdd(d_A, d_B, d_C):\n",
|
|
|
+ " tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x\n",
|
|
|
+ " if tid < N:\n",
|
|
|
+ " d_C[tid] = d_A[tid] + d_B[tid]\n",
|
|
|
+ " \n",
|
|
|
+ "h_A = np.arange(N, dtype=np.int32)\n",
|
|
|
+ "h_B = np.arange(N, dtype=np.int32)\n",
|
|
|
+ "h_C = np.zeros(N, dtype=np.int32)\n",
|
|
|
+ "\n",
|
|
|
+ "d_A = cuda.to_device(h_A)\n",
|
|
|
+ "d_B = cuda.to_device(h_B)\n",
|
|
|
+ "d_C = cuda.to_device(h_C)\n",
|
|
|
+ "\n",
|
|
|
+ "blockpergrid = N + (threadsperblock - 1) // threadsperblock\n",
|
|
|
+ "arrayAdd[blockpergrid, threadsperblock](d_A, d_B, d_C)\n",
|
|
|
+ "\n",
|
|
|
+ "h_C = d_C.copy_to_host()\n",
|
|
|
+ "print(h_C)\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Atomic Operation\n",
|
|
|
+ "\n",
|
|
|
+ "- Atomic operation is required in a situation where multiple threads attempt to modify a common portion of the memory. \n",
|
|
|
+ "- Typical example includes: simultaneous withdrawal from a bank account through ATM machine or large number of threads modfying a particular index of an array based on certain condition(s)\n",
|
|
|
+ "- List of presently implemented atomic operations supported by Numba are:\n",
|
|
|
+ "> **import numba.cuda as cuda**\n",
|
|
|
+ "> - cuda.atomic.add(array, index, value)\n",
|
|
|
+ "> - cuda.atomic.min(array, index, value)\n",
|
|
|
+ "> - cuda.atomic.max(array, index, value)\n",
|
|
|
+ "> - cuda.atomic.nanmax(array, index, value)\n",
|
|
|
+ "> - cuda.atomic.nanmin(array, index, value)\n",
|
|
|
+ "> - cuda.atomic.compare_and_swap(array, old_value, current_value)\n",
|
|
|
+ "> - cuda.atomic.sub(array, index, value)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "# Task ==> sum of an array: [1,2,3,4,5,6,7,8,9,10] in parallel\n",
|
|
|
+ "# Note that threads are executed randomly\n",
|
|
|
+ "\n",
|
|
|
+ "# atomic operation example \n",
|
|
|
+ "size = 10\n",
|
|
|
+ "nthread = 10\n",
|
|
|
+ "@cuda.jit()\n",
|
|
|
+ "def add_atomic(my_array, total):\n",
|
|
|
+ " tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x\n",
|
|
|
+ " cuda.atomic.add(total,0, my_array[tid])\n",
|
|
|
+ "\n",
|
|
|
+ "my_array = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.int32)\n",
|
|
|
+ "total = np.zeros(1, dtype=np.int32)\n",
|
|
|
+ "nblock = int(size / nthread)\n",
|
|
|
+ "add_atomic[nblock, nthread](my_array, total)\n",
|
|
|
+ "print(\"Atomic:\", total)\n",
|
|
|
+ "\n",
|
|
|
+ "######################################################################################\n",
|
|
|
+ "# Non-atomic operation example \n",
|
|
|
+ "size = 10\n",
|
|
|
+ "nthread = 10\n",
|
|
|
+ "@cuda.jit()\n",
|
|
|
+ "def add_atomic(my_array, total):\n",
|
|
|
+ " tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x\n",
|
|
|
+ " total[0] += my_array[tid]\n",
|
|
|
+ " \n",
|
|
|
+ "\n",
|
|
|
+ "my_array = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.int32)\n",
|
|
|
+ "total = np.zeros(1, dtype=np.int32)\n",
|
|
|
+ "nblock = int(size / nthread)\n",
|
|
|
+ "add_atomic[nblock, nthread](my_array, total)\n",
|
|
|
+ "print(\"Non atomic: \", total)\n",
|
|
|
+ "\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "### 7. CUDA Ufuncs\n",
|
|
|
+ "\n",
|
|
|
+ "- The CUDA ufunc supports passing intra-device arrays to reduce traffic over the PCI-express bus. \n",
|
|
|
+ "- It also support asynchronous mode by using stream keyword.\n",
|
|
|
+ "\n",
|
|
|
+ "<img src=\"../images/ufunc.png\"/>"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "from numba import vectorize\n",
|
|
|
+ "import numba.cuda as cuda\n",
|
|
|
+ "import numpy as np\n",
|
|
|
+ "\n",
|
|
|
+ "@vectorize(['float32(float32, float32)'],target='cuda')\n",
|
|
|
+ "def compute(a, b):\n",
|
|
|
+ " return (a - b) * (a + b)\n",
|
|
|
+ "\n",
|
|
|
+ "N = 10000\n",
|
|
|
+ "A = np.arange(N , dtype=np.float32)\n",
|
|
|
+ "B = np.arange(N, dtype=np.float32)\n",
|
|
|
+ "C = compute(A, B)\n",
|
|
|
+ "\n",
|
|
|
+ "print(C.reshape(100,100))"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "#### Device function\n",
|
|
|
+ "\n",
|
|
|
+ "- The CUDA device functions can only be invoked from within the device and can return a value like normal functions. The device function is usually placed before the CUDA ufunc kernel otherwise a call to the device function may not be visible inside the ufunc kernel."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "from numba import vectorize\n",
|
|
|
+ "import numba.cuda as cuda\n",
|
|
|
+ "import numpy as np\n",
|
|
|
+ "import math\n",
|
|
|
+ "\n",
|
|
|
+ "@cuda.jit('float32(float32)', device=True, inline=True)\n",
|
|
|
+ "def device_ufunc(c):\n",
|
|
|
+ " return math.sqrt(c)\n",
|
|
|
+ "\n",
|
|
|
+ "@vectorize(['float32(float32, float32)'],target='cuda')\n",
|
|
|
+ "def compute(a, b):\n",
|
|
|
+ " c = (a - b) * (a + b)\n",
|
|
|
+ " return device_ufunc(c)\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Summary\n",
|
|
|
+ "\n",
|
|
|
+ "<img src=\"../images/numba_summary1.png\"/>\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "---\n",
|
|
|
+ "\n",
|
|
|
+ "## Lab Task\n",
|
|
|
+ "\n",
|
|
|
+ "In this section, you are expected to click on the **Serial code Lab Assignment** link and proceed to Lab 2. In this lab you will find three python serial code functions. You are required to revise the **pair_gpu** function and make it run on the GPU, and likewise do a few modifications on the **main** function.\n",
|
|
|
+ "\n",
|
|
|
+ "## <div style=\"text-align:center; color:#FF0000; border:3px solid red;height:80px;\"> <b><br/> [Serial Code Lab Assignment](serial_RDF.ipynb) </b> </div>\n",
|
|
|
+ "\n",
|
|
|
+ "---\n",
|
|
|
+ "\n",
|
|
|
+ "## Post-Lab Summary\n",
|
|
|
+ "\n",
|
|
|
+ "If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page. This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below.\n",
|
|
|
+ "\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "%%bash\n",
|
|
|
+ "cd ..\n",
|
|
|
+ "rm -f nways_files.zip\n",
|
|
|
+ "zip -r nways_files.zip *"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "\n",
|
|
|
+ "**After** executing the above zip command, you should be able to download the zip file [here](../nways_files.zip).\n",
|
|
|
+ "\n",
|
|
|
+ "**IMPORTANT**: Please click on **HOME** to go back to the main notebook for *N ways of GPU programming for MD* code.\n",
|
|
|
+ "\n",
|
|
|
+ "---\n",
|
|
|
+ "\n",
|
|
|
+ "# <p style=\"text-align:center;border:3px; border-style:solid; border-color:#FF0000 ; padding: 1em\"> <a href=../../../nways_MD_start.ipynb>HOME</a></p>\n",
|
|
|
+ "\n",
|
|
|
+ "---\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "# Links and Resources\n",
|
|
|
+ "\n",
|
|
|
+ "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n",
|
|
|
+ "\n",
|
|
|
+ "[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)\n",
|
|
|
+ "\n",
|
|
|
+ "**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).\n",
|
|
|
+ "\n",
|
|
|
+ "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n",
|
|
|
+ "\n",
|
|
|
+ "---\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "## References\n",
|
|
|
+ "\n",
|
|
|
+ "- Numba Documentation, Release 0.52.0-py3.7-linux-x86_64.egg, Anaconda, Nov 30, 2020.\n",
|
|
|
+ "- Bhaumik Vaidya, Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA, Packt Publishing, 2018.\n",
|
|
|
+ "- https://docs.nvidia.com/cuda/cuda-c-programming-guide/\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "--- \n",
|
|
|
+ "\n",
|
|
|
+ "## Licensing \n",
|
|
|
+ "\n",
|
|
|
+ "This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "metadata": {
|
|
|
+ "celltoolbar": "Raw Cell Format",
|
|
|
+ "kernelspec": {
|
|
|
+ "display_name": "Python 3",
|
|
|
+ "language": "python",
|
|
|
+ "name": "python3"
|
|
|
+ },
|
|
|
+ "language_info": {
|
|
|
+ "codemirror_mode": {
|
|
|
+ "name": "ipython",
|
|
|
+ "version": 3
|
|
|
+ },
|
|
|
+ "file_extension": ".py",
|
|
|
+ "mimetype": "text/x-python",
|
|
|
+ "name": "python",
|
|
|
+ "nbconvert_exporter": "python",
|
|
|
+ "pygments_lexer": "ipython3",
|
|
|
+ "version": "3.8.5"
|
|
|
+ }
|
|
|
+ },
|
|
|
+ "nbformat": 4,
|
|
|
+ "nbformat_minor": 4
|
|
|
+}
|