|
@@ -143,7 +143,7 @@
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
"#Compile the code for multicore\n",
|
|
|
- "!cd ../../source_code/openacc && nvfortran -acc -ta=multicore -Minfo=accel -o rdf nvtx.f90 rdf.f90 -I/opt/nvidia/hpc_sdk/Linux_x86_64/20.11/cuda/11.0/include -L/opt/nvidia/hpc_sdk/Linux_x86_64/20.11/cuda/11.0/lib64 -lnvToolsExt"
|
|
|
+ "!cd ../../source_code/openacc && nvfortran -acc -ta=multicore -Minfo=accel -o rdf nvtx.f90 rdf.f90 -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/include -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -201,7 +201,7 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "Let's checkout the profiler's report. [Download the profiler output](../../source_code/openacc/rdf_multicore.qdrep) and open it via the GUI. From the timeline view, checkout the NVTX markers displays as part of threads. **Why are we using NVTX?** Please see the section on [Using NVIDIA Tools Extension (NVTX)](../profiling-c.ipynb#Using-NVIDIA-Tools-Extension-(NVTX)).\n",
|
|
|
+ "Let's checkout the profiler's report. [Download the profiler output](../../source_code/openacc/rdf_multicore.qdrep) and open it via the GUI. From the timeline view, checkout the NVTX markers displays as part of threads. **Why are we using NVTX?** Please see the section on [Using NVIDIA Tools Extension (NVTX)](../../../../../profiler/English/jupyter_notebook/profiling.ipynb#Using-NVIDIA-Tools-Extension-(NVTX)).\n",
|
|
|
"\n",
|
|
|
"From the timeline view, right click on the nvtx row and click the \"show in events view\". Now you can see the nvtx statistic at the bottom of the window which shows the duration of each range. \n",
|
|
|
"\n",
|
|
@@ -233,7 +233,7 @@
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
"#compile for Tesla GPU\n",
|
|
|
- "!cd ../../source_code/openacc && nvfortran -acc -ta=tesla:managed,lineinfo -Minfo=accel -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/20.11/cuda/11.0/lib64 -lnvToolsExt"
|
|
|
+ "!cd ../../source_code/openacc && nvfortran -acc -ta=tesla:managed,lineinfo -Minfo=accel -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -256,8 +256,7 @@
|
|
|
"\n",
|
|
|
"- Using `-ta=tesla:managed`, instruct the compiler to build for an NVIDIA Tesla GPU using \"CUDA Managed Memory\"\n",
|
|
|
"- Using `-Minfo` command-line option, we will see all output from the compiler. In this example, we use `-Minfo=accel` to only see the output corresponding to the accelerator (in this case an NVIDIA GPU).\n",
|
|
|
- "- The first line of the output, `round(float)`, tells us which function the following information is in reference to.\n",
|
|
|
- "- The line starting with 177, shows we created a parallel OpenACC loop. This loop is made up of gangs (a grid of blocks in CUDA language) and vector parallelism (threads in CUDA language) with the vector size being 128 per gang. `99, acc loop gang, vector(128) /* blockIdx.x threadIdx.x */`\n",
|
|
|
+ "- The line starting with 97, shows we created a parallel OpenACC loop. This loop is made up of gangs (a grid of blocks in CUDA language) and vector parallelism (threads in CUDA language) with the vector size being 128 per gang. `99, acc loop gang, vector(128) /* blockIdx.x threadIdx.x */`\n",
|
|
|
"- The rest of the information concerns data movement. Compiler detected possible need to move data and handled it for us. We will get into this later in this lab.\n",
|
|
|
"\n",
|
|
|
"It is very important to inspect the feedback to make sure the compiler is doing what you have asked of it. Now, let's profile the code."
|
|
@@ -323,7 +322,7 @@
|
|
|
"\n",
|
|
|
"| Compiler | Latest Version | Maintained by | Full or Partial Support |\n",
|
|
|
"| --- | --- | --- | --- |\n",
|
|
|
- "| HPC SDK| 20.11 | NVIDIA HPC SDK | Full 2.5 spec |\n",
|
|
|
+ "| HPC SDK| 21.3 | NVIDIA HPC SDK | Full 2.5 spec |\n",
|
|
|
"| GCC | 10 | Mentor Graphics, SUSE | 2.0 spec, Limited Kernel directive support, No Unified Memory |\n",
|
|
|
"| CCE| latest | Cray | 2.0 Spec | \n"
|
|
|
]
|
|
@@ -365,7 +364,7 @@
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
"#compile for Tesla GPU\n",
|
|
|
- "!cd ../../source_code/openacc && nvfortran -acc -ta=tesla:managed,lineinfo -Minfo=accel -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/20.11/cuda/11.0/lib64 -lnvToolsExt"
|
|
|
+ "!cd ../../source_code/openacc && nvfortran -acc -ta=tesla:managed,lineinfo -Minfo=accel -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -398,9 +397,19 @@
|
|
|
"\n",
|
|
|
"If you only replaced the parallel directive with kernels (meaning only wrapping the loop with `!$acc kernels`), then the compiler feedback will look similar to below:\n",
|
|
|
"\n",
|
|
|
- "<img src=\"../images/kernel_feedback.png\">\n",
|
|
|
+ "```\n",
|
|
|
+ "rdf:\n",
|
|
|
+ " 97, Generating implicit copyin(y(iconf,:),z(iconf,:),x(iconf,:)) [if not already present]\n",
|
|
|
+ " Generating implicit copy(g(:)) [if not already present]\n",
|
|
|
+ " 99, Loop carried dependence due to exposed use of g(:) prevents parallelization\n",
|
|
|
+ " Accelerator serial kernel generated\n",
|
|
|
+ " Generating Tesla code\n",
|
|
|
+ " 99, !$acc loop seq\n",
|
|
|
+ " 101, !$acc loop seq\n",
|
|
|
+ " 101, Loop carried dependence due to exposed use of g(:) prevents parallelization\n",
|
|
|
+ "```\n",
|
|
|
"\n",
|
|
|
- "The line starting with 179, shows we created a serial kernel and the following loops will run in serial. When we use kernel directives, we let the compiler make decisions for us. In this case, the compiler thinks loop are not safe to parallelise due to dependency.\n",
|
|
|
+ "The line starting with 99, shows we created a serial kernel and the following loops will run in serial. When we use kernel directives, we let the compiler make decisions for us. In this case, the compiler thinks loop are not safe to parallelise due to dependency.\n",
|
|
|
"\n",
|
|
|
"### OpenACC Independent Clause\n",
|
|
|
"\n",
|
|
@@ -426,7 +435,7 @@
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
"#compile for Tesla GPU\n",
|
|
|
- "!cd ../../source_code/openacc && nvfortran -acc -ta=tesla:managed,lineinfo -Minfo=accel -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/20.11/cuda/11.0/lib64 -lnvToolsExt"
|
|
|
+ "!cd ../../source_code/openacc && nvfortran -acc -ta=tesla:managed,lineinfo -Minfo=accel -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -435,7 +444,16 @@
|
|
|
"source": [
|
|
|
"Let's inspect the compiler feedback and see if it does what we expect it to do now. You should get a compiler feedback similar to the below:\n",
|
|
|
"\n",
|
|
|
- "<img src=\"../images/kernel_indep_feedback.png\">\n",
|
|
|
+ "```\n",
|
|
|
+ "rdf:\n",
|
|
|
+ " 97, Generating implicit copyin(y(iconf,:),z(iconf,:),x(iconf,:)) [if not already present]\n",
|
|
|
+ " Generating implicit copy(g(:)) [if not already present]\n",
|
|
|
+ " 99, Loop is parallelizable\n",
|
|
|
+ " 101, Loop is parallelizable\n",
|
|
|
+ " Generating Tesla code\n",
|
|
|
+ " 99, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x\n",
|
|
|
+ " 101, ! blockidx%x threadidx%x auto-collapsed\n",
|
|
|
+ "```\n",
|
|
|
"\n",
|
|
|
"We can see that the compiler knows that the loop is parallelisable (`99, Loop is parallelizable`). Note that the loop is parallelized using vector(128) which that the compiler generated instructions for chunk of data of length 128 (vector size being 128 per gang) `99, acc loop gang, vector(128) /* blockIdx.x threadIdx.x */`\n",
|
|
|
"\n",
|
|
@@ -549,7 +567,7 @@
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
"#compile for Tesla GPU without managed memory\n",
|
|
|
- "!cd ../../source_code/openacc && nvfortran -acc -ta=tesla,lineinfo -Minfo=accel -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/20.11/cuda/11.0/lib64 -lnvToolsExt"
|
|
|
+ "!cd ../../source_code/openacc && nvfortran -acc -ta=tesla,lineinfo -Minfo=accel -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -558,9 +576,18 @@
|
|
|
"source": [
|
|
|
"Let us start inspecting the compiler feedback and see if it applied the optimizations. Below is the screenshot of expected compiler feedback after adding the `data` directives. \n",
|
|
|
"\n",
|
|
|
- "<img src=\"../images/data_feedback.png\">\n",
|
|
|
+ "```\n",
|
|
|
+ "rdf:\n",
|
|
|
+ " 95, Generating copy(g(:)) [if not already present]\n",
|
|
|
+ " Generating copyin(y(y$sd8:(y$sd8-1)+y$sd8,y$sd8:(y$sd8-1)+y$sd8),z(z$sd7:(z$sd7-1)+z$sd7,z$sd7:(z$sd7-1)+z$sd7),x(x$sd9:(x$sd9-1)+x$sd9,x$sd9:(x$sd9-1)+x$sd9)) [if not already present]\n",
|
|
|
+ " 98, Generating Tesla code\n",
|
|
|
+ " 99, !$acc loop gang, vector(128) ! blockidx%x threadidx%x\n",
|
|
|
+ " 100, !$acc loop seq\n",
|
|
|
+ " 100, Loop carried dependence of g prevents parallelization\n",
|
|
|
+ " Loop carried backward dependence of g prevents vectorization\n",
|
|
|
+ "```\n",
|
|
|
"\n",
|
|
|
- "You can see that on line 182, compiler is generating default present for `d_g2`, `d_x`,`d_z`, and `d_y` arrays. In other words, it is assuming that data is present on the GPU and it only copies data to the GPU only if the data do not exist.\n",
|
|
|
+ "You can see that on line 95, compiler is generating default present for `g2`, `x`,`z`, and `y` arrays. In other words, it is assuming that data is present on the GPU and it only copies data to the GPU only if the data do not exist.\n",
|
|
|
"\n",
|
|
|
"\n",
|
|
|
"Make sure to validate the output by running the executable and validate the output. "
|