{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "In this lab, we will optimize the weather simulation application written in Fortran (if you prefer to use C++, click [this link](../../C/jupyter_notebook/profiling-c.ipynb)). \n", "\n", "Let's execute the cell below to display information about the GPUs running on the server by running the nvaccelinfo command, which ships with the NVIDIA HPC compiler that we will be using. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!nvaccelinfo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# NVIDIA Profiler\n", "\n", "### What is profiling\n", "Profiling is the first step in optimizing and tuning your application. Profiling an application would help us understand where most of the execution time is spent. You will gain an understanding of its performance characteristics and can easily identify parts of the code that present opportunities for improvement. Finding hotspots and bottlenecks in your application, can help you decide where to focus our optimization efforts.\n", "\n", "### NVIDIA Nsight Tools\n", "NVIDIA offers Nsight tools (Nsight Systems, Nsight Compute, Nsight Graphics), a collection of applications which enable developers to debug, profile the performance of CUDA, OpenACC, or OpenMP applications. \n", "\n", "Your profiling workflow will change to reflect the individual Nsight tools. Start with Nsight Systems to get a system-level overview of the workload and eliminate any system level bottlenecks, such as unnecessary thread synchronization or data movement, and improve the system level parallelism of your algorithms. Once you have done that, then proceed to Nsight Compute or Nsight Graphics to optimize the most significant CUDA kernels or graphics workloads, respectively. Periodically return to Nsight Systems to ensure that you remain focused on the largest bottleneck. Otherwise the bottleneck may have shifted and your kernel level optimizations may not achieve as high of an improvement as expected.\n", "\n", "- **Nsight Systems** analyze application algorithm system-wide\n", "- **Nsight Compute** debug and optimize CUDA kernels \n", "- **Nsight Graphics** debug and optimize graphic workloads\n", "\n", "\n", "*The data flows between the NVIDIA Nsight tools.*\n", "\n", "This exercise is intended for advanced users and we only focus on Nsight Compute tool to profile kernels in the mini application.\n", "\n", "### Introduction to Nsight Compute\n", "Nsight Compute tool provides detailed performance metrics and API debugging via a user interface and command line tool. NVIDIA Nsight Compute is an interactive kernel profiler for GPU applications which provides detailed performance metrics and API debugging via a user interface and command line tool. The NVIDIA Nsight Compute CLI (`nv-nsight-cu-cli`) provides a non-interactive way to profile applications from the command line and can print the results directly on the command line or store them in a report file. Results can then be imported to the GUI version for inspection. With command line profiler, you can instrument the target API, and collect profile results for the specified kernels or all of them.\n", "\n", "\n", "\n", "- **Navigating the report via GUI**\n", "The Nsight Compute UI consists of a header with general information, as well as controls to switch between report pages or individual collected kernel launches. By default, the profile report comes up on the *Details* page. You can easily switch between different report pages of the report with the dropdown labeled *Page* on the top-left of the page. \n", "\n", "\n", "\n", "A report can contain any number of results from kernel launches. The *Launch* dropdown allows switching between the different results in the report.\n", "\n", "\n", "\n", "\n", "- **Sections and Sets**\n", "Nsight Compute uses section sets to decided the amount of metrics to be collected. By default, a relatively small number of metrics is collected such as SOL (speed of light – comparison against best possible behavior), launch statistics, and occupancy analysis. You can optionally select which of these sections are collected and displayed with command-line parameters. To learn more about available sections, checkout [this table](https://docs.nvidia.com/nsight-compute/NsightCompute/index.html#sections-and-rules).\n", "\n", "\n", "\n", "\n", "Below screenshots show close-up view of example sections in the Nsight Compute profiler. Some of these sections are not collected by default. Learn more about how to collect these metrics in the following section.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "- **Comparing multiple results**\n", "With Nsight Compute GUI, you can create a baseline and compare results against each other. On the *Details* page, press the button *Add Baseline* to make the current report/result, the baseline for all other results from this report and any other report opened in the same instance of Nsight Compute. When a baseline is set, every element on the *Details* page shows two values: The current value of the result in focus and the corresponding value of the baseline or the percentage of change from the corresponding baseline value.\n", "\n", "\n", "\n", "\n", "- **Applying Rules**\n", "Sections on the *Details* page may provide rules. By pressing the *Apply Rules* button on the top of the page, all available rules for the current report is executed. \n", "\n", "\n", "\n", "\n", "### Profiling using command line interface \n", "To profile the application, you can either use the Graphical User Interface(GUI) or Command Line Interface (CLI). During this lab, we will profile the mini application using CLI. The Nsight Compute command line executable is named `nv-nsight-cu-cli`. To collect the default set of data for all kernel launches in the application, run:\n", "\n", "`nv-nsight-cu-cli ./miniWeather`\n", "\n", "For all kernel invocations in the application code, details page data will be gathered and displayed. Example screenshot shows major sections (highlighted in yellow) for SOL (speed of light – comparison against best possible behavior), launch statistics, and occupancy analysis for the example kernel function *set_halo_values_x_409_gpu(double*)* (annotated with red line). \n", "\n", "\n", "\n", " \n", "\n", "You can optionally select which of these sections are collected and displayed with command-line parameters. Simply run `nv-nsight-cu-cli --list-sets` from the command line to see list of available sets. \n", "\n", "\n", "\n", "\n", "\n", "To view all sections and sets when profiling your application with Nsight Compute, run `nv-nsight-cu-cli --set=full ./miniWeather`. Now, you can see Memory and Compute Workload Analysis, scheduler, warp state and instruction statistics in addition to the default sections added to the profiling report. \n", "\n", "**Note**: The choice of sections and metrics will affect profiling time and will slow down the process. It also increases the size of the output.\n", "\n", "\n", "There are also options available to specify for which kernels data should be collected. -k allows you to filter the kernels by a regex match of their names. --kernel-id allows you to filter kernels by context, stream, name and invocation, similar to nvprof. Below is a typical command line invocation to collect the default set of data for all kernel launches in the target application:\n", "\n", "`nv-nsight-cu-cli -c 1 -s 10 -k semi_discrete_step -f -o miniWeather ./miniWeather`\n", "\n", "where command switch options used for this lab are:\n", "- `-c`: to specify number of kernel launches to collect\n", "- `-s`: to specify number of kernels to skip before collection starts\n", "- `-k`: to filter the kernels by a regex match of their names\n", "- `-f`: Overwrites the existing generated report\n", "- `-o`: name for the intermediate result file, created at the end of the collection (.nsight-cuprof-report or .ncu-rep filename)\n", "\n", "**Customising data collection**: One may ask how would you decide on the number of kernels to skip and how many kernel launches to collect? Since data is collected per kernel, it makes sense to collect for more than one kernel launches if kernels have different behavior or performance characteristics. The decision on how many kernel launches to skip or collect depends on if you want to collect the performance metrics for those kernel launches or not.\n", "\n", "\n", "*The screenshot shows profiling the full kernel launch for compute_tendencies_x kernel function.*\n", "\n", "\n", "*The screenshot shows profiling 3 kernel launches for the compute_tendencies_x kernel function, while skipping 5 launches.*\n", "\n", "**Note**: You do not need to memorize the profiler options. You can always run `nv-nsight-cu-cli --help` from the command line and use the necessary options or profiler arguments. For more info on Nsight profiler and NVTX, please see the __[Profiler documentation](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html)__.\n", "\n", "\n", "### How to view the report\n", "The profiler report contains all the information collected during profiling for each kernel launch. When using CLI to profile the application, there are two ways to view the profiler's report. \n", "\n", "1) On the Terminal: By default, a temporary file is used to store profiling results, and data is printed to the command line. You can also use `--summary=per=kernel` option to view the summary of each kernel type on the terminal.\n", "\n", "\n", "\n", "2) NVIDIA Nsight Compute UI: To permanently store the profiler report, use `-o` to specify the output filename. After the profiling session ends, a `*.nsight-cuprof-report` or `*.ncu-rep` file will be created. This file can be loaded into Nsight Compute UI using *File -> Open*. If you would like to view this on your local machine, this requires that the local system has CUDA toolkit installed of same version and the Nsight Compute UI version should match the CLI version. More details on where to download CUDA toolkit can be found in the “Links and Resources” at the end of this page.\n", "\n", "To view the profiler report, simply open the file from the GUI (File > Open).\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting Started \n", "In this section, we use Nsight Compute profiler to inspect kernels within the parallel mini application we worked on in previous exercises. You will profile the code with Nsight Systems (`nsys`), identify certain areas/kernels in the code, where they don't behave as expected. Then, you use Nsight Compute (`nv-nsight-cu-cli`) to profile specific kernels. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 5 (Optional)\n", "\n", "### Learning objectives\n", "Learn how to assess your parallel application via Nsight compute and find the hotspots. In this exercise you will:\n", "\n", "- Learn how to profile your application with Nsight Compute\n", "- Learn how to navigate inside the Nsight Compute UI\n", "- Learn how to inspect the application's kernels with Nsight Compute\n", "- Learn how to execute rules inside the Nsight Computer profiler and find bottlenecks\n", "- Learn how to add baselines and compare results/reports\n", "\n", "As mentioned earlier on, Nsight Compute and Nsight Systems each serve a different purpose in profiling and their functionalities are different. In previous exercises we inspected the timelines, measured activity durations, tracked CPU events via the Nsight Systems profiler. The purpose of this exercise is to get familiar with Nsight Compute tool. This tool provides access to kernel-level analysis using GPU performance metrics.\n", "\n", "We first profile the GPU application, identify certain areas in the code, where they don't behave as expected. Then we isolate those kernels and profile them via Nsight Compute. \n", "\n", "**Understand and analyze** the code present at:\n", "\n", "[OpenACC Code](../source_code/lab5/miniWeather_openacc.f90) \n", "\n", "[Makefile](../source_code/lab5/Makefile)\n", "\n", "Open the downloaded file for inspection. Once done, **Compile** the code with `make` and **Profile** it with `nsys`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cd ../source_code/lab5 && make clean && make" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, **Profile** the code with Nsight System CLI:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cd ../source_code/lab5 && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o miniWeather_5 ./miniWeather" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download and save the report file by holding down Shift and Right-Clicking [Here](../source_code/lab5/miniWeather_5.qdrep) and open it via the Nsight System UI. From the timeline view, inspect the less efficient kernel.\n", "\n", "\n", "\n", "As you can see in the example output, the initialization looks very expensive and kernels are very small that the GPU compute part of the problem is very small. Check how much time (what percentage) is spend in each relative to the time it takes to run the code. \n", "\n", "Let's take the most expensive kernel and take a closer look and see what the Nsight Compute recommends us to do." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, **Profile** the application via Nsight Compute CLI (`nv-nsight-cu-cli`): " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cd ../source_code/lab5 && nv-nsight-cu-cli -c 1 -s 10 -k compute_tendencies_x -f -o miniWeather1 ./miniWeather" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download and save the report file by holding down Shift and Right-Clicking [Here](../source_code/lab5/miniWeather1.ncu-rep) and open it via the Nsight Compute UI. This tool has a lot of sections that focuses on different areas of the GPU and presents them all in one page. Inspect these sections, then click on the \"Apply Rules\" to check the bottlenecks. As you can see from the example output below, Nsight Compute profiler suggests to look at the \"Launch Statistics\" section because the kernel grid is too small to fill the available resources on the GPU.\n", "\n", "\n", "\n", "We previously discussed the Amdahl's law in the first exercise. It is very important to understand the relation between the problem size and computational performance as this can determine the amount of speedup and benefit you would get by parallelizing on GPU. Due to the small problem size (`nx_glob`, `nz_glob` , and `sim_time` in this example), most of the computation is dominated by the initialization and there is not enough work/computation to make it suitable for GPU. Let's change the value of `nx_glob`, `nz_glob` , and `sim_time` in the code to `nx_glob` = 400 , `nz_glob`= 200 , and `sim_time`= 10. \n", "\n", "Click on the [miniWeather_openacc.f90](../source_code/lab5/miniWeather_openacc.f90) link and modify `miniWeather_openacc.f90`. Remember to **SAVE** your code after changes, before running below cells.\n", "\n", "One done, **Compile** and **Profile** the application again with Nsight Compute CLI (`nv-nsight-cu-cli`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cd ../source_code/lab5 && make clean && make" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cd ../source_code/lab5 && nv-nsight-cu-cli -c 1 -s 10 -k semi_discrete_step -f -o miniWeather2 ./miniWeather" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download and save the report file by holding down Shift and Right-Clicking [Here](../source_code/lab5/miniWeather2.ncu-rep) and open it via the Nsight Compute UI. \n", "\n", "**Diff the reports**\n", "\n", "Open both reports via the Nsight Compute UI. From the top of the first report, click on the *Add Baseline*, then do the same for the second tab which is the second report. Have a look at the expected output:\n", "\n", "\n", "\n", "You can now compare the two reports and see how changes you made to the cell size, affected specific metrics from different sections.You can see now that the Nsight Compute suggest to look at the \"Memory Workload Analysis\" to see where the memory bottleneck is. Detailed memory workload analysis section shows all the data traffic between various stages of the GPU and what your kernel is actually transferring. This is out of scope of this tutorial but you can have a look the algorithm and see if you can change anything to do more work per memory access.\n", "\n", "**Note:** If you don't specify specific kernel name when profiling, all kernels will be profiled and this will slow down the profiling time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Post-Lab Summary\n", "\n", "If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page. This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "cd ..\n", "rm -f openacc_profiler_files.zip\n", "zip -r openacc_profiler_files.zip *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**After** executing the above zip command, you should be able to download and save the zip file by holding down Shift and Right-Clicking [Here](../openacc_profiler_files.zip)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----\n", "\n", "#
[HOME](../../profiling_start.ipynb)
\n", "\n", "-----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Links and Resources\n", "\n", "[OpenACC API Guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC%20API%202.6%20Reference%20Guide.pdf)\n", "\n", "[NVIDIA Nsight Compute](https://docs.nvidia.com/nsight-compute/index.html)\n", "\n", "[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)\n", "\n", "**NOTE**: To be able to see the Nsight Compute profiler output, please download Nsight Compute latest version from [here](https://developer.nvidia.com/nsight-compute).\n", "\n", "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", "\n", "--- \n", "\n", "## Licensing \n", "\n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }