{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This lab gives an overview of the Nvidia Nsight Compute and steps to profile a kernel with Nsight Compute command line interface.\n", "\n", "Let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the nvidia-smi command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!nvidia-smi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Introduction to Nsight Compute\n", "Nsight Compute tool provides detailed performance metrics and API debugging via a user interface and command line tool. NVIDIA Nsight Compute is an interactive kernel profiler for GPU applications which provides detailed performance metrics and API debugging via a user interface and command line tool. The NVIDIA Nsight Compute CLI (`ncu`) provides a non-interactive way to profile applications from the command line and can print the results directly on the command line or store them in a report file. \n", "\n", "Results can then be imported to the GUI version for inspection. With command line profiler, you can instrument the target API, and collect profile results for the specified kernels or all of them.\n", "\n", "\n", "\n", "- **Navigating the report via GUI**\n", "The Nsight Compute UI consists of a header with general information, as well as controls to switch between report pages or individual collected kernel launches. By default, the profile report comes up on the *Details* page. You can easily switch between different report pages of the report with the dropdown labeled *Page* on the top-left of the page. \n", "\n", "\n", "\n", "A report can contain any number of results from kernel launches. The *Launch* dropdown allows switching between the different results in the report.\n", "\n", "\n", "\n", "\n", "- **Sections and Sets**\n", "Nsight Compute uses section sets to decided the amount of metrics to be collected. By default, a relatively small number of metrics is collected such as SOL (speed of light – comparison against best possible behavior), launch statistics, and occupancy analysis. You can optionally select which of these sections are collected and displayed with command-line parameters. If you are profiling from the command-line, use the flag `--set detailed` or `--set full`. In the later sections, you will learn how to collect these metrics. Read more at https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#sections-and-rules.\n", "\n", "\n", "\n", "\n", "Below screenshots show close-up view of example sections in the Nsight Compute profiler. You can expand each section by clicking on each. Under each section, there is description explaining what it shows (some of these sections are not collected by default). \n", "\n", "\n", "\n", "Various sections have a triangle with an exclamation mark inside in front of them. Follow the warning sign/icon and it tells you what the bottleneck is and gives you guidance on how you can improve it.\n", "\n", "\n", "\n", "Some of sections have one or more bodies with additional charts or tables. You can click on the triangle expander icon in the top-left corner of each section to show or hide those. If a section has multiple bodies, a dropdown in their top-right corner allows you to switch between them. As shown in the example screenshot below, you can switch between different bodies in the SOL section and choose to view *SOL Chart*, *SOL breakdown*, *SOL Rooflines*, or all together.\n", "\n", "\n", "\n", "Let's have a look at some of these sections:\n", "\n", "The _**GPU Speed Of Light Roofline**_ Chart section contains a Roofline chart that is helpful for visualizing kernel performance. More information on how to use and read this chart can be found in [*Roofline Charts*](#roofline) section.\n", "\n", "\n", "\n", "_**Memory Workload Analysis**_ section contains a Memory chart that visualizes data transfers, cache hit rates, instructions and memory requests. More information on how to use and read this chart can be found in [*Memory Charts*](#memory) section.\n", "\n", "\n", "\n", "_**Source Counters**_ can contain source hotspot tables that indicate the N highest or lowest values of one or more metrics in the kernel source code. In other words, it depicts performance problems in the source code.\n", "\n", "\n", "\n", "You can select the location links to navigate directly to this location in the *Source Page* (it displays metrics that can be correlated with source code). Please note for the correlation of SASS and source code to work, the source code needs to be compiled with the `-lineinfo` flag.\n", " \n", "\n", "\n", "To read more about different sections in NVIDIA Nsight Compute, checkout the documentation : http://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#sections-and-rules\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "- **Comparing multiple results**\n", "With Nsight Compute GUI, you can create a baseline and compare results against each other. On the *Details* page, press the button *Add Baseline* to make the current report/result, the baseline for all other results from this report and any other report opened in the same instance of Nsight Compute. When a baseline is set, every element on the *Details* page shows two values: The current value of the result in focus and the corresponding value of the baseline or the percentage of change from the corresponding baseline value.\n", "\n", "\n", "\n", "\n", "- **Applying Rules**\n", "Sections on the *Details* page may provide rules. By pressing the *Apply Rules* button on the top of the page, all available rules for the current report is executed. \n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Roofline Charts \n", "\n", "Once you wrote the high performance software code, you need to understand how well the application performs on the the available hardware. Different platforms, whether they are CPUs, GPUs, or something else, will have different hardware limitations such as available memory bandwidth and theoretical compute limits. The Roofline performance model visualizes achieved performance and helps you understand how well your application is using the available hardware resources and find the performance limiters. \n", "\n", "Kernel performance is not only dependent on the operational speed of the GPU. Since a kernel requires data to work on, performance is also dependent on the rate at which the GPU can feed data to the kernel. A typical roofline chart combines the peak performance and memory bandwidth of the GPU, with a metric called *Arithmetic Intensity* (a ratio between Work and Memory Traffic), into a single chart, to more realistically represent the achieved performance of the profiled kernel.\n", "\n", "With *Arithmetic intensity* and *FLOP/s*, you can plot a kernel on a graph that includes rooflines and ceilings of performance limits and visualize how your kernel is affected by them.\n", "\n", "- *Arithmetic intensity* The ratio between compute work (FLOPs) and data movement (bytes)\n", "- *FLOP/s*: Floating-point operations per second\n", "\n", "\n", "Nsight compute collects and displays roofline analysis data in the roofline chart. This chart is part of the Speed of Light (SOl) section. \n", "\n", "\n", "\n", "\n", "\n", "This chart actually shows two different rooflines. However, the following components can be identified for each:\n", "\n", "- *Vertical Axis* represents Floating Point Operations per Second (FLOPS) (Note: For GPUs this number can get quite large and to better accommodate the range, this axis is rendered using a logarithmic scale.)\n", "- *Horizontal Axis* - The horizontal axis represents Arithmetic Intensity, which is the ratio between Work (expressed in floating point operations per second), and Memory Traffic (expressed in bytes per second). The resulting unit is in floating point operations per byte. This axis is also shown using a logarithmic scale.\n", "- *Memory Bandwidth Boundary* is the sloped part of the roofline. By default, this slope is determined entirely by the memory transfer rate of the GPU but it can be customized too.\n", "- *Peak Performance Boundary* - The peak performance boundary is the flat part of the roofline By default, this value is determined entirely by the peak performance of the GPU but but it can be customized too.\n", "- *Ridge Point* is the point at which the memory bandwidth boundary meets the peak performance boundary (a useful reference when analyzing kernel performance).\n", "- *Achieved Value* represents the performance of the profiled kernel.\n", "\n", "To learn more about customizing NVIDIA Nsight Compute tools, read the Nsight Compute Customization Guide: https://docs.nvidia.com/nsight-compute/2021.2/CustomizationGuide/index.html#abstract\n", "\n", "#### Roofline Analysis\n", "\n", "The roofline chart can be very helpful in guiding performance optimization efforts for a particular kernel.\n", "\n", "\n", "\n", "As shown here, the ridge point partitions the roofline chart into two regions. The area shaded in blue under the sloped Memory Bandwidth Boundary is the *Memory Bound* region, while the area shaded in green under the Peak Performance Boundary is the *Compute Bound* region. The region in which the achieved value falls, determines the current limiting factor of kernel performance.\n", "\n", "The distance from the achieved value to the respective roofline boundary (shown in this figure as a dotted white line), represents the opportunity for performance improvement. The closer the achieved value is to the roofline boundary, the more optimal is its performance. An achieved value that lies on the *Memory Bandwidth Boundary* but is not yet at the height of the ridge point would indicate that any further improvements in overall FLOP/s are only possible if the *Arithmetic Intensity* is increased at the same time. \n", "\n", "If you hover your mouse over the achieved value, you can see the achieved performance (FLOP/s)(see below example).\n", "\n", "\n", "\n", "Using the baseline feature in combination with roofline charts, is a good way to track optimization progress over a number of kernel executions. As shown in the example below, the roofline chart also contains an achieved value for each baseline. The outline color of the plotted achieved value point can be used to determine from which baseline the point came.In this example, the outline colors are light blue and green showing the achieved value points.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Memory Charts \n", "\n", "Memory Workload Analysis section shows detailed analysis of the memory resources of the GPU. Memory can become a limiting factor for the overall kernel performance when fully utilizing the involved hardware units (Mem Busy), exhausting the available communication bandwidth between those units (Max Bandwidth), or by reaching the maximum throughput of issuing memory instructions (Mem Pipes Busy). Depending on the limiting factor, the memory chart and tables allow to identify the exact bottleneck in the memory system.\n", "\n", "Below is a memory chart of an NVIDIA V100 GPU:\n", "\n", "\n", "\n", "*Logical unit* (e.g: Kernel, Global memory) are shown in green and *physical units* (e.g: L2 Cache, Device Memory, System memory) are shown in blue color. Since not all GPUs have all units, exact set of shown units may vary for a specific GPU architecture.\n", "\n", "*Links* between *Kernel* and other logical units represent the number of executed instructions (Inst) targeting the respective unit. For example, the link between Kernel and Global represents the instructions loading from or storing to the global memory space. \n", "\n", "Links between logical units (green) and physical units (blue) represent the number of requests (Req) issued as a result of their respective instructions. For example, the link going from L1/TEX Cache to Global shows the number of requests generated due to global load instructions.\n", "\n", "The color of each link represents the percentage of peak utilization of the corresponding communication path. The color legend to the right of the chart shows the applied color gradient from unused (0%) to operating at peak performance (100%). Triangle markers to the left of the legend correspond to the links in the chart. \n", "\n", "\n", "Colored rectangles inside the units located at the incoming and outgoing links represents port utilization. Units often share a common data port for incoming and outgoing traffic. Ports use the same color gradient as the data links. Below screenshot shows the mapping of the peak values between the memory chart and the table. An example of the correlation between the peak values reported in the memory tables and the ports in the memory chart is shown below:\n", "\n", "\n", "\n", "\n", "Memory tables shows detailed metrics for the various memory hardware units such as device memory. To learn more, please read the profiling guide: https://docs.nvidia.com/nsight-compute/2021.2/ProfilingGuide/index.html#memory-tables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Profiling using command line interface \n", "To profile the application, you can either use the Graphical User Interface(GUI) or Command Line Interface (CLI). During this lab, we will profile the applications using CLI. The Nsight Compute command line executable is named `ncu`. To collect the default set of data for all kernel launches in the application, run:\n", "\n", "`ncu -o output ./rdf`\n", "\n", "For all kernel invocations in the application code, details page data will be gathered and displayed and the results are written to `output.ncu-rep`. \n", "\n", "\n", "\n", "As seen in the above screenshot, each output from the compute profiler starts with `==PROF==`. The other lines are output from the application itself. For each profiled kernel, the name of the kernel function and the progress of data collection is shown. In the example screenshot, the kernel function name starts with `_Z16pair_gpu_183_gpuPKdS0_S0_...`.\n", "\n", "\n", "\n", "\n", "Example screenshot shows major sections (annotated in green) for SOL (speed of light – comparison against best possible behavior), launch statistics, and occupancy analysis for the example kernel function `pair_gpu`. You can optionally select which of these sections are collected and displayed with command-line parameters. Simply run `ncu --list-sets` from the terminal to see list of available sets. \n", "\n", "\n", " \n", "\n", "\n", "To see the list of currently available sections, use `--list-sections`.\n", "\n", "\n", " \n", "\n", "To collect all sections and sets when profiling your application with Nsight Compute, add `--set=full` to the command line. Then it collects Memory and Compute Workload Analysis, scheduler, warp state and instruction statistics in addition to the default sections and all will be added to the profiling report. \n", "\n", "**Note**: The choice of sections and metrics will affect profiling time and will slow down the process. It also increases the size of the output.\n", "\n", "\n", "There are also options available to specify for which kernels data should be collected. Below is a typical command line invocation to collect the default set of data for all kernel launches in the target application:\n", "\n", "`ncu -k _Z16pair_gpu_183_gpuPKdS0_S0_Pyiidddi --launch-skip 1 --launch-count 1 -f -o output ./rdf`\n", "\n", "where command switch options used for this lab are:\n", "- `-c` or `--launch-count`: to specify number of kernel launches to collect\n", "- `-s` or `--launch-skip`: to specify number of kernels to skip before collection starts\n", "- `-k` or `--kernel-name`: to specify the matching kernel name\n", "- `-f`: Overwrites the existing generated report\n", "- `-o`: name for the intermediate result file, created at the end of the collection (.nsight-cuprof-report or .ncu-rep filename)\n", "\n", "**Customising data collection**: One may ask how would you decide on the number of kernels to skip and how many kernel launches to collect? Since data is collected per kernel, it makes sense to collect for more than one kernel launches if kernels have different behavior or performance characteristics. The decision on how many kernel launches to skip or collect depends on if you want to collect the performance metrics for those kernel launches or not.\n", "\n", "You can also profile the kernel from inside the Nsight Systems or you can copy the command line options for the specific kernel you want to profile. To achieve this, you would need to right click on the kernel in the timeline view from inside the Nsight Systems. \n", "\n", "\n", "\n", "Then click on the \"Analyze the selected Kernel with NVIDIA Nsight Compute\". \n", "\n", "\n", "\n", "Then choose \"Display the command line to use NVIDIA Nsight Compute CLI\". Then, you copy the command and run it on the target system to analyze the selected kernel.\n", "\n", "\n", "\n", "\n", "**Note**: You do not need to memorize the profiler options. You can always run `ncu --help` from the command line and use the necessary options or profiler arguments. For more info on Nsight compute profiler, please read the __[documentation](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html)__.\n", "\n", "\n", "### How to view the report\n", "The profiler report contains all the information collected during profiling for each kernel launch. When using CLI to profile the application, there are two ways to view the profiler's report. \n", "\n", "1) On the Terminal: By default, a temporary file is used to store profiling results, and data is printed to the command line. You can also use `--print-summary per-kernel` option to view the summary of each kernel type on the terminal. To read more about Console output options, checkout the guide at https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#command-line-options-console-output .\n", "\n", "\n", "2) NVIDIA Nsight Compute UI: To permanently store the profiler report, use `-o` to specify the output filename. After the profiling session ends, a `*.nsight-cuprof-report` or `*.ncu-rep` file will be created. This file can be loaded into Nsight Compute UI using *File -> Open*. If you would like to view this on your local machine, this requires that the local system has CUDA toolkit installed of same version and the Nsight Compute UI version should match the CLI version. More details on where to download CUDA toolkit can be found in the “Links and Resources” at the end of this page.\n", "\n", "To view the profiler report, simply open the file from the GUI (File > Open).\n", "\n", "\n", "\n", "**NOTE**: Example screenshots are for reference only and you may not get identical profiler report." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----\n", "\n", "#
[HOME](../../../nways_start.ipynb)
\n", "\n", "-----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Links and Resources\n", "\n", "\n", "[NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute)\n", "\n", "\n", "**NOTE**: To be able to see the Nsight Compute profiler output, please download Nsight Compute's latest version from [here](https://developer.nvidia.com/nsight-compute).\n", "\n", "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", "\n", "--- \n", "\n", "## Licensing \n", "\n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 4 }