{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This lab gives an overview of the Nvidia Nsight Compute and steps to profile a kernel with Nsight Compute command line interface.\n",
    "\n",
    "Let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the nvidia-smi command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Introduction to Nsight Compute\n",
    "Nsight Compute tool provides detailed performance metrics and API debugging via a user interface and command line tool. NVIDIA Nsight Compute is an interactive kernel profiler for GPU applications which provides detailed performance metrics and API debugging via a user interface and command line tool. The NVIDIA Nsight Compute CLI (`ncu`) provides a non-interactive way to profile applications from the command line and can print the results directly on the command line or store them in a report file. \n",
    "\n",
    "Results can then be imported to the GUI version for inspection. With command line profiler, you can instrument the target API, and collect profile results for the specified kernels or all of them.\n",
    "\n",
    "<img src=\"images/compute.png\" >\n",
    "\n",
    "- **Navigating the report via GUI**\n",
    "The Nsight Compute UI consists of a header with general information, as well as controls to switch between report pages or individual collected kernel launches. By default, the profile report comes up on the *Details* page. You can easily switch between different report pages of the report with the dropdown labeled *Page* on the top-left of the page. \n",
    "\n",
    "<img src=\"images/page-compute.png\" >\n",
    "\n",
    "A report can contain any number of results from kernel launches. The *Launch* dropdown allows switching between the different results in the report.\n",
    "\n",
    "<img src=\"images/launch-compute.png\" >\n",
    "\n",
    "\n",
    "- **Sections and Sets**\n",
    "Nsight Compute uses section sets to decided the amount of metrics to be collected. By default, a relatively small number of metrics is collected such as SOL (speed of light – comparison against best possible behavior), launch statistics, and occupancy analysis. You can optionally select which of these sections are collected and displayed with command-line parameters. If you are profiling from the command-line, use the flag `--set detailed` or `--set full`. In the later sections, you will learn how to collect these metrics. Read more at https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#sections-and-rules.\n",
    "\n",
    "<img src=\"images/sections-compute.png\" >\n",
    "\n",
    "\n",
    "Below screenshots show close-up view of example sections in the Nsight Compute profiler. You can expand each section by clicking on each. Under each section, there is description explaining what it shows (some of these sections are not collected by default). \n",
    "\n",
    "<img src=\"images/header-compute.png\" >\n",
    "\n",
    "Various sections have a triangle with an exclamation mark inside in front of them. Follow the warning sign/icon and it tells you what the bottleneck is and gives you guidance on how you can improve it.\n",
    "\n",
    "<img src=\"images/warning-compute.png\" >\n",
    "\n",
    "Some of sections have one or more bodies with additional charts or tables. You can click on the triangle expander icon in the top-left corner of each section to show or hide those. If a section has multiple bodies, a dropdown in their top-right corner allows you to switch between them. As shown in the example screenshot below, you can switch between different bodies in the SOL section and choose to view *SOL Chart*, *SOL breakdown*, *SOL Rooflines*, or all together.\n",
    "\n",
    "<img src=\"images/expand-compute.png\" >\n",
    "\n",
    "Let's have a look at some of these sections:\n",
    "\n",
    "The _**GPU Speed Of Light Roofline**_ Chart section contains a Roofline chart that is helpful for visualizing kernel performance. More information on how to use and read this chart can be found in [*Roofline Charts*](#roofline) section.\n",
    "\n",
    "<img src=\"images/roofline-compute.png\" >\n",
    "\n",
    "_**Memory Workload Analysis**_ section contains a Memory chart that visualizes data transfers, cache hit rates, instructions and memory requests. More information on how to use and read this chart can be found in [*Memory  Charts*](#memory) section.\n",
    "\n",
    "<img src=\"images/memory-compute.png\" >\n",
    "\n",
    "_**Source Counters**_ can contain source hotspot tables that indicate the N highest or lowest values of one or more metrics in the kernel source code. In other words, it depicts performance problems in the source code.\n",
    "\n",
    "<img src=\"images/source-compute.png\" >\n",
    "\n",
    "You can select the location links to navigate directly to this location in the *Source Page* (it displays metrics that can be correlated with source code). Please note for the correlation of SASS and source code to work, the source code needs to be compiled with the `-lineinfo` flag.\n",
    " \n",
    "<img src=\"images/sass-compute.png\" >\n",
    "\n",
    "To read more about different sections in NVIDIA Nsight Compute, checkout the documentation : http://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#sections-and-rules\n",
    "<!--\n",
    "Section Name | Description |\n",
    "-|-\n",
    "Compute Workload Analysis |\tDetailed analysis of the compute resources of the streaming multiprocessors (SM), including the achieved instructions per clock (IPC) and the utilization of each available pipeline. Pipelines with very high utilization might limit the overall performance.\n",
    "InstructionStats (Instruction Statistics) |Statistics of the executed low-level assembly instructions (SASS). The instruction mix provides insight into the types and frequency of the executed instructions. A narrow mix of instruction types implies a dependency on few instruction pipelines, while others remain unused. Using multiple pipelines allows hiding latencies and enables parallel execution.\n",
    "LaunchStats (Launch Statistics) | \tSummary of the configuration used to launch the kernel. The launch configuration defines the size of the kernel grid, the division of the grid into blocks, and the GPU resources needed to execute the kernel. Choosing an efficient launch configuration maximizes device utilization.\n",
    "Memory Workload Analysis  |\tDetailed analysis of the memory resources of the GPU. Memory can become a limiting factor for the overall kernel performance when fully utilizing the involved hardware units (Mem Busy), exhausting the available communication bandwidth between those units (Max Bandwidth), or by reaching the maximum throughput of issuing memory instructions (Mem Pipes Busy). Depending on the limiting factor, the memory chart and tables allow to identify the exact bottleneck in the memory system.\n",
    "Nvlink (Nvlink)| High-level summary of NVLink utilization. It shows the total received and transmitted (sent) memory, as well as the overall link peak utilization.\n",
    "Occupancy (Occupancy)| Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps. Another way to view occupancy is the percentage of the hardware's ability to process warps that is actively in use. Higher occupancy does not always result in higher performance, however, low occupancy always reduces the ability to hide latencies, resulting in overall performance degradation. Large discrepancies between the theoretical and the achieved occupancy during execution typically indicates highly imbalanced workloads.\n",
    "SchedulerStats (Scheduler Statistics) |\tSummary of the activity of the schedulers issuing instructions. Each scheduler maintains a pool of warps that it can issue instructions for. The upper bound of warps in the pool (Theoretical Warps) is limited by the launch configuration. On every cycle each scheduler checks the state of the allocated warps in the pool (Active Warps). Active warps that are not stalled (Eligible Warps) are ready to issue their next instruction. From the set of eligible warps, the scheduler selects a single warp from which to issue one or more instructions (Issued Warp). On cycles with no eligible warps, the issue slot is skipped and no instruction is issued. Having many skipped issue slots indicates poor latency hiding.\n",
    "SourceCounters  |\tSource metrics, including branch efficiency and sampled warp stall reasons. Sampling Data metrics are periodically sampled over the kernel runtime. They indicate when warps were stalled and couldn't be scheduled. See the documentation for a description of all stall reasons. Only focus on stalls if the schedulers fail to issue every cycle.\n",
    "SpeedOfLight |\tHigh-level overview of the utilization for compute and memory resources of the GPU. For each unit, the Speed Of Light (SOL) reports the achieved percentage of utilization with respect to the theoretical maximum. On Volta+ GPUs, it reports the breakdown of SOL SM and SOL Memory to each individual sub-metric to clearly identify the highest contributor.\n",
    "WarpStateStats (Warp State Statistics) | \tAnalysis of the states in which all warps spent cycles during the kernel execution. The warp states describe a warp's readiness or inability to issue its next instruction. The warp cycles per instruction define the latency between two consecutive instructions. The higher the value, the more warp parallelism is required to hide this latency. For each warp state, the chart shows the average number of cycles spent in that state per issued instruction. Stalls are not always impacting the overall performance nor are they completely avoidable. Only focus on stall reasons if the schedulers fail to issue every cycle.\n",
    "-->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "- **Comparing multiple results**\n",
    "With Nsight Compute GUI, you can create a baseline and compare results against each other. On the *Details* page, press the button *Add Baseline* to make the current report/result, the baseline for all other results from this report and any other report opened in the same instance of Nsight Compute. When a baseline is set, every element on the *Details* page shows two values: The current value of the result in focus and the corresponding value of the baseline or the percentage of change from the corresponding baseline value.\n",
    "\n",
    "\n",
    "<img src=\"images/baseline-compute.png\" >\n",
    "\n",
    "- **Applying Rules**\n",
    "Sections on the *Details* page may provide rules. By pressing the *Apply Rules* button on the top of the page, all available rules for the current report is executed. \n",
    "\n",
    "\n",
    "<img src=\"images/rule-compute.png\" >\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Roofline Charts \n",
    "<a name=\"roofline\"></a>\n",
    "Once you wrote the high performance software code, you need to understand how well the application performs on the the available hardware. Different platforms, whether they are CPUs, GPUs, or something else, will have different hardware limitations such as available memory bandwidth and theoretical compute limits. The Roofline performance model visualizes achieved performance and helps you understand how well your application is using the available hardware resources and find the performance limiters. \n",
    "\n",
    "Kernel performance is not only dependent on the operational speed of the GPU. Since a kernel requires data to work on, performance is also dependent on the rate at which the GPU can feed data to the kernel. A typical roofline chart combines the peak performance and memory bandwidth of the GPU, with a metric called *Arithmetic Intensity* (a ratio between Work and Memory Traffic), into a single chart, to more realistically represent the achieved performance of the profiled kernel.\n",
    "\n",
    "With *Arithmetic intensity* and *FLOP/s*, you can plot a kernel on a graph that includes rooflines and ceilings of performance limits and visualize how your kernel is affected by them.\n",
    "\n",
    "- *Arithmetic intensity* The ratio between compute work (FLOPs) and data movement (bytes)\n",
    "- *FLOP/s*: Floating-point operations per second\n",
    "\n",
    "\n",
    "Nsight compute collects and displays roofline analysis data in the roofline chart. This chart is part of the Speed of Light (SOl) section. \n",
    "\n",
    "\n",
    "<img src=\"images/roofline-overview.png\">\n",
    "\n",
    "\n",
    "This chart actually shows two different rooflines. However, the following components can be identified for each:\n",
    "\n",
    "- *Vertical Axis* represents Floating Point Operations per Second (FLOPS) (Note: For GPUs this number can get quite large and to better accommodate the range, this axis is rendered using a logarithmic scale.)\n",
    "- *Horizontal Axis* - The horizontal axis represents Arithmetic Intensity, which is the ratio between Work (expressed in floating point operations per second), and Memory Traffic (expressed in bytes per second). The resulting unit is in floating point operations per byte. This axis is also shown using a logarithmic scale.\n",
    "- *Memory Bandwidth Boundary* is the sloped part of the roofline. By default, this slope is determined entirely by the memory transfer rate of the GPU but it  can be customized too.\n",
    "- *Peak Performance Boundary* - The peak performance boundary is the flat part of the roofline By default, this value is determined entirely by the peak performance of the GPU but but it  can be customized too.\n",
    "- *Ridge Point* is the point at which the memory bandwidth boundary meets the peak performance boundary (a useful reference when analyzing kernel performance).\n",
    "- *Achieved Value* represents the performance of the profiled kernel.\n",
    "\n",
    "To learn more about customizing NVIDIA Nsight Compute tools, read the Nsight Compute Customization Guide: https://docs.nvidia.com/nsight-compute/2021.2/CustomizationGuide/index.html#abstract\n",
    "\n",
    "#### Roofline Analysis\n",
    "\n",
    "The roofline chart can be very helpful in guiding performance optimization efforts for a particular kernel.\n",
    "\n",
    "<img src=\"images/roofline-analysis.png\">\n",
    "\n",
    "As shown here, the ridge point partitions the roofline chart into two regions. The area shaded in blue under the sloped Memory Bandwidth Boundary is the *Memory Bound* region, while the area shaded in green under the Peak Performance Boundary is the *Compute Bound* region. The region in which the achieved value falls, determines the current limiting factor of kernel performance.\n",
    "\n",
    "The distance from the achieved value to the respective roofline boundary (shown in this figure as a dotted white line), represents the opportunity for performance improvement. The closer the achieved value is to the roofline boundary, the more optimal is its performance. An achieved value that lies on the *Memory Bandwidth Boundary* but is not yet at the height of the ridge point would indicate that any further improvements in overall FLOP/s are only possible if the *Arithmetic Intensity* is increased at the same time. \n",
    "\n",
    "If you hover your mouse over the achieved value, you can see the achieved performance (FLOP/s)(see below example).\n",
    "\n",
    "<img src=\"images/roofline-achieved.png\">\n",
    "\n",
    "Using the baseline feature in combination with roofline charts, is a good way to track optimization progress over a number of kernel executions. As shown in the example below, the roofline chart also contains an achieved value for each baseline. The outline color of the plotted achieved value point can be used to determine from which baseline the point came.In this example, the outline colors are light blue and green showing the achieved value points.\n",
    "\n",
    "<img src=\"images/roofline-baseline.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Memory Charts \n",
    "<a name=\"memory\"></a>\n",
    "Memory Workload Analysis section shows detailed analysis of the memory resources of the GPU. Memory can become a limiting factor for the overall kernel performance when fully utilizing the involved hardware units (Mem Busy), exhausting the available communication bandwidth between those units (Max Bandwidth), or by reaching the maximum throughput of issuing memory instructions (Mem Pipes Busy). Depending on the limiting factor, the memory chart and tables allow to identify the exact bottleneck in the memory system.\n",
    "\n",
    "Below is a memory chart of an NVIDIA V100 GPU:\n",
    "\n",
    "<img src=\"images/compute-memory.png\" >\n",
    "\n",
    "*Logical unit* (e.g: Kernel, Global memory) are shown in green and *physical units* (e.g: L2 Cache, Device Memory, System memory) are shown in blue color. Since not all GPUs have all units, exact set of shown units may vary for a specific GPU architecture.\n",
    "\n",
    "*Links* between *Kernel* and other logical units represent the number of executed instructions (Inst) targeting the respective unit. For example, the link between Kernel and Global represents the instructions loading from or storing to the global memory space. \n",
    "\n",
    "Links between logical units (green) and physical units (blue) represent the number of requests (Req) issued as a result of their respective instructions. For example, the link going from L1/TEX Cache to Global shows the number of requests generated due to global load instructions.\n",
    "\n",
    "The color of each link represents the percentage of peak utilization of the corresponding communication path. The color legend to the right of the chart shows the applied color gradient from unused (0%) to operating at peak performance (100%). Triangle markers to the left of the legend correspond to the links in the chart. \n",
    "\n",
    "\n",
    "Colored rectangles inside the units located at the incoming and outgoing links represents port utilization. Units often share a common data port for incoming and outgoing traffic. Ports use the same color gradient as the data links. Below screenshot shows the mapping of the peak values between the memory chart and the table. An example of the correlation between the peak values reported in the memory tables and the ports in the memory chart is shown below:\n",
    "\n",
    "<img src=\"images/compute-memtable.png\">\n",
    "\n",
    "\n",
    "Memory tables shows detailed metrics for the various memory hardware units such as device memory. To learn more, please read the profiling guide: https://docs.nvidia.com/nsight-compute/2021.2/ProfilingGuide/index.html#memory-tables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Profiling using command line interface \n",
    "To profile the application, you can either use the Graphical User Interface(GUI) or Command Line Interface (CLI). During this lab, we will profile the applications using CLI. The Nsight Compute command line executable is named `ncu`. To collect the default set of data for all kernel launches in the application, run:\n",
    "\n",
    "`ncu -o output ./rdf`\n",
    "\n",
    "For all kernel invocations in the application code, details page data will be gathered and displayed and the results are written to `output.ncu-rep`. \n",
    "\n",
    "<img src=\"images/compute-cli-1.png\"  width=\"80%\" height=\"80%\">\n",
    "\n",
    "As seen in the above screenshot, each output from the compute profiler starts with `==PROF==`. The other lines are output from the application itself. For each profiled kernel, the name of the kernel function and the progress of data collection is shown. In the example screenshot, the kernel function name starts with `_Z16pair_gpu_183_gpuPKdS0_S0_...`.\n",
    "\n",
    "\n",
    "<img src=\"images/compute-cli-2.png\"  width=\"80%\" height=\"80%\">\n",
    "\n",
    "Example screenshot shows major sections (annotated in green) for SOL (speed of light – comparison against best possible behavior), launch statistics, and occupancy analysis for the example kernel function `pair_gpu`. You can optionally select which of these sections are collected and displayed with command-line parameters. Simply run `ncu --list-sets` from the terminal to see list of available sets. \n",
    "\n",
    "\n",
    "<img src=\"images/compute-sets.png\" width=\"80%\" height=\"80%\"> \n",
    "\n",
    "\n",
    "To see the list of currently available sections, use `--list-sections`.\n",
    "\n",
    "\n",
    "<img src=\"images/compute-sections.png\" width=\"80%\" height=\"80%\"> \n",
    "\n",
    "To collect all sections and sets when profiling your application with Nsight Compute, add `--set=full` to the command line. Then it collects Memory and Compute Workload Analysis, scheduler, warp state and instruction statistics in addition to the default sections and all will be added to the profiling report. \n",
    "\n",
    "**Note**: The choice of sections and metrics will affect profiling time and will slow down the process. It also increases the size of the output.\n",
    "\n",
    "\n",
    "There are also options available to specify for which kernels data should be collected. Below is a typical command line invocation to collect the default set of data for all kernel launches in the target application:\n",
    "\n",
    "`ncu -k _Z16pair_gpu_183_gpuPKdS0_S0_Pyiidddi --launch-skip 1 --launch-count 1 -f -o output ./rdf`\n",
    "\n",
    "where command switch options used for this lab are:\n",
    "- `-c` or `--launch-count`: to specify number of kernel launches to collect\n",
    "- `-s` or `--launch-skip`: to specify number of kernels to skip before collection starts\n",
    "- `-k` or `--kernel-name`: to specify the matching kernel name\n",
    "- `-f`: Overwrites the existing generated report\n",
    "- `-o`: name for the intermediate result file, created at the end of the collection (.nsight-cuprof-report or .ncu-rep filename)\n",
    "\n",
    "**Customising data collection**: One may ask how would you decide on the number of kernels to skip and how many kernel launches to collect? Since data is collected per kernel, it makes sense to collect for more than one kernel launches if kernels have different behavior or performance characteristics. The decision on how many kernel launches to skip or collect depends on if you want to collect the performance metrics for those kernel launches or not.\n",
    "\n",
    "You can also profile the kernel from inside the Nsight Systems or you can copy the command line options for the specific kernel you want to profile. To achieve this, you would need to right click on the kernel in the timeline view from inside the Nsight Systems. \n",
    "\n",
    "<img src=\"images/nsys-compute-command.png\" width=\"80%\" height=\"80%\">\n",
    "\n",
    "Then click on the \"Analyze the selected Kernel with NVIDIA Nsight Compute\". \n",
    "\n",
    "<img src=\"images/nsys-compute-command1.png\" width=\"50%\" height=\"50%\">\n",
    "\n",
    "Then choose \"Display the command line to use NVIDIA Nsight Compute CLI\". Then, you copy the command and run it on the target system to analyze the selected kernel.\n",
    "\n",
    "<img src=\"images/nsys-compute-command2.png\" width=\"50%\" height=\"50%\">\n",
    "\n",
    "\n",
    "**Note**: You do not need to memorize the profiler options. You can always run `ncu --help` from the command line and use the necessary options or profiler arguments. For more info on Nsight compute profiler, please read the __[documentation](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html)__.\n",
    "\n",
    "\n",
    "### How to view the report\n",
    "The profiler report contains all the information collected during profiling for each kernel launch. When using CLI to profile the application, there are two ways to view the profiler's report. \n",
    "\n",
    "1) On the Terminal: By default, a temporary file is used to store profiling results, and data is printed to the command line. You can also use `--print-summary per-kernel` option to view the summary of each kernel type on the terminal. To read more about Console output options, checkout the guide at https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#command-line-options-console-output .\n",
    "\n",
    "\n",
    "2) NVIDIA Nsight Compute UI: To permanently store the profiler report, use `-o` to specify the output filename. After the profiling session ends, a `*.nsight-cuprof-report` or `*.ncu-rep` file will be created. This file can be loaded into Nsight Compute UI using *File -> Open*. If you would like to view this on your local machine, this requires that the local system has CUDA toolkit installed of same version and the Nsight Compute UI version should match the CLI version. More details on where to download CUDA toolkit can be found in the “Links and Resources” at the end of this page.\n",
    "\n",
    "To view the profiler report, simply open the file from the GUI (File > Open).\n",
    "\n",
    "<img src=\"images/compute-open.png\">\n",
    "\n",
    "**NOTE**: Example screenshots are for reference only and you may not get identical profiler report."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-----\n",
    "\n",
    "# <div style=\"text-align: center ;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em\">[HOME](../../../nways_start.ipynb)</div>\n",
    "\n",
    "-----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Links and Resources\n",
    "\n",
    "\n",
    "[NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute)\n",
    "\n",
    "\n",
    "**NOTE**: To be able to see the Nsight Compute profiler output, please download Nsight Compute's latest version from [here](https://developer.nvidia.com/nsight-compute).\n",
    "\n",
    "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n",
    "\n",
    "--- \n",
    "\n",
    "## Licensing \n",
    "\n",
    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}