5 年之前 · e019847d0f
--- a/hpc/nways/README.md
+++ b/hpc/nways/README.md
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/Final_Remarks.ipynb
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/Final_Remarks.ipynb
@@ -29,7 +29,7 @@
 
				     "There is a very thin line between these categories and within that limited scope and view we could categorize different approaches as follows:\n",
			
 
				     "\n",
			
 
				     " \n",
			
 
				-    "| | OpenACC | OpenMP | stdpar | Kokkos | CUDA Laguages |\n",
			
 
				+    "| | OpenACC | OpenMP | DO-CONCURRENT | Kokkos | CUDA Laguages |\n",
			
 
				     "| --- | --- | --- | --- | --- | --- |\n",
			
 
				     "| Ease | High  | High | High  | Intermediate | Low |\n",
			
 
				     "| Performance  | Depends | Depends | Depends | High | Best |\n",
			
@@ -40,29 +40,26 @@
 
				     "\n",
			
 
				     "## Ease of Programming\n",
			
 
				     "- The directive‐based OpenMP and OpenACC programming models are generally least intrusive when applied to the loops. \n",
			
 
				-    "- Kokkos required restructuring of the existing code for the parallel dispatch via functors or lambda functions\n",
			
 
				     "- CUDA required a comparable amount of rewriting effort, in particular, to map the loops onto a CUDA grid of threads and thread blocks\n",
			
 
				-    "- stdpar also required us to change the constructs to make use of C++17 templates and may be preferred for new developments having C++ template style coding. \n",
			
 
				-    "- The overhead for OpenMP and OpenACC in terms of lines of code is the smallest, followed by stdpar and Kokkos\n",
			
 
				+    "- DO-CONCURRENT also required us to do minimal change by replacing the *do* loop to *do concurrent* . \n",
			
 
				+    "- The overhead for OpenMP, OpenACC and DO-CONCURRENT in terms of lines of code is the smallest\n",
			
 
				     "\n",
			
 
				     "## Performance\n",
			
 
				     "While we have not gone into the details of optimization for any of these programming model the analysis provided here is based on the general design of the programming model itself.\n",
			
 
				     "\n",
			
 
				-    "- Kokkos when compiled enables the use of correct compiler optimization flags for the respective platform, while for the other frameworks, the user has to set these flags manually. This gives kokkos an upper hand over OpenACC and OpenMP. \n",
			
 
				     "- OpenACC and OpenMP abstract model defines a least common denominator for accelerator devices, but cannot represent architectural specifics of these devices without making the language less portable.\n",
			
 
				-    "- stdpar on the other hand is more abstract and gives less control to developers to optimize the code\n",
			
 
				+    "- DO-CONCURRENT on the other hand is more abstract and gives less control to developers to optimize the code\n",
			
 
				     "\n",
			
 
				     "## Portability\n",
			
 
				-    "We observed the same code being run on moth multicore and GPU using OpenMP, OpenACC, Kokkos and stdpar. The point we highlight here is how a programming model supports the divergent cases where developers may choose to use different directive variant to get more performance. In a real application the tolerance for this portability/performance trade-off will vary according to the needs of the programmer and application \n",
			
 
				+    "We observed the same code being run on moth multicore and GPU using OpenMP, OpenACC and DO-CONCURRENT. The point we highlight here is how a programming model supports the divergent cases where developers may choose to use different directive variant to get more performance. In a real application the tolerance for this portability/performance trade-off will vary according to the needs of the programmer and application \n",
			
 
				     "- OpenMP supports [Metadirective](https://www.openmp.org/spec-html/5.0/openmpsu28.html) where the developer can choose to activate different directive variant based on the condition selected.\n",
			
 
				     "- In OpenACC when using ```kernel``` construct, the compiler is responsible for mapping and partitioning the program to the underlying hardware. Since the compiler will mostly take care of the parallelization issues, the descriptive approach may generate performance code for specific architecture. The downside is the quality of the generated accelerated code depends significantly on the capability of the compiler used and hence the term \"may\".\n",
			
 
				     "\n",
			
 
				     "\n",
			
 
				     "## Support\n",
			
 
				-    "- Kokkos project is very well documented and the developers support on GitHub is excellent \n",
			
 
				     "- OpenACC implementation is present in most popular compilers like NVIDIA HPC SDK, PGI, GCC, Clang and CRAY. \n",
			
 
				     "- OpenMP GPU support is currently available on limited compilers but being the most supported programming model for multicore it is matter of time when it comes at par with other models for GPU support.\n",
			
 
				-    "- stdpar being part of the C++ standard is bound to become integral part of most compiler supporting parallelism. \n",
			
 
				+    "- DO-CONCURRENT being part of the ISO Fortran standard is bound to become integral part of most compiler supporting parallelism. \n",
			
 
				     "\n",
			
 
				     "\n",
			
 
				     "Parallel Computing in general has been a difficult task and requires developers not just to know a programming approach but also think in parallel. While this tutorial provide you a good start, it is highly recommended to go through Profiling and Optimization bootcamps as next steps.\n",
			
@@ -110,7 +107,7 @@
 
				    "name": "python",
			
 
				    "nbconvert_exporter": "python",
			
 
				    "pygments_lexer": "ipython3",
			
 
				-   "version": "3.7.4"
			
 
				+   "version": "3.6.2"
			
 
				   }
			
 
				  },
			
 
				  "nbformat": 4,
			
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/cudafortran/nways_cuda.ipynb
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/cudafortran/nways_cuda.ipynb
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/doconcurrent/nways_doconcurrent.ipynb
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/doconcurrent/nways_doconcurrent.ipynb
@@ -168,7 +168,7 @@
 
				    "source": [
			
 
				     "Let's checkout the profiler's report. [Download the profiler output](../../source_code/doconcurrent/rdf_doconcurrent_multicore.qdrep) and open it via the GUI. Have a look at the example expected profiler report below:\n",
			
 
				     "\n",
			
 
				-    "<img src=\"../images/stdpar_multicore.png\">\n",
			
 
				+    "<img src=\"../images/do_concurrent_multicore.jpg\">\n",
			
 
				     "\n",
			
 
				     "\n",
			
 
				     "### Compile and run for Nvidia GPU\n",
			
@@ -246,11 +246,11 @@
 
				    "source": [
			
 
				     "Let's checkout the profiler's report. [Download the profiler output](../../source_code/doconcurrent/rdf_doconcurrent_gpu.qdrep) and open it via the GUI. Have a look at the example expected profiler report below:\n",
			
 
				     "\n",
			
 
				-    "<img src=\"../images/stdpar_gpu.png\">\n",
			
 
				+    "<img src=\"../images/do_concurrent_gpu.jpg\">\n",
			
 
				     "\n",
			
 
				     "If you inspect the output of the profiler closer, you can see the usage of *Unified Memory* annotated with green rectangle which was explained in previous sections.\n",
			
 
				     "\n",
			
 
				-    "Moreover, if you compare the NVTX marker `Pair_Calculation` (from the NVTX row) in both multicore and GPU version, you can see how much improvement you achieved. In the *example screenshot*, we were able to reduce that range from 1.52 seconds to 28 mseconds.\n",
			
 
				+    "Moreover, if you compare the NVTX marker `Pair_Calculation` (from the NVTX row) in both multicore and GPU version, you can see how much improvement you achieved. In the *example screenshot*, we were able to reduce that range from 1.57 seconds to 26 mseconds.\n",
			
 
				     "\n",
			
 
				     "Feel free to checkout the [solution](../../source_code/doconcurrent/SOLUTION/rdf.f90) to help you understand better or compare your implementation with the sample solution."
			
 
				    ]
			
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/cuda_profile_timeline.jpg
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/cuda_profile_timeline.jpg
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/cuda_profile_timeline.png
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/cuda_profile_timeline.png
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/do_concurrent_gpu.jpg
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/do_concurrent_gpu.jpg
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/do_concurrent_multicore.jpg
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/do_concurrent_multicore.jpg
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/nvtx_multicore.jpg
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/nvtx_multicore.jpg
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/nvtx_multicore.png
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/nvtx_multicore.png
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/nvtx_serial.jpg
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/nvtx_serial.jpg
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/nvtx_serial.png
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/nvtx_serial.png
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/openacc
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/openacc
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/openacc
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/openacc
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/openacc_construct.jpg
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/openacc_construct.jpg
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/openacc_construct.png
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/openacc_construct.png
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_data.jpg
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_data.jpg
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_data.png
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_data.png
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_expand.jpg
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_expand.jpg
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_expand.png
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_expand.png
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_timeline.jpg
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_timeline.jpg
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_timeline.png
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_timeline.png
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_unified.jpg
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_unified.jpg
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_unified.png
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/parallel_unified.png
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/serial.jpg
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/serial.jpg
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/serial.png
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/images/serial.png
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/openacc/nways_openacc.ipynb
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/openacc/nways_openacc.ipynb
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/openmp/nways_openmp.ipynb
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/openmp/nways_openmp.ipynb
@@ -271,7 +271,7 @@
 
				    "cell_type": "markdown",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				-    "Inspect the compiler feedback (you should get a similar output as below) you can see from *Line 174* that it is generating a multicore code `98, Generating Multicore code`.\n",
			
 
				+    "Inspect the compiler feedback (you should get a similar output as below) you can see from *Line 98* that it is generating a multicore code `98, Generating Multicore code`.\n",
			
 
				     "\n",
			
 
				     "```\n",
			
 
				     "\tMulticore output\n",
			
@@ -322,7 +322,7 @@
 
				    "source": [
			
 
				     "Let's checkout the profiler's report. [Download the profiler output](../../source_code/openmp/rdf_multicore.qdrep) and open it via the GUI. Have a look at the example expected profiler report below:\n",
			
 
				     "\n",
			
 
				-    "<img src=\"../images/openmp_multicore.png\">\n",
			
 
				+    "<img src=\"../images/nvtx_multicore.jpg\">\n",
			
 
				     "\n",
			
 
				     "Feel free to checkout the [solution](../../source_code/openmp/SOLUTION/rdf_offload.f90) to help you understand better."
			
 
				    ]
			
@@ -363,7 +363,7 @@
 
				     "```\n",
			
 
				     "rdf.f90:\n",
			
 
				     "rdf:\n",
			
 
				-    "     98, !$omp target teams distribute parallel for\n",
			
 
				+    "     98, !$omp target teams distribute parallel do\n",
			
 
				     "         94, Generating map(tofrom:g(:),x(z_b_0:z_b_1,z_b_3:z_b_4),y(z_b_7:z_b_8,z_b_10:z_b_11),z(z_b_14:z_b_15,z_b_17:z_b_18)) \n",
			
 
				     "         98, Generating Tesla and Multicore code\n",
			
 
				     "             Generating \"nvkernel_MAIN__F1L98_1\" GPU kernel\n",
			
@@ -482,7 +482,7 @@
 
				    "outputs": [],
			
 
				    "source": [
			
 
				     "#compile for Tesla GPU\n",
			
 
				-    "!cd ../../source_code/openmp && nvfortran -mp=gpu -Minfo=mp -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/20.11/cuda/11.0/lib64 -lnvToolsExt"
			
 
				+    "!cd ../../source_code/openmp && nvfortran -mp=gpu -Minfo=mp -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt"
			
 
				    ]
			
 
				   },
			
 
				   {
			
--- a/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/serial/rdf_overview.ipynb
+++ b/hpc/nways/nways_labs/nways_MD/English/Fortran/jupyter_notebook/serial/rdf_overview.ipynb
@@ -64,13 +64,13 @@
 
				    "source": [
			
 
				     "Once you run the above cell, you should see the following in the terminal.\n",
			
 
				     "\n",
			
 
				-    "<img src=\"../images/serial.png\" width=\"70%\" height=\"70%\">\n",
			
 
				+    "<img src=\"../images/serial.jpg\" width=\"70%\" height=\"70%\">\n",
			
 
				     "\n",
			
 
				     "To view the profiler report, you would need to [Download the profiler output](../../source_code/serial/rdf_serial.qdrep) and open it via the GUI. For more information on how to open the report via the GUI, please checkout the section on [How to view the report](../../../../../profiler/English/jupyter_notebook/profiling-c.ipynb#gui-report). \n",
			
 
				     "\n",
			
 
				     "From the timeline view, right click on the nvtx row and click the \"show in events view\". Now you can see the nvtx statistic at the bottom of the window which shows the duration of each range. In the following labs, we will look in to the profiler report in more detail. \n",
			
 
				     "\n",
			
 
				-    "<img src=\"../images/nvtx_serial.png\" width=\"100%\" height=\"100%\">\n",
			
 
				+    "<img src=\"../images/nvtx_serial.jpg\" width=\"100%\" height=\"100%\">\n",
			
 
				     "\n",
			
 
				     "The obvious next step is to make **Pair Calculation** algorithm parallel using different approaches to GPU Programming. Please follow the below link and choose one of the approaches to parallelise th serial code.\n",
			
 
				     "\n",
			
--- a/hpc/nways/nways_labs/nways_MD/English/nways_MD_start.ipynb
+++ b/hpc/nways/nways_labs/nways_MD/English/nways_MD_start.ipynb
@@ -55,8 +55,10 @@
 
				     "1. [stdpar](C/jupyter_notebook/stdpar/nways_stdpar.ipynb)\n",
			
 
				     "2. [OpenACC](C/jupyter_notebook/openacc/nways_openacc.ipynb)<!-- , [OpenACC Advanced](C/jupyter_notebook/openacc/nways_openacc_opt.ipynb)-->\n",
			
 
				     "<!--3. [Kokkos](C/jupyter_notebook/kokkos/nways_kokkos.ipynb)-->\n",
			
 
				-    "4. [OpenMP](C/jupyter_notebook/openmp/nways_openmp.ipynb) \n",
			
 
				-    "5. [CUDA C](C/jupyter_notebook/cudac/nways_cuda.ipynb) \n",
			
 
				+    "3. [OpenMP](C/jupyter_notebook/openmp/nways_openmp.ipynb) \n",
			
 
				+    "4. [CUDA C](C/jupyter_notebook/cudac/nways_cuda.ipynb) \n",
			
 
				+    "\n",
			
 
				+    "To finish the lab let us go through some final [remarks](C/jupyter_notebook/Final_Remarks.ipynb)\n",
			
 
				     "\n",
			
 
				     "#### Fortran Programming Language\n",
			
 
				     "\n",
			
@@ -67,15 +69,17 @@
 
				     "1. [do-concurrent](Fortran/jupyter_notebook/doconcurrent/nways_doconcurrent.ipynb)\n",
			
 
				     "2. [OpenACC](Fortran/jupyter_notebook/openacc/nways_openacc.ipynb)<!-- , [OpenACC Advanced](C/jupyter_notebook/openacc/nways_openacc_opt.ipynb)-->\n",
			
 
				     "<!--3. [Kokkos](C/jupyter_notebook/kokkos/nways_kokkos.ipynb)-->\n",
			
 
				-    "4. [OpenMP](Fortran/jupyter_notebook/openmp/nways_openmp.ipynb) \n",
			
 
				-    "5. [CUDA Fortran](Fortran/jupyter_notebook/cudafortran/nways_cuda.ipynb) \n"
			
 
				+    "3. [OpenMP](Fortran/jupyter_notebook/openmp/nways_openmp.ipynb) \n",
			
 
				+    "4. [CUDA Fortran](Fortran/jupyter_notebook/cudafortran/nways_cuda.ipynb) \n",
			
 
				+    "\n",
			
 
				+    "To finish the lab let us go through some final [remarks](Fortran/jupyter_notebook/Final_Remarks.ipynb)\n"
			
 
				    ]
			
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				-    "To finish the lab let us go through some final [remarks](C/jupyter_notebook/Final_Remarks.ipynb)\n",
			
 
				+    "\n",
			
 
				     "\n",
			
 
				     "### Tutorial Duration\n",
			
 
				     "The lab material will be presented in a 8hr session. Link to material is available for download at the end of the lab.\n",
			
@@ -86,7 +90,7 @@
 
				     "### Target Audience and Prerequisites\n",
			
 
				     "The target audience for this lab is researchers/graduate students and developers who are interested in learning about programming various ways to programming GPUs to accelerate their scientific applications.\n",
			
 
				     "\n",
			
 
				-    "Basic experience with C/C++ programming is needed. No GPU programming knowledge is required.\n",
			
 
				+    "Basic experience with Fortran programming is needed. No GPU programming knowledge is required.\n",
			
 
				     "\n",
			
 
				     "-----\n",
			
 
				     "\n",
			
@@ -98,6 +102,13 @@
 
				     "\n",
			
 
				     "This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
			
 
				    ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": []
			
 
				   }
			
 
				  ],
			
 
				  "metadata": {
			
--- a/hpc/nways/nways_labs/nways_start.ipynb
+++ b/hpc/nways/nways_labs/nways_start.ipynb
@@ -49,7 +49,7 @@
 
				     "### Target Audience and Prerequisites\n",
			
 
				     "The target audience for this lab is researchers/graduate students and developers who are interested in learning about programming various ways to programming GPUs to accelerate their scientific applications.\n",
			
 
				     "\n",
			
 
				-    "Basic experience with C/C++ programming is needed. No GPU programming knowledge is required. \n",
			
 
				+    "Basic experience with C/C++ or Fortran programming is needed. No GPU programming knowledge is required. \n",
			
 
				     "\n",
			
 
				     "--- \n",
			
 
				     "\n",