| 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374 | 0	Today, NVIDIA announced new pretrained models and the general availability of Transfer Learning Toolkit (TLT) 3.0, a core component of the NVIDIA Train, Adapt, and Optimize (TAO) platform-guided workflow for creating AI. The new release includes a variety of highly accurate and performant pretrained models in computer vision and conversational AI, as well as a set of powerful productivity features that boost AI development by up to 10x. As enterprises race to bring AI-enabled solutions to market, your competitiveness relies on access to the best development tools. The development journey to deploy custom, high-accuracy, and performant AI models in production can be treacherous for many engineering and research teams attempting to train with open-source models for AI product creation. NVIDIA offers high-quality, pretrained models and TLT to help reduce costs with large-scale data collection and labeling. It also eliminates the burden of training AI/ML models from scratch. New entrants to the computer vision and speech-enabled service market can now deploy production-class AI without a massive AI development team. Highlights of the new release:TLT 3.0 is also now integrated with platforms from several leading partners who provide large, diverse, and high-quality labeled data—enabling faster end-to-end AI/ML workflows. You can now use these partners’ services to generate and annotate data, seamlessly integrate with TLT for model training and optimization, and deploy the model using DeepStream SDK or Riva to create reliable applications in computer vision and conversational AI. Check out more partner posts and tutorials about synthetic data and data annotation with TLT:Learn more about NVIDIA pretrained models and the Transfer Learning Toolkit > >Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Astrophysics researchers have long faced a tradeoff when simulating space— simulations could be either high-resolution or cover a large swath of the universe. With the help of generative adversarial networks, they can accomplish both at once.Carnegie Mellon University and University of California researchers developed a deep learning model that upgrades cosmological simulations from low to high resolution, allowing scientists to create a complex simulated universe within a day. These simulations are critical for researchers to unravel mysteries around galaxy formation, dark matter and dark energy. “Cosmological simulations need to cover a large volume for cosmological studies, while also requiring high resolution to resolve the small-scale galaxy formation physics, which would incur daunting computational challenges. said Yueying Ni, a Ph.D. candidate at Carnegie Mellon. “Our technique can be used as a powerful and promising tool to match those two requirements simultaneously by modeling the small-scale galaxy formation physics in large cosmological volumes.”  The team’s GAN model can take full-scale, low-resolution models and turn them into super-resolution simulations with up to 512 times as many particles. Though it was trained on data from only small areas of space, the model was able to replicate large-scale structures seen only in massive simulations. Published in PNAS, the journal of the National Academy of Sciences, the project used the hundreds of NVIDIA RTX GPUs on the Texas Advanced Computing Center’s Frontera system.   While existing methods would take over three weeks on a single processing core to create a detailed simulation of 134 million particles, the GPU-accelerated deep learning approach does it in just 36 minutes. And for simulations 1,000 times as large, the new method shrunk simulation time down from months on a dedicated supercomputer to 16 hours on a single GPU.This acceleration can help scientists run more simulations to predict how the universe would look in different scenarios. “With our previous simulations, we showed that we could simulate the universe to discover new and interesting physics, but only at small or low-res scales,” said Rupert Croft, physics professor at Carnegie Mellon. “By incorporating machine learning, the technology is able to catch up with our ideas.”Since the current neural networks focused on how gravity moves dark matter around over time, other phenomena such as supernovae and black holes were left out of the simulations. The team next plans to extend their methods to capture the forces responsible for these events. “The universe is the biggest data set there is,” said Scott Dodelson, head of the department of physics at Carnegie Mellon and director of the National Science Foundation Planning Institute for Artificial Intelligence in Physics. And “artificial intelligence is the key to understanding the universe and revealing new physics.” Read the full article in PNAS >> Read more >> Main image from TNG SimulationsHave a story to share? Submit an idea.Get the developer news feed straight to your inbox.1	This post was updated July 20, 2021 to reflect NVIDIA TensorRT 8.0 updates.In this post, you learn how to deploy TensorFlow trained deep learning models using the new TensorFlow-ONNX-TensorRT workflow. This tutorial uses NVIDIA TensorRT 8.0.0.3 and provides two code samples, one for TensorFlow v1 and one for TensorFlow v2. TensorRT is an inference accelerator.First, a network is trained using any framework. After a network is trained, the batch size and precision are fixed (with precision as FP32, FP16, or INT8). The trained model is passed to the TensorRT optimizer, which outputs an optimized runtime also called a plan. The .plan file is a serialized file format of the TensorRT engine. The plan file must be deserialized to run inference using the TensorRT runtime. To optimize models implemented in TensorFlow, the only thing you have to do is convert models to the ONNX format and use the ONNX parser in TensorRT to parse the model and build the TensorRT engine. Figure 1 shows the high-level ONNX workflow. In this post, we discuss how to create a TensorRT engine using the ONNX workflow and how to run inference from the TensorRT engine. More specifically, we demonstrate end-to-end inference from a model in Keras or TensorFlow to ONNX, and to the TensorRT engine with ResNet-50, semantic segmentation, and U-Net networks. Finally, we explain how you can use this workflow on other networks. Download the code examples and unzip. You can run either the TensorFlow 1 or the TensorFlow 2 code example by follow the appropriate README. After downloading the file, you should also download labels.py from the Cityscapes dataset scripts repo and place it in the same folder as the other scripts. ONNX is an open format for machine learning and deep learning models. It allows you to convert deep learning and machine learning models from different frameworks such as TensorFlow, PyTorch, MATLAB, Caffe, and Keras to a single format. It defines a common set of operators, common sets of building blocks of deep learning, and a common file format. It provides a definition of a computation graph, as well as built-in operators. The list of ONNX nodes that may have one or more inputs or outputs forms an acyclic graph. In this example, we show how to use the ONNX workflow on two different networks and create a TensorRT engine. The first network is ResNet-50.The workflow consists of the following steps:The first step is to convert the model to a .pb file. The following code example converts the ResNet-50 model to a .pb file:In addition to Keras, you can also download ResNet-50 from the following locations: The second step is to convert the .pb model to the ONNX format. To do this, first install tf2onnx. After installing tf2onnx, there are two ways of converting the model from a .pb file to the ONNX format. The first way is to use the command line and the second method is by using Python API. Run the following command:To create the TensorRT engine from the ONNX file, run the following command: This code should be saved in the engine.py file, and is used later in the post. This code example contains the following variable:  The builder creates an empty network (builder.create_network()) and the ONNX parser parses the ONNX file into the network (parser.parse(model.read())). You set the input shape for the network (network.get_input(0).shape = shape), after which the builder creates the engine (engine = builder.build_cuda_engine(network)). To create the engine, run the following code example:In this code example, you first get the input shape from the ONNX model. Next,  create the engine, and then save the engine in a .plan file.The TensorRT engine runs inference in the following workflow: These steps are explained in detail in the following code example.  This code should be saved in the inference.py file, and is used later in this post.   The first two lines are for determining the dimensions for input and output. You create page-locked memory buffers in host (h_input_1, h_output). Then, you allocate device memory for input and output the same size as host input and output (d_input_1, d_output). The next step is to create the CUDA stream for copying data between the allocated memory from device and host. In this code example, in the do_inference function, the first step is to load images to buffers in the host using the load_images_to_buffer function. Then the input data is transferred to the GPU (cuda.memcpy_htod_async(d_input_1, h_input_1, stream)) and inference is run using context.execute. Finally the results are copied from GPU to the host (cuda.memcpy_dtoh_async(h_output, d_output, stream)).In the post Fast INT8 Inference for Autonomous Vehicles with TensorRT 3, the author covered the process of UFF workflow for a semantic segmentation model.In this post, you use similar networks to run the ONNX workflow for semantic segmentation. The network consists of a VGG16-based encoder and three upsampling layers implemented using a deconvolutional layer. The network is trained in about 40,000 iterations on the Cityscapes Dataset. There are multiple ways of converting the TensorFlow model to an ONNX file. One way is the one explained in the ResNet50 section. Keras also has its own Keras-to-ONNX file converter. Sometimes, some of the layers are not supported in the TensorFlow-to-ONNX but they are supported in the Keras to ONNX converter. Depending on the Keras framework and the type of layers used, you may need to choose between converters. In the following code example, you directly convert the Keras model to ONNX using the Keras-to-ONNX converter. Download the pretrained semantic segmentation file, semantic_segmentation.hdf5.Figure 2 shows the architecture of the network.As in the previous example, use the following code example to create the engine for semantic segmentation.To test the output of the model, use the Cityscapes Dataset. To work with Cityscapes, you must have the following functions:   sub_mean_chw and color_map.  In the following code example, sub_mean_chw is for subtracting the mean value from the image as the preprocessing step and color_map is the mapping from the class ID to a color. The latter is used for visualization. The following code example is the rest of the code for the previous example. You must run the previous block first because you need the defined functions. Use the example to compare the output of the Keras model and TensorRT engine semantic .plan file and then visualize both outputs.  Replace the placeholders /path/to/semantic_segmentation.hdf5  and input_file_path as appropriate.Figure 3 shows the actual image and the ground truth, and the output of Keras versus the output of the TensorRT engine. As you can see, the output for the TensorRT engine is similar to the one for Keras. Now you can try the ONNX workflow on other networks. For more information about good examples of segmentation networks, see Segmentation models with pretrained backbones on GitHub. As an example, we show how to use the ONNX workflow with other networks. The network in this example is U-Net from the segmentation_models library. Here, we only loaded the model and did not train it. You may need to train these models on your preferred dataset. One important point about these networks is that when you load these networks, their input layer sizes are as follows: (None, None, None, 3). To create a TensorRT engine, you need an ONNX file with a known input size. Before you convert this model to ONNX, change the network by assigning the size to its input and then convert it to the ONNX format. As an example, load the U-Net network from this library (segmentation_models) and assign the size (244, 244, 3) to its input. After creating the TensorRT engine for the inference, do a similar conversion to what you did for semantic segmentation. Depending on the application and dataset, you may need to have a different color mapping.As we mentioned earlier in this post, another way of downloading pretrained models is to download them from NVIDIA NGC Models. It has a list of checkpoints for pretrained models. As an example, you can search for UNet for TensorFlow and then go to the Download page to get the latest checkpoint. In this post, we explained how to deploy deep learning applications using a TensorFlow-to-ONNX-to-TensorRT workflow, with several examples. The first example was ONNX-TensorRT on ResNet-50, and the second example was VGG16-based semantic segmentation that was trained on the Cityscapes Dataset. At the end of the post, we demonstrated how to apply this workflow on other networks.  For more information about the best performance of training and inference, see NVIDIA Data Center Deep Learning Product Performance. Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	This post was updated July 20, 2021 to reflect NVIDIA TensorRT 8.0 updates.In this post, you learn how to deploy TensorFlow trained deep learning models using the new TensorFlow-ONNX-TensorRT workflow. This tutorial uses NVIDIA TensorRT 8.0.0.3 and provides two code samples, one for TensorFlow v1 and one for TensorFlow v2. TensorRT is an inference accelerator.First, a network is trained using any framework. After a network is trained, the batch size and precision are fixed (with precision as FP32, FP16, or INT8). The trained model is passed to the TensorRT optimizer, which outputs an optimized runtime also called a plan. The .plan file is a serialized file format of the TensorRT engine. The plan file must be deserialized to run inference using the TensorRT runtime. To optimize models implemented in TensorFlow, the only thing you have to do is convert models to the ONNX format and use the ONNX parser in TensorRT to parse the model and build the TensorRT engine. Figure 1 shows the high-level ONNX workflow. In this post, we discuss how to create a TensorRT engine using the ONNX workflow and how to run inference from the TensorRT engine. More specifically, we demonstrate end-to-end inference from a model in Keras or TensorFlow to ONNX, and to the TensorRT engine with ResNet-50, semantic segmentation, and U-Net networks. Finally, we explain how you can use this workflow on other networks. Download the code examples and unzip. You can run either the TensorFlow 1 or the TensorFlow 2 code example by follow the appropriate README. After downloading the file, you should also download labels.py from the Cityscapes dataset scripts repo and place it in the same folder as the other scripts. ONNX is an open format for machine learning and deep learning models. It allows you to convert deep learning and machine learning models from different frameworks such as TensorFlow, PyTorch, MATLAB, Caffe, and Keras to a single format. It defines a common set of operators, common sets of building blocks of deep learning, and a common file format. It provides a definition of a computation graph, as well as built-in operators. The list of ONNX nodes that may have one or more inputs or outputs forms an acyclic graph. In this example, we show how to use the ONNX workflow on two different networks and create a TensorRT engine. The first network is ResNet-50.The workflow consists of the following steps:The first step is to convert the model to a .pb file. The following code example converts the ResNet-50 model to a .pb file:In addition to Keras, you can also download ResNet-50 from the following locations: The second step is to convert the .pb model to the ONNX format. To do this, first install tf2onnx. After installing tf2onnx, there are two ways of converting the model from a .pb file to the ONNX format. The first way is to use the command line and the second method is by using Python API. Run the following command:To create the TensorRT engine from the ONNX file, run the following command: This code should be saved in the engine.py file, and is used later in the post. This code example contains the following variable:  The builder creates an empty network (builder.create_network()) and the ONNX parser parses the ONNX file into the network (parser.parse(model.read())). You set the input shape for the network (network.get_input(0).shape = shape), after which the builder creates the engine (engine = builder.build_cuda_engine(network)). To create the engine, run the following code example:In this code example, you first get the input shape from the ONNX model. Next,  create the engine, and then save the engine in a .plan file.The TensorRT engine runs inference in the following workflow: These steps are explained in detail in the following code example.  This code should be saved in the inference.py file, and is used later in this post.   The first two lines are for determining the dimensions for input and output. You create page-locked memory buffers in host (h_input_1, h_output). Then, you allocate device memory for input and output the same size as host input and output (d_input_1, d_output). The next step is to create the CUDA stream for copying data between the allocated memory from device and host. In this code example, in the do_inference function, the first step is to load images to buffers in the host using the load_images_to_buffer function. Then the input data is transferred to the GPU (cuda.memcpy_htod_async(d_input_1, h_input_1, stream)) and inference is run using context.execute. Finally the results are copied from GPU to the host (cuda.memcpy_dtoh_async(h_output, d_output, stream)).In the post Fast INT8 Inference for Autonomous Vehicles with TensorRT 3, the author covered the process of UFF workflow for a semantic segmentation model.In this post, you use similar networks to run the ONNX workflow for semantic segmentation. The network consists of a VGG16-based encoder and three upsampling layers implemented using a deconvolutional layer. The network is trained in about 40,000 iterations on the Cityscapes Dataset. There are multiple ways of converting the TensorFlow model to an ONNX file. One way is the one explained in the ResNet50 section. Keras also has its own Keras-to-ONNX file converter. Sometimes, some of the layers are not supported in the TensorFlow-to-ONNX but they are supported in the Keras to ONNX converter. Depending on the Keras framework and the type of layers used, you may need to choose between converters. In the following code example, you directly convert the Keras model to ONNX using the Keras-to-ONNX converter. Download the pretrained semantic segmentation file, semantic_segmentation.hdf5.Figure 2 shows the architecture of the network.As in the previous example, use the following code example to create the engine for semantic segmentation.To test the output of the model, use the Cityscapes Dataset. To work with Cityscapes, you must have the following functions:   sub_mean_chw and color_map.  In the following code example, sub_mean_chw is for subtracting the mean value from the image as the preprocessing step and color_map is the mapping from the class ID to a color. The latter is used for visualization. The following code example is the rest of the code for the previous example. You must run the previous block first because you need the defined functions. Use the example to compare the output of the Keras model and TensorRT engine semantic .plan file and then visualize both outputs.  Replace the placeholders /path/to/semantic_segmentation.hdf5  and input_file_path as appropriate.Figure 3 shows the actual image and the ground truth, and the output of Keras versus the output of the TensorRT engine. As you can see, the output for the TensorRT engine is similar to the one for Keras. Now you can try the ONNX workflow on other networks. For more information about good examples of segmentation networks, see Segmentation models with pretrained backbones on GitHub. As an example, we show how to use the ONNX workflow with other networks. The network in this example is U-Net from the segmentation_models library. Here, we only loaded the model and did not train it. You may need to train these models on your preferred dataset. One important point about these networks is that when you load these networks, their input layer sizes are as follows: (None, None, None, 3). To create a TensorRT engine, you need an ONNX file with a known input size. Before you convert this model to ONNX, change the network by assigning the size to its input and then convert it to the ONNX format. As an example, load the U-Net network from this library (segmentation_models) and assign the size (244, 244, 3) to its input. After creating the TensorRT engine for the inference, do a similar conversion to what you did for semantic segmentation. Depending on the application and dataset, you may need to have a different color mapping.As we mentioned earlier in this post, another way of downloading pretrained models is to download them from NVIDIA NGC Models. It has a list of checkpoints for pretrained models. As an example, you can search for UNet for TensorFlow and then go to the Download page to get the latest checkpoint. In this post, we explained how to deploy deep learning applications using a TensorFlow-to-ONNX-to-TensorRT workflow, with several examples. The first example was ONNX-TensorRT on ResNet-50, and the second example was VGG16-based semantic segmentation that was trained on the Cityscapes Dataset. At the end of the post, we demonstrated how to apply this workflow on other networks.  For more information about the best performance of training and inference, see NVIDIA Data Center Deep Learning Product Performance. Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.2	In part 1 of this series, we introduced new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In this post, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also cover advanced topics to take advantage of stream-ordered memory allocation in the context of multi-GPU access and the use of IPC. This all helps you improve performance within your existing applications.To measure the performance impact of the new stream-ordered allocator in a real application, here are results from the RAPIDS GPU Big Data Benchmark (gpu-bdb). gpu-bdb is a benchmark of 30 queries representing real-world data science and machine learning workflows at various scale factors: SF1000 is 1 TB of data and SF10000 is 10 TB. Each query is, in fact, a model workflow that can include SQL, user-defined functions, careful subsetting and aggregation, and machine learning.Figure 1 shows the performance of cudaMallocAsync compared to cudaMalloc for a subset of gpu-bdb queries conducted at SF1000 on an NVIDIA DGX-2 across 16 V100 GPUs. As you can see, thanks to memory reuse and eliminating extraneous synchronization, there’s a 2–5x improvement in end-to-end performance when using cudaMallocAsync.An application can use cudaFreeAsync to free a pointer allocated by cudaMalloc. The underlying memory is not freed until the next synchronization of the stream passed to cudaFreeAsync.Similarly, an application can use cudaFree to free memory allocated using cudaMallocAsync. However, cudaFree does not implicitly synchronize in this case, so the application must insert the appropriate synchronization to ensure that all accesses to the to-be-freed memory are complete. Any application code that may be intentionally or accidentally relying on the implicit synchronization behavior of cudaFree must be updated.By default, memory allocated using cudaMallocAsync is accessible from the device associated with the specified stream. Accessing the memory from any other device requires enabling access to the entire pool from that other device. It also requires the two devices to be peer capable, as reported by cudaDeviceCanAccessPeer. Unlike cudaMalloc allocations, cudaDeviceEnablePeerAccess and cudaDeviceDisablePeerAccess have no effect on memory allocated from memory pools.For example, consider enabling device 4access to the memory pool of device 3:Access from a device other than the device on which the memory pool resides can be revoked by using cudaMemAccessFlagsProtNone when calling cudaMemPoolSetAccess. Access from the memory pool’s own device cannot be revoked.Memory allocated using the default memory pool associated with a device cannot be shared with other processes. An application must explicitly create its own memory pools to share memory allocated using cudaMallocAsync with other processes. The following code sample shows how to create an explicit memory pool with interprocess communication (IPC) capabilities:The location type Device and location ID deviceId indicate that the pool memory must be allocated on a specific GPU. The allocation type Pinned indicates that the memory should be non-migratable, also known as non-pageable. The handle type PosixFileDescriptor indicates that the user intends to query a file descriptor for the pool to share it with another process.The first step to share memory from this pool through IPC is to query the file descriptor that represents the pool:The application can then share the file descriptor with another process, for example through a UNIX domain socket. The other process can then import the file descriptor and obtain a process-local pool handle:The next step is for the exporting process to allocate memory from the pool:There is also an overloaded version of cudaMallocAsync that takes the same arguments as cudaMallocFromPoolAsync:After memory is allocated from this pool through either of these two APIs, the pointer can then be shared with the importing process. First, the exporting process gets an opaque handle representing the memory allocation:This opaque data can then be shared with the importing process through any standard IPC mechanism, such as through shared memory, pipes, and so on The importing process then converts the opaque data into a process-local pointer:Now both processes share access to the same memory allocation. The memory must be freed in the importing process before it is freed in the exporting process. This is to ensure that the memory does not get reutilized for another cudaMallocAsync request in the exporting process while the importing process is still accessing the previously shared memory allocation, potentially causing undefined behavior.The existing function cudaIpcGetMemHandle works only with memory allocated through cudaMalloc and cannot be used on any memory allocated through cudaMallocAsync, regardless of whether the memory was allocated from an explicit pool.If the application expects to use an explicit memory pool most of the time, it can consider setting that as the current pool for the device through cudaDeviceSetMemPool. This enables the application to avoid having to specify the pool argument each time that it must allocate memory from that pool.This has the advantage that any other function allocating with cudaMallocAsync now automatically uses the new pool as its default. The current pool associated with a device can be queried using cudaDeviceGetMemPool.In general, libraries should not change a device’s pool, as doing so affects the entire top-level application. If a library must allocate memory with different properties than those of the default device pool, it may create its own pool and then allocate from that pool using cudaMallocFromPoolAsync. The library could also use the overloaded version of cudaMallocAsync that takes the pool as an argument.To make interoperability easier for applications, libraries should consider providing APIs for the top-level application to coordinate the pools used. For example, libraries could provide set or get APIs to enable the application to control the pool in a more explicit manner. The library could also take the pool as a parameter to individual APIs.When porting an existing application that uses cudaMalloc or cudaFree to the new cudaMallocAsync or cudaFreeAsync APIs, consider the following guidelines.Guidelines for determining the appropriate pool:Guidelines for setting the release threshold for all memory pools:Guidelines for replacing cudaMalloc with cudaMallocAsync:Guidelines for replacing cudaFree with cudaFreeAsync:The stream-ordered allocator and cudaMallocAsync and cudaFreeAsync API functions added in CUDA 11.2 extend the CUDA stream programming model by introducing memory allocation and deallocation as stream-ordered operations. This enables allocations to be scoped to the kernels, which use them while avoiding costly device-wide synchronization that can occur with traditional cudaMalloc/cudaFree.Furthermore, these API functions add the concept of memory pools to CUDA, enabling the reuse of memory to avoid costly system calls and improve performance. Use the guidelines to migrate your existing code and see how much your application performance improves!Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Edge computing has been around for a long time, but has recently become a hot topic because of the convergence of three major trends – IoT, 5G, and AI. IoT devices are becoming smarter and more capable, increasing the breadth of applications that can be deployed on them and the environments they can be deployed in. Simultaneously, recent advancements in 5G capabilities give confidence that this technology will soon be able to connect IoT devices wirelessly anywhere they are deployed. In fact, analysts predict that there will be over 1 billion 5G connected devices by 2023. Lastly, AI successfully moved from research projects into practical applications, changing the landscape for retailers, factories, hospitals, and many more. So what does the convergence of these trends mean? An explosion in the number of IoT devices deployed. Experts estimate there are over 30 billion IoT devices installed today, and Arm predicts that by 2035, there will be over 1 trillion devices. With that many IoT devices deployed, the amount of data collected skyrocketed, putting strain on current cloud infrastructures. Organizations soon found themselves in a position where the AI applications they deployed needed large amounts of data to generate compelling insights, but the latency for their cloud infrastructure to process data and send insights back to the edge were unsustainable. So they turned to edge computing. By putting the processing power at the location that sensors are collecting data, organizations reduce the latency for applications to deliver insights. For some situations, such as autonomous machines at factories, the latency reduction represents a critical safety component. That is where NVIDIA comes in. The NVIDIA Edge AI solution offers a complete end-to-end AI platform for deploying AI at the edge. It starts with NVIDIA-Certified Systems. NVIDIA-Certified Systems combine the computing power of NVIDIA GPUs with secure high-bandwidth, low-latency networking solutions from NVIDIA. Validated for performance, functionality, scalability, and security – IT teams ensure AI workloads deployed from the NGC catalog, NVIDIA’s GPU-optimized hub of HPC and AI software, run at full performance. These servers are backed by enterprise-grade support, including direct access to NVIDIA experts, minimizing system downtime and maximizing user productivity. To build and accelerate applications running on NVIDIA-Certified Systems, NVIDIA offers an extensive toolkit of SDKs, application frameworks, and other tools designed to help developers build AI applications for every industry. These include pretrained models, training scripts, optimized framework containers, inference engines, and more. With these tools, organizations get a head start on building unique AI applications regardless of workload or industry. Once organizations have the hardware to accelerate AI and an AI application to deploy, the next step is to ensure that there is infrastructure in place to manage and scale the application. Without a platform to manage AI at the edge, organizations face the difficult and costly task of manually updating systems at edge locations every time a new software update is released. NVIDIA Fleet Command is a cloud service that securely deploys, manages, and scales AI applications across distributed edge infrastructure. Purpose-built for AI, Fleet Command is a turnkey solution for AI lifecycle management, offering streamlined deployments, layered security, and detailed monitoring capabilities — so organizations can go from zero to AI in minutes.The complete edge AI solution gives organizations the tools needed to build an end-to-end edge deployment. KION Group, the number one global supply chain solutions provider, uses NVIDIA solutions to fulfill order faster and more efficiently. To learn more about NVIDIA edge AI solutions, check out Deploying and Accelerating AI at the Edge With the NVIDIA EGX Platform.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.3	The NVIDIA NGC team is hosting a webinar with live Q&A to dive into our new Jupyter notebook available from the NGC catalog. Learn how to use these resources to kickstart your AI journey.NVIDIA NGC Jupyter Notebook Day: Building a 3D Medical Imaging Segmentation ModelThursday, July 22 at 9:00 AM PTImage segmentation deals with placing each pixel (or voxel in the case of 3D) of an image into specific classes that share common characteristics. In medical imaging, image segmentation can be used to help identify organs and anomalies, measure them, classify them, and even uncover diagnostic information by using data gathered from x-rays, magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and more. However, building, training, and optimizing an accurate image segmentation AI model from scratch can be time consuming for novices and experts alike.By joining this webinar, you will learn:Register now >>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	The NVIDIA NGC team is hosting a webinar with live Q&A to dive into our new Jupyter notebook available from the NGC catalog. Learn how to use these resources to kickstart your AI journey.NVIDIA NGC Jupyter Notebook Day: Building a 3D Medical Imaging Segmentation ModelThursday, July 22 at 9:00 AM PTImage segmentation deals with placing each pixel (or voxel in the case of 3D) of an image into specific classes that share common characteristics. In medical imaging, image segmentation can be used to help identify organs and anomalies, measure them, classify them, and even uncover diagnostic information by using data gathered from x-rays, magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and more. However, building, training, and optimizing an accurate image segmentation AI model from scratch can be time consuming for novices and experts alike.By joining this webinar, you will learn:Register now >>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.4	NVIDIA announces our newest release of the CUDA development environment consisting of GPU-accelerated libraries, debugging and optimization tools, an updated C/C++ compiler, and a runtime library to build and deploy your application on major architectures including NVIDIA Ampere, x86, Arm server processors, and POWER. The latest release, CUDA 11.4, and its features are focused on enhancing the programming model, new language support, and performance of your CUDA applications.Click here to read the recently published blog post that provides an overview of all the new features in CUDA 11.4.Key features:CUDA 11.4 ships with R470 driver. The driver now includes GPUDirect RDMA, as well as GPUDirect Storage packages that streamline and enable you to leverage these technologies without the need for separate installation of additional packages. The driver also enables new MIG configurations for the recently launched NVIDIA A30 GPU to double the amount of memory per MIG slice. This results in optimal performance for various workloads on the A30 GPU, especially for AI Inference workloads.Looking to learn more about the release? We will be publishing a technical blog post next week that will take a deep dive in all the new CUDA 11.4 features. Resources:Learn More & Download NowCUDA 11.4 Developer Blog postHave a story to share? Submit an idea.Get the developer news feed straight to your inbox.	As an undergraduate student excited about AI for healthcare applications, I was thrilled to be joining the NVIDIA Clara Deploy team for an internship. It was the perfect combination: the opportunity to work at a leading technology company enabling the acceleration and adoption of AI while contributing to a team building the future (and the present!) of AI deployment for healthcare. The next few months were filled with learning from brilliant yet humble colleagues, picking up new skills like CUDA programming, and the opportunity to focus on unique technical challenges posed by histopathology data.The Clara Deploy SDK is a container-based, cloud-native development and deployment framework for multi-AI and multidomain workflows in smart hospitals. It enables you to define container-based pipelines consisting of multiple stages, each stage defined by an operator. A pipeline consists of multiple operators and is a directed acyclic graph (DAG) from the data source to the data sink. Each operator is a step of the pipeline, such as loading input, preprocessing, AI inference, and so on.As I explored setting up the NVIDIA Clara Deploy platform and running AI inference pipelines, I gained firsthand experience in the challenges of deploying AI workflows, particularly in standardizing workflows and scaling up execution. While running digital pathology pipelines, I gained awareness of the performance bottleneck of I/O and preprocessing steps that are usually not GPU-accelerated. This influenced my choice to focus on accelerating preprocessing filters for digital pathology during my internship.cuCIM is a RAPIDS library for accelerated n-dimensional image processing and image I/O, with a focus on medical imaging applications. cuCIM consists of I/O, file system, and operation modules. Operations in cuCIM can be extended using a plug-in architecture. cuCIM is uniquely positioned to be a leading library for medical image-processing applications, and I am excited to have gained exposure to and contributed to it during my time at NVIDIA.A significant challenge in the digitization of histopathology analysis is the stain variation observed in pathology images. These images can have large variations in staining caused by multiple factors, including stain vendors, storage conditions, staining protocols, digital scanners, and so on.Given the range of factors, it is impractical to control for staining variation during image acquisition. Instead, an image preprocessing step called stain normalization is often used to algorithmically standardize image staining. A stain normalization filter accepts as input a source image and a target image. The source image is to be stain normalized, and the target image contains the ideal stain, to be transferred to the source image. Ultimately, a normalized source image is returned as output.Prior work has shown that stain normalization used as a preprocessing step in digital pathology AI pipelines can shorten training time, improve accuracy, and enable data from different sources to be used together. Because you are operating in a relatively small data regime due to the scarcity of stained pathology images, stain normalization enables you to optimize the signal obtained amidst noisy stain variations.However, prior implementations of stain normalization were relatively slow as they were not GPU-accelerated. There was an opportunity to implement a GPU-accelerated stain normalization algorithm and enable fast and effective preprocessing for digital pathology AI pipelines.Stain normalization methods fall into three broad categories:For more information, see Stain Color Adaptive Normalization (SCAN) algorithm: Separation and standardization of histological stains in digital pathology.I chose to focus on stain deconvolution-based methods, as prior literature showed greater performance compared to global color normalization and better theoretical guarantees regarding the maintenance of biological structure integrity compared to deep network-based methods.Stain deconvolution-based methods assume that each image is characterized by a stain matrix, which contains the red, green, blue (RGB) values for each of the two stains in H&E stained images: hematoxylin and eosin.Using the Beer-Lambert law, an RGB image is transformed into an optical density image. Then, the optical density image may be related to the product of a pixel concentration matrix and the stain matrix for that image. The pixel concentration matrix indicates the concentration of each stain for each pixel. If the stain matrix is estimated, done here with the Macenko method, then the concentration matrix may be obtained.Finally, for stain normalization, the stain matrix of a source image is replaced with the stain matrix of a target image. This serves the purpose of transferring the stain profile from the target image to the source image. Because the concentration matrix of the source image is unchanged, the morphology of the biological structures is maintained. The Macenko method for estimating the stain matrix is an unsupervised method using the singular value decomposition.I designed and implemented a filter for the Macenko method for stain normalization in CuPy, after modifying an existing version in NumPy. Next, I compared the performance of the two.Figure 3 shows the relative performance of the NumPy and CuPy implementations of stain normalization for different image sizes, using an NVIDIA DGX-1. Performance for the CuPy implementation is plotted in terms of acceleration factor relative to the NumPy implementation.Given the goal of enabling GPU-accelerated stain normalization to be used as a preprocessing step for digital pathology pipelines, I began the integration of this filter as a transform (array-based and dictionary-based) into MONAI. MONAI is an open-source, PyTorch-based framework for deep learning in medical imaging. After being fully integrated, the stain normalization transform can be added to pathology pipelines in Clara Train or MONAI.Next, I worked on implementing the color conversion rgb2hed function in CUDA C++, which is a commonly used function available in scikit-image and the cuCIM Python layer, among other libraries. Color space conversion from RGB to HED is closely related to stain normalization, as this function involves obtaining stain concentration values, assuming that the stain vectors are a constant, precalculated approximation. This ignores variations between the staining of different images. This function is to be integrated into cuCIM through a C++ based operator plugin mechanism.I compared the performance of a pure C++ implementation and the CUDA C++ implementation. Figure 4 shows the relative performance of the two versions, for different image sizes, using an NVIDIA GV100 GPU and Intel(R) Core(TM) i7-7800X CPU. Performance for the CUDA C++ implementation is plotted in terms of acceleration factor relative to the pure C++ implementation.It’s important to note that the performance gains do not account for any transfer of data to and from the GPU. I did this because I am considering the common scenario where data transfers are minimized by remaining on the GPU for several subsequent operations in an image processing workflow, with transfer back to the host occurring only at the end.In summary, my internship project was focused on accelerating color conversion filters for digital pathology. Specifically, I worked on designing and implementing the Macenko stain normalization method, using CuPy for GPU-acceleration. I began the integration of this into MONAI as a transform, for future use as a preprocessing step for digital pathology pipelines. Next, I worked on implementing the color conversion rgb2hed function in CUDA C++, to be integrated into cuCIM through a C++ based operator plugin mechanism.Both the CuPy implementation of Macenko stain normalization and the CUDA C++ implementation of the rgb2hed function showed significant performance gains compared to the NumPy version and pure C++ version, respectively. The stain normalization preprocessing time for training a pipeline over 500 epochs with a dataset of 250 images and image size of 4000 by 4000 pixels is roughly estimated at 13 days with the NumPy-based filter. It decreases to 3.5 hours for the CuPy-based filter.Ultimately, accelerating pre– and post-processing filters for digital pathology can improve the performance of deep learning pipelines in digital pathology, expedite the adoption of digital pathology, and enable AI to revolutionize pathology.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.5	Relying on the capabilities of GPUs, a team from Facebook AI Research has developed a faster, more efficient way for AI to run similarity searches. The study, published in IEEE Transactions on Big Data, creates a deep learning algorithm capable of handling and comparing high-dimensional data from media that is notably faster, while just as accurate as previous techniques. In a world with an ever-growing supply of data, the work promises to ease both the compute power and time needed for processing large libraries.“The most straightforward technique for searching and indexing [high-dimensional data] is by brute-force comparison, whereby you need to check [each image] against every other image in the database. This is impractical for collections containing billions of vectors,” Jeff Johnson, study colead and a research engineer at Facebook, said in a press release.Containing millions of pixels and data points, every image and video creates billions of vectors. This large amount of data is valuable for analyzing, detecting, indexing, and comparing vectors. It is also problematic for calculating similarities of large libraries with traditional CPU algorithms that rely on several supercomputer components, slowing down overall computing time.Using only four GPUs with CUDA, the researchers designed an algorithm for GPUs to both host and analyze library image data points. The method also compresses the data, making it easier, and thus faster to analyze.  The new algorithm processed over 95 million high-dimensional images in 35 minutes. A graph of a billion vectors took less than 12 hours to compute. According to a comparison test in the study, handling the same database with a cluster of 128 CPU servers took 108.7 hours-about 8.5x longer.“By keeping computations purely on a GPU, we can take advantage of the much faster memory available on the accelerator, instead of dealing with the slower memories of CPU servers and even slower machine-to-machine network interconnects within a traditional supercomputer cluster,” said Johnson. The researchers state the methods are already being applied to a wide variety of tasks, including a language processing search for translations. Known as the Facebook AI Similarity Search library, the approach is open source for implementation, testing, and comparison. Read more >>>Read the full article in IEEE Transactions on Big Data >>>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	As an undergraduate student excited about AI for healthcare applications, I was thrilled to be joining the NVIDIA Clara Deploy team for an internship. It was the perfect combination: the opportunity to work at a leading technology company enabling the acceleration and adoption of AI while contributing to a team building the future (and the present!) of AI deployment for healthcare. The next few months were filled with learning from brilliant yet humble colleagues, picking up new skills like CUDA programming, and the opportunity to focus on unique technical challenges posed by histopathology data.The Clara Deploy SDK is a container-based, cloud-native development and deployment framework for multi-AI and multidomain workflows in smart hospitals. It enables you to define container-based pipelines consisting of multiple stages, each stage defined by an operator. A pipeline consists of multiple operators and is a directed acyclic graph (DAG) from the data source to the data sink. Each operator is a step of the pipeline, such as loading input, preprocessing, AI inference, and so on.As I explored setting up the NVIDIA Clara Deploy platform and running AI inference pipelines, I gained firsthand experience in the challenges of deploying AI workflows, particularly in standardizing workflows and scaling up execution. While running digital pathology pipelines, I gained awareness of the performance bottleneck of I/O and preprocessing steps that are usually not GPU-accelerated. This influenced my choice to focus on accelerating preprocessing filters for digital pathology during my internship.cuCIM is a RAPIDS library for accelerated n-dimensional image processing and image I/O, with a focus on medical imaging applications. cuCIM consists of I/O, file system, and operation modules. Operations in cuCIM can be extended using a plug-in architecture. cuCIM is uniquely positioned to be a leading library for medical image-processing applications, and I am excited to have gained exposure to and contributed to it during my time at NVIDIA.A significant challenge in the digitization of histopathology analysis is the stain variation observed in pathology images. These images can have large variations in staining caused by multiple factors, including stain vendors, storage conditions, staining protocols, digital scanners, and so on.Given the range of factors, it is impractical to control for staining variation during image acquisition. Instead, an image preprocessing step called stain normalization is often used to algorithmically standardize image staining. A stain normalization filter accepts as input a source image and a target image. The source image is to be stain normalized, and the target image contains the ideal stain, to be transferred to the source image. Ultimately, a normalized source image is returned as output.Prior work has shown that stain normalization used as a preprocessing step in digital pathology AI pipelines can shorten training time, improve accuracy, and enable data from different sources to be used together. Because you are operating in a relatively small data regime due to the scarcity of stained pathology images, stain normalization enables you to optimize the signal obtained amidst noisy stain variations.However, prior implementations of stain normalization were relatively slow as they were not GPU-accelerated. There was an opportunity to implement a GPU-accelerated stain normalization algorithm and enable fast and effective preprocessing for digital pathology AI pipelines.Stain normalization methods fall into three broad categories:For more information, see Stain Color Adaptive Normalization (SCAN) algorithm: Separation and standardization of histological stains in digital pathology.I chose to focus on stain deconvolution-based methods, as prior literature showed greater performance compared to global color normalization and better theoretical guarantees regarding the maintenance of biological structure integrity compared to deep network-based methods.Stain deconvolution-based methods assume that each image is characterized by a stain matrix, which contains the red, green, blue (RGB) values for each of the two stains in H&E stained images: hematoxylin and eosin.Using the Beer-Lambert law, an RGB image is transformed into an optical density image. Then, the optical density image may be related to the product of a pixel concentration matrix and the stain matrix for that image. The pixel concentration matrix indicates the concentration of each stain for each pixel. If the stain matrix is estimated, done here with the Macenko method, then the concentration matrix may be obtained.Finally, for stain normalization, the stain matrix of a source image is replaced with the stain matrix of a target image. This serves the purpose of transferring the stain profile from the target image to the source image. Because the concentration matrix of the source image is unchanged, the morphology of the biological structures is maintained. The Macenko method for estimating the stain matrix is an unsupervised method using the singular value decomposition.I designed and implemented a filter for the Macenko method for stain normalization in CuPy, after modifying an existing version in NumPy. Next, I compared the performance of the two.Figure 3 shows the relative performance of the NumPy and CuPy implementations of stain normalization for different image sizes, using an NVIDIA DGX-1. Performance for the CuPy implementation is plotted in terms of acceleration factor relative to the NumPy implementation.Given the goal of enabling GPU-accelerated stain normalization to be used as a preprocessing step for digital pathology pipelines, I began the integration of this filter as a transform (array-based and dictionary-based) into MONAI. MONAI is an open-source, PyTorch-based framework for deep learning in medical imaging. After being fully integrated, the stain normalization transform can be added to pathology pipelines in Clara Train or MONAI.Next, I worked on implementing the color conversion rgb2hed function in CUDA C++, which is a commonly used function available in scikit-image and the cuCIM Python layer, among other libraries. Color space conversion from RGB to HED is closely related to stain normalization, as this function involves obtaining stain concentration values, assuming that the stain vectors are a constant, precalculated approximation. This ignores variations between the staining of different images. This function is to be integrated into cuCIM through a C++ based operator plugin mechanism.I compared the performance of a pure C++ implementation and the CUDA C++ implementation. Figure 4 shows the relative performance of the two versions, for different image sizes, using an NVIDIA GV100 GPU and Intel(R) Core(TM) i7-7800X CPU. Performance for the CUDA C++ implementation is plotted in terms of acceleration factor relative to the pure C++ implementation.It’s important to note that the performance gains do not account for any transfer of data to and from the GPU. I did this because I am considering the common scenario where data transfers are minimized by remaining on the GPU for several subsequent operations in an image processing workflow, with transfer back to the host occurring only at the end.In summary, my internship project was focused on accelerating color conversion filters for digital pathology. Specifically, I worked on designing and implementing the Macenko stain normalization method, using CuPy for GPU-acceleration. I began the integration of this into MONAI as a transform, for future use as a preprocessing step for digital pathology pipelines. Next, I worked on implementing the color conversion rgb2hed function in CUDA C++, to be integrated into cuCIM through a C++ based operator plugin mechanism.Both the CuPy implementation of Macenko stain normalization and the CUDA C++ implementation of the rgb2hed function showed significant performance gains compared to the NumPy version and pure C++ version, respectively. The stain normalization preprocessing time for training a pipeline over 500 epochs with a dataset of 250 images and image size of 4000 by 4000 pixels is roughly estimated at 13 days with the NumPy-based filter. It decreases to 3.5 hours for the CuPy-based filter.Ultimately, accelerating pre– and post-processing filters for digital pathology can improve the performance of deep learning pipelines in digital pathology, expedite the adoption of digital pathology, and enable AI to revolutionize pathology.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.6	NVIDIA and Mozilla are proud to announce the latest release of the Common Voice dataset, with over 13,000 hours of crowd-sourced speech data, and adding another 16 languages to the corpus. Common Voice is the world’s largest open data voice dataset and designed to democratize voice technology. It is used by researchers, academics, and developers around the world.  Contributors mobilize their own communities to donate speech data to the MCV public database, which anyone can then use to train voice-enabled technology.  As part of NVIDIA’s collaboration with Mozilla Common Voice, the models trained on this and other public datasets are made available for free via an open-source toolkit called NVIDIA NeMo. Highlights of this release include:Pretrained Models:NVIDIA has released multilingual speech recognition models in NGC for free as part of the partnership mission to democratize voice technology. NeMo is an open-source toolkit for researchers developing state-of-the-art conversational AI models. Researchers can further fine-tune these models on multilingual datasets. See an example in this notebook that fine tunes an English speech recognition model on the MCV Japanese dataset. Contribute Your Voice, and Validate Samples: The dataset relies on the amazing effort and contribution from many communities across the world. Take the time to feed back into the dataset by recording your voice and validating samples from other contributors: https://commonvoice.mozilla.org/speakYou can download the latest MCV dataset from https://commonvoice.mozilla.org/datasets, including the repo for full stats https://github.com/common-voice/cv-dataset/, and NVIDIA NeMo from NGC Catalog and GitHub.Dataset ‘Ask Me Anything’:August 4, 2021 from 3:00 – 4:00 p.m. UTC / 2:00 – 3:00 p.m. EDT / 11:00 a.m. – 12:00 p.m. PDT:In celebration of the dataset release, on August 4th Mozilla is hosting an AMA discussion with Lead Engineer Jenny Zhang. Jenny will be available to answer your questions live, to join and ask a question please use the following AMA discourse topic.   Read more > >Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	NVIDIA and Mozilla are proud to announce the latest release of the Common Voice dataset, with over 13,000 hours of crowd-sourced speech data, and adding another 16 languages to the corpus. Common Voice is the world’s largest open data voice dataset and designed to democratize voice technology. It is used by researchers, academics, and developers around the world.  Contributors mobilize their own communities to donate speech data to the MCV public database, which anyone can then use to train voice-enabled technology.  As part of NVIDIA’s collaboration with Mozilla Common Voice, the models trained on this and other public datasets are made available for free via an open-source toolkit called NVIDIA NeMo. Highlights of this release include:Pretrained Models:NVIDIA has released multilingual speech recognition models in NGC for free as part of the partnership mission to democratize voice technology. NeMo is an open-source toolkit for researchers developing state-of-the-art conversational AI models. Researchers can further fine-tune these models on multilingual datasets. See an example in this notebook that fine tunes an English speech recognition model on the MCV Japanese dataset. Contribute Your Voice, and Validate Samples: The dataset relies on the amazing effort and contribution from many communities across the world. Take the time to feed back into the dataset by recording your voice and validating samples from other contributors: https://commonvoice.mozilla.org/speakYou can download the latest MCV dataset from https://commonvoice.mozilla.org/datasets, including the repo for full stats https://github.com/common-voice/cv-dataset/, and NVIDIA NeMo from NGC Catalog and GitHub.Dataset ‘Ask Me Anything’:August 4, 2021 from 3:00 – 4:00 p.m. UTC / 2:00 – 3:00 p.m. EDT / 11:00 a.m. – 12:00 p.m. PDT:In celebration of the dataset release, on August 4th Mozilla is hosting an AMA discussion with Lead Engineer Jenny Zhang. Jenny will be available to answer your questions live, to join and ask a question please use the following AMA discourse topic.   Read more > >Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.7	Today, NVIDIA announced TensorRT 8.0 which brings BERT-Large inference latency down to 1.2 ms with new optimizations. This version also delivers 2x the accuracy for INT8 precision with Quantization Aware Training, and significantly higher performance through support for Sparsity, which was introduced in Ampere GPUs.TensorRT is an SDK for high-performance deep learning inference that includes an inference optimizer and runtime that delivers low latency and high throughput. TensorRT is used across industries such as Healthcare, Automotive, Manufacturing, Internet/Telecom services, Financial Services, Energy, and has been downloaded nearly 2.5 million times.There have been several kinds of new transformer-based models used across conversational AI. New generalized optimizations in TensorRT can accelerate all such models reducing inference time to half the time vs TensorRT 7.Highlights from this version include:You can learn more about Sparsity here.One of the biggest social media platforms in China, WeChat accelerates its search using TensorRT serving 500M users a month.“We have implemented TensorRT-and-INT8 QAT-based model inference acceleration to accelerate core tasks of WeChat Search such as Query Understanding and Results Ranking. The conventional limitation of NLP model complexity has been broken-through by our solution with GPU + TensorRT, and BERT/Transformer can be fully integrated in our solution. In addition, we have achieved significant reduction (70%) in allocated computational resources using superb performance optimization methods. ” – Huili/Raccoonliu/Dickzhu, WeChat SearchNVIDIA TensorRT is freely available to members of the NVIDIA Developer Program. To learn more, visit the TensorRT product page.To learn more about TensorRT 8 and its features:Follow these GTC Sessions to get yourself familiar with Technologies:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Human pose estimation is a popular computer vision task of estimating key points on a person’s body such as eyes, arms, and legs. This can help classify a person’s actions, such as standing, sitting, walking, lying down, jumping, and so on.Understanding the context of what a person might be doing in a scene has broad application across a wide range of industries. In a retail setting, this information can be used to understand customer behavior, enhance security, and provide richer analytics. In healthcare, this can be used to monitor patients and alert medical personnel if the patient needs immediate attention. On a factory floor, human pose can be used to identify if proper safety protocols are being followed.In general, this is a reliable approach in applications that require understanding of human activity and commonly used as one of the key components in more complex tasks such as gesture, tracking, anomaly detection, and so on.Open-source methods of developing pose estimation exist but are not optimal in terms of inference performance and are time consuming to integrate into production applications. With this post, we show you how to develop and deploy pose estimation models that are easy to use across device profiles, perform extremely well, and are highly accurate.Pose estimation has been integrated with the NVIDIA Transfer Learning Toolkit (TLT) 3.0 so that you can take advantage of all the TLT features, like model pruning and quantization, to create both an accurate and a high-performance model. After it’s trained, you can deploy this model for inference for real-time performance.This post series walks you through the steps of training, optimizing, deploying a real-time high performance pose estimation model. In part 1, you learn how to train a 2D pose estimation model using open-source COCO dataset. In part 2, you learn how to optimize the model for inference throughput and then deploy the model using TLT CV inference pipeline. We compare the trained model from TLT with other state-of-the-art models.In this section, we cover the following topics on training a 2D pose estimation model with TLT:The BodyPoseNet model aims to predict the skeleton for every person in a given input image, which consists of keypoints and the connections between them.The two commonly used approaches to pose estimation are top-down and bottom-up. A top-down approach typically uses an object detection network to localize the bounding boxes of all humans in a frame, and then uses a pose network to localize the body parts within that bounding box. A bottom-up approach, as the name suggests, builds the skeleton from bottom-up. It first detects all human body parts within a frame and then uses a methodology to group the parts that belong to a specific person.There are several reasons to adopt a bottom-up approach. One is higher inference performance. With a bottom-up approach, there is no need for a separate person detector, unlike top-down pose estimation methods. The compute does not scale linearly with the number of persons in the scene. This enables you to achieve real-time performance for crowded scenes as well. Moreover, bottom-up also has the advantage of having global context as the entire image is provided as input to the network. It can handle complex poses and crowding better.Given some of those reasons, this approach aims to achieve efficient single-shot, bottom-up pose estimation while also delivering competitive accuracy. The default model used in this post is a fully convolutional model and consists of a backbone network, an initial prediction stage which does a pixel-wise prediction of confidence maps (heatmap) and part-affinity fields (PAF) followed by multistage refinement (0 to N stages) on the initial predictions. This solution simplifies and abstracts much of the complexities of the bottom-up approach while allowing for the necessary knobs to be tuned for specific applications.PAFs are one way to represent association scores in a bottom-up approach. For more information, see Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. It consists of a set of 2D vector fields that encode the location and orientation of limbs. This, in association with the heatmap, is used to build up the skeleton during post-processing by performing a bipartite matching and associating body part candidates.NVIDIA TLT toolkit helps abstract away the AI/DL framework complexity and enables you to build production quality models faster, with no coding required. For more information about hardware and software requirements, setting up required dependencies, and installing the TLT launcher, see the TLT Quick Start Guide.Download the latest samples using the following command:You can find the sample notebook located at tlt_cv_samples:v1.1.0/bpnet, which also includes all the steps in detail.Set up env variables for cleaner command line commands. Update the following variable values:To run the TLT launcher, map the ~/tlt-experiments directory on the local machine to the Docker container using the ~/.tlt_mounts.json file. For more information, see TLT Launcher.Create the ~/.tlt_mounts.json file and update the following content inside:Make sure that the source directory paths to be mounted are valid. This mounts the path /home/<username>/tlt-experiments on the host machine to be the path /workspace/tlt-experiments inside the container. It also mounts the downloaded specs on the host machine to be the path /workspace/examples/bpnet/specs, /workspace/examples/bpnet/data_pose_config, and /workspace/examples/bpnet/model_pose_config inside the container.Make sure that you have installed the required dependencies by running the following command:To get started, set up an NGC account and then download the pretrained model. Currently, only the vgg19 backbone is supported.We use the COCO (common objects on context) 2017 dataset in this post as an example. Download the dataset and extract as per the instructions:Unzip the images directories into the $LOCAL_DATA_DIR directory and the annotations into $LOCAL_DATA_DIR/annotations.To prepare the data for training, you must generate segmentation masks to be used for masking the loss of unlabeled persons and tfrecords to feed to the training pipeline. The mask folder is based on the path provided in the coco_spec.json file. mask_root_dir_path directory is a relative path to root_directory_path, as are mask_root_dir_path and annotation_root_dir_path.To use this example with a custom dataset:For more information, see the following docs:The next step is to configure the spec file for training. The experiment spec file is essential, as it compiles all the necessary hyperparameters for achieving a good model. The specification file for BodyPoseNet training configures these components of the training pipe:You can find the default specification file at $SPECS_DIR/bpnet_train_m1_coco.yaml. We expand on each component of the specification file but we don’t cover all the parameters here. For more information, see Create a Train Experiment Configuration File.The top-level experiment configs include basic parameters for an experiment; for example, number of epochs, pretrained weights, whether to load the pretrained graph, and so on. An encrypted checkpoint is saved per the checkpoint_n_epoch value. Here’s a code example of some of the top-level configs.All the paths (checkpoint_dir and pretrained_weights) are internal to the Docker container. To verify correctness, check ~/.tlt_mounts.json. For more information about these parameters, see the Body Pose Trainer section.This section helps you with defining datapaths, image configuration, the target pose configuration, normalization parameters, and so on. The augmentation_config section provides some on-the-fly augmentation options. It supports basic spatial augmentations, such as flip, zoom, rotate, and translate, which can be configured before training experiments. The label_processor_config section provides the required parameters to configure the ground truth feature map generation.For more information about each parameter, see the Dataloader section.The BodyPoseNet model can be configured using the model option in the spec file. The following is a sample model config to instantiate a custom VGG19-backbone-based model.The number of total stages for pose estimation (stages of refinement + 1) in the network is captured by the stages param which takes any value >= 2. We recommend using the L1 regularizer when training a network before pruning, as L1 regularization makes it easier to prune the network weights. For more information about each parameter in the model, see the Model section.This section describes how to configure the optimizer and learning-rate schedule:The default base_learning_rate is set for a single-GPU training. To use multi-GPU training, you may have to modify the learning_rate value to get similar accuracy. In most cases, scaling up the learning rate by a factor of $NUM_GPUS would be a good start. For instance, if you are using two GPUs, use 2 * base_learning_rate used in one GPU setting, and if you are using four GPUs, use 4 * base_learning_rate. For more information about each parameter in the model, see the Optimizer section.After following the steps to generate TFRecords and masks and setting up a train specification file, you are now ready to start training the body pose estimation network. Use the following command to launch training:Training with more GPUs enables networks to ingest more data faster, saving you precious time during the development process. TLT supports multi-GPU training so that you can train the model with several GPUs in parallel. We recommend using four GPUs or more for training the model as one GPU might take several days to complete. The training time roughly decreases by a factor of $NUM_GPUS. Make sure that you update the learning rates accordingly, based on the linear scaling method described in the Optimizer section.BodyPoseNet supports restarting from checkpoint. In case the training job is killed prematurely, you may resume training from the last saved checkpoint by simply rerunning the same command. Make sure that you use the same number of GPUs when restarting the training.Start with configuring the inference and evaluation specification file. The following code example is a sample specification:The value of input_shape here can be different from the input_dims value used for training. The multi_scale_inference parameter enables multiscale refinement over the provided scales. Because you are using a model of stride 8, output_upsampling_factor is set to 8.To keep the evaluation consistent with bottom-up human pose estimation research, there are two modes and specification files to evaluate the model:There is another mode used primarily to verify against the final exported TRT models. You use this in later sections.The --model_filename argument overrides the model_path variable in the inference specification file.To evaluate the model, use the following command:Now that you’ve trained the model, run inference and verify the predictions. To verify the model visually with TLT, use the tlt bpnet inference command. The tool supports running inference on the .tlt model, as well as the TensorRT .engine model. It generates annotated images with skeleton rendered on them and serialized frame-by-frame keypoint labels and metadata in detections.json. For example, to run inference with a trained .tlt model, run the following command:Figure 1 shows an example of the original image and Figure 2 shows the output image with pose results rendered. As you can see, the model is robust to an image that is different from the COCO training data.In this post, you learned about training body pose models using the BodyPoseNet app in TLT. The post showed taking an open-source COCO dataset with a pretrained backbone from NGC to train a model with TLT. To optimize the trained model for inference and deployment, see Training and Optimizing the 2D Pose Estimation Model, Part 2.For more information, see the following resources:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.8	Researchers, developers, and engineers worldwide are gathering virtually this year for the annual Conference on Computer Vision and Pattern Recognition (CVPR) from June 19th to June 25th. Throughout the week, NVIDIA Research will present their recent computer vision-related projects via presentations and interactive Q&As. The nearly 30 accepted papers from NVIDIA range from simulating dynamic driving environments, to powering neural architecture search for medical imaging. Here are a few featured papers:DriveGAN: Towards a Controllable High-Quality Neural SimulationAuthors: Seung Wook Kim (University of Toronto, NVIDIA)*; Jonah Philion (University of Toronto, NVIDIA); Antonio Torralba (MIT); Sanja Fidler (University of Toronto, NVIDIA)DriveGAN is a fully differentiable simulator, it further allows for re-simulation of a given video sequence, offering an agent to drive through a recorded scene again, possibly taking different actions. The talk will be live on Tuesday, June 22, 2021 at 10:00pm ESTDiNTS: Differentiable Neural Network Topology Search for 3D Medical Image SegmentationAuthors: Yufan He (Johns Hopkins University)*; Dong Yang (NVIDIA); Holger R Roth (NVIDIA); Can Zhao (NVIDIA); Daguang Xu (NVIDIA)From the abstract: In this work, we focus on three important aspects of NAS in 3D medical image segmentation: flexible multi-path network topology, high search efficiency, and budgeted GPU memory usage. Our method achieves the state-of-the-art performance and the top ranking on the MSD challenge leaderboard.The talk will be live on Tuesday, June 22, 2021 at 10:00 pm ESTTo view the complete list of NVIDIA Research accepted papers, workshop and tutorials, demos, and to explore job opportunities at NVIDIA, visit the NVIDIA at CVPR 2021 website.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Edge computing has been around for a long time, but has recently become a hot topic because of the convergence of three major trends – IoT, 5G, and AI. IoT devices are becoming smarter and more capable, increasing the breadth of applications that can be deployed on them and the environments they can be deployed in. Simultaneously, recent advancements in 5G capabilities give confidence that this technology will soon be able to connect IoT devices wirelessly anywhere they are deployed. In fact, analysts predict that there will be over 1 billion 5G connected devices by 2023. Lastly, AI successfully moved from research projects into practical applications, changing the landscape for retailers, factories, hospitals, and many more. So what does the convergence of these trends mean? An explosion in the number of IoT devices deployed. Experts estimate there are over 30 billion IoT devices installed today, and Arm predicts that by 2035, there will be over 1 trillion devices. With that many IoT devices deployed, the amount of data collected skyrocketed, putting strain on current cloud infrastructures. Organizations soon found themselves in a position where the AI applications they deployed needed large amounts of data to generate compelling insights, but the latency for their cloud infrastructure to process data and send insights back to the edge were unsustainable. So they turned to edge computing. By putting the processing power at the location that sensors are collecting data, organizations reduce the latency for applications to deliver insights. For some situations, such as autonomous machines at factories, the latency reduction represents a critical safety component. That is where NVIDIA comes in. The NVIDIA Edge AI solution offers a complete end-to-end AI platform for deploying AI at the edge. It starts with NVIDIA-Certified Systems. NVIDIA-Certified Systems combine the computing power of NVIDIA GPUs with secure high-bandwidth, low-latency networking solutions from NVIDIA. Validated for performance, functionality, scalability, and security – IT teams ensure AI workloads deployed from the NGC catalog, NVIDIA’s GPU-optimized hub of HPC and AI software, run at full performance. These servers are backed by enterprise-grade support, including direct access to NVIDIA experts, minimizing system downtime and maximizing user productivity. To build and accelerate applications running on NVIDIA-Certified Systems, NVIDIA offers an extensive toolkit of SDKs, application frameworks, and other tools designed to help developers build AI applications for every industry. These include pretrained models, training scripts, optimized framework containers, inference engines, and more. With these tools, organizations get a head start on building unique AI applications regardless of workload or industry. Once organizations have the hardware to accelerate AI and an AI application to deploy, the next step is to ensure that there is infrastructure in place to manage and scale the application. Without a platform to manage AI at the edge, organizations face the difficult and costly task of manually updating systems at edge locations every time a new software update is released. NVIDIA Fleet Command is a cloud service that securely deploys, manages, and scales AI applications across distributed edge infrastructure. Purpose-built for AI, Fleet Command is a turnkey solution for AI lifecycle management, offering streamlined deployments, layered security, and detailed monitoring capabilities — so organizations can go from zero to AI in minutes.The complete edge AI solution gives organizations the tools needed to build an end-to-end edge deployment. KION Group, the number one global supply chain solutions provider, uses NVIDIA solutions to fulfill order faster and more efficiently. To learn more about NVIDIA edge AI solutions, check out Deploying and Accelerating AI at the Edge With the NVIDIA EGX Platform.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.9	The long, cumbersome slog of data procurement has been slowing down innovation in AI, especially in computer vision, which relies on labeled images and video for training. But now you can jumpstart your machine learning process by quickly generating synthetic data using AI.Reverie.With the AI.Reverie synthetic data platform, you can create the exact training data that you need in a fraction of the time it would take to find and label the right real photography. In AI.Reverie’s photorealistic 3D environments, you can generate data for all possible scenarios, including hard to reach places, unusual environmental conditions, and rare or unique events.Training data generation includes labels. Choose the needed types, such as 2D or 3D bounding boxes, depth masks, and so on. After you test your model, you can return to the platform to quickly generate additional data to improve accuracy. Test and repeat in quick, iterative cycles.We wanted to test performance of AI.Reverie synthetic data in NVIDIA Transfer Learning Toolkit 3.0. Originally, we set out to replicate the results in the research paper RarePlanes: Synthetic Data Takes Flight, which used synthetic imagery to create object detection models. We discovered new tools in TLT that made it possible to create more lightweight models that were as accurate as, but much faster than, those featured in the original paper.In this post, we show you how we used the TLT quantized-aware training and model pruning to accomplish this, and how to replicate the results yourself. We show you how to create an airplane detector, but you should be able to fine-tune the model for various satellite detection scenarios of your own.To replicate these results, you can clone the GitHub repository and follow along with the included Jupyter notebook.Clone the following repo:Create a conda environment:Activate the model:Start Jupyter:We tested the code with Python 3.8.8, using Anaconda 4.9.2 to manage dependencies and the virtual environment. The code may work with different versions of Python and other virtual environment solutions, but we haven’t tested those configurations. We used Ubuntu 18.04.5 LTS and NVIDIA driver 460.32.03 and CUDA Version 11.2. TLT requires driver 455.xx or later.For more information about the contents of the RarePlanes dataset, see RarePlanes Public User Guide.For this tutorial, you need only download a subset of the data. The following code example is meant to be executed from within the Jupyter notebook. First, create the folders:Now use this function to download the datasets from Amazon S3, extract them, and verify:Then download the dataset:TLT uses the KITTI format for object detection model training. RarePlanes is in the COCO format, so you must run a conversion script from within the Jupyter notebook. This converts the real train/test and synthetic train/test datasets.There should now be a folder for each dataset split inside of data/kitti that contains the KITTI formatted annotation text files and symlinks to the original images.The notebook has a script to generate a ~/.tlt_mounts.json file. For more information about the various settings, see Running the launcher.You must turn the KITTI labels into the TFRecord format used by TLT. The convert_split function in the notebook helps you bulk convert all the datasets:Using your NGC account and command-line tool, you can now download the model:Using your NGC account and command-line tool, you can now download the model:The model is now located at the following path:The following command starts training and logs results to a file that you can tail:Follow along with the following command:After training is complete, you can use the functions defined in the notebook to get relevant statistics on your model:You get something like the following output:To reevaluate your trained model on your test set or other dataset, run the following:The output should look something like this:Running an experiment with synthetic dataYou can see the results for each epoch by running: !cat out_resnet18_synth_amp16.log | grep -i aircraftExample output:Now, fine-tune your best-performing synthetic-data-trained model with 10% of the real data. To do so, you must first create the 10% split.You then use this function to replace the checkpoint in your template spec with the best performing model from the synthetic-only training.You can now begin a TLT training. Start your fine-tuning with the best-performing epoch of the model trained on synthetic data alone, in the previous section.After training has completed, you should see a best epoch of between 91-93% mAP50, which gets you close to the real-only model performance with only 10% of the real data.In the notebook, there’s a command to evaluate the best performing model checkpoint on the test set:You should see something like the following output:Data enhancement is fine-tuning a model training on AI.Reverie’s synthetic data with just 10% of the original, real dataset. As you can see, this technique produces a model as accurate as one trained on real data alone. That represents roughly 90% cost savings on real, labeled data and saves you from having to endure a long hand-labeling and QA process.Having trained a well-performing model, you can now decrease the number of weights to cut down on file size and inference time. TLT includes an easy-to-use pruning tool.The one argument to play with is -pth, which sets the threshold for neurons to prune. The higher you set this, the more parameters are pruned, but after a certain point your accuracy metric may drop too low. We found that a value of 0.5 worked for these experiments, but you may find different results on other datasets.You can now evaluate the pruned model:Now you can see how many parameters remain:You should see something like the following outputs:This is 70% smaller than the original model, which had 11.2 million parameters! Of course, you’ve lost performance by dropping so many parameters, which you can verify:Luckily, you can recover almost all the performance by retraining the pruned model.As before, there is a template spec to run this experiment that only requires you to fill in the location of the pruned model:You can now retrain the pruned model:On a run of this experiment, the best performing epoch achieved 91.925 mAP50, which is about the same as the original nonpruned experiment.The final step in this process is quantizing the pruned model so that you can achieve much higher levels of inference speed with TensorRT. We have a quantization aware training (QAT) spec template available:Run the QAT training:Use the TLT export tool to export to INT8 quantized TensorRT format:At this point, you can now evaluate your quantized model using TensorRT:Looking at the output:We were impressed by these results. AI.Reverie’s synthetic data platform, with just 10% of the real dataset, enabled us to achieve the same performance as we did when training on the full real dataset. That represents a cost savings of roughly 90%, not to mention the time saved on procurement. It now takes days, not months, to generate the needed synthetic data.TLT also produced a 25.2x reduction in parameter count, a 33.6x reduction in file size, a 174.7x increase in performance (QPS), while retaining 95% of the original performance. TLT’s capabilities were particularly valuable for pruning and quantizing.Go to AI.Reverie, download the synthetic training data for your project, and start training with TLT.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	The long, cumbersome slog of data procurement has been slowing down innovation in AI, especially in computer vision, which relies on labeled images and video for training. But now you can jumpstart your machine learning process by quickly generating synthetic data using AI.Reverie.With the AI.Reverie synthetic data platform, you can create the exact training data that you need in a fraction of the time it would take to find and label the right real photography. In AI.Reverie’s photorealistic 3D environments, you can generate data for all possible scenarios, including hard to reach places, unusual environmental conditions, and rare or unique events.Training data generation includes labels. Choose the needed types, such as 2D or 3D bounding boxes, depth masks, and so on. After you test your model, you can return to the platform to quickly generate additional data to improve accuracy. Test and repeat in quick, iterative cycles.We wanted to test performance of AI.Reverie synthetic data in NVIDIA Transfer Learning Toolkit 3.0. Originally, we set out to replicate the results in the research paper RarePlanes: Synthetic Data Takes Flight, which used synthetic imagery to create object detection models. We discovered new tools in TLT that made it possible to create more lightweight models that were as accurate as, but much faster than, those featured in the original paper.In this post, we show you how we used the TLT quantized-aware training and model pruning to accomplish this, and how to replicate the results yourself. We show you how to create an airplane detector, but you should be able to fine-tune the model for various satellite detection scenarios of your own.To replicate these results, you can clone the GitHub repository and follow along with the included Jupyter notebook.Clone the following repo:Create a conda environment:Activate the model:Start Jupyter:We tested the code with Python 3.8.8, using Anaconda 4.9.2 to manage dependencies and the virtual environment. The code may work with different versions of Python and other virtual environment solutions, but we haven’t tested those configurations. We used Ubuntu 18.04.5 LTS and NVIDIA driver 460.32.03 and CUDA Version 11.2. TLT requires driver 455.xx or later.For more information about the contents of the RarePlanes dataset, see RarePlanes Public User Guide.For this tutorial, you need only download a subset of the data. The following code example is meant to be executed from within the Jupyter notebook. First, create the folders:Now use this function to download the datasets from Amazon S3, extract them, and verify:Then download the dataset:TLT uses the KITTI format for object detection model training. RarePlanes is in the COCO format, so you must run a conversion script from within the Jupyter notebook. This converts the real train/test and synthetic train/test datasets.There should now be a folder for each dataset split inside of data/kitti that contains the KITTI formatted annotation text files and symlinks to the original images.The notebook has a script to generate a ~/.tlt_mounts.json file. For more information about the various settings, see Running the launcher.You must turn the KITTI labels into the TFRecord format used by TLT. The convert_split function in the notebook helps you bulk convert all the datasets:Using your NGC account and command-line tool, you can now download the model:Using your NGC account and command-line tool, you can now download the model:The model is now located at the following path:The following command starts training and logs results to a file that you can tail:Follow along with the following command:After training is complete, you can use the functions defined in the notebook to get relevant statistics on your model:You get something like the following output:To reevaluate your trained model on your test set or other dataset, run the following:The output should look something like this:Running an experiment with synthetic dataYou can see the results for each epoch by running: !cat out_resnet18_synth_amp16.log | grep -i aircraftExample output:Now, fine-tune your best-performing synthetic-data-trained model with 10% of the real data. To do so, you must first create the 10% split.You then use this function to replace the checkpoint in your template spec with the best performing model from the synthetic-only training.You can now begin a TLT training. Start your fine-tuning with the best-performing epoch of the model trained on synthetic data alone, in the previous section.After training has completed, you should see a best epoch of between 91-93% mAP50, which gets you close to the real-only model performance with only 10% of the real data.In the notebook, there’s a command to evaluate the best performing model checkpoint on the test set:You should see something like the following output:Data enhancement is fine-tuning a model training on AI.Reverie’s synthetic data with just 10% of the original, real dataset. As you can see, this technique produces a model as accurate as one trained on real data alone. That represents roughly 90% cost savings on real, labeled data and saves you from having to endure a long hand-labeling and QA process.Having trained a well-performing model, you can now decrease the number of weights to cut down on file size and inference time. TLT includes an easy-to-use pruning tool.The one argument to play with is -pth, which sets the threshold for neurons to prune. The higher you set this, the more parameters are pruned, but after a certain point your accuracy metric may drop too low. We found that a value of 0.5 worked for these experiments, but you may find different results on other datasets.You can now evaluate the pruned model:Now you can see how many parameters remain:You should see something like the following outputs:This is 70% smaller than the original model, which had 11.2 million parameters! Of course, you’ve lost performance by dropping so many parameters, which you can verify:Luckily, you can recover almost all the performance by retraining the pruned model.As before, there is a template spec to run this experiment that only requires you to fill in the location of the pruned model:You can now retrain the pruned model:On a run of this experiment, the best performing epoch achieved 91.925 mAP50, which is about the same as the original nonpruned experiment.The final step in this process is quantizing the pruned model so that you can achieve much higher levels of inference speed with TensorRT. We have a quantization aware training (QAT) spec template available:Run the QAT training:Use the TLT export tool to export to INT8 quantized TensorRT format:At this point, you can now evaluate your quantized model using TensorRT:Looking at the output:We were impressed by these results. AI.Reverie’s synthetic data platform, with just 10% of the real dataset, enabled us to achieve the same performance as we did when training on the full real dataset. That represents a cost savings of roughly 90%, not to mention the time saved on procurement. It now takes days, not months, to generate the needed synthetic data.TLT also produced a 25.2x reduction in parameter count, a 33.6x reduction in file size, a 174.7x increase in performance (QPS), while retaining 95% of the original performance. TLT’s capabilities were particularly valuable for pruning and quantizing.Go to AI.Reverie, download the synthetic training data for your project, and start training with TLT.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.10	The NVIDIA NGC team is hosting a webinar with live Q&A to dive into this Jupyter notebook available from the NGC catalog. Learn how to use these resources to kickstart your AI journey. Register now: NVIDIA NGC Jupyter Notebook Day: Medical Imaging Segmentation.Image segmentation partitions a digital image into multiple segments by changing the representation into something more meaningful and easier to analyze. In the field of medical imaging, image segmentation can be used to help identify organs and anomalies, measure them, classify them, and even uncover diagnostic information. It does this by using data gathered from x-rays, magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and other formats.To achieve state-of-the-art models that deliver the desired accuracy and performance for a use case, you must set up the right environment, train with the ideal hyperparameters, and optimize it to achieve the desired accuracy. All of this can be time-consuming. Data scientists and developers need the right set of tools to quickly overcome tedious tasks. That’s why we built the NGC catalog.The NGC catalog is a hub of GPU-optimized AI and HPC applications and tools. NGC provides easy access to performance-optimized containers, shortens model development time with pretrained models, and provides industry-specific SDKs to help build complete AI solutions and speed up AI workflows. These diverse assets can be used for a variety of use cases, ranging from computer vision and speech recognition to language understanding. The potential solutions span industries such as automotive, healthcare, manufacturing, and retail.In this post, we show how you can use the Medical 3D Image Segmentation notebook to predict brain tumors in MRI images. This post is suitable for anyone who is new to AI and has a particular interest in image segmentation as it applies to medical imaging. 3D U-Net enables the seamless segmentation of 3D volumes, with high accuracy and performance. It can be adapted to solve many different segmentation problems. Figure 2 shows that 3D U-Net consists of a contractive (left) and expanding (right) path. It repeatedly applies unpadded convolutions followed by max pooling for downsampling.In deep learning, a convolutional neural network (CNN) is a subset of deep neural networks, mostly used in image recognition and image processing. CNNs use deep learning to perform both generative and descriptive tasks, often using machine vision along with recommender systems and natural language processing.Padding in CNNs refers to the number of pixels added to an image when it is processed by the kernel of a CNN. Unpadded CNNs means that no pixels are added to the image.Pooling is a downsampling approach in CNN. Max pooling is one of common pooling methods that summarize the most activated presence of a feature. Every step in the expanding path consists of a feature map upsampling and a concatenation with the correspondingly cropped feature map from the contractive path.This resource contains a Dockerfile that extends the TensorFlow NGC container and encapsulates some dependencies. The resource can be downloaded using the following commands:wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/unet3d_medical_for_tensorflow/versions/20.06.0/zip -O unet3d_medical_for_tensorflow_20.06.0.zipAside from these dependencies, you also need the following components:To train your model using mixed or TF32 precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the 3D U-Net model on the Brain Tumor Segmentation 2019 dataset.Download the resource manually by clicking the three dots at the top-right corner of the resource page.You could also use the following wget command:wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/unet3d_medical_for_tensorflow/versions/20.06.0/zip -O unet3d_medical_for_tensorflow_20.06.0.zipThis command uses the Dockerfile to create a Docker image named unet3d_tf, downloading all the required components automatically.docker build -t unet3d_tf .Data can be obtained by registering on the Brain Tumor Segmentation 2019 dataset website. The data should be downloaded and placed where /data in the container is mounted.To start an interactive session in the NGC container to run preprocessing, training, and inference, you must run the following command. This launches the container and mounts the ./data directory as a volume to the /data directory inside the container, mounts the ./results directory to the /results directory in the container.The advantage of using a container is that it packages all the necessary libraries and dependencies into a single, isolated environment. This way you don’t have to worry about the complex install process.Use this command to start a Jupyter notebook inside the container:jupyter notebook --ip 0.0.0.0 --port 8888 --allow-rootMove the dataset to the /data directory inside the container. Download the notebook with the following command: wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/med_3dunet/versions/1/zip -O med_3dunet_1.zipThen, upload the downloaded notebook into JupyterLab and run the cells of the notebook to preprocess the dataset and train, benchmark, and test the model.By running the cells of this Jupyter notebook, you can first check the downloaded dataset and see the brain tumor images. After that, see the data preprocessing command and prepare the data for training. The next step is training the model and using the checkpoints of the training process for the predicting step. Finally, check the output of the predict function visually.To check the dataset, you can use nibabel, which is a package that provides read/write access to some common medical and neuroimaging file formats.By running the next three cells, you can install nibabel using pip install, choose an image from the dataset, and plot the chosen third image from the dataset using matplotlib. You can check other dataset images by changing the image address in the code.The result is something like Figure 4.The dataset/preprocess_data.py script converts the raw data into the TFRecord format used for training and evaluation. This dataset, from the 2019 BraTS challenge, contains over 3 TB multiinstitutional, routine, clinically acquired, preoperative, multimodal, MRI scans of glioblastoma (GBM/HGG) and lower-grade glioma (LGG), with the pathologically confirmed diagnosis. When available, overall survival (OS) data for the patient is also included. This data is structured in training, validation, and testing datasets.The format of images is nii.gz. NIfTI is a type of file format for neuroimaging. You can preprocess the downloaded dataset by running the following command:python dataset/preprocess_data.py -i /data/MICCAI_BraTS_2019_Data_Training -o /data/preprocessed -vThe final format of the processed images is tfrecord. To help you read data efficiently, serialize your data and store it in a set of files (~100 to 200 MB each) that can each be read linearly. This is especially true if the data is being streamed over a network. It can also be useful for caching any data preprocessing. The TFRecord format is a simple format for storing a sequence of binary records, which speeds up the data loading process considerably.After the Docker container is launched, you can start the training of a single fold (fold 0) with the default hyperparameters (for example, {1 to 8} GPUs {TF-AMP/FP32/TF32}):Bash examples/unet3d_train_single{_TF-AMP}.sh <number/of/gpus> <path/to/dataset> <path/to/checkpoint> <batch/size>For example, to run with 32-bit precision (FP32 or TF32) with batch size 2 on one GPU, run the following command:bash examples/unet3d_train_single.sh 1 /data/preprocessed /results 2To train a single fold with mixed precision (TF-AMP) with on eight GPUs and batch size 2 per GPU, run the following command:bash examples/unet3d_train_single_TF-AMP.sh 8 /data/preprocessed /results 2The training performance can be evaluated by running benchmarking scripts:bash examples/unet3d_{train,infer}_benchmark{_TF-AMP}.sh <number/of/gpus/for/training> <path/to/dataset> <path/to/checkpoint> <batch/size>This script makes the model run and reports the performance. For example, to benchmark training with TF-AMP with batch size 2 on four GPUs, run the following command:bash examples/unet3d_train_benchmark_TF-AMP.sh 4 /data/preprocessed /results 2You can use the test dataset and predict as exec-mode to test the model. The result is saved in the model_dir directory and data_dir is the path to the dataset:python main.py --model_dir /results --exec_mode predict --data_dir /data/preprocessed_testIn the following code example, you plot one of the chosen results from the \results folder:For those of you looking to explore advanced features built into this notebook, you can see the full list of available options for main.py using -h or --help. By running the next cell, you can see how to change execute mode and other parameters of this script. You can perform model training, predicting, evaluating, and inferencing using customized hyperparameters using this script.python main.py --helpThe main.py parameters can be changed to perform different tasks, including training, evaluation, and prediction.You can also train the model using default hyperparameters. By running the python main.py --help command, you can see the list of arguments that you can change, including training hyperparameters. For example, in training mode, you can change the learning rate from the default 0.0002 to 0.001 and the training steps from 16000 to 1000 using the following command:python main.py --model_dir /results --exec_mode train --data_dir /data/preprocessed_test --learning_rate 0.001 --max_steps 1000You can run other execution modes available in main.py. For example, in this post, we used the prediction execution mode of python.py by running the following command:python main.py --model_dir /results --exec_mode predict --data_dir /data/preprocessed_testIn this post, we showed how you can get started with a medical imaging model using a simple Jupyter notebook from the NGC catalog. As you make the transition from this Jupyter notebook to building your own medical imaging workflows, consider using NVIDIA Clara Train. Clara Train includes AI-Assisted Annotation APIs and an Annotation server that can be seamlessly integrated into any medical viewer, making it AI-capable. The training framework includes decentralized learning techniques, such federated learning and transfer learning for your AI workflows.To learn how to use these resources and kickstart your AI journey, register for the upcoming webinar with live Q&A, NVIDIA NGC Jupyter Notebook Day: Medical Imaging Segmentation.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	In Case You Missed It (ICYMI) is a series in which we spotlight essential talks, whitepapers, blogs, and success stories showcasing NVIDIA technologies accelerating real world solutions.In this update, we look at the ways NVIDIA TensorRT and the Triton Inference Server can help your business deploy high-performance models with resilience at scale. We start with an in-depth, step-by-step introduction to TensorRT and Triton. Next, we dig into exactly how Triton and Clara Deploy complement each other in your healthcare use cases. Finally, to round things out our whitepaper covers exactly what you’ll need to know when migrating your applications to Triton.On-Demand: Inception Café – Accelerating Deep Learning Inference with NVIDIA TensorRT and TritonA step-by-step walkthrough applying NVIDIA TensorRT and Triton in conjunction with NVIDIA Clara Deploy.Watch >Whitepaper: Inception Café – Migrating Your Medical AI App to TritonThis whitepaper explores the end-to-end process of migrating an existing medical AI application to Triton.Read >On-Demand: Introduction to TensorRT and Triton A Walkthrough of Optimizing Your First Deep Learning Inference ModelAn overview of TensorRT optimization of a PyTorch model followed by deployment of the optimized model using Triton. By the end of this workshop, developers will see the substantial benefits of integrating TensorRT and get started on optimizing their own deep learning models.Watch >On-Demand: Clara Train 4.0 – 101 Getting StartedThis session provides a walk through the Clara Train SDK features and capabilities with a set of Jupyter Notebooks covering a range of topics, including Medical Model Archives (MMARs), AI-assisted annotation, and AutoML. These features help data scientists quickly annotate, train, and optimize hyperparameters for their deep learning model.Watch >On-Demand: Clara Train 4.0 – 201 Federated LearningThis session delivers an overview of federated learning, a distributed AI model development technique that allows models to be created without transferring data outside of hospitals or imaging centers. The session will finish with a walkthrough of Clara Train federated learning capabilities by going through a set of Jupyter Notebooks.Watch >On-Demand: Medical Imaging AI with MONAI BootcampMONAI is a freely available, community-supported, open-source PyTorch-based framework for deep learning in medical imaging. It provides domain-optimized foundational capabilities for developing medical imaging training workflows in a native PyTorch paradigm. This MONAI Bootcamp offers medical imaging researchers an architectural deep dive of MONAI and finishes with a walkthrough of MONAI’s capabilities through a set of four Jupyter Notebooks.Watch >On-Demand: Clara Guardian 101: A Hello World Walkthrough on the Jetson PlatformNVIDIA Clara Guardian provides healthcare-specific pretrained models and sample applications that can significantly reduce the time-to-solution for developers building smart-hospital applications. It targets three categories—public safety (thermal screening, mask detection, and social distancing monitoring), patient care (patient monitoring, fall detection, and patient engagement), and operational efficiency (operating room workflow automation, surgery analytics, and contactless control). In this session, attendees will get a walkthrough of how to use Clara Guardian on the Jetson NX platform, including how to use the pretrained models for tasks like automatic speech recognition and body pose estimation.Watch >On-Demand: GPU-Accelerated Genomics Using Clara Parabricks, Gary BurnettNVIDIA Clara Parabricks is a software suite for performing secondary analysis of next generation sequencing (NGS) DNA and RNA data. A major benefit of Parabricks is that it is designed to deliver results at blazing fast speeds and low cost. Parabricks can analyze whole human genomes in under 30 minutes, compared to about 30 hours for 30x WGS data. In this session, attendees will take a guided tour of the Parabricks suite featuring live examples and real world applications.Watch >On-Demand: Using Ethernet to Stream High-Throughput, Low-Latency Medical Sensor DataMedical sensors in various medical devices generate high-throughput data. System designers are challenged to move the sensor data to the GPU for processing. Over the last decade, Ethernet speeds have increased from 10G to 100G, enabling new ways to meet this challenge. We’ll explore three technologies from NVIDIA that make streaming high-throughput medical sensor data over Ethernet easy and efficient — NVIDIA Networking ConnectX NICs, Rivermax SDK with GPUDirect, and Clara AGX. Learn about the capabilities of each of these technologies and explore examples of how these technologies can be leveraged by several different types of medical devices. Finally, a step-by-step demo will walk attendees through installing the software, initializing a link, and testing for throughput and CPU overhead.Watch >NEW on NGC: Simplify and Unify Biomedical Analytics with VyasaLearn how Vyasa Analytics leverages Clara Discovery, Triton Inference Server, RAPIDS, and DGX to develop solutions for pharmaceutical and biotechnology companies. Vyasa Analytics solutions are available from the NVIDIA NGC catalog for rapid evaluation and deployment.Read >Do you have a startup? Join NVIDIA Inception’s global network of over 8,000 startups.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.11	Kubernetes is an open-source container-orchestration system for automating computer application deployment, scaling, and management. It’s an extremely popular tool, and can be used for automated rollouts and rollbacks, horizontal scaling, storage orchestration, and more. For many organizations, Kubernetes is a key component to their infrastructure. A critical step to installing and scaling Kubernetes is ensuring that it is properly utilizing the other components of the infrastructure. NVIDIA Operators streamline installing and managing GPUs and NICs on Kubernetes to make the software stack ready to run the most resource-demanding workloads, such as AI, ML, DL, and HPC, in the cloud, data center, and at the edge. NVIDIA Operators consist of the GPU Operator and the Network Operator, and are open source and based on the Operator Framework. NVIDIA GPU OperatorThe NVIDIA GPU Operator is packaged as a Helm Chart and installs and manages the lifecycle of software components so that the GPU-accelerated applications can be run on Kubernetes. The components are the GPU feature discovery, the NVIDIA Driver, the Kubernetes Device Plugin, the NVIDIA Container Toolkit, and DCGM Monitoring. The GPU Operator enables infrastructure teams to manage the lifecycle of GPUs when used with Kubernetes at the Cluster level, therefore eliminating the need to manage each node individually. Previously infrastructure teams had to manage two operating system images, one for GPU nodes and one CPU nodes. When using the GPU Operator, infrastructure teams can use the CPU image with GPU worker nodes as well.NVIDIA Network OperatorThe Network Operator is responsible for automating the deployment and management of the host networking components in a Kubernetes cluster. It includes the Kubernetes Device Plugin, NVIDIA Driver, NVIDIA Peer Memory Driver, and the Multus, macvlan CNIs. These components were previously installed manually, but are automated through the Network Operator, streamlining the deployment process and enabling accelerated computing with enhanced customer experience.Used independently or together, NVIDIA Operators simplify GPU and SmartNIC configurations on Kubernetes and are compatible with partner cloud platforms. To learn more about these components and how the NVIDIA Operators solve the key challenges to running AI, ML, DL, and HPC workloads and simplify initial setup and Day 2 operations, check out the on-demand webinar “Accelerating Kubernetes with NVIDIA Operators“.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Kubernetes is an open-source container-orchestration system for automating computer application deployment, scaling, and management. It’s an extremely popular tool, and can be used for automated rollouts and rollbacks, horizontal scaling, storage orchestration, and more. For many organizations, Kubernetes is a key component to their infrastructure. A critical step to installing and scaling Kubernetes is ensuring that it is properly utilizing the other components of the infrastructure. NVIDIA Operators streamline installing and managing GPUs and NICs on Kubernetes to make the software stack ready to run the most resource-demanding workloads, such as AI, ML, DL, and HPC, in the cloud, data center, and at the edge. NVIDIA Operators consist of the GPU Operator and the Network Operator, and are open source and based on the Operator Framework. NVIDIA GPU OperatorThe NVIDIA GPU Operator is packaged as a Helm Chart and installs and manages the lifecycle of software components so that the GPU-accelerated applications can be run on Kubernetes. The components are the GPU feature discovery, the NVIDIA Driver, the Kubernetes Device Plugin, the NVIDIA Container Toolkit, and DCGM Monitoring. The GPU Operator enables infrastructure teams to manage the lifecycle of GPUs when used with Kubernetes at the Cluster level, therefore eliminating the need to manage each node individually. Previously infrastructure teams had to manage two operating system images, one for GPU nodes and one CPU nodes. When using the GPU Operator, infrastructure teams can use the CPU image with GPU worker nodes as well.NVIDIA Network OperatorThe Network Operator is responsible for automating the deployment and management of the host networking components in a Kubernetes cluster. It includes the Kubernetes Device Plugin, NVIDIA Driver, NVIDIA Peer Memory Driver, and the Multus, macvlan CNIs. These components were previously installed manually, but are automated through the Network Operator, streamlining the deployment process and enabling accelerated computing with enhanced customer experience.Used independently or together, NVIDIA Operators simplify GPU and SmartNIC configurations on Kubernetes and are compatible with partner cloud platforms. To learn more about these components and how the NVIDIA Operators solve the key challenges to running AI, ML, DL, and HPC workloads and simplify initial setup and Day 2 operations, check out the on-demand webinar “Accelerating Kubernetes with NVIDIA Operators“.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.12	Researchers from University of Washington and Facebook used deep learning to convert still images into realistic animated looping videos. Their approach, which will be presented at the upcoming Conference on Computer Vision and Pattern Recognition (CVPR), imitates continuous fluid motion — such as flowing water, smoke and clouds — to turn still images into short videos that loop seamlessly. “What’s special about our method is that it doesn’t require any user input or extra information,” said Aleksander Hołyński, University of Washington doctoral student in computer science and engineering and lead author on the project. “All you need is a picture. And it produces as output a high-resolution, seamlessly looping video that quite often looks like a real video.”The team created a method known as “symmetric splatting”  to predict the past and future motion from a still image, combining that data to create a seamless animation. “When we see a waterfall, we know how the water should behave. The same is true for fire or smoke. These types of motions obey the same set of physical laws, and there are usually cues in the image that tell us how things should be moving,” Hołyński said. “We’d love to extend our work to operate on a wider range of objects, like animating a person’s hair blowing in the wind. I’m hoping that eventually the pictures that we share with our friends and family won’t be static images. Instead, they’ll all be dynamic animations like the ones our method produces.”To teach their neural network to estimate motion, the team trained the model on more than 1,000 videos of fluid motion such as waterfalls, rivers and oceans. Given only the first frame of the video, the system would predict what should happen in future frames, and compare its prediction with the original video. This comparison helped the model improve its predictions of whether and how each pixel in an image should move. The researchers used the NVIDIA Pix2PixHD GAN model for motion estimation network training, as well as FlowNet2 and PWC-Net. NVIDIA GPUs were used for both training and inference of the model. The training data included 1196 unique videos, 1096 for training, 50 for validation and 50 for testing.Read the University of Washington news release for more >>The researchers’ paper is available here.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	With up to 93% accuracy, and a median rate of 75%, the model decoded the participants word’s at a rate of up to 18 per minute.  “We want to get to 1,000 words, and eventually all words. This is just the starting point,” Chang said. The study builds off previous work by Chang and his colleagues, which developed a deep learning method for decoding and converting brain signals. Unlike the current work, participants in the previous study were able to speak.  Read more >>>Read the full article in The New England Journal of Medicine >>>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.13	Here are the latest resources and news for healthcare developers from GTC 21, including demos and specialized sessions for building AI in drug discovery, medical imaging, genomics, and smart hospitals. Learn about new features now available in NVIDIA Clara Train 4.0, an application framework for medical imaging that includes pre-trained models, AI-assisted annotation, AutoML, and federated learning.The developer resources listed below are exclusively available to NVIDIA Developer Program members. Join today for free in order to get access to the tools and training necessary to build on NVIDIA’s technology platform here.On-Demand SessionsAccelerating Drug Discovery with Advanced Computational ModelingSpeaker: Robert Abel, Executive Vice President, Chief Computational Scientist, SchrödingerLearn about how integrated deployment and collaborative use of advanced computational modeling and next-generation machine learning can accelerate drug discovery from Robert Abel, Executive Vice President, Chief Computational Scientist at Schrödinger.Using Ethernet to Stream Medical Sensor DataSpeaker: Mathias Blake, Platform Architect for Medical Devices, NVIDIAExplore three technologies from NVIDIA that make streaming high-throughput medical sensor data over Ethernet easy and efficient—NVIDIA Networking ConnectX NICs, Rivermax SDK with GPUDirect, and Clara AGX. Learn about the capabilities of each of these technologies and explore examples of how these technologies can be leveraged by several different types of medical devices.Automate 3D Medical Imaging Segmentation with AutoML and Neural Architecture SearchSpeaker: Dong Yang, Applied Research Scientist, NVIDIARecently, neural architecture search (NAS) has been applied to automatically search high-performance networks for medical image segmentation. Hear from NVIDIA Applied Research Scientist, Dong Yang, to learn about AutoML and NAS techniques in the Clara Train SDK.Deep Learning and Accelerated Computing for Single-Cell Genomic DataSpeaker: Avantika Lal, Sr. Scientist in Deep Learning and Genomics, NVIDIALearn about accelerating discovery of cell types in the human body with RAPIDS and AtacWorks, a deep learning toolkit to enhance ATAC-seq data and identify active regulatory DNA more accurately than existing state-of-the-art methods.BlogCreating Medical Imaging Models with Clara Train 4.0Learn about the upcoming release of NVIDIA Clara Train 4.0, including infrastructure upgrades based on MONAI, expansion into digital pathology, and updates to DeepGrow for annotating organs effectively in 3D images.DemosAccelerating Drug Discovery with Clara Discovery’s MegaMolBartSee how NVIDIA Clara Discovery’s MegaMolBart, a transformer-based NLP model developed with AstraZeneca, trained on millions of molecules, can accelerate the drug discovery process.NVIDIA Triton Inference Server: Generative Chemical StructuresWatch NVIDIA Triton Inference Server power deep learning models to propose thousands of molecules per second for drug design that can be further refined with physics-based simulations.Visit NVIDIA On-Demand to explore the extensive catalog of sessions, podcasts, demos, research posters and more.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Here are the latest resources and news for healthcare developers from GTC 21, including demos and specialized sessions for building AI in drug discovery, medical imaging, genomics, and smart hospitals. Learn about new features now available in NVIDIA Clara Train 4.0, an application framework for medical imaging that includes pre-trained models, AI-assisted annotation, AutoML, and federated learning.The developer resources listed below are exclusively available to NVIDIA Developer Program members. Join today for free in order to get access to the tools and training necessary to build on NVIDIA’s technology platform here.On-Demand SessionsAccelerating Drug Discovery with Advanced Computational ModelingSpeaker: Robert Abel, Executive Vice President, Chief Computational Scientist, SchrödingerLearn about how integrated deployment and collaborative use of advanced computational modeling and next-generation machine learning can accelerate drug discovery from Robert Abel, Executive Vice President, Chief Computational Scientist at Schrödinger.Using Ethernet to Stream Medical Sensor DataSpeaker: Mathias Blake, Platform Architect for Medical Devices, NVIDIAExplore three technologies from NVIDIA that make streaming high-throughput medical sensor data over Ethernet easy and efficient—NVIDIA Networking ConnectX NICs, Rivermax SDK with GPUDirect, and Clara AGX. Learn about the capabilities of each of these technologies and explore examples of how these technologies can be leveraged by several different types of medical devices.Automate 3D Medical Imaging Segmentation with AutoML and Neural Architecture SearchSpeaker: Dong Yang, Applied Research Scientist, NVIDIARecently, neural architecture search (NAS) has been applied to automatically search high-performance networks for medical image segmentation. Hear from NVIDIA Applied Research Scientist, Dong Yang, to learn about AutoML and NAS techniques in the Clara Train SDK.Deep Learning and Accelerated Computing for Single-Cell Genomic DataSpeaker: Avantika Lal, Sr. Scientist in Deep Learning and Genomics, NVIDIALearn about accelerating discovery of cell types in the human body with RAPIDS and AtacWorks, a deep learning toolkit to enhance ATAC-seq data and identify active regulatory DNA more accurately than existing state-of-the-art methods.BlogCreating Medical Imaging Models with Clara Train 4.0Learn about the upcoming release of NVIDIA Clara Train 4.0, including infrastructure upgrades based on MONAI, expansion into digital pathology, and updates to DeepGrow for annotating organs effectively in 3D images.DemosAccelerating Drug Discovery with Clara Discovery’s MegaMolBartSee how NVIDIA Clara Discovery’s MegaMolBart, a transformer-based NLP model developed with AstraZeneca, trained on millions of molecules, can accelerate the drug discovery process.NVIDIA Triton Inference Server: Generative Chemical StructuresWatch NVIDIA Triton Inference Server power deep learning models to propose thousands of molecules per second for drug design that can be further refined with physics-based simulations.Visit NVIDIA On-Demand to explore the extensive catalog of sessions, podcasts, demos, research posters and more.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.14	The retail supply chain is complex and includes everything from creating a product, distributing it, putting it on shelves in stores, to getting it into customer hands. Retailers and Consumer Packaged Goods (CPG) companies must look at the entire supply chain for critical gaps and problems that can be solved with technology and automation. Computer vision has been implemented by many of these companies for years, with cameras distributed in their stores, warehouses, and on assembly lines. This is where edge computing comes in, AI applications can be run in remote locations that allow companies to turn these cameras from sources of information to sources of intelligence. With AI, these cameras can turn from sources of information to sources of intelligence. Whether providing in-store analytics to help evaluate traffic patterns and optimize product placement, to improving packaging detection and analysis, and overall health and safety within warehouses.The challenge with computer vision applications in the retail space is the heavy data requirement that is needed to ensure AI models are accurate and safe. Once trained, these models then need to be deployed to many locations at the edge, often without the IT resources onsite. Kinetic Vision has partnered with NVIDIA to develop a new solution to this problem that allows retailers and CPG companies to generate accurate models, and scale them out at the edge.Solving the challenge of data is key to enabling the training of AI models using NVIDIA tools like the DeepStream SDK and Transfer Learning Toolkit (TLT). With a Synthetic Data Generator, Kinetic Vision not only produces data volume, but also with the required variances to ensure the model will perform in any environment. Numerous angles, lighting, backgrounds, and product types can be generated quickly and easily using different methods including GANs, simulated sensor data (LIDAR, RADAR, IMU), photorealistic 3D environment, synthetic x-rays, and physics simulations. The synthetic data is then used to train a model that can be tested in a digital twin, a virtual representation of the warehouse, supply line, store, or whatever environment the model will be deployed. Using the synthetic data and the digital twin, Kinetic Vision can train, simulate, and re-train the model to achieve the required level of accuracy. Once the AI model has achieved the desired level of performance, it must be tested in the real world. This is where NVIDIA Fleet Command comes in. Fleet Command is a hybrid-cloud platform for deploying and managing AI models at the edge. The pre-trained model is simply loaded into the NGC catalog and then deployed on the edge system using the Fleet Command UI in just a few clicks. Once deployed at the edge, the model can continue to be optimized with real world data sent back from the store or warehouse. These updates are once again easily deployed and managed using Fleet Command. The advantages of this new approach to creating retail computer vision applications include both ROI and technological benefits. The cost of developing an AI model with a digital twin is easily 10 percent the time and cost required to do the same thing in a physical environment. With the digital twin, testing can be done without physical infrastructure or requiring production interruptions. Additionally, new products and product variations can be easily accommodated without requiring inventory photography that must be manually annotated. Finally, the digital twin results in a generalized and scalable model that still provides the accuracy required for production deployment.To learn more about how to use synthetic data and Fleet Command to deploy highly accurate and scalable models, check out the GTC session “Novel Approach to Deploy Highly Accurate AI Retail Computer Vision Applications at the Edge“.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	The retail supply chain is complex and includes everything from creating a product, distributing it, putting it on shelves in stores, to getting it into customer hands. Retailers and Consumer Packaged Goods (CPG) companies must look at the entire supply chain for critical gaps and problems that can be solved with technology and automation. Computer vision has been implemented by many of these companies for years, with cameras distributed in their stores, warehouses, and on assembly lines. This is where edge computing comes in, AI applications can be run in remote locations that allow companies to turn these cameras from sources of information to sources of intelligence. With AI, these cameras can turn from sources of information to sources of intelligence. Whether providing in-store analytics to help evaluate traffic patterns and optimize product placement, to improving packaging detection and analysis, and overall health and safety within warehouses.The challenge with computer vision applications in the retail space is the heavy data requirement that is needed to ensure AI models are accurate and safe. Once trained, these models then need to be deployed to many locations at the edge, often without the IT resources onsite. Kinetic Vision has partnered with NVIDIA to develop a new solution to this problem that allows retailers and CPG companies to generate accurate models, and scale them out at the edge.Solving the challenge of data is key to enabling the training of AI models using NVIDIA tools like the DeepStream SDK and Transfer Learning Toolkit (TLT). With a Synthetic Data Generator, Kinetic Vision not only produces data volume, but also with the required variances to ensure the model will perform in any environment. Numerous angles, lighting, backgrounds, and product types can be generated quickly and easily using different methods including GANs, simulated sensor data (LIDAR, RADAR, IMU), photorealistic 3D environment, synthetic x-rays, and physics simulations. The synthetic data is then used to train a model that can be tested in a digital twin, a virtual representation of the warehouse, supply line, store, or whatever environment the model will be deployed. Using the synthetic data and the digital twin, Kinetic Vision can train, simulate, and re-train the model to achieve the required level of accuracy. Once the AI model has achieved the desired level of performance, it must be tested in the real world. This is where NVIDIA Fleet Command comes in. Fleet Command is a hybrid-cloud platform for deploying and managing AI models at the edge. The pre-trained model is simply loaded into the NGC catalog and then deployed on the edge system using the Fleet Command UI in just a few clicks. Once deployed at the edge, the model can continue to be optimized with real world data sent back from the store or warehouse. These updates are once again easily deployed and managed using Fleet Command. The advantages of this new approach to creating retail computer vision applications include both ROI and technological benefits. The cost of developing an AI model with a digital twin is easily 10 percent the time and cost required to do the same thing in a physical environment. With the digital twin, testing can be done without physical infrastructure or requiring production interruptions. Additionally, new products and product variations can be easily accommodated without requiring inventory photography that must be manually annotated. Finally, the digital twin results in a generalized and scalable model that still provides the accuracy required for production deployment.To learn more about how to use synthetic data and Fleet Command to deploy highly accurate and scalable models, check out the GTC session “Novel Approach to Deploy Highly Accurate AI Retail Computer Vision Applications at the Edge“.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.15	In Case You Missed It (ICYMI) is a series in which we spotlight essential talks, whitepapers, blogs, and success stories showcasing NVIDIA technologies accelerating real world solutions.In this update, we look at the ways NVIDIA TensorRT and the Triton Inference Server can help your business deploy high-performance models with resilience at scale. We start with an in-depth, step-by-step introduction to TensorRT and Triton. Next, we dig into exactly how Triton and Clara Deploy complement each other in your healthcare use cases. Finally, to round things out our whitepaper covers exactly what you’ll need to know when migrating your applications to Triton.On-Demand: Inception Café – Accelerating Deep Learning Inference with NVIDIA TensorRT and TritonA step-by-step walkthrough applying NVIDIA TensorRT and Triton in conjunction with NVIDIA Clara Deploy.Watch >Whitepaper: Inception Café – Migrating Your Medical AI App to TritonThis whitepaper explores the end-to-end process of migrating an existing medical AI application to Triton.Read >On-Demand: Introduction to TensorRT and Triton A Walkthrough of Optimizing Your First Deep Learning Inference ModelAn overview of TensorRT optimization of a PyTorch model followed by deployment of the optimized model using Triton. By the end of this workshop, developers will see the substantial benefits of integrating TensorRT and get started on optimizing their own deep learning models.Watch >On-Demand: Clara Train 4.0 – 101 Getting StartedThis session provides a walk through the Clara Train SDK features and capabilities with a set of Jupyter Notebooks covering a range of topics, including Medical Model Archives (MMARs), AI-assisted annotation, and AutoML. These features help data scientists quickly annotate, train, and optimize hyperparameters for their deep learning model.Watch >On-Demand: Clara Train 4.0 – 201 Federated LearningThis session delivers an overview of federated learning, a distributed AI model development technique that allows models to be created without transferring data outside of hospitals or imaging centers. The session will finish with a walkthrough of Clara Train federated learning capabilities by going through a set of Jupyter Notebooks.Watch >On-Demand: Medical Imaging AI with MONAI BootcampMONAI is a freely available, community-supported, open-source PyTorch-based framework for deep learning in medical imaging. It provides domain-optimized foundational capabilities for developing medical imaging training workflows in a native PyTorch paradigm. This MONAI Bootcamp offers medical imaging researchers an architectural deep dive of MONAI and finishes with a walkthrough of MONAI’s capabilities through a set of four Jupyter Notebooks.Watch >On-Demand: Clara Guardian 101: A Hello World Walkthrough on the Jetson PlatformNVIDIA Clara Guardian provides healthcare-specific pretrained models and sample applications that can significantly reduce the time-to-solution for developers building smart-hospital applications. It targets three categories—public safety (thermal screening, mask detection, and social distancing monitoring), patient care (patient monitoring, fall detection, and patient engagement), and operational efficiency (operating room workflow automation, surgery analytics, and contactless control). In this session, attendees will get a walkthrough of how to use Clara Guardian on the Jetson NX platform, including how to use the pretrained models for tasks like automatic speech recognition and body pose estimation.Watch >On-Demand: GPU-Accelerated Genomics Using Clara Parabricks, Gary BurnettNVIDIA Clara Parabricks is a software suite for performing secondary analysis of next generation sequencing (NGS) DNA and RNA data. A major benefit of Parabricks is that it is designed to deliver results at blazing fast speeds and low cost. Parabricks can analyze whole human genomes in under 30 minutes, compared to about 30 hours for 30x WGS data. In this session, attendees will take a guided tour of the Parabricks suite featuring live examples and real world applications.Watch >On-Demand: Using Ethernet to Stream High-Throughput, Low-Latency Medical Sensor DataMedical sensors in various medical devices generate high-throughput data. System designers are challenged to move the sensor data to the GPU for processing. Over the last decade, Ethernet speeds have increased from 10G to 100G, enabling new ways to meet this challenge. We’ll explore three technologies from NVIDIA that make streaming high-throughput medical sensor data over Ethernet easy and efficient — NVIDIA Networking ConnectX NICs, Rivermax SDK with GPUDirect, and Clara AGX. Learn about the capabilities of each of these technologies and explore examples of how these technologies can be leveraged by several different types of medical devices. Finally, a step-by-step demo will walk attendees through installing the software, initializing a link, and testing for throughput and CPU overhead.Watch >NEW on NGC: Simplify and Unify Biomedical Analytics with VyasaLearn how Vyasa Analytics leverages Clara Discovery, Triton Inference Server, RAPIDS, and DGX to develop solutions for pharmaceutical and biotechnology companies. Vyasa Analytics solutions are available from the NVIDIA NGC catalog for rapid evaluation and deployment.Read >Do you have a startup? Join NVIDIA Inception’s global network of over 8,000 startups.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Targeting areas populated with disease-carrying mosquitoes just got easier thanks to a new study. The research, recently published in IEEE Explore, uses deep learning to recognize tiger mosquitoes with near perfect accuracy from images taken by citizen scientists .“Identifying the mosquitoes is fundamental, as the diseases they transmit continue to be a major public health issue,” said lead author Gereziher Adhane.The study, from researchers at the Scene UNderstanding & Artificial Intelligence (SUnAI) research group, of the Universitat Oberta de Catalunya’s (UOC) Faculty of Computer Science, Multimedia and Telecommunications and of the eHealth Center, uses images from the Mosquito Alert app. Developed in Spain and currently expanding globally, the platform brings together citizens, entomologists, public health authorities, and mosquito-control services to reduce mosquito-borne diseases. Anyone in the world can upload geo-tagged images of mosquitoes to the app. Three expert entomologists inspect and validate the submitted images before they are added to the database, classified, and mapped. Travel and migration, along with climate change and urbanization, has broadened the range and habitat of mosquitoes. The quick identification of species such as the tiger—known to transmit dengue, Zika, chikungunya, and yellow fever—remains a key step in assisting relevant authorities to curb their spread. “This type of analysis depends largely on human expertise and requires the collaboration of professionals, is typically time-consuming, and is not cost-effective because of the possible rapid propagation of invasive species,” said Adhane. “This is where neural networks can play a role as a practical solution for controlling the spread of mosquitoes.”The research team developed a deep convolutional neural network that distinguishes between mosquito species. Starting with a pretrained model, they fine-tuned it using the hand-labeled Mosquito Alert dataset. Using NVIDIA GPUs and the cuDNN-accelerated PyTorch deep learning framework, the classification models were taught to pinpoint tiger mosquitoes based on identifiable morphological features such as white stripes on the legs, abdominal patches, head, and thorax shape.  Deep learning models typically rely on millions of samples. However, using only 6,378 images of both tiger and non-tiger mosquitoes from Mosquito Alert, the researchers were able to train the model with about 94% accuracy. “The neural network we have developed can perform as well or nearly as well as a human expert and the algorithm is sufficiently powerful to process massive amounts of images,” said Adhane.According to the researchers, as Mosquito Alert scales up, the study can be expanded to classify multiple species of mosquitoes and their breeding sites across the globe. “The model we have developed could be used in practical applications with small modifications to work with mobile apps. Using this trained network, it is possible to make predictions about images of mosquitoes taken using smartphones efficiently and in real time,” Adhane said. The GPU used in the research was a donation provided by the NVIDIA Academic Hardware Grant Program.Read the full article in IEEE Explore >>Read more >>  Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.16	The NVIDIA Deep Learning Institute (DLI) extended its popular public workshop summer series through September 2021. These workshops are conducted live in a virtual classroom environment with expert guidance from NVIDIA-certified instructors. Participants have access to fully configured GPU-accelerated servers in the cloud to perform hands-on exercises. To register, visit our website. Space is limited so we encourage you to sign up early.Here is our current public workshop schedule:AugustFundamentals of Deep LearningWed, August 25, 9:00 a.m. to 5:00 p.m. PDT (NALA)Fundamentals of Accelerated Computing with CUDA PythonThu, August 26, 9:00 a.m. to 5:00 p.m. CEST (EMEA)SeptemberFundamentals of Deep LearningThu, September 16, 9:00 a.m. to 5:00 p.m. CEST (EMEA)Deep Learning for Industrial InspectionTue, September 21, 9:00 a.m. to 5:00 p.m. CEST (EMEA)Tue, September 21, 9:00 a.m. to 5:00 p.m. PDT (NALA)Applications of AI for Anomaly DetectionWed, September 22, 9:00 a.m. to 5:00 p.m. CEST (EMEA)Wed, September 22, 9:00 a.m. to 5:00 p.m. PDT (NALA)Building Transformer-Based Natural Language Processing Applications Thu, September 23, 9:00 a.m. to 5:00 p.m. PDT (NALA)Visit the DLI website for details on each course and the full schedule of upcoming instructor-led workshops, which is regularly updated with new training opportunities.For more information, e-mail nvdli@nvidia.com.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	The NVIDIA Deep Learning Institute (DLI) extended its popular public workshop summer series through September 2021. These workshops are conducted live in a virtual classroom environment with expert guidance from NVIDIA-certified instructors. Participants have access to fully configured GPU-accelerated servers in the cloud to perform hands-on exercises. To register, visit our website. Space is limited so we encourage you to sign up early.Here is our current public workshop schedule:AugustFundamentals of Deep LearningWed, August 25, 9:00 a.m. to 5:00 p.m. PDT (NALA)Fundamentals of Accelerated Computing with CUDA PythonThu, August 26, 9:00 a.m. to 5:00 p.m. CEST (EMEA)SeptemberFundamentals of Deep LearningThu, September 16, 9:00 a.m. to 5:00 p.m. CEST (EMEA)Deep Learning for Industrial InspectionTue, September 21, 9:00 a.m. to 5:00 p.m. CEST (EMEA)Tue, September 21, 9:00 a.m. to 5:00 p.m. PDT (NALA)Applications of AI for Anomaly DetectionWed, September 22, 9:00 a.m. to 5:00 p.m. CEST (EMEA)Wed, September 22, 9:00 a.m. to 5:00 p.m. PDT (NALA)Building Transformer-Based Natural Language Processing Applications Thu, September 23, 9:00 a.m. to 5:00 p.m. PDT (NALA)Visit the DLI website for details on each course and the full schedule of upcoming instructor-led workshops, which is regularly updated with new training opportunities.For more information, e-mail nvdli@nvidia.com.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.17	NVIDIA Clara AGX SDK 3.0 is available today! The Clara AGX SDK runs on the NVIDIA Jetson and Clara AGX platform and provides developers with capabilities to build end-to-end streaming workflows for medical imaging. The focus of this release is to provide added support for NGC containers, including TensorFlow and PyTorch frameworks, a new ultrasound application, and updated Transfer Learning Toolkit scripts.  There is now support for the leading deep learning framework containers, including TensorFlow 1, TensorFlow 2, and PyTorch, as well as the Triton Inference Server. These containers can help you quickly get started using the Clara AGX Development Kit, NVIDIA’s GPU super-charged development platform for AI medical devices and edge-based inferencing. We’ve also released three new application containers along with the SDK, available on NGC. These application containers include: Clara AGX SDK has also been updated to the latest Transfer Learning Toolkit (TLT) 3.0 release. Developers can now use TLT 3.0 out-of-the-box and includes compatibility with DeepStream SDK for real-time, low latency, high-resolution image AI deployments.  Download Clara AGX SDK 3.0 through the Clara AGX Developer Site. An NVIDIA Developer Program account is needed to access the SDK. You can also find all of our containers through NGC. Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Robotics researchers from NVIDIA and University of Southern California presented their work at the 2021 Robotics: Science and Systems (RSS) conference called DiSECt, the first differentiable simulator for robotic cutting. The simulator accurately predicts the forces acting on a knife as it presses and slices through natural soft materials, such as fruits and vegetables.Robots that are intelligent, adaptive and generalize cutting behavior with either a kitchen butter knife, or a surgical resection remains a difficult problem for researchers.As it turns out, the process of cutting with feedback requires adaptation to stiffness of the objects, applied force during the cut, and often a sawing motion to cut through. To achieve this, researchers use a family of techniques which leverage feedback to guide the controller adaptation. However, fluid controller adaptation requires very careful parameter tuning for each instance of the same problem. While these techniques are successful in industrial settings, no two cucumbers (or tomatoes) are the same, hence rendering these family of algorithms ineffective in a more generic setting.In contrast, recent focus in research has been in building differentiable algorithms for control problems, which in simpler terms means that the sensitivity of output with respect to input can be evaluated without excessive sampling. Efficient solutions for control problems are achievable when the simulated dynamics is differentiable [1,2,3], but the process of simulating cutting has not been differentiable so far!Differentiable simulation for cutting poses a challenge, since naturally cutting is a discontinuous process where crack formation and fracture propagation occur that prohibit the calculation of gradients. We tackle this problem by proposing a novel way of simulating cutting that represents the process of crack propagation and damage mechanics in a continuous manner.DiSECt implements the commonly used Finite Element Method (FEM) to simulate deformable materials, such as foodstuffs. The object to be cut is represented by a 3D mesh which consists of tetrahedral elements. Along the cutting surface we slice the mesh following the Virtual Node Algorithm [4]. This algorithm duplicates the mesh elements that intersect the cutting surface, and adds additional, so-called “virtual” vertices on the edges where these elements are cut. The virtual nodes add extra degrees of freedom to accurately simulate the contact dynamics of the knife when it presses and slices through the mesh.Next, DiSECt inserts springs connecting the virtual nodes on either side of the cutting surface. These cutting springs allow us to simulate damage mechanics and crack propagation in a continuous manner, by weakening them in proportion to the contact force the knife exerts on the mesh. This continuous treatment allows us to differentiate through the dynamics in order to compute gradients for the parameters defining the properties of the material or the trajectory of the knife. For example, given the gradients for the vertical and sideways velocity of the knife, we can efficiently determine an energy-minimizing yet fast cutting motion through gradient-based optimization.We leverage reverse-mode automatic differentiation to efficiently compute gradients for hundreds of simulation parameters. Our simulator uses source code transformation which automatically generates efficient CUDA kernels for the forward and backward passes of all our simulation routines, such as the FEM or contact model. Such an approach allows us to implement complex simulation routines which are parallelized on the GPU, while the gradients of the inputs with respect to the outputs of such routines are automatically derived from analyzing the abstract syntax tree (AST) of the simulation code. Through gradient-based optimization algorithms, we can automatically tune the simulation parameters to achieve a close match between the simulator and real-world measurements. In one of our experiments, we leverage an existing dataset [5] of knife force profiles measured by a real-world robot while cutting various foodstuffs. We set up our simulator with the corresponding mesh and its material properties, and optimize the remaining parameters to reduce the discrepancy between the simulated and the real knife force profile. Within 150 gradient evaluations, our simulator closely predicts the knife force profile, as we demonstrate on the examples of cutting an actual apple and a potato. As shown in the figure, the initial parameter guess yielded a force profile that was far from the real observation, and our approach automatically found an accurate fit. We present further results in our accompanying paper that demonstrate that the found parameters generalize to different conditions, such as the downward velocity of the knife or the length of the reference trajectory window.We also generate additional data using a highly established commercial simulator, which allows us to precisely control the experimental setup, such as object shape and material properties. Given such data, we can also leverage the motion of the mesh vertices as an additional ground-truth signal. After optimizing the simulation parameters, DiSECt is able to predict the vertex positions and velocities, as well as the force profile, much more accurately.Aside from parameter inference, the gradients of our differentiable cutting simulator can also be used to optimize the cutting motion of the knife. In our full cutting simulation, we represent a trajectory by keyframes, where each frame prescribes the downward velocity, as well as the frequency and amplitude of a sinusoidal sideways velocity. At the start of the optimization, the initial motion is a straight downward-pressing motion.We optimize this trajectory with the objective to minimize the mean force on the knife and penalize the time it takes to cut the object. Studies have shown that humans perform sawing motions when cutting biomaterials in order to reduce the required force. Such behavior emerges from our optimization as well.After 50 iterations with the Adam optimizer, we see a reduction in average knife force by 15 percent. However, the knife slices sideways further than its blade length. Therefore, we add a hard constraint to keep the lateral motion within valid limits and perform constrained optimization. Thanks to the end-to-end-differentiability of DiSECt, accurate gradients for such constraints are available, and lead to a valid knife motion which requires only 0.3 percent more force than the unconstrained result.Cutting food items multiple times results in slightly different force profiles for each instance, depending on the geometry of such materials. We additionally present results to transfer simulation parameters between different meshes corresponding to the same material. Our approach leverages optimal transport to find correspondences between simulation parameters of a source mesh and a target mesh (e.g., local stiffnesses) based on the location of the virtual nodes. As shown in the following figure, the 2D positions of these nodes along the cutting surface allow us to map simulation parameters (shown here is the softness of the cutting springs) to topologically different target geometries.In our ongoing research, we are bringing our differentiable simulation approach to real-world robotic cutting. We investigate a closed-loop control system where the simulator is updated online from force measurements, while the robot is cutting foodstuffs. Through model-predictive planning and optimal control, we aim to find time- and energy-efficient cutting actions that apply to the physical system.We thank Yan-Bin Jia and Prajjwal Jamdagni for kindly providing us a dataset of real-world cutting trajectories which we used throughout our experiments. Be sure to also check out their research on robotic cutting!DiSECt is a finalist at 2021 RSS for the best (student) paper. Visit the team’s project webpage to learn more.DiSECt Research Paper: DiSECt: A Differentiable Simulation Engine for Autonomous Robotic CuttingEric Heiden, Miles Macklin, Yashraj S Narang, Dieter Fox, Animesh Garg, Fabio RamosRobotics Science and Systems (RSS) 2021.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.18	The first post in this series covered how to train a 2D pose estimation model using an open-source COCO dataset with the BodyPoseNet app in the NVIDIA Transfer Learning Toolkit.In this post, you learn how to optimize the pose estimation model in the NVIDIA Transfer Learning Toolkit. It walks you through the steps of model pruning and INT8 quantization to optimize the model for inference.This section covers few topics of model optimization and export:BodyPoseNet supports model pruning to remove unnecessary connections, reducing the number of parameters by an order of magnitude. This results in an optimized model architecture.To prune the model, use the following command:Usually, you just have to adjust -pth (threshold) for accuracy and model size trade off. For some internal studies, we’ve noticed that a pth value between the range [0.05, 3.0] is a good starting point for BodyPoseNet models.After the model has been pruned, there might be a slight decrease in accuracy because some previously useful weights may have been removed. To regain the accuracy, we recommend retraining this pruned model over the same dataset. You can follow the same instructions as in the Train experiment configuration file section. The main change is now to specify pretrained_weights as the path to pruned model and enable load_graph. Because the model is being initialized with pruned model weights, the model converges faster.You can follow similar instructions as in the Evaluation and Model verification sections to evaluate and verify the pruned model. After retraining the pruned model with pth 0.05, you can observe an accuracy of 56.1% AP with multiscale inference. Here are the metrics on COCO validation set:Inference throughput and how quickly you can create an efficient model are two key metrics for deploying deep learning applications because they directly affect the time to market and the cost of deployment. TLT includes an export command to export and prepare TLT models for deployment.The model is exported as a .etlt (encrypted TLT) file. The file is consumable by the TLT CV Inference, which decrypts the model and converts it to a TensorRT engine. Exporting the model decouples the training process from inference and allows conversion to TensorRT engines outside the TLT environment. TensorRT engines are specific to each hardware configuration and should be generated for each unique inference environment. The following code example shows the export of the pruned, retrained model.The export command can optionally generate the calibration cache for running inference at INT8 precision. This is described more in detail in later sections.The BodyPoseNet model supports int8 inference mode in TensorRT. To do this, the model is first calibrated to run 8-bit inferences. To calibrate the model, you need a directory with a sampled set of images to be used for calibration.We’ve provided a helper script that parses the annotations and samples the required number of images at random based on specified criteria like number of people in the image, number of keypoints per person, and so on.The following command exports the pruned, retrained model to the .etlt format, performs INT8 calibration, and generates the INT8 calibration cache and TensorRT engine for the current hardware.Make sure that the directory mentioned in --cal_image_dir has at least (batch_size * batches) number of images in it. To generate a F16 engine for the current hardware, specify --data_type as FP16. For more information about the parameters used here, see the INT8 model overview.This evaluation is mainly used as a sanity check for the exported TRT (INT8/FP16) models. This doesn’t reflect the true accuracy of the model as the input aspect ratio here can vary a lot from the aspect ratio of the images in the validation set. The set has a collection of images with various resolutions. Here, you retain a strict input resolution and pad the image to retrain the aspect ratio. So, the accuracy here might vary based on the aspect ratio and the network resolution that you choose.You can run the evaluation of the .tlt model in strict mode as well to compare with the accuracies of the INT8/FP16/FP32 models for any drop in accuracy. The FP16 and FP32 models should have no or minimal drop in accuracy when compared to the .tlt model in this step. The INT8 models would have similar accuracies (or comparable within 2-3% AP range) to the .tlt  model.You can follow similar instructions as in the Evaluation and Model verification sections to evaluate and verify the models. One change would be that you now use  $SPECS_DIR/infer_spec_retrained_strict.yaml as inference_spec and the model to use would be a pruned TLT model, INT8 engine, or FP16 engine.After the INT8/FP16/FP32 model is verified, you must reexport the model so it can be used to run on inference platforms like TLT CV Inference. You use the same guidelines as in the previous sections, but you must add the --sdk_compatible_model flag to the export command, which adds a few nontraininable post-process layers to the model to enable compatibility with the inference pipelines. Reuse the calibration tensorfile (cal_data_file) generated in the earlier step to keep it consistent, but you must regenerate the cal_cache_file and the .etlt model.In this section, we look at some best practices to improve model performance and accuracy.Network input resolution of the model is one of the major factors that determine the accuracy of bottom-up approaches. Bottom-up methods must feed the whole image at one time, resulting in a smaller resolution per person. Hence, higher input resolution yields better accuracy, especially on small- and medium-scale persons with regard to the image scale. However, with a higher input resolution, the runtime of the CNN also would be higher. So, the accuracy/runtime tradeoff should be determined by the accuracy and runtime requirements for the target use case.If your application involves pose estimation for one or more persons close to the camera such that the scale of the person is relatively large, then you could go with a smaller network input height. If you are targeting to use the network for persons with smaller relative scales, like crowded scenes, you might want to go with a higher network input height. After you freeze the height of the network, the width can be decided based on the aspect ratio for your input data used during deployment time.These are approximate runtimes and accuracies for the default architecture and spec used in the notebook. Any changes to the architecture or params yields different results. This is primarily to get a better sense of which resolution would suit your needs.You can expect to see a 7-10% AP increase in the area=medium category when going from 224×320 to 288×384 and an additional 7-10% AP when you choose 320×448. The accuracy for area=large remains almost the same across these resolutions, so you can stick to a lower resolution if this is what you need. As per the COCO keypoint evaluation, medium area is defined as persons occupying less than area between 36^2 to 96^2. Anything higher is categorized as large.We use a default size 288×384 in this post. To use a different resolution, you need the following changes:The height and width should be a multiple of 8, preferably a multiple of 16/32/64.Figure 1 shows that the model architecture includes refinement stages, where each stage refines the results of the previous stage. You can use the stages parameter under the model section to configure this. stages include both the initial prediction stage and the refinement stages. We recommend using a minimum of one refinement stage, and a maximum of six, which corresponds to stages within the range [2, 7].When you use more stages of refinement, it may help improve the accuracy but keep in mind that this would result in an increased inference time. We use a default of two refinement stages (stages=3) in this post, which is tuned for optimal performance and accuracy. For even faster performance, use stages=2.Pruning can help with a significant decrease in the number of parameters and maximize speed while preserving the accuracy or at the cost of some drop in accuracy. A higher pruning threshold gives you a smaller model and thus higher inference speed but might cause a drop in accuracy.The threshold to use depends on the dataset. If the retrain accuracy is good, you can increase this value to get smaller models. Otherwise, lower this value to get better accuracy. We recommend iterating with the prune-retrain cycle until you are satisfied with the accuracy-speed tradeoff. You can also use a higher L1 regularization weight when training the model before pruning. It would push more weights towards zero, making it easier to prune the network weights.In this section, we dive deeper into the model accuracy and performance, and compare it against the state of the art, and across platforms.We compare this approach against OpenPose as this method follows a similar single-shot bottom-up methodology. Figure 4 shows that you achieve a much better accuracy-performance tradeoff as compared to the OpenPose model. The accuracy is lower by ~8% AP whereas you achieve close to a 9x speedup for the model trained with the default parameters provided in this post.The following table shows the inference performance of the BodyPoseNet model trained with TLT by using the default parameters. We profiled the model inference with the trtexec command of TensorRT.In this post, you learned about optimizing body pose models using the BodyPoseNet app in TLT. The post showed taking an open-source COCO dataset with a pretrained backbone from NGC to train and optimize a model with TLT. For information regarding model deployment, see the TLT CV inference pipeline Quick Start Scripts and Deployment instructions.With this model, you can get up to 9x improvement in inference performance as compared to OpenPose, helping you achieve real-time performance even on embedded devices. Pruning plus INT8 precision gives you the highest inference performance on your edge devices.For more information, see the following resources:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	The first post in this series covered how to train a 2D pose estimation model using an open-source COCO dataset with the BodyPoseNet app in the NVIDIA Transfer Learning Toolkit.In this post, you learn how to optimize the pose estimation model in the NVIDIA Transfer Learning Toolkit. It walks you through the steps of model pruning and INT8 quantization to optimize the model for inference.This section covers few topics of model optimization and export:BodyPoseNet supports model pruning to remove unnecessary connections, reducing the number of parameters by an order of magnitude. This results in an optimized model architecture.To prune the model, use the following command:Usually, you just have to adjust -pth (threshold) for accuracy and model size trade off. For some internal studies, we’ve noticed that a pth value between the range [0.05, 3.0] is a good starting point for BodyPoseNet models.After the model has been pruned, there might be a slight decrease in accuracy because some previously useful weights may have been removed. To regain the accuracy, we recommend retraining this pruned model over the same dataset. You can follow the same instructions as in the Train experiment configuration file section. The main change is now to specify pretrained_weights as the path to pruned model and enable load_graph. Because the model is being initialized with pruned model weights, the model converges faster.You can follow similar instructions as in the Evaluation and Model verification sections to evaluate and verify the pruned model. After retraining the pruned model with pth 0.05, you can observe an accuracy of 56.1% AP with multiscale inference. Here are the metrics on COCO validation set:Inference throughput and how quickly you can create an efficient model are two key metrics for deploying deep learning applications because they directly affect the time to market and the cost of deployment. TLT includes an export command to export and prepare TLT models for deployment.The model is exported as a .etlt (encrypted TLT) file. The file is consumable by the TLT CV Inference, which decrypts the model and converts it to a TensorRT engine. Exporting the model decouples the training process from inference and allows conversion to TensorRT engines outside the TLT environment. TensorRT engines are specific to each hardware configuration and should be generated for each unique inference environment. The following code example shows the export of the pruned, retrained model.The export command can optionally generate the calibration cache for running inference at INT8 precision. This is described more in detail in later sections.The BodyPoseNet model supports int8 inference mode in TensorRT. To do this, the model is first calibrated to run 8-bit inferences. To calibrate the model, you need a directory with a sampled set of images to be used for calibration.We’ve provided a helper script that parses the annotations and samples the required number of images at random based on specified criteria like number of people in the image, number of keypoints per person, and so on.The following command exports the pruned, retrained model to the .etlt format, performs INT8 calibration, and generates the INT8 calibration cache and TensorRT engine for the current hardware.Make sure that the directory mentioned in --cal_image_dir has at least (batch_size * batches) number of images in it. To generate a F16 engine for the current hardware, specify --data_type as FP16. For more information about the parameters used here, see the INT8 model overview.This evaluation is mainly used as a sanity check for the exported TRT (INT8/FP16) models. This doesn’t reflect the true accuracy of the model as the input aspect ratio here can vary a lot from the aspect ratio of the images in the validation set. The set has a collection of images with various resolutions. Here, you retain a strict input resolution and pad the image to retrain the aspect ratio. So, the accuracy here might vary based on the aspect ratio and the network resolution that you choose.You can run the evaluation of the .tlt model in strict mode as well to compare with the accuracies of the INT8/FP16/FP32 models for any drop in accuracy. The FP16 and FP32 models should have no or minimal drop in accuracy when compared to the .tlt model in this step. The INT8 models would have similar accuracies (or comparable within 2-3% AP range) to the .tlt  model.You can follow similar instructions as in the Evaluation and Model verification sections to evaluate and verify the models. One change would be that you now use  $SPECS_DIR/infer_spec_retrained_strict.yaml as inference_spec and the model to use would be a pruned TLT model, INT8 engine, or FP16 engine.After the INT8/FP16/FP32 model is verified, you must reexport the model so it can be used to run on inference platforms like TLT CV Inference. You use the same guidelines as in the previous sections, but you must add the --sdk_compatible_model flag to the export command, which adds a few nontraininable post-process layers to the model to enable compatibility with the inference pipelines. Reuse the calibration tensorfile (cal_data_file) generated in the earlier step to keep it consistent, but you must regenerate the cal_cache_file and the .etlt model.In this section, we look at some best practices to improve model performance and accuracy.Network input resolution of the model is one of the major factors that determine the accuracy of bottom-up approaches. Bottom-up methods must feed the whole image at one time, resulting in a smaller resolution per person. Hence, higher input resolution yields better accuracy, especially on small- and medium-scale persons with regard to the image scale. However, with a higher input resolution, the runtime of the CNN also would be higher. So, the accuracy/runtime tradeoff should be determined by the accuracy and runtime requirements for the target use case.If your application involves pose estimation for one or more persons close to the camera such that the scale of the person is relatively large, then you could go with a smaller network input height. If you are targeting to use the network for persons with smaller relative scales, like crowded scenes, you might want to go with a higher network input height. After you freeze the height of the network, the width can be decided based on the aspect ratio for your input data used during deployment time.These are approximate runtimes and accuracies for the default architecture and spec used in the notebook. Any changes to the architecture or params yields different results. This is primarily to get a better sense of which resolution would suit your needs.You can expect to see a 7-10% AP increase in the area=medium category when going from 224×320 to 288×384 and an additional 7-10% AP when you choose 320×448. The accuracy for area=large remains almost the same across these resolutions, so you can stick to a lower resolution if this is what you need. As per the COCO keypoint evaluation, medium area is defined as persons occupying less than area between 36^2 to 96^2. Anything higher is categorized as large.We use a default size 288×384 in this post. To use a different resolution, you need the following changes:The height and width should be a multiple of 8, preferably a multiple of 16/32/64.Figure 1 shows that the model architecture includes refinement stages, where each stage refines the results of the previous stage. You can use the stages parameter under the model section to configure this. stages include both the initial prediction stage and the refinement stages. We recommend using a minimum of one refinement stage, and a maximum of six, which corresponds to stages within the range [2, 7].When you use more stages of refinement, it may help improve the accuracy but keep in mind that this would result in an increased inference time. We use a default of two refinement stages (stages=3) in this post, which is tuned for optimal performance and accuracy. For even faster performance, use stages=2.Pruning can help with a significant decrease in the number of parameters and maximize speed while preserving the accuracy or at the cost of some drop in accuracy. A higher pruning threshold gives you a smaller model and thus higher inference speed but might cause a drop in accuracy.The threshold to use depends on the dataset. If the retrain accuracy is good, you can increase this value to get smaller models. Otherwise, lower this value to get better accuracy. We recommend iterating with the prune-retrain cycle until you are satisfied with the accuracy-speed tradeoff. You can also use a higher L1 regularization weight when training the model before pruning. It would push more weights towards zero, making it easier to prune the network weights.In this section, we dive deeper into the model accuracy and performance, and compare it against the state of the art, and across platforms.We compare this approach against OpenPose as this method follows a similar single-shot bottom-up methodology. Figure 4 shows that you achieve a much better accuracy-performance tradeoff as compared to the OpenPose model. The accuracy is lower by ~8% AP whereas you achieve close to a 9x speedup for the model trained with the default parameters provided in this post.The following table shows the inference performance of the BodyPoseNet model trained with TLT by using the default parameters. We profiled the model inference with the trtexec command of TensorRT.In this post, you learned about optimizing body pose models using the BodyPoseNet app in TLT. The post showed taking an open-source COCO dataset with a pretrained backbone from NGC to train and optimize a model with TLT. For information regarding model deployment, see the TLT CV inference pipeline Quick Start Scripts and Deployment instructions.With this model, you can get up to 9x improvement in inference performance as compared to OpenPose, helping you achieve real-time performance even on embedded devices. Pruning plus INT8 precision gives you the highest inference performance on your edge devices.For more information, see the following resources:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.19	Five NVIDIA Inception partners were named finalists at the 2020-2021 Artificial Intelligence Tech Sprint, a competition aimed at improving healthcare for veterans using the latest AI technology. Hosted by the Department of Veterans Affairs (VA), the sprint is designed to foster collaboration with industry and academic partners on AI-enabled tools that leverage federal data to address a need for veterans. See the official news release for more details about the competition. Participating teams gave presentations and demonstrations judged by panels of Veterans and other experts. 44 teams from industry and universities participated, addressing a range of health care challenges such as chronic conditions management, cancer screening, rehabilitation, patient experiences and more. Majority of the solutions from NVIDIA Inception partners were powered by NVIDIA Clara Guardian, a smart hospital solution for patient monitoring, mental health, and patient care.  “We were overwhelmed with the overall quality of proposals in this very competitive cycle, it’s a great tribute to our mission to serve our veterans,” said Artificial Intelligence Tech Sprint lead Rafael Fricks. “Very sophisticated AI capabilities are more accessible than ever; a few years ago these proposals wouldn’t have been possible outside the realm of very specialized high performance computing.” NVIDIA looks forward to providing continued support to these winning Inception Partners in the coming Pilot Implementation program phase, and to the contributions their AI solutions will make to the important VA/VHA healthcare mission serving our Nation’s veterans. Build your patient care solutions on Clara Guardian >Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	NVIDIA recently released NVIDIA Riva with world-class speech recognition capability for enterprises to generate highly accurate transcriptions and NVIDIA NeMo 1.0, which includes new state-of-the-art speech and language models for democratizing and accelerating conversational AI research.NVIDIA Riva world-class speech recognition is an out-of-the-box speech service that can be easily deployed in any cloud or datacenter. Enterprises can use the Transfer Learning Toolkit (TLT) to customize speech service across a variety of industries and use cases.  With TLT, developers can accelerate development of custom speech and language models by 10x.  The speech recognition model is highly accurate and trained on domain-agnostic vocabulary from telecommunications, finance, healthcare, education, and also various proprietary and open-source datasets. Additionally, it was trained on noisy data, multiple sampling rates including 8khz for call centers, variety of accents, and dialogue all of which contribute to the model’s accuracy. With the Riva speech service, you can generate a transcription in under 10 milliseconds. It is evaluated on multiple proprietary datasets with over ninety percent accuracy and can be adapted to a wide variety of use cases and domains. It can be used in several apps such as transcribing audio in call centers, video conferencing and in virtual assistants.T-Mobile, one of the largest telecommunication operators in the United States, used Riva to offer exceptional customer service.“With NVIDIA Riva services, fine-tuned using T-Mobile data, we’re building products to help us resolve customer issues in real time,” said Matthew Davis, vice president of product and technology at T-Mobile. “After evaluating several automatic speech recognition solutions, T-Mobile has found Riva to deliver a quality model at extremely low latency, enabling experiences our customers love.”You can download the Riva speech service from the NGC Catalog to start building your own transcription application today. NVIDIA NeMo is an open-source toolkit for researchers developing state-of-the-art (SOTA) conversational AI models. It includes collections for automatic speech recognition (ASR), natural language processing (NLP) and text-to-speech (TTS), which enable researchers to quickly experiment with new SOTA neural networks and create new models or build on top of existing ones. NeMo is tightly coupled with PyTorch, PyTorch Lightning and Hydra frameworks. These integrations enable researchers to develop and use NeMo models and modules in conjunction with PyTorch and PyTorch Lightning modules. Also, with the Hydra framework and NeMo, researchers can easily customize complex conversational AI models.Highlights of this version include:Also, most NeMo models can be exported to NVIDIA Riva for production deployment and high-performance inference. Learn more about what is included in NeMo 1.0 from the NVIDIA Developer Blog. NeMo is open-sourced and is available for download and use from the NGC Catalog and GitHub.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.20	‘Meet the Researcher’ is a series in which we spotlight researchers in academia who use NVIDIA technologies to accelerate their work. This month we spotlight Antti Honkela, Associate Professor in Computer Science at University of Helsinki in Finland. Honkela is the Coordinating Professor of the Research Program in Privacy-preserving and Secure AI at the Finnish Center for Artificial Intelligence (FCAI). In addition, he serves as a privacy and anonymity expert in the steering group of Findata, the recently established Finnish Health and Social Data Permit Authority, and has given expert statements to the Finnish Parliament on legislation related to health data privacy.What are your research areas of focus?Most of my group focuses on developing machine learning and probabilistic inference methods under differential privacy. This provides a strong guarantee that the results cannot be used to violate the privacy of data subjects. I also supervise two students working on applying probabilistic models to analyze genetic data.What motivated you to pursue this research area?I have been interested in math and computers for a while. After one year at university, I was given the opportunity to work as a research assistant for Dr. Harri Valpola, who is now the CEO and co-founder of Curious AI. It was during this time that I became hooked on Bayesian machine learning.Bioinformatics came into the picture after I received my PhD when I struggled to find an application for my machine learning work, which was not apparent at the time. Thanks to an opportunity at a NeurIPS workshop, I met Professor Eric Mjolsness; he told me that some of the models I had been developing might be perfect for modeling gene regulation.After a few years of working on bioinformatics, I moved back into machine learning to work on differential privacy. This has been an excellent opportunity to link my research, my long-term interest in digital human rights, and my solid theoretical background in mathematics to help solve what I believe will be a significant bottleneck for machine learning for health.Tell us about a few of your current research projects.One major project in my group is the work led by Dr. Antti Koskela on using numerical methods for accurate privacy accounting for differential privacy. Differential privacy allows the deriving of an upper bound on so-called privacy loss of data subjects when their data is used. However, the loss increases with each additional access to the data, and it is easy to derive very loose upper bounds on the total loss. Still, these provide a very pessimistic view of the actual privacy loss. Deriving accurate bounds for complex algorithms such as training a neural network with differentially private stochastic gradient descent has been a major challenge, but our work provides an efficient numerical solution with provable error bounds.Another major initiative is developing tools for differentially private probabilistic programming, which allows the user to specify the structure of a probabilistic model. At the same time the system will automatically derive an algorithm for learning the model from data. Such models allow creating anonymized twins of sensitive data sets more efficiently by easily incorporating prior knowledge. This work is based on a very close collaboration with researchers from the group of Professor Samuel Kaski at Aalto University, and led by Joonas Jälkö and Lukas Prediger.What problems or challenges does your research address?We want to develop technologies that allow using sensitive personal data such as health data for things like precision healthcare with guarantees that data subject privacy is maintained. I believe these will be essential for achieving the desired AI revolution in healthcare in a societally sustainable way.What technological breakthroughs are you most proud of?From our recent work, I am excited about noise-aware differentially private Bayesian inference we have recently developed for generalized linear models such as logistic regression (led by Dr. Tejas Kulkarni from Aalto University) as well as Gaussian processes. These methods beautifully bring together two important technologies: differential privacy for strong privacy protection, and Bayesian inference for quantifying the uncertainty of predictions and inferences. These are a perfect combination because differential privacy requires injecting more randomness to guarantee the privacy and with these methods we can quantify the impact of that randomness in the final result.Going further back and really technical, things that stand-out are using natural gradients in variational inference that can really speed-up learning, and has led to significant later breakthroughs in stochastic variational inference and Bayesian deep learning.A small but significant technical breakthrough that enabled a few major papers, but did not make the headlines, was a method for expressing the computations by using numerically stable evaluation of differences of so-called error functions. These come up in operations with the Gaussian distribution, and recently came up even in some differential privacy work. My original MATLAB code has now been ported to many other languages.How are using NVIDIA technologies for your research?GPUs make training large machine learning models a lot faster and we use NVIDIA V100 and A100 GPUs extensively in my group. I really wish such tools would have been available when I was doing my PhD in the early 2000s using weeks to train neural networks.Training models under differential privacy has caused some problems here, because it needs access to per-example gradients that standard deep learning frameworks do not support efficiently. I am really happy about the great collaboration we had with the NVIDIA AI Technology Center in Helsinki who helped make our differentially private probabilistic programming code run really fast on NVIDIA GPUs.What is next for your research?I have two big goals at the moment: developing new methods to allow doing machine learning and Bayesian inference better under differential privacy, and bringing these to users in open source tools that integrate nicely with their existing workflows and run efficiently.Any advice for new researchers, especially to those who are inspired and motivated by your work?Lasting scientific contributions arise from rigorous work built on a solid foundation. There are a lot of components out there that tempt you to try some quick hacks for a quick result, but these very seldom lead to lasting results. This is especially true in fields like privacy, where mathematically rigorous privacy proofs are essential, and seemingly minor details may break the proof for some otherwise attractive combination of methods.To learn more about the work that Antti Honkela and his group is doing, visit his academia webpage.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	New research out of the University of California, San Francisco has given a paralyzed man the ability to communicate by translating his brain signals into computer generated writing. The study, published in The New England Journal of Medicine, marks a significant milestone toward restoring communication for people who have lost the ability to speak. “To our knowledge, this is the first successful demonstration of direct decoding of full words from the brain activity of someone who is paralyzed and cannot speak,” senior author and the Joan and Sanford Weill Chair of Neurological Surgery at UCSF, Edward Chang said in a press release. “It shows strong promise to restore communication by tapping into the brain’s natural speech machinery.” Some with speech limitations use assistive devices–such as touchscreens, keyboards, or speech-generating computers to communicate. However, every year thousands lose their speech ability from paralysis or brain damage, leaving them unable to use assistive technologies. The participant lost his ability to speak in 2003, paralyzed by a brain stroke following a car accident. The researchers were not sure if his brain retained neural activity linked to speech. To track his brain signals, a neuroprosthetic device consisting of electrodes was positioned on the left side of the brain, across several regions known for speech processing. Over about four months the team embarked on 50 training sessions, where the participant was prompted to say individual words, form sentences, or respond to questions on a display screen. While responding to the prompts, the electrode device captured neural activity and transmitted the information to a computer with custom software. “Our models needed to learn the mapping between complex brain activity patterns and intended speech. That poses a major challenge when the participant can’t speak,” David Moses, a postdoctoral engineer in the Chang lab and one of the lead authors of the study, said in a press release.To decode the responses from his brain activity, the team created speech-detection and word classification models. Using the cuDNN-accelerated TensorFlow framework and 32 NVIDIA V100 Tensor Core GPUs the researchers trained, fine-tuned, and evaluated the models.“Utilizing neural networks was essential to getting the classification and detection performance we did, and our final product was the result of lots of experimentation,’ said study co-lead Sean Metzger. “Because our dataset was constantly evolving and growing, being able to adapt the models we were using was critical. The GPUs helped us make changes, monitor progress, and understand our dataset.”21	Solving a mystery that stumped scientists for decades, last November a group of computational biologists from Alphabet’s DeepMind used AI to predict a protein’s structure from its amino acid sequence. Not even a year later, a new study offers a more powerful model, capable of computing protein structures in as little as 10 minutes, on one gaming computer.The research, from scientists at the University of Washington (UW), holds promise for faster drug development, which could unlock solutions for treating diseases like cancer.Present in every cell in the body, proteins play a role in many processes such as blood clotting, hormone regulation, immune system response, vision, and cell and tissue repair. Made from long chains of amino acids that interact to form a folded three-dimensional structure, the shape of a protein determines its function.Unfolded or misfolded proteins are also thought to cause degenerative disorders including cystic fibrosis, Alzheimer’s disease, Parkinson’s disease, and Huntington’s disease. Understanding and predicting how a protein structure develops could help scientists design effective interventions for many of these diseases. The researchers at UW developed the RoseTTAFold model by creating a three-track neural network that simultaneously considers the sequence patterns, amino acid interaction, and possible three-dimensional structure of a protein. To train the model, the team used discontinuous crops of protein segments, with 260 unique amino acid elements. With the cuDNN-accelerated PyTorch deep learning framework, and NVIDIA GeForce 2080 GPUs, this information flows back and forth within the deep learning model. The network is then able to deduce a protein’s chemical parts along with its folded structure.“The end-to-end version of RoseTTAFold requires about 10 minutes on an RTX 2080 GPU to generate backbone coordinates for proteins with less than 400 residues. The pyRosetta version requires 5 minutes for network calculations on a single NVIDIA RTX 2080 GPU, and an hour for all-atom structure generation with 15 CPU cores,” the researchers write in the study.  The tool not only quickly predicts proteins, but can do so with limited input. It also has the ability to compute beyond simple structures, predicting complexes consisting of several proteins bound together. More complex models are computed in about 30 minutes on a 24G NVIDIA TITAN RTX.A public server is available for anyone interested in submitting protein sequences. The source code is also freely available to the scientific community.“In just the last month, over 4,500 proteins have been submitted to our new web server, and we have made the RoseTTAFold code available through the GitHub website. We hope this new tool will continue to benefit the entire research community,” said lead author Minkyung Baek, a postdoctoral scholar at the University of Washington, Institute for Protein Design.  Read more >>>Read the full article in Science >>>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Astrophysics researchers have long faced a tradeoff when simulating space— simulations could be either high-resolution or cover a large swath of the universe. With the help of generative adversarial networks, they can accomplish both at once.Carnegie Mellon University and University of California researchers developed a deep learning model that upgrades cosmological simulations from low to high resolution, allowing scientists to create a complex simulated universe within a day. These simulations are critical for researchers to unravel mysteries around galaxy formation, dark matter and dark energy. “Cosmological simulations need to cover a large volume for cosmological studies, while also requiring high resolution to resolve the small-scale galaxy formation physics, which would incur daunting computational challenges. said Yueying Ni, a Ph.D. candidate at Carnegie Mellon. “Our technique can be used as a powerful and promising tool to match those two requirements simultaneously by modeling the small-scale galaxy formation physics in large cosmological volumes.”  The team’s GAN model can take full-scale, low-resolution models and turn them into super-resolution simulations with up to 512 times as many particles. Though it was trained on data from only small areas of space, the model was able to replicate large-scale structures seen only in massive simulations. Published in PNAS, the journal of the National Academy of Sciences, the project used the hundreds of NVIDIA RTX GPUs on the Texas Advanced Computing Center’s Frontera system.   While existing methods would take over three weeks on a single processing core to create a detailed simulation of 134 million particles, the GPU-accelerated deep learning approach does it in just 36 minutes. And for simulations 1,000 times as large, the new method shrunk simulation time down from months on a dedicated supercomputer to 16 hours on a single GPU.This acceleration can help scientists run more simulations to predict how the universe would look in different scenarios. “With our previous simulations, we showed that we could simulate the universe to discover new and interesting physics, but only at small or low-res scales,” said Rupert Croft, physics professor at Carnegie Mellon. “By incorporating machine learning, the technology is able to catch up with our ideas.”Since the current neural networks focused on how gravity moves dark matter around over time, other phenomena such as supernovae and black holes were left out of the simulations. The team next plans to extend their methods to capture the forces responsible for these events. “The universe is the biggest data set there is,” said Scott Dodelson, head of the department of physics at Carnegie Mellon and director of the National Science Foundation Planning Institute for Artificial Intelligence in Physics. And “artificial intelligence is the key to understanding the universe and revealing new physics.” Read the full article in PNAS >> Read more >> Main image from TNG SimulationsHave a story to share? Submit an idea.Get the developer news feed straight to your inbox.22	Deep learning models have been successfully used in medical image analysis problems but they require a large, curated amount of labeled images to obtain good performance. Creating such annotations are tedious, time-consuming and typically require clinical expertise.To address this gap, Project MONAI has released MONAI Label v0.1 – an intelligent open source image labeling and learning tool that helps researchers and clinicians collaborate, create annotated datasets easily and quickly, and build AI models in a standardized MONAI paradigm.MONAI Label enables the adaptation of AI models to the clinical task at hand by continuously learning from the user’s interactions and new labels. It powers an AI Assisted annotation experience, allowing researchers and developers to make continuous improvements to their applications with iterative feedback from clinicians who are typically the end users of the medical imaging AI models.At the Children’s Hospital of Philadelphia (CHOP), Dr. Matthew Jolley explains how they are innovating and driving clinical impact with machine learning algorithms.“Children with congenital heart disease demonstrate a broad range of anatomy, and there are few readily available tools to facilitate image-based structural phenotyping and patient specific planning of complex cardiac interventions. However, currently 3D image-based heart model creation is slow, even in the hands of experienced modelers.  As such, we have been working to develop machine learning algorithms to create models of heart valves in children with congenital heart disease, such as the tricuspid valve in hypoplastic left heart syndrome. Ongoing development of automation based on machine learning will allow rapid modeling and precise quantification of how a dysfunctional valve differs from normal valves across multiple parameters.  That “structural valve profile” for an individual can then be contextualized within the spectrum of anatomy and function we see in the population, which may eventually inform improved medical decision making and interventions for children.”With MONAI Label we envision creating a community of researchers and clinicians like Dr. Jolley and his team who can build upon a well maintained software foundation that will accelerate collaboration through continuous learning. The MONAI Label team and CHOP collaborated through a Slicer week project, and successfully developed a MONAI Label application for leaflet segmentation of heart valves in 3D echocardiographic (3DE) images. The team is now working to deploy this model as a MONAI Label application on a public facing server at CHOP where clinicians can directly interface with the model and trigger a training loop for adaptation – learn more.It is incredibly important for an open source initiative like Project MONAI to have clinicians in the loop as we converge to develop a common set of best practices for AI lifecycle management in healthcare imaging. To quote Dr. Jolley:“Open-source frameworks like Project MONAI provide a standardized, transparent, and reproducible template for the creation of, and deployment of medical imaged-focused machine learning models, potentiating efforts such as ours. They allow us to focus on investigating novel algorithms and their application, rather than developing and maintaining software infrastructure.  This in turn has accelerated research progress which we are actively translating into tools of practical relevance to the pediatric community we serve.”WHAT IS INCLUDED IN MONAI LABEL V0.1MONAI Label is an open-source server-client system that is easy to set up and can run locally on a machine with one or two GPUs. The initial release does not yet support multiple user sessions, therefore both server and client operate on the same machine.MONAI Label delivers on MONAI’s core promise of being modular, Pythonic, extensible, easy to debug, user friendly, and portable.MONAI v0.1 includes:Future releases of NVIDIA Clara AIAA will also leverage the MONAI Label framework. We continue to bring together development efforts for NVIDIA Clara medical imaging tools and MONAI to deliver domain-optimized, robust software tools for researchers and developers in healthcare imaging.With contributions from an engaged community, MONAI Label aims to reduce the cost of labeling and maximize the collaboration between researchers & clinicians. Get started today with sample applications available on the MONAI Label GitHub and follow along with our step-by-step getting started guide available in the MONAI Label Documentation.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Deep learning models have been successfully used in medical image analysis problems but they require a large, curated amount of labeled images to obtain good performance. Creating such annotations are tedious, time-consuming and typically require clinical expertise.To address this gap, Project MONAI has released MONAI Label v0.1 – an intelligent open source image labeling and learning tool that helps researchers and clinicians collaborate, create annotated datasets easily and quickly, and build AI models in a standardized MONAI paradigm.MONAI Label enables the adaptation of AI models to the clinical task at hand by continuously learning from the user’s interactions and new labels. It powers an AI Assisted annotation experience, allowing researchers and developers to make continuous improvements to their applications with iterative feedback from clinicians who are typically the end users of the medical imaging AI models.At the Children’s Hospital of Philadelphia (CHOP), Dr. Matthew Jolley explains how they are innovating and driving clinical impact with machine learning algorithms.“Children with congenital heart disease demonstrate a broad range of anatomy, and there are few readily available tools to facilitate image-based structural phenotyping and patient specific planning of complex cardiac interventions. However, currently 3D image-based heart model creation is slow, even in the hands of experienced modelers.  As such, we have been working to develop machine learning algorithms to create models of heart valves in children with congenital heart disease, such as the tricuspid valve in hypoplastic left heart syndrome. Ongoing development of automation based on machine learning will allow rapid modeling and precise quantification of how a dysfunctional valve differs from normal valves across multiple parameters.  That “structural valve profile” for an individual can then be contextualized within the spectrum of anatomy and function we see in the population, which may eventually inform improved medical decision making and interventions for children.”With MONAI Label we envision creating a community of researchers and clinicians like Dr. Jolley and his team who can build upon a well maintained software foundation that will accelerate collaboration through continuous learning. The MONAI Label team and CHOP collaborated through a Slicer week project, and successfully developed a MONAI Label application for leaflet segmentation of heart valves in 3D echocardiographic (3DE) images. The team is now working to deploy this model as a MONAI Label application on a public facing server at CHOP where clinicians can directly interface with the model and trigger a training loop for adaptation – learn more.It is incredibly important for an open source initiative like Project MONAI to have clinicians in the loop as we converge to develop a common set of best practices for AI lifecycle management in healthcare imaging. To quote Dr. Jolley:“Open-source frameworks like Project MONAI provide a standardized, transparent, and reproducible template for the creation of, and deployment of medical imaged-focused machine learning models, potentiating efforts such as ours. They allow us to focus on investigating novel algorithms and their application, rather than developing and maintaining software infrastructure.  This in turn has accelerated research progress which we are actively translating into tools of practical relevance to the pediatric community we serve.”WHAT IS INCLUDED IN MONAI LABEL V0.1MONAI Label is an open-source server-client system that is easy to set up and can run locally on a machine with one or two GPUs. The initial release does not yet support multiple user sessions, therefore both server and client operate on the same machine.MONAI Label delivers on MONAI’s core promise of being modular, Pythonic, extensible, easy to debug, user friendly, and portable.MONAI v0.1 includes:Future releases of NVIDIA Clara AIAA will also leverage the MONAI Label framework. We continue to bring together development efforts for NVIDIA Clara medical imaging tools and MONAI to deliver domain-optimized, robust software tools for researchers and developers in healthcare imaging.With contributions from an engaged community, MONAI Label aims to reduce the cost of labeling and maximize the collaboration between researchers & clinicians. Get started today with sample applications available on the MONAI Label GitHub and follow along with our step-by-step getting started guide available in the MONAI Label Documentation.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.23	AI has been instrumental in providing exciting features and improving quality and operational efficiency for conferencing, media delivery and content creation. This year at GTC we announced the release of NVIDIA Maxine, a GPU-accelerated SDK for building innovative virtual collaboration and content creation applications such as video conferencing and live streaming. Check out some of the most popular sessions, demos, and videos from GTC showcasing Maxine’s latest advancements:The developer resources listed below are exclusively available to NVIDIA Developer Program members. Join today for free in order to get access to the tools and training necessary to build on NVIDIA’s technology platform here. SDKNVIDIA Maxine Now Available With Maxine’s AI SDKs—Video Effects, Audio Effects, and Augmented Reality (AR)—developers can now create real-time, video-based experiences easily deployed to PCs, data centers, and the cloud. Maxine can also leverage NVIDIA Jarvis to access conversational AI capabilities such as transcription, translation, and virtual assistants.On-DemandHow NVIDIA’s Maxine Changed the Way We CommunicateHear from Avaya’s Mike Kuch, Sr. Director of Solutions Marketing, and Paul Relf, Sr. Director of Product Management about Avaya Spaces built on CPaaS. Avaya is making capabilities associated with meetings available in contact centers. With AI noise elimination, agents and customers can hear each other in noisy environments. We’re combining components to realize the art of the possible for unique experiences by Avaya with NVIDIA AI.Real-time AI for Video-Conferencing with MaxineLearn from Andrew Rabinovich, Co-Founder and CTO, Julian Green, Co-Founder and CEO, and Tarrence van As, Co-Founder and Principal Engineer, from Headroom about applying the latest AI research on real-time video and audio streams for a more-human video-conferencing application. Explore employing generative models for super-resolution, giving order-of-magnitude reduced bandwidth. See new solutions for saliency segmentation delivering contextual virtual backgrounds of stuff that matters. DemoBuilding AI-Powered Virtual Collaboration and Content Creation Solutions with NVIDIA Maxine With new state-of the-art AI features for video, audio, and augmented reality—including AI face codec, eye contact, super resolution, noise removal, and more—NVIDIA Maxine is reinventing virtual collaboration on PCs, in the data center, and in the cloud. Reinvent Video Conferencing, Content Creation & Streaming with AI Using NVIDIA MaxineDevelopers from video conferencing, content creation and streaming providers such as Notch, Headroom, Be.Live, and Touchcast are using the Maxine SDK to create real-time video-based experiences easily deployed to PCs, data centers or in the cloud.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	In Case You Missed It (ICYMI) is a series in which we spotlight essential talks, whitepapers, blogs, and success stories showcasing NVIDIA technologies accelerating real world solutions.In this update, we look at the ways NVIDIA TensorRT and the Triton Inference Server can help your business deploy high-performance models with resilience at scale. We start with an in-depth, step-by-step introduction to TensorRT and Triton. Next, we dig into exactly how Triton and Clara Deploy complement each other in your healthcare use cases. Finally, to round things out our whitepaper covers exactly what you’ll need to know when migrating your applications to Triton.On-Demand: Inception Café – Accelerating Deep Learning Inference with NVIDIA TensorRT and TritonA step-by-step walkthrough applying NVIDIA TensorRT and Triton in conjunction with NVIDIA Clara Deploy.Watch >Whitepaper: Inception Café – Migrating Your Medical AI App to TritonThis whitepaper explores the end-to-end process of migrating an existing medical AI application to Triton.Read >On-Demand: Introduction to TensorRT and Triton A Walkthrough of Optimizing Your First Deep Learning Inference ModelAn overview of TensorRT optimization of a PyTorch model followed by deployment of the optimized model using Triton. By the end of this workshop, developers will see the substantial benefits of integrating TensorRT and get started on optimizing their own deep learning models.Watch >On-Demand: Clara Train 4.0 – 101 Getting StartedThis session provides a walk through the Clara Train SDK features and capabilities with a set of Jupyter Notebooks covering a range of topics, including Medical Model Archives (MMARs), AI-assisted annotation, and AutoML. These features help data scientists quickly annotate, train, and optimize hyperparameters for their deep learning model.Watch >On-Demand: Clara Train 4.0 – 201 Federated LearningThis session delivers an overview of federated learning, a distributed AI model development technique that allows models to be created without transferring data outside of hospitals or imaging centers. The session will finish with a walkthrough of Clara Train federated learning capabilities by going through a set of Jupyter Notebooks.Watch >On-Demand: Medical Imaging AI with MONAI BootcampMONAI is a freely available, community-supported, open-source PyTorch-based framework for deep learning in medical imaging. It provides domain-optimized foundational capabilities for developing medical imaging training workflows in a native PyTorch paradigm. This MONAI Bootcamp offers medical imaging researchers an architectural deep dive of MONAI and finishes with a walkthrough of MONAI’s capabilities through a set of four Jupyter Notebooks.Watch >On-Demand: Clara Guardian 101: A Hello World Walkthrough on the Jetson PlatformNVIDIA Clara Guardian provides healthcare-specific pretrained models and sample applications that can significantly reduce the time-to-solution for developers building smart-hospital applications. It targets three categories—public safety (thermal screening, mask detection, and social distancing monitoring), patient care (patient monitoring, fall detection, and patient engagement), and operational efficiency (operating room workflow automation, surgery analytics, and contactless control). In this session, attendees will get a walkthrough of how to use Clara Guardian on the Jetson NX platform, including how to use the pretrained models for tasks like automatic speech recognition and body pose estimation.Watch >On-Demand: GPU-Accelerated Genomics Using Clara Parabricks, Gary BurnettNVIDIA Clara Parabricks is a software suite for performing secondary analysis of next generation sequencing (NGS) DNA and RNA data. A major benefit of Parabricks is that it is designed to deliver results at blazing fast speeds and low cost. Parabricks can analyze whole human genomes in under 30 minutes, compared to about 30 hours for 30x WGS data. In this session, attendees will take a guided tour of the Parabricks suite featuring live examples and real world applications.Watch >On-Demand: Using Ethernet to Stream High-Throughput, Low-Latency Medical Sensor DataMedical sensors in various medical devices generate high-throughput data. System designers are challenged to move the sensor data to the GPU for processing. Over the last decade, Ethernet speeds have increased from 10G to 100G, enabling new ways to meet this challenge. We’ll explore three technologies from NVIDIA that make streaming high-throughput medical sensor data over Ethernet easy and efficient — NVIDIA Networking ConnectX NICs, Rivermax SDK with GPUDirect, and Clara AGX. Learn about the capabilities of each of these technologies and explore examples of how these technologies can be leveraged by several different types of medical devices. Finally, a step-by-step demo will walk attendees through installing the software, initializing a link, and testing for throughput and CPU overhead.Watch >NEW on NGC: Simplify and Unify Biomedical Analytics with VyasaLearn how Vyasa Analytics leverages Clara Discovery, Triton Inference Server, RAPIDS, and DGX to develop solutions for pharmaceutical and biotechnology companies. Vyasa Analytics solutions are available from the NVIDIA NGC catalog for rapid evaluation and deployment.Read >Do you have a startup? Join NVIDIA Inception’s global network of over 8,000 startups.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.24	A team of scientists from Argonne National Laboratory developed a new method for turning X-ray data into visible, 3D images with the help of AI. The study, published in Applied Physics Reviews, develops a computational framework capable of taking data from the lab’s Advanced Photon Source (APS) and creating 3D visualizations hundreds of times faster than traditional methods. “In order to make full use of what the upgraded APS will be capable of, we have to reinvent data analytics. Our current methods are not enough to keep up. Machine learning can make full use and go beyond what is currently possible,” Mathew Cherukara, a computational scientist at Argonne and study coauthor, said in a press release. The advancement could have wide-ranging benefits to many areas of study relying on sizable amounts of 3D data, ranging from astronomy to nanoscale imaging.Described as one of the most technologically complex machines in the world, the APS uses extremely bright X-ray beams to help researchers see the structure of materials at the molecular and atomic level. As these beams of light bounce off an object, detectors collect them in the form of data. With time and complex computations, this data is converted into images, revealing the object’s structure.However, detectors are unable to capture all the beam data, leaving missing pieces of information. The researchers fill this gap by using neural networks that train computer models to identify objects and visualize an image, based on the raw data it is fed. With 3D images this can be extremely timely due to the amount of information processed.“We used computer simulations to create crystals of different shapes and sizes, and we converted them into images and diffraction patterns for the neural network to learn. The ease of quickly generating many realistic crystals for training is the benefit of simulations,” said Henry Chan, an Argonne postdoctoral researcher, and study coauthor.  The work for the new computational framework, known as 3D-CDI-NN, was developed using GPU resources at Argonne’s Joint Laboratory for System Evaluation, consisting of NVIDIA A100 and RTX 8000 GPUs.“This paper… greatly facilitates the imaging process. We want to know what a material is, and how it changes over time, and this will help us make better pictures of it as we make measurements,” said Stephan Hruszkewycz, study coauthor and physicist with Argonne’s Materials Science Division.	NVIDIA Clara AGX SDK 3.0 is available today! The Clara AGX SDK runs on the NVIDIA Jetson and Clara AGX platform and provides developers with capabilities to build end-to-end streaming workflows for medical imaging. The focus of this release is to provide added support for NGC containers, including TensorFlow and PyTorch frameworks, a new ultrasound application, and updated Transfer Learning Toolkit scripts.  There is now support for the leading deep learning framework containers, including TensorFlow 1, TensorFlow 2, and PyTorch, as well as the Triton Inference Server. These containers can help you quickly get started using the Clara AGX Development Kit, NVIDIA’s GPU super-charged development platform for AI medical devices and edge-based inferencing. We’ve also released three new application containers along with the SDK, available on NGC. These application containers include: Clara AGX SDK has also been updated to the latest Transfer Learning Toolkit (TLT) 3.0 release. Developers can now use TLT 3.0 out-of-the-box and includes compatibility with DeepStream SDK for real-time, low latency, high-resolution image AI deployments.  Download Clara AGX SDK 3.0 through the Clara AGX Developer Site. An NVIDIA Developer Program account is needed to access the SDK. You can also find all of our containers through NGC. Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.25	 Read more >>>Read the full article in Applied Physics Reviews >>>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	 Read more >>>Read the full article in Applied Physics Reviews >>>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.26	When it comes to production, companies spend endless cycles improving their processes to drive the most revenue. Manufacturing lines are rigorously tested, and any changes require downtime that can eat up a company’s profits. That’s where AI comes in.Manufacturing as an industry is ripe to experience the benefits of AI because it performs highly repeatable tasks that can each be tuned and optimized for overall performance. AI takes readily-available historical data from sensors, cameras, and even outcomes and processes it faster than any human could, without getting tired. Once the data is fed into the AI, the AI makes sense of it, then it has to make a prediction based on past data, it makes a choice based on the best option available, and finally it takes action.At GTC ’21, Data Monsters, who builds AI solutions for production and packaging, discussed the growth of AI in manufacturing and how AI is being used to optimize every part of the supply chain, from forecasting and production planning to quality control. The session “Getting Started with AI in Manufacturing” shared how AI could be used to improve the Overall Equipment Effectiveness (OEE) of any organization using data that is already available today. OEE consists of three factors: availability, performance, and quality. Each of these factors can be optimized to improve the effectiveness and therefore profits of manufacturers. Let’s take a look at the various AI techniques that can be used for each.Availability is measured by the amount of uptime compared to downtime. As downtime at any part of the system can result in dramatic productivity loss, predictive maintenance is something many manufacturers are looking to in order to improve the uptime of machinery. Predictive maintenance models learn from the system and identify indicators that predict a failure. This model can alert the team prior to a failure and make recommendations about what needs to be fixed, both of which can reduce downtime. Performance looks at how fast products are being produced compared to how fast they could be produced. With highly repetitive tasks in the manufacturing space, AI can be used to help identify the most efficient schedule based on objective function parameters, and make suggestions on where bottlenecks can be removed. Depending on the parameters, process optimization can determine the most efficient outcome based on technology variables and historical outcomes, thus maximizing throughput, minimizing cost, and reducing leftover stock.Quality of production means looking at what proportion of products are being produced without defects. Here, computer vision provides a lot of data for analysis. Manufacturers can improve the overall quality by identifying where in the process the defects are happening so they can be prevented in the future. Reducing defects and improving the overall quality of products can have a dramatic impact on not only productivity, but also revenue. AI becomes a huge differentiator in the manufacturing space, as it reduces manual operation, and improves efficiency and the competitive position in the market with optimized costs and scheduling. Due to the intense calculations of AI required to perform these tasks, manufacturers are bringing the compute close to sensors generating the data. Moving compute to the edge has the benefit of lowering latency and bandwidth requirements to run AI applications, ensuring the fastest and most accurate responses. With numerous compute systems on production lines, AI models are downloaded from the cloud, data is collected and processed locally. Models are fine-tuned and uploaded back to the cloud for further distribution between several edge systems.To learn more about implementing inspections, diagnostics, and predictive maintenance in the manufacturing pipeline, check out the Data Monster’s session “Getting Started with AI in Manufacturing“. Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	When it comes to production, companies spend endless cycles improving their processes to drive the most revenue. Manufacturing lines are rigorously tested, and any changes require downtime that can eat up a company’s profits. That’s where AI comes in.Manufacturing as an industry is ripe to experience the benefits of AI because it performs highly repeatable tasks that can each be tuned and optimized for overall performance. AI takes readily-available historical data from sensors, cameras, and even outcomes and processes it faster than any human could, without getting tired. Once the data is fed into the AI, the AI makes sense of it, then it has to make a prediction based on past data, it makes a choice based on the best option available, and finally it takes action.At GTC ’21, Data Monsters, who builds AI solutions for production and packaging, discussed the growth of AI in manufacturing and how AI is being used to optimize every part of the supply chain, from forecasting and production planning to quality control. The session “Getting Started with AI in Manufacturing” shared how AI could be used to improve the Overall Equipment Effectiveness (OEE) of any organization using data that is already available today. OEE consists of three factors: availability, performance, and quality. Each of these factors can be optimized to improve the effectiveness and therefore profits of manufacturers. Let’s take a look at the various AI techniques that can be used for each.Availability is measured by the amount of uptime compared to downtime. As downtime at any part of the system can result in dramatic productivity loss, predictive maintenance is something many manufacturers are looking to in order to improve the uptime of machinery. Predictive maintenance models learn from the system and identify indicators that predict a failure. This model can alert the team prior to a failure and make recommendations about what needs to be fixed, both of which can reduce downtime. Performance looks at how fast products are being produced compared to how fast they could be produced. With highly repetitive tasks in the manufacturing space, AI can be used to help identify the most efficient schedule based on objective function parameters, and make suggestions on where bottlenecks can be removed. Depending on the parameters, process optimization can determine the most efficient outcome based on technology variables and historical outcomes, thus maximizing throughput, minimizing cost, and reducing leftover stock.Quality of production means looking at what proportion of products are being produced without defects. Here, computer vision provides a lot of data for analysis. Manufacturers can improve the overall quality by identifying where in the process the defects are happening so they can be prevented in the future. Reducing defects and improving the overall quality of products can have a dramatic impact on not only productivity, but also revenue. AI becomes a huge differentiator in the manufacturing space, as it reduces manual operation, and improves efficiency and the competitive position in the market with optimized costs and scheduling. Due to the intense calculations of AI required to perform these tasks, manufacturers are bringing the compute close to sensors generating the data. Moving compute to the edge has the benefit of lowering latency and bandwidth requirements to run AI applications, ensuring the fastest and most accurate responses. With numerous compute systems on production lines, AI models are downloaded from the cloud, data is collected and processed locally. Models are fine-tuned and uploaded back to the cloud for further distribution between several edge systems.To learn more about implementing inspections, diagnostics, and predictive maintenance in the manufacturing pipeline, check out the Data Monster’s session “Getting Started with AI in Manufacturing“. Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.27	In an increasingly complex world, the need to automate and improve operational efficiency and safety in our physical spaces has never been greater. Whether it is streamlining the retail experience, tackling traffic congestion in our growing cities, or improving productivity in our factories—the power of AI and edge computing is critical. Video cameras are one of the most important IoT sensors. With approximately 1 billion cameras deployed worldwide, they generate a wealth of data that, when combined with AI-enabled perception and reasoning, is key to transforming ordinary areas into smart spaces. As the number of IoT sensors grows, more data is getting generated in remote edge locations. Sending data from sensors at the edge to data centers is extremely costly. However, data movement is critical for the successful operation of AI applications, which means that these applications are susceptible to high costs and latency when processing through the cloud. Sending the data to data centers is not ideal in contexts where every second counts, such as managing real-time traffic or addressing medical emergencies.This leads us to edge computing, a distributed computing model that allows computing to take place near the sensor where data is being collected and analyzed.Edge computing is the technology that powers edge AI, an architecture that processes the sensor data with deep learning algorithms close to the sensors generating the data. Edge AI enables any device or computer to process data and make decisions in real-time with minimal latency. Hence, edge computing is essential for real-time applications that require low latency to enable quick responses. Examples include spotting obstacles on rail lines, inspecting defects on fast moving assembly lines, or detecting patient falls in hospitals.By bringing AI processing tasks closer to the source, edge computing overcomes issues that can occur with cloud computing, like high latency and compromised security. Some advantages of edge computing are:Use Cases of Edge Computing for Smart CitiesCities, school campuses, and shopping malls are several of many places that have started to use AI at the edge to transform themselves into smart spaces. From traffic management to city planning, these entities are using AI to make their spaces more efficient, accessible, and safe. The following examples illustrate how edge computing has been used to transform operations and improve safety around the world. To reduce traffic congestionNota developed a real-time traffic control solution that uses edge computing and computer vision to identify traffic volume, analyze congestion, and optimize traffic signal controls at intersections. Nota’s solutions are used by cities to improve traffic flow, saving them traffic congestion-related costs and minimizing the amount of time drivers spend in traffic. [Read more]To assess and avoid operational hazards in citiesViisights helps manage operations within Israel’s cities. Viisights’ edge computing application assists city officials in identifying and managing events in densely populated areas. Its real-time detection of behavior helps officials predict how quickly an event is growing and determine if there is reason for alarm or a need to take action. [Read more]To revolutionize the retail industryMany retail stores and distribution centers use edge computing and computer vision to bring real-time insights to retailers, enabling them to protect their assets and streamline distribution system processes. The technology can help retailers grow their top line with efficiencies that can improve retailers’ net profit margins. [Read more]To save lives at beachesSightbit developed an image detection application that helps spot dangers at beaches. Speed is very critical in these life or death situations which is why processing is done at the edge. The system detects potential dangers such as rip currents, or hazardous ocean conditions allowing authorities to enact life-saving procedures. [Read more]To improve airline and airport operation efficiencyAirports around the world are partnering with ASSAIA to use edge computing to improve turnaround times and reduce delays. ASSAIA’s AI-enabled video analytics application produces insights that help airlines and airports make better and quicker decisions around capacity, sustainability, and safety. [Read more]A new generation of AI applications at the edge is driving incredible operational efficiency and safety gains across a broad range of spaces. Download this free e-book to learn how edge computing is helping build smarter and safer spaces around the world.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	The NGC team is hosting a webinar and live Q&A. Topics include how to use containers from the NGC catalog deployed from Google Cloud Marketplace to GKE, a managed Kubernetes service on Google Cloud, that easily builds, deploys, and runs AI solutions.Building a Computer Vision Service Using NVIDIA NGC and Google CloudAugust 25 at 10 a.m. PTOrganizations are using computer vision to improve the product experience, increase production, and drive operational efficiencies. But, building a solution requires large amounts of labeled data, the software and hardware infrastructure to train AI models, and the tools to run real-time inference that will scale with demand.With one click, NGC containers for AI can be deployed from Google Cloud Marketplace to GKE. This managed Kubernetes service on Google Cloud, makes it easy for enterprises to build, deploy, and run their AI solutions.By joining this webinar, you will learn:Register now >>> Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.28	JPEG 2000 (.jp2, .jpg2, .j2k) is an image compression standard defined by the Joint Photographers Expert Group (JPEG) as the more flexible successor to the still popular JPEG standard. Part 1 of the JPEG 2000 standard, which forms the core coding system, was first approved in August 2002. To date, the standard has expanded to 17 parts, covering areas like Motion JPEG2000 (Part 3) which extends the standard for video, extensions for three-dimensional data (Part 10), and so on.Features like mathematically lossless compression and large precision and higher dynamic range per component helped JPEG 2000 find adoption in digital cinema applications. JPEG 2000 is also widely used in digital pathology and geospatial imaging, where image dimensions exceed 4K but regions of interest (ROI) stay small.The JPEG 2000 feature set provides ample opportunities for GPU acceleration when compared to its predecessor, JPEG. Through GPU acceleration, images can be decoded in parallel and larger images can be processed quicker. nvJPEG2000 is a new library that accelerates the decoding of JPEG 2000 images on NVIDIA GPUs. It supports codec features commonly used in geospatial imaging, remote sensing, and digital pathology. Figure 1 overviews the decoding stages that nvJPEG2000 accelerates.The Tier1 Decode (entropy decode) stage is the most compute-intensive stage of the entire decode process. The entropy decode algorithm used in the legacy JPEG codec was serial in nature and was hard to parallelize.In JPEG 2000, the entropy decode stage is applied at a block-based granularity (typical block sizes are 64×64 and 32×32) that makes it possible to offload the entropy decode stage entirely to the GPU. For more information about the entropy decode process, see Section C of the JPEG 2000 Core coding system specification.The JPEG 2000 core coding system allows for two types of wavelet transforms (5-3 Reversible and 9-7 Irreversible), both of which benefit from GPU acceleration. For more information about the wavelet transforms, see Section F of the JPEG 2000 Core coding system specification.In this section, we concentrate on the new nvJPEG2000 API tailored for the geospatial domain, which enables decoding specific tiles within an image instead of decoding the full image. Imaging data captured by the European Space Agency’s Sentinel 2 satellites are stored as JPEG 2000 bitstreams. Sentinel 2 level 2A data downloaded from the Copernicus hub can be used with the nvJPEG2000 decoding examples. The imaging data has 12 bands or channels and each of them is stored as an independent JPEG 2000 bitstream. The image in Figure 2 is subdivided into 121 tiles. To speed up the decode of multitile images, a new API called nvjpeg2kDecodeTile has been added in nvJPEG2000 v 0.2, which enables you to decode each tile independently.For multitile images, decoding each tile sequentially would be suboptimal. The GitHub multitile decode sample demonstrates how to decode each tile on a separate cudaStream_t. By taking this approach, you can simultaneously decode multiple tiles on the GPU. Nsight Systems trace in Figure 3 shows the decoding of Sentinel 2 data set consisting of 12 bands. By using 10 CUDA streams, up to 10 tiles are being decoded in parallel at any point during the decode process.Table 1 shows performance data comparing a single stream and multiple streams on a GV100 GPU.Using 10 CUDA streams reduces the total decode time of the entire dataset by about 75% on a Quadro GV100 GPU. For more information, see the Accelerating Geospatial Remote Sensing Workflows Using NVIDIA SDKs [S32150] GTC’21 talk. It discusses geospatial image-processing workflows in more detail and the role nvJPEG2000 plays there.JPEG 2000 is used in digital pathology to store whole slide images (WSI). Figure 4 gives an overview of various deep learning techniques that can be applied to WSI. Deep learning models can be used to distinguish between cancerous and healthy cells. Image segmentation methods can be used to identify a tumor location in the WSI. For more information, see Deep neural network models for computational histopathology: A survey.Table 2 lists the key parameters and their commonly used values of a whole slide image (WSI) compressed using JPEG 2000.The image in question is large and it is not possible to decode the entire image at one time due to the amount of memory required. The size of the decode output is around 53 GB (92000×201712 * 3). This is excluding the decoder memory requirements.There are several approaches to handling such large images. In this post, we describe two of them:Both approaches can be easily performed using specific nvJPEG2000 APIs.The nvJPEG2000 library enables the decoding of a specific area of interest in an image supported as part of the  nvjpeg2kDecodeTile API. The following code example shows how to set the area of interest in terms of image coordinates. The nvjpeg2kDecodeParams_t type enables you to control the decode output settings, such as the area of interest to decode.For more information about how to partially decode an image with multiple tiles, see the Decode Tile Decode GitHub sample.The second approach to decode a large image is to decode the image at lower resolutions. The ability to decode only the lower resolutions is a benefit of JPEG 2000 using wavelet transforms. In Figure 5, wavelet transform is applied up to two levels, which gives you access to the image at three resolutions. By controlling how the inverse wavelet transform is applied, you decode only the lower resolutions of an image.The digital pathology image described in Table 2 has 12 resolutions. This information can be retrieved on a per-tile basis:The image has a size of 92000×201712 with 12 resolutions. If you choose to discard the four higher resolutions and decode the image up to eight resolutions, that means you can extract an image of size 5750×12574. By dropping four higher resolutions, you are scaling the result by a factor of 16.To show the performance improvement that decoding JPEG2000 on GPU brings, compare GPU-based nvJPEG2000 with CPU-based OpenJPEG.Figures 6 and 7 show the average speedup when decoding one image at a time. The following images are used in the measurements:The tables were compiled with OpenJPEG CPU Performance – Intel Xeon Gold 6240@2GHz 3.9GHz Turbo (Cascade Lake) HT On, Number of CPU threads per image=16.On NVIDIA Ampere Architecture GPUs such as NVIDIA RTX A6000, the speedup factor is more than 8x for decoding. This speedup is measured for single-image latency.Even higher speedups can be achieved by batching the decode of multiple images. Figures 8 and 9 compare the speed of decoding a 1920×1080 8-bit image with 444 chroma subsampling (Full HD) in both lossless and lossy modes respectively across multiple GPUs.Figures 8 and 9 demonstrate the benefits of batched decode using the nvJPEG2000 library. There’s a significant performance increase on GPUs with a large number of streaming multiprocessors (SMs), such as A100 and NVIDIA RTX A6000, than with smaller numbers of SMs, such as NVIDIA RTX 4000 and T4. By batching, you are making sure that the compute resources available are efficiently used.As observed from Figure 8, the decode speed on an NVIDIA RTX A6000 is 232 images per second for a batch size of 20. This equates to an additional 3x speed over batch size = 1, based on a benchmark image with a low compression ratio. The compressed bitstream is only about 3x smaller than the uncompressed image. At higher compression ratios, the speedup is faster.The following GitHub samples show how to achieve this speedup both at image and tile granularity:The nvJPEG2000 library accelerates the decoding of JPEG2000 images both in size and volume using NVIDIA GPUs by targeting specific image-processing tasks of interest. Decoding JPEG 2000 images using the nvJPEG2000 library can be as much as 8x faster on GPU (NVIDIA RTX A6000) than on CPU. A further speedup of 3x (24x faster than CPU) is achieved by batching the decode of multiple images.The simple nvJPEG2000 APIs make it easy to include in your applications and workflows. It is also integrated into the NVIDIA Data Loading Library (DALI), a data loading and preprocessing library to accelerate deep learning applications. Using nvJPEG2000 and DALI together makes it easy to use JPEG2000 images as part of deep learning training workflows.For more information, see the following resources:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	This post was updated July 20, 2021 to reflect NVIDIA TensorRT 8.0 updates.In this post, you learn how to deploy TensorFlow trained deep learning models using the new TensorFlow-ONNX-TensorRT workflow. This tutorial uses NVIDIA TensorRT 8.0.0.3 and provides two code samples, one for TensorFlow v1 and one for TensorFlow v2. TensorRT is an inference accelerator.First, a network is trained using any framework. After a network is trained, the batch size and precision are fixed (with precision as FP32, FP16, or INT8). The trained model is passed to the TensorRT optimizer, which outputs an optimized runtime also called a plan. The .plan file is a serialized file format of the TensorRT engine. The plan file must be deserialized to run inference using the TensorRT runtime. To optimize models implemented in TensorFlow, the only thing you have to do is convert models to the ONNX format and use the ONNX parser in TensorRT to parse the model and build the TensorRT engine. Figure 1 shows the high-level ONNX workflow. In this post, we discuss how to create a TensorRT engine using the ONNX workflow and how to run inference from the TensorRT engine. More specifically, we demonstrate end-to-end inference from a model in Keras or TensorFlow to ONNX, and to the TensorRT engine with ResNet-50, semantic segmentation, and U-Net networks. Finally, we explain how you can use this workflow on other networks. Download the code examples and unzip. You can run either the TensorFlow 1 or the TensorFlow 2 code example by follow the appropriate README. After downloading the file, you should also download labels.py from the Cityscapes dataset scripts repo and place it in the same folder as the other scripts. ONNX is an open format for machine learning and deep learning models. It allows you to convert deep learning and machine learning models from different frameworks such as TensorFlow, PyTorch, MATLAB, Caffe, and Keras to a single format. It defines a common set of operators, common sets of building blocks of deep learning, and a common file format. It provides a definition of a computation graph, as well as built-in operators. The list of ONNX nodes that may have one or more inputs or outputs forms an acyclic graph. In this example, we show how to use the ONNX workflow on two different networks and create a TensorRT engine. The first network is ResNet-50.The workflow consists of the following steps:The first step is to convert the model to a .pb file. The following code example converts the ResNet-50 model to a .pb file:In addition to Keras, you can also download ResNet-50 from the following locations: The second step is to convert the .pb model to the ONNX format. To do this, first install tf2onnx. After installing tf2onnx, there are two ways of converting the model from a .pb file to the ONNX format. The first way is to use the command line and the second method is by using Python API. Run the following command:To create the TensorRT engine from the ONNX file, run the following command: This code should be saved in the engine.py file, and is used later in the post. This code example contains the following variable:  The builder creates an empty network (builder.create_network()) and the ONNX parser parses the ONNX file into the network (parser.parse(model.read())). You set the input shape for the network (network.get_input(0).shape = shape), after which the builder creates the engine (engine = builder.build_cuda_engine(network)). To create the engine, run the following code example:In this code example, you first get the input shape from the ONNX model. Next,  create the engine, and then save the engine in a .plan file.The TensorRT engine runs inference in the following workflow: These steps are explained in detail in the following code example.  This code should be saved in the inference.py file, and is used later in this post.   The first two lines are for determining the dimensions for input and output. You create page-locked memory buffers in host (h_input_1, h_output). Then, you allocate device memory for input and output the same size as host input and output (d_input_1, d_output). The next step is to create the CUDA stream for copying data between the allocated memory from device and host. In this code example, in the do_inference function, the first step is to load images to buffers in the host using the load_images_to_buffer function. Then the input data is transferred to the GPU (cuda.memcpy_htod_async(d_input_1, h_input_1, stream)) and inference is run using context.execute. Finally the results are copied from GPU to the host (cuda.memcpy_dtoh_async(h_output, d_output, stream)).In the post Fast INT8 Inference for Autonomous Vehicles with TensorRT 3, the author covered the process of UFF workflow for a semantic segmentation model.In this post, you use similar networks to run the ONNX workflow for semantic segmentation. The network consists of a VGG16-based encoder and three upsampling layers implemented using a deconvolutional layer. The network is trained in about 40,000 iterations on the Cityscapes Dataset. There are multiple ways of converting the TensorFlow model to an ONNX file. One way is the one explained in the ResNet50 section. Keras also has its own Keras-to-ONNX file converter. Sometimes, some of the layers are not supported in the TensorFlow-to-ONNX but they are supported in the Keras to ONNX converter. Depending on the Keras framework and the type of layers used, you may need to choose between converters. In the following code example, you directly convert the Keras model to ONNX using the Keras-to-ONNX converter. Download the pretrained semantic segmentation file, semantic_segmentation.hdf5.Figure 2 shows the architecture of the network.As in the previous example, use the following code example to create the engine for semantic segmentation.To test the output of the model, use the Cityscapes Dataset. To work with Cityscapes, you must have the following functions:   sub_mean_chw and color_map.  In the following code example, sub_mean_chw is for subtracting the mean value from the image as the preprocessing step and color_map is the mapping from the class ID to a color. The latter is used for visualization. The following code example is the rest of the code for the previous example. You must run the previous block first because you need the defined functions. Use the example to compare the output of the Keras model and TensorRT engine semantic .plan file and then visualize both outputs.  Replace the placeholders /path/to/semantic_segmentation.hdf5  and input_file_path as appropriate.Figure 3 shows the actual image and the ground truth, and the output of Keras versus the output of the TensorRT engine. As you can see, the output for the TensorRT engine is similar to the one for Keras. Now you can try the ONNX workflow on other networks. For more information about good examples of segmentation networks, see Segmentation models with pretrained backbones on GitHub. As an example, we show how to use the ONNX workflow with other networks. The network in this example is U-Net from the segmentation_models library. Here, we only loaded the model and did not train it. You may need to train these models on your preferred dataset. One important point about these networks is that when you load these networks, their input layer sizes are as follows: (None, None, None, 3). To create a TensorRT engine, you need an ONNX file with a known input size. Before you convert this model to ONNX, change the network by assigning the size to its input and then convert it to the ONNX format. As an example, load the U-Net network from this library (segmentation_models) and assign the size (244, 244, 3) to its input. After creating the TensorRT engine for the inference, do a similar conversion to what you did for semantic segmentation. Depending on the application and dataset, you may need to have a different color mapping.As we mentioned earlier in this post, another way of downloading pretrained models is to download them from NVIDIA NGC Models. It has a list of checkpoints for pretrained models. As an example, you can search for UNet for TensorFlow and then go to the Download page to get the latest checkpoint. In this post, we explained how to deploy deep learning applications using a TensorFlow-to-ONNX-to-TensorRT workflow, with several examples. The first example was ONNX-TensorRT on ResNet-50, and the second example was VGG16-based semantic segmentation that was trained on the Cityscapes Dataset. At the end of the post, we demonstrated how to apply this workflow on other networks.  For more information about the best performance of training and inference, see NVIDIA Data Center Deep Learning Product Performance. Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.29	Deep learning (DL) is the state-of-the-art solution for many machine learning problems, such as computer vision or natural language problems and it outperforms alternative methods. Recent trends include applying DL techniques to recommendation engines. Many large companies—such as AirBnB, Facebook, Google, Home Depot, LinkedIn, and Pinterest—share their experience in using DL for recommender systems.Recently, NVIDIA and the RAPIDS.AI team won three competitions with DL: the ACM RecSys2021 Challenge, SIGIR eCom Data Challenge, and ACM WSDM2021 Booking.com Challenge.The field of recommender systems is complex. In this post, I focus on the neural network architecture and its components, such as embedding and fully connected layers, recurrent neural network cells (LSTM or GRU), and transformer blocks. I discuss popular network architectures, such as Google’s Wide & Deep and Facebook’s Deep Learning Recommender Model (DLRM).There are many different techniques to design a recommender system, such as association rules, content-based or collaborative filtering, matrix factorization, or training a linear or tree-based model to predict the interaction likelihood.What are the advantages of using neural networks? In general, DL models achieve higher accuracy. First, DL can leverage additional data. Many traditional machine-learning techniques plateau with more data. However, when you increase the capacity of neural networks, the model can improve performance with more data.Second, neural networks are flexible in their design. For example, you can train a DL model on multiple objectives (multitask learning), such as “will the user add the item to the cart?”, “start the checkout with the item?”, or “purchase the item?” Each goal helps the model to extract information from the data and the goals can support each other.Other design approaches include adding multimodal data to the recommender model. You can do this by processing product images with a convolutional neural network or product description with an NLP model. Neural networks are used in many domains. You can transfer new developments, such as optimizers or new layers, to recommender systems.Finally, DL frameworks are highly optimized to process terabytes to petabytes of data for all kinds of domains. Here’s how you can design neural networks for recommender systems.Embedding layers represent categories with dense vectors. This technique is popular in NLP to embed words with dense representation. Words with similar meaning have a similar embedding vector.You can apply the same technique to recommender systems. The most trivial recommender system is based on users and items: Which items should you recommend to a user? You have user IDs and item IDs. The words are the users and items, so you use two embedding tables (Figure 1).Calculate the dot-product between the user embedding and the item embedding to get a final score, the likelihood that a user interacts with an item. You may apply the sigmoid activation function as a last step to transform the output to a probability between 0 and 1.This method is equivalent to matrix factorization or alternating least squares (ALS).The performance of neural networks is based on deep architectures with multiple, nonlinear layers. You can extend the previous model by feeding the output of your embedding layers through multiple, fully connected layers with ReLU activations.A design choice is how to combine the two embedding vectors. Either you can concatenate only the embedding vectors or you can multiply the vectors element-wise with each other, similar to a dot product. The output is followed by multiple hidden layers.So far, you’ve used only the user ID and product ID as an input, but you often have more information available. Additional user information could be gender, age, city (address), time since last visit, or credit card used for payment. An item usually has a brand, price, categories, or quantity sold in the last 7 days. This side information can help the model to generalize better. Modify the neural network to use the additional features as input.Embedding layers and fully connected layers are the main components to understand some of the latest published neural network architectures. In this post, I cover Google’s Wide and Deep from 2016 and Facebook’s DLRM from 2019.Google’s Wide and Deep contains two components:The innovation is that both components are trained simultaneously, which is possible as neural networks are flexible. The deep tower feeds categorical features through embedding layers and concatenates the output with numerical input features. The concatenated vector is fed through multiple fully connected layers.Does that sound familiar to you? Yes, that is your previous neural network design. The new component is the wide tower, which is just a linear combination of the input features, with a similar linear/logistic regression. The output of each tower is summed for the final prediction value.Facebook’s DLRM has a similar structure to the neural network architecture with metadata but has some specific differences. The dataset can contain multiple categorical features. DLRM requires that all categorical inputs are fed through an embedding layer with the same dimensionality. Later, I discuss why this is important.Next, the continuous inputs are concatenated and fed through multiple, fully connected layers, called bottom multilayer perceptron (MLP). The final layer of the bottom MLP has the same dimensionality as the embedding layer vectors.DLRM uses a new combination layer. It applies element-wise multiplication between all pairs of embedding vectors and bottom MLP output. That is the reason each vector has the same dimensionality. The resulting vectors are concatenated and fed through another set of fully connected layers (top MLP).When I analyzed different DL-based architectures for recommender systems, I assumed that the input has a tabular data structure and ignored the nature of user interactions. However, a user has multiple interactions in one session when they visit the website. For example, they visit a shop and view multiple product pages. Can you use the sequence of user interactions as an input to extract patterns?In one session, the user views multiple pairs of jeans in a row and you should recommend another pair of jeans. In another session, the same user views multiple pairs of shoes in a row and you should recommend another pair of shoes. That is the intuition behind session-based recommender systems.Thankfully, you can apply some techniques from NLP to the recommender system domain. The user’s interactions have a sequential structure.The sequence can be processed by using either a recurrent neural network (RNN) or transformer-based architecture as the Sequence Layer. You represent the item IDs with embedding vectors and feed the output through your sequence layer. The hidden representation of the sequence layer can be added as an input to your deep learning architecture.As I focused this post on the theory of applying DL to recommender systems, I didn’t cover many other challenges. I briefly describe them here to provide a starting point:This post introduced you to DL-based recommender systems. I started with basic matrix factorization based on two inputs and went over the latest session-based architecture using transformer layers.You can process the sequence by using either a recurrent neural network (RNN) or transformer-based architecture as the sequence layer. Represent the item IDs with embedding vectors and feed the output through the sequence layer. Add the hidden representation of the sequence layer as an input to your DL architecture.Interested in learning more about recommender systems? NVIDIA Merlin is an open-source framework to accelerate recommender systems end-to-end on the GPU. NVIDIA continuously develop more resources to train and deploy DL-based recommender systems easily. Here are some resources to help:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Deep learning (DL) is the state-of-the-art solution for many machine learning problems, such as computer vision or natural language problems and it outperforms alternative methods. Recent trends include applying DL techniques to recommendation engines. Many large companies—such as AirBnB, Facebook, Google, Home Depot, LinkedIn, and Pinterest—share their experience in using DL for recommender systems.Recently, NVIDIA and the RAPIDS.AI team won three competitions with DL: the ACM RecSys2021 Challenge, SIGIR eCom Data Challenge, and ACM WSDM2021 Booking.com Challenge.The field of recommender systems is complex. In this post, I focus on the neural network architecture and its components, such as embedding and fully connected layers, recurrent neural network cells (LSTM or GRU), and transformer blocks. I discuss popular network architectures, such as Google’s Wide & Deep and Facebook’s Deep Learning Recommender Model (DLRM).There are many different techniques to design a recommender system, such as association rules, content-based or collaborative filtering, matrix factorization, or training a linear or tree-based model to predict the interaction likelihood.What are the advantages of using neural networks? In general, DL models achieve higher accuracy. First, DL can leverage additional data. Many traditional machine-learning techniques plateau with more data. However, when you increase the capacity of neural networks, the model can improve performance with more data.Second, neural networks are flexible in their design. For example, you can train a DL model on multiple objectives (multitask learning), such as “will the user add the item to the cart?”, “start the checkout with the item?”, or “purchase the item?” Each goal helps the model to extract information from the data and the goals can support each other.Other design approaches include adding multimodal data to the recommender model. You can do this by processing product images with a convolutional neural network or product description with an NLP model. Neural networks are used in many domains. You can transfer new developments, such as optimizers or new layers, to recommender systems.Finally, DL frameworks are highly optimized to process terabytes to petabytes of data for all kinds of domains. Here’s how you can design neural networks for recommender systems.Embedding layers represent categories with dense vectors. This technique is popular in NLP to embed words with dense representation. Words with similar meaning have a similar embedding vector.You can apply the same technique to recommender systems. The most trivial recommender system is based on users and items: Which items should you recommend to a user? You have user IDs and item IDs. The words are the users and items, so you use two embedding tables (Figure 1).Calculate the dot-product between the user embedding and the item embedding to get a final score, the likelihood that a user interacts with an item. You may apply the sigmoid activation function as a last step to transform the output to a probability between 0 and 1.This method is equivalent to matrix factorization or alternating least squares (ALS).The performance of neural networks is based on deep architectures with multiple, nonlinear layers. You can extend the previous model by feeding the output of your embedding layers through multiple, fully connected layers with ReLU activations.A design choice is how to combine the two embedding vectors. Either you can concatenate only the embedding vectors or you can multiply the vectors element-wise with each other, similar to a dot product. The output is followed by multiple hidden layers.So far, you’ve used only the user ID and product ID as an input, but you often have more information available. Additional user information could be gender, age, city (address), time since last visit, or credit card used for payment. An item usually has a brand, price, categories, or quantity sold in the last 7 days. This side information can help the model to generalize better. Modify the neural network to use the additional features as input.Embedding layers and fully connected layers are the main components to understand some of the latest published neural network architectures. In this post, I cover Google’s Wide and Deep from 2016 and Facebook’s DLRM from 2019.Google’s Wide and Deep contains two components:The innovation is that both components are trained simultaneously, which is possible as neural networks are flexible. The deep tower feeds categorical features through embedding layers and concatenates the output with numerical input features. The concatenated vector is fed through multiple fully connected layers.Does that sound familiar to you? Yes, that is your previous neural network design. The new component is the wide tower, which is just a linear combination of the input features, with a similar linear/logistic regression. The output of each tower is summed for the final prediction value.Facebook’s DLRM has a similar structure to the neural network architecture with metadata but has some specific differences. The dataset can contain multiple categorical features. DLRM requires that all categorical inputs are fed through an embedding layer with the same dimensionality. Later, I discuss why this is important.Next, the continuous inputs are concatenated and fed through multiple, fully connected layers, called bottom multilayer perceptron (MLP). The final layer of the bottom MLP has the same dimensionality as the embedding layer vectors.DLRM uses a new combination layer. It applies element-wise multiplication between all pairs of embedding vectors and bottom MLP output. That is the reason each vector has the same dimensionality. The resulting vectors are concatenated and fed through another set of fully connected layers (top MLP).When I analyzed different DL-based architectures for recommender systems, I assumed that the input has a tabular data structure and ignored the nature of user interactions. However, a user has multiple interactions in one session when they visit the website. For example, they visit a shop and view multiple product pages. Can you use the sequence of user interactions as an input to extract patterns?In one session, the user views multiple pairs of jeans in a row and you should recommend another pair of jeans. In another session, the same user views multiple pairs of shoes in a row and you should recommend another pair of shoes. That is the intuition behind session-based recommender systems.Thankfully, you can apply some techniques from NLP to the recommender system domain. The user’s interactions have a sequential structure.The sequence can be processed by using either a recurrent neural network (RNN) or transformer-based architecture as the Sequence Layer. You represent the item IDs with embedding vectors and feed the output through your sequence layer. The hidden representation of the sequence layer can be added as an input to your deep learning architecture.As I focused this post on the theory of applying DL to recommender systems, I didn’t cover many other challenges. I briefly describe them here to provide a starting point:This post introduced you to DL-based recommender systems. I started with basic matrix factorization based on two inputs and went over the latest session-based architecture using transformer layers.You can process the sequence by using either a recurrent neural network (RNN) or transformer-based architecture as the sequence layer. Represent the item IDs with embedding vectors and feed the output through the sequence layer. Add the hidden representation of the sequence layer as an input to your DL architecture.Interested in learning more about recommender systems? NVIDIA Merlin is an open-source framework to accelerate recommender systems end-to-end on the GPU. NVIDIA continuously develop more resources to train and deploy DL-based recommender systems easily. Here are some resources to help:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.30	The NVIDIA, Facebook, and TensorFlow recommender teams will be hosting a summit with live Q&A to dive into best practices and insights on how to develop and optimize deep learning recommender systems.Develop and Optimize Deep Learning Recommender SystemsThursday, July 29 at 10 a.m. PTBy joining this Deep Learning Recommender Summit, you will hear from fellow ML engineers and data scientists from NVIDIA, Facebook, and TensorFlow on best practices, learnings, and insights for building and optimizing highly effective DL recommender systems.Sessions include:High-Performance Recommendation Model Training at FacebookIn this talk, we will first analyze how model architecture affects the GPU performance and efficiency, and also present the performance optimizations techniques we applied to improve the GPU utilization, which includes optimized PyTorch-based training stack supporting both model and data parallelism, high-performance GPU operators, efficient embedding table sharding, memory hierarchy and pipelining.RecSys2021 Challenge: Predicting User Engagements with Deep Learning Recommender SystemsThe NVIDIA team, a collaboration of Kaggle Grandmaster and NVIDIA Merlin, won the RecSys2021 challenge. It was hosted by Twitter, who provided almost 1 billion tweet-user pairs as a dataset. The team will present their winning solution with a focus on deep learning architectures and how to optimize them.Revisiting Recommender Systems on GPUA new era of faster ETL, Training, and Inference is coming to the RecSys space and this talk will walk through some of the patterns of optimization that guide the tools we are building to make recommenders faster and easier to use on the GPU.TensorFlow RecommendersTensorFlow Recommenders is an end-to-end library for recommender system models: from retrieval, through ranking, to post-ranking. In this talk, we describe how TensorFlow Recommenders can be used to fit and safely deploy sophisticated recommender systems at scale.Register now >>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Deep learning (DL) is the state-of-the-art solution for many machine learning problems, such as computer vision or natural language problems and it outperforms alternative methods. Recent trends include applying DL techniques to recommendation engines. Many large companies—such as AirBnB, Facebook, Google, Home Depot, LinkedIn, and Pinterest—share their experience in using DL for recommender systems.Recently, NVIDIA and the RAPIDS.AI team won three competitions with DL: the ACM RecSys2021 Challenge, SIGIR eCom Data Challenge, and ACM WSDM2021 Booking.com Challenge.The field of recommender systems is complex. In this post, I focus on the neural network architecture and its components, such as embedding and fully connected layers, recurrent neural network cells (LSTM or GRU), and transformer blocks. I discuss popular network architectures, such as Google’s Wide & Deep and Facebook’s Deep Learning Recommender Model (DLRM).There are many different techniques to design a recommender system, such as association rules, content-based or collaborative filtering, matrix factorization, or training a linear or tree-based model to predict the interaction likelihood.What are the advantages of using neural networks? In general, DL models achieve higher accuracy. First, DL can leverage additional data. Many traditional machine-learning techniques plateau with more data. However, when you increase the capacity of neural networks, the model can improve performance with more data.Second, neural networks are flexible in their design. For example, you can train a DL model on multiple objectives (multitask learning), such as “will the user add the item to the cart?”, “start the checkout with the item?”, or “purchase the item?” Each goal helps the model to extract information from the data and the goals can support each other.Other design approaches include adding multimodal data to the recommender model. You can do this by processing product images with a convolutional neural network or product description with an NLP model. Neural networks are used in many domains. You can transfer new developments, such as optimizers or new layers, to recommender systems.Finally, DL frameworks are highly optimized to process terabytes to petabytes of data for all kinds of domains. Here’s how you can design neural networks for recommender systems.Embedding layers represent categories with dense vectors. This technique is popular in NLP to embed words with dense representation. Words with similar meaning have a similar embedding vector.You can apply the same technique to recommender systems. The most trivial recommender system is based on users and items: Which items should you recommend to a user? You have user IDs and item IDs. The words are the users and items, so you use two embedding tables (Figure 1).Calculate the dot-product between the user embedding and the item embedding to get a final score, the likelihood that a user interacts with an item. You may apply the sigmoid activation function as a last step to transform the output to a probability between 0 and 1.This method is equivalent to matrix factorization or alternating least squares (ALS).The performance of neural networks is based on deep architectures with multiple, nonlinear layers. You can extend the previous model by feeding the output of your embedding layers through multiple, fully connected layers with ReLU activations.A design choice is how to combine the two embedding vectors. Either you can concatenate only the embedding vectors or you can multiply the vectors element-wise with each other, similar to a dot product. The output is followed by multiple hidden layers.So far, you’ve used only the user ID and product ID as an input, but you often have more information available. Additional user information could be gender, age, city (address), time since last visit, or credit card used for payment. An item usually has a brand, price, categories, or quantity sold in the last 7 days. This side information can help the model to generalize better. Modify the neural network to use the additional features as input.Embedding layers and fully connected layers are the main components to understand some of the latest published neural network architectures. In this post, I cover Google’s Wide and Deep from 2016 and Facebook’s DLRM from 2019.Google’s Wide and Deep contains two components:The innovation is that both components are trained simultaneously, which is possible as neural networks are flexible. The deep tower feeds categorical features through embedding layers and concatenates the output with numerical input features. The concatenated vector is fed through multiple fully connected layers.Does that sound familiar to you? Yes, that is your previous neural network design. The new component is the wide tower, which is just a linear combination of the input features, with a similar linear/logistic regression. The output of each tower is summed for the final prediction value.Facebook’s DLRM has a similar structure to the neural network architecture with metadata but has some specific differences. The dataset can contain multiple categorical features. DLRM requires that all categorical inputs are fed through an embedding layer with the same dimensionality. Later, I discuss why this is important.Next, the continuous inputs are concatenated and fed through multiple, fully connected layers, called bottom multilayer perceptron (MLP). The final layer of the bottom MLP has the same dimensionality as the embedding layer vectors.DLRM uses a new combination layer. It applies element-wise multiplication between all pairs of embedding vectors and bottom MLP output. That is the reason each vector has the same dimensionality. The resulting vectors are concatenated and fed through another set of fully connected layers (top MLP).When I analyzed different DL-based architectures for recommender systems, I assumed that the input has a tabular data structure and ignored the nature of user interactions. However, a user has multiple interactions in one session when they visit the website. For example, they visit a shop and view multiple product pages. Can you use the sequence of user interactions as an input to extract patterns?In one session, the user views multiple pairs of jeans in a row and you should recommend another pair of jeans. In another session, the same user views multiple pairs of shoes in a row and you should recommend another pair of shoes. That is the intuition behind session-based recommender systems.Thankfully, you can apply some techniques from NLP to the recommender system domain. The user’s interactions have a sequential structure.The sequence can be processed by using either a recurrent neural network (RNN) or transformer-based architecture as the Sequence Layer. You represent the item IDs with embedding vectors and feed the output through your sequence layer. The hidden representation of the sequence layer can be added as an input to your deep learning architecture.As I focused this post on the theory of applying DL to recommender systems, I didn’t cover many other challenges. I briefly describe them here to provide a starting point:This post introduced you to DL-based recommender systems. I started with basic matrix factorization based on two inputs and went over the latest session-based architecture using transformer layers.You can process the sequence by using either a recurrent neural network (RNN) or transformer-based architecture as the sequence layer. Represent the item IDs with embedding vectors and feed the output through the sequence layer. Add the hidden representation of the sequence layer as an input to your DL architecture.Interested in learning more about recommender systems? NVIDIA Merlin is an open-source framework to accelerate recommender systems end-to-end on the GPU. NVIDIA continuously develop more resources to train and deploy DL-based recommender systems easily. Here are some resources to help:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.31	Building a state-of-the-art deep learning model is a complex and time-consuming process. To achieve this, large datasets collected for the model must be of high quality. Once the data is collected, it must be prepared and then trained, and optimized over several iterations. This is not always an option for many enterprises looking to bring their AI applications to market faster while reducing operational costs.NVIDIA TAO is being developed to address these challenges. NVIDIA Train, Adapt, and Optimize (TAO) is an AI model adaptation platform that simplifies and accelerates the creation of enterprise AI applications. By fine-tuning state-of-the-art pre-trained models created by NVIDIA experts with custom data through a UI-based, guided workflow, you can produce highly accurate computer vision, speech, and language understanding models in hours rather than months, eliminating the need for large training runs and deep AI expertise.As a managed and guided workflow, TAO lowers the barrier to building AI by unifying key existing NVIDIA technologies, such as pre-trained models from the NGC catalog, Transfer Learning Toolkit (TLT), Federated Learning with NVIDIA Clara, and TensorRT.Registration for the Early Access Program is now open. Later this year we will begin accepting applicants into the program which will provide you with an exclusive opportunity to collaborate with the NVIDIA product team to help shape TAO.Key Highlights of the Early Access Program:Apply to the TAO Early Access Program here.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Building a state-of-the-art deep learning model is a complex and time-consuming process. To achieve this, large datasets collected for the model must be of high quality. Once the data is collected, it must be prepared and then trained, and optimized over several iterations. This is not always an option for many enterprises looking to bring their AI applications to market faster while reducing operational costs.NVIDIA TAO is being developed to address these challenges. NVIDIA Train, Adapt, and Optimize (TAO) is an AI model adaptation platform that simplifies and accelerates the creation of enterprise AI applications. By fine-tuning state-of-the-art pre-trained models created by NVIDIA experts with custom data through a UI-based, guided workflow, you can produce highly accurate computer vision, speech, and language understanding models in hours rather than months, eliminating the need for large training runs and deep AI expertise.As a managed and guided workflow, TAO lowers the barrier to building AI by unifying key existing NVIDIA technologies, such as pre-trained models from the NGC catalog, Transfer Learning Toolkit (TLT), Federated Learning with NVIDIA Clara, and TensorRT.Registration for the Early Access Program is now open. Later this year we will begin accepting applicants into the program which will provide you with an exclusive opportunity to collaborate with the NVIDIA product team to help shape TAO.Key Highlights of the Early Access Program:Apply to the TAO Early Access Program here.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.32	The NVIDIA Hardware Grant Program helps advance AI and data science by partnering with academic institutions around the world to enable researchers and educators with industry-leading hardware and software.Applicants can request compute support from a large portfolio of NVIDIA products. Awardees of this highly selective program will receive a hardware donation to use in their teaching or research.The hardware granted to qualified applicants could include NVIDIA RTX workstation GPUs powered by NVIDIA Ampere Architecture, NVIDIA BlueField Data Processing Units (DPUs), Remote V100 instances in the cloud with prebuilt container images, NVIDIA Jetson developer kits, and more. Alternatively, certain projects may be awarded with cloud compute credits instead of physical hardware.Please note: NVIDIA RTX 30 Series GPUs are not available through the Academic Hardware Grant Program.The current application submission window will begin on July 12 and close on July 23, 2021. The next submission window will open in early 2022.LEARN MORE >Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	The NVIDIA Hardware Grant Program helps advance AI and data science by partnering with academic institutions around the world to enable researchers and educators with industry-leading hardware and software.Applicants can request compute support from a large portfolio of NVIDIA products. Awardees of this highly selective program will receive a hardware donation to use in their teaching or research.The hardware granted to qualified applicants could include NVIDIA RTX workstation GPUs powered by NVIDIA Ampere Architecture, NVIDIA BlueField Data Processing Units (DPUs), Remote V100 instances in the cloud with prebuilt container images, NVIDIA Jetson developer kits, and more. Alternatively, certain projects may be awarded with cloud compute credits instead of physical hardware.Please note: NVIDIA RTX 30 Series GPUs are not available through the Academic Hardware Grant Program.The current application submission window will begin on July 12 and close on July 23, 2021. The next submission window will open in early 2022.LEARN MORE >Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.33	Looking to reveal secrets of days past, historical scholars across the globe spend their life’s work translating ancient manuscripts. A team at the University of Notre Dame looks to help in this quest, with a newly developed machine learning model for translating and recording handwritten documents centuries old.  Using digitized manuscripts from the Abbey Library of Saint Gall, and a machine learning model that takes into account human perception, the study offers a notable improvement in the capabilities of deep learning transcription.“We’re dealing with historical documents written in styles that have long fallen out of fashion, going back many centuries, and in languages like Latin, which are rarely ever used anymore. You can get beautiful photos of these materials, but what we’ve set out to do is automate transcription in a way that mimics the perception of the page through the eyes of the expert reader and provides a quick, searchable reading of the text,” Walter Scheirer, senior author and an associate professor at Notre Dame said in a press release. Founded in 719, the Abbey Library of Saint Gall holds one of the oldest and richest library collections in the world. The library houses approximately 160,000 volumes and 2,000 manuscripts, dating back to the eighth century. Hand-written on parchment paper in languages rarely used today, many of these materials have yet to be read—a potential fortune of historical archives, waiting to be unearthed.Machine learning methods capable of automatically transcribing these types of historical documents have been in the works, however challenges remain. Up until now, large datasets have been necessary to boost the performance of these language models. With the vast number of volumes available, the work takes time, and relies on a relatively small number of expert scholars for annotation. Missing knowledge, such as the Medieval Latin dictionary that has never been compiled, poses even greater obstacles. The team combined traditional machine learning methods with the science of visual psychophysics, which studies the relationship between the physical world and human behavior, to create more information-rich annotations. In this case, they incorporated the measurements of human vision into the training process of the neural networks when processing the ancient texts.“It’s a strategy not typically used in machine learning. We’re labeling the data through these psychophysical measurements, which comes directly from psychological studies of perception—by taking behavioral measurements. We then inform the network of common difficulties in the perception of these characters and can make corrections based on those measurements,” Scheirer said.To train, validate, and test the models the researchers used a set of digitized handwritten Latin manuscripts from St. Gall dating back to the ninth century. They asked experts to read and enter manual transcriptions from lines of text into custom designed software. Measuring the time for each transcription, gives insight into the difficulty of words, characters, or passages. According to the authors, this data helps reduce errors in the algorithm and provides more realistic readings.  All of the experiments were run using the cuDNN-accelerated PyTorch deep learning framework and GPUs. “We definitely could not have accomplished what we did without NVIDIA hardware and software,” said Scheirer.The research introduces a novel loss formulation for deep learning that incorporates measurements of human vision, which can be applied to different processing pipelines for handwritten document transcription. Credit: Scheirer et al/IEEEThere are still areas the team is working to improve. Damaged and incomplete documents, along with illustrations and abbreviations pose a special challenge for the models. “The inflection point AI reached thanks to Internet-scale data and GPU hardware is going to benefit cultural heritage and the humanities just as much as other fields. We’re just scratching the surface of what we can do with this project,” said Scheirer.  Read the full article in IEEE Transactions on Pattern Analysis and Machine Intelligence  >>Read more >>    Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Breast cancer is the most frequently diagnosed cancer among women worldwide. It’s also the leading cause of cancer-related deaths. Identifying breast cancer at an early stage before metastasis enables more effective treatments and therefore significantly improves survival rates.Although mammography is the most widely used imaging technique for early detection of breast cancer, it is not always available in low-resource settings. Its sensitivity also drops for women with dense breast tissue.Breast ultrasound is often used as a supplementary imaging modality to mammography in screening settings, and as the primary imaging modality in diagnostic settings. Despite its advantages, including lower costs relative to mammography, it is difficult to interpret breast ultrasound images as evident by the considerable intra-reader variability. This leads to increased false-positive findings, unnecessary biopsies, and significant discomfort to patients.Previous work using deep learning for breast ultrasound has been based predominantly on small datasets on the scale of thousands of images. Many of these efforts also rely on expensive and time-consuming manual annotation of images to obtain image-level (presence of cancer in each image) or pixel-level (exact location of each lesion) labels.In our recent paper, Artificial Intelligence System Reduces False-Positive Findings in the Interpretation of Breast Ultrasound Exams, we leverage the full potential of deep learning and eliminate the need for manual annotations by designing a weakly supervised deep neural network whose working resembles the diagnostic procedure of radiologists (Figure 1).The following table compares how radiologists make predictions compared to our AI system.We compared the performance of the trained network to 10 board-certified breast radiologists in a reader study and to hybrid AI-radiologist models, which average the prediction of the AI and each radiologist. The neural network was trained with a dataset consisting of approximately four million ultrasound images on an HPC cluster powered by NVIDIA technologies. The cluster consists of 34 computation nodes each of which is equipped with 80 CPUs and four NVIDIA V100 GPUs (16/32 GB). With this cluster, we performed hyperparameter search by launching experiments (each taking around 300 GPU hours) over a broad range of hyperparameters.To complete this ambitious project, we preprocessed more than eight million breast ultrasound images collected at NYU Langone between 2012 and 2019 and extracted breast-level cancer labels by mining pathology reports.Our results show that a hybrid AI-radiologist model decreased false positive rates by 37.4% (that is, false suspicions of malignancy). This would lead to a reduction in the number of requested biopsies by 27.8%, while maintaining the same level of sensitivity as radiologists (Figure 3).When acting independently, the AI system achieved higher area under the receiver operating characteristic curve (AUROC) and area under the precision recall curve (AUPRC) than individual readers. Figure 3 shows how each reader compares to the network’s performance.Within the internal test set, the AI system maintained high diagnostic accuracy (0.940-0.990 AUROC) across all age groups, mammographic breast densities, and device manufacturers, including GE, Philips, and Siemens. In the biopsied population, it also achieved a 0.940 AUROC.In an external test set collected in Egypt, the system achieved 0.911 AUROC, highlighting its generalization ability in patient demographics not seen during training (Figure 4). Based on qualitative assessment, the network produced appropriate localization information of benign and malignant lesions through its saliency maps. In the exam shown in Figure 4, all 10 breast radiologists thought the lesion appeared suspicious for malignancy and recommended that it undergo biopsy, while the AI system correctly classified it as benign. Most impressively, locations of lesions were never given during training, as it was trained in a weakly supervised manner!For our next steps, we’d like to evaluate our system through prospective validation before it can be widely deployed in clinical practice. This enables us to measure its potential impact in improving the experience of women who undergo breast ultrasound examinations each year on a global level.In conclusion, our work highlights the complementary role of an AI system in improving diagnostic accuracy by significantly decreasing unnecessary biopsies. Beyond improving radiologists’ performance, we have made technical contributions to the methodology of deep learning for medical imaging analysis.This work would not have been possible without state-of-the-art computational resources. For more information, see the preprint, Artificial Intelligence System Reduces False-Positive Findings in the Interpretation of Breast Ultrasound Exams.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.34	There is a high chance that you have asked your smart speaker a question like, “How tall is Mount Everest?” If you did, it probably said, “Mount Everest is 29,032 feet above sea level.” Have you ever wondered how it found an answer for you?Question answering (QA) is loosely defined as a system consisting of information retrieval (IR) and natural language processing (NLP), which is concerned with answering questions posed by humans in a natural language. If you are not familiar with information retrieval, it is a technique to obtain relevant information to a query, from a pool of resources, webpages, or documents in the database, for example. The easiest way to understand the concept is the search engine that you use daily. You then need an NLP system to find an answer within the IR system that is relevant to the query. Although I just listed what you need for building a QA system, it is not a trivial task to build IR and NLP from scratch. Here’s how NVIDIA Riva makes it easy to develop a QA system.NVIDIA Riva is a GPU-accelerated SDK for building multimodal conversational AI services that use an end-to-end deep learning pipeline. The Riva framework includes optimized services for speech, vision, and natural language understanding (NLU) tasks. In addition to providing several pretrained models for the entire pipeline of your conversational AI service, Riva is also architected for deployment at scale. In this post, I look closely into the QA function of Riva and how you can create your own QA application with it.To understand how the Riva QA function works, start with Bidirectional Encoder Representations from Transformers (BERT). It’s a transformer-based, NLP, pretraining method developed by Google in 2018, and it completely changed the field of NLP. BERT understands the contextual representation of a given word in a text. It is pretrained on a large corpus of data, including Wikipedia. With the pretrained BERT, a strong NLP engine, you can further fine-tune it to perform QA with many question-answer pairs like those in the Stanford Question Answering Dataset (SQuAD). The model can now find an answer for a question in natural language from a given context: sentences or paragraphs. Figure 1 shows an example of QA, where it highlights the word “gravity” as an answer to the query, “What causes precipitation to fall?”. In this example, the paragraph is the context and the successfully fine-tuned QA model returns the word “gravity” as an answer.Teams of engineers and researchers at NVIDIA deliver a quality QA function that you can use right out-of-the-box with Riva. The Riva NLP service provides a set of high-level API actions that include QA, NaturalQuery. The Wikipedia API action allows you to fetch articles posted on Wikipedia, an online encyclopedia, with a query in natural language. That’s the information retrieval system that I discussed earlier. Combining the Wikipedia API action and Riva QA function, you can create a simple QA system with a few lines of Python code. Start by installing the Wikipedia API for Python. Next, import the Riva NLP service API and gRPC, the underlying communication framework for Riva.Now, create an input query. Use the Wikipedia API action to fetch the relevant articles and define the number of them to fetch, defined as max_articles_combine. Ask a question, “What is speech recognition?” You then print out the titles of the articles returned from the search. Finally, you add the summaries of each article into a variable: combined_summary.Next, open a gRPC channel that points to the location where the Riva server is running. Because you are running the Riva server locally, it is ‘localhost:50051‘. Then, instantiate NaturalQueryRequest, and send a request to the Riva server, passing both the query and the context. Finally, print the response, returned from the Riva server.With Riva QA and the Wikipedia API action, you just created a simple QA application. If there’s an article in Wikipedia that is relevant to your query, you can theoretically find answers. Imagine that you have a database full of articles relevant to your domain, company, industry, or anything of interest. You can create a QA service that can find answers to the questions specific to your field of interest. Obviously, you would need an IR system that would fetch relevant articles from your database, like the Wikipedia API action used in this post. When you have the IR system in your pipeline, Riva can help you find an answer for you. We look forward to the cool applications that you’ll create with Riva. .Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	In part 1 of this series, we introduced new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In this post, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also cover advanced topics to take advantage of stream-ordered memory allocation in the context of multi-GPU access and the use of IPC. This all helps you improve performance within your existing applications.To measure the performance impact of the new stream-ordered allocator in a real application, here are results from the RAPIDS GPU Big Data Benchmark (gpu-bdb). gpu-bdb is a benchmark of 30 queries representing real-world data science and machine learning workflows at various scale factors: SF1000 is 1 TB of data and SF10000 is 10 TB. Each query is, in fact, a model workflow that can include SQL, user-defined functions, careful subsetting and aggregation, and machine learning.Figure 1 shows the performance of cudaMallocAsync compared to cudaMalloc for a subset of gpu-bdb queries conducted at SF1000 on an NVIDIA DGX-2 across 16 V100 GPUs. As you can see, thanks to memory reuse and eliminating extraneous synchronization, there’s a 2–5x improvement in end-to-end performance when using cudaMallocAsync.An application can use cudaFreeAsync to free a pointer allocated by cudaMalloc. The underlying memory is not freed until the next synchronization of the stream passed to cudaFreeAsync.Similarly, an application can use cudaFree to free memory allocated using cudaMallocAsync. However, cudaFree does not implicitly synchronize in this case, so the application must insert the appropriate synchronization to ensure that all accesses to the to-be-freed memory are complete. Any application code that may be intentionally or accidentally relying on the implicit synchronization behavior of cudaFree must be updated.By default, memory allocated using cudaMallocAsync is accessible from the device associated with the specified stream. Accessing the memory from any other device requires enabling access to the entire pool from that other device. It also requires the two devices to be peer capable, as reported by cudaDeviceCanAccessPeer. Unlike cudaMalloc allocations, cudaDeviceEnablePeerAccess and cudaDeviceDisablePeerAccess have no effect on memory allocated from memory pools.For example, consider enabling device 4access to the memory pool of device 3:Access from a device other than the device on which the memory pool resides can be revoked by using cudaMemAccessFlagsProtNone when calling cudaMemPoolSetAccess. Access from the memory pool’s own device cannot be revoked.Memory allocated using the default memory pool associated with a device cannot be shared with other processes. An application must explicitly create its own memory pools to share memory allocated using cudaMallocAsync with other processes. The following code sample shows how to create an explicit memory pool with interprocess communication (IPC) capabilities:The location type Device and location ID deviceId indicate that the pool memory must be allocated on a specific GPU. The allocation type Pinned indicates that the memory should be non-migratable, also known as non-pageable. The handle type PosixFileDescriptor indicates that the user intends to query a file descriptor for the pool to share it with another process.The first step to share memory from this pool through IPC is to query the file descriptor that represents the pool:The application can then share the file descriptor with another process, for example through a UNIX domain socket. The other process can then import the file descriptor and obtain a process-local pool handle:The next step is for the exporting process to allocate memory from the pool:There is also an overloaded version of cudaMallocAsync that takes the same arguments as cudaMallocFromPoolAsync:After memory is allocated from this pool through either of these two APIs, the pointer can then be shared with the importing process. First, the exporting process gets an opaque handle representing the memory allocation:This opaque data can then be shared with the importing process through any standard IPC mechanism, such as through shared memory, pipes, and so on The importing process then converts the opaque data into a process-local pointer:Now both processes share access to the same memory allocation. The memory must be freed in the importing process before it is freed in the exporting process. This is to ensure that the memory does not get reutilized for another cudaMallocAsync request in the exporting process while the importing process is still accessing the previously shared memory allocation, potentially causing undefined behavior.The existing function cudaIpcGetMemHandle works only with memory allocated through cudaMalloc and cannot be used on any memory allocated through cudaMallocAsync, regardless of whether the memory was allocated from an explicit pool.If the application expects to use an explicit memory pool most of the time, it can consider setting that as the current pool for the device through cudaDeviceSetMemPool. This enables the application to avoid having to specify the pool argument each time that it must allocate memory from that pool.This has the advantage that any other function allocating with cudaMallocAsync now automatically uses the new pool as its default. The current pool associated with a device can be queried using cudaDeviceGetMemPool.In general, libraries should not change a device’s pool, as doing so affects the entire top-level application. If a library must allocate memory with different properties than those of the default device pool, it may create its own pool and then allocate from that pool using cudaMallocFromPoolAsync. The library could also use the overloaded version of cudaMallocAsync that takes the pool as an argument.To make interoperability easier for applications, libraries should consider providing APIs for the top-level application to coordinate the pools used. For example, libraries could provide set or get APIs to enable the application to control the pool in a more explicit manner. The library could also take the pool as a parameter to individual APIs.When porting an existing application that uses cudaMalloc or cudaFree to the new cudaMallocAsync or cudaFreeAsync APIs, consider the following guidelines.Guidelines for determining the appropriate pool:Guidelines for setting the release threshold for all memory pools:Guidelines for replacing cudaMalloc with cudaMallocAsync:Guidelines for replacing cudaFree with cudaFreeAsync:The stream-ordered allocator and cudaMallocAsync and cudaFreeAsync API functions added in CUDA 11.2 extend the CUDA stream programming model by introducing memory allocation and deallocation as stream-ordered operations. This enables allocations to be scoped to the kernels, which use them while avoiding costly device-wide synchronization that can occur with traditional cudaMalloc/cudaFree.Furthermore, these API functions add the concept of memory pools to CUDA, enabling the reuse of memory to avoid costly system calls and improve performance. Use the guidelines to migrate your existing code and see how much your application performance improves!Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.35	Today, in partnership with NVIDIA, Google Cloud announced Dataflow is bringing GPUs to the world of big data processing to unlock new possibilities. With Dataflow GPU, users can now leverage the power of NVIDIA GPUs in their machine learning inference workflows. Here we show you how to access these performance benefits with BERT. Google Cloud’s Dataflow is a managed service for executing a wide variety of data processing patterns including both streaming and batch analytics. It has recently added GPU support can now accelerate machine learning inference workflows, which are running on Dataflow pipelines. Please check out Google Cloud’s launch post for more exciting new features. In this post, we will showcase the performance benefits and TCO improvement with NVIDIA GPU acceleration by deploying a Bidirectional Encoder Representations from Transformers (BERT) model fine-tuned on “Question Answering” tasks on Dataflow. We show TensorFlow inference in Dataflow with CPU, how to run the same code on GPU with a significant performance boost, showcase the best performance after we convert the model through NVIDIA TensorRT, and deploy through TensorRT’s python API with Dataflow. Check out NVIDIA sample code to try now. There are several steps we will be touching on in this post. We start by creating an environment on our local machine to run all of these Dataflow jobs. For additional details, please refer to the Dataflow Python quick start guide. It is recommended to create a virtual environment for Python, we use virtualenv here:When using Dataflow, it is required to align the Python version in your development environment with the Dataflow runtime Python version. More specifically, when running a Dataflow pipeline, you should use the same Python version and Apache Beam SDK version to avoid unexpected errors.Now, we activate the virtual environment.One of the most important things to pay attention to before activating a virtual environment is to be sure that you are not operating in another virtual environment, as this usually causes issues.After activating our virtual environment, we are ready to install the required packages. Even though our jobs are running on Dataflow, we still need a couple of packages locally so that Python does not complain when we run our code locally.You can experiment with different versions of TensorFlow but the key here is to align the version you have here and the version that you will be using in the Dataflow environment. Apache Beam and its Google Cloud components are also required.NVIDIA NGC has plenty of resources ranging from GPU-optimized containers to fine-tuned models.  We explore several NGC resources.The first resource we will be using is a BERT large model that is fine-tuned for the SquadV2 question answering task and contains 340 million parameters. The following command will download the BERT model.With the BERT model we just downloaded, automatic mixed precision (AMP) is used during training and the sequence length is 384.We also need a vocabulary file and we get it from a BERT checkpoint that can be obtained from NGC with the following command:After getting these resources, we just need to uncompress them and locate them in our working folder. We will be using a custom docker container and these models will be included in our image.We will be using a custom Dockerfile that is derived from a GPU-optimized NGC TensorFlow container. NGC TensorFlow (TF) containers are the best option when accelerating TF models using NVIDIA GPUs.We then add a couple of more steps to copy these models and the files we have. You can find the Dockerfile here and below is a snapshot of the Dockerfile.The next steps are to build the docker file and push it to the Google Container Registry (GCR). You can do this with the following command. Alternatively, you can use the script we created here. If you are using the script from our repo, you can simply do bash build_and_push.shIf you have already authenticated your Google account, you can simply run the Python files we provided here by calling the run_cpu.sh and run_gpu.sh scripts are available in the same repo.The bert_squad2_qa_cpu.py file in the repo is designed to answer questions based on a description text document. The batch size is 16, meaning that we will be answering 16 questions at each inference call and there are 16,000 questions (1,000 batches of questions). Note that BERT could be fine-tuned for other tasks given a specific use case.When running a job on Dataflow, by default it auto-scales based on real-time CPU usage. If you want to disable this feature you need to set autoscaling_algorithm to NONE. This will let you pick how many workers to use throughout the life of your job. Alternatively, you can let Dataflow auto-scale your job and limit the maximum number of workers to be used by setting the max_num_workers parameter.We recommend setting a job name rather than using the auto-generated name to better follow your jobs by setting the job_name parameter. This job name will be the prefix for the compute instance that is running your job.To execute the same dataflow TensorFlow inference job with GPU support, we need to set the following parameters. For additional information, please refer to Dataflow GPU documentation. For additional information, please refer to Dataflow GPU documentation.The parameter preceding enables us to have an NVIDIA T4 Tensor Core GPU attached to the Dataflow worker VM, which is also visible as a Compute VM instance running our job. Dataflow will automatically install required NVIDIA drivers that support CUDA11.The bert_squad2_qa_gpu.py file is almost the same as the bert_squad2_qa_cpu.py file. This means that with very little to no changes we can have our jobs running using NVIDIA GPUs. In our examples, we have a couple of additional GPU setups such as setting the memory growth with the code below.NVIDIA TensorRT optimizes Deep Learning models for inference and provides low latency and high throughput (for more information). Here, we use the NVIDIA TensorRT optimization to BERT model and use it to answer questions on a Dataflow pipeline with GPU at the speed of light. Users could follow the TensorRT demo BERT github repository.We also use Polygraphy, which is a high-level python API for TensorRT to load the TensorRT engine file and run inference. In Dataflow code, the TensorRT model is encapsulated with a shared utility class, allowing all threads from a Dataflow worker process to make use of it.In Table 10, we provided total run times and resources used for sample runs. The final cost for a Dataflow job is a linear combination of total vCPU time, total memory time, and total hard disk usage. For the GPU case, there is a GPU component as well.Note that the table preceding is compiled based on a run and the exact number might slightly fluctuate but according to our experiments the ratios did not change much.The total savings including the cost and run-time savings is more than 10x when accelerating our model with NVIDIA GPUs (TF-GPU) compared to using CPUs (TF-CPU). This means that when we use NVIDIA GPUs for inference on this task, we can have faster run times and lower costs compared to running your model using only CPUs.With NVIDIA optimized inference libraries such as TensorRT, the user could run more complex and bigger models on GPU in Dataflow. TensorRT further accelerates the same job 3.6x faster compared to running it with TF-GPU, which yields 4.2x cost saving. Compare TensorRT with TF-CPU, we get 17x less execution time that provides around 38x less bill.In this post, we compared TF-CPU, TF-GPU, and TensorRT inference performance for the question answering task running on Google Cloud Dataflow. Dataflow users can get great benefits by leveraging GPU workers and NVIDIA optimized libraries.Accelerating deep learning model inference with NVIDIA GPUs and NVIDIA software is super easy. By adding or changing a couple of lines, we can run models using TF-GPU or TensorRT. We provided scripts and source files here and here for reference.We would like to thank Shan Kulandaivel, Valentyn Tymofieiev, and Reza Rokni from the Google Cloud Dataflow team, and Jill Milton and Fraser Gardiner from NVIDIA for their support and invaluable feedback.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Today, in partnership with NVIDIA, Google Cloud announced Dataflow is bringing GPUs to the world of big data processing to unlock new possibilities. With Dataflow GPU, users can now leverage the power of NVIDIA GPUs in their machine learning inference workflows. Here we show you how to access these performance benefits with BERT. Google Cloud’s Dataflow is a managed service for executing a wide variety of data processing patterns including both streaming and batch analytics. It has recently added GPU support can now accelerate machine learning inference workflows, which are running on Dataflow pipelines. Please check out Google Cloud’s launch post for more exciting new features. In this post, we will showcase the performance benefits and TCO improvement with NVIDIA GPU acceleration by deploying a Bidirectional Encoder Representations from Transformers (BERT) model fine-tuned on “Question Answering” tasks on Dataflow. We show TensorFlow inference in Dataflow with CPU, how to run the same code on GPU with a significant performance boost, showcase the best performance after we convert the model through NVIDIA TensorRT, and deploy through TensorRT’s python API with Dataflow. Check out NVIDIA sample code to try now. There are several steps we will be touching on in this post. We start by creating an environment on our local machine to run all of these Dataflow jobs. For additional details, please refer to the Dataflow Python quick start guide. It is recommended to create a virtual environment for Python, we use virtualenv here:When using Dataflow, it is required to align the Python version in your development environment with the Dataflow runtime Python version. More specifically, when running a Dataflow pipeline, you should use the same Python version and Apache Beam SDK version to avoid unexpected errors.Now, we activate the virtual environment.One of the most important things to pay attention to before activating a virtual environment is to be sure that you are not operating in another virtual environment, as this usually causes issues.After activating our virtual environment, we are ready to install the required packages. Even though our jobs are running on Dataflow, we still need a couple of packages locally so that Python does not complain when we run our code locally.You can experiment with different versions of TensorFlow but the key here is to align the version you have here and the version that you will be using in the Dataflow environment. Apache Beam and its Google Cloud components are also required.NVIDIA NGC has plenty of resources ranging from GPU-optimized containers to fine-tuned models.  We explore several NGC resources.The first resource we will be using is a BERT large model that is fine-tuned for the SquadV2 question answering task and contains 340 million parameters. The following command will download the BERT model.With the BERT model we just downloaded, automatic mixed precision (AMP) is used during training and the sequence length is 384.We also need a vocabulary file and we get it from a BERT checkpoint that can be obtained from NGC with the following command:After getting these resources, we just need to uncompress them and locate them in our working folder. We will be using a custom docker container and these models will be included in our image.We will be using a custom Dockerfile that is derived from a GPU-optimized NGC TensorFlow container. NGC TensorFlow (TF) containers are the best option when accelerating TF models using NVIDIA GPUs.We then add a couple of more steps to copy these models and the files we have. You can find the Dockerfile here and below is a snapshot of the Dockerfile.The next steps are to build the docker file and push it to the Google Container Registry (GCR). You can do this with the following command. Alternatively, you can use the script we created here. If you are using the script from our repo, you can simply do bash build_and_push.shIf you have already authenticated your Google account, you can simply run the Python files we provided here by calling the run_cpu.sh and run_gpu.sh scripts are available in the same repo.The bert_squad2_qa_cpu.py file in the repo is designed to answer questions based on a description text document. The batch size is 16, meaning that we will be answering 16 questions at each inference call and there are 16,000 questions (1,000 batches of questions). Note that BERT could be fine-tuned for other tasks given a specific use case.When running a job on Dataflow, by default it auto-scales based on real-time CPU usage. If you want to disable this feature you need to set autoscaling_algorithm to NONE. This will let you pick how many workers to use throughout the life of your job. Alternatively, you can let Dataflow auto-scale your job and limit the maximum number of workers to be used by setting the max_num_workers parameter.We recommend setting a job name rather than using the auto-generated name to better follow your jobs by setting the job_name parameter. This job name will be the prefix for the compute instance that is running your job.To execute the same dataflow TensorFlow inference job with GPU support, we need to set the following parameters. For additional information, please refer to Dataflow GPU documentation. For additional information, please refer to Dataflow GPU documentation.The parameter preceding enables us to have an NVIDIA T4 Tensor Core GPU attached to the Dataflow worker VM, which is also visible as a Compute VM instance running our job. Dataflow will automatically install required NVIDIA drivers that support CUDA11.The bert_squad2_qa_gpu.py file is almost the same as the bert_squad2_qa_cpu.py file. This means that with very little to no changes we can have our jobs running using NVIDIA GPUs. In our examples, we have a couple of additional GPU setups such as setting the memory growth with the code below.NVIDIA TensorRT optimizes Deep Learning models for inference and provides low latency and high throughput (for more information). Here, we use the NVIDIA TensorRT optimization to BERT model and use it to answer questions on a Dataflow pipeline with GPU at the speed of light. Users could follow the TensorRT demo BERT github repository.We also use Polygraphy, which is a high-level python API for TensorRT to load the TensorRT engine file and run inference. In Dataflow code, the TensorRT model is encapsulated with a shared utility class, allowing all threads from a Dataflow worker process to make use of it.In Table 10, we provided total run times and resources used for sample runs. The final cost for a Dataflow job is a linear combination of total vCPU time, total memory time, and total hard disk usage. For the GPU case, there is a GPU component as well.Note that the table preceding is compiled based on a run and the exact number might slightly fluctuate but according to our experiments the ratios did not change much.The total savings including the cost and run-time savings is more than 10x when accelerating our model with NVIDIA GPUs (TF-GPU) compared to using CPUs (TF-CPU). This means that when we use NVIDIA GPUs for inference on this task, we can have faster run times and lower costs compared to running your model using only CPUs.With NVIDIA optimized inference libraries such as TensorRT, the user could run more complex and bigger models on GPU in Dataflow. TensorRT further accelerates the same job 3.6x faster compared to running it with TF-GPU, which yields 4.2x cost saving. Compare TensorRT with TF-CPU, we get 17x less execution time that provides around 38x less bill.In this post, we compared TF-CPU, TF-GPU, and TensorRT inference performance for the question answering task running on Google Cloud Dataflow. Dataflow users can get great benefits by leveraging GPU workers and NVIDIA optimized libraries.Accelerating deep learning model inference with NVIDIA GPUs and NVIDIA software is super easy. By adding or changing a couple of lines, we can run models using TF-GPU or TensorRT. We provided scripts and source files here and here for reference.We would like to thank Shan Kulandaivel, Valentyn Tymofieiev, and Reza Rokni from the Google Cloud Dataflow team, and Jill Milton and Fraser Gardiner from NVIDIA for their support and invaluable feedback.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.36	NVIDIA recently released NVIDIA Riva with world-class speech recognition capability for enterprises to generate highly accurate transcriptions and NVIDIA NeMo 1.0, which includes new state-of-the-art speech and language models for democratizing and accelerating conversational AI research.NVIDIA Riva world-class speech recognition is an out-of-the-box speech service that can be easily deployed in any cloud or datacenter. Enterprises can use the Transfer Learning Toolkit (TLT) to customize speech service across a variety of industries and use cases.  With TLT, developers can accelerate development of custom speech and language models by 10x.  The speech recognition model is highly accurate and trained on domain-agnostic vocabulary from telecommunications, finance, healthcare, education, and also various proprietary and open-source datasets. Additionally, it was trained on noisy data, multiple sampling rates including 8khz for call centers, variety of accents, and dialogue all of which contribute to the model’s accuracy. With the Riva speech service, you can generate a transcription in under 10 milliseconds. It is evaluated on multiple proprietary datasets with over ninety percent accuracy and can be adapted to a wide variety of use cases and domains. It can be used in several apps such as transcribing audio in call centers, video conferencing and in virtual assistants.T-Mobile, one of the largest telecommunication operators in the United States, used Riva to offer exceptional customer service.“With NVIDIA Riva services, fine-tuned using T-Mobile data, we’re building products to help us resolve customer issues in real time,” said Matthew Davis, vice president of product and technology at T-Mobile. “After evaluating several automatic speech recognition solutions, T-Mobile has found Riva to deliver a quality model at extremely low latency, enabling experiences our customers love.”You can download the Riva speech service from the NGC Catalog to start building your own transcription application today. NVIDIA NeMo is an open-source toolkit for researchers developing state-of-the-art (SOTA) conversational AI models. It includes collections for automatic speech recognition (ASR), natural language processing (NLP) and text-to-speech (TTS), which enable researchers to quickly experiment with new SOTA neural networks and create new models or build on top of existing ones. NeMo is tightly coupled with PyTorch, PyTorch Lightning and Hydra frameworks. These integrations enable researchers to develop and use NeMo models and modules in conjunction with PyTorch and PyTorch Lightning modules. Also, with the Hydra framework and NeMo, researchers can easily customize complex conversational AI models.Highlights of this version include:Also, most NeMo models can be exported to NVIDIA Riva for production deployment and high-performance inference. Learn more about what is included in NeMo 1.0 from the NVIDIA Developer Blog. NeMo is open-sourced and is available for download and use from the NGC Catalog and GitHub.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	The NVIDIA, Facebook, and TensorFlow recommender teams will be hosting a summit with live Q&A to dive into best practices and insights on how to develop and optimize deep learning recommender systems.Develop and Optimize Deep Learning Recommender SystemsThursday, July 29 at 10 a.m. PTBy joining this Deep Learning Recommender Summit, you will hear from fellow ML engineers and data scientists from NVIDIA, Facebook, and TensorFlow on best practices, learnings, and insights for building and optimizing highly effective DL recommender systems.Sessions include:High-Performance Recommendation Model Training at FacebookIn this talk, we will first analyze how model architecture affects the GPU performance and efficiency, and also present the performance optimizations techniques we applied to improve the GPU utilization, which includes optimized PyTorch-based training stack supporting both model and data parallelism, high-performance GPU operators, efficient embedding table sharding, memory hierarchy and pipelining.RecSys2021 Challenge: Predicting User Engagements with Deep Learning Recommender SystemsThe NVIDIA team, a collaboration of Kaggle Grandmaster and NVIDIA Merlin, won the RecSys2021 challenge. It was hosted by Twitter, who provided almost 1 billion tweet-user pairs as a dataset. The team will present their winning solution with a focus on deep learning architectures and how to optimize them.Revisiting Recommender Systems on GPUA new era of faster ETL, Training, and Inference is coming to the RecSys space and this talk will walk through some of the patterns of optimization that guide the tools we are building to make recommenders faster and easier to use on the GPU.TensorFlow RecommendersTensorFlow Recommenders is an end-to-end library for recommender system models: from retrieval, through ranking, to post-ranking. In this talk, we describe how TensorFlow Recommenders can be used to fit and safely deploy sophisticated recommender systems at scale.Register now >>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.37	Simulations are pervasive in every domain of science and engineering, but they are often constrained by large computational times, limited compute resources, tedious manual setup efforts, and the need for technical expertise. NVIDIA SimNet is a simulation toolkit that addresses these challenges with a combination of AI and physics. A success story of SimNet’s application today is in modeling the flow and transport in porous media. This effort was led by Cedric Frances, a PhD student at Stanford University.Cedric is researching the applicability and limitations of mesh-free reservoir simulations using physics-informed neural networks (PINNs). He’s keenly interested in the flow and transport problem in porous media (conservation of mass and Darcy flow). Cedric’s application is a Python-based reservoir simulator, which computes the pressure and concentrations of various fluids in a porous media and enables predictions that typically affect large, industrial energy projects. This includes the production of hydrocarbons, storage of carbon dioxide, water disposal, air storage, waste management, and so on.Researchers previously tried to use the PINNs approach to capture the solution of a hyperbolic problem with nonconvex flux term (Riemann problem) in a forward setting with no data other than initial and boundary conditions. Unfortunately, these attempts were unsuccessful.Before trying out SimNet, Cedric developed his own implementations of PINNs using Python and deep learning frameworks such as TensorFlow and Keras. He used various architectures of networks, such as residual, GAN, periodic activation, CNN, PDE-Net, and so on. However, it was difficult to implement all of them to find which one worked best or worked at all. The emergence of open-source code on GitHub made it easy to test these implementations out. The high overhead involved in every new implementation, such as environment setup, hardware configuration, modification of code to test his own problem, and so on, was not efficient.Cedric wanted to have a good, unified framework maintained by a team of professional software developers to solve problems that allowed him to focus on the physics of the problem and extensively test the methods that have been recently published. His search for such a framework ended when he stumbled upon SimNet.Cedric downloaded SimNet and started using fully connected networks with tanh activation functions and spatial weighing of the loss function. He discovered that SimNet’s general-purpose framework with multiple architectures and well-documented examples served as a good starting point. Its ability to emulate solutions with sharp shocks, introduce new dynamic constraints such as entropy and velocity saved him weeks of development. More importantly, it provided a quick turnaround on testing methods to determine their usefulness.The problem presented here is that of incompressible, immiscible displacement of two phases in a porous medium. This is also referred to as transport problem and has been delineated in various forms over the years. It has been applied to the displacement of oil by water for waterflood problems in reservoir for over half a century. More recently, it’s been applied to the displacement of brine by CO2 in carbon sequestration applications. For more information, see Mechanism of Fluid Displacement in Sands and Theory of Gas Injection Processes.Assume that a wetting phase (w) is displacing a nonwetting phase (n). Wettability is the preference of a fluid to be in contact with a solid surrounded by another fluid; for example, water is wetting on most surfaces compared to air. Conservation of mass applies to both phases. For the wetting phase:     (1)In this formula,  is the porosity,  is the saturation (or concentration) of that wetting phase, and is the saturation of the nonwetting. The flow rate of the wetting phase  can be written as follows:     (2)In this formula,  is the absolute permeability that quantifies the propensity of a material to allow liquid or gas to flow through it.  is the wetting phase relative permeability that is function of the saturation and characterizes the effective permeability of a given phase in the presence or absence of it. A phase preferentially flows through a path where it is already present. Think of a drop of water dripping down from a window and following existing trails.You can formulate the phase flux of the wetting phase  as a function of the total flux  using a simple homogenization rule:                                                     (3)You can rewrite this equation as a function of the total flux. This gives rise to the fractional flow:     (4)The conservation equation can now be written:     (5)For a one-dimensional case where you assume that the total flux is equal to one pore volume injected per time step (), you can obtain the equation:     (6)In this formula, the fractional flow is a nonlinear equation defined as follows:      (7) In this formula,   and   are the residual (irreducible) saturations for the wetting and nonwettings resulting from trapping mechanisms and  is the endpoint mobility ratio defined as the ratio of endpoint-relative permeability and viscosity of both phases. We used the Corey and Brooks relative permeability relationship. For more information, see Hydraulic Properties of Porous Media.The partial differential equation solved here is a hyperbolic of order 1 and the fractional flow term is nonconvex. It belongs to the class of Riemann conservation problem that is typically solved using finite volume methods. For more information, see Hyperbolic systems of conservation laws and the mathematical theory of shock waves.With uniform Dirichlet boundary conditions:        (8)        (9)You can apply the method of characteristics (MOC) to build an analytical solution to this equation. For the MOC or any finite volume method to be conservative, you must modify the fractional flow term as shown in Figure 1.Until now, no other known method solved such a problem using a sampling method, so this remained an open question. A previous attempt by Fuks and Tchelepi concluded that physics-informed approaches were not suitable for the problem described (Figure 2).Cedric’s research on this topic has now been published: Physics Informed Deep Learning for Flow and Transport in Porous Media.Important theoretical milestones are now being reached on simple yet challenging 1D examples. Cedric plans on expanding his study to larger dimensions (2D and 3D), where the scalability of the code and the easy deployment on larger arrays will be put to the test. He expects to encounter similar issues and is looking forward to the gain provided by SimNet going from 2D to 3D, for example.Cedric elaborated on his experience with SimNet. “SimNet’s clear APIs, clean and easily navigable code, environment and hardware configurations well handled with Docker containers, scalability, ease of deployment and the competent support team made it easy to adopt and has provided some very promising results. This has been great so far and we look forward to using SimNet on problems with much larger dimensions.”To view the GTC’21 session, see Physics-Informed Neural Network for Flow and Transport in Porous Media. For more information about features and to download the toolkit, see NVIDIA SimNet.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	NVIDIA Clara AGX SDK 3.0 is available today! The Clara AGX SDK runs on the NVIDIA Jetson and Clara AGX platform and provides developers with capabilities to build end-to-end streaming workflows for medical imaging. The focus of this release is to provide added support for NGC containers, including TensorFlow and PyTorch frameworks, a new ultrasound application, and updated Transfer Learning Toolkit scripts.  There is now support for the leading deep learning framework containers, including TensorFlow 1, TensorFlow 2, and PyTorch, as well as the Triton Inference Server. These containers can help you quickly get started using the Clara AGX Development Kit, NVIDIA’s GPU super-charged development platform for AI medical devices and edge-based inferencing. We’ve also released three new application containers along with the SDK, available on NGC. These application containers include: Clara AGX SDK has also been updated to the latest Transfer Learning Toolkit (TLT) 3.0 release. Developers can now use TLT 3.0 out-of-the-box and includes compatibility with DeepStream SDK for real-time, low latency, high-resolution image AI deployments.  Download Clara AGX SDK 3.0 through the Clara AGX Developer Site. An NVIDIA Developer Program account is needed to access the SDK. You can also find all of our containers through NGC. Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.38	Robotics researchers from NVIDIA and University of Southern California presented their work at the 2021 Robotics: Science and Systems (RSS) conference called DiSECt, the first differentiable simulator for robotic cutting. The simulator accurately predicts the forces acting on a knife as it presses and slices through natural soft materials, such as fruits and vegetables.Robots that are intelligent, adaptive and generalize cutting behavior with either a kitchen butter knife, or a surgical resection remains a difficult problem for researchers.As it turns out, the process of cutting with feedback requires adaptation to stiffness of the objects, applied force during the cut, and often a sawing motion to cut through. To achieve this, researchers use a family of techniques which leverage feedback to guide the controller adaptation. However, fluid controller adaptation requires very careful parameter tuning for each instance of the same problem. While these techniques are successful in industrial settings, no two cucumbers (or tomatoes) are the same, hence rendering these family of algorithms ineffective in a more generic setting.In contrast, recent focus in research has been in building differentiable algorithms for control problems, which in simpler terms means that the sensitivity of output with respect to input can be evaluated without excessive sampling. Efficient solutions for control problems are achievable when the simulated dynamics is differentiable [1,2,3], but the process of simulating cutting has not been differentiable so far!Differentiable simulation for cutting poses a challenge, since naturally cutting is a discontinuous process where crack formation and fracture propagation occur that prohibit the calculation of gradients. We tackle this problem by proposing a novel way of simulating cutting that represents the process of crack propagation and damage mechanics in a continuous manner.DiSECt implements the commonly used Finite Element Method (FEM) to simulate deformable materials, such as foodstuffs. The object to be cut is represented by a 3D mesh which consists of tetrahedral elements. Along the cutting surface we slice the mesh following the Virtual Node Algorithm [4]. This algorithm duplicates the mesh elements that intersect the cutting surface, and adds additional, so-called “virtual” vertices on the edges where these elements are cut. The virtual nodes add extra degrees of freedom to accurately simulate the contact dynamics of the knife when it presses and slices through the mesh.Next, DiSECt inserts springs connecting the virtual nodes on either side of the cutting surface. These cutting springs allow us to simulate damage mechanics and crack propagation in a continuous manner, by weakening them in proportion to the contact force the knife exerts on the mesh. This continuous treatment allows us to differentiate through the dynamics in order to compute gradients for the parameters defining the properties of the material or the trajectory of the knife. For example, given the gradients for the vertical and sideways velocity of the knife, we can efficiently determine an energy-minimizing yet fast cutting motion through gradient-based optimization.We leverage reverse-mode automatic differentiation to efficiently compute gradients for hundreds of simulation parameters. Our simulator uses source code transformation which automatically generates efficient CUDA kernels for the forward and backward passes of all our simulation routines, such as the FEM or contact model. Such an approach allows us to implement complex simulation routines which are parallelized on the GPU, while the gradients of the inputs with respect to the outputs of such routines are automatically derived from analyzing the abstract syntax tree (AST) of the simulation code. Through gradient-based optimization algorithms, we can automatically tune the simulation parameters to achieve a close match between the simulator and real-world measurements. In one of our experiments, we leverage an existing dataset [5] of knife force profiles measured by a real-world robot while cutting various foodstuffs. We set up our simulator with the corresponding mesh and its material properties, and optimize the remaining parameters to reduce the discrepancy between the simulated and the real knife force profile. Within 150 gradient evaluations, our simulator closely predicts the knife force profile, as we demonstrate on the examples of cutting an actual apple and a potato. As shown in the figure, the initial parameter guess yielded a force profile that was far from the real observation, and our approach automatically found an accurate fit. We present further results in our accompanying paper that demonstrate that the found parameters generalize to different conditions, such as the downward velocity of the knife or the length of the reference trajectory window.We also generate additional data using a highly established commercial simulator, which allows us to precisely control the experimental setup, such as object shape and material properties. Given such data, we can also leverage the motion of the mesh vertices as an additional ground-truth signal. After optimizing the simulation parameters, DiSECt is able to predict the vertex positions and velocities, as well as the force profile, much more accurately.Aside from parameter inference, the gradients of our differentiable cutting simulator can also be used to optimize the cutting motion of the knife. In our full cutting simulation, we represent a trajectory by keyframes, where each frame prescribes the downward velocity, as well as the frequency and amplitude of a sinusoidal sideways velocity. At the start of the optimization, the initial motion is a straight downward-pressing motion.We optimize this trajectory with the objective to minimize the mean force on the knife and penalize the time it takes to cut the object. Studies have shown that humans perform sawing motions when cutting biomaterials in order to reduce the required force. Such behavior emerges from our optimization as well.After 50 iterations with the Adam optimizer, we see a reduction in average knife force by 15 percent. However, the knife slices sideways further than its blade length. Therefore, we add a hard constraint to keep the lateral motion within valid limits and perform constrained optimization. Thanks to the end-to-end-differentiability of DiSECt, accurate gradients for such constraints are available, and lead to a valid knife motion which requires only 0.3 percent more force than the unconstrained result.Cutting food items multiple times results in slightly different force profiles for each instance, depending on the geometry of such materials. We additionally present results to transfer simulation parameters between different meshes corresponding to the same material. Our approach leverages optimal transport to find correspondences between simulation parameters of a source mesh and a target mesh (e.g., local stiffnesses) based on the location of the virtual nodes. As shown in the following figure, the 2D positions of these nodes along the cutting surface allow us to map simulation parameters (shown here is the softness of the cutting springs) to topologically different target geometries.In our ongoing research, we are bringing our differentiable simulation approach to real-world robotic cutting. We investigate a closed-loop control system where the simulator is updated online from force measurements, while the robot is cutting foodstuffs. Through model-predictive planning and optimal control, we aim to find time- and energy-efficient cutting actions that apply to the physical system.We thank Yan-Bin Jia and Prajjwal Jamdagni for kindly providing us a dataset of real-world cutting trajectories which we used throughout our experiments. Be sure to also check out their research on robotic cutting!DiSECt is a finalist at 2021 RSS for the best (student) paper. Visit the team’s project webpage to learn more.DiSECt Research Paper: DiSECt: A Differentiable Simulation Engine for Autonomous Robotic CuttingEric Heiden, Miles Macklin, Yashraj S Narang, Dieter Fox, Animesh Garg, Fabio RamosRobotics Science and Systems (RSS) 2021.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Robotics researchers from NVIDIA and University of Southern California presented their work at the 2021 Robotics: Science and Systems (RSS) conference called DiSECt, the first differentiable simulator for robotic cutting. The simulator accurately predicts the forces acting on a knife as it presses and slices through natural soft materials, such as fruits and vegetables.Robots that are intelligent, adaptive and generalize cutting behavior with either a kitchen butter knife, or a surgical resection remains a difficult problem for researchers.As it turns out, the process of cutting with feedback requires adaptation to stiffness of the objects, applied force during the cut, and often a sawing motion to cut through. To achieve this, researchers use a family of techniques which leverage feedback to guide the controller adaptation. However, fluid controller adaptation requires very careful parameter tuning for each instance of the same problem. While these techniques are successful in industrial settings, no two cucumbers (or tomatoes) are the same, hence rendering these family of algorithms ineffective in a more generic setting.In contrast, recent focus in research has been in building differentiable algorithms for control problems, which in simpler terms means that the sensitivity of output with respect to input can be evaluated without excessive sampling. Efficient solutions for control problems are achievable when the simulated dynamics is differentiable [1,2,3], but the process of simulating cutting has not been differentiable so far!Differentiable simulation for cutting poses a challenge, since naturally cutting is a discontinuous process where crack formation and fracture propagation occur that prohibit the calculation of gradients. We tackle this problem by proposing a novel way of simulating cutting that represents the process of crack propagation and damage mechanics in a continuous manner.DiSECt implements the commonly used Finite Element Method (FEM) to simulate deformable materials, such as foodstuffs. The object to be cut is represented by a 3D mesh which consists of tetrahedral elements. Along the cutting surface we slice the mesh following the Virtual Node Algorithm [4]. This algorithm duplicates the mesh elements that intersect the cutting surface, and adds additional, so-called “virtual” vertices on the edges where these elements are cut. The virtual nodes add extra degrees of freedom to accurately simulate the contact dynamics of the knife when it presses and slices through the mesh.Next, DiSECt inserts springs connecting the virtual nodes on either side of the cutting surface. These cutting springs allow us to simulate damage mechanics and crack propagation in a continuous manner, by weakening them in proportion to the contact force the knife exerts on the mesh. This continuous treatment allows us to differentiate through the dynamics in order to compute gradients for the parameters defining the properties of the material or the trajectory of the knife. For example, given the gradients for the vertical and sideways velocity of the knife, we can efficiently determine an energy-minimizing yet fast cutting motion through gradient-based optimization.We leverage reverse-mode automatic differentiation to efficiently compute gradients for hundreds of simulation parameters. Our simulator uses source code transformation which automatically generates efficient CUDA kernels for the forward and backward passes of all our simulation routines, such as the FEM or contact model. Such an approach allows us to implement complex simulation routines which are parallelized on the GPU, while the gradients of the inputs with respect to the outputs of such routines are automatically derived from analyzing the abstract syntax tree (AST) of the simulation code. Through gradient-based optimization algorithms, we can automatically tune the simulation parameters to achieve a close match between the simulator and real-world measurements. In one of our experiments, we leverage an existing dataset [5] of knife force profiles measured by a real-world robot while cutting various foodstuffs. We set up our simulator with the corresponding mesh and its material properties, and optimize the remaining parameters to reduce the discrepancy between the simulated and the real knife force profile. Within 150 gradient evaluations, our simulator closely predicts the knife force profile, as we demonstrate on the examples of cutting an actual apple and a potato. As shown in the figure, the initial parameter guess yielded a force profile that was far from the real observation, and our approach automatically found an accurate fit. We present further results in our accompanying paper that demonstrate that the found parameters generalize to different conditions, such as the downward velocity of the knife or the length of the reference trajectory window.We also generate additional data using a highly established commercial simulator, which allows us to precisely control the experimental setup, such as object shape and material properties. Given such data, we can also leverage the motion of the mesh vertices as an additional ground-truth signal. After optimizing the simulation parameters, DiSECt is able to predict the vertex positions and velocities, as well as the force profile, much more accurately.Aside from parameter inference, the gradients of our differentiable cutting simulator can also be used to optimize the cutting motion of the knife. In our full cutting simulation, we represent a trajectory by keyframes, where each frame prescribes the downward velocity, as well as the frequency and amplitude of a sinusoidal sideways velocity. At the start of the optimization, the initial motion is a straight downward-pressing motion.We optimize this trajectory with the objective to minimize the mean force on the knife and penalize the time it takes to cut the object. Studies have shown that humans perform sawing motions when cutting biomaterials in order to reduce the required force. Such behavior emerges from our optimization as well.After 50 iterations with the Adam optimizer, we see a reduction in average knife force by 15 percent. However, the knife slices sideways further than its blade length. Therefore, we add a hard constraint to keep the lateral motion within valid limits and perform constrained optimization. Thanks to the end-to-end-differentiability of DiSECt, accurate gradients for such constraints are available, and lead to a valid knife motion which requires only 0.3 percent more force than the unconstrained result.Cutting food items multiple times results in slightly different force profiles for each instance, depending on the geometry of such materials. We additionally present results to transfer simulation parameters between different meshes corresponding to the same material. Our approach leverages optimal transport to find correspondences between simulation parameters of a source mesh and a target mesh (e.g., local stiffnesses) based on the location of the virtual nodes. As shown in the following figure, the 2D positions of these nodes along the cutting surface allow us to map simulation parameters (shown here is the softness of the cutting springs) to topologically different target geometries.In our ongoing research, we are bringing our differentiable simulation approach to real-world robotic cutting. We investigate a closed-loop control system where the simulator is updated online from force measurements, while the robot is cutting foodstuffs. Through model-predictive planning and optimal control, we aim to find time- and energy-efficient cutting actions that apply to the physical system.We thank Yan-Bin Jia and Prajjwal Jamdagni for kindly providing us a dataset of real-world cutting trajectories which we used throughout our experiments. Be sure to also check out their research on robotic cutting!DiSECt is a finalist at 2021 RSS for the best (student) paper. Visit the team’s project webpage to learn more.DiSECt Research Paper: DiSECt: A Differentiable Simulation Engine for Autonomous Robotic CuttingEric Heiden, Miles Macklin, Yashraj S Narang, Dieter Fox, Animesh Garg, Fabio RamosRobotics Science and Systems (RSS) 2021.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.39	The new Isaac simulation engine not only creates better photorealistic environments, but also streamlines synthetic data generation and domain randomization to build ground-truth datasets to train robots in applications from logistics and warehouses to factories of the future.NVIDIA Omniverse is the underlying foundation for NVIDIA’s simulators, including the Isaac platform — which now includes several new features. Discover the next level in simulation capabilities for robots with NVIDIA Isaac Sim open beta, available now. Built on the Omniverse platform, Isaac Sim is a robotics simulation application and synthetic data generation tool. It allows roboticists to train and test their robots more efficiently by providing a realistic simulation of the robot interacting with compelling environments that can expand coverage beyond what is possible in the real world.This release of Isaac Sim also adds improved multi-camera support and sensor capabilities, and a PTC Onshape CAD importer to make it easier to bring in 3D assets. These new features will expand the breadth of robots and environments that can be successfully modeled and deployed in every aspect: from design and development of the physical robot, then training the robot, to deploying in a “digital twin” in which the robot is simulated and tested in an accurate and photorealistic virtual environment.Summary of Key New FeaturesIsaac Sim Enables More Robotics SimulationDevelopers have long seen the benefits of having a powerful simulation environment for testing and training robots. But all too often, the simulators have had shortcomings which limited their adoption. Isaac Sim addresses these drawbacks with the benefits described below. Realistic Simulation In order to deliver realistic robotics simulations, Isaac Sim leverages the Omniverse platform’s powerful technologies including advanced GPU-enabled physics simulation with PhysX 5, photorealism with real-time ray and path tracing, and Material Definition Language (MDL) support for physically-based rendering.Modular, Breadth of ApplicationsIsaac Sim is built to address many of the most common robotics use cases including manipulation, autonomous navigation, and synthetic data generation for training data. Its modular design allows users to easily customize and extend the toolset to accommodate many applications and and environments.Seamless Connectivity and InteroperabilityIsaac Sim benefits from Omniverse Nucleus and Omniverse Connectors, enabling collaborative building, sharing, and importing of environments and robot models in Universal Scene Description (USD). Easily connect the robot’s brain to a virtual world through Isaac SDK and ROS/ROS2 interface, fully-featured Python scripting, plugins for importing robot and environment models.Synthetic Data Generation in Isaac Sim Bootstraps Machine LearningSynthetic Data Generation is an important tool that is increasingly used to train the perception models found in today’s robots. Getting real-world, properly labeled data is a time consuming and costly endeavor. But in the case of robotics, some of the required training data could be too difficult or dangerous to collect in the real world. This is especially true of robots that must operate in close proximity to humans.Isaac Sim has built-in support for a variety of sensor types that are important in training perception models. These sensors include RGB, depth, bounding boxes, and segmentation.In the open beta, we have the ability to output synthetic data in the KITTI format. This data can then be used directly with the NVIDIA Transfer Learning Toolkit to enhance model performance with use case-specific data.Domain RandomizationDomain Randomization varies the parameters that define a simulated scene, such as the lighting, color and texture of materials in the scene. One of the main objectives of domain randomization is to enhance the training of machine learning (ML) models by exposing the neural network to a wide variety of domain parameters in simulation. This will help the model to generalize well when it encounters real world scenarios. In effect, this technique helps teach models what to ignore.Isaac Sim supports the randomization of many different attributes that help define a given scene. With these capabilities, the ML engineers can ensure that the synthetic dataset contains sufficient diversity to drive robust model performance.Randomizable ParametersIn Isaac Sim open beta, we have enhanced the domain randomization capabilities by allowing the user  to define a region for randomization. Developers can now draw a box around the region in the scene that is to be randomized and the rest of the scene will remain static. More Information on Isaac SimCheck out the latest Isaac Sim GTC 2021 session, Sim-to-Real.Also, learn more about Isaac Sim by exploring the growing number of video tutorials.Learn more about using Isasac Sim to train your Jetbot by exploring these developer blogs:.Getting Started Join the thousands of developers who have worked with Isaac Sim across the robotics community via our early access program. Get started with the next step in robotics simulation by downloading Isaac Sim.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Here are the latest resources and news for healthcare developers from GTC 21, including demos and specialized sessions for building AI in drug discovery, medical imaging, genomics, and smart hospitals. Learn about new features now available in NVIDIA Clara Train 4.0, an application framework for medical imaging that includes pre-trained models, AI-assisted annotation, AutoML, and federated learning.The developer resources listed below are exclusively available to NVIDIA Developer Program members. Join today for free in order to get access to the tools and training necessary to build on NVIDIA’s technology platform here.On-Demand SessionsAccelerating Drug Discovery with Advanced Computational ModelingSpeaker: Robert Abel, Executive Vice President, Chief Computational Scientist, SchrödingerLearn about how integrated deployment and collaborative use of advanced computational modeling and next-generation machine learning can accelerate drug discovery from Robert Abel, Executive Vice President, Chief Computational Scientist at Schrödinger.Using Ethernet to Stream Medical Sensor DataSpeaker: Mathias Blake, Platform Architect for Medical Devices, NVIDIAExplore three technologies from NVIDIA that make streaming high-throughput medical sensor data over Ethernet easy and efficient—NVIDIA Networking ConnectX NICs, Rivermax SDK with GPUDirect, and Clara AGX. Learn about the capabilities of each of these technologies and explore examples of how these technologies can be leveraged by several different types of medical devices.Automate 3D Medical Imaging Segmentation with AutoML and Neural Architecture SearchSpeaker: Dong Yang, Applied Research Scientist, NVIDIARecently, neural architecture search (NAS) has been applied to automatically search high-performance networks for medical image segmentation. Hear from NVIDIA Applied Research Scientist, Dong Yang, to learn about AutoML and NAS techniques in the Clara Train SDK.Deep Learning and Accelerated Computing for Single-Cell Genomic DataSpeaker: Avantika Lal, Sr. Scientist in Deep Learning and Genomics, NVIDIALearn about accelerating discovery of cell types in the human body with RAPIDS and AtacWorks, a deep learning toolkit to enhance ATAC-seq data and identify active regulatory DNA more accurately than existing state-of-the-art methods.BlogCreating Medical Imaging Models with Clara Train 4.0Learn about the upcoming release of NVIDIA Clara Train 4.0, including infrastructure upgrades based on MONAI, expansion into digital pathology, and updates to DeepGrow for annotating organs effectively in 3D images.DemosAccelerating Drug Discovery with Clara Discovery’s MegaMolBartSee how NVIDIA Clara Discovery’s MegaMolBart, a transformer-based NLP model developed with AstraZeneca, trained on millions of molecules, can accelerate the drug discovery process.NVIDIA Triton Inference Server: Generative Chemical StructuresWatch NVIDIA Triton Inference Server power deep learning models to propose thousands of molecules per second for drug design that can be further refined with physics-based simulations.Visit NVIDIA On-Demand to explore the extensive catalog of sessions, podcasts, demos, research posters and more.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.40	At GTC 21, NVIDIA announced several major breakthroughs in conversational AI for building and deploying automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS) applications. The conference also hosted over 60 engaging sessions and workshops featuring the latest tools, technologies and research in conversational AI and NLP.The developer resources listed below are exclusively available to NVIDIA Developer Program members. Join today for free in order to get access to the tools and training necessary to build on NVIDIA’s technology platform here. On-Demand SessionsConversational AI DemystifiedSpeaker: Meriem Bendris, Senior Solution Architect, NVIDIAConversational AI technologies are becoming ubiquitous, with countless products taking advantage of automatic speech recognition, natural language understanding, and speech synthesis coming to market. Thanks to new tools and technologies, developing conversational AI applications is easier than ever, enabling a much broader range of applications, such as virtual assistants, real-time transcription, and many more. We will give an overview of the conversational AI landscape and discuss how any organization can get started developing conversational AI applications today.Building and Deploying a Custom Conversational AI App with NVIDIA Transfer Learning Toolkit and JarvisSpeakers: Tripti Singhal, Solutions Architect, NVIDIA; Nikhil Srihari, Technical Marketing Engineer – Deep Learning, NVIDIA; Arun Venkatesan, Product Manager, NVIDIATailoring the deep learning models in a conversational AI pipeline to your enterprise needs is time-consuming. Developing a domain-specific application typically requires several cycles of re-training, fine-tuning, and deploying the model until it satisfies the requirements. NVIDIA Jarvis helps you easily build production-ready conversational AI applications and provides tools for fine-tuning on your domain. In this session, we will walk you through the process of customizing automatic speech recognition and natural language processing pipelines to build a truly customized production-ready Conversational AI application.Megatron GPT-3 Large Model Inference with Triton and ONNX RuntimeSpeaker: Denis Timonin, AI Solutions Architect, NVIDIAHuge NLP models like Megatron-LM GPT-3, Megatron-LM Bert require tens/hundreds of gigabytes of memory to store their weights or run inference. Frequently, one GPU is not enough for such a task. One way to run inference and maximize throughput of these models is to divide them into smaller sub-parts in the pipeline-parallelism (in-depth) style and run these subparts on multiple GPUs. This method will allow us to use bigger batch size and run inference through an ensemble of subparts in a conveyor manner. TRITON inference server is an open-source inference serving software that lets teams deploy trained AI models from any framework. And this is a perfect tool that allows us to run this ensemble. In this talk, we will take Megatron LM with billions of parameters, convert it in ONNX format, and will learn how to divide it into subparts with the new tool – ONNX-GraphSurgeon. Then, we will use TRITON ensemble API and ONNX runtime background and run this model inference on an NVIDIA DGX.BlogAnnouncing Megatron for Training Trillion Parameter Models and NVIDIA Jarvis AvailabilityNVIDIA announced Megatron for training giant transformer-based language models and major capabilities in NVIDIA Jarvis for building state-of-the-art interactive conversational AI applications.DemoWorld-Class ASR | Real-Time Machine Translation | Controllable Text-to-SpeechWatch this demo to see Jarvis’ automatic speech recognition (ASR) accuracy when fine-tuned on medical jargon, real-time neural machine translation from English to Spanish and Japanese, and powerful controllability of neural text-to-speech.New pre-trained models, notebooks, and sample applications for conversational AI are all available to try from the NGC catalog. You can also find tutorials for building and deploying conversational AI applications at the NVIDIA Developer Blog.Join the NVIDIA Developer Program for all of the latest tools and resources for building with NVIDIA technologies.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	At GTC 21, NVIDIA announced several major breakthroughs in conversational AI for building and deploying automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS) applications. The conference also hosted over 60 engaging sessions and workshops featuring the latest tools, technologies and research in conversational AI and NLP.The developer resources listed below are exclusively available to NVIDIA Developer Program members. Join today for free in order to get access to the tools and training necessary to build on NVIDIA’s technology platform here. On-Demand SessionsConversational AI DemystifiedSpeaker: Meriem Bendris, Senior Solution Architect, NVIDIAConversational AI technologies are becoming ubiquitous, with countless products taking advantage of automatic speech recognition, natural language understanding, and speech synthesis coming to market. Thanks to new tools and technologies, developing conversational AI applications is easier than ever, enabling a much broader range of applications, such as virtual assistants, real-time transcription, and many more. We will give an overview of the conversational AI landscape and discuss how any organization can get started developing conversational AI applications today.Building and Deploying a Custom Conversational AI App with NVIDIA Transfer Learning Toolkit and JarvisSpeakers: Tripti Singhal, Solutions Architect, NVIDIA; Nikhil Srihari, Technical Marketing Engineer – Deep Learning, NVIDIA; Arun Venkatesan, Product Manager, NVIDIATailoring the deep learning models in a conversational AI pipeline to your enterprise needs is time-consuming. Developing a domain-specific application typically requires several cycles of re-training, fine-tuning, and deploying the model until it satisfies the requirements. NVIDIA Jarvis helps you easily build production-ready conversational AI applications and provides tools for fine-tuning on your domain. In this session, we will walk you through the process of customizing automatic speech recognition and natural language processing pipelines to build a truly customized production-ready Conversational AI application.Megatron GPT-3 Large Model Inference with Triton and ONNX RuntimeSpeaker: Denis Timonin, AI Solutions Architect, NVIDIAHuge NLP models like Megatron-LM GPT-3, Megatron-LM Bert require tens/hundreds of gigabytes of memory to store their weights or run inference. Frequently, one GPU is not enough for such a task. One way to run inference and maximize throughput of these models is to divide them into smaller sub-parts in the pipeline-parallelism (in-depth) style and run these subparts on multiple GPUs. This method will allow us to use bigger batch size and run inference through an ensemble of subparts in a conveyor manner. TRITON inference server is an open-source inference serving software that lets teams deploy trained AI models from any framework. And this is a perfect tool that allows us to run this ensemble. In this talk, we will take Megatron LM with billions of parameters, convert it in ONNX format, and will learn how to divide it into subparts with the new tool – ONNX-GraphSurgeon. Then, we will use TRITON ensemble API and ONNX runtime background and run this model inference on an NVIDIA DGX.BlogAnnouncing Megatron for Training Trillion Parameter Models and NVIDIA Jarvis AvailabilityNVIDIA announced Megatron for training giant transformer-based language models and major capabilities in NVIDIA Jarvis for building state-of-the-art interactive conversational AI applications.DemoWorld-Class ASR | Real-Time Machine Translation | Controllable Text-to-SpeechWatch this demo to see Jarvis’ automatic speech recognition (ASR) accuracy when fine-tuned on medical jargon, real-time neural machine translation from English to Spanish and Japanese, and powerful controllability of neural text-to-speech.New pre-trained models, notebooks, and sample applications for conversational AI are all available to try from the NGC catalog. You can also find tutorials for building and deploying conversational AI applications at the NVIDIA Developer Blog.Join the NVIDIA Developer Program for all of the latest tools and resources for building with NVIDIA technologies.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.41	Deep learning research requires working at scale. Training on massive data sets or multilayered deep networks is computationally intensive and can take an impractically long time as deep learning models are bound by memory. The key here is to compose the deep learning models in a structured way so that they are decoupled from the engineering and data, enabling researchers to conduct fast research.PyTorch Lightning, developed by Grid.AI, is now available as a container on the NGC catalog, NVIDIA’s hub of GPU-optimized AI and HPC software. Pytorch Lightning was designed to remove the roadblocks in deep learning research and allows researchers to focus on science. Lightning is more of a style guide than a framework, enabling you to structure and organize your code while providing utilities for common functions. With PyTorch Lightning, you can scale your models to multiple GPUs and leverage state-of-the-art training features such as 16-bit precision, early stopping, logging, pruning and quantization, while enabling faster iteration and reproducibility.A Lightning model is composed of the following:The Lightning objects are implemented as hooks that can be overridden, making every single aspect of deep learning training highly configurable. With Lightning, you have full control over every detail:Get started today with NGC PyTorch Lightning Docker Container from the NGC catalog.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Deep learning research requires working at scale. Training on massive data sets or multilayered deep networks is computationally intensive and can take an impractically long time as deep learning models are bound by memory. The key here is to compose the deep learning models in a structured way so that they are decoupled from the engineering and data, enabling researchers to conduct fast research.PyTorch Lightning, developed by Grid.AI, is now available as a container on the NGC catalog, NVIDIA’s hub of GPU-optimized AI and HPC software. Pytorch Lightning was designed to remove the roadblocks in deep learning research and allows researchers to focus on science. Lightning is more of a style guide than a framework, enabling you to structure and organize your code while providing utilities for common functions. With PyTorch Lightning, you can scale your models to multiple GPUs and leverage state-of-the-art training features such as 16-bit precision, early stopping, logging, pruning and quantization, while enabling faster iteration and reproducibility.A Lightning model is composed of the following:The Lightning objects are implemented as hooks that can be overridden, making every single aspect of deep learning training highly configurable. With Lightning, you have full control over every detail:Get started today with NGC PyTorch Lightning Docker Container from the NGC catalog.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.42	Edge computing has been around for a long time, but has recently become a hot topic because of the convergence of three major trends – IoT, 5G, and AI. IoT devices are becoming smarter and more capable, increasing the breadth of applications that can be deployed on them and the environments they can be deployed in. Simultaneously, recent advancements in 5G capabilities give confidence that this technology will soon be able to connect IoT devices wirelessly anywhere they are deployed. In fact, analysts predict that there will be over 1 billion 5G connected devices by 2023. Lastly, AI successfully moved from research projects into practical applications, changing the landscape for retailers, factories, hospitals, and many more. So what does the convergence of these trends mean? An explosion in the number of IoT devices deployed. Experts estimate there are over 30 billion IoT devices installed today, and Arm predicts that by 2035, there will be over 1 trillion devices. With that many IoT devices deployed, the amount of data collected skyrocketed, putting strain on current cloud infrastructures. Organizations soon found themselves in a position where the AI applications they deployed needed large amounts of data to generate compelling insights, but the latency for their cloud infrastructure to process data and send insights back to the edge were unsustainable. So they turned to edge computing. By putting the processing power at the location that sensors are collecting data, organizations reduce the latency for applications to deliver insights. For some situations, such as autonomous machines at factories, the latency reduction represents a critical safety component. That is where NVIDIA comes in. The NVIDIA Edge AI solution offers a complete end-to-end AI platform for deploying AI at the edge. It starts with NVIDIA-Certified Systems. NVIDIA-Certified Systems combine the computing power of NVIDIA GPUs with secure high-bandwidth, low-latency networking solutions from NVIDIA. Validated for performance, functionality, scalability, and security – IT teams ensure AI workloads deployed from the NGC catalog, NVIDIA’s GPU-optimized hub of HPC and AI software, run at full performance. These servers are backed by enterprise-grade support, including direct access to NVIDIA experts, minimizing system downtime and maximizing user productivity. To build and accelerate applications running on NVIDIA-Certified Systems, NVIDIA offers an extensive toolkit of SDKs, application frameworks, and other tools designed to help developers build AI applications for every industry. These include pretrained models, training scripts, optimized framework containers, inference engines, and more. With these tools, organizations get a head start on building unique AI applications regardless of workload or industry. Once organizations have the hardware to accelerate AI and an AI application to deploy, the next step is to ensure that there is infrastructure in place to manage and scale the application. Without a platform to manage AI at the edge, organizations face the difficult and costly task of manually updating systems at edge locations every time a new software update is released. NVIDIA Fleet Command is a cloud service that securely deploys, manages, and scales AI applications across distributed edge infrastructure. Purpose-built for AI, Fleet Command is a turnkey solution for AI lifecycle management, offering streamlined deployments, layered security, and detailed monitoring capabilities — so organizations can go from zero to AI in minutes.The complete edge AI solution gives organizations the tools needed to build an end-to-end edge deployment. KION Group, the number one global supply chain solutions provider, uses NVIDIA solutions to fulfill order faster and more efficiently. To learn more about NVIDIA edge AI solutions, check out Deploying and Accelerating AI at the Edge With the NVIDIA EGX Platform.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Edge computing has been around for a long time, but has recently become a hot topic because of the convergence of three major trends – IoT, 5G, and AI. IoT devices are becoming smarter and more capable, increasing the breadth of applications that can be deployed on them and the environments they can be deployed in. Simultaneously, recent advancements in 5G capabilities give confidence that this technology will soon be able to connect IoT devices wirelessly anywhere they are deployed. In fact, analysts predict that there will be over 1 billion 5G connected devices by 2023. Lastly, AI successfully moved from research projects into practical applications, changing the landscape for retailers, factories, hospitals, and many more. So what does the convergence of these trends mean? An explosion in the number of IoT devices deployed. Experts estimate there are over 30 billion IoT devices installed today, and Arm predicts that by 2035, there will be over 1 trillion devices. With that many IoT devices deployed, the amount of data collected skyrocketed, putting strain on current cloud infrastructures. Organizations soon found themselves in a position where the AI applications they deployed needed large amounts of data to generate compelling insights, but the latency for their cloud infrastructure to process data and send insights back to the edge were unsustainable. So they turned to edge computing. By putting the processing power at the location that sensors are collecting data, organizations reduce the latency for applications to deliver insights. For some situations, such as autonomous machines at factories, the latency reduction represents a critical safety component. That is where NVIDIA comes in. The NVIDIA Edge AI solution offers a complete end-to-end AI platform for deploying AI at the edge. It starts with NVIDIA-Certified Systems. NVIDIA-Certified Systems combine the computing power of NVIDIA GPUs with secure high-bandwidth, low-latency networking solutions from NVIDIA. Validated for performance, functionality, scalability, and security – IT teams ensure AI workloads deployed from the NGC catalog, NVIDIA’s GPU-optimized hub of HPC and AI software, run at full performance. These servers are backed by enterprise-grade support, including direct access to NVIDIA experts, minimizing system downtime and maximizing user productivity. To build and accelerate applications running on NVIDIA-Certified Systems, NVIDIA offers an extensive toolkit of SDKs, application frameworks, and other tools designed to help developers build AI applications for every industry. These include pretrained models, training scripts, optimized framework containers, inference engines, and more. With these tools, organizations get a head start on building unique AI applications regardless of workload or industry. Once organizations have the hardware to accelerate AI and an AI application to deploy, the next step is to ensure that there is infrastructure in place to manage and scale the application. Without a platform to manage AI at the edge, organizations face the difficult and costly task of manually updating systems at edge locations every time a new software update is released. NVIDIA Fleet Command is a cloud service that securely deploys, manages, and scales AI applications across distributed edge infrastructure. Purpose-built for AI, Fleet Command is a turnkey solution for AI lifecycle management, offering streamlined deployments, layered security, and detailed monitoring capabilities — so organizations can go from zero to AI in minutes.The complete edge AI solution gives organizations the tools needed to build an end-to-end edge deployment. KION Group, the number one global supply chain solutions provider, uses NVIDIA solutions to fulfill order faster and more efficiently. To learn more about NVIDIA edge AI solutions, check out Deploying and Accelerating AI at the Edge With the NVIDIA EGX Platform.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.43	The NGC team is hosting a webinar and live Q&A. Topics include how to use containers from the NGC catalog deployed from Google Cloud Marketplace to GKE, a managed Kubernetes service on Google Cloud, that easily builds, deploys, and runs AI solutions.Building a Computer Vision Service Using NVIDIA NGC and Google CloudAugust 25 at 10 a.m. PTOrganizations are using computer vision to improve the product experience, increase production, and drive operational efficiencies. But, building a solution requires large amounts of labeled data, the software and hardware infrastructure to train AI models, and the tools to run real-time inference that will scale with demand.With one click, NGC containers for AI can be deployed from Google Cloud Marketplace to GKE. This managed Kubernetes service on Google Cloud, makes it easy for enterprises to build, deploy, and run their AI solutions.By joining this webinar, you will learn:Register now >>> Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	The first post in this series covered how to train a 2D pose estimation model using an open-source COCO dataset with the BodyPoseNet app in the NVIDIA Transfer Learning Toolkit.In this post, you learn how to optimize the pose estimation model in the NVIDIA Transfer Learning Toolkit. It walks you through the steps of model pruning and INT8 quantization to optimize the model for inference.This section covers few topics of model optimization and export:BodyPoseNet supports model pruning to remove unnecessary connections, reducing the number of parameters by an order of magnitude. This results in an optimized model architecture.To prune the model, use the following command:Usually, you just have to adjust -pth (threshold) for accuracy and model size trade off. For some internal studies, we’ve noticed that a pth value between the range [0.05, 3.0] is a good starting point for BodyPoseNet models.After the model has been pruned, there might be a slight decrease in accuracy because some previously useful weights may have been removed. To regain the accuracy, we recommend retraining this pruned model over the same dataset. You can follow the same instructions as in the Train experiment configuration file section. The main change is now to specify pretrained_weights as the path to pruned model and enable load_graph. Because the model is being initialized with pruned model weights, the model converges faster.You can follow similar instructions as in the Evaluation and Model verification sections to evaluate and verify the pruned model. After retraining the pruned model with pth 0.05, you can observe an accuracy of 56.1% AP with multiscale inference. Here are the metrics on COCO validation set:Inference throughput and how quickly you can create an efficient model are two key metrics for deploying deep learning applications because they directly affect the time to market and the cost of deployment. TLT includes an export command to export and prepare TLT models for deployment.The model is exported as a .etlt (encrypted TLT) file. The file is consumable by the TLT CV Inference, which decrypts the model and converts it to a TensorRT engine. Exporting the model decouples the training process from inference and allows conversion to TensorRT engines outside the TLT environment. TensorRT engines are specific to each hardware configuration and should be generated for each unique inference environment. The following code example shows the export of the pruned, retrained model.The export command can optionally generate the calibration cache for running inference at INT8 precision. This is described more in detail in later sections.The BodyPoseNet model supports int8 inference mode in TensorRT. To do this, the model is first calibrated to run 8-bit inferences. To calibrate the model, you need a directory with a sampled set of images to be used for calibration.We’ve provided a helper script that parses the annotations and samples the required number of images at random based on specified criteria like number of people in the image, number of keypoints per person, and so on.The following command exports the pruned, retrained model to the .etlt format, performs INT8 calibration, and generates the INT8 calibration cache and TensorRT engine for the current hardware.Make sure that the directory mentioned in --cal_image_dir has at least (batch_size * batches) number of images in it. To generate a F16 engine for the current hardware, specify --data_type as FP16. For more information about the parameters used here, see the INT8 model overview.This evaluation is mainly used as a sanity check for the exported TRT (INT8/FP16) models. This doesn’t reflect the true accuracy of the model as the input aspect ratio here can vary a lot from the aspect ratio of the images in the validation set. The set has a collection of images with various resolutions. Here, you retain a strict input resolution and pad the image to retrain the aspect ratio. So, the accuracy here might vary based on the aspect ratio and the network resolution that you choose.You can run the evaluation of the .tlt model in strict mode as well to compare with the accuracies of the INT8/FP16/FP32 models for any drop in accuracy. The FP16 and FP32 models should have no or minimal drop in accuracy when compared to the .tlt model in this step. The INT8 models would have similar accuracies (or comparable within 2-3% AP range) to the .tlt  model.You can follow similar instructions as in the Evaluation and Model verification sections to evaluate and verify the models. One change would be that you now use  $SPECS_DIR/infer_spec_retrained_strict.yaml as inference_spec and the model to use would be a pruned TLT model, INT8 engine, or FP16 engine.After the INT8/FP16/FP32 model is verified, you must reexport the model so it can be used to run on inference platforms like TLT CV Inference. You use the same guidelines as in the previous sections, but you must add the --sdk_compatible_model flag to the export command, which adds a few nontraininable post-process layers to the model to enable compatibility with the inference pipelines. Reuse the calibration tensorfile (cal_data_file) generated in the earlier step to keep it consistent, but you must regenerate the cal_cache_file and the .etlt model.In this section, we look at some best practices to improve model performance and accuracy.Network input resolution of the model is one of the major factors that determine the accuracy of bottom-up approaches. Bottom-up methods must feed the whole image at one time, resulting in a smaller resolution per person. Hence, higher input resolution yields better accuracy, especially on small- and medium-scale persons with regard to the image scale. However, with a higher input resolution, the runtime of the CNN also would be higher. So, the accuracy/runtime tradeoff should be determined by the accuracy and runtime requirements for the target use case.If your application involves pose estimation for one or more persons close to the camera such that the scale of the person is relatively large, then you could go with a smaller network input height. If you are targeting to use the network for persons with smaller relative scales, like crowded scenes, you might want to go with a higher network input height. After you freeze the height of the network, the width can be decided based on the aspect ratio for your input data used during deployment time.These are approximate runtimes and accuracies for the default architecture and spec used in the notebook. Any changes to the architecture or params yields different results. This is primarily to get a better sense of which resolution would suit your needs.You can expect to see a 7-10% AP increase in the area=medium category when going from 224×320 to 288×384 and an additional 7-10% AP when you choose 320×448. The accuracy for area=large remains almost the same across these resolutions, so you can stick to a lower resolution if this is what you need. As per the COCO keypoint evaluation, medium area is defined as persons occupying less than area between 36^2 to 96^2. Anything higher is categorized as large.We use a default size 288×384 in this post. To use a different resolution, you need the following changes:The height and width should be a multiple of 8, preferably a multiple of 16/32/64.Figure 1 shows that the model architecture includes refinement stages, where each stage refines the results of the previous stage. You can use the stages parameter under the model section to configure this. stages include both the initial prediction stage and the refinement stages. We recommend using a minimum of one refinement stage, and a maximum of six, which corresponds to stages within the range [2, 7].When you use more stages of refinement, it may help improve the accuracy but keep in mind that this would result in an increased inference time. We use a default of two refinement stages (stages=3) in this post, which is tuned for optimal performance and accuracy. For even faster performance, use stages=2.Pruning can help with a significant decrease in the number of parameters and maximize speed while preserving the accuracy or at the cost of some drop in accuracy. A higher pruning threshold gives you a smaller model and thus higher inference speed but might cause a drop in accuracy.The threshold to use depends on the dataset. If the retrain accuracy is good, you can increase this value to get smaller models. Otherwise, lower this value to get better accuracy. We recommend iterating with the prune-retrain cycle until you are satisfied with the accuracy-speed tradeoff. You can also use a higher L1 regularization weight when training the model before pruning. It would push more weights towards zero, making it easier to prune the network weights.In this section, we dive deeper into the model accuracy and performance, and compare it against the state of the art, and across platforms.We compare this approach against OpenPose as this method follows a similar single-shot bottom-up methodology. Figure 4 shows that you achieve a much better accuracy-performance tradeoff as compared to the OpenPose model. The accuracy is lower by ~8% AP whereas you achieve close to a 9x speedup for the model trained with the default parameters provided in this post.The following table shows the inference performance of the BodyPoseNet model trained with TLT by using the default parameters. We profiled the model inference with the trtexec command of TensorRT.In this post, you learned about optimizing body pose models using the BodyPoseNet app in TLT. The post showed taking an open-source COCO dataset with a pretrained backbone from NGC to train and optimize a model with TLT. For information regarding model deployment, see the TLT CV inference pipeline Quick Start Scripts and Deployment instructions.With this model, you can get up to 9x improvement in inference performance as compared to OpenPose, helping you achieve real-time performance even on embedded devices. Pruning plus INT8 precision gives you the highest inference performance on your edge devices.For more information, see the following resources:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.44	Texas Children’s Hospital, one of the top-ranked children’s hospitals in the United States, hosted its inaugural innovation-focused hackathon, themed “Hospital of the Future,” from May 14th – May 24th 2021. This hackathon attracted 500 participants, and sponsored by NVIDIA, Mark III, HPE, Google Cloud, T-Mobile, Microsoft, Unity, and H-E-B Digital.NVIDIA Inception member, IPMD won the Touchless Experience category of Texas Children’s Hospital Healthcare Hackathon with its Project M emotional AI solution. The Touchless Experience category aimed to transform physical touchpoints into touchless and seamless, but still personal and trustworthy digital healthcare experience. Participants were encouraged to show solutions that could integrate into the workspace to fit current and future needs of smart hospitals. Project M is an accurate emotional AI platform designed to detect human emotions based on hidden and micro facial expressions. This medical device utilizes machine learning solutions with custom-built CNN-based algorithms to detect human emotions on NVIDIA Clara Guardian. There are eight universal categories of human emotions (anger, contempt, disgust, fear, happy, neutral, sad, and surprise), and each of these emotions can be identified to have at least four different intensities, totaling approximately 400,000 different variable emotions. IPMD collected over 200,000 labeled inputs and reached over 95% overall ROC/AUC scores during the hackathon and won the category. They used NVIDIA Clara Guardian, a smart hospital application framework, which consists of CUDA-X software, such as TensorFlow, TensorRT,  cuDNN, and CUDA, to train and deploy many emotions and moods with high accuracy. It was trained on a 12 NVIDIA V100 GPUs for 500 hours per month, and inference was on NVIDIA V100 GPUs for 1,000 hours a month on AWS. You can give it a test spin here. “We’re excited to recognize IPMD and believe this project sparked a lot of great ideas amongst our panel of executive judges and added to the excitement for all of the possibilities new technologies bring to the hospital of the future.” said Melanie Lowther, Director of Entrepreneurship and Innovation for Texas Children’s Hospital.“As an NVIDIA NPN Elite Partner, Mark III was proud to partner with Texas Children’s around The Hospital of the Future hackathon,” said Andy Lin, VP Strategy and Innovation for Mark III Systems.  “We are huge advocates of the NVIDIA developer ecosystem and the NVIDIA Inception program and thrilled at what IPMD was able to put together around emotional AI, as healthcare moves closer to the reality of a smart hospital.”IPMD’s future plan is to register Project M as a Software as Medical Device under the US FDA. Soon they will be embedded into a mental telehealth platform to help physicians and mental health professionals better understand their patients’ emotional states to maximize treatment outcomes. Read more about Clara Guardian > Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Texas Children’s Hospital, one of the top-ranked children’s hospitals in the United States, hosted its inaugural innovation-focused hackathon, themed “Hospital of the Future,” from May 14th – May 24th 2021. This hackathon attracted 500 participants, and sponsored by NVIDIA, Mark III, HPE, Google Cloud, T-Mobile, Microsoft, Unity, and H-E-B Digital.NVIDIA Inception member, IPMD won the Touchless Experience category of Texas Children’s Hospital Healthcare Hackathon with its Project M emotional AI solution. The Touchless Experience category aimed to transform physical touchpoints into touchless and seamless, but still personal and trustworthy digital healthcare experience. Participants were encouraged to show solutions that could integrate into the workspace to fit current and future needs of smart hospitals. Project M is an accurate emotional AI platform designed to detect human emotions based on hidden and micro facial expressions. This medical device utilizes machine learning solutions with custom-built CNN-based algorithms to detect human emotions on NVIDIA Clara Guardian. There are eight universal categories of human emotions (anger, contempt, disgust, fear, happy, neutral, sad, and surprise), and each of these emotions can be identified to have at least four different intensities, totaling approximately 400,000 different variable emotions. IPMD collected over 200,000 labeled inputs and reached over 95% overall ROC/AUC scores during the hackathon and won the category. They used NVIDIA Clara Guardian, a smart hospital application framework, which consists of CUDA-X software, such as TensorFlow, TensorRT,  cuDNN, and CUDA, to train and deploy many emotions and moods with high accuracy. It was trained on a 12 NVIDIA V100 GPUs for 500 hours per month, and inference was on NVIDIA V100 GPUs for 1,000 hours a month on AWS. You can give it a test spin here. “We’re excited to recognize IPMD and believe this project sparked a lot of great ideas amongst our panel of executive judges and added to the excitement for all of the possibilities new technologies bring to the hospital of the future.” said Melanie Lowther, Director of Entrepreneurship and Innovation for Texas Children’s Hospital.“As an NVIDIA NPN Elite Partner, Mark III was proud to partner with Texas Children’s around The Hospital of the Future hackathon,” said Andy Lin, VP Strategy and Innovation for Mark III Systems.  “We are huge advocates of the NVIDIA developer ecosystem and the NVIDIA Inception program and thrilled at what IPMD was able to put together around emotional AI, as healthcare moves closer to the reality of a smart hospital.”IPMD’s future plan is to register Project M as a Software as Medical Device under the US FDA. Soon they will be embedded into a mental telehealth platform to help physicians and mental health professionals better understand their patients’ emotional states to maximize treatment outcomes. Read more about Clara Guardian > Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.45	This post was updated July 20, 2021 to reflect NVIDIA TensorRT 8.0 updates.When deploying a neural network, it’s useful to think about how the network could be made to run faster or take less space. A more efficient network can make better predictions in a limited time budget, react more quickly to unexpected input, or fit into constrained deployment environments.Sparsity is one optimization technique that holds the promise of meeting these goals. If there are zeros in the network, then you don’t need to store or operate on them. The benefits of sparsity only seem straightforward. There have long been three challenges to realizing the promised gains.In this post, we discuss how the NVIDIA Ampere Architecture addresses these challenges. Today, NVIDIA is releasing TensorRT version 8.0, which introduces support for the Sparse Tensor Cores available on the NVIDIA Ampere Architecture GPUs.TensorRT is an SDK for high-performance deep learning inference, which includes an optimizer and runtime that minimizes latency and maximizes throughput in production. Using a simple training workflow and deploying with TensorRT 8.0, Sparse Tensor Cores can eliminate unnecessary calculations in neural networks, resulting in over 30% performance/watt gain compared to dense networks.The NVIDIA A100 GPU adds support for fine-grained structured sparsity to its Tensor Cores.  Sparse Tensor Cores accelerate a 2:4 sparsity pattern. In each contiguous block of four values, two values must be zero. This naturally leads to a sparsity of 50%, which is fine-grained. There are no vector or block structures pruned together. Such a regular pattern is easy to compress and has a low metadata overhead (Figure 1).Sparse Tensor Cores accelerate this format by operating only on the nonzero values in the compressed matrix. They use the metadata that is stored with the nonzeros to pull only the necessary values from the other, uncompressed operand. So, for a sparsity of 2x, they can complete the same effective calculation in half the time. Table 1 shows details on the wide variety of data types supported by Sparse Tensor Cores.Of course, performance is pointless without good accuracy. We’ve developed a simple training workflow that can easily generate a 2:4 structured sparse network matching the accuracy of the dense network:This workflow uses one-shot pruning in Step 2. After the pruning stage, the sparsity pattern is fixed. There are many ways to make pruning decisions. Which weights should stay, and which should be forced to zero? We’ve found that a simple answer works well: weight magnitude. We prefer to prune values that are already close to zero. As you might expect, suddenly turning half of the weights in a network to zero can affect the network’s accuracy. Step 3 recovers that accuracy with enough weight update steps to let the weights converge and a high enough learning rate to let the weights move around sufficiently.  This recipe works incredibly well. Across a wide range of networks, it generates a sparse model that maintains the accuracy of the dense network from Step 1. Table 2 has a sample of FP16 accuracy results that we obtained using this workflow implemented in the PyTorch Library Automatic SParsity (ASP). For more information about the full results for both FP16 and INT8, see the Accelerating Sparse Deep Neural Networks whitepaper.Here’s how easy the workflow is to use with ResNeXt-101_32x8d as a target.You use the torchvision pretrained model, so step 1 is done already. Because you’re using ASP, the first code change is to import the library:Load the pretrained model for this training run. Instead of training the dense weights, though, prune the model and prepare the optimizer before the training loop (step 2 of the workflow):That’s it. The training loop proceeds as normal with the default command augmented to begin with the pretrained model, which reuses the original hyperparameters and optimizer settings for the retraining:When training completes (Step 3), the network accuracy should have recovered to match that of the pretrained model, as shown in Table 2. As usual, the best-performing checkpoint may not be from the final epoch.For inference, use TensorRT 8.0 to import the trained model’s sparse checkpoint. The model needs to be converted from the native framework format into the ONNX format before importing into TensorRT. Conversion can be done by following the notebooks in the quickstart/IntroNotebooks GitHub repo.We have already converted the sparse ResNeXt-101_32x8d to ONNX format. You can download this model from NGC. If you don’t have NGC installed, use the following command to install NGC:After NGC is installed, download sparse ResNeXt-101_32x8d in ONNX format by running the following command:To import the ONNX model into TensorRT, clone the TensorRT repo and set up the Docker environment, as mentioned in the NVIDIA/TensorRT readme.After you are in the TensorRT root directory, convert the sparse ONNX model to TensorRT engine using trtexec. Make a directory to store the model and engine:Copy the downloaded ResNext ONNX model to the /workspace/TensorRT/model directory and then execute the trtexec command as follows:A new file named resnext101_engine.trt is created at /workspace/TensorRT/model/. The resnext101_engine.trt file can now be serialized to perform inference by one of the following methods:Benchmarking this sparse model in TensorRT 8.0 on an A100 GPU at various batch sizes shows two important trends:Don’t forget, this network has the exact same accuracy as the dense baseline. This extra efficiency and performance doesn’t require penalizing accuracy.Sparsity is popular in neural network compression and simplification research. Until now, though, fine-grained sparsity has not delivered on its promise of performance andaccuracy. We developed 2:4 fine-grained structured sparsity and built support directly into NVIDIA Ampere Architecture Sparse Tensor Cores. With this simple, three-step sparse retraining workflow, you can generate sparse neural networks that match the baseline accuracy, and TensorRT 8.0 accelerates them by default.For more information, see the Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture GTC2021 session, all about accelerating sparsity in the NVIDIA Ampere Architecture, or read the Accelerating Sparse Deep Neural Networks whitepaper.Ready to jump in and try 2:4 sparsity on your own networks? The Automatic SParsity (ASP) PyTorch library makes it easy to generate a sparse network, and TensorRT 8.0 can deploy them efficiently.To learn more about TensorRT 8.0 and it’s new features, see the Accelerate Deep Learning Inference with TensorRT 8.0 GTC’21 session or the TensorRT page.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	This post was updated July 20, 2021 to reflect NVIDIA TensorRT 8.0 updates.When deploying a neural network, it’s useful to think about how the network could be made to run faster or take less space. A more efficient network can make better predictions in a limited time budget, react more quickly to unexpected input, or fit into constrained deployment environments.Sparsity is one optimization technique that holds the promise of meeting these goals. If there are zeros in the network, then you don’t need to store or operate on them. The benefits of sparsity only seem straightforward. There have long been three challenges to realizing the promised gains.In this post, we discuss how the NVIDIA Ampere Architecture addresses these challenges. Today, NVIDIA is releasing TensorRT version 8.0, which introduces support for the Sparse Tensor Cores available on the NVIDIA Ampere Architecture GPUs.TensorRT is an SDK for high-performance deep learning inference, which includes an optimizer and runtime that minimizes latency and maximizes throughput in production. Using a simple training workflow and deploying with TensorRT 8.0, Sparse Tensor Cores can eliminate unnecessary calculations in neural networks, resulting in over 30% performance/watt gain compared to dense networks.The NVIDIA A100 GPU adds support for fine-grained structured sparsity to its Tensor Cores.  Sparse Tensor Cores accelerate a 2:4 sparsity pattern. In each contiguous block of four values, two values must be zero. This naturally leads to a sparsity of 50%, which is fine-grained. There are no vector or block structures pruned together. Such a regular pattern is easy to compress and has a low metadata overhead (Figure 1).Sparse Tensor Cores accelerate this format by operating only on the nonzero values in the compressed matrix. They use the metadata that is stored with the nonzeros to pull only the necessary values from the other, uncompressed operand. So, for a sparsity of 2x, they can complete the same effective calculation in half the time. Table 1 shows details on the wide variety of data types supported by Sparse Tensor Cores.Of course, performance is pointless without good accuracy. We’ve developed a simple training workflow that can easily generate a 2:4 structured sparse network matching the accuracy of the dense network:This workflow uses one-shot pruning in Step 2. After the pruning stage, the sparsity pattern is fixed. There are many ways to make pruning decisions. Which weights should stay, and which should be forced to zero? We’ve found that a simple answer works well: weight magnitude. We prefer to prune values that are already close to zero. As you might expect, suddenly turning half of the weights in a network to zero can affect the network’s accuracy. Step 3 recovers that accuracy with enough weight update steps to let the weights converge and a high enough learning rate to let the weights move around sufficiently.  This recipe works incredibly well. Across a wide range of networks, it generates a sparse model that maintains the accuracy of the dense network from Step 1. Table 2 has a sample of FP16 accuracy results that we obtained using this workflow implemented in the PyTorch Library Automatic SParsity (ASP). For more information about the full results for both FP16 and INT8, see the Accelerating Sparse Deep Neural Networks whitepaper.Here’s how easy the workflow is to use with ResNeXt-101_32x8d as a target.You use the torchvision pretrained model, so step 1 is done already. Because you’re using ASP, the first code change is to import the library:Load the pretrained model for this training run. Instead of training the dense weights, though, prune the model and prepare the optimizer before the training loop (step 2 of the workflow):That’s it. The training loop proceeds as normal with the default command augmented to begin with the pretrained model, which reuses the original hyperparameters and optimizer settings for the retraining:When training completes (Step 3), the network accuracy should have recovered to match that of the pretrained model, as shown in Table 2. As usual, the best-performing checkpoint may not be from the final epoch.For inference, use TensorRT 8.0 to import the trained model’s sparse checkpoint. The model needs to be converted from the native framework format into the ONNX format before importing into TensorRT. Conversion can be done by following the notebooks in the quickstart/IntroNotebooks GitHub repo.We have already converted the sparse ResNeXt-101_32x8d to ONNX format. You can download this model from NGC. If you don’t have NGC installed, use the following command to install NGC:After NGC is installed, download sparse ResNeXt-101_32x8d in ONNX format by running the following command:To import the ONNX model into TensorRT, clone the TensorRT repo and set up the Docker environment, as mentioned in the NVIDIA/TensorRT readme.After you are in the TensorRT root directory, convert the sparse ONNX model to TensorRT engine using trtexec. Make a directory to store the model and engine:Copy the downloaded ResNext ONNX model to the /workspace/TensorRT/model directory and then execute the trtexec command as follows:A new file named resnext101_engine.trt is created at /workspace/TensorRT/model/. The resnext101_engine.trt file can now be serialized to perform inference by one of the following methods:Benchmarking this sparse model in TensorRT 8.0 on an A100 GPU at various batch sizes shows two important trends:Don’t forget, this network has the exact same accuracy as the dense baseline. This extra efficiency and performance doesn’t require penalizing accuracy.Sparsity is popular in neural network compression and simplification research. Until now, though, fine-grained sparsity has not delivered on its promise of performance andaccuracy. We developed 2:4 fine-grained structured sparsity and built support directly into NVIDIA Ampere Architecture Sparse Tensor Cores. With this simple, three-step sparse retraining workflow, you can generate sparse neural networks that match the baseline accuracy, and TensorRT 8.0 accelerates them by default.For more information, see the Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture GTC2021 session, all about accelerating sparsity in the NVIDIA Ampere Architecture, or read the Accelerating Sparse Deep Neural Networks whitepaper.Ready to jump in and try 2:4 sparsity on your own networks? The Automatic SParsity (ASP) PyTorch library makes it easy to generate a sparse network, and TensorRT 8.0 can deploy them efficiently.To learn more about TensorRT 8.0 and it’s new features, see the Accelerate Deep Learning Inference with TensorRT 8.0 GTC’21 session or the TensorRT page.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.46	Today NVIDIA announced the availability of the NVIDIA Arm HPC Developer Kit with the NVIDIA HPC SDK version 21.7.  The DevKit is an integrated hardware-software platform for creating, evaluating, and benchmarking HPC, AI, and scientific computing applications for Arm server based accelerated platforms. The HPC SDK v21.7 is the latest update of the software development kit, and fully supports the new Arm HPC DevKit.This DevKit targets heterogeneous GPU/CPU system development, and includes an Arm CPU, two NVIDIA A100 Tensor Core GPUs, two NVIDIA BlueField-2 data processing units (DPUs), and the NVIDIA HPC SDK suite of tools. The integrated HW/SW DevKit delivers:The NVIDIA Arm HPC Developer Kit is based on the GIGABYTE G242-P32 2U server, and leverages the NVIDIA HPC SDK, a comprehensive suite of compilers, libraries, and tools for HPC delivering performance, portability, and productivity. The platform will support Ubuntu, SLES, and RHEL operating systems. HPC SDK 21.7 includes:Previously HPC SDK 21.5 introduced support for:The NVIDIA HPC SDK C++ and Fortran compilers are the first compilers to support automatic GPU acceleration of standard language constructs including C++17 parallel algorithms and Fortran intrinsics.Download the HPC SDK v21.7 for free today.Contact our partner GIGABYTE about the hardware pricing and availability via the link on our DevKit web page.Learn more about NVIDIA and Arm support:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	A team of scientists from Argonne National Laboratory developed a new method for turning X-ray data into visible, 3D images with the help of AI. The study, published in Applied Physics Reviews, develops a computational framework capable of taking data from the lab’s Advanced Photon Source (APS) and creating 3D visualizations hundreds of times faster than traditional methods. “In order to make full use of what the upgraded APS will be capable of, we have to reinvent data analytics. Our current methods are not enough to keep up. Machine learning can make full use and go beyond what is currently possible,” Mathew Cherukara, a computational scientist at Argonne and study coauthor, said in a press release. The advancement could have wide-ranging benefits to many areas of study relying on sizable amounts of 3D data, ranging from astronomy to nanoscale imaging.Described as one of the most technologically complex machines in the world, the APS uses extremely bright X-ray beams to help researchers see the structure of materials at the molecular and atomic level. As these beams of light bounce off an object, detectors collect them in the form of data. With time and complex computations, this data is converted into images, revealing the object’s structure.However, detectors are unable to capture all the beam data, leaving missing pieces of information. The researchers fill this gap by using neural networks that train computer models to identify objects and visualize an image, based on the raw data it is fed. With 3D images this can be extremely timely due to the amount of information processed.“We used computer simulations to create crystals of different shapes and sizes, and we converted them into images and diffraction patterns for the neural network to learn. The ease of quickly generating many realistic crystals for training is the benefit of simulations,” said Henry Chan, an Argonne postdoctoral researcher, and study coauthor.  The work for the new computational framework, known as 3D-CDI-NN, was developed using GPU resources at Argonne’s Joint Laboratory for System Evaluation, consisting of NVIDIA A100 and RTX 8000 GPUs.“This paper… greatly facilitates the imaging process. We want to know what a material is, and how it changes over time, and this will help us make better pictures of it as we make measurements,” said Stephan Hruszkewycz, study coauthor and physicist with Argonne’s Materials Science Division.47	The NVIDIA NGC catalog is a hub of GPU-optimized deep learning, machine learning and HPC applications. With highly performant software containers, pre-trained models, industry specific SDKs and Helm charts you can simplify and accelerate your end-to-end workflows. The NVIDIA NGC team works closely with our internal and external partners to update the content in the catalog on a regular basis. Below are some of the highlights: NVIDIA Maxine is a GPU-accelerated SDK with state-of-the-art AI features for developers to build virtual collaboration and content creation solutions, including video conferencing and streaming applications. You can add any of Maxine’s AI effects – Video, Audio, and Augmented Reality – into your existing application or develop a new pipeline from scratch.Maxine’s Video Effects SDK and Audio Effects SDK are now available through the Maxine collection on the NGC catalog that includes a container for each SDK:Clara Train v4.0 is now powered by MONAI, a domain-specialized open-source PyTorch framework, accelerating deep learning in Healthcare imaging. The latest version also expands into Digital Pathology and introduces homomorphic encryption for server side aggregation in federated learning.The NVIDIA Transfer Learning Toolkit (TLT) is the AI toolkit that abstracts away the AI/DL framework complexity and leverages high quality pre-trained models to enable you to build production quality models faster with only a fraction of data required. Version 3.0 of TLT is now available for computer vision and conversational AI use cases. Get started today by exploring the TLT collections for: Our most popular deep learning frameworks for training and inference have also been updated to the latest 21.02 versionHave a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Despite substantial progress in natural language processing (NLP) research over the last two years and its commercial success, little effort has been devoted to adapting this capability to other significant languages, such as Hindi, Arabic, Portuguese, or Spanish. Obviously, catering to the entire human population with more than 6,500 languages is challenging. At the same time, supporting just 40 languages addresses the NLP needs of more than 60% of the human population. Figure 2 shows that, even across the most frequently used languages, the performance of language models varies tremendously. Bear in mind that this comparison is not perfect, as those languages do have different language entropy. More importantly, the research on most capable large-scale language models seems to be limited to only a handful of high resource languages (languages with a high number of documents available publicly), such as English or Chinese.  The situation is even more complex when you account for domain-specific languages (such as medical, technical, or legal jargon), where besides English only a few high-quality models exist. This is regretful as those domain-specific language models are currently transforming the way that clinicians, engineers, researchers or other experts access information. Unfortunately, there is a limited number of equivalent models outside of English. Fortunately, replicating the success of English-language models across other languages is no longer a research task but predominantly an engineering activity. It no longer requires inventing new models and training approaches, but instead systematic and iterative dataset engineering, model training, and its continuous validation. This does not mean that engineering those models is trivial. Because of the model and dataset sizes used in modern NLP, the training process requires a substantial amount of computing power. Secondly, to use large models, you must collect large textual datasets. Thirdly, because of the shared size of models used, new approaches to training and inference are required.NVIDIA has extensive experience not only in building large-scale language models (ranging from 1 billion to 175 billion parameters) but also in deploying them to production. The goal of this post is to share our knowledge around project organization, infrastructure requirements, and budgeting and to support projects in this area.As hypothesized in Deep Learning Scaling is Predictable, Empirically, the NLP model performance seems to follow a power law with respect to both the model size and the volume of data used for training. As you make models and datasets bigger, the performance continues to improve. The following diagram from Scaling Laws for Neural Language Models demonstrates not only that this relationship holds but more importantly, it holds across nine orders of magnitude of compute.In the NLP scaling law, despite the models at the far right reaching as much as 175 billion parameters (more than 500 times larger than BERT Large), this relationship does not show signs of stopping. This suggests that even further improvement can be expected from larger models. Indeed, switch transformers when scaled to 1.6 trillion parameters (roughly 5000x larger than BERT Large) continue to demonstrate the previously mentioned behavior. More importantly, the large NLP models seem to generate much more robust features capable of solving complex problems even without large-scale fine-tuning datasets. Figure 4 shows this capability across three orders of magnitude of models.Due to this capability, and despite the relatively high cost of their development, large NLP models are likely to not only continue to dominate the NLP processing landscape but also continue to grow, at least by another order of magnitude approaching trillions of parameters.This relationship between model size, dataset size, and model performance is not unique to NLP. I see the same behavior of automatic speech recognition and computer vision models, and across many other disciplines that are a backbone of conversational AI.At the same time, a limited amount of work was devoted to the development of both large-scale datasets and models for other languages. Indeed, the majority of the work that focuses on languages other than English take advantage of smaller models and less curated datasets. For example, NLP using subset from general datasets such as raw Common Crawl).Even less effort is devoted to supporting any of the following:The current status creates an opportunity for local companies willing to invest in model training to lead the development of NLP technologies in the region.Building large-scale language models is not trivial for many reasons. First, the large-scale datasets are not trivial to curate. In the raw format, they are actually quite easy to obtain. Second, the infrastructure required to train these huge models requires substantial systems knowledge to set up. Finally, they require extensive research expertise to train and optimize. What is less widely understood is that training such large models requires software engineering effort. Most interesting models are larger than the memory capacity of not only individual GPUs but also of many multi-GPU servers. The number of mathematical operations required to train them can also make training times unmanageable, measured in months even on fairly sizable systems. Approaches such as model and pipeline parallelism overcome some of those challenges. However, applying them in a naive way could lead to scaling issues, exacerbating an already long training time. Together with organizations such as Microsoft and Stanford University, NVIDIA has worked towards developing tools that streamline the development process of the largest language models and provide computational efficiency and scalability to allow cost-effective training. As a consequence, a wide range of tools abstracting the complexity of large model development are now available, including the following:As a result of those efforts, I’ve seen a substantial reduction in training times of large models. Indeed, the GPT-3 model with 175 billion parameters trained on 300 billion tokens using 1024 NVIDIA A100 Tensor Core GPUs can be trained today in 34 days (as shown in Efficient Large-Scale Language Model Training on GPU Clusters). Based on experimentation, NVIDIA estimates that 1 trillion parameter models can be trained in approximately 84 days with 3072 A100 GPUs. Despite the training cost of those models being high, it is not beyond reach of most large organizations. With further software advances, it is likely to reduce further.Because the development of large language models requires scalable infrastructure, NVIDIA has also consolidated knowledge from building the internal Selene cluster (used for NLP internal research and to deliver record-breaking performance in MLPerf training and inference benchmarks) into a fully packaged product called NVIDIA DGX SuperPOD. This cluster is more than just a system reference design. In fact, it can be bought in its entirety together with software and support of NVIDIA data scientists and applied researchers, similar to the NLP-focused deployment by Naver Clova). Such an approach has already had a substantial impact on the NLP landscape, as it enables organizations with extensive NLP expertise to scale out their efforts fast. More importantly, it enables organizations with limited systems, HPC, or large-scale NLP workload expertise to start iterating in weeks, rather than months or years. The ability to build large language models is just an academic achievement when it’s impossible to take advantage of the results of your work by deploying your models to production. The challenge of deploying models such as GPT-3 correlates to their sheer size, which exceeds the memory capacity of a GPU, and computational complexity. Both are factors that contribute to decreased throughput and high latency of inference. This is a widely understood problem and a range of tools and solutions currently exist to make serving the largest language models simple and cost-effective. NVIDIA Triton Inference Server is an open-source cloud and edge inferencing solution optimized for both CPUs and GPUs. It can be used to host distributed models effectively. To deploy a large model using pipeline parallelism, the model must be split into several parts, for example, manipulating the ONNX graph with tools such as ONNX Graph Surgeon. Each of the parts must be small enough to fit into the memory space of a single GPU.After the model is subdivided, it can be distributed across multiple GPUs without the need to develop any code. You create an NVIDIA Triton YAML configuration file defining how individual parts of the model should be connected.The traffic between individual model parts and their load balancing can be managed automatically by Triton Inference Server. The communication overheads are also kept to a minimum as Triton takes advantage of the latest NVIDIA NVSwitch and third-generation NVIDIA NVLink technology, providing 600 GB/sec GPU-to-GPU direct bandwidth, which is 10X  higher than PCIe gen4. This means that you can efficiently deploy not only medium-scale models of multi billions of parameters but even the largest of models, including GPT-3 with trillions of parameters.For more information, see Megatron GPT-3 Large Model Inference with Triton and ONNX Runtime (GTC21 session).Beyond the ability to host trained large models, it is important to look at optimization techniques. Such techniques can reduce the memory footprint of the models, through quantization and pruning; substantially accelerate the execution; and reduce latency by optimizing memory access, taking advantage of TensorCores or sparsity acceleration.Utilities such as TensorRT provide a wide range of optimized kernels for execution of transformer-based architectures. They can automatically do half precision (FP16) or, in certain cases, INT8 quantization. TensorRT also supports quantization-aware training and provides early support for hardware-accelerated sparsity.The NVIDIA FasterTransformer library specializes in the inference of the transformer neural networks and can be used with models such as BERT or GPT-2/3. This library includes a tensor-parallel inference backend that provides the ability to do the inference of the huge GPT-3 models in parallel on multiple GPUs within the DGX A100 system. This enables you to reduce inference latency by as much as 1.2–3x, depending on model size. With FasterTransformer, you can deploy the largest of Megatron Models with a single line of code.The Microsoft DeepSpeed library has a number of features focused on inference, including support for Mixture-of-Quantization (MoQ), high-performance INT8 kernels, or DeepFusion.Thanks to all of those advances, large language models are no longer limited to academic research as they are making headway into commercial AI-based products.Correct sizing of the challenge is critical for the success of your NLP initiative. The amount of engineering and research staff needed as well as training and inference infrastructure significantly affects your business case. The following factors have significant impact on the overall cost of the development: After the fundamental business questions are addressed, it is possible to estimate the effort and compute required for their development. When you have an understanding of how good your model must be to allow for the product or service, it is possible to estimate the model size needed. The relationship between the performance of language models and the amount of data and model size is widely understood (Figure 9).After you understand the size of the model and dataset that you need, you can estimate the amount of infrastructure required and training time. For more information, see Efficient Large-Scale Language Model Training on GPU Clusters. Furthermore, the scaling of large language models is superlinear, meaning that the training performance does not degrade with the increasing model size but actually increases (Figure 10).Here are the key factors to consider for initial infrastructure sizing:Large language models have appealing properties and will help expand the availability of NLP around the globe. They more performant across a wide number of NLP tasks, but they are also much more sample efficient. They are what are known as few-shot learners and in certain ways are easier to design, as their exact hyperparameter configuration seems unimportant in comparison to their size.  As a consequence, the NLP models are likely to continue to grow. I see empirical evidence justifying at least one if not two orders of magnitude of growth. Fortunately, the technology to build and deploy them to production has matured considerably. The software required to train them has also matured considerably and is broadly available, such as the NVIDIA open source Megatron-based implementation of GPT-3. Quality is continuing to improve, driving down the training times. The infrastructure required to train models in this space is also well understood and commercially available (DGX SuperPOD). It is now possible to deploy the largest of NLP models to production using tools such as Triton Inference Server.  As a consequence, big NLP models are in reach of everyone with the will to pursue them.NVIDIA actively supports customers in the scoping and delivery of large training and inference systems, as well as supporting them in establishing NLP training capability. If you are working towards building your NLP capability, reach out to your local NVIDIA account team. You can also join one of our Deep Learning Institute NLP classes. During the course, you learn how to work with modern NLP models, optimise them with TensorRT, and deploy for cost-effective production with Triton Inference Server. For more information, see any of the following NLP-related GTC presentations:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.48	As an undergraduate student excited about AI for healthcare applications, I was thrilled to be joining the NVIDIA Clara Deploy team for an internship. It was the perfect combination: the opportunity to work at a leading technology company enabling the acceleration and adoption of AI while contributing to a team building the future (and the present!) of AI deployment for healthcare. The next few months were filled with learning from brilliant yet humble colleagues, picking up new skills like CUDA programming, and the opportunity to focus on unique technical challenges posed by histopathology data.The Clara Deploy SDK is a container-based, cloud-native development and deployment framework for multi-AI and multidomain workflows in smart hospitals. It enables you to define container-based pipelines consisting of multiple stages, each stage defined by an operator. A pipeline consists of multiple operators and is a directed acyclic graph (DAG) from the data source to the data sink. Each operator is a step of the pipeline, such as loading input, preprocessing, AI inference, and so on.As I explored setting up the NVIDIA Clara Deploy platform and running AI inference pipelines, I gained firsthand experience in the challenges of deploying AI workflows, particularly in standardizing workflows and scaling up execution. While running digital pathology pipelines, I gained awareness of the performance bottleneck of I/O and preprocessing steps that are usually not GPU-accelerated. This influenced my choice to focus on accelerating preprocessing filters for digital pathology during my internship.cuCIM is a RAPIDS library for accelerated n-dimensional image processing and image I/O, with a focus on medical imaging applications. cuCIM consists of I/O, file system, and operation modules. Operations in cuCIM can be extended using a plug-in architecture. cuCIM is uniquely positioned to be a leading library for medical image-processing applications, and I am excited to have gained exposure to and contributed to it during my time at NVIDIA.A significant challenge in the digitization of histopathology analysis is the stain variation observed in pathology images. These images can have large variations in staining caused by multiple factors, including stain vendors, storage conditions, staining protocols, digital scanners, and so on.Given the range of factors, it is impractical to control for staining variation during image acquisition. Instead, an image preprocessing step called stain normalization is often used to algorithmically standardize image staining. A stain normalization filter accepts as input a source image and a target image. The source image is to be stain normalized, and the target image contains the ideal stain, to be transferred to the source image. Ultimately, a normalized source image is returned as output.Prior work has shown that stain normalization used as a preprocessing step in digital pathology AI pipelines can shorten training time, improve accuracy, and enable data from different sources to be used together. Because you are operating in a relatively small data regime due to the scarcity of stained pathology images, stain normalization enables you to optimize the signal obtained amidst noisy stain variations.However, prior implementations of stain normalization were relatively slow as they were not GPU-accelerated. There was an opportunity to implement a GPU-accelerated stain normalization algorithm and enable fast and effective preprocessing for digital pathology AI pipelines.Stain normalization methods fall into three broad categories:For more information, see Stain Color Adaptive Normalization (SCAN) algorithm: Separation and standardization of histological stains in digital pathology.I chose to focus on stain deconvolution-based methods, as prior literature showed greater performance compared to global color normalization and better theoretical guarantees regarding the maintenance of biological structure integrity compared to deep network-based methods.Stain deconvolution-based methods assume that each image is characterized by a stain matrix, which contains the red, green, blue (RGB) values for each of the two stains in H&E stained images: hematoxylin and eosin.Using the Beer-Lambert law, an RGB image is transformed into an optical density image. Then, the optical density image may be related to the product of a pixel concentration matrix and the stain matrix for that image. The pixel concentration matrix indicates the concentration of each stain for each pixel. If the stain matrix is estimated, done here with the Macenko method, then the concentration matrix may be obtained.Finally, for stain normalization, the stain matrix of a source image is replaced with the stain matrix of a target image. This serves the purpose of transferring the stain profile from the target image to the source image. Because the concentration matrix of the source image is unchanged, the morphology of the biological structures is maintained. The Macenko method for estimating the stain matrix is an unsupervised method using the singular value decomposition.I designed and implemented a filter for the Macenko method for stain normalization in CuPy, after modifying an existing version in NumPy. Next, I compared the performance of the two.Figure 3 shows the relative performance of the NumPy and CuPy implementations of stain normalization for different image sizes, using an NVIDIA DGX-1. Performance for the CuPy implementation is plotted in terms of acceleration factor relative to the NumPy implementation.Given the goal of enabling GPU-accelerated stain normalization to be used as a preprocessing step for digital pathology pipelines, I began the integration of this filter as a transform (array-based and dictionary-based) into MONAI. MONAI is an open-source, PyTorch-based framework for deep learning in medical imaging. After being fully integrated, the stain normalization transform can be added to pathology pipelines in Clara Train or MONAI.Next, I worked on implementing the color conversion rgb2hed function in CUDA C++, which is a commonly used function available in scikit-image and the cuCIM Python layer, among other libraries. Color space conversion from RGB to HED is closely related to stain normalization, as this function involves obtaining stain concentration values, assuming that the stain vectors are a constant, precalculated approximation. This ignores variations between the staining of different images. This function is to be integrated into cuCIM through a C++ based operator plugin mechanism.I compared the performance of a pure C++ implementation and the CUDA C++ implementation. Figure 4 shows the relative performance of the two versions, for different image sizes, using an NVIDIA GV100 GPU and Intel(R) Core(TM) i7-7800X CPU. Performance for the CUDA C++ implementation is plotted in terms of acceleration factor relative to the pure C++ implementation.It’s important to note that the performance gains do not account for any transfer of data to and from the GPU. I did this because I am considering the common scenario where data transfers are minimized by remaining on the GPU for several subsequent operations in an image processing workflow, with transfer back to the host occurring only at the end.In summary, my internship project was focused on accelerating color conversion filters for digital pathology. Specifically, I worked on designing and implementing the Macenko stain normalization method, using CuPy for GPU-acceleration. I began the integration of this into MONAI as a transform, for future use as a preprocessing step for digital pathology pipelines. Next, I worked on implementing the color conversion rgb2hed function in CUDA C++, to be integrated into cuCIM through a C++ based operator plugin mechanism.Both the CuPy implementation of Macenko stain normalization and the CUDA C++ implementation of the rgb2hed function showed significant performance gains compared to the NumPy version and pure C++ version, respectively. The stain normalization preprocessing time for training a pipeline over 500 epochs with a dataset of 250 images and image size of 4000 by 4000 pixels is roughly estimated at 13 days with the NumPy-based filter. It decreases to 3.5 hours for the CuPy-based filter.Ultimately, accelerating pre– and post-processing filters for digital pathology can improve the performance of deep learning pipelines in digital pathology, expedite the adoption of digital pathology, and enable AI to revolutionize pathology.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Most CUDA developers are familiar with the cudaMalloc and cudaFree API functions to allocate GPU accessible memory. However, there has long been an obstacle with these API functions: they aren’t stream ordered. In this post, we introduce new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In part 2 of this series, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also cover advanced topics to take advantage of stream-ordered memory allocation in the context of multi-GPU access and the use of IPC. This all helps you improve performance within your existing applications.The following code example on the left is inefficient because the first cudaFree call has to wait for kernelA to finish, so it synchronizes the device before freeing the memory. To make this run more efficiently, the memory can be allocated upfront and sized to the larger of the two sizes, as shown on the right.This increases code complexity in the application because the memory management code is separated out from the business logic. The problem is exacerbated when other libraries are involved. For example, consider the case where kernelA is launched by a library function instead:This is much harder for the application to make efficient because it may not have complete visibility or control over what the library is doing. To circumvent this problem, the library would have to allocate memory when that function is invoked for the first time and never free it until the library is deinitialized. This not only increases code complexity, but it also causes the library to hold on to the memory longer than it needs to, potentially denying another portion of the application from using that memory.Some applications take the idea of allocating memory upfront even further by implementing their own custom allocator. This adds a significant amount of complexity to application development. CUDA aims to provide a low-effort, high-performance alternative.CUDA 11.2 introduced a stream-ordered memory allocator to solve these types of problems, with the addition of cudaMallocAsync and cudaFreeAsync. These new API functions shift memory allocation from global-scope operations that synchronize the entire device to stream-ordered operations that enable you to compose memory management with GPU work submission. This eliminates the need for synchronizing outstanding GPU work and helps restrict the lifetime of the allocation to the GPU work that accesses it. Consider the following code example:It is now possible to manage memory at function scope, as in the following example of a library function launching kernelA.All the usual stream-ordering rules apply to cudaMallocAsync and cudaFreeAsync. The memory returned from cudaMallocAsync can be accessed by any kernel or memcpy operation as long as the kernel or memcpy is ordered to execute after the allocation operation and before the deallocation operation, in stream order. Deallocation can be performed in any stream, as long as it is ordered to execute after the allocation operation and after all accesses on all streams of that memory on the GPU.In effect, stream-ordered allocation behaves as if allocation and free were kernels. If kernelA produces a valid buffer on a stream and kernelB invalidates it on the same stream, then an application is free to access the buffer after kernelA and before kernelB in the appropriate stream order.The following example shows various valid usages.Figure 1 shows the various dependencies specified in the earlier code example. As you can see, all kernels are ordered to execute after the allocation operation and complete before the deallocation operation.Memory allocation and deallocation cannot fail asynchronously. Memory errors that occur because of a call to cudaMallocAsync or cudaFreeAsync (for example, out of memory) are reported immediately through an error code returned from the call. If cudaMallocAsync completes successfully, the returned pointer is guaranteed to be a valid pointer to memory that is safe to access in the appropriate stream order.The CUDA driver uses memory pools to achieve the behavior of returning a pointer immediately.The stream-ordered memory allocator introduces the concept of memory pools to CUDA. A memory pool is a collection of previously allocated memory that can be reused for future allocations. In CUDA, a pool is represented by a cudaMemPool_t handle. Each device has a notion of a default pool whose handle can be queried using cudaDeviceGetDefaultMemPool.You can also explicitly create your own pools and either use them directly or set them as the current pool for a device and use them indirectly. Reasons for explicit pool creation include custom configuration, as described later in this post. When no explicitly created pool has been set as the current pool for a device, the default pool acts as the current pool.When called without an explicit pool argument, each call to cudaMallocAsync infers the device from the specified stream and attempts to allocate memory from that device’s current pool. If the pool has insufficient memory, the CUDA driver calls into the OS to allocate more memory. Each call to cudaFreeAsync returns memory to the pool, which is then available for re-use on subsequent cudaMallocAsync requests. Pools are managed by the CUDA driver, which means that applications can enable pool sharing between multiple libraries without those libraries having to coordinate with each other.If a memory allocation request made using cudaMallocAsync can’t be serviced due to fragmentation of the corresponding memory pool, the CUDA driver defragments the pool by remapping unused memory in the pool to a contiguous portion of the GPU’s virtual address space. Remapping existing pool memory instead of allocating new memory from the OS also helps keep the application’s memory footprint low.By default, unused memory accumulated in the pool is returned to the OS during the next synchronization operation on an event, stream, or device, as the following code example shows.Returning memory from the pool to the system can affect performance in some cases. Consider the following code example:By default, stream synchronization causes any pools associated with that stream’s device to release all unused memory back to the system. In this example, that would happen at the end of every iteration. As a result, there is no memory to reuse for the next cudaMallocAsync call and instead memory must be allocated through an expensive system call.To avoid this expensive reallocation, the application can configure a release threshold to enable unused memory to persist beyond the synchronization operation. The release threshold specifies the maximum amount of memory the pool caches. It releases all excess memory back to the OS during a synchronization operation.By default, the release threshold of a pool is zero. This means that allunused memory in the pool is released back to the OS during every synchronization operation. The following code example shows how to change the release threshold.Using a nonzero release threshold enables reusing memory from one iteration to the next. This requires only simple bookkeeping and makes the performance of cudaMallocAsync independent of the size of the allocation, which results in dramatically improved memory allocation performance (Figure 2).The pool threshold is just a hint. Memory in the pool can also be released implicitly by the CUDA driver to enable an unrelated memory allocation request in the same process to succeed. For example, a call to cudaMalloc or cuMemCreate could cause CUDA to free unused memory from any memory pool associated with the device in the same process to serve the request.This is especially helpful in scenarios where an application makes use of multiple libraries, some of which use cudaMallocAsync and some that do not. By automatically freeing up unused pool memory, those libraries do not have to coordinate with each other to have their respective allocation requests succeed.There are limitations to when the CUDA driver automatically reassigns memory from a pool to unrelated allocation requests. For example, the application may be using a different interface, like Vulkan or DirectX, to access the GPU, or there may be more than one process using the GPU at the same time. Memory allocation requests in those contexts do not cause automatic freeing of unused pool memory. In such cases, the application may have to explicitly free unused memory in the pool, by invoking cudaMemPoolTrimTo.The bytesToKeep argument tells the CUDA driver how many bytes it can retain in the pool. Any unused memory that exceeds that size is released back to the OS. The stream parameter to cudaMallocAsync and cudaFreeAsync helps CUDA reuse memory efficiently and avoid expensive calls into the OS. Consider the following trivial code example.In this code example, ptr2 is allocated in stream order after ptr1 is freed. The ptr2 allocation could reuse some, or all, of the memory that was used for ptr1 without any synchronization, because kernelA and kernelB are launched in the same stream. So, stream-ordering semantics guarantee that kernelB cannot begin execution and access the memory until kernelA has completed. This way, the CUDA driver can help keep the memory footprint of the application low while also improving allocation performance.The CUDA driver can also follow dependencies between streams inserted through CUDA events, as shown in the following code example:As the CUDA driver is aware of the dependency between streams A and B, it can reuse the memory used by ptr1 for ptr2. The dependency chain between streams A and B can contain any number of streams, as shown in the following code example.If necessary, the application can disable this feature on a per-pool basis:The CUDA driver can also reuse memory opportunistically in the absence of explicit dependencies specified by the application. While such heuristics may help improve performance or avoid memory allocation failures, they can add nondeterminism to the application and so can be disabled on a per-pool basis. Consider the following code example:In this scenario, there are no explicit dependencies between streamA and streamB. However, the CUDA driver is aware of how far each stream has executed. If, on the second call to cudaMallocAsync in streamB, the CUDA driver determines that kernelA has finished execution on the GPU, then it can reuse some or all of the memory used by ptr1 for ptr2.If kernelA has not finished execution, the CUDA driver can add an implicit dependency between the two streams such that kernelB does not begin executing until kernelA finishes.The application can disable these heuristics as follows:In part 1 of this series, we introduced the new API functions cudaMallocAsync and cudaFreeAsync , which enable memory allocation and deallocation to be stream-ordered operations. Use them to avoid expensive calls to the OS through memory pools maintained by the CUDA driver.In part 2 of this series, we share some benchmark results to show the benefits of stream-ordered memory allocation. We also provide a step-by-step recipe for modifying your existing applications to take full advantage of this advanced CUDA capability.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.49	Most CUDA developers are familiar with the cudaMalloc and cudaFree API functions to allocate GPU accessible memory. However, there has long been an obstacle with these API functions: they aren’t stream ordered. In this post, we introduce new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In part 2 of this series, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also cover advanced topics to take advantage of stream-ordered memory allocation in the context of multi-GPU access and the use of IPC. This all helps you improve performance within your existing applications.The following code example on the left is inefficient because the first cudaFree call has to wait for kernelA to finish, so it synchronizes the device before freeing the memory. To make this run more efficiently, the memory can be allocated upfront and sized to the larger of the two sizes, as shown on the right.This increases code complexity in the application because the memory management code is separated out from the business logic. The problem is exacerbated when other libraries are involved. For example, consider the case where kernelA is launched by a library function instead:This is much harder for the application to make efficient because it may not have complete visibility or control over what the library is doing. To circumvent this problem, the library would have to allocate memory when that function is invoked for the first time and never free it until the library is deinitialized. This not only increases code complexity, but it also causes the library to hold on to the memory longer than it needs to, potentially denying another portion of the application from using that memory.Some applications take the idea of allocating memory upfront even further by implementing their own custom allocator. This adds a significant amount of complexity to application development. CUDA aims to provide a low-effort, high-performance alternative.CUDA 11.2 introduced a stream-ordered memory allocator to solve these types of problems, with the addition of cudaMallocAsync and cudaFreeAsync. These new API functions shift memory allocation from global-scope operations that synchronize the entire device to stream-ordered operations that enable you to compose memory management with GPU work submission. This eliminates the need for synchronizing outstanding GPU work and helps restrict the lifetime of the allocation to the GPU work that accesses it. Consider the following code example:It is now possible to manage memory at function scope, as in the following example of a library function launching kernelA.All the usual stream-ordering rules apply to cudaMallocAsync and cudaFreeAsync. The memory returned from cudaMallocAsync can be accessed by any kernel or memcpy operation as long as the kernel or memcpy is ordered to execute after the allocation operation and before the deallocation operation, in stream order. Deallocation can be performed in any stream, as long as it is ordered to execute after the allocation operation and after all accesses on all streams of that memory on the GPU.In effect, stream-ordered allocation behaves as if allocation and free were kernels. If kernelA produces a valid buffer on a stream and kernelB invalidates it on the same stream, then an application is free to access the buffer after kernelA and before kernelB in the appropriate stream order.The following example shows various valid usages.Figure 1 shows the various dependencies specified in the earlier code example. As you can see, all kernels are ordered to execute after the allocation operation and complete before the deallocation operation.Memory allocation and deallocation cannot fail asynchronously. Memory errors that occur because of a call to cudaMallocAsync or cudaFreeAsync (for example, out of memory) are reported immediately through an error code returned from the call. If cudaMallocAsync completes successfully, the returned pointer is guaranteed to be a valid pointer to memory that is safe to access in the appropriate stream order.The CUDA driver uses memory pools to achieve the behavior of returning a pointer immediately.The stream-ordered memory allocator introduces the concept of memory pools to CUDA. A memory pool is a collection of previously allocated memory that can be reused for future allocations. In CUDA, a pool is represented by a cudaMemPool_t handle. Each device has a notion of a default pool whose handle can be queried using cudaDeviceGetDefaultMemPool.You can also explicitly create your own pools and either use them directly or set them as the current pool for a device and use them indirectly. Reasons for explicit pool creation include custom configuration, as described later in this post. When no explicitly created pool has been set as the current pool for a device, the default pool acts as the current pool.When called without an explicit pool argument, each call to cudaMallocAsync infers the device from the specified stream and attempts to allocate memory from that device’s current pool. If the pool has insufficient memory, the CUDA driver calls into the OS to allocate more memory. Each call to cudaFreeAsync returns memory to the pool, which is then available for re-use on subsequent cudaMallocAsync requests. Pools are managed by the CUDA driver, which means that applications can enable pool sharing between multiple libraries without those libraries having to coordinate with each other.If a memory allocation request made using cudaMallocAsync can’t be serviced due to fragmentation of the corresponding memory pool, the CUDA driver defragments the pool by remapping unused memory in the pool to a contiguous portion of the GPU’s virtual address space. Remapping existing pool memory instead of allocating new memory from the OS also helps keep the application’s memory footprint low.By default, unused memory accumulated in the pool is returned to the OS during the next synchronization operation on an event, stream, or device, as the following code example shows.Returning memory from the pool to the system can affect performance in some cases. Consider the following code example:By default, stream synchronization causes any pools associated with that stream’s device to release all unused memory back to the system. In this example, that would happen at the end of every iteration. As a result, there is no memory to reuse for the next cudaMallocAsync call and instead memory must be allocated through an expensive system call.To avoid this expensive reallocation, the application can configure a release threshold to enable unused memory to persist beyond the synchronization operation. The release threshold specifies the maximum amount of memory the pool caches. It releases all excess memory back to the OS during a synchronization operation.By default, the release threshold of a pool is zero. This means that allunused memory in the pool is released back to the OS during every synchronization operation. The following code example shows how to change the release threshold.Using a nonzero release threshold enables reusing memory from one iteration to the next. This requires only simple bookkeeping and makes the performance of cudaMallocAsync independent of the size of the allocation, which results in dramatically improved memory allocation performance (Figure 2).The pool threshold is just a hint. Memory in the pool can also be released implicitly by the CUDA driver to enable an unrelated memory allocation request in the same process to succeed. For example, a call to cudaMalloc or cuMemCreate could cause CUDA to free unused memory from any memory pool associated with the device in the same process to serve the request.This is especially helpful in scenarios where an application makes use of multiple libraries, some of which use cudaMallocAsync and some that do not. By automatically freeing up unused pool memory, those libraries do not have to coordinate with each other to have their respective allocation requests succeed.There are limitations to when the CUDA driver automatically reassigns memory from a pool to unrelated allocation requests. For example, the application may be using a different interface, like Vulkan or DirectX, to access the GPU, or there may be more than one process using the GPU at the same time. Memory allocation requests in those contexts do not cause automatic freeing of unused pool memory. In such cases, the application may have to explicitly free unused memory in the pool, by invoking cudaMemPoolTrimTo.The bytesToKeep argument tells the CUDA driver how many bytes it can retain in the pool. Any unused memory that exceeds that size is released back to the OS. The stream parameter to cudaMallocAsync and cudaFreeAsync helps CUDA reuse memory efficiently and avoid expensive calls into the OS. Consider the following trivial code example.In this code example, ptr2 is allocated in stream order after ptr1 is freed. The ptr2 allocation could reuse some, or all, of the memory that was used for ptr1 without any synchronization, because kernelA and kernelB are launched in the same stream. So, stream-ordering semantics guarantee that kernelB cannot begin execution and access the memory until kernelA has completed. This way, the CUDA driver can help keep the memory footprint of the application low while also improving allocation performance.The CUDA driver can also follow dependencies between streams inserted through CUDA events, as shown in the following code example:As the CUDA driver is aware of the dependency between streams A and B, it can reuse the memory used by ptr1 for ptr2. The dependency chain between streams A and B can contain any number of streams, as shown in the following code example.If necessary, the application can disable this feature on a per-pool basis:The CUDA driver can also reuse memory opportunistically in the absence of explicit dependencies specified by the application. While such heuristics may help improve performance or avoid memory allocation failures, they can add nondeterminism to the application and so can be disabled on a per-pool basis. Consider the following code example:In this scenario, there are no explicit dependencies between streamA and streamB. However, the CUDA driver is aware of how far each stream has executed. If, on the second call to cudaMallocAsync in streamB, the CUDA driver determines that kernelA has finished execution on the GPU, then it can reuse some or all of the memory used by ptr1 for ptr2.If kernelA has not finished execution, the CUDA driver can add an implicit dependency between the two streams such that kernelB does not begin executing until kernelA finishes.The application can disable these heuristics as follows:In part 1 of this series, we introduced the new API functions cudaMallocAsync and cudaFreeAsync , which enable memory allocation and deallocation to be stream-ordered operations. Use them to avoid expensive calls to the OS through memory pools maintained by the CUDA driver.In part 2 of this series, we share some benchmark results to show the benefits of stream-ordered memory allocation. We also provide a step-by-step recipe for modifying your existing applications to take full advantage of this advanced CUDA capability.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	 Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.50	In AI and computer vision, data acquisition is costly and time-consuming and human-based labeling can be error-prone. The accuracy of the models is also affected by insufficient and poorly balanced data and the prolonged time required to improve the deep learning models. It always requires the reacquisition of data in the real world.The collection, preparation of data, and development of accurate and reliable software solutions based on AI training is an extremely laborious process. The required investment costs offset the expected benefits of deploying the system.One way to bridge the data gap and accelerate model training is by using synthetic data instead of real data for training. SKY ENGINE provides an AI platform to move deep learning to virtual reality. It is possible to generate synthetic data using simulations where the synthetic images come with the annotation that can be used directly in training AI models.Synthetic data can now be directly exported to run on the NVIDIA Transfer Learning Toolkit (TLT), an AI training toolkit that simplifies training by abstracting away the AI/DL framework complexity. This enables you to build production-quality models faster without needing any AI expertise. With the SKY ENGINE AI platform and TLT, you can quickly iterate and build AI.In this post, you learn how you can harness the power of synthetic data by taking preannotated synthetic data and training it on TLT. I demonstrate a simple inspection use case to identify antennas on a telco tower using segmentation.SKY ENGINE introduces a full-stack AI platform for deep learning in virtual reality, which is the next-generation active learning AI system for image and video analysis applications. The SKY ENGINE AI platform can generate data using a proprietary, dedicated simulation system where images come already annotated and ready for deep learning.The output data stream can include any of the following:SKY ENGINE AI also includes advanced domain adaptation algorithms that can understand the characteristics of real data examples. They assure the high-quality performance of any trained AI model during the inference.The SKY ENGINE simulation system enables physics-driven sensor simulations (cameras, thermal vision, IR, lidars, radars, and more) and sensor data fusion. It is tightly coupled with a deep learning pipeline to ensure evolution. During training, SKY ENGINE AI can spot ambiguous situations that deteriorate the accuracy of the AI model. It obtains more imagery data to reflect those problematic situations that the deep learning accuracy could instantaneously improve. SKY ENGINE AI learns more with every performed experiment.SKY ENGINE AI delivers a garden of deep neural networks fully implemented, tested, and optimized. Provided models are dedicated to popular computer vision tasks like object detection and semantic segmentation. They can also serve as more sophisticated topologies designed and implemented for 3D position and pose estimation, 3D geometry reasoning, or representation learning.SKY ENGINE AI also includes advanced domain adaptation algorithms that can understand the characteristics of real data examples and assure the performance of trained model inference. SKY ENGINE AI does not require sophisticated rendering and imaging knowledge, so the entry barrier is very low. It has a Python API, including a large number of helpers to quickly build and configure the environment.The SKY ENGINE AI platform can generate the datasets and enable the training of deep learning models that can use input data originating from any source. The input stream for AI models training in NVIDIA TLT and AI-driven inference can effectively include low-quality images obtained using smartphones, data from CCTV cameras, or cameras mounted on drones.You can deploy analytical modules for telecommunication network performance optimization on the cloud, including data storage and multi-GPU scaling. The majority of software projects driven by machine learning in this space are unable to reach the final stage of solution deployment. This could be because of the high dependence of machine learning capabilities on the quality of the input data. The development of AI models with deep training on synthetic data, offered by SKY ENGINE, is a solution with predictable project development and guaranteed deployment in several industrial business processes.One of the common computer vision tasks is the localization and classification of the equipment of interest. In this post, I present the process of neural network optimization for bounding box localization of antenna instances on a telecommunication tower using the NVIDIA TLT environment with MaskRCNN. You use the synthetic data from SKY ENGINE AI to train the MaskRCNN model. The high-level workflow is as follows:To follow along, see the SKY ENGINE AI Jupyter notebook on GitHub.Given the real samples of a telco tower, I used the SE Rendering Engine to create an annotated synthetic dataset.To launch automatic generation of labeled data using SKY ENGINE AI and to prepare the data source object, you must define basic tools like empty renderer context, as well as paths where the assets for the synthetic scene are located.In this rendering scenario, I randomized the following:There can be many projects in which the samples returned by SKY ENGINE are not shuffled enough. One example would be when your rendering process follows the camera trajectory. For this reason, I recommend extra shuffling of the data before dividing it into train and test sets.After generating the images, convert them to COCO format using the data export module of SKY ENGINE. This is required by the NVIDIA TLT framework. After you prepare the configuration file according to the documentation, you can run the training for the TLT pretrained Mask RCNN model with the TensorFlow backend:As a final step, run a trained deep learning model for inference on real data to see if the model is accurately performing tasks of interest.Figure 3 shows some results of telecommunication antenna detection.In this post, I demonstrated how you can reduce your data collection and annotation effort by using the synthetic data from SKY ENGINE and training and optimizing it with NVIDIA TLT. I presented a single SKY ENGINE AI use case for telecommunication industry. However, this platform unlocks the universe of further potential applications delivering several advanced functionalities:For more information, see the SKY ENGINE AI solution on GitHub. For more computer vision use cases developed in the SKY ENGINE AI Platform, see the following videos:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	In AI and computer vision, data acquisition is costly and time-consuming and human-based labeling can be error-prone. The accuracy of the models is also affected by insufficient and poorly balanced data and the prolonged time required to improve the deep learning models. It always requires the reacquisition of data in the real world.The collection, preparation of data, and development of accurate and reliable software solutions based on AI training is an extremely laborious process. The required investment costs offset the expected benefits of deploying the system.One way to bridge the data gap and accelerate model training is by using synthetic data instead of real data for training. SKY ENGINE provides an AI platform to move deep learning to virtual reality. It is possible to generate synthetic data using simulations where the synthetic images come with the annotation that can be used directly in training AI models.Synthetic data can now be directly exported to run on the NVIDIA Transfer Learning Toolkit (TLT), an AI training toolkit that simplifies training by abstracting away the AI/DL framework complexity. This enables you to build production-quality models faster without needing any AI expertise. With the SKY ENGINE AI platform and TLT, you can quickly iterate and build AI.In this post, you learn how you can harness the power of synthetic data by taking preannotated synthetic data and training it on TLT. I demonstrate a simple inspection use case to identify antennas on a telco tower using segmentation.SKY ENGINE introduces a full-stack AI platform for deep learning in virtual reality, which is the next-generation active learning AI system for image and video analysis applications. The SKY ENGINE AI platform can generate data using a proprietary, dedicated simulation system where images come already annotated and ready for deep learning.The output data stream can include any of the following:SKY ENGINE AI also includes advanced domain adaptation algorithms that can understand the characteristics of real data examples. They assure the high-quality performance of any trained AI model during the inference.The SKY ENGINE simulation system enables physics-driven sensor simulations (cameras, thermal vision, IR, lidars, radars, and more) and sensor data fusion. It is tightly coupled with a deep learning pipeline to ensure evolution. During training, SKY ENGINE AI can spot ambiguous situations that deteriorate the accuracy of the AI model. It obtains more imagery data to reflect those problematic situations that the deep learning accuracy could instantaneously improve. SKY ENGINE AI learns more with every performed experiment.SKY ENGINE AI delivers a garden of deep neural networks fully implemented, tested, and optimized. Provided models are dedicated to popular computer vision tasks like object detection and semantic segmentation. They can also serve as more sophisticated topologies designed and implemented for 3D position and pose estimation, 3D geometry reasoning, or representation learning.SKY ENGINE AI also includes advanced domain adaptation algorithms that can understand the characteristics of real data examples and assure the performance of trained model inference. SKY ENGINE AI does not require sophisticated rendering and imaging knowledge, so the entry barrier is very low. It has a Python API, including a large number of helpers to quickly build and configure the environment.The SKY ENGINE AI platform can generate the datasets and enable the training of deep learning models that can use input data originating from any source. The input stream for AI models training in NVIDIA TLT and AI-driven inference can effectively include low-quality images obtained using smartphones, data from CCTV cameras, or cameras mounted on drones.You can deploy analytical modules for telecommunication network performance optimization on the cloud, including data storage and multi-GPU scaling. The majority of software projects driven by machine learning in this space are unable to reach the final stage of solution deployment. This could be because of the high dependence of machine learning capabilities on the quality of the input data. The development of AI models with deep training on synthetic data, offered by SKY ENGINE, is a solution with predictable project development and guaranteed deployment in several industrial business processes.One of the common computer vision tasks is the localization and classification of the equipment of interest. In this post, I present the process of neural network optimization for bounding box localization of antenna instances on a telecommunication tower using the NVIDIA TLT environment with MaskRCNN. You use the synthetic data from SKY ENGINE AI to train the MaskRCNN model. The high-level workflow is as follows:To follow along, see the SKY ENGINE AI Jupyter notebook on GitHub.Given the real samples of a telco tower, I used the SE Rendering Engine to create an annotated synthetic dataset.To launch automatic generation of labeled data using SKY ENGINE AI and to prepare the data source object, you must define basic tools like empty renderer context, as well as paths where the assets for the synthetic scene are located.In this rendering scenario, I randomized the following:There can be many projects in which the samples returned by SKY ENGINE are not shuffled enough. One example would be when your rendering process follows the camera trajectory. For this reason, I recommend extra shuffling of the data before dividing it into train and test sets.After generating the images, convert them to COCO format using the data export module of SKY ENGINE. This is required by the NVIDIA TLT framework. After you prepare the configuration file according to the documentation, you can run the training for the TLT pretrained Mask RCNN model with the TensorFlow backend:As a final step, run a trained deep learning model for inference on real data to see if the model is accurately performing tasks of interest.Figure 3 shows some results of telecommunication antenna detection.In this post, I demonstrated how you can reduce your data collection and annotation effort by using the synthetic data from SKY ENGINE and training and optimizing it with NVIDIA TLT. I presented a single SKY ENGINE AI use case for telecommunication industry. However, this platform unlocks the universe of further potential applications delivering several advanced functionalities:For more information, see the SKY ENGINE AI solution on GitHub. For more computer vision use cases developed in the SKY ENGINE AI Platform, see the following videos:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.51	Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	 Read more >>>Read the full article in Applied Physics Reviews >>>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.52	This post was originally published in August 2019 and has been updated for NVIDIA TensorRT 8.0.Large-scale language models (LSLMs) such as BERT, GPT-2, and XL-Net have brought exciting leaps in accuracy for many natural language processing (NLP) tasks. Since its release in October 2018, BERT (Bidirectional Encoder Representations from Transformers), with all its many variants, remains one of the most popular language models and still delivers state-of-the-art accuracy.BERT provided a leap in accuracy for NLP tasks that brought high-quality, language-based services within the reach of companies across many industries. To use the model in production, you must consider factors such as latency and accuracy, which influences end-user satisfaction with a service. BERT requires significant compute during inference due to its 12/24-layer stacked, multihead attention network. This has posed a challenge for companies to deploy BERT as part of real-time applications.Today, NVIDIA is releasing version 8 of TensorRT, which brings the inference latency of BERT-Large down to 1.2 ms on NVIDIA A100 GPUs with new optimizations on transformer-based networks. New generalized optimizations in TensorRT can accelerate all such models, reducing inference time to half the time compared to TensorRT 7.TensorRT is a platform for high-performance, deep learning inference, which includes an optimizer and runtime that minimizes latency and maximizes throughput in production. With TensorRT, you can optimize models trained in all major frameworks, calibrate for lower precision with high accuracy, and finally deploy in production.All the code for achieving this performance with BERT is being released as open source in this NVIDIA/TensorRT GitHub repo. We have optimized the Transformer layer, which is a fundamental building block of the BERT encoder so that you can adapt these optimizations to any BERT-based NLP task. BERT is applied to an expanding set of speech and NLP applications beyond conversational AI, all of which can take advantage of these optimizations.Question answering (QA) or reading comprehension is a popular way to test the ability of models to understand the context. The SQuAD leaderboard tracks the top performers for this task, for a dataset and test set that they provide. There has been rapid progress in QA ability in the last few years, with global contributions from academia and companies.In this post, we demonstrate how to create a simple QA application using Python, powered by TensorRT-optimized BERT code that NVIDIA released today. The example provides an API to input passages and questions, and it returns responses generated by the BERT model. Here’s a brief review of the steps to perform training and inference using TensorRT for BERT.A major problem faced by NLP researchers and developers is scarcity of high-quality labeled training data for their specific NLP task. To overcome the problem of learning a model for the task from scratch, breakthroughs in NLP use the vast amounts of unlabeled text and break the NLP task into two parts:These two stages are typically referred to as pretraining and fine-tuning. This paradigm enables the use of the pretrained language model to a wide range of tasks without any task-specific change to the model architecture. In this example, BERT provides a high-quality language model that is fine-tuned for QA but suitable for other tasks such as sentence classification and sentiment analysis. You can either start with the pretrained checkpoints available online or pretrain BERT on your own custom corpus (Figure 1). You can also initialize pretraining from a checkpoint and then continue training on custom data.Pretraining with custom or domain-specific data may yield interesting results, for example BioBert. However, it is computationally intensive and requires a massively parallel compute infrastructure to complete within a reasonable amount of time. GPU-enabled, multinode training is an ideal solution for such scenarios. For more information about how NVIDIA developers were able to train BERT in less than an hour, see Training BERT with GPUs.In the fine-tuning step, the task-specific network based on the pretrained BERT language model is trained using the task-specific training data. For QA, this is (paragraph, question, answer) triples. Compared to pretraining, fine-tuning is generally far less computationally demanding.To perform inference using a QA neural network:Figure 2 shows the entire workflow.Set up your environment to perform BERT inference with the following steps: We use scripts to perform these steps, which you can find in the TensorRT BERT sample repo. While we describe several options that you can pass to each script, to get started quickly, you could also run the following code example:This last command builds an engine with a maximum batch size of 1 (-b 1), and sequence length of 128 (-s 128) using mixed precision (--fp16) and the BERT Large SQuAD v2 FP16 Sequence Length 128 checkpoint (-c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1).Now, give it a passage and see how much information it can decipher by asking a few questions.The result of this command should be something similar to the following:Given the same passage with a different question, you should get the following result:The answers provided by the model are accurate based on the text of the passage that was provided. The sample uses FP16 precision for performing inference with TensorRT. This helps achieve the highest performance possible on Tensor Cores in NVIDIA GPUs. In our tests, we measured the accuracy of TensorRT to be comparable to in-framework inference with FP16 precision. Here are the options available with the scripts. The docker/build.sh script builds the docker image using the Dockerfile supplied in the docker folder. It installs all necessary packages, depending on the OS you are selecting as Dockerfile. For this post, we used ubuntu-18.04 but dockerfiles for ubuntu-16.04 and ubuntu-20.04 are also provided.Run the script as follows:After creating and running the environment, download fine-tuned weights for BERT. Note that you do not need the pretrained weights to create the TensorRT engine (just the fine-tuned weights). Along with the fine-tuned weights, use the associated configuration file, which specifies parameters such as number of attention heads, number of layers, and the vocab.txt file, which contains the learned vocabulary from the training process. These are packaged with the fine-tuned model downloaded from NGC; download them using the download_model.sh script. As part of this script, you can specify the set of fine-tuned weights for the BERT model to download. The command-line parameters control the exact BERT model to be used later for model building and inference:Examples:This script by default downloads fine-tuned TensorFlow BERT-large, with FP16 precision and a sequence length of 128. In addition to the fine-tuned model, you use the configuration file, enumerating model parameters and the vocabulary file used to convert BERT model output to a textual answer. Next, you can build the TensorRT engine and use it for a QA example, that is, inference. The script builder.py builds the TensorRT engine for inference based on the downloaded BERT fine-tuned model. Make sure that the sequence length provided to the following script matches the sequence length of the model that was downloaded. Here are the optional arguments:Example:You should now have a TensorRT engine, engines/bert_large_128.engine,  to use in the inference.py script for QA.Later in this post, we describe the process to build the TensorRT engine. You can now provide a passage and a query to inference.py and see if the model is able to answer your queries correctly.There are few ways to interact with the inference script: Here are the parameters for the inference.py script:This script uses a prebuilt TensorRT BERT QA engine to answer a question based on the provided passage.Here are the optional arguments:For a step-by-step description and walkthrough of the inference process, see the Python script inference.py and the detailed Jupyter notebook inference.ipynb in the sample folder. Here are a few key parameters and concepts for performing inference with TensorRT. BERT, or more specifically, the encoder layer, uses the following parameters to govern its operation:The value of these parameters, which depend on the BERT model chosen, are used to set the configuration parameters for the TensorRT plan file (execution engine). For each encoder, also specify the number of hidden layers and the attention head size. You can also read all the earlier parameters from the TensorFlow checkpoint file. As the BERT model we are using has been fine-tuned for a downstream task of QA on the SQuAD dataset, the output for the network (that is, the output fully connected layer) is a span of text where the answer appears in the passage, referred to as  h_output in the sample. After you generate the TensorRT engine, you can serialize it and use it later with TensorRT runtime.During inference, you perform memory copies from CPU to GPU and the reverse asynchronously to get tensors into and out of the GPU memory, respectively. Asynchronous memory copy operation hides the latency of memory transfer by overlapping computations with memory copy operation between device and host. Figure 3 shows the asynchronous memory copies and kernel execution.The inputs to the BERT model (Figure 3) include the following:The outputs (start_logits and end_logits) represent the span of the answer, which the network predicts inside the passage based on the question.BERT can be applied both for online and offline use cases. Online NLP applications, such as conversational AI, place tight latency budgets during inference. Several models need to execute in a sequence in response to a single user query. When used as a service, the total time a customer experiences includes compute time as well as input and output network latency. Longer times lead to a sluggish performance and a poor customer experience. While the exact latency available for a single model can vary by application, several real-time applications need the language model to execute in under 10 ms. Using an NVIDIA Ampere Architecture A100 GPU, BERT-Large optimized with TensorRT 8 can perform inference in 1.2ms for a QA task similar to that available in SQuAD with batch size = 1 and sequence length = 128. Using the TensorRT optimized sample, you can execute different batch sizes for BERT-base or BERT-large within the 10 ms latency budget. For example, the latency for inference on a BERT-Large model with sequence length = 384 batch size = 1 on A30 with TensorRT8 was 3.62ms. The same model, sequence length =384 with highly optimized code on a CPU-only platform (**) for batch size = 1 was 76ms.The performance measures the compute-only latency time for executing the network on a QA task between passing tensors as input and gathering logits as output. You can find the code used to benchmark the sample in the script scripts/inference_benchmark.sh in the repo. NVIDIA is releasing TensorRT 8.0, which makes it possible to perform BERT inference in 0.74ms on A30 GPUs. The code for benchmarking inference on BERT is available as a sample in the TensorRT open-source repo. This post gives an overview of how to use the TensorRT sample and performance results. We further describe a workflow of how to use the BERT sample as part of a simple application and Jupyter notebook where you can pass a paragraph and ask questions related to it. The new optimizations and performance achievable makes it practical to use BERT in production for applications with tight latency budgets, such as conversational AI.We are always looking for new ideas for new examples and applications to share. What NLP applications do you use BERT for and what examples would you like to see from us in the future?If you have questions regarding the TensorRT sample repo, check the NVIDIA TensorRT Developer Forum to see if other members of the TensorRT community have a resolution first. NVIDIA Registered Developer Program members can also file bugs at https://developer.nvidia.com/nvidia-developer-program.(*)(**) Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	 Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.53	Whether you need to monitor cybersecurity threats, fraudulent financial transactions, product defects, or equipment health, artificial intelligence can help you catch data abnormalities before they impact your business. AI models can be trained and deployed to automatically analyze datasets, define “normal behavior,” and identify breaches in patterns quickly and effectively. These models can then be used to predict future anomalies. With massive amounts of data available across industries and subtle distinctions between normal and abnormal patterns, it’s critical that organizations use AI to quickly detect anomalies that pose a threat.The NVIDIA Deep Learning Institute (DLI) is offering instructor-led, hands-on training on how to implement multiple AI-based approaches to solve a specific use case of identifying network intrusions for telecommunications. You’ll learn three different anomaly detection techniques using GPU-accelerated XGBoost, deep learning-based autoencoders, and generative adversarial networks (GANs) and then implement and compare supervised and unsupervised learning techniques. At the end of the workshop, you’ll be able to use AI to detect anomalies in your work across telecommunications, cybersecurity, finance, manufacturing, and other key industries.By participating in this workshop, you’ll:This training will be offered:Tue, Sep 21, 2021, 9:00 a.m. – 5:00 p.m. CEST/EMEA, UTC+2Tue, Sep 21, 2021, 9:00 a.m. – 5:00 p.m. PDT, UTC-7Space is limited, register now.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	When it comes to production, companies spend endless cycles improving their processes to drive the most revenue. Manufacturing lines are rigorously tested, and any changes require downtime that can eat up a company’s profits. That’s where AI comes in.Manufacturing as an industry is ripe to experience the benefits of AI because it performs highly repeatable tasks that can each be tuned and optimized for overall performance. AI takes readily-available historical data from sensors, cameras, and even outcomes and processes it faster than any human could, without getting tired. Once the data is fed into the AI, the AI makes sense of it, then it has to make a prediction based on past data, it makes a choice based on the best option available, and finally it takes action.At GTC ’21, Data Monsters, who builds AI solutions for production and packaging, discussed the growth of AI in manufacturing and how AI is being used to optimize every part of the supply chain, from forecasting and production planning to quality control. The session “Getting Started with AI in Manufacturing” shared how AI could be used to improve the Overall Equipment Effectiveness (OEE) of any organization using data that is already available today. OEE consists of three factors: availability, performance, and quality. Each of these factors can be optimized to improve the effectiveness and therefore profits of manufacturers. Let’s take a look at the various AI techniques that can be used for each.Availability is measured by the amount of uptime compared to downtime. As downtime at any part of the system can result in dramatic productivity loss, predictive maintenance is something many manufacturers are looking to in order to improve the uptime of machinery. Predictive maintenance models learn from the system and identify indicators that predict a failure. This model can alert the team prior to a failure and make recommendations about what needs to be fixed, both of which can reduce downtime. Performance looks at how fast products are being produced compared to how fast they could be produced. With highly repetitive tasks in the manufacturing space, AI can be used to help identify the most efficient schedule based on objective function parameters, and make suggestions on where bottlenecks can be removed. Depending on the parameters, process optimization can determine the most efficient outcome based on technology variables and historical outcomes, thus maximizing throughput, minimizing cost, and reducing leftover stock.Quality of production means looking at what proportion of products are being produced without defects. Here, computer vision provides a lot of data for analysis. Manufacturers can improve the overall quality by identifying where in the process the defects are happening so they can be prevented in the future. Reducing defects and improving the overall quality of products can have a dramatic impact on not only productivity, but also revenue. AI becomes a huge differentiator in the manufacturing space, as it reduces manual operation, and improves efficiency and the competitive position in the market with optimized costs and scheduling. Due to the intense calculations of AI required to perform these tasks, manufacturers are bringing the compute close to sensors generating the data. Moving compute to the edge has the benefit of lowering latency and bandwidth requirements to run AI applications, ensuring the fastest and most accurate responses. With numerous compute systems on production lines, AI models are downloaded from the cloud, data is collected and processed locally. Models are fine-tuned and uploaded back to the cloud for further distribution between several edge systems.To learn more about implementing inspections, diagnostics, and predictive maintenance in the manufacturing pipeline, check out the Data Monster’s session “Getting Started with AI in Manufacturing“. Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.54	Breast cancer is the most frequently diagnosed cancer among women worldwide. It’s also the leading cause of cancer-related deaths. Identifying breast cancer at an early stage before metastasis enables more effective treatments and therefore significantly improves survival rates.Although mammography is the most widely used imaging technique for early detection of breast cancer, it is not always available in low-resource settings. Its sensitivity also drops for women with dense breast tissue.Breast ultrasound is often used as a supplementary imaging modality to mammography in screening settings, and as the primary imaging modality in diagnostic settings. Despite its advantages, including lower costs relative to mammography, it is difficult to interpret breast ultrasound images as evident by the considerable intra-reader variability. This leads to increased false-positive findings, unnecessary biopsies, and significant discomfort to patients.Previous work using deep learning for breast ultrasound has been based predominantly on small datasets on the scale of thousands of images. Many of these efforts also rely on expensive and time-consuming manual annotation of images to obtain image-level (presence of cancer in each image) or pixel-level (exact location of each lesion) labels.In our recent paper, Artificial Intelligence System Reduces False-Positive Findings in the Interpretation of Breast Ultrasound Exams, we leverage the full potential of deep learning and eliminate the need for manual annotations by designing a weakly supervised deep neural network whose working resembles the diagnostic procedure of radiologists (Figure 1).The following table compares how radiologists make predictions compared to our AI system.We compared the performance of the trained network to 10 board-certified breast radiologists in a reader study and to hybrid AI-radiologist models, which average the prediction of the AI and each radiologist. The neural network was trained with a dataset consisting of approximately four million ultrasound images on an HPC cluster powered by NVIDIA technologies. The cluster consists of 34 computation nodes each of which is equipped with 80 CPUs and four NVIDIA V100 GPUs (16/32 GB). With this cluster, we performed hyperparameter search by launching experiments (each taking around 300 GPU hours) over a broad range of hyperparameters.To complete this ambitious project, we preprocessed more than eight million breast ultrasound images collected at NYU Langone between 2012 and 2019 and extracted breast-level cancer labels by mining pathology reports.Our results show that a hybrid AI-radiologist model decreased false positive rates by 37.4% (that is, false suspicions of malignancy). This would lead to a reduction in the number of requested biopsies by 27.8%, while maintaining the same level of sensitivity as radiologists (Figure 3).When acting independently, the AI system achieved higher area under the receiver operating characteristic curve (AUROC) and area under the precision recall curve (AUPRC) than individual readers. Figure 3 shows how each reader compares to the network’s performance.Within the internal test set, the AI system maintained high diagnostic accuracy (0.940-0.990 AUROC) across all age groups, mammographic breast densities, and device manufacturers, including GE, Philips, and Siemens. In the biopsied population, it also achieved a 0.940 AUROC.In an external test set collected in Egypt, the system achieved 0.911 AUROC, highlighting its generalization ability in patient demographics not seen during training (Figure 4). Based on qualitative assessment, the network produced appropriate localization information of benign and malignant lesions through its saliency maps. In the exam shown in Figure 4, all 10 breast radiologists thought the lesion appeared suspicious for malignancy and recommended that it undergo biopsy, while the AI system correctly classified it as benign. Most impressively, locations of lesions were never given during training, as it was trained in a weakly supervised manner!For our next steps, we’d like to evaluate our system through prospective validation before it can be widely deployed in clinical practice. This enables us to measure its potential impact in improving the experience of women who undergo breast ultrasound examinations each year on a global level.In conclusion, our work highlights the complementary role of an AI system in improving diagnostic accuracy by significantly decreasing unnecessary biopsies. Beyond improving radiologists’ performance, we have made technical contributions to the methodology of deep learning for medical imaging analysis.This work would not have been possible without state-of-the-art computational resources. For more information, see the preprint, Artificial Intelligence System Reduces False-Positive Findings in the Interpretation of Breast Ultrasound Exams.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	This post was updated July 20, 2021 to reflect NVIDIA TensorRT 8.0 updates.In this post, you learn how to deploy TensorFlow trained deep learning models using the new TensorFlow-ONNX-TensorRT workflow. This tutorial uses NVIDIA TensorRT 8.0.0.3 and provides two code samples, one for TensorFlow v1 and one for TensorFlow v2. TensorRT is an inference accelerator.First, a network is trained using any framework. After a network is trained, the batch size and precision are fixed (with precision as FP32, FP16, or INT8). The trained model is passed to the TensorRT optimizer, which outputs an optimized runtime also called a plan. The .plan file is a serialized file format of the TensorRT engine. The plan file must be deserialized to run inference using the TensorRT runtime. To optimize models implemented in TensorFlow, the only thing you have to do is convert models to the ONNX format and use the ONNX parser in TensorRT to parse the model and build the TensorRT engine. Figure 1 shows the high-level ONNX workflow. In this post, we discuss how to create a TensorRT engine using the ONNX workflow and how to run inference from the TensorRT engine. More specifically, we demonstrate end-to-end inference from a model in Keras or TensorFlow to ONNX, and to the TensorRT engine with ResNet-50, semantic segmentation, and U-Net networks. Finally, we explain how you can use this workflow on other networks. Download the code examples and unzip. You can run either the TensorFlow 1 or the TensorFlow 2 code example by follow the appropriate README. After downloading the file, you should also download labels.py from the Cityscapes dataset scripts repo and place it in the same folder as the other scripts. ONNX is an open format for machine learning and deep learning models. It allows you to convert deep learning and machine learning models from different frameworks such as TensorFlow, PyTorch, MATLAB, Caffe, and Keras to a single format. It defines a common set of operators, common sets of building blocks of deep learning, and a common file format. It provides a definition of a computation graph, as well as built-in operators. The list of ONNX nodes that may have one or more inputs or outputs forms an acyclic graph. In this example, we show how to use the ONNX workflow on two different networks and create a TensorRT engine. The first network is ResNet-50.The workflow consists of the following steps:The first step is to convert the model to a .pb file. The following code example converts the ResNet-50 model to a .pb file:In addition to Keras, you can also download ResNet-50 from the following locations: The second step is to convert the .pb model to the ONNX format. To do this, first install tf2onnx. After installing tf2onnx, there are two ways of converting the model from a .pb file to the ONNX format. The first way is to use the command line and the second method is by using Python API. Run the following command:To create the TensorRT engine from the ONNX file, run the following command: This code should be saved in the engine.py file, and is used later in the post. This code example contains the following variable:  The builder creates an empty network (builder.create_network()) and the ONNX parser parses the ONNX file into the network (parser.parse(model.read())). You set the input shape for the network (network.get_input(0).shape = shape), after which the builder creates the engine (engine = builder.build_cuda_engine(network)). To create the engine, run the following code example:In this code example, you first get the input shape from the ONNX model. Next,  create the engine, and then save the engine in a .plan file.The TensorRT engine runs inference in the following workflow: These steps are explained in detail in the following code example.  This code should be saved in the inference.py file, and is used later in this post.   The first two lines are for determining the dimensions for input and output. You create page-locked memory buffers in host (h_input_1, h_output). Then, you allocate device memory for input and output the same size as host input and output (d_input_1, d_output). The next step is to create the CUDA stream for copying data between the allocated memory from device and host. In this code example, in the do_inference function, the first step is to load images to buffers in the host using the load_images_to_buffer function. Then the input data is transferred to the GPU (cuda.memcpy_htod_async(d_input_1, h_input_1, stream)) and inference is run using context.execute. Finally the results are copied from GPU to the host (cuda.memcpy_dtoh_async(h_output, d_output, stream)).In the post Fast INT8 Inference for Autonomous Vehicles with TensorRT 3, the author covered the process of UFF workflow for a semantic segmentation model.In this post, you use similar networks to run the ONNX workflow for semantic segmentation. The network consists of a VGG16-based encoder and three upsampling layers implemented using a deconvolutional layer. The network is trained in about 40,000 iterations on the Cityscapes Dataset. There are multiple ways of converting the TensorFlow model to an ONNX file. One way is the one explained in the ResNet50 section. Keras also has its own Keras-to-ONNX file converter. Sometimes, some of the layers are not supported in the TensorFlow-to-ONNX but they are supported in the Keras to ONNX converter. Depending on the Keras framework and the type of layers used, you may need to choose between converters. In the following code example, you directly convert the Keras model to ONNX using the Keras-to-ONNX converter. Download the pretrained semantic segmentation file, semantic_segmentation.hdf5.Figure 2 shows the architecture of the network.As in the previous example, use the following code example to create the engine for semantic segmentation.To test the output of the model, use the Cityscapes Dataset. To work with Cityscapes, you must have the following functions:   sub_mean_chw and color_map.  In the following code example, sub_mean_chw is for subtracting the mean value from the image as the preprocessing step and color_map is the mapping from the class ID to a color. The latter is used for visualization. The following code example is the rest of the code for the previous example. You must run the previous block first because you need the defined functions. Use the example to compare the output of the Keras model and TensorRT engine semantic .plan file and then visualize both outputs.  Replace the placeholders /path/to/semantic_segmentation.hdf5  and input_file_path as appropriate.Figure 3 shows the actual image and the ground truth, and the output of Keras versus the output of the TensorRT engine. As you can see, the output for the TensorRT engine is similar to the one for Keras. Now you can try the ONNX workflow on other networks. For more information about good examples of segmentation networks, see Segmentation models with pretrained backbones on GitHub. As an example, we show how to use the ONNX workflow with other networks. The network in this example is U-Net from the segmentation_models library. Here, we only loaded the model and did not train it. You may need to train these models on your preferred dataset. One important point about these networks is that when you load these networks, their input layer sizes are as follows: (None, None, None, 3). To create a TensorRT engine, you need an ONNX file with a known input size. Before you convert this model to ONNX, change the network by assigning the size to its input and then convert it to the ONNX format. As an example, load the U-Net network from this library (segmentation_models) and assign the size (244, 244, 3) to its input. After creating the TensorRT engine for the inference, do a similar conversion to what you did for semantic segmentation. Depending on the application and dataset, you may need to have a different color mapping.As we mentioned earlier in this post, another way of downloading pretrained models is to download them from NVIDIA NGC Models. It has a list of checkpoints for pretrained models. As an example, you can search for UNet for TensorFlow and then go to the Download page to get the latest checkpoint. In this post, we explained how to deploy deep learning applications using a TensorFlow-to-ONNX-to-TensorRT workflow, with several examples. The first example was ONNX-TensorRT on ResNet-50, and the second example was VGG16-based semantic segmentation that was trained on the Cityscapes Dataset. At the end of the post, we demonstrated how to apply this workflow on other networks.  For more information about the best performance of training and inference, see NVIDIA Data Center Deep Learning Product Performance. Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.55	New research out of the University of California, San Francisco has given a paralyzed man the ability to communicate by translating his brain signals into computer generated writing. The study, published in The New England Journal of Medicine, marks a significant milestone toward restoring communication for people who have lost the ability to speak. “To our knowledge, this is the first successful demonstration of direct decoding of full words from the brain activity of someone who is paralyzed and cannot speak,” senior author and the Joan and Sanford Weill Chair of Neurological Surgery at UCSF, Edward Chang said in a press release. “It shows strong promise to restore communication by tapping into the brain’s natural speech machinery.” Some with speech limitations use assistive devices–such as touchscreens, keyboards, or speech-generating computers to communicate. However, every year thousands lose their speech ability from paralysis or brain damage, leaving them unable to use assistive technologies. The participant lost his ability to speak in 2003, paralyzed by a brain stroke following a car accident. The researchers were not sure if his brain retained neural activity linked to speech. To track his brain signals, a neuroprosthetic device consisting of electrodes was positioned on the left side of the brain, across several regions known for speech processing. Over about four months the team embarked on 50 training sessions, where the participant was prompted to say individual words, form sentences, or respond to questions on a display screen. While responding to the prompts, the electrode device captured neural activity and transmitted the information to a computer with custom software. “Our models needed to learn the mapping between complex brain activity patterns and intended speech. That poses a major challenge when the participant can’t speak,” David Moses, a postdoctoral engineer in the Chang lab and one of the lead authors of the study, said in a press release.To decode the responses from his brain activity, the team created speech-detection and word classification models. Using the cuDNN-accelerated TensorFlow framework and 32 NVIDIA V100 Tensor Core GPUs the researchers trained, fine-tuned, and evaluated the models.“Utilizing neural networks was essential to getting the classification and detection performance we did, and our final product was the result of lots of experimentation,’ said study co-lead Sean Metzger. “Because our dataset was constantly evolving and growing, being able to adapt the models we were using was critical. The GPUs helped us make changes, monitor progress, and understand our dataset.”	At GTC 21, NVIDIA announced several major breakthroughs in conversational AI for building and deploying automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS) applications. The conference also hosted over 60 engaging sessions and workshops featuring the latest tools, technologies and research in conversational AI and NLP.The developer resources listed below are exclusively available to NVIDIA Developer Program members. Join today for free in order to get access to the tools and training necessary to build on NVIDIA’s technology platform here. On-Demand SessionsConversational AI DemystifiedSpeaker: Meriem Bendris, Senior Solution Architect, NVIDIAConversational AI technologies are becoming ubiquitous, with countless products taking advantage of automatic speech recognition, natural language understanding, and speech synthesis coming to market. Thanks to new tools and technologies, developing conversational AI applications is easier than ever, enabling a much broader range of applications, such as virtual assistants, real-time transcription, and many more. We will give an overview of the conversational AI landscape and discuss how any organization can get started developing conversational AI applications today.Building and Deploying a Custom Conversational AI App with NVIDIA Transfer Learning Toolkit and JarvisSpeakers: Tripti Singhal, Solutions Architect, NVIDIA; Nikhil Srihari, Technical Marketing Engineer – Deep Learning, NVIDIA; Arun Venkatesan, Product Manager, NVIDIATailoring the deep learning models in a conversational AI pipeline to your enterprise needs is time-consuming. Developing a domain-specific application typically requires several cycles of re-training, fine-tuning, and deploying the model until it satisfies the requirements. NVIDIA Jarvis helps you easily build production-ready conversational AI applications and provides tools for fine-tuning on your domain. In this session, we will walk you through the process of customizing automatic speech recognition and natural language processing pipelines to build a truly customized production-ready Conversational AI application.Megatron GPT-3 Large Model Inference with Triton and ONNX RuntimeSpeaker: Denis Timonin, AI Solutions Architect, NVIDIAHuge NLP models like Megatron-LM GPT-3, Megatron-LM Bert require tens/hundreds of gigabytes of memory to store their weights or run inference. Frequently, one GPU is not enough for such a task. One way to run inference and maximize throughput of these models is to divide them into smaller sub-parts in the pipeline-parallelism (in-depth) style and run these subparts on multiple GPUs. This method will allow us to use bigger batch size and run inference through an ensemble of subparts in a conveyor manner. TRITON inference server is an open-source inference serving software that lets teams deploy trained AI models from any framework. And this is a perfect tool that allows us to run this ensemble. In this talk, we will take Megatron LM with billions of parameters, convert it in ONNX format, and will learn how to divide it into subparts with the new tool – ONNX-GraphSurgeon. Then, we will use TRITON ensemble API and ONNX runtime background and run this model inference on an NVIDIA DGX.BlogAnnouncing Megatron for Training Trillion Parameter Models and NVIDIA Jarvis AvailabilityNVIDIA announced Megatron for training giant transformer-based language models and major capabilities in NVIDIA Jarvis for building state-of-the-art interactive conversational AI applications.DemoWorld-Class ASR | Real-Time Machine Translation | Controllable Text-to-SpeechWatch this demo to see Jarvis’ automatic speech recognition (ASR) accuracy when fine-tuned on medical jargon, real-time neural machine translation from English to Spanish and Japanese, and powerful controllability of neural text-to-speech.New pre-trained models, notebooks, and sample applications for conversational AI are all available to try from the NGC catalog. You can also find tutorials for building and deploying conversational AI applications at the NVIDIA Developer Blog.Join the NVIDIA Developer Program for all of the latest tools and resources for building with NVIDIA technologies.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.56	With up to 93% accuracy, and a median rate of 75%, the model decoded the participants word’s at a rate of up to 18 per minute.  “We want to get to 1,000 words, and eventually all words. This is just the starting point,” Chang said. The study builds off previous work by Chang and his colleagues, which developed a deep learning method for decoding and converting brain signals. Unlike the current work, participants in the previous study were able to speak.  Read more >>>Read the full article in The New England Journal of Medicine >>>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	With up to 93% accuracy, and a median rate of 75%, the model decoded the participants word’s at a rate of up to 18 per minute.  “We want to get to 1,000 words, and eventually all words. This is just the starting point,” Chang said. The study builds off previous work by Chang and his colleagues, which developed a deep learning method for decoding and converting brain signals. Unlike the current work, participants in the previous study were able to speak.  Read more >>>Read the full article in The New England Journal of Medicine >>>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.57	King’s College London, along with partner hospitals and university collaborators, unveiled new details today about one of the first projects on Cambridge-1, the United Kingdom’s most powerful supercomputer.The Synthetic Brain Project is focused on building deep learning models that can synthesize artificial 3D MRI images of human brains. These models can help scientists understand what a human brain looks like across a variety of ages, genders, and diseases. The AI models were developed by King’s College London, and NVIDIA data scientists and engineers, as part of The London Medical Imaging & AI Centre for Value Based Healthcare. The research was funded by UK Research and Innovation and a Wellcome Flagship Programme (in collaboration with University College London). The use of synthetic data has the additional benefit of ensuring patient privacy and gives King’s the ability to open the research to the broader UK healthcare community. Without Cambridge-1, the AI models would have taken months rather than weeks to train, and the resulting image quality would not have been as clear. King’s and NVIDIA researchers used Cambridge-1 to scale the models to the necessary size using multiple GPUs, and then applied a process known as hyperparameter tuning, which dramatically improved the accuracy of the models.“Cambridge-1 enables accelerated generation of synthetic data that gives researchers at King’s the ability to understand how different factors affect the brain, anatomy, and pathology,” said Jorge Cardoso, senior lecturer in Artificial Medical Intelligence at King’s College London. “We can ask our models to generate an almost infinite amount of data, with prescribed ages and diseases; with this, we can start tackling problems such as how diseases affect the brain and when abnormalities might exist.” Introduction of the NVIDIA Cambridge-1 supercomputer poses new possibilities for groundbreaking research like the Synthetic Brain Project and could be used to accelerate research in digital biology on disease, drug design, and the human genome. As one of the world’s top 50 fastest supercomputers, Cambridge-1 is built on 80 DGX A100 systems, integrating NVIDIA A100 GPUs, Bluefield-2 DPUs, and NVIDIA HDR InfiniBand networking. King’s College London is leveraging NVIDIA hardware and the open-source MONAI software framework supported by PyTorch, with cuDNN and Omniverse for their Synthetic Brain Project. MONAI is a freely available, community-supported PyTorch-based framework for deep learning in healthcare imaging. The CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library for deep neural networks. Omniverse is an open platform for virtual collaboration and real-time simulation. King’s has just begun using it to visualize brains, which can help physicians better understand the morphology and pathology of brain diseases. The increasing efficiency of deep learning architectures—together with hardware improvements—have enabled complex and high-dimensional modelling of medical volumetric data at higher resolutions. Vector-Quantized Variational Autoencoders (VQ-VAE) have been an option for an efficient generative unsupervised learning approach that can encode images to a substantially compressed representation compared to its initial size, while preserving the decoded fidelity. King’s used a VQ-VAE inspired and 3D optimized network to efficiently encode a full-resolution brain volume, compressing the data to less than 1% of the original size while maintaining image fidelity, and outperforming the previous State-of-the-Art. After the images are encoded by the VQ-VAE, the latent space is learned through a long-range transformer model optimized for the volumetric nature of the data and associated sequence length. The sequence length caused by the three-dimensional nature of the data requires unparalleled model sizes made possible by the multi-GPU and multinode scaling provided by Cambridge-1. By sampling from these large transformer models, and conditioning on clinical variables of interest (such as age or disease), new latent space sequences can be generated, and decoded into volumetric brain images using the VQ-VAE. Transformer AI models adopt the mechanism of attention, differentially weighing the significance of each part of the input data, and used to understand these sequence lengths. Creating generative brain images that are eerily similar to real life neurological radiology studies helps understand how the brain forms, how trauma and disease affect it, and how to help it recover. Instead of real patient data, the use of synthetic data mitigates problems with data access and patient privacy. As part of the synthetic brain generation project from King’s College London, the code and models are open-source. NVIDIA has made open-source contributions to improve the performance of the fast-transformers project, on which The Synthetic Brain Project depends upon.To learn more about Cambridge-1, watch the replay of the Cambridge-1 Inauguration featuring a special address from NVIDIA founder and CEO Jensen Huang, and a panel with UK healthcare experts from AstraZeneca, GSK, Guy’s and St Thomas’ NHS Foundation Trust, King’s College London and Oxford Nanopore.	King’s College London, along with partner hospitals and university collaborators, unveiled new details today about one of the first projects on Cambridge-1, the United Kingdom’s most powerful supercomputer.The Synthetic Brain Project is focused on building deep learning models that can synthesize artificial 3D MRI images of human brains. These models can help scientists understand what a human brain looks like across a variety of ages, genders, and diseases. The AI models were developed by King’s College London, and NVIDIA data scientists and engineers, as part of The London Medical Imaging & AI Centre for Value Based Healthcare. The research was funded by UK Research and Innovation and a Wellcome Flagship Programme (in collaboration with University College London). The use of synthetic data has the additional benefit of ensuring patient privacy and gives King’s the ability to open the research to the broader UK healthcare community. Without Cambridge-1, the AI models would have taken months rather than weeks to train, and the resulting image quality would not have been as clear. King’s and NVIDIA researchers used Cambridge-1 to scale the models to the necessary size using multiple GPUs, and then applied a process known as hyperparameter tuning, which dramatically improved the accuracy of the models.“Cambridge-1 enables accelerated generation of synthetic data that gives researchers at King’s the ability to understand how different factors affect the brain, anatomy, and pathology,” said Jorge Cardoso, senior lecturer in Artificial Medical Intelligence at King’s College London. “We can ask our models to generate an almost infinite amount of data, with prescribed ages and diseases; with this, we can start tackling problems such as how diseases affect the brain and when abnormalities might exist.” Introduction of the NVIDIA Cambridge-1 supercomputer poses new possibilities for groundbreaking research like the Synthetic Brain Project and could be used to accelerate research in digital biology on disease, drug design, and the human genome. As one of the world’s top 50 fastest supercomputers, Cambridge-1 is built on 80 DGX A100 systems, integrating NVIDIA A100 GPUs, Bluefield-2 DPUs, and NVIDIA HDR InfiniBand networking. King’s College London is leveraging NVIDIA hardware and the open-source MONAI software framework supported by PyTorch, with cuDNN and Omniverse for their Synthetic Brain Project. MONAI is a freely available, community-supported PyTorch-based framework for deep learning in healthcare imaging. The CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library for deep neural networks. Omniverse is an open platform for virtual collaboration and real-time simulation. King’s has just begun using it to visualize brains, which can help physicians better understand the morphology and pathology of brain diseases. The increasing efficiency of deep learning architectures—together with hardware improvements—have enabled complex and high-dimensional modelling of medical volumetric data at higher resolutions. Vector-Quantized Variational Autoencoders (VQ-VAE) have been an option for an efficient generative unsupervised learning approach that can encode images to a substantially compressed representation compared to its initial size, while preserving the decoded fidelity. King’s used a VQ-VAE inspired and 3D optimized network to efficiently encode a full-resolution brain volume, compressing the data to less than 1% of the original size while maintaining image fidelity, and outperforming the previous State-of-the-Art. After the images are encoded by the VQ-VAE, the latent space is learned through a long-range transformer model optimized for the volumetric nature of the data and associated sequence length. The sequence length caused by the three-dimensional nature of the data requires unparalleled model sizes made possible by the multi-GPU and multinode scaling provided by Cambridge-1. By sampling from these large transformer models, and conditioning on clinical variables of interest (such as age or disease), new latent space sequences can be generated, and decoded into volumetric brain images using the VQ-VAE. Transformer AI models adopt the mechanism of attention, differentially weighing the significance of each part of the input data, and used to understand these sequence lengths. Creating generative brain images that are eerily similar to real life neurological radiology studies helps understand how the brain forms, how trauma and disease affect it, and how to help it recover. Instead of real patient data, the use of synthetic data mitigates problems with data access and patient privacy. As part of the synthetic brain generation project from King’s College London, the code and models are open-source. NVIDIA has made open-source contributions to improve the performance of the fast-transformers project, on which The Synthetic Brain Project depends upon.To learn more about Cambridge-1, watch the replay of the Cambridge-1 Inauguration featuring a special address from NVIDIA founder and CEO Jensen Huang, and a panel with UK healthcare experts from AstraZeneca, GSK, Guy’s and St Thomas’ NHS Foundation Trust, King’s College London and Oxford Nanopore.58	 Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	 Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.59	MLPerf is an industry-wide AI consortium tasked with developing a suite of performance benchmarks that cover a range of leading AI workloads widely in use. The latest MLPerf v1.0 training round includes vision, language and recommender systems, and reinforcement learning tasks. It is continually evolving to reflect the state-of-the-art AI applications.NVIDIA submitted MLPerf v1.0 training results for all eight benchmarks, as is our tradition. In fact, systems built upon the NVIDIA AI platform are the only commercially available systems to make submissions across the board.Compared to our previous MLPerf v0.7 submissions, we improved up to 2.1x on a chip-to-chip basis and up to 3.5x at scale. We set 16 performance records with eight on a per-chip basis and eight at-scale training in the commercially available solutions category.(*) Per Accelerator performance for A100 computed using NVIDIA 8xA100 server time-to-train and multiplying it by 8 | Per Chip Performance comparisons to others arrived at by comparing performance at the closest similar scale. Per-Accelerator Records:  BERT: 1.0-1033  | DLRM: 1.0-1037  |  Mask R-CNN: 1.0-1057  |  ResNet50 v1.5: 1.0-1038  |  SSD: 1.0-1038  |  RNN-T: 1.0-1060  |  3D U-Net: 1.0-1053  |  MiniGo: 1.0-1061Max Scale Records:  BERT: 1.0-1077  | DLRM: 1.0-1067  |  Mask R-CNN: 1.0-1070  |  ResNet50 v1.5: 1.0-1076  |  SSD: 1.0-1072  |  RNN-T: 1.0-1074  |  3D U-Net: 1.0-1071  |  MiniGo: 1.0-1075MLPerf name and logo are trademarks. For more information, see www.mlperf.org.This is the second MLPerf training round featuring NVIDIA A100 GPUs. Our continual year-over-year improvement on the same hardware is a lively testament to the strength of the NVIDIA platform and commitment to continuous software improvement. As in previous MLPerf rounds, NVIDIA engineers developed a host of innovations to achieve these new levels of performance:The CUDA graph and SHARP enhancements enabled us to increase our scale to a record number of 4096 GPUs used to solve a single AI network.This post provides insights into many of the optimizations used to deliver the outstanding scale and performance. Many of these improvements are available on NGC, which is the hub for NVIDIA GPU-optimized software. You can realize the benefits of these optimizations in your real-world applications, instead of just observing better benchmark scores from the sideline.Large-scale training requires system hardware and software to be precisely tuned to work together and support the unique performance requirements that arise at scale. NVIDIA made major advances on both dimensions, which are now available for production use.On the system side, the key building block of our at-scale training is the NVIDIA DGX SuperPOD. DGX SuperPOD is the culmination of years of expertise in HPC and AI data centers. It is based on the NVIDIA DGX A100 with the latest NVIDIA A100 Tensor Core GPU, third-generation NVIDIA NVLink, NVSwitch, and the NVIDIA ConnectX-6 VPI 200 Gbps HDR InfiniBand. These were combined to make Selene a top 5 supercomputer in the Top 500 supercomputer list, with the following components:On the software side, the NGC container release v. 21.05 enhances and enables several capabilities:In this section, we dive into the optimizations for selected individual MLPerf workloads.Recommendation is arguably the most pervasive AI workload in data centers today. The NVIDIA MLPerf DLRM submission was based on HugeCTR, a GPU-accelerated recommendation framework that is part of the NVIDIA Merlin open beta framework. The HugeCTR v3.1 beta release added the following optimizations:One of the major challenges in scaling DLRM to multiple nodes is the ~10x difference in per-GPU all-to-all bandwidth between NVLink and Infiniband. This makes the embedding exchange between nodes a significant bottleneck during training.To counteract this, HugeCTR implemented hybrid embedding, a novel embedding design that deduplicates the categories in a batch before doing the embedding weight exchange in the forward pass. It also reduces the gradients locally before doing gradient exchange in the backward pass.For efficient deduplication, the hybrid embedding maps the categories to frequent and infrequent embeddings based on the statistical access frequency of categories. The frequent embedding is implemented in a data-parallel fashion that takes away most of the replicated categories in a batch, reducing the embedding exchange traffic. Infrequent embedding follows the distributed model parallel-embedding paradigm. This enables DLRM to scale to multiple nodes with unprecedented efficiency.All-to-all and all-reduce collective latencies play a significant role in scaling efficiency. Multinode all-to-all throughput for small message sizes was limited by the Infiniband message rate. To mitigate this, HugeCTR implemented fused NVLink aggregation using hierarchical all-to-all for embedding exchange.You can optimize internode all-to-all and all-reduce latencies further:Intranode all-reduce is also optimized using a single-shot reduction algorithm as opposed to ring.Frequent embedding all-reduce and MLP all-reduce are fused into a single all-reduce operation to save on exposed all-reduce latency.Input pipeline plays a significant role in training performance. To achieve peak I/O throughput, HugeCTR implemented a fully asynchronous data reader using the Linux asynchronous I/O library (AIO). Because hybrid embedding requires the whole batch to be present on all the GPUs, direct host-to-device (H2D) for each GPU would make PCIe a bottleneck. So, the data is copied onto the GPUs using a hierarchical approach, by first doing a H2D over PCIe and then a broadcast over NVLink.Moreover, H2D traffic from data readers may interfere with internode all-to-all and all-reduce traffic over PCIe. So, HugeCTR implements intelligent data-reader scheduling to avoid such interference.Because the bottom MLP has no data dependencies with embedding, several components of the bottom MLP could be overlapped with embedding for efficient utilization of GPU resources.HugeCTR implemented a fused, fully connected layer that made use of cublasLt GEMM fusions:To reduce launch latencies and prevent PCIe interference between kernel launches, data-reader, and communication traffic, all DLRM compute and communication kernels are designed to be stream-capturable. The whole training iteration is captured into a single CUDA graph.With the preceding optimizations, we scaled to multiple nodes and completed the DLRM training task in just under a minute on 14 DGX-A100 nodes. This is a 3.3x speedup compared to the previous v0.7 submission.BERT is arguably one of the most important workloads in the NLP domain today. In the MLPerf v1.0 round, we improved upon our v0.7 submission with the following optimizations:The size of the activation tensors inside the multihead attention block grows with the square of the sequence length. This results in increased memory footprint, as well as longer runtimes due to the accompanying memory access operations. We fused softmax, masking, and dropout operations into a single kernel both in forward pass and backward pass. By doing so, we avoided several memory access operations for large activation tensors inside the multihead attention block, which resulted in a meaningful performance boost.For more information, see SelfMultiheadAttn in the NVIDIA Apex library.In this MLPerf round, we implemented distributed LAMB. In distributed LAMB, the gradients are first split across eight GPUs within each DGX-A100 node. This is followed by an all-reduce between the nodes in eight separate groups. After this operation, each GPU has one of eight chunks that constitute the all-reduced gradient tensor, and the LAMB optimizer is run on 1/8th of the full gradient tensor.When necessary, gradient norms are computed by computing local norms and performing an all-reduce operation. After the optimizer, an intranode all-gather operation is performed at each node, so that each GPU has the full updated parameter tensor. Execution is continued with the forward pass of the next iteration.Distributed LAMB substantially improves performance both for single-node and multinode configurations. For more information, see DistributedFusedLAMB in the Apex library.There are cases where the GPU execution depends on some value that is stored or calculated on the CPU. An example is when a specific tensor has a varying size that depends on the computation for each iteration. Because tensor size information is kept on the CPU, there must be a synchronization between GPU and CPU to pass the tensor size information for proper buffer allocation.Our solution was using a tensor with fixed size, but indicating which elements are valid using a separate Boolean mask. With this approach, no CPU-GPU synchronization was needed, as the tensor sizes are known. When a subsequent computation must know the real size of the tensor, as for an averaging operation, the elements of the Boolean mask can be summed on the GPU.Even though this approach resulted in slightly more access to GPU memory, it is much faster than having CPU synchronization in the critical path. This optimization resulted in a significant performance boost for small local batch size, which is the case for our max-scale configuration. This is because CPU synchronizations can’t keep up with fast GPU execution for small batch sizes.     Another source of CPU-GPU synchronization is the data that is kept on CPU, such as learning rate or potentially other optimizer states. We kept all the optimizer states on the GPU for distributed LAMB to achieve synchronization-free execution.As a result of these optimizations, we eliminated all the synchronizations between CPU and GPU during a training cycle. The only synchronizations are the ones that happen at the evaluation points, to log the evaluation accuracy in a file in real time for every evaluation point.Traditionally, CPU launches each GPU kernel individually. In general, even though GPU kernels do more work for large batch sizes, CPU kernel launch work and related CPU overheads stay fixed, barring the variations in CPU scheduling. As a result, for small local batch sizes, CPU overhead can become a significant performance bottleneck. This is what happened in our max-scale BERT configuration in MLPerf.On top of that, when CPU execution becomes a bottleneck, variations in CPU execution result in different runtimes across all GPUs for each iteration. This introduces a significant synchronization overhead when the workload is scaled to many GPUs (4096 GPUs in this case). Each GPU synchronizes every iteration for gradient reductions, and iteration time is determined by the slowest worker.     CUDA Graphs is a feature that enables launching an entire sequence of kernels at one time, eliminating CPU overheads between kernel executions. CUDA Graphs recently became available in PyTorch. By graph capturing the model, we eliminated CPU overhead and the accompanying synchronization overhead. The CUDA Graphs implementation resulted in a 1.7x performance boost just by itself for our max-scale BERT configuration.SHARP improved the performance of collectives significantly for BERT, especially for our max-scale configuration. End-to-end performance boost from SHARP is 17% for this BERT configuration. ResNet-50 is the veteran among MLPerf workloads. In this edition of MLPerf, we continue to optimize ResNet by improving Conv+BN+ReLu fusion kernels in CuDNN, along with the following optimizations:At large scales (>128 nodes) for ResNet-50, we reduced the local batch size per GPU to extremely small values. This often results in sub-20-ms iteration time. To reduce the overhead of the data pipeline, we introduced the input batch multiplier (IBM). DALI throughput is higher at large batch sizes than smaller batch sizes. To take advantage of this fact, we created super batches that are much larger than the local batch size. For each iteration, we then derived the needed samples from these super batches, increasing the DALI processing throughput and reducing the data pipeline overhead.At these small iteration times, gapless and continuous execution is the key to perfect scaling. Pre-allocating DALI buffers through hints is another feature that we introduced to reduce the overhead of dynamic GPU memory allocation while exploring the dataset.For ResNet-50, batch norm (BN) is a significant portion of the network’s iteration time. We optimized the fused BN+ReLu and BN+Add+ReLu kernels in MXNet through vectorization, cache-friendly memory traversals, and reducing quantization.The new MXNet dependency engine provides an asynchronous approach to scheduling work on the GPU, reducing the host (CPU) overhead and jitter such as overhead arising from MXNet and Horovord handshake.In the new dependency engine, the operation updates the dependency as soon as the work is scheduled on the GPU, not when the work is finished. It is the subsequent operation that must perform the synchronization to ensure correctness. This is further enhanced by removing the need for synchronization and using cudaStreamWait events to manage dependencies.U-Net3D is one of the two new workloads in this round of MLPerf training. We used the following optimizations:In 3D U-Net, because the sample number in the training dataset is relatively small, there is a fundamental limit to how much it can be scaled with naive data parallelism. To break that limit, we used spatial parallelism to split a single image across eight GPUs. At the end of the backward propagation, the gradients from each partition can be all-reduced as usual to get the resultant gradients, which can then be used to calculate the weight gradients.The naive approach to implementing spatial parallel convolution is to transfer the halo information from the neighboring GPU before running the convolution. However, to increase efficiency, we implemented a different scheme, in which we transfer the halo from the neighboring GPU in parallel to running the main inner convolutions. The error term to this main convolution is calculated independently using the halo and added to get the result. By hiding the transfer costs, we saw much better scaling efficiency than with the naive approach.For the backward propagation, similarly, the halos needed for the dgrad operation are transferred in parallel with the computation of the weight gradients and data gradients. The halos transferred for the data gradients are then reused for computing the correction terms for both weight and data gradients.3D U-Net has a bottleneck region in the middle of the network with much smaller activation sizes, which are not suited for spatial parallelism. We used a hybrid approach where we used spatial parallelism only for the layers that benefit from it. We gathered the activations for the sample from all GPUs right before this bottleneck region and executed these layers serially on each GPU. We split the work among the GPUs again when cooperating became beneficial. This made sure that we made the best choice separately for each region of the network.Evaluation contributes a significant amount of time in the reference code. Because evaluation can be run concurrently with training, we assigned dedicated nodes for running just evaluation.To hide the evaluation behind the training cycle entirely, we used spatial parallelism to speed up the validation step. In addition, as evaluation uses the same set of images, the images were loaded only one time and then cached in the GPU memory.Because the evaluation doesn’t start until a third of the way through the training, the evaluation nodes have enough time to load, process, and cache the dataset, as well as initialize all required libraries.At the end of the training cycle, training nodes use InfiniBand to transfer the model quickly to the evaluation nodes and continue running subsequent training iterations. The evaluation nodes run evaluation after the model parameters are transferred. At the end of the evaluation, the evaluation node communicates to the training nodes if the target accuracy is reached.The number of evaluation nodes added are just enough to hide the entire evaluation cycle behind the training cycle.We optimized the data loader in two ways: optimizing the augmentations and caching the dataset.Augmentations: 3D U-Net requires heavy augmentation due to the small size of the dataset. One of the most expensive operations is something that we call “biased crop”. On contrary to a random crop, biased crop selects regions with a positive label with a given probability. This requires heavy computations of 3D-connected components on labels every time the expensive path is selected. To avoid calculating the connected components every time that the sample is loaded, the result is cached in the host and reused so that it is calculated only one time.Data loading: As the training gets faster with the new features, the I/O starts to show up as the bottleneck. To alleviate this, we cached the entire image dataset in the GPU memory. This removes the PCIe and I/O from the critical data loader path. While the images are loaded from the large and high-bandwidth GPU memory, the labels are loaded from the CPU to perform augmentations.Because the Channels-Last layout is more efficient for convolution kernels, native support for the Channels-Last format was added in MXNet. This avoids any additional transposes needed in the model to take advantage of highly efficient GPU kernels.3D U-Net has multiple encoder and decoder layers with small channel counts. Using a typical tile size of 256×64 for the kernels used in these operations results in significant tile-size quantization effects. To optimize this, cuDNN added kernels optimized for smaller tile sizes with better cache reuse. This helped 3D U-Net achieve better compute utilization. Apart from these optimizations, 3D U-Net benefited from the optimized BatchNorm + ReLu activation kernel. The BatchNorm kernel was run repeatedly with a BatchSize value of 1 to get the Instance-Norm functionality. The asynchronous dependency engine implemented in MXNet, CUDA Graphs, and SHARP also helped performance significantly.With the array of optimizations made for 3D U-Net, we scaled to 100 DGX A100 nodes (800 GPUs), with training running on 80 nodes (640 GPUs) and evaluation running on 20 nodes (160 GPUs). The max-scale configuration of 100 nodes got over 9.7x speedup as compared to the single-node configuration.This is the fourth time that the lightweight SSD has been featured in MLPerf. In this round, the evaluation schedule was changed to happen every fifth epoch, starting from the first. In previous rounds, the evaluation scheduled started from the 40th epoch. Even with the extra computational requirement, we sped up our submissions time by more than x1.6.SSD consists of many smaller convolution layers. The benchmark was particularly affected by the improvements to the MXNet dependency engine, CUDA Graphs, and the enablement of SHARP, as discussed earlier.The training time of a deep learning model is a multivariable function. In its most basic form, the equation is as follows:The goal is to minimize  where both it and  and  are functions of the batch size. is a monotonically non-decreasing function. Batch sizes are more computationally efficient, but they take more time per iteration.On the other hand, , up to a certain batch size, is a monotonically nonincreasing function. Larger batch sizes require fewer iterations to converge because the model sees more images per iteration.Compared to the v0.7 submission where we used a batch size of 2048, the v1.0 batch size was 3072, which required 22% fewer iterations. Because the larger iteration was only 20% slower, the result was an 8% faster time to convergence.In this example, going to a batch size of 4096 instead of 3072 would’ve resulted in a longer training time. The 11% fewer iterations didn’t make up for the extra 20% run time per iteration.Evaluation can be broken into two phases:The new evaluation in v1.0 adds eight validation cycles to the base submission. Worse, improvements to the epoch train time means that scoring needs to take less than 2 seconds or the training time of five epochs. Otherwise, it won’t be fully hidden and any training time improvements are pointless.To improve inference time, we made sure that the inference graph was static. We improved the nonmaximum suppression implementation and moved the Boolean mask, used to filter negative detections, to outside the graph. Static graphs save memory reallocation time and make switching between training and inference contexts faster.For scoring, we used nv-cocoapi, which is a C++ implementation of cocoapi and 60x times faster. For v1.0, we improved the nv-cocoapi performance by 2x with multithreaded results accumulation, faster indices sorting, and caching the ground truth data structures.We optimized object detection with the following techniques: Deep learning frameworks use GPUs to accelerate computations, but a significant amount of code still runs on CPU cores. CPU codes process metadata like tensor shapes to prepare arguments needed to launch GPU kernels. Processing metadata is a fixed cost while the cost of the computational work done by the GPUs is positively correlated with batch size. For large batch sizes, CPU overhead is a negligible percentage of total run time cost. At small batch sizes, CPU overhead can become larger than GPU run time. When that happens, GPUs go idle between kernel calls.This issue can be identified on an Nsight Systems timeline plot. The plot below shows the “backbone” portion of Mask R-CNN with per-GPU batch size of 1 before graphing. The green portion shows CPU load while the blue portion shows GPU load. In this profile, you see that the CPU is maxed out at 100% load while GPU is idle most of the time. There is a lot of empty space between GPU kernels.CUDA graph is a tool that can automatically eliminate CPU overhead when tensor shapes are static. A complete graph of all kernel calls is captured during the first step. In subsequent steps, the entire graph is launched with a single operation, eliminating all the CPU overhead. PyTorch now has support for CUDA graph, we used this to speed up Mask R-CNN for MLPerf 1.0.With graphing, we see that the GPU kernels are tightly packed and GPU utilization remains high. The graphed portion now runs in 6 ms instead of 31 ms, a speedup of 5x. We mostly just graphed the ResNet backbone, not the entire model. Even then, we saw >2x uplift for the entire benchmark just from graphing.There are many PyTorch modules that make the main process wait until the GPU has finished all previously launched kernels. This can be detrimental to performance, because it makes the CPU sit idle when it could be working on launching more kernels. The CPU can get ahead of the GPU in low overhead segments and start launching kernels from succeeding segments. As long as total CPU overhead is less than total GPU kernel time, the CPU never becomes the bottleneck, but this breaks when sync points are introduced. Also, model segments that have sync points cannot be graphed with CUDA graph, so removing syncs is important.We did some of this work for MLPerf 1.0. For instance, torch.randperm was rewritten to use CUB instead of Thrust because the latter is a synchronous C++ template library. These improvements are available in the latest NGC container.Removing all the syncs improved the uplift that we saw from CUDA Graphs from 1.6x to 2.5x.Our MLPerf 0.7 submission did asynchronous evaluation, but it wasn’t fast enough to keep up with training after optimizations. Evaluation took 18 seconds per epoch, and 4 seconds of that was fully exposed time. Without changes to the evaluation code, our at-scale submission would have clocked in about 100 seconds slower.Of the three evaluation phases, inference and prep account for all the exposed time. To speed up inference, we cached the test images in GPU memory, as they never change. We moved the prep phase to a pool of background processes, as each sample in the test dataset can be processed independently. We scored segmentation masks and boxes simultaneously in two background processes. These optimizations reduced evaluation time to ~4 seconds per epoch.This component loads and augments images during training. In our MLPerf 0.7 submission, all data loading work was done by CPU cores. The old dataloader was not fast enough to keep up with training after optimizations. To remedy that, we developed a hybrid dataloader.The hybrid dataloader decodes the images on CPU and then does image augmentation work on GPU using Torchvision. To hide the cost of dataloading completely, we moved the load next image call in the main training loop after the loss backward call. The CPUs are idle for several milliseconds after the loss backward call because of theCUDA Graph launch. This is more than enough time to decode the next image. After the GPUs finish back propagation, they sit idle while the optimizer does all-reduce on the gradients. During this idle time, the dataloader does image augmentation work.The basic building block of ResNet50 is a three layer-stack composed of a convolution, batch norm, and activation function. For Mask R-CNN, the batch norm is frozen, which means that both the batch norm and activation function are pointwise operations that can be fused. In previous rounds, we used the PyTorch JIT fuser to fuse the two pointwise operations. Thanks to the new fusion engine in CUDNN v8, we improved on this by fusing the pointwise operations with the convolution. The flexible API of the fusion engine also enabled us to fuse all three basic layers under one autograd function. That let us work around a limitation of the fuser by doing asymmetric fusions in backpropagation for an even bigger performance boost.Speech recognition with RNN-T is the other new workload in this round of MLPerf training. We used the following optimizations:RNN-T uses a special loss function that we call transducer loss function. The algorithm that computes the loss is iterative in nature. A naive implementation is often inefficient due to the irregular memory access pattern and the exposed long memory read latency.To overcome this difficulty, we developed apex.contrib.transducer.TransducerLoss. It uses a diagonal-wave-front-like computing paradigm to exploit the parallelism in the algorithm. Shared memory and registers are used extensively to cache the data exchanged between iterations. The loss function also employs prefetch to hide the memory access latency.Another component that is often found in a transducer-type network is the transducer joint operation. To accelerate this operation, we developed apex.contrib.transducer.TransducerJoint. This Apex extension is not only faster than its native PyTorch counterpart, but also enables output packing, reducing the workload seen by following layers.Figure 17 shows the packing operation by the Apex transducer joint. In the baseline joint operation, the paddings from the input sequences are carried over to the output, as the joint operation is oblivious to the input padding. In the Apex transducer joint operation, the paddings are removed at the output, reducing the size of the tensor fed to the following operations.To reduce computations of LSTMs that are wasted on paddings, we split batch processing into two phases (Figure 18). In the first pass, all the samples in the minibatch up to certain time steps (enclosed by the black boxes) are evaluated. Half of the samples in the minibatch are completed in the first pass. The remaining time steps of the other half of the samples (enclosed by the red boxes) are evaluated in the second pass. The regions enclosed by blue boxes represent the savings from batch splitting.The black dashed line in Figure 18 estimates the workload seen by the GPUs. Because the batch size is halved for the second pass, the workload seen by the GPU is roughly halved. In multi-GPU training, it is often the slowest GPU that limits the training throughput. The dashed line is obtained from the GPU with most workloads.To mitigate this load imbalance, we employed a technique called presorting, where samples in a minibatch are sorted based on their sequence lengths. The longest and shortest sequences are placed on the same GPU to balance the workload. The intuition behind this is that GPUs with long sequences are likely to be the bottleneck. Therefore, short sequences should be placed on these GPUs as well to maximize the benefit of sequence splitting.RNN-T has an interesting network structure where the LSTMs deal with relatively small tensors, whereas the joint net takes much larger tensors. To enable LSTMs to run more efficiently with a large batch size while not exceeding the GPU memory capacity by having a huge tensor in the joint net, we employed a technique called batch splitting (Figure 17). We used a reasonably large batch size so that LSTMs achieved a decent GPU utilization. In contrast, joint net operates on a portion of the batch and loops through those subbatches one by one.In Figure 19, a batch splitting factor of 2 is used. In this case, the batch sizes of the inputs to the LSTMs and the joint net are B and B/2, respectively. Because all the tensors generated by the joint net, except the gradients for the weights, are no longer needed after the backpropagation is completed, they can be released and create room for the next subbatch in the loop.Other than accelerating training, evaluation of RNN-T has also been scrutinized. The evaluation of RNN-T is iterative in nature and the evaluation of the predict network is performed step by step. Each sample in a batch might pick different code paths in the same time step, depending on the execution results. Because of these, a naive implementation leads to a low GPU utilization rate and a long evaluation time that is comparable to the training itself.To overcome these difficulties, we performed two categories of optimizations in the evaluation. The first optimization performed evaluation in batch mode and take care of the different control flows in a batch with predicates. The second optimization graphed the main RNN-T evaluation loop, which consists of many short GPU kernels. We also used loop unrolling and overlapping CPU-GPU communication with GPU execution to amortize associated overheads. The optimized evaluation was more than 100x faster than the reference code for the single-node configuration, and more than 30x faster for the max-scale configuration.LSTM is the main building block of RNN-T. A large portion of the end-to-end network time is spent on LSTMs. In cuDNN v8, the performance of LSTMs has been heavily optimized. For example, better horizontal fusion algorithms and heuristics were applied to the GEMMs in LSTM cells and drop out in between LSTM layers, improving performance and reducing the overhead from dropout.MLPerf v1.0 showcases the continuous innovation happening in the AI domain. The NVIDIA AI platform delivers leadership performance with tight integration of hardware, data center technologies, and software to realize the full potential of AI.In the last two-and-a-half years since the first MLPerf training benchmark launched, NVIDIA performance has increased by nearly 7x. The NVIDIA platform excels in both performance and usability, offering a single leadership platform from data center to edge to cloud.All software used for NVIDIA submissions is available from the MLPerf repository, to enable you to reproduce our benchmark results. We constantly add these cutting-edge MLPerf improvements into our deep learning frameworks containers available on NGC, our software hub for GPU-optimized applications.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	MLPerf is an industry-wide AI consortium tasked with developing a suite of performance benchmarks that cover a range of leading AI workloads widely in use. The latest MLPerf v1.0 training round includes vision, language and recommender systems, and reinforcement learning tasks. It is continually evolving to reflect the state-of-the-art AI applications.NVIDIA submitted MLPerf v1.0 training results for all eight benchmarks, as is our tradition. In fact, systems built upon the NVIDIA AI platform are the only commercially available systems to make submissions across the board.Compared to our previous MLPerf v0.7 submissions, we improved up to 2.1x on a chip-to-chip basis and up to 3.5x at scale. We set 16 performance records with eight on a per-chip basis and eight at-scale training in the commercially available solutions category.(*) Per Accelerator performance for A100 computed using NVIDIA 8xA100 server time-to-train and multiplying it by 8 | Per Chip Performance comparisons to others arrived at by comparing performance at the closest similar scale. Per-Accelerator Records:  BERT: 1.0-1033  | DLRM: 1.0-1037  |  Mask R-CNN: 1.0-1057  |  ResNet50 v1.5: 1.0-1038  |  SSD: 1.0-1038  |  RNN-T: 1.0-1060  |  3D U-Net: 1.0-1053  |  MiniGo: 1.0-1061Max Scale Records:  BERT: 1.0-1077  | DLRM: 1.0-1067  |  Mask R-CNN: 1.0-1070  |  ResNet50 v1.5: 1.0-1076  |  SSD: 1.0-1072  |  RNN-T: 1.0-1074  |  3D U-Net: 1.0-1071  |  MiniGo: 1.0-1075MLPerf name and logo are trademarks. For more information, see www.mlperf.org.This is the second MLPerf training round featuring NVIDIA A100 GPUs. Our continual year-over-year improvement on the same hardware is a lively testament to the strength of the NVIDIA platform and commitment to continuous software improvement. As in previous MLPerf rounds, NVIDIA engineers developed a host of innovations to achieve these new levels of performance:The CUDA graph and SHARP enhancements enabled us to increase our scale to a record number of 4096 GPUs used to solve a single AI network.This post provides insights into many of the optimizations used to deliver the outstanding scale and performance. Many of these improvements are available on NGC, which is the hub for NVIDIA GPU-optimized software. You can realize the benefits of these optimizations in your real-world applications, instead of just observing better benchmark scores from the sideline.Large-scale training requires system hardware and software to be precisely tuned to work together and support the unique performance requirements that arise at scale. NVIDIA made major advances on both dimensions, which are now available for production use.On the system side, the key building block of our at-scale training is the NVIDIA DGX SuperPOD. DGX SuperPOD is the culmination of years of expertise in HPC and AI data centers. It is based on the NVIDIA DGX A100 with the latest NVIDIA A100 Tensor Core GPU, third-generation NVIDIA NVLink, NVSwitch, and the NVIDIA ConnectX-6 VPI 200 Gbps HDR InfiniBand. These were combined to make Selene a top 5 supercomputer in the Top 500 supercomputer list, with the following components:On the software side, the NGC container release v. 21.05 enhances and enables several capabilities:In this section, we dive into the optimizations for selected individual MLPerf workloads.Recommendation is arguably the most pervasive AI workload in data centers today. The NVIDIA MLPerf DLRM submission was based on HugeCTR, a GPU-accelerated recommendation framework that is part of the NVIDIA Merlin open beta framework. The HugeCTR v3.1 beta release added the following optimizations:One of the major challenges in scaling DLRM to multiple nodes is the ~10x difference in per-GPU all-to-all bandwidth between NVLink and Infiniband. This makes the embedding exchange between nodes a significant bottleneck during training.To counteract this, HugeCTR implemented hybrid embedding, a novel embedding design that deduplicates the categories in a batch before doing the embedding weight exchange in the forward pass. It also reduces the gradients locally before doing gradient exchange in the backward pass.For efficient deduplication, the hybrid embedding maps the categories to frequent and infrequent embeddings based on the statistical access frequency of categories. The frequent embedding is implemented in a data-parallel fashion that takes away most of the replicated categories in a batch, reducing the embedding exchange traffic. Infrequent embedding follows the distributed model parallel-embedding paradigm. This enables DLRM to scale to multiple nodes with unprecedented efficiency.All-to-all and all-reduce collective latencies play a significant role in scaling efficiency. Multinode all-to-all throughput for small message sizes was limited by the Infiniband message rate. To mitigate this, HugeCTR implemented fused NVLink aggregation using hierarchical all-to-all for embedding exchange.You can optimize internode all-to-all and all-reduce latencies further:Intranode all-reduce is also optimized using a single-shot reduction algorithm as opposed to ring.Frequent embedding all-reduce and MLP all-reduce are fused into a single all-reduce operation to save on exposed all-reduce latency.Input pipeline plays a significant role in training performance. To achieve peak I/O throughput, HugeCTR implemented a fully asynchronous data reader using the Linux asynchronous I/O library (AIO). Because hybrid embedding requires the whole batch to be present on all the GPUs, direct host-to-device (H2D) for each GPU would make PCIe a bottleneck. So, the data is copied onto the GPUs using a hierarchical approach, by first doing a H2D over PCIe and then a broadcast over NVLink.Moreover, H2D traffic from data readers may interfere with internode all-to-all and all-reduce traffic over PCIe. So, HugeCTR implements intelligent data-reader scheduling to avoid such interference.Because the bottom MLP has no data dependencies with embedding, several components of the bottom MLP could be overlapped with embedding for efficient utilization of GPU resources.HugeCTR implemented a fused, fully connected layer that made use of cublasLt GEMM fusions:To reduce launch latencies and prevent PCIe interference between kernel launches, data-reader, and communication traffic, all DLRM compute and communication kernels are designed to be stream-capturable. The whole training iteration is captured into a single CUDA graph.With the preceding optimizations, we scaled to multiple nodes and completed the DLRM training task in just under a minute on 14 DGX-A100 nodes. This is a 3.3x speedup compared to the previous v0.7 submission.BERT is arguably one of the most important workloads in the NLP domain today. In the MLPerf v1.0 round, we improved upon our v0.7 submission with the following optimizations:The size of the activation tensors inside the multihead attention block grows with the square of the sequence length. This results in increased memory footprint, as well as longer runtimes due to the accompanying memory access operations. We fused softmax, masking, and dropout operations into a single kernel both in forward pass and backward pass. By doing so, we avoided several memory access operations for large activation tensors inside the multihead attention block, which resulted in a meaningful performance boost.For more information, see SelfMultiheadAttn in the NVIDIA Apex library.In this MLPerf round, we implemented distributed LAMB. In distributed LAMB, the gradients are first split across eight GPUs within each DGX-A100 node. This is followed by an all-reduce between the nodes in eight separate groups. After this operation, each GPU has one of eight chunks that constitute the all-reduced gradient tensor, and the LAMB optimizer is run on 1/8th of the full gradient tensor.When necessary, gradient norms are computed by computing local norms and performing an all-reduce operation. After the optimizer, an intranode all-gather operation is performed at each node, so that each GPU has the full updated parameter tensor. Execution is continued with the forward pass of the next iteration.Distributed LAMB substantially improves performance both for single-node and multinode configurations. For more information, see DistributedFusedLAMB in the Apex library.There are cases where the GPU execution depends on some value that is stored or calculated on the CPU. An example is when a specific tensor has a varying size that depends on the computation for each iteration. Because tensor size information is kept on the CPU, there must be a synchronization between GPU and CPU to pass the tensor size information for proper buffer allocation.Our solution was using a tensor with fixed size, but indicating which elements are valid using a separate Boolean mask. With this approach, no CPU-GPU synchronization was needed, as the tensor sizes are known. When a subsequent computation must know the real size of the tensor, as for an averaging operation, the elements of the Boolean mask can be summed on the GPU.Even though this approach resulted in slightly more access to GPU memory, it is much faster than having CPU synchronization in the critical path. This optimization resulted in a significant performance boost for small local batch size, which is the case for our max-scale configuration. This is because CPU synchronizations can’t keep up with fast GPU execution for small batch sizes.     Another source of CPU-GPU synchronization is the data that is kept on CPU, such as learning rate or potentially other optimizer states. We kept all the optimizer states on the GPU for distributed LAMB to achieve synchronization-free execution.As a result of these optimizations, we eliminated all the synchronizations between CPU and GPU during a training cycle. The only synchronizations are the ones that happen at the evaluation points, to log the evaluation accuracy in a file in real time for every evaluation point.Traditionally, CPU launches each GPU kernel individually. In general, even though GPU kernels do more work for large batch sizes, CPU kernel launch work and related CPU overheads stay fixed, barring the variations in CPU scheduling. As a result, for small local batch sizes, CPU overhead can become a significant performance bottleneck. This is what happened in our max-scale BERT configuration in MLPerf.On top of that, when CPU execution becomes a bottleneck, variations in CPU execution result in different runtimes across all GPUs for each iteration. This introduces a significant synchronization overhead when the workload is scaled to many GPUs (4096 GPUs in this case). Each GPU synchronizes every iteration for gradient reductions, and iteration time is determined by the slowest worker.     CUDA Graphs is a feature that enables launching an entire sequence of kernels at one time, eliminating CPU overheads between kernel executions. CUDA Graphs recently became available in PyTorch. By graph capturing the model, we eliminated CPU overhead and the accompanying synchronization overhead. The CUDA Graphs implementation resulted in a 1.7x performance boost just by itself for our max-scale BERT configuration.SHARP improved the performance of collectives significantly for BERT, especially for our max-scale configuration. End-to-end performance boost from SHARP is 17% for this BERT configuration. ResNet-50 is the veteran among MLPerf workloads. In this edition of MLPerf, we continue to optimize ResNet by improving Conv+BN+ReLu fusion kernels in CuDNN, along with the following optimizations:At large scales (>128 nodes) for ResNet-50, we reduced the local batch size per GPU to extremely small values. This often results in sub-20-ms iteration time. To reduce the overhead of the data pipeline, we introduced the input batch multiplier (IBM). DALI throughput is higher at large batch sizes than smaller batch sizes. To take advantage of this fact, we created super batches that are much larger than the local batch size. For each iteration, we then derived the needed samples from these super batches, increasing the DALI processing throughput and reducing the data pipeline overhead.At these small iteration times, gapless and continuous execution is the key to perfect scaling. Pre-allocating DALI buffers through hints is another feature that we introduced to reduce the overhead of dynamic GPU memory allocation while exploring the dataset.For ResNet-50, batch norm (BN) is a significant portion of the network’s iteration time. We optimized the fused BN+ReLu and BN+Add+ReLu kernels in MXNet through vectorization, cache-friendly memory traversals, and reducing quantization.The new MXNet dependency engine provides an asynchronous approach to scheduling work on the GPU, reducing the host (CPU) overhead and jitter such as overhead arising from MXNet and Horovord handshake.In the new dependency engine, the operation updates the dependency as soon as the work is scheduled on the GPU, not when the work is finished. It is the subsequent operation that must perform the synchronization to ensure correctness. This is further enhanced by removing the need for synchronization and using cudaStreamWait events to manage dependencies.U-Net3D is one of the two new workloads in this round of MLPerf training. We used the following optimizations:In 3D U-Net, because the sample number in the training dataset is relatively small, there is a fundamental limit to how much it can be scaled with naive data parallelism. To break that limit, we used spatial parallelism to split a single image across eight GPUs. At the end of the backward propagation, the gradients from each partition can be all-reduced as usual to get the resultant gradients, which can then be used to calculate the weight gradients.The naive approach to implementing spatial parallel convolution is to transfer the halo information from the neighboring GPU before running the convolution. However, to increase efficiency, we implemented a different scheme, in which we transfer the halo from the neighboring GPU in parallel to running the main inner convolutions. The error term to this main convolution is calculated independently using the halo and added to get the result. By hiding the transfer costs, we saw much better scaling efficiency than with the naive approach.For the backward propagation, similarly, the halos needed for the dgrad operation are transferred in parallel with the computation of the weight gradients and data gradients. The halos transferred for the data gradients are then reused for computing the correction terms for both weight and data gradients.3D U-Net has a bottleneck region in the middle of the network with much smaller activation sizes, which are not suited for spatial parallelism. We used a hybrid approach where we used spatial parallelism only for the layers that benefit from it. We gathered the activations for the sample from all GPUs right before this bottleneck region and executed these layers serially on each GPU. We split the work among the GPUs again when cooperating became beneficial. This made sure that we made the best choice separately for each region of the network.Evaluation contributes a significant amount of time in the reference code. Because evaluation can be run concurrently with training, we assigned dedicated nodes for running just evaluation.To hide the evaluation behind the training cycle entirely, we used spatial parallelism to speed up the validation step. In addition, as evaluation uses the same set of images, the images were loaded only one time and then cached in the GPU memory.Because the evaluation doesn’t start until a third of the way through the training, the evaluation nodes have enough time to load, process, and cache the dataset, as well as initialize all required libraries.At the end of the training cycle, training nodes use InfiniBand to transfer the model quickly to the evaluation nodes and continue running subsequent training iterations. The evaluation nodes run evaluation after the model parameters are transferred. At the end of the evaluation, the evaluation node communicates to the training nodes if the target accuracy is reached.The number of evaluation nodes added are just enough to hide the entire evaluation cycle behind the training cycle.We optimized the data loader in two ways: optimizing the augmentations and caching the dataset.Augmentations: 3D U-Net requires heavy augmentation due to the small size of the dataset. One of the most expensive operations is something that we call “biased crop”. On contrary to a random crop, biased crop selects regions with a positive label with a given probability. This requires heavy computations of 3D-connected components on labels every time the expensive path is selected. To avoid calculating the connected components every time that the sample is loaded, the result is cached in the host and reused so that it is calculated only one time.Data loading: As the training gets faster with the new features, the I/O starts to show up as the bottleneck. To alleviate this, we cached the entire image dataset in the GPU memory. This removes the PCIe and I/O from the critical data loader path. While the images are loaded from the large and high-bandwidth GPU memory, the labels are loaded from the CPU to perform augmentations.Because the Channels-Last layout is more efficient for convolution kernels, native support for the Channels-Last format was added in MXNet. This avoids any additional transposes needed in the model to take advantage of highly efficient GPU kernels.3D U-Net has multiple encoder and decoder layers with small channel counts. Using a typical tile size of 256×64 for the kernels used in these operations results in significant tile-size quantization effects. To optimize this, cuDNN added kernels optimized for smaller tile sizes with better cache reuse. This helped 3D U-Net achieve better compute utilization. Apart from these optimizations, 3D U-Net benefited from the optimized BatchNorm + ReLu activation kernel. The BatchNorm kernel was run repeatedly with a BatchSize value of 1 to get the Instance-Norm functionality. The asynchronous dependency engine implemented in MXNet, CUDA Graphs, and SHARP also helped performance significantly.With the array of optimizations made for 3D U-Net, we scaled to 100 DGX A100 nodes (800 GPUs), with training running on 80 nodes (640 GPUs) and evaluation running on 20 nodes (160 GPUs). The max-scale configuration of 100 nodes got over 9.7x speedup as compared to the single-node configuration.This is the fourth time that the lightweight SSD has been featured in MLPerf. In this round, the evaluation schedule was changed to happen every fifth epoch, starting from the first. In previous rounds, the evaluation scheduled started from the 40th epoch. Even with the extra computational requirement, we sped up our submissions time by more than x1.6.SSD consists of many smaller convolution layers. The benchmark was particularly affected by the improvements to the MXNet dependency engine, CUDA Graphs, and the enablement of SHARP, as discussed earlier.The training time of a deep learning model is a multivariable function. In its most basic form, the equation is as follows:The goal is to minimize  where both it and  and  are functions of the batch size. is a monotonically non-decreasing function. Batch sizes are more computationally efficient, but they take more time per iteration.On the other hand, , up to a certain batch size, is a monotonically nonincreasing function. Larger batch sizes require fewer iterations to converge because the model sees more images per iteration.Compared to the v0.7 submission where we used a batch size of 2048, the v1.0 batch size was 3072, which required 22% fewer iterations. Because the larger iteration was only 20% slower, the result was an 8% faster time to convergence.In this example, going to a batch size of 4096 instead of 3072 would’ve resulted in a longer training time. The 11% fewer iterations didn’t make up for the extra 20% run time per iteration.Evaluation can be broken into two phases:The new evaluation in v1.0 adds eight validation cycles to the base submission. Worse, improvements to the epoch train time means that scoring needs to take less than 2 seconds or the training time of five epochs. Otherwise, it won’t be fully hidden and any training time improvements are pointless.To improve inference time, we made sure that the inference graph was static. We improved the nonmaximum suppression implementation and moved the Boolean mask, used to filter negative detections, to outside the graph. Static graphs save memory reallocation time and make switching between training and inference contexts faster.For scoring, we used nv-cocoapi, which is a C++ implementation of cocoapi and 60x times faster. For v1.0, we improved the nv-cocoapi performance by 2x with multithreaded results accumulation, faster indices sorting, and caching the ground truth data structures.We optimized object detection with the following techniques: Deep learning frameworks use GPUs to accelerate computations, but a significant amount of code still runs on CPU cores. CPU codes process metadata like tensor shapes to prepare arguments needed to launch GPU kernels. Processing metadata is a fixed cost while the cost of the computational work done by the GPUs is positively correlated with batch size. For large batch sizes, CPU overhead is a negligible percentage of total run time cost. At small batch sizes, CPU overhead can become larger than GPU run time. When that happens, GPUs go idle between kernel calls.This issue can be identified on an Nsight Systems timeline plot. The plot below shows the “backbone” portion of Mask R-CNN with per-GPU batch size of 1 before graphing. The green portion shows CPU load while the blue portion shows GPU load. In this profile, you see that the CPU is maxed out at 100% load while GPU is idle most of the time. There is a lot of empty space between GPU kernels.CUDA graph is a tool that can automatically eliminate CPU overhead when tensor shapes are static. A complete graph of all kernel calls is captured during the first step. In subsequent steps, the entire graph is launched with a single operation, eliminating all the CPU overhead. PyTorch now has support for CUDA graph, we used this to speed up Mask R-CNN for MLPerf 1.0.With graphing, we see that the GPU kernels are tightly packed and GPU utilization remains high. The graphed portion now runs in 6 ms instead of 31 ms, a speedup of 5x. We mostly just graphed the ResNet backbone, not the entire model. Even then, we saw >2x uplift for the entire benchmark just from graphing.There are many PyTorch modules that make the main process wait until the GPU has finished all previously launched kernels. This can be detrimental to performance, because it makes the CPU sit idle when it could be working on launching more kernels. The CPU can get ahead of the GPU in low overhead segments and start launching kernels from succeeding segments. As long as total CPU overhead is less than total GPU kernel time, the CPU never becomes the bottleneck, but this breaks when sync points are introduced. Also, model segments that have sync points cannot be graphed with CUDA graph, so removing syncs is important.We did some of this work for MLPerf 1.0. For instance, torch.randperm was rewritten to use CUB instead of Thrust because the latter is a synchronous C++ template library. These improvements are available in the latest NGC container.Removing all the syncs improved the uplift that we saw from CUDA Graphs from 1.6x to 2.5x.Our MLPerf 0.7 submission did asynchronous evaluation, but it wasn’t fast enough to keep up with training after optimizations. Evaluation took 18 seconds per epoch, and 4 seconds of that was fully exposed time. Without changes to the evaluation code, our at-scale submission would have clocked in about 100 seconds slower.Of the three evaluation phases, inference and prep account for all the exposed time. To speed up inference, we cached the test images in GPU memory, as they never change. We moved the prep phase to a pool of background processes, as each sample in the test dataset can be processed independently. We scored segmentation masks and boxes simultaneously in two background processes. These optimizations reduced evaluation time to ~4 seconds per epoch.This component loads and augments images during training. In our MLPerf 0.7 submission, all data loading work was done by CPU cores. The old dataloader was not fast enough to keep up with training after optimizations. To remedy that, we developed a hybrid dataloader.The hybrid dataloader decodes the images on CPU and then does image augmentation work on GPU using Torchvision. To hide the cost of dataloading completely, we moved the load next image call in the main training loop after the loss backward call. The CPUs are idle for several milliseconds after the loss backward call because of theCUDA Graph launch. This is more than enough time to decode the next image. After the GPUs finish back propagation, they sit idle while the optimizer does all-reduce on the gradients. During this idle time, the dataloader does image augmentation work.The basic building block of ResNet50 is a three layer-stack composed of a convolution, batch norm, and activation function. For Mask R-CNN, the batch norm is frozen, which means that both the batch norm and activation function are pointwise operations that can be fused. In previous rounds, we used the PyTorch JIT fuser to fuse the two pointwise operations. Thanks to the new fusion engine in CUDNN v8, we improved on this by fusing the pointwise operations with the convolution. The flexible API of the fusion engine also enabled us to fuse all three basic layers under one autograd function. That let us work around a limitation of the fuser by doing asymmetric fusions in backpropagation for an even bigger performance boost.Speech recognition with RNN-T is the other new workload in this round of MLPerf training. We used the following optimizations:RNN-T uses a special loss function that we call transducer loss function. The algorithm that computes the loss is iterative in nature. A naive implementation is often inefficient due to the irregular memory access pattern and the exposed long memory read latency.To overcome this difficulty, we developed apex.contrib.transducer.TransducerLoss. It uses a diagonal-wave-front-like computing paradigm to exploit the parallelism in the algorithm. Shared memory and registers are used extensively to cache the data exchanged between iterations. The loss function also employs prefetch to hide the memory access latency.Another component that is often found in a transducer-type network is the transducer joint operation. To accelerate this operation, we developed apex.contrib.transducer.TransducerJoint. This Apex extension is not only faster than its native PyTorch counterpart, but also enables output packing, reducing the workload seen by following layers.Figure 17 shows the packing operation by the Apex transducer joint. In the baseline joint operation, the paddings from the input sequences are carried over to the output, as the joint operation is oblivious to the input padding. In the Apex transducer joint operation, the paddings are removed at the output, reducing the size of the tensor fed to the following operations.To reduce computations of LSTMs that are wasted on paddings, we split batch processing into two phases (Figure 18). In the first pass, all the samples in the minibatch up to certain time steps (enclosed by the black boxes) are evaluated. Half of the samples in the minibatch are completed in the first pass. The remaining time steps of the other half of the samples (enclosed by the red boxes) are evaluated in the second pass. The regions enclosed by blue boxes represent the savings from batch splitting.The black dashed line in Figure 18 estimates the workload seen by the GPUs. Because the batch size is halved for the second pass, the workload seen by the GPU is roughly halved. In multi-GPU training, it is often the slowest GPU that limits the training throughput. The dashed line is obtained from the GPU with most workloads.To mitigate this load imbalance, we employed a technique called presorting, where samples in a minibatch are sorted based on their sequence lengths. The longest and shortest sequences are placed on the same GPU to balance the workload. The intuition behind this is that GPUs with long sequences are likely to be the bottleneck. Therefore, short sequences should be placed on these GPUs as well to maximize the benefit of sequence splitting.RNN-T has an interesting network structure where the LSTMs deal with relatively small tensors, whereas the joint net takes much larger tensors. To enable LSTMs to run more efficiently with a large batch size while not exceeding the GPU memory capacity by having a huge tensor in the joint net, we employed a technique called batch splitting (Figure 17). We used a reasonably large batch size so that LSTMs achieved a decent GPU utilization. In contrast, joint net operates on a portion of the batch and loops through those subbatches one by one.In Figure 19, a batch splitting factor of 2 is used. In this case, the batch sizes of the inputs to the LSTMs and the joint net are B and B/2, respectively. Because all the tensors generated by the joint net, except the gradients for the weights, are no longer needed after the backpropagation is completed, they can be released and create room for the next subbatch in the loop.Other than accelerating training, evaluation of RNN-T has also been scrutinized. The evaluation of RNN-T is iterative in nature and the evaluation of the predict network is performed step by step. Each sample in a batch might pick different code paths in the same time step, depending on the execution results. Because of these, a naive implementation leads to a low GPU utilization rate and a long evaluation time that is comparable to the training itself.To overcome these difficulties, we performed two categories of optimizations in the evaluation. The first optimization performed evaluation in batch mode and take care of the different control flows in a batch with predicates. The second optimization graphed the main RNN-T evaluation loop, which consists of many short GPU kernels. We also used loop unrolling and overlapping CPU-GPU communication with GPU execution to amortize associated overheads. The optimized evaluation was more than 100x faster than the reference code for the single-node configuration, and more than 30x faster for the max-scale configuration.LSTM is the main building block of RNN-T. A large portion of the end-to-end network time is spent on LSTMs. In cuDNN v8, the performance of LSTMs has been heavily optimized. For example, better horizontal fusion algorithms and heuristics were applied to the GEMMs in LSTM cells and drop out in between LSTM layers, improving performance and reducing the overhead from dropout.MLPerf v1.0 showcases the continuous innovation happening in the AI domain. The NVIDIA AI platform delivers leadership performance with tight integration of hardware, data center technologies, and software to realize the full potential of AI.In the last two-and-a-half years since the first MLPerf training benchmark launched, NVIDIA performance has increased by nearly 7x. The NVIDIA platform excels in both performance and usability, offering a single leadership platform from data center to edge to cloud.All software used for NVIDIA submissions is available from the MLPerf repository, to enable you to reproduce our benchmark results. We constantly add these cutting-edge MLPerf improvements into our deep learning frameworks containers available on NGC, our software hub for GPU-optimized applications.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.60	There is an increasing demand for manufacturers to achieve high-quality control standards in their production processes. Traditionally, manufacturers have relied on manual inspection to guarantee product quality. However, manual inspection is expensive, often only covers a small sample of production, and ultimately results in production bottlenecks, lowered productivity, and reduced efficiency.By automating defect inspection with AI and computer vision, manufacturers can revolutionize their quality control processes. However, one major obstacle stands between manufacturers and full automation. Building an AI system and production-ready application is hard and typically requires a skilled AI team to train and fine-tune the model. The average manufacturer does not employ this expertise and resorts to using manual inspection.The goal of this project was to show how the NVIDIA Transfer Learning Toolkit (TLT) and a pretrained model can be used to quickly build more accurate quality control in the manufacturing process. This project was done without an army of AI specialists or data scientists. To see how effective the NVIDIA TLT is in training an AI system for commercial quality-control purposes, a publicly available dataset on the steel welding process was used to retrain a pretrained ResNet-18 model, from the NGC catalog, a GPU-optimized hub of AI and HPC software, using TLT. We compared the effort and resulting model’s accuracy to a model built from scratch on the dataset in a previously published work by a team of AI researchers.NVIDIA TLT is user-friendly and fast, and can be easily used by engineers who do not have AI expertise. We observed that NVIDIA TLT was faster to set up and produced more accurate results, posting a macro average F1 score of 97% compared to 78% from previously published “built from scratch” work on the dataset.  This post explores how NVIDIA TLT can quickly and accurately train AI models, showing how AI and transfer learning can transform how image and video analysis and industrial processes are deployed.NVIDIA TLT, a core component of the NVIDIA Train, Adapt, and Optimize (TAO) platform, follows the zero-coding paradigm tofast-track AI development. TLT comes with a set of ready-to-use Jupyter notebooks, Python scripts, and configuration specifications with default parameter values that enable you to start training and fine-tuning your datasets quickly and easily.To get started with the NVIDIA TLT, we followed these Quick Start Guide instructions.For more information about the configuration options for training, see Model Config in the NVIDIA TLT User Guide.The dataset used in this project was created by researchers at the University of Birmingham for their paper, Automated defect classification of SS304 TIG welding process using visible spectrum camera and machine learning.The dataset consists of over 45K grayscale welding images, which can be obtained through Kaggle. The dataset describes one class of proper execution: good_weld. It has five classes of defects that can occur during a tungsten inert gas (TIG) welding process: burn_through, contamination, lack_of_fusion, lack_of_shielding_gas, and high_travel_speed.Like many industrial datasets, this dataset is quite imbalanced as it can be difficult to collect data on defects that occur with a low likelihood. Table 1 shows the class distribution for the train, validation, and test datasets.Figure 2 visualizes the imbalance in the test dataset. The test dataset contains 75x more images of good_weld than it does of lack_of_shielding.The approach taken focuses on minimizing both development time and tuning time while ensuring that the accuracy is applicable for a production environment. TLT was used in combination with the standard configuration files shipped with the example notebooks. The setup, training, and tuning was done in under 8 hours.We conducted parameter sweeps regarding network depth and the number of training epochs. We observed that changing the learning rate from its default did not improve the results, so we did not investigate this further and left it at the default. The best results were obtained with a pretrained ResNet-18 model obtained from the NGC catalog after 30 epochs of training with a learning rate of 0.006.Check out the step-by-step approach in krygol/304SteelWeldingClassification GitHub repo.The obtained results were comparably good across all classes. Some lack_of_fusion gas images got misclassified as burn_through and contamination images. This effect was also observed when training the deeper ResNet50, which was even more prone to misclassifying lack_of_fusion as another defective class.The researchers at the University of Birmingham chose a different AI workflow. They manually prepared the dataset to lessen the imbalance by undersampling it. They also rescaled their images to different sizes and chose custom network structures.They used a fully connected neural network (Fully-con6), neural network with two hidden layers. They also implemented a convolutional neural network (Conv6) with three convolutional layers each followed by a max pooling layer and a fully connected layer as the final hidden layer. They did not use skip connections as ResNet does.The results obtained with TLT are even more impressive when compared to results of the custom implementation by the researchers at the University of Birmingham.Conv6 performed better on average with a macro average F1 of 0.78 but failed completely at recognizing lack_of_shielding gas defects. Fully-con6 performed worse on average, with a macro average F1 of 0.56. Fully-con6 could classify some of the lack_of_shielding gas images but had problems with burn_through and high_travel_speed images. Both Fully-con6 and Conv6 had distinct weaknesses that would prohibit them from being qualified as production-ready. The best F1-scores for every class are marked in green in the table. As you can see, the ResNet-18 model trained by TLT provided better results with a macro average of 0.97.We had a great experience with TLT, which in general was user-friendly and effective. It was fast to set up, easy to use, and produced results acceptable for production within a short computational time. Based on our experience, we believe that TLT provides a great advantage for engineers who are not AI experts but wish to use AI in their production environments. Using TLT in a manufacturing environment to automate quality control does not come at a performance cost and the application can often be used with the default settings and a few minor tweaks to outperform custom architectures.This exploration of using NVIDIA TLT to quickly and accurately train AI models shows that there is great potential for AI in industrial processes.Thanks and gratitude to the researchers—Daniel Bacioiu, Geoff Melton, Mayorkinos Papaelias, and Rob Shaw—for their paper, Automated defect classification of SS304 TIG welding process using visible spectrum camera and machine learning, and publishing their data on Kaggle. Thanks to Ekaterina Sirazitdinova for her help during the development process of this project.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Despite substantial progress in natural language processing (NLP) research over the last two years and its commercial success, little effort has been devoted to adapting this capability to other significant languages, such as Hindi, Arabic, Portuguese, or Spanish. Obviously, catering to the entire human population with more than 6,500 languages is challenging. At the same time, supporting just 40 languages addresses the NLP needs of more than 60% of the human population. Figure 2 shows that, even across the most frequently used languages, the performance of language models varies tremendously. Bear in mind that this comparison is not perfect, as those languages do have different language entropy. More importantly, the research on most capable large-scale language models seems to be limited to only a handful of high resource languages (languages with a high number of documents available publicly), such as English or Chinese.  The situation is even more complex when you account for domain-specific languages (such as medical, technical, or legal jargon), where besides English only a few high-quality models exist. This is regretful as those domain-specific language models are currently transforming the way that clinicians, engineers, researchers or other experts access information. Unfortunately, there is a limited number of equivalent models outside of English. Fortunately, replicating the success of English-language models across other languages is no longer a research task but predominantly an engineering activity. It no longer requires inventing new models and training approaches, but instead systematic and iterative dataset engineering, model training, and its continuous validation. This does not mean that engineering those models is trivial. Because of the model and dataset sizes used in modern NLP, the training process requires a substantial amount of computing power. Secondly, to use large models, you must collect large textual datasets. Thirdly, because of the shared size of models used, new approaches to training and inference are required.NVIDIA has extensive experience not only in building large-scale language models (ranging from 1 billion to 175 billion parameters) but also in deploying them to production. The goal of this post is to share our knowledge around project organization, infrastructure requirements, and budgeting and to support projects in this area.As hypothesized in Deep Learning Scaling is Predictable, Empirically, the NLP model performance seems to follow a power law with respect to both the model size and the volume of data used for training. As you make models and datasets bigger, the performance continues to improve. The following diagram from Scaling Laws for Neural Language Models demonstrates not only that this relationship holds but more importantly, it holds across nine orders of magnitude of compute.In the NLP scaling law, despite the models at the far right reaching as much as 175 billion parameters (more than 500 times larger than BERT Large), this relationship does not show signs of stopping. This suggests that even further improvement can be expected from larger models. Indeed, switch transformers when scaled to 1.6 trillion parameters (roughly 5000x larger than BERT Large) continue to demonstrate the previously mentioned behavior. More importantly, the large NLP models seem to generate much more robust features capable of solving complex problems even without large-scale fine-tuning datasets. Figure 4 shows this capability across three orders of magnitude of models.Due to this capability, and despite the relatively high cost of their development, large NLP models are likely to not only continue to dominate the NLP processing landscape but also continue to grow, at least by another order of magnitude approaching trillions of parameters.This relationship between model size, dataset size, and model performance is not unique to NLP. I see the same behavior of automatic speech recognition and computer vision models, and across many other disciplines that are a backbone of conversational AI.At the same time, a limited amount of work was devoted to the development of both large-scale datasets and models for other languages. Indeed, the majority of the work that focuses on languages other than English take advantage of smaller models and less curated datasets. For example, NLP using subset from general datasets such as raw Common Crawl).Even less effort is devoted to supporting any of the following:The current status creates an opportunity for local companies willing to invest in model training to lead the development of NLP technologies in the region.Building large-scale language models is not trivial for many reasons. First, the large-scale datasets are not trivial to curate. In the raw format, they are actually quite easy to obtain. Second, the infrastructure required to train these huge models requires substantial systems knowledge to set up. Finally, they require extensive research expertise to train and optimize. What is less widely understood is that training such large models requires software engineering effort. Most interesting models are larger than the memory capacity of not only individual GPUs but also of many multi-GPU servers. The number of mathematical operations required to train them can also make training times unmanageable, measured in months even on fairly sizable systems. Approaches such as model and pipeline parallelism overcome some of those challenges. However, applying them in a naive way could lead to scaling issues, exacerbating an already long training time. Together with organizations such as Microsoft and Stanford University, NVIDIA has worked towards developing tools that streamline the development process of the largest language models and provide computational efficiency and scalability to allow cost-effective training. As a consequence, a wide range of tools abstracting the complexity of large model development are now available, including the following:As a result of those efforts, I’ve seen a substantial reduction in training times of large models. Indeed, the GPT-3 model with 175 billion parameters trained on 300 billion tokens using 1024 NVIDIA A100 Tensor Core GPUs can be trained today in 34 days (as shown in Efficient Large-Scale Language Model Training on GPU Clusters). Based on experimentation, NVIDIA estimates that 1 trillion parameter models can be trained in approximately 84 days with 3072 A100 GPUs. Despite the training cost of those models being high, it is not beyond reach of most large organizations. With further software advances, it is likely to reduce further.Because the development of large language models requires scalable infrastructure, NVIDIA has also consolidated knowledge from building the internal Selene cluster (used for NLP internal research and to deliver record-breaking performance in MLPerf training and inference benchmarks) into a fully packaged product called NVIDIA DGX SuperPOD. This cluster is more than just a system reference design. In fact, it can be bought in its entirety together with software and support of NVIDIA data scientists and applied researchers, similar to the NLP-focused deployment by Naver Clova). Such an approach has already had a substantial impact on the NLP landscape, as it enables organizations with extensive NLP expertise to scale out their efforts fast. More importantly, it enables organizations with limited systems, HPC, or large-scale NLP workload expertise to start iterating in weeks, rather than months or years. The ability to build large language models is just an academic achievement when it’s impossible to take advantage of the results of your work by deploying your models to production. The challenge of deploying models such as GPT-3 correlates to their sheer size, which exceeds the memory capacity of a GPU, and computational complexity. Both are factors that contribute to decreased throughput and high latency of inference. This is a widely understood problem and a range of tools and solutions currently exist to make serving the largest language models simple and cost-effective. NVIDIA Triton Inference Server is an open-source cloud and edge inferencing solution optimized for both CPUs and GPUs. It can be used to host distributed models effectively. To deploy a large model using pipeline parallelism, the model must be split into several parts, for example, manipulating the ONNX graph with tools such as ONNX Graph Surgeon. Each of the parts must be small enough to fit into the memory space of a single GPU.After the model is subdivided, it can be distributed across multiple GPUs without the need to develop any code. You create an NVIDIA Triton YAML configuration file defining how individual parts of the model should be connected.The traffic between individual model parts and their load balancing can be managed automatically by Triton Inference Server. The communication overheads are also kept to a minimum as Triton takes advantage of the latest NVIDIA NVSwitch and third-generation NVIDIA NVLink technology, providing 600 GB/sec GPU-to-GPU direct bandwidth, which is 10X  higher than PCIe gen4. This means that you can efficiently deploy not only medium-scale models of multi billions of parameters but even the largest of models, including GPT-3 with trillions of parameters.For more information, see Megatron GPT-3 Large Model Inference with Triton and ONNX Runtime (GTC21 session).Beyond the ability to host trained large models, it is important to look at optimization techniques. Such techniques can reduce the memory footprint of the models, through quantization and pruning; substantially accelerate the execution; and reduce latency by optimizing memory access, taking advantage of TensorCores or sparsity acceleration.Utilities such as TensorRT provide a wide range of optimized kernels for execution of transformer-based architectures. They can automatically do half precision (FP16) or, in certain cases, INT8 quantization. TensorRT also supports quantization-aware training and provides early support for hardware-accelerated sparsity.The NVIDIA FasterTransformer library specializes in the inference of the transformer neural networks and can be used with models such as BERT or GPT-2/3. This library includes a tensor-parallel inference backend that provides the ability to do the inference of the huge GPT-3 models in parallel on multiple GPUs within the DGX A100 system. This enables you to reduce inference latency by as much as 1.2–3x, depending on model size. With FasterTransformer, you can deploy the largest of Megatron Models with a single line of code.The Microsoft DeepSpeed library has a number of features focused on inference, including support for Mixture-of-Quantization (MoQ), high-performance INT8 kernels, or DeepFusion.Thanks to all of those advances, large language models are no longer limited to academic research as they are making headway into commercial AI-based products.Correct sizing of the challenge is critical for the success of your NLP initiative. The amount of engineering and research staff needed as well as training and inference infrastructure significantly affects your business case. The following factors have significant impact on the overall cost of the development: After the fundamental business questions are addressed, it is possible to estimate the effort and compute required for their development. When you have an understanding of how good your model must be to allow for the product or service, it is possible to estimate the model size needed. The relationship between the performance of language models and the amount of data and model size is widely understood (Figure 9).After you understand the size of the model and dataset that you need, you can estimate the amount of infrastructure required and training time. For more information, see Efficient Large-Scale Language Model Training on GPU Clusters. Furthermore, the scaling of large language models is superlinear, meaning that the training performance does not degrade with the increasing model size but actually increases (Figure 10).Here are the key factors to consider for initial infrastructure sizing:Large language models have appealing properties and will help expand the availability of NLP around the globe. They more performant across a wide number of NLP tasks, but they are also much more sample efficient. They are what are known as few-shot learners and in certain ways are easier to design, as their exact hyperparameter configuration seems unimportant in comparison to their size.  As a consequence, the NLP models are likely to continue to grow. I see empirical evidence justifying at least one if not two orders of magnitude of growth. Fortunately, the technology to build and deploy them to production has matured considerably. The software required to train them has also matured considerably and is broadly available, such as the NVIDIA open source Megatron-based implementation of GPT-3. Quality is continuing to improve, driving down the training times. The infrastructure required to train models in this space is also well understood and commercially available (DGX SuperPOD). It is now possible to deploy the largest of NLP models to production using tools such as Triton Inference Server.  As a consequence, big NLP models are in reach of everyone with the will to pursue them.NVIDIA actively supports customers in the scoping and delivery of large training and inference systems, as well as supporting them in establishing NLP training capability. If you are working towards building your NLP capability, reach out to your local NVIDIA account team. You can also join one of our Deep Learning Institute NLP classes. During the course, you learn how to work with modern NLP models, optimise them with TensorRT, and deploy for cost-effective production with Triton Inference Server. For more information, see any of the following NLP-related GTC presentations:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.61	At GTC ’21, experts presented a variety of technical talks to help people new to AI, or just those looking for tools to speed-up their AI development using the various components of the NGC catalog, including:Watch these on-demand sessions to learn how to build solutions in the cloud with NVIDIA AI software from NGC.Building a Text-to-Speech Service that Sounds Like YouThis session shows how to build a TTS model for expressive speech using pretrained models. The model is fine-tuned with speech samples and customized for the variability in speech performing style transfer from other speakers. The provided tools let developers create a model for their voice and style and make the TTS service sound like them!Analyzing Traffic Video Streams at ScaleThis session demonstrates how to use the Transfer Learning Toolkit and pretrained models to build computer vision models and run inference on over 1,000 live video feeds on a single AWS instance powered by NVIDIA A100 GPUs.Deploy Compute and Software Resources to Run AI/ML Applications in Azure Machine Learning with Just Two CommandsThis session shows how to building a taxi fare prediction application using RAPIDS and shows how to automatically set up a DASK cluster with multiple Azure virtual machines to support large datasets, mount data into the Dask scheduler and workers, deploy GPU-optimized AI software from the NGC catalog to train models, and then make taxi fare predictions.Build and Deploy AI Applications Faster on Azure Machine LearningThis session demonstrates the basics of Azure Machine Learning (AzureML) Platform, the benefits of using the NGC catalog, and how to leverage the NGC-AzureML Quick Launch Toolkit to build an end-to-end AI application in AzureML.If you’re building an AI solution from scratch or just want to replicate the use cases shown in the above sessions, start with the NGC catalog.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	At GTC ’21, experts presented a variety of technical talks to help people new to AI, or just those looking for tools to speed-up their AI development using the various components of the NGC catalog, including:Watch these on-demand sessions to learn how to build solutions in the cloud with NVIDIA AI software from NGC.Building a Text-to-Speech Service that Sounds Like YouThis session shows how to build a TTS model for expressive speech using pretrained models. The model is fine-tuned with speech samples and customized for the variability in speech performing style transfer from other speakers. The provided tools let developers create a model for their voice and style and make the TTS service sound like them!Analyzing Traffic Video Streams at ScaleThis session demonstrates how to use the Transfer Learning Toolkit and pretrained models to build computer vision models and run inference on over 1,000 live video feeds on a single AWS instance powered by NVIDIA A100 GPUs.Deploy Compute and Software Resources to Run AI/ML Applications in Azure Machine Learning with Just Two CommandsThis session shows how to building a taxi fare prediction application using RAPIDS and shows how to automatically set up a DASK cluster with multiple Azure virtual machines to support large datasets, mount data into the Dask scheduler and workers, deploy GPU-optimized AI software from the NGC catalog to train models, and then make taxi fare predictions.Build and Deploy AI Applications Faster on Azure Machine LearningThis session demonstrates the basics of Azure Machine Learning (AzureML) Platform, the benefits of using the NGC catalog, and how to leverage the NGC-AzureML Quick Launch Toolkit to build an end-to-end AI application in AzureML.If you’re building an AI solution from scratch or just want to replicate the use cases shown in the above sessions, start with the NGC catalog.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.62	The most commonly diagnosed cancer in the US today is skin cancer. There are three main variants: melanoma, basal cell carcinoma (BCC), and squamous cell carcinoma (SCC). Though melanoma only accounts for roughly 1% of all skin cancers, it is the most fatal, metastasizing rapidly without early detection and treatment. This makes early detection critical, as numerous studies show significantly better survival rates when detection is done in its earliest stages.The current diagnosis procedure is done through a visual examination by a dermatologist, followed by a biopsy to confirm any suspected pathology. This manual examination is dependent on human subjectivity and thus suffers from error at a concerning rate. When a primary care physician looks for skin cancer, their sensitivity, or ability to identify a patient with the disease correctly, is only 0.45, while a dermatologist has a sensitivity of 0.97.In recent years, the use of deep learning to perform medical diagnostics has become a quickly growing field. In this post, we discuss developing an end-to-end example of how deep learning could lead to an automated dermatology exam system free of human bias, using the recently announced NVIDIA Clara AGX development kit.This reference application is the pairing of two deep learning models:Figure 1 shows the workflow of the algorithm using a single video frame. The application can use a high-definition webcam or IP camera as input to the models, or even run on a previously captured video.This reference application was built using the NVIDIA Clara AGX development kit, a high-end performance workstation built with medical applications in mind. The system includes an RTX 6000 GPU, delivering 200+ INT8 AI TOPs of peak performance and 24 GB of VRAM, leaving plenty of overhead for running multiple models.In addition, the AGX platform offers support for high bandwidth sensors through 100G Ethernet and an NVIDIA ConnectX-6 network interface card (NIC). NVIDIA partners are currently using the NVIDIA Clara AGX development kit to develop applications in ultrasound, genomics, and endoscopy.The Clara AGX Developer Kit is currently available exclusively for members of the NVIDIA Clara Developer Partner Program. After you register, we’ll be in touch.We’ve provided a research prototype of a dermatology application, but what would it take to transform this into a real application?For more information, see the dermatology reference Docker container on NGC.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	NVIDIA and Mozilla are proud to announce the latest release of the Common Voice dataset, with over 13,000 hours of crowd-sourced speech data, and adding another 16 languages to the corpus. Common Voice is the world’s largest open data voice dataset and designed to democratize voice technology. It is used by researchers, academics, and developers around the world.  Contributors mobilize their own communities to donate speech data to the MCV public database, which anyone can then use to train voice-enabled technology.  As part of NVIDIA’s collaboration with Mozilla Common Voice, the models trained on this and other public datasets are made available for free via an open-source toolkit called NVIDIA NeMo. Highlights of this release include:Pretrained Models:NVIDIA has released multilingual speech recognition models in NGC for free as part of the partnership mission to democratize voice technology. NeMo is an open-source toolkit for researchers developing state-of-the-art conversational AI models. Researchers can further fine-tune these models on multilingual datasets. See an example in this notebook that fine tunes an English speech recognition model on the MCV Japanese dataset. Contribute Your Voice, and Validate Samples: The dataset relies on the amazing effort and contribution from many communities across the world. Take the time to feed back into the dataset by recording your voice and validating samples from other contributors: https://commonvoice.mozilla.org/speakYou can download the latest MCV dataset from https://commonvoice.mozilla.org/datasets, including the repo for full stats https://github.com/common-voice/cv-dataset/, and NVIDIA NeMo from NGC Catalog and GitHub.Dataset ‘Ask Me Anything’:August 4, 2021 from 3:00 – 4:00 p.m. UTC / 2:00 – 3:00 p.m. EDT / 11:00 a.m. – 12:00 p.m. PDT:In celebration of the dataset release, on August 4th Mozilla is hosting an AMA discussion with Lead Engineer Jenny Zhang. Jenny will be available to answer your questions live, to join and ask a question please use the following AMA discourse topic.   Read more > >Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.63	Relying on the capabilities of GPUs, a team from Facebook AI Research has developed a faster, more efficient way for AI to run similarity searches. The study, published in IEEE Transactions on Big Data, creates a deep learning algorithm capable of handling and comparing high-dimensional data from media that is notably faster, while just as accurate as previous techniques. In a world with an ever-growing supply of data, the work promises to ease both the compute power and time needed for processing large libraries.“The most straightforward technique for searching and indexing [high-dimensional data] is by brute-force comparison, whereby you need to check [each image] against every other image in the database. This is impractical for collections containing billions of vectors,” Jeff Johnson, study colead and a research engineer at Facebook, said in a press release.Containing millions of pixels and data points, every image and video creates billions of vectors. This large amount of data is valuable for analyzing, detecting, indexing, and comparing vectors. It is also problematic for calculating similarities of large libraries with traditional CPU algorithms that rely on several supercomputer components, slowing down overall computing time.Using only four GPUs with CUDA, the researchers designed an algorithm for GPUs to both host and analyze library image data points. The method also compresses the data, making it easier, and thus faster to analyze.  The new algorithm processed over 95 million high-dimensional images in 35 minutes. A graph of a billion vectors took less than 12 hours to compute. According to a comparison test in the study, handling the same database with a cluster of 128 CPU servers took 108.7 hours-about 8.5x longer.“By keeping computations purely on a GPU, we can take advantage of the much faster memory available on the accelerator, instead of dealing with the slower memories of CPU servers and even slower machine-to-machine network interconnects within a traditional supercomputer cluster,” said Johnson. The researchers state the methods are already being applied to a wide variety of tasks, including a language processing search for translations. Known as the Facebook AI Similarity Search library, the approach is open source for implementation, testing, and comparison. Read more >>>Read the full article in IEEE Transactions on Big Data >>>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Relying on the capabilities of GPUs, a team from Facebook AI Research has developed a faster, more efficient way for AI to run similarity searches. The study, published in IEEE Transactions on Big Data, creates a deep learning algorithm capable of handling and comparing high-dimensional data from media that is notably faster, while just as accurate as previous techniques. In a world with an ever-growing supply of data, the work promises to ease both the compute power and time needed for processing large libraries.“The most straightforward technique for searching and indexing [high-dimensional data] is by brute-force comparison, whereby you need to check [each image] against every other image in the database. This is impractical for collections containing billions of vectors,” Jeff Johnson, study colead and a research engineer at Facebook, said in a press release.Containing millions of pixels and data points, every image and video creates billions of vectors. This large amount of data is valuable for analyzing, detecting, indexing, and comparing vectors. It is also problematic for calculating similarities of large libraries with traditional CPU algorithms that rely on several supercomputer components, slowing down overall computing time.Using only four GPUs with CUDA, the researchers designed an algorithm for GPUs to both host and analyze library image data points. The method also compresses the data, making it easier, and thus faster to analyze.  The new algorithm processed over 95 million high-dimensional images in 35 minutes. A graph of a billion vectors took less than 12 hours to compute. According to a comparison test in the study, handling the same database with a cluster of 128 CPU servers took 108.7 hours-about 8.5x longer.“By keeping computations purely on a GPU, we can take advantage of the much faster memory available on the accelerator, instead of dealing with the slower memories of CPU servers and even slower machine-to-machine network interconnects within a traditional supercomputer cluster,” said Johnson. The researchers state the methods are already being applied to a wide variety of tasks, including a language processing search for translations. Known as the Facebook AI Similarity Search library, the approach is open source for implementation, testing, and comparison. Read more >>>Read the full article in IEEE Transactions on Big Data >>>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.64	Deep learning is revolutionizing the way that industries are delivering products and services. These services include object detection, classification, and segmentation for computer vision, and text extraction, classification, and summarization for language-based applications. These applications must run in real time.Most of the models are trained in floating-point 32-bit arithmetic to take advantage of a wider dynamic range. However, at inference, these models may take a longer time to predict results compared to reduced precision inference, causing some delay in the real-time responses, and affecting the user experience.It’s better in many cases to use reduced precision or 8-bit integer numbers. The challenge is that simply rounding the weights after training may result in a lower accuracy model, especially if the weights have a wide dynamic range. This post provides a simple introduction to quantization-aware training (QAT), and how to implement fake-quantization during training, and perform inference with NVIDIA TensorRT 8.0.Model quantization is a popular deep learning optimization method in which model data—both network parameters and activations—are converted from a floating-point representation to a lower-precision representation, typically using 8-bit integers. This has several benefits:Quantization has many benefits but the reduction in the precision of the parameters and data can easily hurt a model’s task accuracy. Consider that 32-bit floating-point can represent roughly 4 billion numbers in the interval [-3.4e38, 3.40e38]. This interval of representable numbers is also known as the dynamic-range. The distance between two neighboring representable numbers is the precision of the representation.Floating-point numbers are distributed nonuniformly in the dynamic range and about half of the representable floating-point numbers are in the interval [-1,1]. In other words, representable numbers in the [-1, 1] interval would have higher precision than numbers in [1, 2]. The high density of representable 32-bit floating-point numbers in [-1, 1] is helpful in deep learning models where parameters and data have most of their distribution mass around zero.Using an 8-bit integer representation, however, you can represent only 28 different values. These 256 values can be distributed uniformly or nonuniformly, for example, for higher precision around zero. All mainstream, deep-learning hardware and software chooses to use a uniform representation because it enables computing using high-throughput parallel or vectorized integer math pipelines.To convert the representation of a floating-point tensor () to an 8-bit representation (), a scale-factor is used to map the floating-point tensor’s dynamic-range to [-128, 127]:This is symmetric quantization because the dynamic-range is symmetric about the origin.  is a function that applies some rounding-policy to round rational numbers to integers; and  is a function that clips outliers that fall outside the [-128, 127] interval. TensorRT uses symmetric quantization to represent both activation data and model weights.At the top of Figure 1 is a diagram of an arbitrary floating-point tensor , depicted as a histogram of the distribution of its elements. We chose a symmetric range of coefficients to represent in the quantized tensor: [, ]. Here,  is the element with the largest absolute value to represent. To compute the quantization scale, divide the float-point dynamic-range into 256 equal parts:The method shown here to compute the scale uses the full range that you can represent with signed 8-bit integers: [-128, 127]. TensorRT Explicit Precision (Q/DQ) networks use this range when quantizing weights and activations.There is tension between the dynamic range chosen to represent using 8-bit integers and the error introduced by the rounding operation. A larger dynamic range means that more values from the original floating-point tensor get represented in the quantized tensor, but it also means using a lower precision and introducing a larger rounding error.Choosing a smaller dynamic range reduces the rounding error but introduce a clipping error. Floating-point values that are outside the dynamic range are clipped to the min/max value of the dynamic range.To address the effects of the loss of precision on the task accuracy, various quantization techniques have been developed. These techniques can be classified as belonging to one of two categories: post-training quantization (PTQ) or quantization-aware training (QAT).As the name suggests, PTQ is performed after a high-precision model has been trained. With PTQ, quantizing the weights is easy. You have access to the weight tensors and can measure their distributions. Quantizing the activations is more challenging because the activation distributions must be measured using real input data. To do this, the trained floating-point model is evaluated using a small dataset representative of the task’s real input data, and statistics about the interlayer activation distributions are collected. As a final step, the quantization scales of the model’s activation tensors are determined using one of several optimization objectives. This process is calibration and the representative dataset used is the calibration-dataset.Sometimes PTQ is not able to achieve acceptable task accuracy. This is when you might consider using QAT. The idea behind QAT is simple: you can improve the accuracy of quantized models if you include the quantization error in the training phase. It enables the network to adapt to the quantized weights and activations.There are various recipes to perform QAT, from starting with an untrained model to starting with a pretrained model. All recipes change the training regimen to include the quantization error in the training loss by inserting fake-quantization operations into the training graph to simulate the quantization of data and parameters. These operations are called ‘fake’ because they quantize the data, but then immediately dequantize the data so the operation’s compute remains in float-point precision. This trick adds quantization noise without changing much in the deep-learning framework.In the forward-pass, you fake-quantize the floating-point weights and activations and use these fake-quantized weights and activations to perform the layer’s operation. In the backward pass, you use the weights’ gradients to update the floating-point weights. To deal with the quantization gradient, which is zero almost everywhere except for points where it is undefined, you use the (straight-through estimator (STE), which passes the gradient as-is through the fake-quantization operator. When the QAT process is done, the fake-quantization layers hold the quantization scales that you use to quantize the weights and activations that the model is used for inference.PTQ is the more popular method of the two because it is simple and doesn’t involve the training pipeline, which also makes it the faster method. However, QAT almost always produces better accuracy, and sometimes this is the only acceptable method.TensorRT 8.0 supports INT8 models using two different processing modes. The first processing mode uses the TensorRT tensor dynamic-range API and also uses INT8 precision (8-bit signed integer) compute and data opportunistically to optimize inference latency.This mode is used when TensorRT performs the full PTQ calibration recipe and when TensorRT uses preconfigured tensor dynamic-ranges (Figure 3). The other TensorRT INT8 processing mode is used when processing floating-point ONNX networks with QuantizeLayer/DequantizeLayer layers and follows explicit quantization rules. For more information about the differences, see Explicit-Quantization vs. PTQ-Processing in the TensorRT Developer Guide.The TensorRT Quantization Toolkit for PyTorch compliments TensorRT by providing a convenient PyTorch library that helps produce optimizable QAT models. The toolkit provides an API to automatically or manually prepare a model for QAT or PTQ.At the core of the API is the TensorQuantizer module, which can quantize, fake-quantize, or collect statistics on a tensor. It is used together with QuantDescriptor, which describes how a tensor should be quantized. Layered on top of TensorQuantizer are quantized modules that are designed as drop-in replacements of PyTorch’s full-precision modules. These are convenience modules that use TensorQuantizer to fake-quantize or collect statistics on a module’s weights and inputs.The API supports the automatic conversion of PyTorch modules to their quantized versions. Conversion can also be done manually using the API, which allows for partial quantization in cases where you don’t want to quantize all modules. For example, some layers may be more sensitive to quantization and leaving them unquantized improves task accuracy.The TensorRT-specific recipe for QAT is described in detail in NVIDIA Quantization whitepaper, which includes a more rigorous discussion of the quantization methods and results from experiments comparing QAT and PTQ on various learning tasks.This section describes the classification-task quantization example included with the toolkit.The recommended toolkit recipe for QAT calls for starting with a pretrained model, as it’s been shown that starting from a pretrained model and fine-tuning leads to better accuracy and requires significantly fewer iterations. In this case, you load a pretrained ResNet50 model. The command-line arguments for running the example from the bash shell:The --data-dir argument points to the ImageNet (ILSVRC2012) dataset, which you must download separately. The --calibrator=histogram argument specifies that the model should be calibrated, using the histogram calibrator, before fine-tuning the model. The rest of the arguments, and many more, are documented in the example.The ResNet50 model is originally from Facebook’s Torchvision package, but because it includes some important changes (quantization of skip-connections), the network definition is included with the toolkit (resnet50_res). For more information, see Q/DQ Layer-Placement Recommendations.Here’s a brief overview of the code. For more information, see Quantizing ResNet50.The function prepare_model instantiates the data loaders and model as usual, but it also configures the quantization descriptors. Here’s an example:Instances of QuantDescriptor describe how to calibrate and quantize tensors by configuring the calibration method and axis of quantization. For each quantized operation (such as quant_nn.QuantConv2d), you configure the activations and weights in QuantDescriptor separately because they use different fake-quantization nodes.You then add fake-quantization nodes in the training graph. The following code (quant_modules.initialize) dynamically patches PyTorch code behind the scenes so that some of the torch.nn.module subclasses are replaced by their quantized counterparts, instantiates the model’s modules, and then reverts the dynamic patch (quant_modules.deactivate). For example, torch.nn.conv2d is replaced by pytorch_quantization.nn.QuantConv2d, which performs fake-quantization before performing the 2D convolution. The method quant_modules.initialize should be invoked before model instantiation.Next, you collect statistics (collect_stats) on the calibration data: feed calibration data to the model and collect activation distribution statistics in the form of a histogram for each layer to quantize. After you’ve collected the histogram data, calibrate the scales (calibrate_model) using one or more calibration algorithms (compute_amax).During calibration, try to determine the quantization scale of each layer, so that it optimizes some objective, such as the model accuracy. There are currently two calibrator classes:To determine the quality of the calibration method afterward, evaluate the model accuracy on your dataset. The toolkit makes it easy to compare the results of the four different calibration methods to discover the best method for a specific model. The toolkit can be extended with proprietary calibration algorithms. For more information, see the ResNet50 example notebook.If the model’s accuracy is satisfactory, you don’t have to proceed with QAT. You can export to ONNX and be done. That would be the PTQ recipe. TensorRT is given the ONNX model that has Q/DQ operators with quantization scales, and it optimizes the model for inference. So, this is a PTQ workflow that results in a Q/DQ ONNX model.To continue to the QAT phase, choose the best calibrated, quantized model. Use QAT to fine-tune for around 10% of the original training schedule with an annealing learning-rate schedule, and finally export to ONNX. For more information, see the Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation whitepaper.There are a couple of things to keep in mind when exporting to ONNX:When the model is finally exported to ONNX, the fake-quantization nodes are exported to ONNX as two separate ONNX operators: QuantizeLinear and DequantizeLinear (shown in Figure 5 as Q and DQ).At a high level, TensorRT processes ONNX models with Q/DQ operators similarly to how TensorRT processes any other ONNX model:Building Q/DQ networks in TensorRT does not require any special builder configuration, aside from enabling INT8, because it is automatically enabled when Q/DQ layers are detected in the network. The minimal command to build a Q/DQ network using the TensorRT sample application trtexec is as follows:TensorRT optimizes Q/DQ networks using a special mode referred to as explicit quantization, which is motivated by the requirements for network processing-predictability and control over the arithmetic precision used for network operation. Processing-predictability is the promise to maintain the arithmetic precision of the original model. The idea is that Q/DQ layers specify where precision transitions must happen and that all optimizations must preserve the arithmetic semantics of the original ONNX model.Contrasting TensorRT Q/DQ processing and plain TensorRT INT8 processing helps explain this better. In plain TensorRT, INT8 network tensors are assigned quantization scales, using the dynamic range API or through a calibration process. TensorRT treats the model as a floating-point model when applying the backend optimizations and uses INT8 as another tool to optimize layer execution time. If a layer runs faster in INT8, then it is configured to use INT8. Otherwise, FP32 or FP16 is used, whichever is faster. In this mode, TensorRT is optimizing for latency only, and you have little control over which operations are quantized.In contrast, in explicit quantization, Q/DQ layers specify where precision transitions must happen. The optimizer is not allowed to perform precision-conversions not dictated by the network. This is true even if such conversions increase layer precision (for example, choosing an FP16 implementation over an INT8 implementation) and even if such conversion results in a plan file that executes faster (for example, preferring INT8 over FP16 on V100 where INT8 is not accelerated by Tensor Cores).In explicit quantization, you have full control over precision transitions and the quantization is predictable. TensorRT still optimizes for performance but under the constraint of maintaining the original model’s arithmetic precision. Using the dynamic-range API on Q/DQ networks is not supported.The explicit quantization optimization passes operate in three phases:For more information about the main explicit quantization optimizations that TensorRT performs, see the TensorRT Developer Guide.The plan file created from building a TensorRT Q/DQ network contains quantized weights and operations and is ready to deploy. EfficientNet is one of the networks that requires QAT to maintain accuracy. The following chart compares PTQ to QAT.For more information, see the EfficientNet Quantization example on NVIDIA DeepLearningExamples.In this post, we briefly introduced basic quantization concepts and TensorRT’s quantization toolkit and then reviewed how TensorRT 8.0 processes Q/DQ networks. We did a quick walkthrough of the ResNet50 QAT example provided with the Quantization Toolkit.ResNet50 can be quantized using PTQ and doesn’t require QAT. EfficientNet, however, requires QAT to maintain accuracy. The EfficientNet B0 baseline floating-point Top1 accuracy is 77.4, while its PTQ Top1 accuracy is 33.9 and its QAT Top1 accuracy is 76.8.For more information, see the GTC 2021 session, Quantization Aware Training in PyTorch with TensorRT 8.0.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	 Read more >>>Read the full article in Applied Physics Reviews >>>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.65	Astrophysics researchers have long faced a tradeoff when simulating space— simulations could be either high-resolution or cover a large swath of the universe. With the help of generative adversarial networks, they can accomplish both at once.Carnegie Mellon University and University of California researchers developed a deep learning model that upgrades cosmological simulations from low to high resolution, allowing scientists to create a complex simulated universe within a day. These simulations are critical for researchers to unravel mysteries around galaxy formation, dark matter and dark energy. “Cosmological simulations need to cover a large volume for cosmological studies, while also requiring high resolution to resolve the small-scale galaxy formation physics, which would incur daunting computational challenges. said Yueying Ni, a Ph.D. candidate at Carnegie Mellon. “Our technique can be used as a powerful and promising tool to match those two requirements simultaneously by modeling the small-scale galaxy formation physics in large cosmological volumes.”  The team’s GAN model can take full-scale, low-resolution models and turn them into super-resolution simulations with up to 512 times as many particles. Though it was trained on data from only small areas of space, the model was able to replicate large-scale structures seen only in massive simulations. Published in PNAS, the journal of the National Academy of Sciences, the project used the hundreds of NVIDIA RTX GPUs on the Texas Advanced Computing Center’s Frontera system.   While existing methods would take over three weeks on a single processing core to create a detailed simulation of 134 million particles, the GPU-accelerated deep learning approach does it in just 36 minutes. And for simulations 1,000 times as large, the new method shrunk simulation time down from months on a dedicated supercomputer to 16 hours on a single GPU.This acceleration can help scientists run more simulations to predict how the universe would look in different scenarios. “With our previous simulations, we showed that we could simulate the universe to discover new and interesting physics, but only at small or low-res scales,” said Rupert Croft, physics professor at Carnegie Mellon. “By incorporating machine learning, the technology is able to catch up with our ideas.”Since the current neural networks focused on how gravity moves dark matter around over time, other phenomena such as supernovae and black holes were left out of the simulations. The team next plans to extend their methods to capture the forces responsible for these events. “The universe is the biggest data set there is,” said Scott Dodelson, head of the department of physics at Carnegie Mellon and director of the National Science Foundation Planning Institute for Artificial Intelligence in Physics. And “artificial intelligence is the key to understanding the universe and revealing new physics.” Read the full article in PNAS >> Read more >> Main image from TNG SimulationsHave a story to share? Submit an idea.Get the developer news feed straight to your inbox.	The new Isaac simulation engine not only creates better photorealistic environments, but also streamlines synthetic data generation and domain randomization to build ground-truth datasets to train robots in applications from logistics and warehouses to factories of the future.NVIDIA Omniverse is the underlying foundation for NVIDIA’s simulators, including the Isaac platform — which now includes several new features. Discover the next level in simulation capabilities for robots with NVIDIA Isaac Sim open beta, available now. Built on the Omniverse platform, Isaac Sim is a robotics simulation application and synthetic data generation tool. It allows roboticists to train and test their robots more efficiently by providing a realistic simulation of the robot interacting with compelling environments that can expand coverage beyond what is possible in the real world.This release of Isaac Sim also adds improved multi-camera support and sensor capabilities, and a PTC Onshape CAD importer to make it easier to bring in 3D assets. These new features will expand the breadth of robots and environments that can be successfully modeled and deployed in every aspect: from design and development of the physical robot, then training the robot, to deploying in a “digital twin” in which the robot is simulated and tested in an accurate and photorealistic virtual environment.Summary of Key New FeaturesIsaac Sim Enables More Robotics SimulationDevelopers have long seen the benefits of having a powerful simulation environment for testing and training robots. But all too often, the simulators have had shortcomings which limited their adoption. Isaac Sim addresses these drawbacks with the benefits described below. Realistic Simulation In order to deliver realistic robotics simulations, Isaac Sim leverages the Omniverse platform’s powerful technologies including advanced GPU-enabled physics simulation with PhysX 5, photorealism with real-time ray and path tracing, and Material Definition Language (MDL) support for physically-based rendering.Modular, Breadth of ApplicationsIsaac Sim is built to address many of the most common robotics use cases including manipulation, autonomous navigation, and synthetic data generation for training data. Its modular design allows users to easily customize and extend the toolset to accommodate many applications and and environments.Seamless Connectivity and InteroperabilityIsaac Sim benefits from Omniverse Nucleus and Omniverse Connectors, enabling collaborative building, sharing, and importing of environments and robot models in Universal Scene Description (USD). Easily connect the robot’s brain to a virtual world through Isaac SDK and ROS/ROS2 interface, fully-featured Python scripting, plugins for importing robot and environment models.Synthetic Data Generation in Isaac Sim Bootstraps Machine LearningSynthetic Data Generation is an important tool that is increasingly used to train the perception models found in today’s robots. Getting real-world, properly labeled data is a time consuming and costly endeavor. But in the case of robotics, some of the required training data could be too difficult or dangerous to collect in the real world. This is especially true of robots that must operate in close proximity to humans.Isaac Sim has built-in support for a variety of sensor types that are important in training perception models. These sensors include RGB, depth, bounding boxes, and segmentation.In the open beta, we have the ability to output synthetic data in the KITTI format. This data can then be used directly with the NVIDIA Transfer Learning Toolkit to enhance model performance with use case-specific data.Domain RandomizationDomain Randomization varies the parameters that define a simulated scene, such as the lighting, color and texture of materials in the scene. One of the main objectives of domain randomization is to enhance the training of machine learning (ML) models by exposing the neural network to a wide variety of domain parameters in simulation. This will help the model to generalize well when it encounters real world scenarios. In effect, this technique helps teach models what to ignore.Isaac Sim supports the randomization of many different attributes that help define a given scene. With these capabilities, the ML engineers can ensure that the synthetic dataset contains sufficient diversity to drive robust model performance.Randomizable ParametersIn Isaac Sim open beta, we have enhanced the domain randomization capabilities by allowing the user  to define a region for randomization. Developers can now draw a box around the region in the scene that is to be randomized and the rest of the scene will remain static. More Information on Isaac SimCheck out the latest Isaac Sim GTC 2021 session, Sim-to-Real.Also, learn more about Isaac Sim by exploring the growing number of video tutorials.Learn more about using Isasac Sim to train your Jetbot by exploring these developer blogs:.Getting Started Join the thousands of developers who have worked with Isaac Sim across the robotics community via our early access program. Get started with the next step in robotics simulation by downloading Isaac Sim.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.66	AI can help banking firms better detect and prevent payment fraud and improve processes for anti-money laundering (AML) and know-your-customer (KYC) systems. With NVIDIA GPU-accelerated machine learning and deep learning platforms, data scientists can deliver results in days, instead of the weeks more traditional methods require. Learn how companies like PayPal and Datatonic are preventing fraud. The developer resources listed below are exclusively available to NVIDIA Developer Program members. Join today for free in order to get access to the tools and training necessary to build on NVIDIA’s technology platform here.  On-DemandHow to Accelerate Large-Scale Inference for a Real-World Fraud Detection System Using the Latest NVIDIA GPUs on Google Cloud PlatformSpeakers: Samantha Guerriero, Senior Machine Learning Engineer and Sepanda Pouryahya, Chief ScientistDive deep into the data science behind a successful fraud detection solution with both traditional and deep learning approaches. Get the latest insights on designing real-time machine learning systems on Google Cloud Platform following MLOps best practices.Auto-Tagging and Temporal Refresh Using Reinforcement Learning-Based Meta-Learning Frameworks for Real-Time Fraud DetectionSpeaker: Nitin Sharma, Distinguished Scientist, AI Research GroupDetecting fraud poses multiple challenges on account of systemic changes in fraud snapshots, but also due to temporal changes stemming from incremental product releases, feature creep, and variations in starkly heterogeneous traffic from buyers and sellers transacting over global ecommerce. Such context poses challenges associated with temporal stability of machine learning frameworks in the context of an ever-evolving ecosystem, making a strong case for periodic model retraining. BlogDetecting Financial Fraud Using GANs at Swedbank with Hopsworks and NVIDIA GPUsDiscover how Swedbank trains advanced deep learning models such as generative adversarial neural networks (GANs) using NVIDIA GPUs and Logical Clock’s Hopsworks as part of its fraud and anti-money laundering (AML) strategy.SDKHigh-Performance Data ScienceRun entire data science workflows with high-speed GPU compute and parallelize data loading, data manipulation, and machine learning for 50X faster end-to-end data science pipelines.Deep Learning Inference PlatformNVIDIA’s inference platform delivers the performance, efficiency, and responsiveness critical to powering the next generation of AI products and services—in the cloud, in the data center, at the network’s edge, and more.Click here to view all of the Financial Services sessions and demos on NVIDIA On-Demand.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	JPEG 2000 (.jp2, .jpg2, .j2k) is an image compression standard defined by the Joint Photographers Expert Group (JPEG) as the more flexible successor to the still popular JPEG standard. Part 1 of the JPEG 2000 standard, which forms the core coding system, was first approved in August 2002. To date, the standard has expanded to 17 parts, covering areas like Motion JPEG2000 (Part 3) which extends the standard for video, extensions for three-dimensional data (Part 10), and so on.Features like mathematically lossless compression and large precision and higher dynamic range per component helped JPEG 2000 find adoption in digital cinema applications. JPEG 2000 is also widely used in digital pathology and geospatial imaging, where image dimensions exceed 4K but regions of interest (ROI) stay small.The JPEG 2000 feature set provides ample opportunities for GPU acceleration when compared to its predecessor, JPEG. Through GPU acceleration, images can be decoded in parallel and larger images can be processed quicker. nvJPEG2000 is a new library that accelerates the decoding of JPEG 2000 images on NVIDIA GPUs. It supports codec features commonly used in geospatial imaging, remote sensing, and digital pathology. Figure 1 overviews the decoding stages that nvJPEG2000 accelerates.The Tier1 Decode (entropy decode) stage is the most compute-intensive stage of the entire decode process. The entropy decode algorithm used in the legacy JPEG codec was serial in nature and was hard to parallelize.In JPEG 2000, the entropy decode stage is applied at a block-based granularity (typical block sizes are 64×64 and 32×32) that makes it possible to offload the entropy decode stage entirely to the GPU. For more information about the entropy decode process, see Section C of the JPEG 2000 Core coding system specification.The JPEG 2000 core coding system allows for two types of wavelet transforms (5-3 Reversible and 9-7 Irreversible), both of which benefit from GPU acceleration. For more information about the wavelet transforms, see Section F of the JPEG 2000 Core coding system specification.In this section, we concentrate on the new nvJPEG2000 API tailored for the geospatial domain, which enables decoding specific tiles within an image instead of decoding the full image. Imaging data captured by the European Space Agency’s Sentinel 2 satellites are stored as JPEG 2000 bitstreams. Sentinel 2 level 2A data downloaded from the Copernicus hub can be used with the nvJPEG2000 decoding examples. The imaging data has 12 bands or channels and each of them is stored as an independent JPEG 2000 bitstream. The image in Figure 2 is subdivided into 121 tiles. To speed up the decode of multitile images, a new API called nvjpeg2kDecodeTile has been added in nvJPEG2000 v 0.2, which enables you to decode each tile independently.For multitile images, decoding each tile sequentially would be suboptimal. The GitHub multitile decode sample demonstrates how to decode each tile on a separate cudaStream_t. By taking this approach, you can simultaneously decode multiple tiles on the GPU. Nsight Systems trace in Figure 3 shows the decoding of Sentinel 2 data set consisting of 12 bands. By using 10 CUDA streams, up to 10 tiles are being decoded in parallel at any point during the decode process.Table 1 shows performance data comparing a single stream and multiple streams on a GV100 GPU.Using 10 CUDA streams reduces the total decode time of the entire dataset by about 75% on a Quadro GV100 GPU. For more information, see the Accelerating Geospatial Remote Sensing Workflows Using NVIDIA SDKs [S32150] GTC’21 talk. It discusses geospatial image-processing workflows in more detail and the role nvJPEG2000 plays there.JPEG 2000 is used in digital pathology to store whole slide images (WSI). Figure 4 gives an overview of various deep learning techniques that can be applied to WSI. Deep learning models can be used to distinguish between cancerous and healthy cells. Image segmentation methods can be used to identify a tumor location in the WSI. For more information, see Deep neural network models for computational histopathology: A survey.Table 2 lists the key parameters and their commonly used values of a whole slide image (WSI) compressed using JPEG 2000.The image in question is large and it is not possible to decode the entire image at one time due to the amount of memory required. The size of the decode output is around 53 GB (92000×201712 * 3). This is excluding the decoder memory requirements.There are several approaches to handling such large images. In this post, we describe two of them:Both approaches can be easily performed using specific nvJPEG2000 APIs.The nvJPEG2000 library enables the decoding of a specific area of interest in an image supported as part of the  nvjpeg2kDecodeTile API. The following code example shows how to set the area of interest in terms of image coordinates. The nvjpeg2kDecodeParams_t type enables you to control the decode output settings, such as the area of interest to decode.For more information about how to partially decode an image with multiple tiles, see the Decode Tile Decode GitHub sample.The second approach to decode a large image is to decode the image at lower resolutions. The ability to decode only the lower resolutions is a benefit of JPEG 2000 using wavelet transforms. In Figure 5, wavelet transform is applied up to two levels, which gives you access to the image at three resolutions. By controlling how the inverse wavelet transform is applied, you decode only the lower resolutions of an image.The digital pathology image described in Table 2 has 12 resolutions. This information can be retrieved on a per-tile basis:The image has a size of 92000×201712 with 12 resolutions. If you choose to discard the four higher resolutions and decode the image up to eight resolutions, that means you can extract an image of size 5750×12574. By dropping four higher resolutions, you are scaling the result by a factor of 16.To show the performance improvement that decoding JPEG2000 on GPU brings, compare GPU-based nvJPEG2000 with CPU-based OpenJPEG.Figures 6 and 7 show the average speedup when decoding one image at a time. The following images are used in the measurements:The tables were compiled with OpenJPEG CPU Performance – Intel Xeon Gold 6240@2GHz 3.9GHz Turbo (Cascade Lake) HT On, Number of CPU threads per image=16.On NVIDIA Ampere Architecture GPUs such as NVIDIA RTX A6000, the speedup factor is more than 8x for decoding. This speedup is measured for single-image latency.Even higher speedups can be achieved by batching the decode of multiple images. Figures 8 and 9 compare the speed of decoding a 1920×1080 8-bit image with 444 chroma subsampling (Full HD) in both lossless and lossy modes respectively across multiple GPUs.Figures 8 and 9 demonstrate the benefits of batched decode using the nvJPEG2000 library. There’s a significant performance increase on GPUs with a large number of streaming multiprocessors (SMs), such as A100 and NVIDIA RTX A6000, than with smaller numbers of SMs, such as NVIDIA RTX 4000 and T4. By batching, you are making sure that the compute resources available are efficiently used.As observed from Figure 8, the decode speed on an NVIDIA RTX A6000 is 232 images per second for a batch size of 20. This equates to an additional 3x speed over batch size = 1, based on a benchmark image with a low compression ratio. The compressed bitstream is only about 3x smaller than the uncompressed image. At higher compression ratios, the speedup is faster.The following GitHub samples show how to achieve this speedup both at image and tile granularity:The nvJPEG2000 library accelerates the decoding of JPEG2000 images both in size and volume using NVIDIA GPUs by targeting specific image-processing tasks of interest. Decoding JPEG 2000 images using the nvJPEG2000 library can be as much as 8x faster on GPU (NVIDIA RTX A6000) than on CPU. A further speedup of 3x (24x faster than CPU) is achieved by batching the decode of multiple images.The simple nvJPEG2000 APIs make it easy to include in your applications and workflows. It is also integrated into the NVIDIA Data Loading Library (DALI), a data loading and preprocessing library to accelerate deep learning applications. Using nvJPEG2000 and DALI together makes it easy to use JPEG2000 images as part of deep learning training workflows.For more information, see the following resources:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.67	Human pose estimation is a popular computer vision task of estimating key points on a person’s body such as eyes, arms, and legs. This can help classify a person’s actions, such as standing, sitting, walking, lying down, jumping, and so on.Understanding the context of what a person might be doing in a scene has broad application across a wide range of industries. In a retail setting, this information can be used to understand customer behavior, enhance security, and provide richer analytics. In healthcare, this can be used to monitor patients and alert medical personnel if the patient needs immediate attention. On a factory floor, human pose can be used to identify if proper safety protocols are being followed.In general, this is a reliable approach in applications that require understanding of human activity and commonly used as one of the key components in more complex tasks such as gesture, tracking, anomaly detection, and so on.Open-source methods of developing pose estimation exist but are not optimal in terms of inference performance and are time consuming to integrate into production applications. With this post, we show you how to develop and deploy pose estimation models that are easy to use across device profiles, perform extremely well, and are highly accurate.Pose estimation has been integrated with the NVIDIA Transfer Learning Toolkit (TLT) 3.0 so that you can take advantage of all the TLT features, like model pruning and quantization, to create both an accurate and a high-performance model. After it’s trained, you can deploy this model for inference for real-time performance.This post series walks you through the steps of training, optimizing, deploying a real-time high performance pose estimation model. In part 1, you learn how to train a 2D pose estimation model using open-source COCO dataset. In part 2, you learn how to optimize the model for inference throughput and then deploy the model using TLT CV inference pipeline. We compare the trained model from TLT with other state-of-the-art models.In this section, we cover the following topics on training a 2D pose estimation model with TLT:The BodyPoseNet model aims to predict the skeleton for every person in a given input image, which consists of keypoints and the connections between them.The two commonly used approaches to pose estimation are top-down and bottom-up. A top-down approach typically uses an object detection network to localize the bounding boxes of all humans in a frame, and then uses a pose network to localize the body parts within that bounding box. A bottom-up approach, as the name suggests, builds the skeleton from bottom-up. It first detects all human body parts within a frame and then uses a methodology to group the parts that belong to a specific person.There are several reasons to adopt a bottom-up approach. One is higher inference performance. With a bottom-up approach, there is no need for a separate person detector, unlike top-down pose estimation methods. The compute does not scale linearly with the number of persons in the scene. This enables you to achieve real-time performance for crowded scenes as well. Moreover, bottom-up also has the advantage of having global context as the entire image is provided as input to the network. It can handle complex poses and crowding better.Given some of those reasons, this approach aims to achieve efficient single-shot, bottom-up pose estimation while also delivering competitive accuracy. The default model used in this post is a fully convolutional model and consists of a backbone network, an initial prediction stage which does a pixel-wise prediction of confidence maps (heatmap) and part-affinity fields (PAF) followed by multistage refinement (0 to N stages) on the initial predictions. This solution simplifies and abstracts much of the complexities of the bottom-up approach while allowing for the necessary knobs to be tuned for specific applications.PAFs are one way to represent association scores in a bottom-up approach. For more information, see Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. It consists of a set of 2D vector fields that encode the location and orientation of limbs. This, in association with the heatmap, is used to build up the skeleton during post-processing by performing a bipartite matching and associating body part candidates.NVIDIA TLT toolkit helps abstract away the AI/DL framework complexity and enables you to build production quality models faster, with no coding required. For more information about hardware and software requirements, setting up required dependencies, and installing the TLT launcher, see the TLT Quick Start Guide.Download the latest samples using the following command:You can find the sample notebook located at tlt_cv_samples:v1.1.0/bpnet, which also includes all the steps in detail.Set up env variables for cleaner command line commands. Update the following variable values:To run the TLT launcher, map the ~/tlt-experiments directory on the local machine to the Docker container using the ~/.tlt_mounts.json file. For more information, see TLT Launcher.Create the ~/.tlt_mounts.json file and update the following content inside:Make sure that the source directory paths to be mounted are valid. This mounts the path /home/<username>/tlt-experiments on the host machine to be the path /workspace/tlt-experiments inside the container. It also mounts the downloaded specs on the host machine to be the path /workspace/examples/bpnet/specs, /workspace/examples/bpnet/data_pose_config, and /workspace/examples/bpnet/model_pose_config inside the container.Make sure that you have installed the required dependencies by running the following command:To get started, set up an NGC account and then download the pretrained model. Currently, only the vgg19 backbone is supported.We use the COCO (common objects on context) 2017 dataset in this post as an example. Download the dataset and extract as per the instructions:Unzip the images directories into the $LOCAL_DATA_DIR directory and the annotations into $LOCAL_DATA_DIR/annotations.To prepare the data for training, you must generate segmentation masks to be used for masking the loss of unlabeled persons and tfrecords to feed to the training pipeline. The mask folder is based on the path provided in the coco_spec.json file. mask_root_dir_path directory is a relative path to root_directory_path, as are mask_root_dir_path and annotation_root_dir_path.To use this example with a custom dataset:For more information, see the following docs:The next step is to configure the spec file for training. The experiment spec file is essential, as it compiles all the necessary hyperparameters for achieving a good model. The specification file for BodyPoseNet training configures these components of the training pipe:You can find the default specification file at $SPECS_DIR/bpnet_train_m1_coco.yaml. We expand on each component of the specification file but we don’t cover all the parameters here. For more information, see Create a Train Experiment Configuration File.The top-level experiment configs include basic parameters for an experiment; for example, number of epochs, pretrained weights, whether to load the pretrained graph, and so on. An encrypted checkpoint is saved per the checkpoint_n_epoch value. Here’s a code example of some of the top-level configs.All the paths (checkpoint_dir and pretrained_weights) are internal to the Docker container. To verify correctness, check ~/.tlt_mounts.json. For more information about these parameters, see the Body Pose Trainer section.This section helps you with defining datapaths, image configuration, the target pose configuration, normalization parameters, and so on. The augmentation_config section provides some on-the-fly augmentation options. It supports basic spatial augmentations, such as flip, zoom, rotate, and translate, which can be configured before training experiments. The label_processor_config section provides the required parameters to configure the ground truth feature map generation.For more information about each parameter, see the Dataloader section.The BodyPoseNet model can be configured using the model option in the spec file. The following is a sample model config to instantiate a custom VGG19-backbone-based model.The number of total stages for pose estimation (stages of refinement + 1) in the network is captured by the stages param which takes any value >= 2. We recommend using the L1 regularizer when training a network before pruning, as L1 regularization makes it easier to prune the network weights. For more information about each parameter in the model, see the Model section.This section describes how to configure the optimizer and learning-rate schedule:The default base_learning_rate is set for a single-GPU training. To use multi-GPU training, you may have to modify the learning_rate value to get similar accuracy. In most cases, scaling up the learning rate by a factor of $NUM_GPUS would be a good start. For instance, if you are using two GPUs, use 2 * base_learning_rate used in one GPU setting, and if you are using four GPUs, use 4 * base_learning_rate. For more information about each parameter in the model, see the Optimizer section.After following the steps to generate TFRecords and masks and setting up a train specification file, you are now ready to start training the body pose estimation network. Use the following command to launch training:Training with more GPUs enables networks to ingest more data faster, saving you precious time during the development process. TLT supports multi-GPU training so that you can train the model with several GPUs in parallel. We recommend using four GPUs or more for training the model as one GPU might take several days to complete. The training time roughly decreases by a factor of $NUM_GPUS. Make sure that you update the learning rates accordingly, based on the linear scaling method described in the Optimizer section.BodyPoseNet supports restarting from checkpoint. In case the training job is killed prematurely, you may resume training from the last saved checkpoint by simply rerunning the same command. Make sure that you use the same number of GPUs when restarting the training.Start with configuring the inference and evaluation specification file. The following code example is a sample specification:The value of input_shape here can be different from the input_dims value used for training. The multi_scale_inference parameter enables multiscale refinement over the provided scales. Because you are using a model of stride 8, output_upsampling_factor is set to 8.To keep the evaluation consistent with bottom-up human pose estimation research, there are two modes and specification files to evaluate the model:There is another mode used primarily to verify against the final exported TRT models. You use this in later sections.The --model_filename argument overrides the model_path variable in the inference specification file.To evaluate the model, use the following command:Now that you’ve trained the model, run inference and verify the predictions. To verify the model visually with TLT, use the tlt bpnet inference command. The tool supports running inference on the .tlt model, as well as the TensorRT .engine model. It generates annotated images with skeleton rendered on them and serialized frame-by-frame keypoint labels and metadata in detections.json. For example, to run inference with a trained .tlt model, run the following command:Figure 1 shows an example of the original image and Figure 2 shows the output image with pose results rendered. As you can see, the model is robust to an image that is different from the COCO training data.In this post, you learned about training body pose models using the BodyPoseNet app in TLT. The post showed taking an open-source COCO dataset with a pretrained backbone from NGC to train a model with TLT. To optimize the trained model for inference and deployment, see Training and Optimizing the 2D Pose Estimation Model, Part 2.For more information, see the following resources:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Most CUDA developers are familiar with the cudaMalloc and cudaFree API functions to allocate GPU accessible memory. However, there has long been an obstacle with these API functions: they aren’t stream ordered. In this post, we introduce new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In part 2 of this series, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also cover advanced topics to take advantage of stream-ordered memory allocation in the context of multi-GPU access and the use of IPC. This all helps you improve performance within your existing applications.The following code example on the left is inefficient because the first cudaFree call has to wait for kernelA to finish, so it synchronizes the device before freeing the memory. To make this run more efficiently, the memory can be allocated upfront and sized to the larger of the two sizes, as shown on the right.This increases code complexity in the application because the memory management code is separated out from the business logic. The problem is exacerbated when other libraries are involved. For example, consider the case where kernelA is launched by a library function instead:This is much harder for the application to make efficient because it may not have complete visibility or control over what the library is doing. To circumvent this problem, the library would have to allocate memory when that function is invoked for the first time and never free it until the library is deinitialized. This not only increases code complexity, but it also causes the library to hold on to the memory longer than it needs to, potentially denying another portion of the application from using that memory.Some applications take the idea of allocating memory upfront even further by implementing their own custom allocator. This adds a significant amount of complexity to application development. CUDA aims to provide a low-effort, high-performance alternative.CUDA 11.2 introduced a stream-ordered memory allocator to solve these types of problems, with the addition of cudaMallocAsync and cudaFreeAsync. These new API functions shift memory allocation from global-scope operations that synchronize the entire device to stream-ordered operations that enable you to compose memory management with GPU work submission. This eliminates the need for synchronizing outstanding GPU work and helps restrict the lifetime of the allocation to the GPU work that accesses it. Consider the following code example:It is now possible to manage memory at function scope, as in the following example of a library function launching kernelA.All the usual stream-ordering rules apply to cudaMallocAsync and cudaFreeAsync. The memory returned from cudaMallocAsync can be accessed by any kernel or memcpy operation as long as the kernel or memcpy is ordered to execute after the allocation operation and before the deallocation operation, in stream order. Deallocation can be performed in any stream, as long as it is ordered to execute after the allocation operation and after all accesses on all streams of that memory on the GPU.In effect, stream-ordered allocation behaves as if allocation and free were kernels. If kernelA produces a valid buffer on a stream and kernelB invalidates it on the same stream, then an application is free to access the buffer after kernelA and before kernelB in the appropriate stream order.The following example shows various valid usages.Figure 1 shows the various dependencies specified in the earlier code example. As you can see, all kernels are ordered to execute after the allocation operation and complete before the deallocation operation.Memory allocation and deallocation cannot fail asynchronously. Memory errors that occur because of a call to cudaMallocAsync or cudaFreeAsync (for example, out of memory) are reported immediately through an error code returned from the call. If cudaMallocAsync completes successfully, the returned pointer is guaranteed to be a valid pointer to memory that is safe to access in the appropriate stream order.The CUDA driver uses memory pools to achieve the behavior of returning a pointer immediately.The stream-ordered memory allocator introduces the concept of memory pools to CUDA. A memory pool is a collection of previously allocated memory that can be reused for future allocations. In CUDA, a pool is represented by a cudaMemPool_t handle. Each device has a notion of a default pool whose handle can be queried using cudaDeviceGetDefaultMemPool.You can also explicitly create your own pools and either use them directly or set them as the current pool for a device and use them indirectly. Reasons for explicit pool creation include custom configuration, as described later in this post. When no explicitly created pool has been set as the current pool for a device, the default pool acts as the current pool.When called without an explicit pool argument, each call to cudaMallocAsync infers the device from the specified stream and attempts to allocate memory from that device’s current pool. If the pool has insufficient memory, the CUDA driver calls into the OS to allocate more memory. Each call to cudaFreeAsync returns memory to the pool, which is then available for re-use on subsequent cudaMallocAsync requests. Pools are managed by the CUDA driver, which means that applications can enable pool sharing between multiple libraries without those libraries having to coordinate with each other.If a memory allocation request made using cudaMallocAsync can’t be serviced due to fragmentation of the corresponding memory pool, the CUDA driver defragments the pool by remapping unused memory in the pool to a contiguous portion of the GPU’s virtual address space. Remapping existing pool memory instead of allocating new memory from the OS also helps keep the application’s memory footprint low.By default, unused memory accumulated in the pool is returned to the OS during the next synchronization operation on an event, stream, or device, as the following code example shows.Returning memory from the pool to the system can affect performance in some cases. Consider the following code example:By default, stream synchronization causes any pools associated with that stream’s device to release all unused memory back to the system. In this example, that would happen at the end of every iteration. As a result, there is no memory to reuse for the next cudaMallocAsync call and instead memory must be allocated through an expensive system call.To avoid this expensive reallocation, the application can configure a release threshold to enable unused memory to persist beyond the synchronization operation. The release threshold specifies the maximum amount of memory the pool caches. It releases all excess memory back to the OS during a synchronization operation.By default, the release threshold of a pool is zero. This means that allunused memory in the pool is released back to the OS during every synchronization operation. The following code example shows how to change the release threshold.Using a nonzero release threshold enables reusing memory from one iteration to the next. This requires only simple bookkeeping and makes the performance of cudaMallocAsync independent of the size of the allocation, which results in dramatically improved memory allocation performance (Figure 2).The pool threshold is just a hint. Memory in the pool can also be released implicitly by the CUDA driver to enable an unrelated memory allocation request in the same process to succeed. For example, a call to cudaMalloc or cuMemCreate could cause CUDA to free unused memory from any memory pool associated with the device in the same process to serve the request.This is especially helpful in scenarios where an application makes use of multiple libraries, some of which use cudaMallocAsync and some that do not. By automatically freeing up unused pool memory, those libraries do not have to coordinate with each other to have their respective allocation requests succeed.There are limitations to when the CUDA driver automatically reassigns memory from a pool to unrelated allocation requests. For example, the application may be using a different interface, like Vulkan or DirectX, to access the GPU, or there may be more than one process using the GPU at the same time. Memory allocation requests in those contexts do not cause automatic freeing of unused pool memory. In such cases, the application may have to explicitly free unused memory in the pool, by invoking cudaMemPoolTrimTo.The bytesToKeep argument tells the CUDA driver how many bytes it can retain in the pool. Any unused memory that exceeds that size is released back to the OS. The stream parameter to cudaMallocAsync and cudaFreeAsync helps CUDA reuse memory efficiently and avoid expensive calls into the OS. Consider the following trivial code example.In this code example, ptr2 is allocated in stream order after ptr1 is freed. The ptr2 allocation could reuse some, or all, of the memory that was used for ptr1 without any synchronization, because kernelA and kernelB are launched in the same stream. So, stream-ordering semantics guarantee that kernelB cannot begin execution and access the memory until kernelA has completed. This way, the CUDA driver can help keep the memory footprint of the application low while also improving allocation performance.The CUDA driver can also follow dependencies between streams inserted through CUDA events, as shown in the following code example:As the CUDA driver is aware of the dependency between streams A and B, it can reuse the memory used by ptr1 for ptr2. The dependency chain between streams A and B can contain any number of streams, as shown in the following code example.If necessary, the application can disable this feature on a per-pool basis:The CUDA driver can also reuse memory opportunistically in the absence of explicit dependencies specified by the application. While such heuristics may help improve performance or avoid memory allocation failures, they can add nondeterminism to the application and so can be disabled on a per-pool basis. Consider the following code example:In this scenario, there are no explicit dependencies between streamA and streamB. However, the CUDA driver is aware of how far each stream has executed. If, on the second call to cudaMallocAsync in streamB, the CUDA driver determines that kernelA has finished execution on the GPU, then it can reuse some or all of the memory used by ptr1 for ptr2.If kernelA has not finished execution, the CUDA driver can add an implicit dependency between the two streams such that kernelB does not begin executing until kernelA finishes.The application can disable these heuristics as follows:In part 1 of this series, we introduced the new API functions cudaMallocAsync and cudaFreeAsync , which enable memory allocation and deallocation to be stream-ordered operations. Use them to avoid expensive calls to the OS through memory pools maintained by the CUDA driver.In part 2 of this series, we share some benchmark results to show the benefits of stream-ordered memory allocation. We also provide a step-by-step recipe for modifying your existing applications to take full advantage of this advanced CUDA capability.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.68	Despite substantial progress in natural language processing (NLP) research over the last two years and its commercial success, little effort has been devoted to adapting this capability to other significant languages, such as Hindi, Arabic, Portuguese, or Spanish. Obviously, catering to the entire human population with more than 6,500 languages is challenging. At the same time, supporting just 40 languages addresses the NLP needs of more than 60% of the human population. Figure 2 shows that, even across the most frequently used languages, the performance of language models varies tremendously. Bear in mind that this comparison is not perfect, as those languages do have different language entropy. More importantly, the research on most capable large-scale language models seems to be limited to only a handful of high resource languages (languages with a high number of documents available publicly), such as English or Chinese.  The situation is even more complex when you account for domain-specific languages (such as medical, technical, or legal jargon), where besides English only a few high-quality models exist. This is regretful as those domain-specific language models are currently transforming the way that clinicians, engineers, researchers or other experts access information. Unfortunately, there is a limited number of equivalent models outside of English. Fortunately, replicating the success of English-language models across other languages is no longer a research task but predominantly an engineering activity. It no longer requires inventing new models and training approaches, but instead systematic and iterative dataset engineering, model training, and its continuous validation. This does not mean that engineering those models is trivial. Because of the model and dataset sizes used in modern NLP, the training process requires a substantial amount of computing power. Secondly, to use large models, you must collect large textual datasets. Thirdly, because of the shared size of models used, new approaches to training and inference are required.NVIDIA has extensive experience not only in building large-scale language models (ranging from 1 billion to 175 billion parameters) but also in deploying them to production. The goal of this post is to share our knowledge around project organization, infrastructure requirements, and budgeting and to support projects in this area.As hypothesized in Deep Learning Scaling is Predictable, Empirically, the NLP model performance seems to follow a power law with respect to both the model size and the volume of data used for training. As you make models and datasets bigger, the performance continues to improve. The following diagram from Scaling Laws for Neural Language Models demonstrates not only that this relationship holds but more importantly, it holds across nine orders of magnitude of compute.In the NLP scaling law, despite the models at the far right reaching as much as 175 billion parameters (more than 500 times larger than BERT Large), this relationship does not show signs of stopping. This suggests that even further improvement can be expected from larger models. Indeed, switch transformers when scaled to 1.6 trillion parameters (roughly 5000x larger than BERT Large) continue to demonstrate the previously mentioned behavior. More importantly, the large NLP models seem to generate much more robust features capable of solving complex problems even without large-scale fine-tuning datasets. Figure 4 shows this capability across three orders of magnitude of models.Due to this capability, and despite the relatively high cost of their development, large NLP models are likely to not only continue to dominate the NLP processing landscape but also continue to grow, at least by another order of magnitude approaching trillions of parameters.This relationship between model size, dataset size, and model performance is not unique to NLP. I see the same behavior of automatic speech recognition and computer vision models, and across many other disciplines that are a backbone of conversational AI.At the same time, a limited amount of work was devoted to the development of both large-scale datasets and models for other languages. Indeed, the majority of the work that focuses on languages other than English take advantage of smaller models and less curated datasets. For example, NLP using subset from general datasets such as raw Common Crawl).Even less effort is devoted to supporting any of the following:The current status creates an opportunity for local companies willing to invest in model training to lead the development of NLP technologies in the region.Building large-scale language models is not trivial for many reasons. First, the large-scale datasets are not trivial to curate. In the raw format, they are actually quite easy to obtain. Second, the infrastructure required to train these huge models requires substantial systems knowledge to set up. Finally, they require extensive research expertise to train and optimize. What is less widely understood is that training such large models requires software engineering effort. Most interesting models are larger than the memory capacity of not only individual GPUs but also of many multi-GPU servers. The number of mathematical operations required to train them can also make training times unmanageable, measured in months even on fairly sizable systems. Approaches such as model and pipeline parallelism overcome some of those challenges. However, applying them in a naive way could lead to scaling issues, exacerbating an already long training time. Together with organizations such as Microsoft and Stanford University, NVIDIA has worked towards developing tools that streamline the development process of the largest language models and provide computational efficiency and scalability to allow cost-effective training. As a consequence, a wide range of tools abstracting the complexity of large model development are now available, including the following:As a result of those efforts, I’ve seen a substantial reduction in training times of large models. Indeed, the GPT-3 model with 175 billion parameters trained on 300 billion tokens using 1024 NVIDIA A100 Tensor Core GPUs can be trained today in 34 days (as shown in Efficient Large-Scale Language Model Training on GPU Clusters). Based on experimentation, NVIDIA estimates that 1 trillion parameter models can be trained in approximately 84 days with 3072 A100 GPUs. Despite the training cost of those models being high, it is not beyond reach of most large organizations. With further software advances, it is likely to reduce further.Because the development of large language models requires scalable infrastructure, NVIDIA has also consolidated knowledge from building the internal Selene cluster (used for NLP internal research and to deliver record-breaking performance in MLPerf training and inference benchmarks) into a fully packaged product called NVIDIA DGX SuperPOD. This cluster is more than just a system reference design. In fact, it can be bought in its entirety together with software and support of NVIDIA data scientists and applied researchers, similar to the NLP-focused deployment by Naver Clova). Such an approach has already had a substantial impact on the NLP landscape, as it enables organizations with extensive NLP expertise to scale out their efforts fast. More importantly, it enables organizations with limited systems, HPC, or large-scale NLP workload expertise to start iterating in weeks, rather than months or years. The ability to build large language models is just an academic achievement when it’s impossible to take advantage of the results of your work by deploying your models to production. The challenge of deploying models such as GPT-3 correlates to their sheer size, which exceeds the memory capacity of a GPU, and computational complexity. Both are factors that contribute to decreased throughput and high latency of inference. This is a widely understood problem and a range of tools and solutions currently exist to make serving the largest language models simple and cost-effective. NVIDIA Triton Inference Server is an open-source cloud and edge inferencing solution optimized for both CPUs and GPUs. It can be used to host distributed models effectively. To deploy a large model using pipeline parallelism, the model must be split into several parts, for example, manipulating the ONNX graph with tools such as ONNX Graph Surgeon. Each of the parts must be small enough to fit into the memory space of a single GPU.After the model is subdivided, it can be distributed across multiple GPUs without the need to develop any code. You create an NVIDIA Triton YAML configuration file defining how individual parts of the model should be connected.The traffic between individual model parts and their load balancing can be managed automatically by Triton Inference Server. The communication overheads are also kept to a minimum as Triton takes advantage of the latest NVIDIA NVSwitch and third-generation NVIDIA NVLink technology, providing 600 GB/sec GPU-to-GPU direct bandwidth, which is 10X  higher than PCIe gen4. This means that you can efficiently deploy not only medium-scale models of multi billions of parameters but even the largest of models, including GPT-3 with trillions of parameters.For more information, see Megatron GPT-3 Large Model Inference with Triton and ONNX Runtime (GTC21 session).Beyond the ability to host trained large models, it is important to look at optimization techniques. Such techniques can reduce the memory footprint of the models, through quantization and pruning; substantially accelerate the execution; and reduce latency by optimizing memory access, taking advantage of TensorCores or sparsity acceleration.Utilities such as TensorRT provide a wide range of optimized kernels for execution of transformer-based architectures. They can automatically do half precision (FP16) or, in certain cases, INT8 quantization. TensorRT also supports quantization-aware training and provides early support for hardware-accelerated sparsity.The NVIDIA FasterTransformer library specializes in the inference of the transformer neural networks and can be used with models such as BERT or GPT-2/3. This library includes a tensor-parallel inference backend that provides the ability to do the inference of the huge GPT-3 models in parallel on multiple GPUs within the DGX A100 system. This enables you to reduce inference latency by as much as 1.2–3x, depending on model size. With FasterTransformer, you can deploy the largest of Megatron Models with a single line of code.The Microsoft DeepSpeed library has a number of features focused on inference, including support for Mixture-of-Quantization (MoQ), high-performance INT8 kernels, or DeepFusion.Thanks to all of those advances, large language models are no longer limited to academic research as they are making headway into commercial AI-based products.Correct sizing of the challenge is critical for the success of your NLP initiative. The amount of engineering and research staff needed as well as training and inference infrastructure significantly affects your business case. The following factors have significant impact on the overall cost of the development: After the fundamental business questions are addressed, it is possible to estimate the effort and compute required for their development. When you have an understanding of how good your model must be to allow for the product or service, it is possible to estimate the model size needed. The relationship between the performance of language models and the amount of data and model size is widely understood (Figure 9).After you understand the size of the model and dataset that you need, you can estimate the amount of infrastructure required and training time. For more information, see Efficient Large-Scale Language Model Training on GPU Clusters. Furthermore, the scaling of large language models is superlinear, meaning that the training performance does not degrade with the increasing model size but actually increases (Figure 10).Here are the key factors to consider for initial infrastructure sizing:Large language models have appealing properties and will help expand the availability of NLP around the globe. They more performant across a wide number of NLP tasks, but they are also much more sample efficient. They are what are known as few-shot learners and in certain ways are easier to design, as their exact hyperparameter configuration seems unimportant in comparison to their size.  As a consequence, the NLP models are likely to continue to grow. I see empirical evidence justifying at least one if not two orders of magnitude of growth. Fortunately, the technology to build and deploy them to production has matured considerably. The software required to train them has also matured considerably and is broadly available, such as the NVIDIA open source Megatron-based implementation of GPT-3. Quality is continuing to improve, driving down the training times. The infrastructure required to train models in this space is also well understood and commercially available (DGX SuperPOD). It is now possible to deploy the largest of NLP models to production using tools such as Triton Inference Server.  As a consequence, big NLP models are in reach of everyone with the will to pursue them.NVIDIA actively supports customers in the scoping and delivery of large training and inference systems, as well as supporting them in establishing NLP training capability. If you are working towards building your NLP capability, reach out to your local NVIDIA account team. You can also join one of our Deep Learning Institute NLP classes. During the course, you learn how to work with modern NLP models, optimise them with TensorRT, and deploy for cost-effective production with Triton Inference Server. For more information, see any of the following NLP-related GTC presentations:Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.69	Engineers, product developers and designers around the world attended GTC to experience the latest NVIDIA solutions that are accelerating interactive rendering and simulation workflows in real-time.We showcased the NVIDIA-powered AI technologies and features that have made creativity faster and easier for artists and designers worldwide. Industry luminaries joined us at GTC to share their vision for the future of AI and how current developers such as Autodesk, Adobe, Pixar, Bentley Systems and Siemens have integrated the AI technology into their most popular applications. All of these GTC sessions are now available through NVIDIA On-Demand, so learn more about AI and catch up on the latest advancements in professional content creation, from digital twins to GPU-accelerated production rendering. The developer resources listed below are exclusively available to NVIDIA Developer Program members. Join today for free to get access to the tools and training necessary to build on NVIDIA’s technology platform here. Modern AI 1980s-2021 and BeyondProfessor Jürgen Schmidhuber speaks about the past, present, and future of AI. Deep Learning DemystifiedLearn about the fundamentals of AI, high-level use cases and problem-solving methods. AI-Enabled Digital Twins for Resilient Infrastructure (Presented by Bentley Systems)Hear from Bentley Systems on how AI-enabled infrastructure digital twins help facilitate and support the decision-making process for engineers, operators, and other stakeholders. Production Rendering on GPU with Arnold (Presented by Autodesk)Get an exclusive peek on the latest GPU-accelerated developments coming to Arnold, Autodesk’s Academy Award-winning production renderer.Accelerating Machine Learning for Video SystemsSee how Adobe has automated the repetitive, time-consuming parts of content creation with machine learning and AI.Click here to watch all the AI for Graphics sessions from GTC 21. Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	The NGC team is hosting a webinar and live Q&A. Topics include how to use containers from the NGC catalog deployed from Google Cloud Marketplace to GKE, a managed Kubernetes service on Google Cloud, that easily builds, deploys, and runs AI solutions.Building a Computer Vision Service Using NVIDIA NGC and Google CloudAugust 25 at 10 a.m. PTOrganizations are using computer vision to improve the product experience, increase production, and drive operational efficiencies. But, building a solution requires large amounts of labeled data, the software and hardware infrastructure to train AI models, and the tools to run real-time inference that will scale with demand.With one click, NGC containers for AI can be deployed from Google Cloud Marketplace to GKE. This managed Kubernetes service on Google Cloud, makes it easy for enterprises to build, deploy, and run their AI solutions.By joining this webinar, you will learn:Register now >>> Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.70	NVIDIA NeMo is a conversational AI toolkit built for researchers working on automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech synthesis (TTS). The primary objective of NeMo is to help researchers from industry and academia to reuse prior work (code and pretrained models and make it easier to create new conversational AI models. NeMo is an open-source project, and we welcome contributions from the research community.The 1.0 update brings significant architectural, code quality, and documentation improvements as well as a plethora of new state-of-the-art neural networks and pretrained checkpoints in several languages. The best way to start with NeMo is by installing it in your regular PyTorch environment:NeMo is a PyTorch ecosystem project that relies heavily on two other projects from the ecosystem: PyTorch Lightning for training and Hydra for configuration management. You can also use NeMo models and modules within any PyTorch code.NeMo comes with three main collections: ASR, NLP, and TTS. They are collections of models and modules that are ready to be reused in your conversational AI experiments. Most importantly, for most of the models, we provide weights pretrained on various datasets using tens of thousands of GPU hours.The NeMo ASR collection is the most extensive collection with a lot to offer for researchers of all levels, from beginners to advanced. If you are new to deep learning for speech recognition, we recommend that you get started with an interactive notebook for both ASR and NeMo overview. If you are an experienced researcher looking to create your own model, you’ll find various ready-to-use building blocks:The NeMo ASR collection provides you with various types of ASR networks: Jasper, QuartzNet, CitriNet, and Conformer. With the NeMo 1.0 update, the CitriNet and Conformer models are the next flagship ASR models providing better accuracy on word-error-rate (WER) than Jasper and QuartzNet while maintaining similar or better efficiency.CitriNet is an improvement upon QuartzNet that uses several ideas originally introduced in ContextNet. It uses subword encoding through word piece tokenization and Squeeze-and-Excitation mechanism to obtain highly accurate audio transcripts while using a nonautoregressive, CTC-based decoding scheme for efficient inference.Conformer-CTC is a CTC-based variant of the Conformer model that uses CTC loss and decoding instead of RNN-T loss, making it a nonautoregressive model. This model combines self-attention and convolution modules to achieve the best of both worlds. The self-attention modules can learn the global interaction while the convolutions efficiently capture the local correlations.This model gives you an option to experiment with attention-based models. Due to the global context obtained by self-attention and squeeze-and-excitation mechanism, Conformer and CitriNet models have superior WER in offline scenarios.You can use Citrinet and Conformer models with CTC as well as RNN-T decoders.We spent tens of thousands of GPU hours training ASR models in various languages. In NeMo, we offer these checkpoints back to the community for free. As of this release, NeMo has ASR models in English, Spanish, Chinese, Catalonian, Italian, Russian, French, and Polish. Moreover, we partner with Mozilla to make more pretrained models available with the help of Mozilla Common Voice project.Finally, NeMo’s ASR collection contains reusable building blocks and pretrained models for various other important speech-based tasks such as: voice activity detection, speaker recognition, diarization, and voice command detection.Natural language processing (NLP) is essential for providing a great conversational AI experience. The NeMo NLP collection provides a set of pretrained models for typical NLP tasks such as question answering, punctuation and capitalization, named entity recognition, and neural machine translation. Hugging Face transformers have fueled many recent advances in NLP by providing a huge set of pretrained models and an easy-to-use experience for developers and researchers. NeMo is compatible with transformers in that most of the pretrained Hugging Face NLP models can be imported into NeMo. You may provide pretrained BERT-like checkpoints from transformers for the encoders of common tasks. The language models of the common tasks are initialized in default with the pretrained model from Hugging Face transformers. NeMo is also integrated with models trained by NVIDIA Megatron, allowing you to incorporate Megatron-based encoders into your question answering and neural machine translation models. NeMo can be used to fine-tune model-parallel models based on Megatron.In today’s globalized world, it has become important to communicate with people speaking different languages. A conversational AI system capable of converting source text from one language to another will be a powerful communication tool. NeMo 1.0 now supports neural machine translation (NMT) tasks with transformer-based models allowing you to quickly build an end-to-end language translation pipelines. This release includes pretrained NMT models for the following language pairs in both directions:Because tokenization is an extremely important part of NLP and NeMo supports most widely used tokenizers, such as HF tokenizers, SentencePiece, and YouTokenToMe.If humans can talk to computers, the computers should be able to talk back as well. Speech synthesis takes text as an input and generates humanized audio output. This is typically accomplished with two models: a spectrogram generator that generates spectrograms from text and a vocoder that generates audio from spectrogram. The NeMo TTS collection provides you with the following models:Here’s a simple example demonstrating how to use NeMo for prototyping a universal translator app. This app takes a Russian audio file and generates an English translation audio. You can play with it using the AudioTranslationSample.ipynb notebook.The best part of this example is that you can fine-tune all the models used here on your datasets. In-domain fine-tuning is a great way to improve the performance of your models on specific applications. The NeMo GitHub repo provides plenty of fine-tuning examples.NeMo models have a common look and feel, regardless of domain. They are configured, trained, and used in a similar fashion.An ability to run experiments and quickly test new ideas is key to successful research. With NeMo, you can speed up training by using the latest NVIDIA Tensor Cores and model parallel training features across many nodes and hundreds of GPUs. Much of this functionality is provided with the help of the PyTorch Lightning trainer, which has an intuitive and easy-to-use API.For speech recognition, language modeling, and machine translation, we provide high-performance web dataset-based data loaders. These data loaders can handle scaling to tens of thousands of hours of speech data to deliver high performance in massively distributed settings with thousands of GPUs.Proper preparation of training data and pre and post-processing are hugely important and often overlooked steps in all machine learning pipelines. NeMo 1.0 includes new features for dataset creation and speech data explorer.NeMo 1.0 includes important text processing features such as text normalization and inverse text normalization. Text normalization converts text from written form into its verbalized form. It is used as a preprocessing step before training TTS models. It could also be used for preprocessing ASR training transcripts. Inverse text normalization (ITN) is a reverse operation and is often a part of the ASR post-processing pipeline. It is the task of converting the raw spoken output of the ASR model into its written form to improve text readability.For example, the normalized version of “It weighs 10 kg.” would be “It weighs 10 kilograms”.NeMo 1.0 release substantially improves overall quality and documentation. It adds support for new tasks such as neural machine translation and many new models pretrained in different languages. As a mature tool for ASR and TTS, it also adds new features for text normalization and denormalization, dataset creation based on CTC-segmentation and speech data explorer. These updates benefit researchers in academia and industry by making it easier for you to develop and train new conversational AI models.Many NeMo models can be exported to NVIDIA Riva for production deployment and high-performance inference. NVIDIA Riva is an accelerated SDK for building multimodal conversational AI services that delivers real-time performance on GPUs.We welcome external contributions! On the NVIDIA NeMo GitHub page, you can try out the examples, participate in community discussions, and take your models from research to production using NeMo and NVIDIA Riva.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	NVIDIA NeMo is a conversational AI toolkit built for researchers working on automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech synthesis (TTS). The primary objective of NeMo is to help researchers from industry and academia to reuse prior work (code and pretrained models and make it easier to create new conversational AI models. NeMo is an open-source project, and we welcome contributions from the research community.The 1.0 update brings significant architectural, code quality, and documentation improvements as well as a plethora of new state-of-the-art neural networks and pretrained checkpoints in several languages. The best way to start with NeMo is by installing it in your regular PyTorch environment:NeMo is a PyTorch ecosystem project that relies heavily on two other projects from the ecosystem: PyTorch Lightning for training and Hydra for configuration management. You can also use NeMo models and modules within any PyTorch code.NeMo comes with three main collections: ASR, NLP, and TTS. They are collections of models and modules that are ready to be reused in your conversational AI experiments. Most importantly, for most of the models, we provide weights pretrained on various datasets using tens of thousands of GPU hours.The NeMo ASR collection is the most extensive collection with a lot to offer for researchers of all levels, from beginners to advanced. If you are new to deep learning for speech recognition, we recommend that you get started with an interactive notebook for both ASR and NeMo overview. If you are an experienced researcher looking to create your own model, you’ll find various ready-to-use building blocks:The NeMo ASR collection provides you with various types of ASR networks: Jasper, QuartzNet, CitriNet, and Conformer. With the NeMo 1.0 update, the CitriNet and Conformer models are the next flagship ASR models providing better accuracy on word-error-rate (WER) than Jasper and QuartzNet while maintaining similar or better efficiency.CitriNet is an improvement upon QuartzNet that uses several ideas originally introduced in ContextNet. It uses subword encoding through word piece tokenization and Squeeze-and-Excitation mechanism to obtain highly accurate audio transcripts while using a nonautoregressive, CTC-based decoding scheme for efficient inference.Conformer-CTC is a CTC-based variant of the Conformer model that uses CTC loss and decoding instead of RNN-T loss, making it a nonautoregressive model. This model combines self-attention and convolution modules to achieve the best of both worlds. The self-attention modules can learn the global interaction while the convolutions efficiently capture the local correlations.This model gives you an option to experiment with attention-based models. Due to the global context obtained by self-attention and squeeze-and-excitation mechanism, Conformer and CitriNet models have superior WER in offline scenarios.You can use Citrinet and Conformer models with CTC as well as RNN-T decoders.We spent tens of thousands of GPU hours training ASR models in various languages. In NeMo, we offer these checkpoints back to the community for free. As of this release, NeMo has ASR models in English, Spanish, Chinese, Catalonian, Italian, Russian, French, and Polish. Moreover, we partner with Mozilla to make more pretrained models available with the help of Mozilla Common Voice project.Finally, NeMo’s ASR collection contains reusable building blocks and pretrained models for various other important speech-based tasks such as: voice activity detection, speaker recognition, diarization, and voice command detection.Natural language processing (NLP) is essential for providing a great conversational AI experience. The NeMo NLP collection provides a set of pretrained models for typical NLP tasks such as question answering, punctuation and capitalization, named entity recognition, and neural machine translation. Hugging Face transformers have fueled many recent advances in NLP by providing a huge set of pretrained models and an easy-to-use experience for developers and researchers. NeMo is compatible with transformers in that most of the pretrained Hugging Face NLP models can be imported into NeMo. You may provide pretrained BERT-like checkpoints from transformers for the encoders of common tasks. The language models of the common tasks are initialized in default with the pretrained model from Hugging Face transformers. NeMo is also integrated with models trained by NVIDIA Megatron, allowing you to incorporate Megatron-based encoders into your question answering and neural machine translation models. NeMo can be used to fine-tune model-parallel models based on Megatron.In today’s globalized world, it has become important to communicate with people speaking different languages. A conversational AI system capable of converting source text from one language to another will be a powerful communication tool. NeMo 1.0 now supports neural machine translation (NMT) tasks with transformer-based models allowing you to quickly build an end-to-end language translation pipelines. This release includes pretrained NMT models for the following language pairs in both directions:Because tokenization is an extremely important part of NLP and NeMo supports most widely used tokenizers, such as HF tokenizers, SentencePiece, and YouTokenToMe.If humans can talk to computers, the computers should be able to talk back as well. Speech synthesis takes text as an input and generates humanized audio output. This is typically accomplished with two models: a spectrogram generator that generates spectrograms from text and a vocoder that generates audio from spectrogram. The NeMo TTS collection provides you with the following models:Here’s a simple example demonstrating how to use NeMo for prototyping a universal translator app. This app takes a Russian audio file and generates an English translation audio. You can play with it using the AudioTranslationSample.ipynb notebook.The best part of this example is that you can fine-tune all the models used here on your datasets. In-domain fine-tuning is a great way to improve the performance of your models on specific applications. The NeMo GitHub repo provides plenty of fine-tuning examples.NeMo models have a common look and feel, regardless of domain. They are configured, trained, and used in a similar fashion.An ability to run experiments and quickly test new ideas is key to successful research. With NeMo, you can speed up training by using the latest NVIDIA Tensor Cores and model parallel training features across many nodes and hundreds of GPUs. Much of this functionality is provided with the help of the PyTorch Lightning trainer, which has an intuitive and easy-to-use API.For speech recognition, language modeling, and machine translation, we provide high-performance web dataset-based data loaders. These data loaders can handle scaling to tens of thousands of hours of speech data to deliver high performance in massively distributed settings with thousands of GPUs.Proper preparation of training data and pre and post-processing are hugely important and often overlooked steps in all machine learning pipelines. NeMo 1.0 includes new features for dataset creation and speech data explorer.NeMo 1.0 includes important text processing features such as text normalization and inverse text normalization. Text normalization converts text from written form into its verbalized form. It is used as a preprocessing step before training TTS models. It could also be used for preprocessing ASR training transcripts. Inverse text normalization (ITN) is a reverse operation and is often a part of the ASR post-processing pipeline. It is the task of converting the raw spoken output of the ASR model into its written form to improve text readability.For example, the normalized version of “It weighs 10 kg.” would be “It weighs 10 kilograms”.NeMo 1.0 release substantially improves overall quality and documentation. It adds support for new tasks such as neural machine translation and many new models pretrained in different languages. As a mature tool for ASR and TTS, it also adds new features for text normalization and denormalization, dataset creation based on CTC-segmentation and speech data explorer. These updates benefit researchers in academia and industry by making it easier for you to develop and train new conversational AI models.Many NeMo models can be exported to NVIDIA Riva for production deployment and high-performance inference. NVIDIA Riva is an accelerated SDK for building multimodal conversational AI services that delivers real-time performance on GPUs.We welcome external contributions! On the NVIDIA NeMo GitHub page, you can try out the examples, participate in community discussions, and take your models from research to production using NeMo and NVIDIA Riva.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.71	In NVIDIA Clara Train 4.0, we added homomorphic encryption (HE) tools for federated learning (FL). HE enables you to compute data while the data is still encrypted.In Clara Train 3.1, all clients used certified SSL channels to communicate their local model updates with the server. The SSL certificates are needed to establish trusted communication channels and are provided through a third party that runs the provisioning tool and securely distributes them to the hospitals. This secures the communication to the server, but the server can still see the raw model (unencrypted) updates to do aggregation.With Clara Train 4.0, the communication channels are still established using SSL certificates and the provisioning tool. However, each client optionally also receives additional keys to homomorphically encrypt their model updates before sending them to the server. The server doesn’t own a key and only sees the encrypted model updates.With HE, the server can aggregate these encrypted weights and then send the updated model back to the client. The clients can decrypt the model weights because they have the keys and can then continue with the next round of training (Figure 1).HE ensures that each client’s changes to the global model stays hidden by preventing the server from reverse-engineering the submitted weights and discovering any training data. This added security comes at a computational cost on the server. However, it can play an important role in healthcare in making sure that patient data stays secure at each hospital while still benefiting from using federated learning with other institutions.We implemented secure aggregation during FL with HE using the TenSEAL library by OpenMined, a convenient Python wrapper around Microsoft SEAL. Both libraries are available as open-source and provide an implementation of Homomorphic encryption for arithmetic of approximate numbers, also known as the CKKS scheme, which was proposed as a solution for encrypted machine learning.Our default uses the following HE setting, specified in Clara Train’s provisioning tool for FL:These settings are recommended and should work for most tasks but could be further optimized depending on your specific model architecture and machine learning problem. For more information about different settings, see this tutorial on the CKKS scheme and benchmarking.To compare the impact of HE to the overall training time and performance, we ran the following experiments. We chose SegResNet (a U-Net like architecture used to win the BraTS 2018 challenge) trained on the CT spleen segmentation task from the Medical Segmentation Decathlon.Each federated learning run was trained for 100 rounds with each client training for 10 local epochs on their local data on an NVIDIA V100 (server and clients are running on localhost). In each run, half of the clients each used half of the training data (16/16) and half of the validation data (5/4), respectively. We recorded the total training time and best average validation dice score of the global model. We show the relative increase added by HE in Table 1.There is a moderate increase in total training time of about 20% when encrypting the full model. This increase in training time is due to the added encryption and decryption steps and aggregation in homomorphically encrypted space. Our implementation enables you to reduce that extra time by only encrypting a subset of the model parameters, for example, all convolutional layers (“conv”). You could also encrypt just three of the key layers, such as the input, middle, and output layers.The added training time is also due to increased message sizes needed to send the encrypted model gradient updates, requiring longer upload times. For SegResNet, we observe an increase from 19 MB to 283 MB using the HE setting mentioned earlier (~15x increase).Next, we compare the performance of FL using up to 30 clients with the server running on AWS. For reference, we used an m5a.2xlarge with eight vCPUs, 32-GB memory, and up to 2,880 Gbps network bandwidth. We show the average encryption, decryption, and upload time, comparing raw compared to encrypted model gradients being uploaded in Figure 2 and Table 2. You can see the longer upload times due to the larger message sizes needed by HE.If you’re interested in learning more about how to set up FL with homomorphic encryption using Clara Train, we have a great Jupyter notebook on GitHub that walks you through the setup.HE can reduce model inversion or data leakage risks if there is a malicious or compromised server. However, your final models might still contain or memorize privacy-relevant information. That’s where differential privacy methods can be a useful addition to HE. Clara Train SDK implements the sparse vector technique (SVT) and partial model sharing that can help preserve privacy. For more information, see Privacy-preserving Federated Brain Tumour Segmentation. Keep in mind that there is a tradeoff between model performance and privacy protection.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	In NVIDIA Clara Train 4.0, we added homomorphic encryption (HE) tools for federated learning (FL). HE enables you to compute data while the data is still encrypted.In Clara Train 3.1, all clients used certified SSL channels to communicate their local model updates with the server. The SSL certificates are needed to establish trusted communication channels and are provided through a third party that runs the provisioning tool and securely distributes them to the hospitals. This secures the communication to the server, but the server can still see the raw model (unencrypted) updates to do aggregation.With Clara Train 4.0, the communication channels are still established using SSL certificates and the provisioning tool. However, each client optionally also receives additional keys to homomorphically encrypt their model updates before sending them to the server. The server doesn’t own a key and only sees the encrypted model updates.With HE, the server can aggregate these encrypted weights and then send the updated model back to the client. The clients can decrypt the model weights because they have the keys and can then continue with the next round of training (Figure 1).HE ensures that each client’s changes to the global model stays hidden by preventing the server from reverse-engineering the submitted weights and discovering any training data. This added security comes at a computational cost on the server. However, it can play an important role in healthcare in making sure that patient data stays secure at each hospital while still benefiting from using federated learning with other institutions.We implemented secure aggregation during FL with HE using the TenSEAL library by OpenMined, a convenient Python wrapper around Microsoft SEAL. Both libraries are available as open-source and provide an implementation of Homomorphic encryption for arithmetic of approximate numbers, also known as the CKKS scheme, which was proposed as a solution for encrypted machine learning.Our default uses the following HE setting, specified in Clara Train’s provisioning tool for FL:These settings are recommended and should work for most tasks but could be further optimized depending on your specific model architecture and machine learning problem. For more information about different settings, see this tutorial on the CKKS scheme and benchmarking.To compare the impact of HE to the overall training time and performance, we ran the following experiments. We chose SegResNet (a U-Net like architecture used to win the BraTS 2018 challenge) trained on the CT spleen segmentation task from the Medical Segmentation Decathlon.Each federated learning run was trained for 100 rounds with each client training for 10 local epochs on their local data on an NVIDIA V100 (server and clients are running on localhost). In each run, half of the clients each used half of the training data (16/16) and half of the validation data (5/4), respectively. We recorded the total training time and best average validation dice score of the global model. We show the relative increase added by HE in Table 1.There is a moderate increase in total training time of about 20% when encrypting the full model. This increase in training time is due to the added encryption and decryption steps and aggregation in homomorphically encrypted space. Our implementation enables you to reduce that extra time by only encrypting a subset of the model parameters, for example, all convolutional layers (“conv”). You could also encrypt just three of the key layers, such as the input, middle, and output layers.The added training time is also due to increased message sizes needed to send the encrypted model gradient updates, requiring longer upload times. For SegResNet, we observe an increase from 19 MB to 283 MB using the HE setting mentioned earlier (~15x increase).Next, we compare the performance of FL using up to 30 clients with the server running on AWS. For reference, we used an m5a.2xlarge with eight vCPUs, 32-GB memory, and up to 2,880 Gbps network bandwidth. We show the average encryption, decryption, and upload time, comparing raw compared to encrypted model gradients being uploaded in Figure 2 and Table 2. You can see the longer upload times due to the larger message sizes needed by HE.If you’re interested in learning more about how to set up FL with homomorphic encryption using Clara Train, we have a great Jupyter notebook on GitHub that walks you through the setup.HE can reduce model inversion or data leakage risks if there is a malicious or compromised server. However, your final models might still contain or memorize privacy-relevant information. That’s where differential privacy methods can be a useful addition to HE. Clara Train SDK implements the sparse vector technique (SVT) and partial model sharing that can help preserve privacy. For more information, see Privacy-preserving Federated Brain Tumour Segmentation. Keep in mind that there is a tradeoff between model performance and privacy protection.Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.72	Targeting areas populated with disease-carrying mosquitoes just got easier thanks to a new study. The research, recently published in IEEE Explore, uses deep learning to recognize tiger mosquitoes with near perfect accuracy from images taken by citizen scientists .“Identifying the mosquitoes is fundamental, as the diseases they transmit continue to be a major public health issue,” said lead author Gereziher Adhane.The study, from researchers at the Scene UNderstanding & Artificial Intelligence (SUnAI) research group, of the Universitat Oberta de Catalunya’s (UOC) Faculty of Computer Science, Multimedia and Telecommunications and of the eHealth Center, uses images from the Mosquito Alert app. Developed in Spain and currently expanding globally, the platform brings together citizens, entomologists, public health authorities, and mosquito-control services to reduce mosquito-borne diseases. Anyone in the world can upload geo-tagged images of mosquitoes to the app. Three expert entomologists inspect and validate the submitted images before they are added to the database, classified, and mapped. Travel and migration, along with climate change and urbanization, has broadened the range and habitat of mosquitoes. The quick identification of species such as the tiger—known to transmit dengue, Zika, chikungunya, and yellow fever—remains a key step in assisting relevant authorities to curb their spread. “This type of analysis depends largely on human expertise and requires the collaboration of professionals, is typically time-consuming, and is not cost-effective because of the possible rapid propagation of invasive species,” said Adhane. “This is where neural networks can play a role as a practical solution for controlling the spread of mosquitoes.”The research team developed a deep convolutional neural network that distinguishes between mosquito species. Starting with a pretrained model, they fine-tuned it using the hand-labeled Mosquito Alert dataset. Using NVIDIA GPUs and the cuDNN-accelerated PyTorch deep learning framework, the classification models were taught to pinpoint tiger mosquitoes based on identifiable morphological features such as white stripes on the legs, abdominal patches, head, and thorax shape.  Deep learning models typically rely on millions of samples. However, using only 6,378 images of both tiger and non-tiger mosquitoes from Mosquito Alert, the researchers were able to train the model with about 94% accuracy. “The neural network we have developed can perform as well or nearly as well as a human expert and the algorithm is sufficiently powerful to process massive amounts of images,” said Adhane.According to the researchers, as Mosquito Alert scales up, the study can be expanded to classify multiple species of mosquitoes and their breeding sites across the globe. “The model we have developed could be used in practical applications with small modifications to work with mobile apps. Using this trained network, it is possible to make predictions about images of mosquitoes taken using smartphones efficiently and in real time,” Adhane said. The GPU used in the research was a donation provided by the NVIDIA Academic Hardware Grant Program.Read the full article in IEEE Explore >>Read more >>  Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.	The NVIDIA NGC team is hosting a webinar with live Q&A to dive into our new Jupyter notebook available from the NGC catalog. Learn how to use these resources to kickstart your AI journey.NVIDIA NGC Jupyter Notebook Day: Building a 3D Medical Imaging Segmentation ModelThursday, July 22 at 9:00 AM PTImage segmentation deals with placing each pixel (or voxel in the case of 3D) of an image into specific classes that share common characteristics. In medical imaging, image segmentation can be used to help identify organs and anomalies, measure them, classify them, and even uncover diagnostic information by using data gathered from x-rays, magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and more. However, building, training, and optimizing an accurate image segmentation AI model from scratch can be time consuming for novices and experts alike.By joining this webinar, you will learn:Register now >>Have a story to share? Submit an idea.Get the developer news feed straight to your inbox.
 |