Why You Should Leverage NVMe SSDs for Artificial Intelligence & Machine Learning Workloads

Posted on May 5, 2020 by rawee.k

Whether you’re a large enterprise with a multimillion-dollar data center, or a small shop with a few applications servers in a server closet, artificial intelligence (AI) and machine learning (ML) use cases have exploded over the past few years. NVMe SSDs have proven to be a better solution for handling these workloads.

In today’s hybrid world, data center leaders have needed to become more savvy and effective in every aspect of how their IT infrastructure deals with these highly-transactional workloads. They run data-hungry, next-gen applications powered by NVMe SSDs with high-performance object storage infrastructures to capture, store and analyze more data faster for users.

Delivering Low-Latency & High-Throughput for AI & ML Workloads

NVMe SSDs offers great benefits for specific AI use cases, like training a machine learning model. Machine learning involves two phases – training a model based on what is learned from the dataset, and running the model. The most resource-hungry stage of machine learning is the training of a model.

The modern datasets used for model training can be huge – for example, MRI scans can reach terabytes each; a learning system may use tens or hundreds of thousands of images. Even if the training itself runs from RAM, the memory should be fed from non-volatile storage, which has to support very high bandwidth. In addition, paging out the old training data and bringing in new data should be done as fast as possible to keep the GPUs from being idle. This necessitates low latency, and the only protocol allowing for both high bandwidth and low latency like this is NVMe.

However, there are limitations as to how many local NVMe drives can be used. This is often the result of how many PCIe lanes are actually allocated to NVMe drives since the GPUs/SoCs also require PCIe lanes. Additionally, for checkpoint usage many local NVMe drives need to be synchronized to allow for a usable snapshot.

NVMe Benefits for Artificial Intelligence & Machine Learning

Server solutions that leverage NVMe SSDs create an ideal storage platform for AI/ML workloads, especially machine learning for multiple applications. By applying NVMe-based data storage on solid-stage drives, you can:

  • Create and Manage Larger Datasets – By separating out storage capacity from the compute nodes, datasets for machine learning training can scale up to 1PB. As a result, the dataset grows, bringing additional NVMe storage online, increasing performance and avoiding limitations set by legacy storage controller bottlenecks.
  • Overcome Capacity Limitations of Local SSDs – The limited space for SSD media in GPU nodes restricts the ability to manage larger datasets. Conversely, NVMe storage enables NVMe volumes to be dynamically provisioned over high performance Ethernet or InfiniBand networks.
  • Accelerate epoch time of Machine Learning – NVMe SSDs eliminates the latency bottlenecks of older storage protocols and unleashes the parallelism inherent to the NVMe protocol by leveraging high performance NVMe-oF by as much as 10x. Every GPU node has direct, parallel access to the media at the lowest possible latency.
  • Improve Utilization of GPUs – Having GPUs rest idle due to slow access to data for processing is costly. By offloading storage access to the idle CPUs and delivering storage performance at the speed of local SSDs, NVMe storage ensures that the GPU-nodes are kept busy with fast access to data.

Leveraging NVMe SSD Solutions for AI/ML

Following the launch of the new WD Ultrastar DC series NVMe SSDs with capacity sizes from 800GB to 20.72TB, Pogo Linux server users will experience low-latency and high-throughput for high-transaction AI and ML applications from fast, dense and efficient NVMe SSD storage. In both single- and dual-processor rackmount server configurations, data center leaders will be able to accelerate access to critical data – ideal for AI and ML workloads with read-intense workloads – from 96-Layer 3D NAND and up to 1.2M random read IOPS.

We’d be excited to take a deeper dive into how to architect low-latency and high-throughput with NVMe SSD storage at petabyte scale for AI & ML workloads. Schedule a time or give us a call at (888) 828-7646 to learn how we’ve helped businesses of all sizes with custom hardware configurations and exceptional technical support.