Accelerating AI Workflows by Eliminating Storage Bottlenecks
The biggest advantage of modern GPU computing is also creating its biggest challenge: GPUs have an amazing appetite for data. Current GPU servers can process tens of gigabytes of data per second. NVIDIA’s latest DGX-2TM system has as many as 16 GPUs, but by far not enough local storage. The DGX-1 has a theoretical limit of 7.8GB/s bandwidth, but with only 4 SATA SSDs it is limited to about 2.2GB/s. Theoretically, it can process 2 million random IOPs but local storage only provides 400K IOPs. The latest NVIDIA DGX-2 has 30TB (8 x 3.84TB) local NVMe but is not optimized to use it efficiently. Other brand GPU servers typically feature few PCIe lanes for local flash (NVMe or other), meaning even the lowest latency option for these servers is a severe bottleneck or is simply too little capacity for the GPUs. Starving the GPUs with slow storage or wasting time copying data wastes expensive GPU resources and affects the ROI.
Fortunately, NVIDIA’s DGX nodes also have massive network connectivity. They can ingest as much as 48GB/s of bandwidth via 4-8 x 100Gb ports – playing a key part in the solution: Excelero’s NVMesh enables customers to maximize the utilization of their GPUs leveraging the massive network connectivity of the DGXs and the low-latency and high IOPs/BW benefits of NVMe in a distributed and linearly scalable architecture. Scalable, disaggregated shared storage enables DGX platforms to work on huge data sets and reduce training time from weeks to days
NVMesh for AI and ML workloads
Excelero’s NVMesh eliminates any compromise between performance and practicality, and allows GPU optimized servers to access scalable, high performance NVMe flash storage pools as if they were local flash. This technique ensures efficient use of both the GPUs themselves and the associated NVMe flash. The end result is higher ROI, easier workflow management and faster time to results.
- Shared storage resources across multiple GPU servers
- Eliminates need to copy data locally
- Datasets can be larger than what can fit inside the DGX