Part 3 of 3
In the first part of this series, we discussed the similarities and differences between HPC and AI with a special focus on the storage aspects. In the second part, a closer look at the 3 Phases of Deep Learning provided us with a better understanding of what the storage requirements for Deep Learning are, given that DL is generally considered to be one of the most important drivers for storage performance nowadays. In this third and final part, we now have all we need to design an optimal storage system for scalable Deep Learning.
Storage for Scalable Deep Learning Training
From the previous chapters, we already know the most important properties that an optimal storage solution for training should have and can now sum them up to validate that Excelero’s NVMesh AI storage solution can be used to meet them:
- Low latency and high read operations per second: This makes NVMe the ideal technology, as it provides both, the high access performance and the affordable capacity at scale.
- Data protection: The training phase is business-critical for modern companies, which requires the storage system to be fault-tolerant. Excelero’s MeshProtect feature protects the data across multiple NVMe drives and across multiple servers with different options, including distributed erasure coding.
- Scalable capacity: Through Excelero’s elastic NVMe technology, additional drives or servers can be added seamlessly at any time to increase usable capacity or performance. As Excelero’s patented remote direct drive access (RDDA) technology enables access to remote NVMe drives at the same performance that an internal GPU server drive would have, the solution is not bound by the capacity limitations of internal drives of GPU servers, but still meets and typically exceeds the performance of internal drives.
- Scalable performance: Similar to the capacity scale approach, more drives and more servers can seamlessly be added to Excelero’s NVMesh storage to increase performance.
Given that Excelero’s NVMesh software-defined storage allows access to remote NVMe drives at the same performance that an internal GPU server drive would have, customers are free to use NVMe drives in dedicated servers. This allows to flexibly design and scale the capacity and performance of the storage solution independent of the number of GPU servers and without the need for additional caching or data copying into the GPU servers.
At the same time, NVMesh protects the data from hardware failures and enables shared access to logical volumes from all GPU servers.
Some companies use this concept with NVMesh as the basis for parallel file systems like SpectrumScale or BeeGFS. But given the significant overhead of parallel file systems when working with lots of small files and given that the training dataset is typically read-only (except for special occasions when new data is added), Excelero’s customers often prefer to avoid a parallel file system completely to meet their performance demands with less hardware.
In this case without a parallel file system, NVMesh logical block volumes with the training data are simply made available over the network directly at the GPU servers and mounted with a high-performance local file system like xfs. This enables shared read acess to the same data from all GPU servers at a level of access performance which is far beyond the performance that can be achieved with parallel file systems on the same hardware, thus reducing management effort and solution costs, as less hardware is required to meet the demands.
To address the rather moderate requirements for a shared writable file system to store the generated model, a simple highly available NFS service on the NVMesh servers can be added.
The result is a modern storage system that takes full advantage of the latest hardware technologies, while at the same time providing data protection and unmatched performance at arbitrary scale through Excelero’s software-defined storage and patented RDDA technology.
Efficient storage access in the training phase of Deep Learning is widely recognized as one of the biggest challenges of modern companies. This is often based on the fact that solution approaches are based on traditional concepts from areas like High-Performance Computing. While there certainly is some overlap with HPC technologies in DL, Excelero has seen in many customer projects that the DL training storage demands are very different from classic HPC storage demands. Thus, classic approaches often lead to customers being unable to scale their business-critical workloads as needed.
As shown above, a modern solution approach based on Excelero’s NVMesh, which takes the specific aspects of DL workloads into account, can easily overcome such limitations and allows companies to scale their training with full flexibility as needed.