Part 1 of 3
A Storage-Centric View on AI and Deep Learning
Similarities and Differences Between AI and HPC
Nowadays, solution providers tend to manage High Performance Computing (HPC) and Artificial Intelligence (AI) projects through the same divisions and the same people, by making their former HPC experts also responsible for AI customers. The reason behind this is that AI workloads (especially for Deep Learning) share some important concepts. Among them are the ability to manage multiple GPU systems as a cluster and share those systems in a coordinated way by multiple data scientists. Also, both types of systems typically require shared access to data at a high level of performance and communicate over a fast RDMA-enabled network. Especially in scientific research, the classic HPC systems nowadays tend to have GPUs added to the compute nodes to have the same cluster suitable for classic HPC and new AI workloads.
Taking a closer look at industry production AI systems, it turns out that there are very significant differences between AI and classic HPC workloads, which need to be taken into account before applying classic HPC concepts to AI solutions. One of them is the number of hosts. The HPC community is still working towards the first exascale systems (a compute cluster that provides an aggregate performance of 10^18 floating point operations per second) for many years now. The assumption has originally been that such systems consist of tens of thousands or rather hundreds of thousands of hosts. Thus, the assumptions for storage have been that a single host would not have very high performance demands and the storage system only needs to deliver high throughput when accessed by many clients in parallel, ideally as streaming reads or writes in large blocks.
The Rise of the Supercomputer in a Box
In Deep Learning training, systems typically have multiple GPUs per host in Nvidia DGX boxes or similar designs. A single DGX is already a supercomputer in a box and a single DGX-2 provides already 2 PetaFLOPS, so 500 of them would make an ExaFLOP system – in contrast to the originally assumed hundreds of thousands of hosts for an exascale HPC system. This means the number of hosts (or number of clients from the storage perspective) is much lower than typically in HPC. Also, the high storage access performance of a single client is much more relevant, as we have seen people struggling to even feed a single DGX-2 fast enough with data to take advantage of the computational powers that the box provides. Another very significant difference to classic HPC is the data access pattern. While it is traditionally rather about streaming large files for HPC, in Deep Learning the access pattern for training often consists of reading lots of small files like individual images, documents or small relevant fragments of larger files. This means the read latency and small read operations per second become a dominating factor to keep the GPUs busy.
Thus, the classic HPC storage solutions that are often suggested for AI systems can’t handle the demands of these systems, which lead Excelero to design new concepts to meet these new demands.
Typical AI Customer Requirements
AI is a horizontal topic and thus crosses domains in all industries. Each of these domains can be very different from the others and even inside a domain there are many different aspects to consider. Taking autonomous driving as a popular example, the systems have to deal with detecting static objects such as traffic signs, with detecting variable/moving objects like humans and cars, the kind of ground a car is currently on and so on, all together taken into account for complex decisions like whether it is ok to accelerate now or whether the car should rather slow down.
To break things down into a more manageable domain that is typically very I/O intensive, we will focus on Deep Learning here, although the shown solutions may well be suitable for other areas with similar demands. This means the systems typically consist of hosts with multiple GPUs, have a fast RDMA-capable interconnect and need to be able to process vast amounts of data in short time. The performance critical part is typically the read phase, as the training dataset is processed again and again to improve the resulting model. The reads are typically not bound by large streaming throughput, but rather by small file access latency and small file reads per second, as the size of input objects is relatively small and each access has to take the extra work of opening and closing a small file into account. This is one of the reasons why NVMe storage is the primary choice for Deep Learning systems, as classic SATA or SAS drive based approaches are unable to fulfill the demands that can come from even a single DGX box. Write performance for such systems is often less critical and consists of two main factors: One being the write of the generated model, where the size of the model is rather small compared to the amount of processed input data. The other is the addition of more training objects, which requires the storage systems to be able to grow seamlessly.
In the second part, we will take a closer look at the storage requirements of the 3 Phases of Deep Learning. In the third and final part we will then be able to design an optimal storage system for scalable Deep Learning.