Supercomputers have become increasingly important in the field of computational science as they provide a wide range of computationally intensive tasks in various fields such as weather forecasting, space exploration, genetic research, climate research, oil and gas exploration, molecular modeling, quantum mechanics and physical simulations. Supercomputing Labs provide researchers with computational resources and expertise necessary to perform their research at massive scale.
Supercomputing research centers are often funded by government branches and as such need to meet high levels of availability to ensure high ROI for the supercomputer. One way to increase availability is by using a burst buffer for checkpointing. Excelero’s NVMesh enables supercomputing centers to build petabyte-scale burst buffer leveraging high-performance NVMe.
NVMesh provides an extremely cost effective method to achieve unheard of burst buffer bandwidth by adding commodity flash drives and NVMesh software to compute nodes and sharing the storage across the existing low latency network fabric. It provides redundancy without impacting target CPUs. There is no need for additional dedicated hardware or proprietary file system integrations as storage is provisioned as a simple block device.
Burst Buffer Challenges
More data: clusters become larger as well as the amount of memory per node
Larger checkpoints require more time to complete
Longer checkpointing means less computing, lower availability
Excelero Is a Game Changer for Supercomputing
High-performance computing applications consist of complex sets of processes that sometimes run for weeks. When any of these processes is interrupted, this could destroy the results of the entire compute job. This problem becomes worse as supercomputers become more powerful. Therefore, parallel computing applications use the concept of checkpoint-restart. This technique allows compute jobs to be restarted from the most recently saved checkpoint in case of an interruption. Checkpoints are typically saved in a shared, parallel file system. But as clusters become larger and the amount of memory per node increases, each individual checkpoint becomes larger and either takes more time to complete or requires a higher-performance file system. When a system is checkpointing it’s not computing, which reduces the availability score of the system. Excelero’s NVMesh drastically shortens those moments of unavailability and enables supercomputing centers to maximize their availability score.
NVMesh Key Benefits
Petabyte-scale unified pool of high-performance flash retaining the speeds and latencies of directly-attached media
Supports large-scale modeling, simulation, analysis and visualization
Finish check pointing faster and start running the job
Scale & Performance
- Delivers 99.8% of the local NVMe storage servers’ performance over the network to workstations
- Leverage the full performance of your NVMe flash at any scale, over the network
- Scale your performance and capacity linearly
- Leverage high IOPS, high bandwidth or mixed
- Supports current and future media workflows
- Maximize the utilization of your NVMe flash devices
- Choose hardware from any server, storage and network vendor
- Easy to manage & monitor, reduces the maintenance TCO
- The block interface facilitates easy integration with other legacy storage, such as GPFS or Stornext
- Choice of architecture: converged, disaggregated or mixed
- Mix different storage media types to optimize for cost, scale or performance
- Scale storage and compute separately, as needed