Excelero delivers big data AI storage solutions for business & enterprise, big data storage solutions, and enterprise data storage solutions. Applications are for major web scale companies for data analytics, machine learning applications in media and entertainment and HPC environments. Skip to main content
EngineeringExcelero

Ensuring High-Availability for NVMesh Management

By October 26, 2017January 9th, 2018No Comments

In the last few technical posts, we concentrated mostly on client and target components, I/O initiators and storage servers. I think the time has come to introduce the management component and describe it in depth.

The management component’s roles comprise collecting system inventory, managing volume allocation policy, provisioning volumes, managing their life cycle and monitoring system health. Taking care of the volumes and disks in the system include tasks such as creating new volumes, moving volume allocations between drives, migrating drives between targets and deleting volumes.The management system provides a GUI that is a central place for system administrators to observe the health and performance of the entire NVMesh. It also provides a RESTful API for more automated management and integration with other management systems.

Without an active management, NVMesh can continue to operate as long as no changes are introduced to the network and storage topology. For example, I/O to existing volumes will be serviced, but volumes cannot be changed, nor created. As management is a critical component, we focused on  two main issues with our  version 1.2 release of management, Availability and ScalabilityAvailability equates we can survive a management node failure and still have the system provide all the management’s services. For management, scalability means handling any number of clients and target nodes in large clusters. The main limiting factor is that the amount of open connections we can handle on a single management machine is limited. In order to achieve both scalability and availability, we decided to go with an active-active architecture. That way we could strive to achieve near-linear scalability. For example, in a setup with 1,000 targets and clients nodes, it would be undesirable to rely on a single management node to handles all connections.

There are two critical optimizations that we are using in the management nodes. The first is maintaining a cache of “hard” calculations. Each management instance maintains a cache of several “hard” calculations that are frequently needed. For instance, the total amount of servers and their health in the database is cached, as is the total amount of free, allocated and redundant space we currently have for volume allocation. Instead of summarizing all the free space of all the drives each time someone requests this information, we can simply check it once, and update the sum with every change, for example when a new disk is introduced we could add its capacity to the free space counter without recalculating the rest of the disks in the system.

The second critical optimization is working in an event driven model. A consumer can ask for a specific change on a portion of the data and will only get that portion when a change occurred. For example if a specific client is attached to a volume and that volume is changed, the client will only get notified on that specific volume. Of course, the management will only send that data to the consumers that registered on that event, i.e. updates for that specific volume. Working in such an event driven model sifts or filters out irrelevant information from being updated globally and reduces and optimizes management to endpoint communication.

Using an active-active architecture is not problem free. Syncing the cache between management machines is required. For example, a management node detects a new disk. An event should be triggered so that all the other managements can update their summarized counters. There may also be a need to generate re-routing events. As we implied, we want to maintain the least amount of connections to a single management machine so it’s possible that a specific consumer will register on an event on management A, but the event will only be triggered on Management B.

Finally, there is the trivial task of coordinating between operations on different managements. If two managements try to allocate volumes at the same time, we must make sure that both aren’t using the same drive space.

To summarize, implementing N-way active-active clustering of management nodes, in the form of a scale-out service for the management of the NVMesh nodes was the main challenge for the management team, for version 1.2 of NVMesh. We hope we have conveyed some of the issues encountered and the techniques used to overcome them.

Yaniv Romem

Author Yaniv Romem

More posts by Yaniv Romem