My previous post was an attempt to explain to non-techies why the Excelero vision is so ground-breaking. I described how we designed a storage architecture that avoids the storage controller bottleneck problem by using distributed algorithms. It was an interesting exercise albeit one of the most challenging in this job.
For this new post, I want to take a different approach and take a deeper dive into the Excelero technology and our product NVMesh. While I still want to keep things simple, I will have to increase the difficulty level just a notch.
So what is Excelero’s NVMesh?
The NVMesh storage architecture mainly consists of two software modules: clients and targets. A client is a server that runs an application and generates read and write requests. The target is a server with storage, typically flash devices. NVMsh abstracts those flash devices into a single pool, from which “virtual disks” can be allocated (block volumes).
What’s really cool about NVMesh is that these “virtual disks” are consumed by the applications as if it were local storage: the operating system (OS) of the client is tricked into behaving as if each virtual disk is attached locally. This way, just any application is compatible with NVMesh.
Another great benefit of this architecture is that it allows to combine both capacity and performance of the storage devices. Take for example a network with 1,000 flash devices. All these devices can be united into a single huge virtual disk that would have 1,000 times the capacity and performance of one device. But, if a customer wants data protection and stores two copies of their data, that will affect the usable capacity and the performance. The read performance data will be still 1,000 times faster but writing will only be 500 times faster (as each data block/file is saved twice). By creating virtual disks, customers partition both capacity and performance.
So what? Why is this so hard to build?
In a storage infrastructure powered by NVMesh, both clients and targets run on many servers, spread across the network. Some servers can be just targets, some are just clients, others (often most) are both.
If all the modules could see a consistent (identical) structure of the network, creating and managing a pool of disks would be easy. The challenge, however, with using a distributed algorithm over the network is that there is no single network topology. Each server sees a different picture.
Continuing with our 1,000 SSD devices example, one client may see only a subset of 890 devices, another sees a subset of 970 disks and yet another sees only 531 disks. How can they even agree on what the pool is? How can they define IO routes over the network?
Let me explain how Excelero solves this with a few more examples:
Disagreement of 3 servers in the network: Server A and B cannot communicate. A thinks that server B is dead, because it does not answer A’s request, while B thinks A is dead. Server C may know that they are both alive, but disconnected from each other. Each server has a different point of view on the network (sees a different topology) and those views are inconsistent with each other.
- A declares B as dead and says that all the data on B’s disks is inaccessible and possibly outdated.
- B declares A as dead and invalidates its data. Now we have disagreement in the network.
- One client renames a file stored on A from a.txt to a1.txt, while another client deletes this file on B. So what is the outcome?
This is a very usual issue with distributed algorithms. We call this a “split brain” scenario: two or more agents of the distributed algorithms engage in mutually inconsistent behavior. Once split brain occurs, getting data corruptions is almost inevitable. This problem does not occur in systems with a centralized traffic controller because they have a single point of view on the network (but with limited scalability and performance as a result).
Client A is connected to both disks (D1 & D2), while client B is disconnected from disk 2. Both clients want to edit, i.e read and write, the same document, and both want it to be backed-up if possible, i.e. written on both disks, so in case one disk dies the file is not lost.
- Client B inserts a new paragraph to the document and writes it to disk 1.
- At the same time, client A reads the document from disk 2 and misses the new paragraph.
This is a severe error which in general leads to a lost file, corrupted documents and other such hazards. This corruption occurred, because in distributed algorithms each node has a distinct point of view and there is no consensus of how the network looks. Moreover, because there is no traffic controller, and clients do not communicate with each other directly, no one can detect that such corruption occurred. When a client writes data to a disk, no one except this client knows that data was written.
One may think of using synchronization or a locking mechanism to let clients A and B communicate to each other, so they are aware of the that they see different pictures. However this is not a real solution because A and B can disagree about the location of the synchronization mechanism. Much like they see different versions of the same file, they can see different versions of the the same synchronization mechanism. In other words, for guaranteed redundancy the synchronization mechanism must itself be distributed, or else it becomes a bottleneck and a single point of failure.
Excelero’s solution implements smart algorithms to avoid data corruptions, split brains, etc, without enlisting a traffic controller, nor any centralized mechanism. This is done by running a protocol that all clients cooperate on. Making this protocol robust, especially when disks fail or networks get disrupted, is the hard part.