I highly recommend reading my previous posts (Excelero for non storage folks, and Excelero for storage folks) before proceeding to this post. In those posts, I described how we designed a storage architecture that avoids the storage controller bottleneck problem by using distributed algorithms.
As I explained in the previous posts, the NVMesh storage architecture mainly consists of two software modules: clients and targets. A client is a server that runs an application and generates read and write requests. The target is a server with storage, typically flash devices. NVMesh clients & targets communicate with each other, identify usable resources (ssd devices, smart NICs, RAM, etc) and use those resources to run a distributed storage solution. All flash disks are united into a single pool, from which logical disks can be allocated (block volumes). Logical disks have an identical API to physical disks (block storage API, ioctls) and additionally provide data services, like redundancy, striping, thin provisioning, and more.
Key Challenges in distributed storage
- Algorithms are executed entirely by the clients and clients do not directly communicate with each other. In a typical distributed algorithm, agents usually communicate with each other or at least elect a representative which synchronizes the agents.
- Control path is a “suggestion only mode” to the datapath and does not really take over it. Example: a suggestion that raid-6 returned back from degraded mode. Our architecture strives to optimize the data path, which flows cleanly without interference. Control path runs in background and suggest changes. Eventually, data path will comply to those suggestion but not in real time and each client will do that according to a different schedule.
Examples of some implications of the distributed model
- It is perfectly valid for a few clients to disagree about which disks are alive or dead. Example: Client A thinks that there is only 1 copy of data while client B knowns that the data is mirrored across 2 disks. Both clients coexist and issue IO together.
- Almost every distributed mechanism has a mutual exclusion (locking) mechanism to ensure that a specific set of actions is taken atomically (either all actions are taken or none, as a single transaction). The description of such a mechanism is given via control path. However, the usage of mutual exclusion is in the datapath (which does not always comply to control path). The implication is stunning: clients may even disagree to where the locking mechanism exists and how to use it.
Networking issues vs. Storage issues
I am sure that readers that are familiar with networking raised an eyebrow. Most of the difficulties described above are also present in networking/web solutions, so why reinvent the wheel? Why not integrate existing network solutions into NVMesh?
It appears that a distributed storage is dealing mainly with communication & synchronization problems, so using the same solutions which are successfully implemented in pure storage-less networking (distributed video-chats, multiplayer games) seems like a good idea.
I will try to give a quick answer: in pure networking, corner case error handling is typically done via a kind of “reboot” mechanism, and does not really fix the problem. For example: if a code reaches a buggy (inconsistent) state, a packet can be just dropped, and it will be retransmitted by a higher layer software. A video frame can be skipped, in multiplayer game, a TCP socket can be closed and reopened, etc. I am not saying that all the problems are solved that way, but in many cases this is a legitimate fallback, especially when overcoming bugs in the code.
This approach comes down to an important postulate: “In storage-less networking – when there is no connection, the existing technical problem never grows, and on reconnection it is auto-solved”.
An example will make this easier to understand: when the connection is lost, users are kicked out of their skype conference but they will not lose data. When the session resumes, users are re-connected to the chat. In a different scenario, when the connection to VISA fails, a credit card authentication cannot be processed. At that moment the customer may not be able to purchase a pair of new shoes, but previous purchases are not affected, and bank account will not get corrupted, so the problem does not escalate.
In storage, however, data is expected to be persistent, So the “reboot” approach is not viable. You cannot just drop a paragraph out of document, because 2 users edited it simultaneously and the algorithm failed to resolve who did what. Moreover, typically, rebooting makes the problem even harder to solve.
Let me give another example: your local bank saves the balance of your bank account replicated on 5 disks, each in a different city (failure of up to 4 disks, 4 power failures, 4 floods will still render the data accessible). You tried to buy a pair of shoes, meanwhile your spouse withdrew money from the ATM, and mobile phone carrier charged for a monthly service exactly at the same time. A bug in the storage algorithm messed up 3 of those transactions and now 5 replicas on disks include 5 different cash balances. It would not help to restart the connection, nor to reboot the entire system. The balance is corrupted and will not fix itself. The more time passes, the older the logs gets, each reboot may erase additional traces of transaction in servers RAM. A possible solution might be, saving restore points and managing journal of financial transactions (so even if the cached balance gets corrupted, we can roll back to a safe point and replay the transaction).
While in a pure network solution (like a multiplayer game), you are willing to accept that once in a while there will be a short disconnection, in storage you will never accept that once in a while … your money is lost. So in storage, all the corner cases and bugs in the code must be addressed and hot-fixed, while in pure networking they can be solved by a reboot.
Disclaimer: In no way do I imply that storage problems are more difficult than networking issues, just that they are different.
Networking vs. Storage Responsibility
Another difference between storage vs networking is the responsibility. Typically, each software layer is responsible only for its own bugs.
Let’s use another example: a multiplayer game stops and says there is no connection, You see that the wifi is off (say due to a bug in a router firmware). You restart the router and continue without blaming the protocols of the game for kicking you out.
Or a different example: a few charged particles from a solar flare create a ripple in a video frame. You just shrug your shoulders.
But storage requirements are different: users expect 100% resiliency. When developing software-defined storage, You must handle bugs in your software. But a bug in the firmware, a cosmic radiation altering an SSD cell, an error in networking are all considered your bugs as well. The customer expects maximum availability (five nines or more) and does not care who created a problem. All issues accumulate to the unavailability bucket.
As a final example: you stored an important photo in the cloud and it disappeared. You will blame that provider, right? But what actually might happened is that your data was written to a disk at winter, while you tried to access it in summer. The difference of room temperatures affected the impedance and voltage thresholds of the (SSD) disk, the electrical charge was misinterpreted by the controller of the disk, which resulted in the data loss. As a user, you don’t care about that though. You will hold your storage provider accountable – he is in charge of handling all the problems using intelligent software that makes backups, uses journals, restore points and hashing mechanism to detect errors.