A typical I/O model is one in which clients, a.k.a hosts and initiators, issue I/O and targets (servers with the physical disks) implement the data services. If you read our previous blog posts, you already know that Excelero uses a different approach. The full load of data services is shifted to client side distributed algorithms.
In fact, when clients issue I/O, it is possible that no other entity in the entire system knows that this I/O occurred. In other words, the CPU core that executes the I/O is the only one that knows about it. Our decentralized I/O approach completely eliminates some of the fundamental bottlenecks that prevent hyper-scaling using other approaches.
However, this model also imposes a tough constraint (a seemingly fatal contradiction):
1. Detecting and debugging data corruption bugs requires a central brain in the system (logs that describe who did what, how and why).
2. Our system does not have any central brain. An evil client that corrupted the data might crash and lose all his local logs, effectively nullifying the history.
- Client C0 writes a valid data block to the volume.
- Client C1 joins the dark side and decides to perform a corrupted I/O. It may write different data blocks to legs of raid1, calculate incorrect parity for a block that is part of an erasure coded area or inject a write hole. No other entity knows that this has happened.
- Some time later, client C2 reads the data, detects problems and starts a rebuild (tries to fix the discrepancy). However, unfortunately dark forces intervene again and disconnect C2 from the network (or crash it completely) in the middle of the rebuild.
- 42 days later, C0 reads the data and sees a total mess. But now it is probably impossible to reconstruct who did the corruption and why. Even if some logs exist they are partial and outdated.
Luckily, we just happen to have a solution. There is always one central place even in a completely decentralized system. The data itself. It is persistently written on the disks and the I/O ACID model is exactly what we expect from a logged history.
So to debug the system and verify that it works under all odd circumstances, we use the data itself as a log in the following way:
- Our data services have a special debugging mode which is not accessible to the customer, but is activated by our QA during tests.
- In this mode, the actual blocks of the data are larger than the block size as seen by the operating system. For example, volumes with a sector size of 512 bytes embed each data block in sectors of size 2048 bytes.
- The rest of the free space inside each block is used to inject internal data describing the state of the client which does the I/O, a summary of the I/O and various other debug info tidbits.
- With careful planning, it is possible to store a history of this I/O block and not just the last write. A naive approach will require a read-modify-write to append the history to the metadata, but we have an elegant solution which works with simple writes (without read & modify) with very high probability.
- Data blocks are written asynchronously by all clients, but always atomically with the metadata.
- All clients run a distributed I/O generation pattern which relies on a few basic, yet remarkable, algorithms in linear algebra fields theory. The I/O is pseudo-random, but each client can validate other clients data without ever communicating with them.
- When a data integrity problem is detected, we might not need any standard dmesg/journalctl records to understand what happened. The data blocks on the SSDs are enough (a kind of “bare flash” debugging).
Now let us return to our previous example of C0 reading a total mess instead of the expected data. It can now reconstruct the history from the spare space inside each block. Thus, it can deduce C2 was not to blame for the mess, but rather C1 is the root of all evil because it issued I/O before getting permission to do so (its internal state was incorrect for this I/O).
In conclusion: Our approach of injecting history of data into the data itself allows us to debug a completely distributed algorithm almost as if it was centralized. The data supplies us with a free ACID logging system.
Important note: The described mechanism is present only in special debug builds, designed to analyze corruptions quickly, during the development phase. When deploying NVMesh at customer sites we use a different set of diagnostic tools because of the nature of deploying performance optimized code which is a great subject for a future post in this blog. Those production diagnostic logs & data history records supply a fraction of the information gathered by debug-build. Thus debugging customer issues is possible but is more difficult and time consuming. It is less suitable for agile R&D development cycles.
By: Daniel Herman-Shmulyan, Data Service Team Leader (LinkedIn)