EngineeringExcelero

The Advantage of Simple Replication with NVMe Disks

By December 4, 2017 No Comments

Any IT data professional knows what causes them the greatest grief. It’s the loss of customer data. The customer is mad, your management is mad, and it’s “your” fault.

The first-level solution to protect against data loss is replication. This doesn’t solve the problem – it merely makes it less likely. This is because after one disk fails, the second disk can fail before the replication is restored. This paper focuses on the independent failures of two disks in the same replication set (the terms ‘replication set’ and ‘mirror set’ can be used interchangeably).

RAID (Redundant Array of [Inexpensive, Independent] Disks) is a common industry term. Research at Berkeley (https://www2.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf) in 1987 defined a number of terrms. The ones used in this paper are:

  • RAID 0: Striping the data across multiple disks. A number of blocks are stored on the first disk, then a number on the second, and so on until the cycle repeats starting with the first disk again. There is no redundancy, and the failure of any disk in the RAID 0 set results in the loss of data.
  • RAID 1: The replication of data between two disks. Each disk contains a full copy of the data.
  • RAID 10: Not in the original Berkeley paper, but an industry-standard term. A combination of RAID 1 and RAID 0. Striping the data across multiple RAID 1 sets.

Each RAID 1 or RAID 10 replication set is presented to hosts as a single block storage volume. It looks like a regular disk to the hosts.

In the market today are two interesting approaches:

  1. Use one pair of disks (RAID 1), or, if a larger host volume is necessary, stripe across several pairs of disks (RAID 10), to create a replication set.
  2. Use a collection of N disks, and stripe N/2 replication sets across all the disks. Each disk has a small unused space that is used to quickly restore replication when a disk fails.  Every RAID 1 replication set has blocks on every disk in the collection of N disks. This approach has a number of vendor-specific terms. This paper will refer to it generically as a collection of disks, or a collection.

Approach (1) is the “traditional” method. It has the simplest algorithms, requiring a write to both disks and a read from either disk. In the event of a disk failure, the data of the surviving disk is copied to a replacement disk, to restore the redundancy of the RAID 1 set. This is the approach that Excelero has chosen to take with the new NVMe SSDs.

Approach (2) was developed to deal with the latency issue of rotating magnetic disks. Multiple hosts performing I/O operations on the collection will access all of the drives, and the I/O latency is spread across all the drives, instead of just two drives. Even with a small set of blocks or a host volume seeing a high rate of I/O (a “hot spot”), the effects are spread out over the collection.

With the advent of Solid State Disks (SSDs), the need for distributing hot spots is reduced. SSD disks do not require mechanical motion of a disk head and a platter rotation to access data, making the I/O latency predictable and low. NVMe disks that are directly connected to the server-internal PCIe buss have a faster transfer rate and lower latency of early SSDs, further reducing the need for using collections.

If the capacity of 50 disks is needed, approach (1) pairs 100 disks into 50 pairs of two disks each. If a single drive fails, only one RAID 1 set is affected, and the other 49 sets are still redundant. Recovering data on the affected set requires copying the surviving disk to a replacement disk.

For approach (2), 100 disks are used in the collection, and each of the 50 RAID 1 sets is striped across all 100 drives. If a single drive fails, all RAID 1 sets are affected. Recovering redundancy does not require a replacement disk; the redundancy is restored by reading from a disk and writing to the unused space on another disk. The restoration requires reading from the 99 surviving drives, and writing to the same 99 drives. This restore runs 99 times faster than approach (1), in the absence of other limits.

If a single disk fails in approach (1), only the failure of the surviving disk in the RAID 1 set would cause the loss of data. Failure of any of the other 98 disks before recovery is complete would cause another RAID 1 set to lose redundancy, but no data would be lost. The chance of data loss is (<failure rate of a disk>) * (<recovery time of a disk>). Only one in 99 disk failures would cause data loss.

If a single disk fails in approach (2), every RAID 1 set would have parts where there is only one copy. Every set is affected. Failure of any of the surviving disks causes data loss. The chance of data loss is (<failure rate of a disk> * 99) * (<recovery time of a disk> / 99). Loss of any disk would cause data loss, but the recovery is 99 times faster.

In the case of completely independent disk failures, the chance of data loss is the same between approach (1) and approach (2). The key phrases in reaching this conclusion are “in the absence of other limits” and “In the case of completely independent disk failures”.

Other limits might include disk interface speed, network speed, and other limits. Reaching any one of these limits when rebuilding a collection at 99 times speed will make it more likely to have data loss with approach (2) than approach (1).

There are other things besides disks that can fail, like the software doing the mirroring, the server containing the disks, the connection between the servers, and other things. The disks of a pair, or all of a data domain, need to be moved further apart so that one failure of these other things does not cause data to be lost. If one of these other things fails, it will likely cause more than one disk to be lost, for some amount of time. In our example of 99 disks, with two disks per server, approach (1) would require two servers to fail, containing the same two members of a RAID 1 set, to lose access to the data. In approach (2), a single server failure will very likely cause the collection to lose access to the data. The probability of complete data loss or availability loss is much higher for approach (2) than approach (1), at a ratio depending upon the failure rates of common components (e.g. servers).

The conclusion to be drawn from this paper is that approach (1) (Excelero’s approach) has superior data loss attributes and data availability attributes than approach (2).

About the author:

Jim is in the CTO’s office at Excelero. He is responsible for researching possible future features for NVMesh, investigating issues, and coding. Jim is a storage veteran from DDN where he was a Storage Architect, developing features for various storage products. He works from his home in Colorado Springs, Colorado, USA.