[Ed. note: Originally written by Jesse Off, April 19, 2012]
Intro to DoubleStore
DoubleStore is a software layer unique to Technologic Systems that provides RAID-like filesystem redundancy on SD flash media. It is currently implemented on the TS-7520 bootrom and Linux SD card drivers. Using DoubleStore, robust file storage can be acheived either using 2 separate cards, or on one card by splitting its capacity in half. Furthermore, TS has put the ability to boot from a DoubleStore SD card in the low-level TS-BOOTROM startup firmware, thereby giving the benefit of a fault tolerant bootup of the Linux kernel and initrd on TS boards.
In order to realize the goal of a highly reliable embedded system flash data store implemented on top of SD media it is important to recognize some facts about SD media that TS has encountered in the years since implementing its first FPGA SD card controller logic:
- Unlike hard drive head crashes, SD cards rarely fail as a whole, instead they exhibit bit corruption or lack of programmability on a sector by sector basis.
- Even though there are CRC checks on data transfer to/from SD cards and the ability for the SD card to return that a requested read/write has failed, rarely have we found a SD card that properly reports media failures to the controller hardware and instead prefers to return every op as a success.
- The commoditization and minitiarization of flash over the years has reduced reliability in general and allowed for not only sub-standard SD card designs but also counterfeits.
- Unlike previous generations of CompactFlash type devices, the available Industrial Grade SD cards are few and very expensive and some tested even have internal controller bugs worse than their lower grade and cheaper counterparts. (presumably because of their low production volume)
- SD Manufacturers do die shrinks and change the underlying low level flash chips and internal controllers without notice and without changing any outwardly apparent packaging. The result of this is that previous success with a card brand is no guarantee of future success.
Strategies adopted in high-end server devices to deal with hard drive failure presume certain types of failure modes not common on SD media. In these systems, a variety of RAID schemes (RAID1, RAID4, RAID5) can recreate data when one device fails with "failure" defined as a controller command timeout or an IO error response from a device that has detected its own mechanical failure or sector corruption. This is not possible with SD media because often times the controller is uncooperative with the detection of its own failure modes and will simply return corrupt data or fail to perform a write that it reported as successful.
High end server installations using disk drives are not immune to this type of silent corruption either. However, the probabilities of these types of undetectable corruptions are so low compared to the inevitable failure of spinning media that they are largely ignored. Indeed, very few operating systems (Linux included) even have the ability to detect, let alone correct, silent bit corruptions returned by the low level block storage device. To properly handle this requires a very significant paradigm shift in the area of filesystem design that not only would require checksums on both data and metadata, but also a way to invoke recovery mechanisms to fix and scrub any corrupted sectors detected.
At the time of this writing only 2 filesystem designs are even close to being able to deal with this and neither of them are appropriate on an embedded system. One of them is an experimental Linux filesystem "BTRFS". This filesystem is not only inappropriate due to its still experimental status, but also because it does not checksum the entirety of the filesystem metadata and has no automatic mechanism to recover data from even the failed checksums it does keep on data sectors. The other filesystem is Oracle/Sun's "ZFS" which exists as a native filesystem on the Solaris OS, but in Linux it is not well supported. Also, although ZFS has all the necessary features (data+metadata checksumming w/redundant copies) it is very heavy weight and inappropriate for a RAM and CPU limited embedded system.
How it works:
Rather than design a new filesystem, TS has created a layer on top of the SD block device called DoubleStore. DoubleStore solves both the problems of detection and correction of silent data corruption on SD media using CRC's and allows tried and tested filesystem architectures such as EXT2 to be placed on top of it.
To handle the CRC storage, the block device stores 4.5kb for every 4kb of real data. After every 512 bytes of data in this 4.5kb block is 64 bytes of out-of-band (OOB) data where a CRC32, sector number, and other DoubleStore metadata is stored. The CRC32 can be used to detect bit corruption and the sector number is used detect failure modes of SD where correct data is written to the wrong sector. When a corruption is detected, a copy of the sector is retreived from either a different spot on the same SD card or from a separate SD card altogether on hardware platforms with multiple SD card slots. After recovering the data from the fallback sector, the correct data is rewritten back to the original sectors. TS has found several SD failure modes are transient, and rewriting/scrubbing the sector often permanently corrects it.
It is worth noting that the DoubleStore storage scheme significantly compromises write speed and SD card capacity by 44% since data+CRC is written twice. Read speed is also reduced to 89% of max. However, this is often of little consequence as SD media is very inexpensive and good embedded applications tend to limit flash writes anyway to prevent wearout. As an embedded system, it is usually much more important that data is safe from corruption.
Every successful bootup starts a background resilvering thread in sdctl. This process checks every sector and confirms failover sectors are also good and up to date. The possibility exists upon any unsafe shutdown that sector writes to the primary SD card completes but not on the failover card. This is likely benign since the primary is consulted first to satisfy read requests. However, if at some point in the future a corruption occurs and a sector from the failover is called upon it will have recovered obsolete data. This must be avoided and the resilvering process is the solution.
The other benefit of automatic resilvering on bootup is to detect any failed or failing cards as quickly as possible. With the CRC mechanism, it is possible to identify sectors which have been clobbered during power-down as a result of an incomplete/botched internal SD flash erase/write or wear-level operation. When this happens, it is not guaranteed to be limited only to sectors being rewritten at the time of unsafe shutdown. This violates several assumptions made by filesystem designers such that the entire card becomes suspect despite best efforts of the modern filesystem. In some ways the types of optimizations these modern hard drive filesystems use to skip lengthy reverification of on-disk data structures actually allow serious corruptions to go undetected. Where old filesystems exhaustively and time consumingly check every important data structure, the new filesystems using journalling only check a small subset in order to speed up boot time.
Although little can be done about the nature of SD, the resilvering process can be used to detect some of the telltale problem signs of a imperfect SD card controller, notify an operator, and attempt recovery to improve the chances of a fsck. When performing resilvering, certain types of detected failures result in permanently tainting the SD card. A "tainted" SD card will blink its red LED and also report this status in "sdctl --status".
Since the resilvering process visits and audits every sector on the SD card each bootup, it can discover the highest write sequence number stored in the DoubleStore metadata. This write sequence number is a 64-bit value that is incremented and stored in each sector as its written. This effectively allows the media to keep a "running total" of the number of flash writes of the entire card across reboots and power failures. Knowing this number allows insight into the age of a SD card and how close it may be to reaching the end of its service life in rewrite cycles. This feature is also useful to the embedded system designer since it is not always apparent the real write-load when working at a very high level with Linux and Linux services.
TS boards that make use of DoubleStore by default still retain the ability to use standard formatted SD cards. DoubleStore cards are detected automatically.
- Redundancy: data stored using DoubleStore is stored twice. Should a CRC failure be detected, the fallback store is consulted and restored automatically.
- Ultra-reliable bootup: DoubleStore isn't just used for the filesystem, we also use it to store a kernel and initial ramdisk image.
- File-system flexibility: since DoubleStore operates at the block level, any filesystem supported by the kernel can be used.
- Diagnostics: DoubleStore keeps a count of all data written to the card so you know when to expect your flash to wear out. Also, if DoubleStore catches a card behaving strangely or experiencing silent data corruption, it will blink an LED on the board letting an operator know a card should be replaced.
- Automatic Health Checks: every bootup starts a low priority background verification of all sectors on the card so errors can be discovered and fixed before being required by applications.
- Self-healing: Any time a sector is found corrupt and successfully restored, it will be automatically rewritten or "scrubbed". Also, if a new, blank replacement fallback SD is inserted, it will automatically be rebuilt from the primary.
- Reduced write speed performance: Due to the need to write everything twice and also write additional metadata containing CRC's, sequence numbers, etc. Write speed performance is a little worse than 2.125x slower.
- Storage capacity: Using DoubleStore requires 2.125x the capacity to store data because of additional overhead of fallback data storage and CRCs.
- Resilvering CPU usage: Every bootup the board will likely be busy for up to a half hour as it verifies every sector of data. This is done in the background at the lowest priority possible, but it still may effect startup performance of some applications.