We tested a sample of 40 4GB microSD cards from Sandisk to failure. They ran constant, writing 24/7 as fast as the cards would write.
39 TS-7553 Ready to run!Our standard stress test uses our DoubleStore layer (described here) and split the disk in half. Pseudo-random data is written contiguously one megabyte or so at a time to the first half and then again on the second half. The pseudo random data is written both with a CRC32 and its sector number. After writing the entire disk, the data is read back where its CRC32 is checked, its sector address is verified, and the copy on the first half of the disk is verified to bit-for-bit match the other copy on the second half in a process known by DoubleStore as "resilvering". If any check trips, the test is failed. Note that this test is run at the block layer with no filesystem present and in the case of the TS-7552 boards, Linux drivers and kernel layers are also not involved and the card hardware is talked to directly by the test application. There is no termination condition of the test -- it runs forever until failure or wear out.     


An interesting way to look at wear-out is to look at CPU utilization. These boards spent months of CPU time writing data as fast and contiguously as possible and the "sdctl" process has all of that time attributed to it in "ps". Now, these little 200Mhz embedded boards can't write data as fast as modern SD cards could accept it, but assuming one has created a viable embedded app performance-wise, you could extrapolate out the MTBF (mean-time-between-failure) of these boards by running your app through its paces and watching CPU utilization. In some ways a slow processor helps SD cards last longer because it can't write data as quickly and Linux has to rely more on the RAM buffer-cache.


One thing about these SD card tests is that many times the root cause of the corruption seems to be a unique interaction of factors. In the past, we've had SD cards that failed in our SD controller hardware but not in a USB to SD adapter and vice versa. Some testing on Transcend Industrial Grade cards done a couple years ago did exactly that and corruptions were unreproducible on a USB to SD reader. We speculated and tested as many things we could think of that may be different even though we knew we were following the SD card specification. The SD specification is large and complicated, and since I'm going to be showing this thread to an engineer from ATP, I thought I'd mention some of the ways our SD protocol exchanges may be different than that of a typical USB-SD reader as an aid for any of the SD card manufacturers who care about reliability and SD spec conformance:

  1. We make extensive use of the allowance in the spec for SD clock gating. Most of the time when the card is idle by the OS, it is actually just "parked" with the clock stopped in the middle of a SD READ_MULTIPLE or WRITE_MULTIPLE command waiting for further requests from the OS. When there is a request for a noncontiguous sector, we terminate the READ/WRITE_MULTIPLE and send an ABORT command followed by another READ/WRITE_MULTIPLE to the new address.
  2. The max speed the SD clock runs at is much lower than 50Mhz for high speed cards and 25Mhz for low speed. On the TS-7552, it is 37.5Mhz and 18.75Mhz. We meet spec required setup and hold times with plenty of margin but maybe not as much margin as a dedicated ASIC since our logic is FPGA based.
  3. The clocking out of the SD card command and argument is done in 32-bit "squirts" as CPU bus cycles are written.
  4. The spec requires 8 clocks after a command or response. We give exactly 8 and stop the clock. No more, no less. Many USB-SD readers leave the clock on all the time and effectively give thousands of extra idle clocks beyond what is called for in the spec.


In one instance a couple years ago, we could only get a particular brand of SD (Transcend Industrial grade) to work reliably if we never used the SD READ/WRITE_MULTIPLE commands and only used the READ/WRITE single sector commands. (sdctl --nomultiwrite option) This compromised significantly on that particular vendor's SD performance (something like 10x slower!) such that we now recommend to customers to not use Transcend Industrial cards. 


One thing that seems to be common is that SD cards do not like getting power brownouts. Several cards have permanently destroyed themselves with a precisely timed power disconnection. It is possible the ATP card we just reproduced corruption on may have had a power brownout sometime in the past. Since them, we have rewritten every sector multiple times over, but I suppose it is possible that the card experienced some silent, permanent damage as a result.


The final result of this testing is reported in the graph below, with each card being marked on the first failure:

Last failure at 13 Terabytes!  What a trooper!