
BackBlaze have been frequently posting notes on their experience with running large numbers of hard drives in their storage service. Some salient points: * “Enterprise” drives, while more expensive than “consumer” ones, are not significantly more reliable. (This has been borne out by other studies.) * Keep your drives running all the time, or switch them off when not in use? “Our vote is to keep them running.” * They do use SMART to keep an eye on their drives. Other studies I read reported that these only picked up a minority (around 30%) of failures, from which I concluded that SMART was not worth using. I suppose it’s different for these guys if it saves on costs to replace a drive *before* it actually starts reporting uncorrectable errors... <https://www.backblaze.com/blog/hard-drive-stats-faq/>

Hi Lawrence Oh man, I could write pages on this subject. I’ll try to keep it short though: Backblaze’s analysis is only really useful for home users and people in similar markets to the one Backblaze is (low-workload cloud storage). I equate these because most home users aren’t going to be doing significant workloads on their drives, and nor does Backblaze. BB’s workload is so light it can't really tax any drive from a workload POV, so their enterprise-vs-consumer analysis is not meaningful. This isn’t something they make clear at all. What the Backblaze posts do achieve for me is a good cross-vendor study of drive reliability for home use. It also suggests I should seriously consider using HGST enterprise drives at work, based on the relative reliability of the HGST consumer drives compared to the others. So if they could just stick to that, then I think their posts would be great, and would be doing a lot of good work. But they keep referring to their attempt at comparing consumer vs enterprise drives, and their methodology just doesn’t stack up. So they don't even begin to come close to convincing me that I should start using consumer drives instead of enterprise ones. If the other studies you refer to are the google and Schroeder ones from 2007, then they aren’t really relevant any more. I can go into this in more detail if you like, but I’ll leave it for now. I’m not aware of any other publicly reported large-scale disk drive health studies, so if you have something else then I’m keen to see it. SMART is also absolutely useful. The google study pointed this out in 2007, and their conclusion rings true now: SMART is great at telling you when a drive is failing or starting to fail. Some of the SMART indicators are very useful for this, for example the reallocated sector count: If this is increasing over time, then the drive is reallocating more sectors due to physical damage, and therefore the drive is failing. So SMART is fine for telling you a drive is *not healthy*. On the other hand, SMART is no use at telling you a drive is *healthy*: A clean bill of health today, according to SMART, doesn’t mean the drive won’t catastrophically fail tomorrow. This is the conclusion that the google paper drew, but a lot of people seem to misinterpret it.
BackBlaze have been frequently posting notes on their experience with running large numbers of hard drives in their storage service. Some salient points:
* “Enterprise” drives, while more expensive than “consumer” ones, are not significantly more reliable. (This has been borne out by other studies.) * Keep your drives running all the time, or switch them off when not in use? “Our vote is to keep them running.” * They do use SMART to keep an eye on their drives. Other studies I read reported that these only picked up a minority (around 30%) of failures, from which I concluded that SMART was not worth using. I suppose it’s different for these guys if it saves on costs to replace a drive *before* it actually starts reporting uncorrectable errors...

On Sun, 26 Apr 2015 14:01:10 +1200, Daniel Lawson wrote:
So SMART is fine for telling you a drive is *not healthy*. On the other hand, SMART is no use at telling you a drive is *healthy*: A clean bill of health today, according to SMART, doesn’t mean the drive won’t catastrophically fail tomorrow. This is the conclusion that the google paper drew, but a lot of people seem to misinterpret it.
This is why I don’t bother with SMART. Sit down and work out the maths: relying on SMART greatly increases your rate of replacement of drives, without a corresponding increase in the reliability of your data.

On Sun, Apr 26, 2015 at 02:54:37PM +1200, Lawrence D'Oliveiro wrote:
On Sun, 26 Apr 2015 14:01:10 +1200, Daniel Lawson wrote:
So SMART is fine for telling you a drive is *not healthy*. On the other hand, SMART is no use at telling you a drive is *healthy*: A clean bill of health today, according to SMART, doesn’t mean the drive won’t catastrophically fail tomorrow. This is the conclusion that the google paper drew, but a lot of people seem to misinterpret it.
This is why I don’t bother with SMART. Sit down and work out the maths: relying on SMART greatly increases your rate of replacement of drives, without a corresponding increase in the reliability of your data.
Yes, that is what I take from Daniel's statements above. SMART is very specific but not very sensitive at detecting impending hard driver failure which defeats its purpose. Cheers Michael.

So SMART is fine for telling you a drive is *not healthy*. On the other hand, SMART is no use at telling you a drive is *healthy*: A clean bill of health today, according to SMART, doesn’t mean the drive won’t catastrophically fail tomorrow. This is the conclusion that the google paper drew, but a lot of people seem to misinterpret it.
This is why I don’t bother with SMART. Sit down and work out the maths: relying on SMART greatly increases your rate of replacement of drives, without a corresponding increase in the reliability of your data.
I’m struggling to understand your reasoning here. If you don’t ever check your drives SMART data, you will absolutely have more unexpected disk failures, quite simply because the drive failures you could have caught early by proactive SMART monitoring weren’t caught early. This might result in multiple drive failures in a single RAID set, which might result in data loss. Of course, you’re backed up (right? right), so there’s no loss in data *reliability*. But there is a loss in *availability*. It’s not a perfect tool, but when something tells you something is going wrong, you should pay attention to it. I’ll take imperfect health prediction and being able to proactively replace drives I know are failing over operating completely blind any day.

On Tue, 28 Apr 2015 21:07:38 +1200, Daniel Lawson wrote:
So SMART is fine for telling you a drive is *not healthy*. On the other hand, SMART is no use at telling you a drive is *healthy*: A clean bill of health today, according to SMART, doesn’t mean the drive won’t catastrophically fail tomorrow. This is the conclusion that the google paper drew, but a lot of people seem to misinterpret it.
This is why I don’t bother with SMART. Sit down and work out the maths: relying on SMART greatly increases your rate of replacement of drives, without a corresponding increase in the reliability of your data.
I’m struggling to understand your reasoning here.
It’s a well-known phenomenon in the mathematics of probability, known as the base-rate fallacy. Remember that people’s intuitions about probability are notoriously misleading. That’s why you have to actually do the maths.

This is why I don’t bother with SMART. Sit down and work out the maths: relying on SMART greatly increases your rate of replacement of drives, without a corresponding increase in the reliability of your data.
I’m struggling to understand your reasoning here.
It’s a well-known phenomenon in the mathematics of probability, known as the base-rate fallacy. Remember that people’s intuitions about probability are notoriously misleading. That’s why you have to actually do the maths.
I see. Although I’m still not sure it applies here, if you follow my earlier comments: Pay attention to SMART only when it tells you bad things are happening. Drives that are in a good condition and are not failing quite simply do *not* have increasing SMART counters (of the relevant counters like reallocated sector count, uncorrectable errors, etc). Drives that do have these counts increasing are quite simply going to fail, and the rate at which they fail is very closed correlated with the rate at which these counters increase. There’s no base rate fallacy here, because any drive that is showing increasing counters is a problem drive. I re-read your first email on this subject and you even acknowledged that backblaze make the same point I am, but you don’t put any weight on avoiding downtime. That’s up to you - it’s not how I’d approach it though :)

On Tue, 28 Apr 2015 23:03:33 +1200, Daniel Lawson wrote:
Drives that are in a good condition and are not failing quite simply do *not* have increasing SMART counters ...
Interpreting SMART that way is a statistical predictive thing. Like any statistical prediction, you will have some percentage of false positives: the counters go up to some high value that you interpret as anomalous, but the drive continues to operate fine for something close to the usual life span.
I re-read your first email on this subject and you even acknowledged that backblaze make the same point I am, but you don’t put any weight on avoiding downtime.
I assume there are already systems in place for coping with failed drives--whether RAID or something more advanced like btrfs/ZFS, or some storage-management scheme built on top of that, whatever. In a situation with thousands of drives, you will be continually having failed drives somewhere, so you have to be able to keep operating with that. The effort and expense comes in actually replacing them. So anything that increases this expense, without actually improving the reliability of things, is not going to be welcome.
participants (3)
-
Daniel Lawson
-
Lawrence D'Oliveiro
-
Michael Cree