Analyzing SMART hard drive reports
SMART testing has been around for a while and isn’t very well understood. I can’t say that I’m an expert on it, but I’ve come to have several fairly well developed ideas in regards to it. First off, if I have a live hard drive that fails to complete a SMART test, it’s time for me to retire that drive. It may seem extravagant, but I would rather not have to spend loads of time recovering data of a potentially failing drive. It’s true that SMART is not necessarily going to raise a red flag before a significant data loss event happens. In fact, I’ve seen drives report that they’ve passed the health test in spite of some serious disk problems. What I think you ought to look out for are a few of these points:
The parameter that I really like to keep an eye out for is the Reallocated_Sector_Ct These shows how many disk sectors have had to be moved in the lifetime of the disk. If you’re disk is at 0 that’s a pretty good sign, if that number increases each time you run it…. very bad sign.
Along with it I like to see how close the following stats are to their “threshholds” and what kind of fluctuation there is between runs:
Raw_Read_Error_Rate
Seek_Error_Rate
Hardware_ECC_Recovered
Offline_Uncorrectable
Temperature can be a KILLER of drives, so if SMART reports your drive is constantly near or over the failure threshold there, you should look into better cooling and line up a replacement drive just in case (as the lifespan of the disk will likely be shorter.)
For all of the smart stats, you’ll see value, worst and threshold. Threshold is the value it needs to reach before claiming that parameter to be in “failure”, worst would be the most severe number recorded and value represents the current reading in that continuum. It’s worth looking at those numbers on the whole to see just HOW close to the threshold some of the stats are.
For instance, I recall a drive that I was asked to look at. The system was exhibiting peculiar freezes and crashes. Smart reported the drive to be healthy, but…. reallocated sector count increased each time I saw the SMART stats. I was asked if we could get an RMA replacement on that fact alone and suggested that we only had to wait a few more tests to see that truly fail. (Their concern was that it was still claiming to be HEALTHY in the overall smart test in spite of rapidly approaching it’s threshold value. Sure enough one or two more tests it finally hit threshold and FAILED the health assessment.