If you’ve been using computers with mechanical hard drives long enough, you know the sound. Tick. Tick. Tick. It’s annoying on its own, terrifying if you don’t have backups.
That’s what I suddenly started hearing yesterday morning as I browsed Twitter. Tick. Tick. Tick. Once a second, like a clock. But I don’t have a clock in my office/studio. I walked around the room to isolate where the tick was coming from. Was it a neighbor hammering on something outside, a computer fan with bad bearings / something stuck in the blades, or the dreaded ticking of a hard drive about to fail?
It was coming from my server computer. I tapped the top of the case, which usually clears up the sound if it’s a fan. Tick. Tick. Tick. Hmmm. Sounds a lot like a hard drive. I’ve got five hard drives inside that computer case, plus an SSD. I ran a couple of Linux commands and quickly isolated it to the hard drive Linux knows as /dev/sdd.
Ah, that’s good news and bad news.
Good news first. The command that isolated the ticking sound actually stopped the ticking, so the possibly-failing hard drive is working again. At least for now.
The wonderful good news is that this hard drive is a member of a RAID 1 array along with its twin /dev/sdc. RAID 1 duplicates data across all the drives in the array, so the failure of one drive won’t cause the loss of any data. As long as the other drive in the array doesn’t fail at the same time, all is well data-wise. But check out what I have to say later about bad news.
It’s also good news that the failing drive was not part of my other RAID array. The other array on /dev/sde and /dev/sdf is much larger, much more active, and contains my local backups of what I consider to be all my most important data. More on that later.
I found which drive was ticking using a bash script I wrote a few years ago to check temperatures of the CPU and hard drives in my server. Here it is. If you want to try this on your own Linux box, you may need to tweak the sed commands to match what you want to see from your specific hardware, and you will need to change the list of disk devices to match your drives.
# get all core temps and fan speeds. first "fan1:" is not important for my hardware. sensors | sed -n '/Core/p;/fan.:.*min/p' for disk in sda sdb sdc sdd sde sdf do # Matches "Temperature_Celsius" and "Airflow_Temperature_Celsius", displays first found sudo smartctl -a /dev/$disk | sed -n "s! (Min/Max.*!!;s!.*Temperature.*\(..$\)!/dev/$disk: \1!p;n" done
The script normally runs instantly, pausing only to prompt for my sudo password if necessary. This time, it hung for 20 or 30 seconds while trying to display the temperature of /dev/sdd. Once it un-hung, it displayed the temperature and the ticking stopped, so whatever the smartctl command did to probe the SMART information from that device kicked it back to life.
In case you don’t know, SMART is a technology built into hard drives that provides measurements and tests for hard drive health. Some of the measures can alert you that a drive is starting to fail, giving you some time to copy the data elsewhere before the drive fails completely.
More good news: The complete output of smartctl -a /dev/sdd shows that the disk passes SMART tests. So SMART thinks the drive is healthy, at least healthy enough that there is no impending data loss. I also ran smartctl -a /dev/sdc on the other identical drive in the RAID array. Since both drives are identical and have been paired in RAID 1 since they were born, any significant differences in performance or reliability should show up as different values in the SMART statistics.
Here are the big differences:
- Seek_Error_Rate: 125 on this drive, 0 on the other drive
- Calibration_Retry_Count: 6042 on this drive, 8 on the other drive
- Hardware_ECC_Recovered: 573 on this drive, 282 on the other drive.
I also ran a SMART self test using smartctl -t short /dev/sdd. The self test brought back the ticking sound, and the test failed with a servo/seek failure message. That all makes sense. Definitely a hardware failure.
A status check of the RAID array using mdadm -D /dev/md75 turned out ok. No data loss on either drive. Not surprising since it’s been about a week since I wrote anything to that array.
Also in the good news category is that this turned out to be an interesting real world test of RAID. Before I knew which drive was failing, I examined some data on each RAID array. The data was there and immediately accessible. So the presence of a failing drive in the array did not prevent or delay access to data in the array. It’s always comforting to see something like this work in a real world use case!
Now for the bad news.
First, the data on this pair of drives isn’t backed up. Actually that’s not quite true because it is the backup, although it’s not backed up to the cloud like the data in my other RAID array. That’s not really so bad in this situation because of what I keep on this array. The only use for this pair of drives is as a backup repository of several hundred gigabytes of files I’ve downloaded that would be inconvenient to re-download. This includes original zip files of digital assets from some courses I’m taking, files I’ve purchased from various digital content providers, some of which may not be available to re-download, some large ISO files I might not want to re-download if I need them, that sort of thing. Many of these files are just the backup copies of files that are sitting in the Downloads folder on my main computer. So if I totally lost the files on this RAID array, I could recover most of them by running a simple backup command to synchronize them from my main computer. Most of the rest I could re-download from the Internet. Some of them I might not be able to find, but it would only hamper, not prevent, my ability to reconstruct certain assets if I had a total failure of both my computers at the same time.
“But don’t you use Crashplan to backup all your files to the cloud?” you say. Well, most but not all. I backup to the cloud all the files that are my own creation: photos, photo art, music compositions, spreadsheets, text documents, etc. as well as various configuration files created by software I use. I also backup the unzipped versions of digital assets that I acquired through courses and purchases. What I do not backup are several hundred gigabytes of downloaded zip and ISO files that are large and potentially re-downloadable. It would take a couple of months of uploading to back those files up to the Crashplan cloud. If I lost them and needed to get them, it would probably be quicker to download the original zip files again than spending the time to back them up now. I also do not backup many gigabytes of cache files and preview files created by Photoshop and Lightroom that would tie up bandwidth for no reason, but that’s a subject for another post. Just trying to be efficient with my time and slow Internet bandwidth.
Now for the actual bad news. The two drives in this RAID array are so identical that their serial numbers are merely one digit apart. I bought them together in 2008 so they are nearly 9 years old. These are old drives, so if one dies of old age (which is likely what’s happening), with identical usage profiles the other can’t be too far behind. I’ve experienced other drives in the same manufacturing lot failing within days of each other, so I would not be surprised if the other drive in this pair starts ticking sometime in the next couple of weeks.
“Why are you using 9 year old drives in your backup server, you nut case?” you say. Hey, don’t be so rude! That’s a long story I won’t totally get into, but it involves a sweep of the house for old electronics to take to the recycler, and a decision to put these two drives that have been unused for a couple of years back to use rather than sending them out to pasture. I figured the extra capacity gained by moving the low-risk archive files off my main RAID array would extend its life by 6 to 12 months, delaying the need to buy another drive for the main RAID. Well, it was just one week ago that I put those two old drives into the server. That decision didn’t last long!
I just ordered another 3 TB hard drive to add to the other, newer RAID array, which will revert to being the only RAID array. The two drives in it now are only 2½ years old, mere toddlers compared to their aging aunt and uncle, and I was able to order the exact same drive as the other two. I’ll be converting that array from RAID 1 to RAID 5, which will double the capacity and extend its life by 3 or 4 years before I need to expand it again. Meanwhile, after I wipe them, those two old drives will go back into the recycle pile where they should have stayed.
Moral of the story: Don’t use 9 year old hard drives to store important data. But if you do, make sure you use RAID, keep backup copies, and stop using them as soon as one drive fails.
Update: After giving it some more thought I decided to convert the storage array to RAID 10 instead of RAID 5. I ordered yet another 3 TB drive, so I added two additional drives instead of one. After it was all done I ended up with the same amount of storage as I would have with RAID 5, but I did so by purchasing two extra drives instead of one. The main trade off is that RAID 10 handles more failure scenarios without causing loss of data than RAID 5. I felt better paying a little extra to keep my backup data even safer!