Disk failures are very common in storage environment and as a storage administrator we come across this situation very often, how often that depends how much disks your storage systems is having; more disks you manage more often you come across this situation.
This post I have written considering RAID-DP with FC-AL disks because it’s always better than RAID4 and SCSI loops we don’t use. Due to its design RAID-DP gives protection from double disk failure in a single raid group. To say that it means you will not loose data even if 2 disks are failed in a single RG at same time or one after another.
As like any other storage system Ontap also uses a disk from spare disks pool to rebuild the data from surviving disk as soon as it encounters a failed disk situation and sends an autosupport message to NetApp for parts replacement. Once autosupport is received by NetApp they initiate RMA process and part gets delivered to the address listed for that failed system in NetApp records. Once the disk arrives you change the disk by yourself or ask a NetApp engineer to come at onsite and change it, whatever way as soon as you replace the disk your system finds the newly working disk and adds it in spare pool.
Now wasn’t that pretty simple and straightforward? Oh yes; because we are using software based disk ownership and disk auto assignment is turned on. Much like your baby had some cold so he called-up GP himself and got it cured rather than asking you to take care of him, but what about if there are some more complication.
Now, will cover what all other things can come in way and any other complications.
Scenario 1:
I have replaced my drive and light shows Green or Amber but ‘sysconfig -r' still shows the drive as broken?
Sometimes we face this problem because system was not able to either label the disks properly or replaced disk itself is not good. The first thing we try is to label the disk correctly if that doesn’t work try replacing with another disk or known good disk but what if that too doesn’t work, just contact NetApp and follow their guidelines.
To label the disk from "BROKEN" to "SPARE" first you have to note down the broken disk id, which you can get from “aggr status -r", now go to advance mode with “priv set advanced” and run “disk unfail ” at this stage your filer will throw some 3-4 errors on console or syslog or snmp traps, depends on how you have configured but this was the final step and now disks should be good which you can confirm with “disk show ” for detailed status or “sysconfig -r” command. Give it a few seconds to recognize the changed status of disk if status change doesn’t shows at first.
Scenario 2:
Two disks have failed from same raid group and I don’t have any spare disk in my system.
Now in this case you are really in big trouble because always you need to have at least one spare disk available in your system whereas NetApp recommends 1:28 ratio i.e. have one spare on each 28 disks. In the situation of dual disk failure you have very high chances of loosing your data if another disk goes while you are rebuilding the data on spare disk or while you are waiting for new disks to arrive.
So always have minimum 2 disks available in your system one disk is also fine and system will not complain about spare disk but if you leave system with only one spare disk then maintenance centre will not work and system will not scan any disk for potential failure.
Now going to your above situation that you have dual disk failure with no spares available, so best bet is just ring NetApp to replace failed disk ASAP or if you think you are loosing your patient select same type of disk from another healthy system, do a disk fail, remove disk and replace it with failed disk on other system.
After adding the disk to another filer if it shows Partial/failed volume, make sure the volume reported as partial/failed belongs to newly inserted disk by using “vol status -v” and “vol status -r" commands, if so just destroy the volume with “vol destroy” command and then zero out the disk with “disk zero spares”.
This exercise will not take more than 15 min(except disk zeroing which depends on your disk type and capacity) and you will have single disk failure in 2 systems which can survive with another disk failure, but what if that doesn’t happens and you keep running your system with dual disk failure. Your system will shut down by itself after 24 hours; yes it will shut down itself without any failover to take, your attention. There is a registry setting to control how long your system should run after disk failure but I think 24hrs is a good time and you shouldn’t increase or decrease it until and unless you think you don’t care of the data sitting there and anyone accessing it.
Scenario 3:
My drive failed but there is no disk with amber lights
A number of times these things happen because disk electricals are failed and no more system can recognize it as part of it. So in this situation first you have to know the disk name. There are couple of methods to know which disk has failed.
a) “sysconfig -r “ look for broken disk list
b) From autosupport message check for failed disk ID
c) "fcadmin device_map" looks for a disk with xxx or “BYP” message
d) In /etc/messages look for failed or bypassed disk warning and there it gives disk ID
Now once you have identified failed disk ID run “disk fail ” and check if you see amber light if not use “blink_on ” in advanced mode to turn on the disk LED or if that that fails turn on the adjusting disk’s light so you can identify the disk correctly using same blink_on command. Alternatively you can use led_on command also instead of blink_on to turn on the disk LEDs adjacent to the defective disk rather than its red LED.
If you use auto assign function then system will assign the disk to spare pool automatically otherwise use “disk assign ” command to assign the disk to system.
Scenario 4:
Disk LED remains orange after replacing failed disk
This error is because you were in very hurry and haven’t given enough time for system to recognize the changes. When the failed disk is removed from slot, the disk LED will remain lit until the Enclosure Services notices and corrects it generally it takes around 30 seconds after removing failed one.
Now as you have already done it so better use led_off command from advanced mode or if that doesn’t works because system believes that the LED is off when it is actually on, so simply turn the LED on and then back off again using “led_on ” then “led_off ” commands.
Scenario 5:
Disk reconstruction failed
There could be a number of issues to fail the RAID reconstruction fail on new disk including enclosure access error, file system disk not responding/missing, spare disk not responding/missing or something else, however most common reason for this failure is outdated firmware on newly inserted disk.
Check if newly inserted disk is having same firmware as other disks if not first update the firmware on newly inserted disk and it then reconstruction should finish successfully.
Scenario 6:
Disk reconstruction stuck at 0% or failed to start
This might be an error or due to limitation in ONTAP i.e. no more than 2 reconstructions should be running at same time. Error which you might find a time is because RAID was in degraded state and system went through unclean shutdown hence parity will be marked inconsistent and need to be recomputed after boot. However as parity recomputation requires all data disks to be present in the RAID group and we already have a failed disk in RG so aggregate will be marked as WAFL_inconsistent. You can confirm this condition with “aggr status -r" command.
If this is the case then you have to run wafliron, giving command “aggr wafliron start ” while you are in advance mode. Make sure you contact NetApp before starting walfiron as it will un-mount all the volumes hosted in the aggregate until first phase of tests are not completed. As the time walfiron takes to complete first phase depends on lots of variables like size of volume/aggregate/RG, number of files/snapshot/Luns and lots of other things therefore you can’t predict how much time it will take to complete, it might be 1 hr or might be 4-5 hrs. So if you are running wafliron contact NetApp at fist hand.