Friday, August 28, 2009
NetApp command line shortcuts
Saturday, August 22, 2009
Failed disk replacement in NetApp
Disk failures are very common in storage environment and as a storage administrator we come across this situation very often, how often that depends how much disks your storage systems is having; more disks you manage more often you come across this situation.
This post I have written considering RAID-DP with FC-AL disks because it’s always better than RAID4 and SCSI loops we don’t use. Due to its design RAID-DP gives protection from double disk failure in a single raid group. To say that it means you will not loose data even if 2 disks are failed in a single RG at same time or one after another.
As like any other storage system Ontap also uses a disk from spare disks pool to rebuild the data from surviving disk as soon as it encounters a failed disk situation and sends an autosupport message to NetApp for parts replacement. Once autosupport is received by NetApp they initiate RMA process and part gets delivered to the address listed for that failed system in NetApp records. Once the disk arrives you change the disk by yourself or ask a NetApp engineer to come at onsite and change it, whatever way as soon as you replace the disk your system finds the newly working disk and adds it in spare pool.
Now wasn’t that pretty simple and straightforward? Oh yes; because we are using software based disk ownership and disk auto assignment is turned on. Much like your baby had some cold so he called-up GP himself and got it cured rather than asking you to take care of him, but what about if there are some more complication.
Now, will cover what all other things can come in way and any other complications.
Scenario 1:
I have replaced my drive and light shows Green or Amber but ‘sysconfig -r' still shows the drive as broken?
Sometimes we face this problem because system was not able to either label the disks properly or replaced disk itself is not good. The first thing we try is to label the disk correctly if that doesn’t work try replacing with another disk or known good disk but what if that too doesn’t work, just contact NetApp and follow their guidelines.
To label the disk from "BROKEN" to "SPARE" first you have to note down the broken disk id, which you can get from “aggr status -r", now go to advance mode with “priv set advanced” and run “disk unfail
Scenario 2:
Two disks have failed from same raid group and I don’t have any spare disk in my system.
Now in this case you are really in big trouble because always you need to have at least one spare disk available in your system whereas NetApp recommends 1:28 ratio i.e. have one spare on each 28 disks. In the situation of dual disk failure you have very high chances of loosing your data if another disk goes while you are rebuilding the data on spare disk or while you are waiting for new disks to arrive.
So always have minimum 2 disks available in your system one disk is also fine and system will not complain about spare disk but if you leave system with only one spare disk then maintenance centre will not work and system will not scan any disk for potential failure.
Now going to your above situation that you have dual disk failure with no spares available, so best bet is just ring NetApp to replace failed disk ASAP or if you think you are loosing your patient select same type of disk from another healthy system, do a disk fail, remove disk and replace it with failed disk on other system.
After adding the disk to another filer if it shows Partial/failed volume, make sure the volume reported as partial/failed belongs to newly inserted disk by using “vol status -v” and “vol status -r" commands, if so just destroy the volume with “vol destroy” command and then zero out the disk with “disk zero spares”.
This exercise will not take more than 15 min(except disk zeroing which depends on your disk type and capacity) and you will have single disk failure in 2 systems which can survive with another disk failure, but what if that doesn’t happens and you keep running your system with dual disk failure. Your system will shut down by itself after 24 hours; yes it will shut down itself without any failover to take, your attention. There is a registry setting to control how long your system should run after disk failure but I think 24hrs is a good time and you shouldn’t increase or decrease it until and unless you think you don’t care of the data sitting there and anyone accessing it.
Scenario 3:
My drive failed but there is no disk with amber lights
A number of times these things happen because disk electricals are failed and no more system can recognize it as part of it. So in this situation first you have to know the disk name. There are couple of methods to know which disk has failed.
a) “sysconfig -r “ look for broken disk list
b) From autosupport message check for failed disk ID
c) "fcadmin device_map" looks for a disk with xxx or “BYP” message
d) In /etc/messages look for failed or bypassed disk warning and there it gives disk ID
Now once you have identified failed disk ID run “disk fail
If you use auto assign function then system will assign the disk to spare pool automatically otherwise use “disk assign
Scenario 4:
Disk LED remains orange after replacing failed disk
This error is because you were in very hurry and haven’t given enough time for system to recognize the changes. When the failed disk is removed from slot, the disk LED will remain lit until the Enclosure Services notices and corrects it generally it takes around 30 seconds after removing failed one.
Now as you have already done it so better use led_off command from advanced mode or if that doesn’t works because system believes that the LED is off when it is actually on, so simply turn the LED on and then back off again using “led_on
Scenario 5:
Disk reconstruction failed
There could be a number of issues to fail the RAID reconstruction fail on new disk including enclosure access error, file system disk not responding/missing, spare disk not responding/missing or something else, however most common reason for this failure is outdated firmware on newly inserted disk.
Check if newly inserted disk is having same firmware as other disks if not first update the firmware on newly inserted disk and it then reconstruction should finish successfully.
Scenario 6:
Disk reconstruction stuck at 0% or failed to start
This might be an error or due to limitation in ONTAP i.e. no more than 2 reconstructions should be running at same time. Error which you might find a time is because RAID was in degraded state and system went through unclean shutdown hence parity will be marked inconsistent and need to be recomputed after boot. However as parity recomputation requires all data disks to be present in the RAID group and we already have a failed disk in RG so aggregate will be marked as WAFL_inconsistent. You can confirm this condition with “aggr status -r" command.
Thursday, August 20, 2009
NetApp NFS mount for Sun Solaris 10 (64 bit)
Mount options
rw,bg,hard,nointr,rsize=32768,wsize=32768,vers=3,proto=tcp
Kernel Tuning
Parameter | Replaced by (Resource Control) | Recommended Minimum Value |
noexec_user_stack | NA | 1 |
semsys:seminfo_semmni | project.max-sem-ids | 100 |
semsys:seminfo_semmns | NA | 1024 |
semsys:seminfo_semmsl | project.max-sem-nsems | 256 |
semsys:seminfo_semvmx | NA | 32767 |
shmsys:shminfo_shmmax | project.max-shm-memory | 4294967296 |
shmsys:shminfo_shmmni | project.max-shm-ids | 100 |
On Solaris 10, the following kernel parameters should be set to the shown value, or higher.
Solaris file descriptors
rlim_fd_cur – "Soft" limit on the number of file descriptors (and sockets) that a single process can have open
rlim_fd_max – "Hard" limit on the number of file descriptors (and sockets) that a single process can have open
Setting these values to 1024 is strongly recommended to avoid database crashes resulting from Solaris resource deprivation.
Network Settings
Parameter | Value | Details |
/dev/tcp tcp_recv_hiwat | 65,535 | increases TCP receive buffer |
/dev/tcp tcp_xmit_hiwat | 65,535 | increases TCP transmit buffer |
/dev/ge adv_pauseTX | 1 | Enables transmit flow control |
/dev/ge adv_pauseRX | 1 | Enables receive flow control |
/dev/ge adv_1000fdx_cap | 1 | forces full duplex for GBE ports |
/dev/tcp tcp_xmit_hiwat | 65536 | Increases TCP transmit high watermark |
/dev/tcp tcp_recv_hiwat | 65536 | Increases TCP receive high watermark |
sq_max_size – Sets the maximum number of messages allowed for each IP queue (STREAMS synchronized queue). Increasing this value improves network performance. A safe value for this parameter is 25 for each 64MB of physical memory in a Solaris system up to a maximum value of 100. The parameter can be optimized by starting at 25 and incrementing by 10 until network performance reaches a peak.
Nstrpush – Determines the maximum number of modules that can be pushed onto a stream and should be set to 9
References
NetApp Technical Teport tr-3633, tr-3496, tr-3322,
NetApp Knowledge Base Article 7518
Thursday, August 6, 2009
NetApp Active/Active vs. Active/Passive (Stretch MetroCluster) solution
Monday, August 3, 2009
NetApp NFS mount for Red Hat Linux 5.2
net.core.rmem_default | 262144 | Default TCP receive window size (Default buffer size) | Improve network performance for IPbased protocols |
net.core.rmem_max | 16777216 | Max. TCP receive window size.(Max. buffer size) | Improve network performance for IPbased protocols |
net.core.wmem_default | 262144 | Default TCP send window size (Default buffer size) | Improve network performance for IPbased protocols |
net.core.wmem_max | 16777216 | Max. TCP send window size (Max. buffer size) | Improve network performance for IPbased protocols |
net.ipv4.tcp_rmem | 4096 262144 16777216 | Autotuning for TCP receive window size (Default and Max. values are overridden by rmem_default rmem_max) | Improve network performance for IPbased protocols |
net.ipv4.tcp_wmem | 4096 262144 16777216 | Autotuning for TCP send window size (Default and Max. values are overridden by wmem_default wmem_max) | Improve network performance for IPbased protocols |
net.ipv4.tcp_window_scaling | 1 | TCP scaling, allows a TCP window size greater than 65536 to be used | This is enabled by default (value 1), make sure that it doesn't get disabled (Value 0). |
net.ipv4.tcp_syncookies | 0 | Disables generation SYN (crypto) COOKIES | Helps to reduce CPU overhead |
net.ipv4.tcp_timestamps | 0 | Disables new RTTM feature introduced in RFC-1323 | Helps to reduce CPU overhead Prevents adding 10-byte overhead to TCP header |
net.ipv4.tcp_sack | 0 | Disables selective ack | Helps to reduce CPU overhead |
References:
NetApp whitepaper tr-3700,tr-3183, tr-3369
NetApp Knowledge Base Article 7518
Jumbo Frames in NetApp
This is just another one out of ten thousand posts talking about Jumbo frames, actually a week back while doing designing for my new NetApp environment was looking for Jumbo Frames. So here's some details which I have collected from different places, although still I have to do testing and see how much benefit I get in my environment however I am posting this thinking it might be useful for someone else.
Jumbo frames
Jumbo frames are TCP frames where MTU size is more than the IEEE standard of 1500 bytes. There are lots of variations in that and anything from 1500 to 12000 can be configured, be called as jumbo frames. However most of the industry uses MTU size of 9000 for jumbo frames due to support from most of the device manufacturers and memory page size limit of common protocols like NFS in which datagram size is 8400 bytes therefore a Ethernet frame size of 9018 can accommodate single NFS datagram in one Ethernet packet and stay comfortably within the standard Ethernet bit error rates.
As NetApp support maximum MTU size of 9192 hence in this paper I have taken 9000 as the MTU size.
Benefits:
- Less CPU overhead as system has to do less header processing because in VIF mode TOE on NetApp cards are disabled.
- 9000 bytes frames are six times higher then stock frames of 1500 MTU so larger frame size leads to higher throughput.
- Some tests in NetApp show upto 30% increase and other vendors have achieved more than 60% in network throughput.
Considerations:
- To use jumbo frames, client system, intermediate switches / routers and NetApp devices, all should be configured to process large frames.
- Any interface operating over 1000 Mbps is currently supported on NetApp systems for jumbo frame configuration.
- Client’s TCP window size should be two times the MTU size, minus 40, and the maximum value can be the highest value storage system support. Typically, the maximum value can be set for client's TCP window is 65,535.
- If storage system is configured to support jumbo frames and the client is not, the communication between them occurs at the client’s frame size
- UDP client’s MTU size and storage system’s MTU size should match as UDP clients do not communicate their MTU size.
- All the interfaces in a vif must have the same MTU size.
Suggestions:
Looking at different performance tests carried out from NetApp, it is clear that all of them do have Jumbo frames enable to achieve a higher throughput and even their different best practices call for using jumbo frames for better usage.
References for further reading
Optimizing Oracle on NFS - NetApp White Paper
CIFS Best Practices - NetApp Technical Report
iSCSI Performance Options - NetApp Technical Report
Oracle 10g Performance on Solaris 10 - NetApp Technical Report
Ethernet Jumbo Frames - Chelsio Communications White Paper
Gigabit Ethernet Jumbo Frames - WareOnEarth Communications
Extended Frame Sizes for Next Generation Ethernets - Alteon Networks White Paper
Boosting Server-to-Server Gigabit Throughput with Jumbo Frames- HP White Paper