My journey with NetApp: 2009/08

Friday, August 28, 2009

NetApp command line shortcuts

Just a few commands which I use frequently while on console.

CTRL+W = It deletes the word before cursor

CTRL+R = Rewrites the entire line you have entered

CTRL+U = Deletes the whole line

CTRL+A = Go to start of the line

CTRL+E = Go to end of the line

CTRL+K = Delete all the following texts

A few more commands are there but I feel arrow keys work better then you press these sequences like

CTRL+F = Right arrow

CTRL+B = Left arrow

CTRL+P = Up arrow

CTRL+N = Down arrow

CTRL+I = Tab key

Am I missing anything else?

Saturday, August 22, 2009

Failed disk replacement in NetApp

Disk failures are very common in storage environment and as a storage administrator we come across this situation very often, how often that depends how much disks your storage systems is having; more disks you manage more often you come across this situation.

This post I have written considering RAID-DP with FC-AL disks because it’s always better than RAID4 and SCSI loops we don’t use. Due to its design RAID-DP gives protection from double disk failure in a single raid group. To say that it means you will not loose data even if 2 disks are failed in a single RG at same time or one after another.

As like any other storage system Ontap also uses a disk from spare disks pool to rebuild the data from surviving disk as soon as it encounters a failed disk situation and sends an autosupport message to NetApp for parts replacement. Once autosupport is received by NetApp they initiate RMA process and part gets delivered to the address listed for that failed system in NetApp records. Once the disk arrives you change the disk by yourself or ask a NetApp engineer to come at onsite and change it, whatever way as soon as you replace the disk your system finds the newly working disk and adds it in spare pool.

Now wasn’t that pretty simple and straightforward? Oh yes; because we are using software based disk ownership and disk auto assignment is turned on. Much like your baby had some cold so he called-up GP himself and got it cured rather than asking you to take care of him, but what about if there are some more complication.

Now, will cover what all other things can come in way and any other complications.

Scenario 1:

I have replaced my drive and light shows Green or Amber but ‘sysconfig -r' still shows the drive as broken?

Sometimes we face this problem because system was not able to either label the disks properly or replaced disk itself is not good. The first thing we try is to label the disk correctly if that doesn’t work try replacing with another disk or known good disk but what if that too doesn’t work, just contact NetApp and follow their guidelines.

To label the disk from "BROKEN" to "SPARE" first you have to note down the broken disk id, which you can get from “aggr status -r", now go to advance mode with “priv set advanced” and run “disk unfail ” at this stage your filer will throw some 3-4 errors on console or syslog or snmp traps, depends on how you have configured but this was the final step and now disks should be good which you can confirm with “disk show ” for detailed status or “sysconfig -r” command. Give it a few seconds to recognize the changed status of disk if status change doesn’t shows at first.

Scenario 2:

Two disks have failed from same raid group and I don’t have any spare disk in my system.

Now in this case you are really in big trouble because always you need to have at least one spare disk available in your system whereas NetApp recommends 1:28 ratio i.e. have one spare on each 28 disks. In the situation of dual disk failure you have very high chances of loosing your data if another disk goes while you are rebuilding the data on spare disk or while you are waiting for new disks to arrive.

So always have minimum 2 disks available in your system one disk is also fine and system will not complain about spare disk but if you leave system with only one spare disk then maintenance centre will not work and system will not scan any disk for potential failure.

Now going to your above situation that you have dual disk failure with no spares available, so best bet is just ring NetApp to replace failed disk ASAP or if you think you are loosing your patient select same type of disk from another healthy system, do a disk fail, remove disk and replace it with failed disk on other system.

After adding the disk to another filer if it shows Partial/failed volume, make sure the volume reported as partial/failed belongs to newly inserted disk by using “vol status -v” and “vol status -r" commands, if so just destroy the volume with “vol destroy” command and then zero out the disk with “disk zero spares”.

This exercise will not take more than 15 min(except disk zeroing which depends on your disk type and capacity) and you will have single disk failure in 2 systems which can survive with another disk failure, but what if that doesn’t happens and you keep running your system with dual disk failure. Your system will shut down by itself after 24 hours; yes it will shut down itself without any failover to take, your attention. There is a registry setting to control how long your system should run after disk failure but I think 24hrs is a good time and you shouldn’t increase or decrease it until and unless you think you don’t care of the data sitting there and anyone accessing it.

Scenario 3:

My drive failed but there is no disk with amber lights

A number of times these things happen because disk electricals are failed and no more system can recognize it as part of it. So in this situation first you have to know the disk name. There are couple of methods to know which disk has failed.

a) “sysconfig -r “ look for broken disk list

b) From autosupport message check for failed disk ID

c) "fcadmin device_map" looks for a disk with xxx or “BYP” message

d) In /etc/messages look for failed or bypassed disk warning and there it gives disk ID

Now once you have identified failed disk ID run “disk fail ” and check if you see amber light if not use “blink_on ” in advanced mode to turn on the disk LED or if that that fails turn on the adjusting disk’s light so you can identify the disk correctly using same blink_on command. Alternatively you can use led_on command also instead of blink_on to turn on the disk LEDs adjacent to the defective disk rather than its red LED.

If you use auto assign function then system will assign the disk to spare pool automatically otherwise use “disk assign ” command to assign the disk to system.

Scenario 4:

Disk LED remains orange after replacing failed disk

This error is because you were in very hurry and haven’t given enough time for system to recognize the changes. When the failed disk is removed from slot, the disk LED will remain lit until the Enclosure Services notices and corrects it generally it takes around 30 seconds after removing failed one.

Now as you have already done it so better use led_off command from advanced mode or if that doesn’t works because system believes that the LED is off when it is actually on, so simply turn the LED on and then back off again using “led_on ” then “led_off ” commands.

Scenario 5:

Disk reconstruction failed

There could be a number of issues to fail the RAID reconstruction fail on new disk including enclosure access error, file system disk not responding/missing, spare disk not responding/missing or something else, however most common reason for this failure is outdated firmware on newly inserted disk.

Check if newly inserted disk is having same firmware as other disks if not first update the firmware on newly inserted disk and it then reconstruction should finish successfully.

Scenario 6:

Disk reconstruction stuck at 0% or failed to start

This might be an error or due to limitation in ONTAP i.e. no more than 2 reconstructions should be running at same time. Error which you might find a time is because RAID was in degraded state and system went through unclean shutdown hence parity will be marked inconsistent and need to be recomputed after boot. However as parity recomputation requires all data disks to be present in the RAID group and we already have a failed disk in RG so aggregate will be marked as WAFL_inconsistent. You can confirm this condition with “aggr status -r" command.

If this is the case then you have to run wafliron, giving command “aggr wafliron start ” while you are in advance mode. Make sure you contact NetApp before starting walfiron as it will un-mount all the volumes hosted in the aggregate until first phase of tests are not completed. As the time walfiron takes to complete first phase depends on lots of variables like size of volume/aggregate/RG, number of files/snapshot/Luns and lots of other things therefore you can’t predict how much time it will take to complete, it might be 1 hr or might be 4-5 hrs. So if you are running wafliron contact NetApp at fist hand.

Thursday, August 20, 2009

NetApp NFS mount for Sun Solaris 10 (64 bit)

In this post I have tried to cover mount options and other settings related to Solaris for higher throughput from NFS, which is more towards 64 bit although these settings apply to even 32 bit but a few extra settings gets counted when you think of 32 bit version, like super caching as I can remember because this list I have complied long back and still it's very handy to me when I get some complain about low performance. For any further details you can look in references section.

Mount options

rw,bg,hard,nointr,rsize=32768,wsize=32768,vers=3,proto=tcp

Kernel Tuning

Parameter	Replaced by (Resource Control)	Recommended Minimum Value
noexec_user_stack	NA	1
semsys:seminfo_semmni	project.max-sem-ids	100
semsys:seminfo_semmns	NA	1024
semsys:seminfo_semmsl	project.max-sem-nsems	256
semsys:seminfo_semvmx	NA	32767
shmsys:shminfo_shmmax	project.max-shm-memory	4294967296
shmsys:shminfo_shmmni	project.max-shm-ids	100

On Solaris 10, the following kernel parameters should be set to the shown value, or higher.

Solaris file descriptors

rlim_fd_cur – "Soft" limit on the number of file descriptors (and sockets) that a single process can have open

rlim_fd_max – "Hard" limit on the number of file descriptors (and sockets) that a single process can have open

Setting these values to 1024 is strongly recommended to avoid database crashes resulting from Solaris resource deprivation.

Network Settings

Parameter	Value	Details
/dev/tcp tcp_recv_hiwat	65,535	increases TCP receive buffer
/dev/tcp tcp_xmit_hiwat	65,535	increases TCP transmit buffer
/dev/ge adv_pauseTX	1	Enables transmit flow control
/dev/ge adv_pauseRX	1	Enables receive flow control
/dev/ge adv_1000fdx_cap	1	forces full duplex for GBE ports
/dev/tcp tcp_xmit_hiwat	65536	Increases TCP transmit high watermark
/dev/tcp tcp_recv_hiwat	65536	Increases TCP receive high watermark

sq_max_size – Sets the maximum number of messages allowed for each IP queue (STREAMS synchronized queue). Increasing this value improves network performance. A safe value for this parameter is 25 for each 64MB of physical memory in a Solaris system up to a maximum value of 100. The parameter can be optimized by starting at 25 and incrementing by 10 until network performance reaches a peak.

Nstrpush – Determines the maximum number of modules that can be pushed onto a stream and should be set to 9

References

NetApp Technical Teport tr-3633, tr-3496, tr-3322,

NetApp Knowledge Base Article 7518

Thursday, August 6, 2009

NetApp Active/Active vs. Active/Passive (Stretch MetroCluster) solution

Active / Active Controller Configuration

In this configuration both the systems are connected to each other’s disk and having heartbeat connection through NVRAM card. In the situation of one controller failure other controller takes over the loads of failed controller and keeps the operation going as it’s having connection with failed controller’s disk shelves.

Further details of Active / Active cluster best practices can be found in TR-3450

Active / Passive (Stretch MetroCluster) Configuration

This is the diagram of active/active metrocluster, however the same design applies to active/passive metrocluster also except one node on the cluster is having only mirror of primary system's data.

In this configuration primary and secondary systems can extend upto 500m (upto 100km with Fabric MetroCluster) and all the primary system data is mirrored to secondary system with Sync Mirror, in the event of primary system failure all the connection automatically gets switch over to remote copy. This provides additional level of failure protection like whole disk shelf failure or multiple failures at same time, however this needs another copy of same data and exact same hardware configuration to be available for secondary node.

Please note that cluster interconnect (CI) on NVRAM card is required for cluster configuration however 3170 offer a new architecture that incorporates a dual-controller design with the cluster interconnect on the backplane. For this reason, the FCVI card that is normally used for CI in a Fabric MetroCluster configuration must also be used for a 31xx Stretch configuration.

Further details of MetroCluster design and implementation can be found in TR-3548

Minimizing downtime with cluster

Although having a cluster configuration saves from any unwanted downtime however a small disruption can be sensed on the network while takeover /giveback is happening which is approximately less than 90 seconds in most of the environments and it keeps the NAS network alive with few “not responding” errors on clients.

A few points in related with this are given below:

CIFS: leads to a loss of session to the clients, and possible loss of data. However clients will reconnect the session by themselves if system comes up before the timeout window.

NFS hard mounts: clients will continue to attempt reconnection indefinitely, therefore controller reboot does not affect clients unless the application issuing the request times out waiting for NFS responses. Consequently, it may be appropriate to compensate by extending the application timeout window.

NFS soft mounts: client processes continue reconnection attempts until the timeout limit is reached. While soft mounts may reduce the possibility of client instability during failover, they expose applications to the potential for silent data corruption, so are only advised in cases where client responsiveness is more important than data integrity. If TCP soft mounts are not possible, reduce the risk of UDP soft mounts by specifying long retransmission timeout values and a relatively large number of retries in the mount options (i.e., timeo=30, retrans=10).

FTP, NDMP, HTTP, backups, restores: state is lost and the operation must be retried by the client.

Applications (for example, Oracle®, Exchange): application-specific. Generally, if timeout-based, application parameters can be tuned to increase timeout intervals to exceed Data ONTAP reboot time as a means of avoiding application disruption.

Monday, August 3, 2009

NetApp NFS mount for Red Hat Linux 5.2

Just another post from my mails, where I have collected some Best Practices for mounting NFS share in RHRL.

Automounter

An automounter can cause a lot of network chatter, so it is best to disable the automounter on your client and set up static mounts before taking a network trace. Automounters depend on the availability of several network infrastructure services. If any of these services is not reliable or performs poorly, it can adversely affect the performance and availability of your NFS clients. When diagnosing an NFS client problem, triple-check your automounter configuration first. It is often wise to disable the automounter before drilling into client problem diagnosis.

LINUX KERNEL TUNING FOR KNFS

sunrpc.tcp_slot_table_entries = 128

Increasing this parameter from the default of 16 to the maximum of 128 increases the number of in-flight Remote Procedure Calls (I/Os). Be sure to edit /etc/init.d/netfs to call /sbin/sysctl –p in the first line of the script so that sunrpc.tcp_slot_table_entries is set before NFS mounts any file systems. If NFS mounts the file systems before this parameter is set, the default value of 16 will be in force.

Mount options

rw,bg,hard,intr,rsize=32768,wsize=32768,vers=3,proto=tcp,timeo=600,retrans=2

Kernel Tuning

Most modern Linux distributions contain a file called /etc/sysctl.conf where you can add changes such as this so they will be executed after every system reboot. Add these lines to your /etc/sysctl.conf file on your client systems:

net.core.rmem_default	262144	Default TCP receive window size (Default buffer size)	Improve network performance for IPbased protocols
net.core.rmem_max	16777216	Max. TCP receive window size.(Max. buffer size)	Improve network performance for IPbased protocols
net.core.wmem_default	262144	Default TCP send window size (Default buffer size)	Improve network performance for IPbased protocols
net.core.wmem_max	16777216	Max. TCP send window size (Max. buffer size)	Improve network performance for IPbased protocols
net.ipv4.tcp_rmem	4096 262144 16777216	Autotuning for TCP receive window size (Default and Max. values are overridden by rmem_default rmem_max)	Improve network performance for IPbased protocols
net.ipv4.tcp_wmem	4096 262144 16777216	Autotuning for TCP send window size (Default and Max. values are overridden by wmem_default wmem_max)	Improve network performance for IPbased protocols
net.ipv4.tcp_window_scaling	1	TCP scaling, allows a TCP window size greater than 65536 to be used	This is enabled by default (value 1), make sure that it doesn't get disabled (Value 0).
net.ipv4.tcp_syncookies	0	Disables generation SYN (crypto) COOKIES	Helps to reduce CPU overhead
net.ipv4.tcp_timestamps	0	Disables new RTTM feature introduced in RFC-1323	Helps to reduce CPU overhead Prevents adding 10-byte overhead to TCP header
net.ipv4.tcp_sack	0	Disables selective ack	Helps to reduce CPU overhead

References:

NetApp whitepaper tr-3700,tr-3183, tr-3369

NetApp Knowledge Base Article 7518

Jumbo Frames in NetApp

This is just another one out of ten thousand posts talking about Jumbo frames, actually a week back while doing designing for my new NetApp environment was looking for Jumbo Frames. So here's some details which I have collected from different places, although still I have to do testing and see how much benefit I get in my environment however I am posting this thinking it might be useful for someone else.

Jumbo frames

Jumbo frames are TCP frames where MTU size is more than the IEEE standard of 1500 bytes. There are lots of variations in that and anything from 1500 to 12000 can be configured, be called as jumbo frames. However most of the industry uses MTU size of 9000 for jumbo frames due to support from most of the device manufacturers and memory page size limit of common protocols like NFS in which datagram size is 8400 bytes therefore a Ethernet frame size of 9018 can accommodate single NFS datagram in one Ethernet packet and stay comfortably within the standard Ethernet bit error rates.

As NetApp support maximum MTU size of 9192 hence in this paper I have taken 9000 as the MTU size.

Benefits:

Less CPU overhead as system has to do less header processing because in VIF mode TOE on NetApp cards are disabled.
9000 bytes frames are six times higher then stock frames of 1500 MTU so larger frame size leads to higher throughput.
Some tests in NetApp show upto 30% increase and other vendors have achieved more than 60% in network throughput.

Considerations:

To use jumbo frames, client system, intermediate switches / routers and NetApp devices, all should be configured to process large frames.
Any interface operating over 1000 Mbps is currently supported on NetApp systems for jumbo frame configuration.
Client’s TCP window size should be two times the MTU size, minus 40, and the maximum value can be the highest value storage system support. Typically, the maximum value can be set for client's TCP window is 65,535.
If storage system is configured to support jumbo frames and the client is not, the communication between them occurs at the client’s frame size
UDP client’s MTU size and storage system’s MTU size should match as UDP clients do not communicate their MTU size.
All the interfaces in a vif must have the same MTU size.

Suggestions:

Looking at different performance tests carried out from NetApp, it is clear that all of them do have Jumbo frames enable to achieve a higher throughput and even their different best practices call for using jumbo frames for better usage.

Although jumbo frames are good option for increasing network through however it needs proper testing and validation, some tests from different implementation have shown no performance increase or network errors also. Like in the case of intermediate IP network which doesn’t support extended frames in that case the IP device may have to fragment the frame which puts additional load and in worst case if the DON'T_FRAGMENT bit is set in the IP header of the packets, the router will drop the packets instead of fragmenting them and sending station will get a message “ICMP DESTINATION UNREACHABLE - FRAGMENTATION NEEDED”.

References for further reading

Optimizing Oracle on NFS - NetApp White Paper

CIFS Best Practices - NetApp Technical Report

iSCSI Performance Options - NetApp Technical Report

Oracle 10g Performance on Solaris 10 - NetApp Technical Report

Ethernet Jumbo Frames - Chelsio Communications White Paper

Gigabit Ethernet Jumbo Frames - WareOnEarth Communications

Extended Frame Sizes for Next Generation Ethernets - Alteon Networks White Paper

Boosting Server-to-Server Gigabit Throughput with Jumbo Frames- HP White Paper