Thursday, August 6, 2009

NetApp Active/Active vs. Active/Passive (Stretch MetroCluster) solution

Active / Active Controller Configuration


In this configuration both the systems are connected to each other’s disk and having heartbeat connection through NVRAM card. In the situation of one controller failure other controller takes over the loads of failed controller and keeps the operation going as it’s having connection with failed controller’s disk shelves.

Further details of Active / Active cluster best practices can be found in TR-3450

Active / Passive (Stretch MetroCluster) Configuration

This is the diagram of active/active metrocluster, however the same design applies to active/passive metrocluster also except one node on the cluster is having only mirror of primary system's data.

In this configuration primary and secondary systems can extend upto 500m (upto 100km with Fabric MetroCluster) and all the primary system data is mirrored to secondary system with Sync Mirror, in the event of primary system failure all the connection automatically gets switch over to remote copy. This provides additional level of failure protection like whole disk shelf failure or multiple failures at same time, however this needs another copy of same data and exact same hardware configuration to be available for secondary node.

Please note that cluster interconnect (CI) on NVRAM card is required for cluster configuration however 3170 offer a new architecture that incorporates a dual-controller design with the cluster interconnect on the backplane. For this reason, the FCVI card that is normally used for CI in a Fabric MetroCluster configuration must also be used for a 31xx Stretch configuration.
Further details of MetroCluster design and implementation can be found in TR-3548

Minimizing downtime with cluster

Although having a cluster configuration saves from any unwanted downtime however a small disruption can be sensed on the network while takeover /giveback is happening which is approximately less than 90 seconds in most of the environments and it keeps the NAS network alive with few “not responding” errors on clients.
A few points in related with this are given below:

CIFS: leads to a loss of session to the clients, and possible loss of data. However clients will reconnect the session by themselves if system comes up before the timeout window.

NFS hard mounts: clients will continue to attempt reconnection indefinitely, therefore controller reboot does not affect clients unless the application issuing the request times out waiting for NFS responses. Consequently, it may be appropriate to compensate by extending the application timeout window.

NFS soft mounts: client processes continue reconnection attempts until the timeout limit is reached. While soft mounts may reduce the possibility of client instability during failover, they expose applications to the potential for silent data corruption, so are only advised in cases where client responsiveness is more important than data integrity. If TCP soft mounts are not possible, reduce the risk of UDP soft mounts by specifying long retransmission timeout values and a relatively large number of retries in the mount options (i.e., timeo=30, retrans=10).

FTP, NDMP, HTTP, backups, restores: state is lost and the operation must be retried by the client.

Applications (for example, Oracle®, Exchange): application-specific. Generally, if timeout-based, application parameters can be tuned to increase timeout intervals to exceed Data ONTAP reboot time as a means of avoiding application disruption.

No comments: