My journey with NetApp: 2010

Friday, December 3, 2010

How to restore data from aggregate snapshot

Today one of our user found himself in wet pants when he noticed his robocopy job has overwritten a folder, rather than appending new data to it. Being panicked he run to me looking for any tape or snapshot backup of his original data, which unfortunately wasn’t there as previously he confirmed that they don’t need any kind of protection.

Now at this time I had only place left where I can recover the data, aggregate level snapshots; so I looked at aggregate snapshots and saw it goes back to time when he had data in place. Knowing that the data deleted from volume is still locked in aggregate’s snapshot I was feeling good that I have done a good job by having some space reserved for aggregate level snapshot, which no one ever advocated.

Now the next step is to recover the data, but problem was that if I revert aggregate using “snap restore –A” then all the volumes in that aggregate will be reverted which will be bigger problem. So had to go on a different way, use aggregate copy function to copy the aggregate’s snapshot to an empty aggregate and then restore the data from there.

Here’s the cookbook for this.

Pre-checks:

The volume you lost data from is a flexible volume
Identify an aggregate which is empty so it can be used for destination (could be on another controller also)
Make sure the destination aggregate is either equal or larger than source aggregate
/etc/hosts.equiv has entry for the filer you want to copy data to and /etc/hosts has its IP address added, in case of copying on same controller loopback address (127.0.0.1) should be added in /etc/hosts file and local filername should be in hosts.equiv file
Name of aggregate’s snapshot which you want to copy

Example:

Let’s say the volume we lost data was ‘vol1’, the aggregate which has this volume is ‘aggr_source’, the aggregate’s snapshot which has lost data is ‘hourly.1’ and empty aggregate where we will be storing data to is ‘aggr_destination’

Execution:

Restrict the destination aggregate using ‘aggr restrict aggr_destination’
Start the aggregate data copy using ‘aggr copy start –s hourly.1 aggr_source aggr_destination’
Once the copy is completed online the aggregate using ‘aggr online aggr_destination’
If you have done copy on same controller, system will rename the volume ‘vol1’ of ‘aggr_destination’ to ‘vol1(1)’
Now export the volume or lun and you have your all lost data available.

So here’s the answer to another popular question, why do I need to reserve space for aggregate level snapshot. Do you have the answer now?

Tuesday, October 12, 2010

Most destructive command in Ontap

There are some commands which shake me when I run or even when I am close to them, but never thought I could be make my filer so close to death by just mistyping a command.

Yes, indeed I did it by typing 'gbd' rather than 'dbg', these two are so close to each other that my buttery fingers didn't realize that I mistyped a command and by the time I could realize it was too late. Sigh!

Little background on this 'gbd' command.

This command is in diag mode and so debug, however whereas 'dbg' captures filer debug info on console or a file, 'gbd' sends kill signal to all the processors which stops all the work on filers and everything just hangs. The only way to recover your filer is by hard rebooting it, either by RLM or physically tipping the power button.

I don't understand why brilliant NetApp engineers have made it so simple, why couldn't they use a command like 'use_this_to_kill_your_system' or something like that and I swear no one would ever type it.

Anyway I did it and I admit, that I should have checked before hitting return which I didn't. But guess what I was lucky enough to not to do it on a prod system and this test/dev system was having only a bunch of NFS clients were connected to it which made it sort of invisible to client systems due to the nature of NFS protocol.

What is the other command which you think shouldn't be so easy along with this?

How to do host/user/group or netgroup from filer

Often we want to do nslookup for a host or NIS/LDAP lookup for a user or group for troubleshooting purpose. You have a unix system handy and you do it from there however what if you suspect results are not same as what your filer may be getting?

If you are troubleshooting CIFS issue, you are in luck with command 'cifs lookup' however, if you are dealing with DNS or NFS issue then you are out of luck, unless you go into advanced mode. Yes, you go inside advance mode and you get access to lot of other commands including one very nifty command 'getXXbyYY', which is incredibly useful but hidden from the view of admins for some strange reason, really I am not sure why NetApp thinks this shouldn't be available to end user as every time I do troubleshooting I feel the need of this and by no way I see this to be making any sort of changes on filer.

Anyway here's the command, though command says using "man na_getXXbyYY" for additional info however I couldn't locate it on systems therefore I use

test1*> getXXbyYY help
usage: getXXbyYY
Where sub-command is one of
gethostbyname_r
gethostbyaddr_r
netgrp
getspwbyname_r
getpwbyname_r
getpwbyuid_r
getgrbyname
getgrbygid
getgrlist

For more information, try 'man na_getXXbyYY'

Please remember this command is not available in admin mode and search order depends of your /etc/nsswitch.conf entry, so before you start thinking that this isn't working as expected please check these two things first.

Though all the subcommands are self explanatory however I have added small description for each of them.

gethostbyname_r - Resolves host name to IP address from configured DNS server, same as nslookup
gethostbyaddr_r - Retrieves IP address for host name from configured DNS server, same as reverse lookup
netgrp - Checks group membership for given host from LDAP/Files/NIS
getspwbyname_r - Displays user information using shadow file
getpwbyname_r - Displays user information including encrypted password from LDAP/Files/NIS
getpwbyuid_r - Same as above however you provide uid in this command rather than user name
getgrbyname - Displays group name and gid from LDAP/Files/NIS
getgrbygid - Same as above however you provide gid in this command rather than group name
getgrlist - Shows given user's gid from LDAP/Files/NIS

Examples:

test1*> getXXbyYY gethostbyname_r landinghost1
name: landinghost1
aliases:
addresses: 10.21.242.7

test1*> getXXbyYY gethostbyaddr_r 10.21.242.7
name: landinghost1
aliases:
addresses: 10.21.242.7

test1*> getXXbyYY netgrp support-group testhost1
client testhost1 is in netgroup support-group

test1*> getXXbyYY getpwbyname_r root
pw_name = root
pw_passwd = _J9..gsxiYTAHEtV3Qnk
pw_uid = 0, pw_gid = 1
pw_gecos =
pw_dir = /
pw_shell =

test1*> getXXbyYY getpwbyuid_r 0
pw_name = root
pw_passwd = _J9..gsxiYTAHEtV3Qnk
pw_uid = 0, pw_gid = 1
pw_gecos =
pw_dir = /
pw_shell =

test1*> getXXbyYY getgrbyname was
name = was
gid = 10826

test1*> getXXbyYY getgrbygid 10826
name = was
gid = 10826

test1*> getXXbyYY getgrlist wasadmin
pw_name = wasadmin
Groups: 10826

Sunday, August 29, 2010

Execute command from a file on Ontap

Occasionally we want to quickly or periodically run a set of pre defined commands on filers, like when we are making little change in network configuration and want to minimize the network down time or when we are creating a volume and know snap reserve, snap schedule, de-dupe, autosize or anything else which needs to be changed after every volume creation. If you are executing commands from unix terminal then a better way would be to keep all the commands in a text file and would do something like

bash>for $i in vol1
>do
>snap sched $i 0
>snap reserve $i 0
>blah $i
>blah $i
>done

and everything would be done, but imagine how would you do if you are doing through console. Unfortunately Ontap doesn't supports any kind of script not even for loop so we have to run each and every line of command either by typing on console or doing a copy-paste from text file from our desktop.

However there is a better way, use notepad to create set of command as you would execute on console with correct order, copy it off to filer and use 'source' command to execute each and every line from it. I know it's not such a brilliant idea as still you have to copy and paste everything to a file on filer however it's wee better than executing each and every command on console.

Think about you have to re-run /etc/rc so either you can use 'rdfile /etc/rc' to print everything on console and copy all the the line and execute on console or just run 'source /etc/rc' and let it run all the commands for you. You can also use 'source -v /etc/rc' to print the commands on console but not to execute, just to get an idea if there are any junk characters or any unwanted command inside the file, just as a precaution you better be sure that all the commands are valid and correct as if a command fails source doesn't stops there and just goes to next command in list.

Use it and I am sure you will like it next time when you are making some changes on filer which needs ten different commands to be executed.

Saturday, August 28, 2010

How to check unplanned downtime detail for a NetApp filer

Every now and then someone ask us what is uptime of system and we just type 'uptime' on system console to get the detail instantly.

This is really handy command to know when the system was last rebooted and how many operations per protocol it has served since then. Wouldn't our life be little easy if managers get satisfy with this detail? Alas! but that doesn't happen and they ask us to give all the details since we have acquired the system or 1st January and then we go back to our excel sheet or ppt we have created as part of monthly report to pull the data.

How about if we can get same information from system with just a command, wouldn't that be cool. Fortunate enough we have little known command 'availtime' right inside Ontap which just do the exact same function and specifically created after thinking about our bosses.

HOST02*> availtime full
Service statistics as of Sat Aug 28 18:07:33 BST 2010
System (UP). First recorded 68824252 secs ago on Mon Jun 23 04:16:41 BST 2008
   Planned downs 31, downtime 6781737 secs, longest 6771328, Tue Sep 9 15:07:33 BST 2008
   Uptime counting unplanned downtime: 100.00%; counting total downtime: 90.14%
NFS (UP). First recorded 68824242 secs ago on Mon Jun 23 04:16:51 BST 2008
   Planned downs 43, downtime 6849318 secs, longest 6839978, Wed Sep 10 10:11:43 BST 2008
   Uptime counting unplanned downtime: 100.00%; counting total downtime: 90.04%
CIFS (UP). First recorded 61969859 secs ago on Wed Sep 10 12:16:34 BST 2008
   Planned downs 35, downtime 17166 secs, longest 7351, Thu Jul 30 13:52:25 BST 2009
   Uptime counting unplanned downtime: 100.00%; counting total downtime: 99.97%
HTTP (UP). First recorded 47876362 secs ago on Fri Feb 20 14:08:11 GMT 2009
   Planned downs 8, downtime 235 secs, longest 53, Wed Jan 20 14:10:18 GMT 2010
   Unplanned downs 16, downtime 4915 secs, longest 3800, Mon Jul 27 16:01:02 BST 2009
   Uptime counting unplanned downtime: 99.98%; counting total downtime: 99.98%
FCP (DOWN). First recorded 68817797 secs ago on Mon Jun 23 06:04:16 BST 2008
   Planned downs 17, downtime 44988443 secs, longest 38209631, Sat Aug 28 18:07:33 BST 2010
   Unplanned downs 6, downtime 78 secs, longest 21, Fri Feb 20 15:24:44 GMT 2009
   Uptime counting unplanned downtime: 99.99%; counting total downtime: 34.62%
iSCSI (DOWN). First recorded 61970687 secs ago on Wed Sep 10 12:02:46 BST 2008
   Planned downs 21, downtime 38211244 secs, longest 36389556, Sat Aug 28 18:07:33 BST 2010
   Uptime counting unplanned downtime: 100.00%; counting total downtime: 38.33%

I am not sure why NetApp has kept this command in Advanced mode but once you know this command I bet next time you will not refrain yourself going inside advance mode to see how many unscheduled downtime you had since last reset.

A shorter version of this command is just 'availtime' it also shows the same information as 'availtime full' however it truncates letters from output and denotes Planned with P and Unplanned with U which is very good if you want to pass it in script.

HOST04*> availtime
Service statistics as of Sat Aug 28 18:07:33 BST 2010
System (UP). First recorded (20667804) on Wed Sep 23 09:35:49 GMT 2009
   P 5, 496, 139, Fri Dec 11 15:58:19 GMT 2009
   U 1, 1605, 1605, Wed Mar 31 17:01:41 GMT 2010
CIFS (UP). First recorded (20666589) on Wed Sep 23 09:56:04 GMT 2009
   P 7, 825, 646, Thu Jan 21 19:08:03 GMT 2010
   U 1, 77, 77, Wed Mar 31 16:34:54 GMT 2010
HTTP (UP). First recorded (20664731) on Wed Sep 23 10:27:02 GMT 2009
   P 3, 51, 22, Thu Jan 21 19:17:25 GMT 2010
   U 4, 203, 96, Thu Jan 21 19:08:03 GMT 2010
FCP (UP). First recorded (20477735) on Fri Sep 25 14:23:38 GMT 2009
   P 3, 126, 92, Thu Jan 21 19:07:57 GMT 2010
   U 4, 108, 76, Wed Mar 31 16:34:53 GMT 2010

In order to reset the output use 'reset' switch and it will zero out all the counters, make sure you have recorded the statistics before you reset the counters as once you reset the counters you will not be able to get details of system uptime since system was built so you may like to do only after you acquire a new system, have done all the configuration and now it's the time for it to serve user requests.

Saturday, June 26, 2010

Operations Manager Efficiency Plugin

After the release of DFM 3.8.1 NetApp has released a nice little plugin for DFM called as ‘Operations Manager Storage Efficiency Dashboard Plugin’. Though quite a long name but it’s good, it cleverly uses DFM database to pull storage utilization and presents the information in nice flash based webpage.

It’s useful when you have to show higher management current storage utilization and saving came from NetApp thin provisioning, dedupe, flexclone and other stuffs and goes very well with NetApp’s storage efficiency mantra. The best part is, after you install the plugin you don’t have to anything and you can access it from anywhere in network without installing any software, however there isn’t a simple way to reach the page even after you are right inside OM webpage as there is no link pointing to dashboard, so you have to remember the location to access it later or for people like me bookmark in your browser.

The most common problem arising from this is due to lack of foresight while creating the plugin. Here’s what I mean to say. Usually we install DFM server on c:\ and move all perfdata, DB, script folder and other bits and pieces to a different drive for easy backup or in the case of cluster, for clustering setup and here script falls apart. Script expects that it is sitting in its default location and web folder is sitting right next to it, so it acts accordingly whereas in real situation web folder is on c:\ and script is in some other volume.

Now there isn’t any way to rectify the behaviour of script or web server, as apache running on DFM can’t be configured to use any folder other then the one sitting inside the installation directory (AFAIK) and no switches are provided in script to tell him the location of original web folder where he needs to copy its content.

So in nutshell even though script executes and copies all the files required for showing the dashboard it’s useless unless you figure out by yourself what’s going wrong and why not the page is showing in your browser.

Overcoming this limitation is easy enough for folks those who are on Unix environment as creating an alias to original web folder makes everything working fine but for windows folks like me creating a shortcut doesn’t works.

So here’s the way to correct the problem.

Download the plugin from now toolchest. Extract the zip and edit file ‘package.xml’, change the string “dfmeff.exe” to “dfmeff.bat”, next you have to create a new batch file in called “dfmeff.bat” with below contents.

@echo off

D:\DFM\script-plugins\dfmeff\dfmeff.exe

xcopy D:\DFM\web\*.* "C:\Program Files\NetApp\DataFabric\DFM\web" /Q/I/Y/R

Obviously you have to change the path as per your installation however once you have created the batch file and added its reference in xml file you are good to go, just zip it again using any zip software and use the new zip file as plugin source for installation in DFM.

Update:
Just noticed a video showing features of plugin on netapp community site http://communities.netapp.com/videos/1209

Monday, June 7, 2010

Which is faster, NDMPcopy or vol copy?

After posting my last post I got few mails asking, amongst ndpmpcopy and vol copy, which one would be faster?

Only if I have to count speed then vol copy, because it copies blocks directly from disk without going through FS, however I think it’s well suitable if you want to migrate a volume.

Pros

CPU usage can be throttled
Source volume snapshot can be copied
Simultaneously 4 copy operations can be started
Once started it goes to background and you can use console for other purpose

Cons

Destination can’t be root volume
Destination volume should be offline
All data in destination volume will be over-written
Destination volume size should be bigger or equal to source
Single file or directory cannot be specified for copy operation
Both the volumes should be of same type; traditional or flexible
If data is copied between two filers both filer should have other filer’s entry in /etc/hosts.equiv file and loopback address for itself in /etc/hosts file

However for copying data between two filers for test or any other purpose ndmpcopy is more suitable because it gives you additional control and less restrictions, which is very useful.

Pros

Little or no CPU overhead
Incremental copy is supported
No limitation on volume size and type
No need to take destination volume offline
Single file or directory can also be specified
No file fragmentation on destination volume as all data is copied sequentially from source volume so improved data layout
No configuration is required between two filers and username and password is used for authentication

Cons

Snapshots can’t be copied from source
Console is not available till the time copy operation is running so no multiple ndmpcopy operations
If lots of small files has to be copied then copy operation will be slower

So as you have seen both are well however one can’t be replaced for other and both have their usage for different purposes.

Monday, May 24, 2010

How to copy files in Ontap

As soon as someone asks this question we all say ‘use ndmpcoyp’ but what if you don’t have any network adapters configured, will ndmpcoyp work?

No; ndmpcopy is very useful if you want to copy a file or a whole volume however one thing very few people know that it doesn’t work if you don’t have loopback adapter configured because ndmpcopy passes all the data through lo adapter so it’s not only dependent on lo’s availability, its speed also. So how do you copy the data if lo is not available?

The answer is simple, use dd, just an old fashioned unix command which does lot of thing, not only it can copy the file with full pathname you can even use block number and disk number and the best part, syntax is simple ‘if’ for from and ‘of’ for to.

It can be used not only for copying file around the system, in fact you can use it for testing I/O and copying file from snapshot also and this command can be used regardless of permission.

A little note, if you are afraid of going in advanced or diagnostic mode better keep use rdfile and wrfile because this command is not available in admin mode so you have to go in advanced mode to use this.

Here’s the syntax of this command.

dd [ [if= file ] | [ din= disknum bin= blocknum ] ] [ [of= file ] | [ dout= disknum bout= blocknum ] ] count= number_of_blocks

Another note, if you are using count make sure you are using in multiply of 4 because a WAFL block size is 4k.

Example:

sim1> priv set advanced
sim1*> dd if=/vol/vol0/.snapshot/hourly.2/etc/snapmirror.conf of=/vol/vol0/etc/snapmirror.conf1

Friday, April 30, 2010

I2P (inode to pathname) in Ontap

Few days before I became curios to know what is I2P and what will happen if you turn it off. So started hunting NetApp site and Google for information however I couldn’t find much on this. Now as obvious once you are off from vendor site and Google then start looking at your social network and then, bingo! I was able to get some information on this.

So the first question what the heck is this I2P?

Actually it’s a new feature in Ontap 7.1 and later versions that maps each inode number and file name with relative path to speed up certain operations. As we all know every file, directory, soft link, hard link or any other metadata file has one inode associated with it so each inode has to go through this process and each file/directory gets 8 bits added to its metadata by Ontap whereas every hard link gets 12 bytes penalty and this happen every time you create, delete or rename file, directory or link.

Ok, but why do I need this?

As far as its usage goes there are some well known application for this like fpolicy, virus scan and file auditing which needs to know full path of the requested file whereas some are Ontap specific feature like in mixed volume it informs NFS clients for any changes made by CIFS clients; one important place where you will see difference is dump command, as having full path available for each file makes it much faster and efficient in operation, some grey areas are also there which are used by Ontap for its internal work but that’s all covered deep under their IP protection policy so I couldn’t get any info on it.

Now how to you get information about this from your system?

If you look closely you will see in vol options there is an option ‘no_i2p’ (default is off and only in 7.1 and later) to enable or disable i2p feature as well if you go to advance mode you can see few more commands related to this like ‘inodepath’ which shows the i2p information stored in a given inode whereas ‘wafl scan status’ command shows you running i2p scans which can be aborted with 'wafl scan abort scan_id' or you can also change the scan speed with ‘wafl scan speed new_speed’ command after you list the current scan speed with ‘wafl scan speed’ command.

After having these information pushed me to think that ok so most of the volumes in my systems are NFS only and they don’t need any virus scan, neither we use dump, fpolicy or any other feature so why not to turn off and get extra juice out of system but speaking with chaps it turned out that turning off wouldn’t be a good idea though they were also not sure what will break if you turn it off and as it has very little performance impact so better to leave it untouched.

And yes it’s true it does have very less performance impact as in general we don’t do so much metadata modification that it may hurt the system with i2p workload however when you upgrade your system from an earlier release to 7.1 or later family they get very busy in creating i2p information for each and every file/directory/link etc and may run in high utilization for quite some time and at that time you may wish to use these commands to quickly pull your system back in normal state and let scan run with slow speed or one volume at a time or completely stop it if you want to revert back the system to pre 7.1 release.

I think if I get some time I would like to do an extensive testing and see what comes out however if anyone else knows please share your knowledge.

Wednesday, January 13, 2010

Defragement in NetApp

Usually we face this problem with our PC and then we defrag our volumes clear temp files and what not; most of the times that solves the problem, though not fully but yes it gets better.

In NetApp though we don’t have to deal with fragmented registry or temp files but due to nature of WAFL file system it gets fragmented very soon, soon after you start overwriting or start deleting and adding the data to volume. So what do you do then?

Well the answer is very simple use ‘reallocate’ command. Yes, this is the defrag tool of NetApp built right in the Ontap OS.

First you have to turn on the reallocation on system with ‘reallocate on’ command. This command turns on the reallocation on system and same way turns off with off switch.

This can be used not only on volumes, infact you can run this on a file, lun or aggregate itself. However I should warn you that optimization of lun may not give you any performance benefit or may get worse, as Ontap doesn’t have any clue what’s in the lun and it’s file system layout.

If you want to run the reallocation only one time you should use -f or -o switch however if you want Ontap to keep a track of your FS and optimize the data when if feels necessary you should control it with –i switch or schedule it with ‘reallocate schedule’ command.

To check current optimization level of volume, you can use ‘reallocate measure -o ’ or if you want to feel adventurous use ‘wafl scan measure_layout ’ through advanced mode, though I don’t suggest using wafl set of commands in general use but yes sometime you want to do something different.

This command is pretty straightforward and no harm (except extra load on CPU and disk) so you can play with this but you should always consider using -p switch for volumes having snapshot and/or snapmirror on to keep the snapshot size small.

Saturday, January 9, 2010

How to get the list of domain users added to filer without fiddling with SID

There were numerous time when I wanted to see an AD user’s permission on filer however just to locate that user on system itself took me a lot of time. Why? Because Ontap shows domain users added to system in SID format rather than their names which is very much annoying as when it dumps the SIDs on screen then we have to use ‘cifs lookup’ command to hunt for the user I am looking for from that bunch of SIDs.

So here’s a little handy unix script to see the list of all AD users added on filers in their username format rather then SIDs

I have already setup a password less login to filer therefore I haven’t added the username and password fields however if you haven’t done that add your login credentials after name of the filer in below command.

rsh useradmin domainuser list -g ‘Administrators’ | sed 's/^S/rsh cifs lookup S/'

Now this command will display the AD users added in Administrator group however if you want to see users from any other group replace the Administrators word with group name on your screen.