My journey with NetApp: pNFS

As my nature, I was lazing around over the weekend when I noticed the newly published best practice document on pNFS and ONTAP 8.1 Cluster Mode. Soon I realized this is what c-mode is made for; parallelism. How? Let’s see that.

First let’s start with what is Parallel NFS or in short ‘pNFS’?

pNFS is extension of NFS version 4.1 which adds support of parallelism to existing NFS version 4.

Well, that was the shortest answer I could write however here’s little more detail on it.

pNFS is part of NFS 4.1, the second minor version to NFS version 4 which adds support of session, and directory delegation, along-with parallelism. The idea to use SAN filesystem architecture and parallelism in NFS was originated from Gary Grider of Los Alamos National Lab and Lee Ward of Sandia National Lab, later it was presented by Gearth Gibson, a professor at Carnegie Mellon University and founder and CTO of Panasas, Brent Welch of Panasas, and Peter Corbett of NetApp in a problem statement to Internet Engineering Task Force (IETF) in 2004. Later, in 2005 NFSv4 working group of IETF commended drafts and in 2006 it folded into the 4.1 minor version draft. It’s published under RFC 5661 describing NFS version 4.1 with parallel support and RFC 5662 detailing protocol definition codes. pNFS is not limited to file, support for block data (RFC 5663) and object based data (RFC 5664) are also added, so now it’s possible to access not only file but object (OSD) and block (FC/iSCSI) based storage also over NFS.

pNFS, being open source, does not require additional software or drivers on the client that are proprietary to enable. Therefore, the different varieties of NFS can coexist at the same time and supported NFS clients can mount the same file system over NFSv3, NFSv4, and NFSv4.1/pNFS.

It is widely accepted by industry and jointly developed by Panasas, NetApp, Sun, EMC, IBM, UMich/CITI and many more however at the time of writing only Fedora 16 with kernel 2.6.39 supports all three layout types (blocks, files and objects) whereas RHEL 6.2 support only files layout.

Now question comes, what was the need for it?

We all love NFS for its simplicity, and from the time it was designed by Sun in the era of 10Mb Ethernet, it scaled well to 100Mb and then gigabit Ethernet, however with the advent of 10Gb and 40Gb links, single stream designed protocol wasn’t enough to scale it further. Industry has already used TOE Ethernet cards, link aggregation and bigger boxes but that wasn’t sufficient to utilize the bandwidth and CPU powers we have available now. So what was left to deal with? Parallel NFS.

So, what’s so different from earlier version of NFS?

pNFS is not much different from its ancestors it just separates metadata from data. Unlike traditional NFS Versions 3, 4, and 4.1, where metadata and data are shared on the same I/O path, with pNFS, metadata and data travels on different I/O paths. It allows metadata server handles all the metadata activities from the client, while the data servers provide a direct path for data access.

+-----------+

|+-----------+ +-----------+

||+-----------+ | |

||| | NFSv4.1 + pNFS | Metadata |

+|| Clients |<----------Metadata------------>| Server |

+| | | |

+-----------+ | |

||| +-----------+

Data |

||| |

||| Storage +-----------+ |

||| Protocol |+-----------+ |

||+----------------||+-----------+ Control |

|+-----------------||| | Protocol|

+------------------+|| Data |------------+

+| Server |

+-----------+

Figure 1: pNFS Architecture

As a result of this, in a clustered storage system with multiple nodes you mount only a directory or root namespace but you get direct access to data from each nodes. The metadata server (MDS) handles all nondata traffic such as GETATTRs, SETATTRs, ACCESS, LOOKUPs, and so on. Data servers (DSs) store file data and respond directly to client read and write requests. A control protocol is used to provide synchronization between the metadata server and data server.

Ok, but where does pNFS add value?

Large files and high number of concurrent users. With the advent of parallel computing with multi node cluster, single job gets divided amongst n nodes and when a job arrives at computational nodes they all try to access same data from one storage location which soon becomes bottleneck however with pNFS multiple storage system nodes responds with parts of file in parallel, increasing the aggregated bandwidth and lowering the latency. At the same time when many small files are accessed by large number of concurrent users single storage system can get chocked however with pNFS all nodes of storage system hosting the file, share user load.

Great, but how all this work?

In principal, pNFS uses parallelism used by RAID-0 however at different level. As in RAID-0 data is spread across multiple disks for faster response, same way in pNFS one file/filesystem is spread across multiple nodes in clustered storage array for faster response.

For example when client sends read requests for a file to storage system, storage system replies with file metadata along with layout details, detailing node address, data location, and striping information, after getting layout details client knows list of cluster nodes having parts of the file. Now client directly contacts to all the data nodes simultaneously for the file and nodes reply with the parts of files they have which clients later assembles.

I think It’s enough for now, next post will detail about pNFS implementation by NetApp

Read for scholars:

http://www.pnfs.com/

http://datatracker.ietf.org/wg/nfsv4/

http://www.pdl.cmu.edu/pNFS/index.shtml

http://www.citi.umich.edu/projects/asci/pnfs/linux/

http://www.pnfs.com/docs/LISA-11-pNFS-BoF-final.pdf

http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-pnfs-00.txt

http://www.pdl.cmu.edu/pNFS/archive/gibson-pnfs-problem-statement.html