As
my nature, I was lazing around over the weekend when I noticed the newly
published best practice document on pNFS and ONTAP 8.1 Cluster
Mode. Soon I realized this is what c-mode is made for; parallelism. How? Let’s see that.
First
let’s start with what is Parallel NFS or in short ‘pNFS’?
pNFS is extension
of NFS version 4.1 which adds support of parallelism to existing NFS version 4.
Well,
that was the shortest answer I could write however here’s little more detail on
it.
pNFS
is part of NFS 4.1, the second minor version to NFS version 4 which adds support
of session, and directory delegation, along-with parallelism. The idea to use SAN
filesystem architecture and parallelism in NFS was originated from Gary Grider of
Los Alamos National Lab and Lee Ward of Sandia
National Lab, later it was presented by Gearth Gibson, a professor at Carnegie Mellon
University and founder and CTO of Panasas, Brent Welch of Panasas, and Peter
Corbett of NetApp in a problem
statement to Internet Engineering Task Force (IETF) in 2004. Later, in 2005
NFSv4 working group of IETF commended drafts and in 2006 it folded into the 4.1
minor version draft. It’s published under RFC 5661 describing NFS version
4.1 with parallel support and RFC
5662 detailing protocol definition codes. pNFS is not limited to file, support
for block data (RFC 5663) and
object based data (RFC 5664) are
also added, so now it’s possible to access not only file but object (OSD) and
block (FC/iSCSI) based storage also over NFS.
pNFS,
being open source, does not require additional software or drivers on the
client that are proprietary to enable. Therefore, the different varieties of
NFS can coexist at the same time and supported NFS clients can mount the same
file system over NFSv3, NFSv4, and NFSv4.1/pNFS.
It
is widely accepted by industry and jointly developed by Panasas, NetApp, Sun,
EMC, IBM, UMich/CITI and many more however at the time of writing only Fedora
16 with kernel 2.6.39 supports all three layout types (blocks, files and objects)
whereas RHEL 6.2 support only files layout.
Now question comes,
what was the need for it?
We
all love NFS for its simplicity, and from the time it was designed by Sun in
the era of 10Mb Ethernet, it scaled well to 100Mb and then gigabit Ethernet, however
with the advent of 10Gb and 40Gb links, single stream designed protocol wasn’t
enough to scale it further. Industry has already used TOE Ethernet cards, link
aggregation and bigger boxes but that wasn’t sufficient to utilize the
bandwidth and CPU powers we have available now. So what was left to deal with?
Parallel NFS.
So, what’s so different
from earlier version of NFS?
pNFS
is not much different from its ancestors it just separates metadata from data. Unlike
traditional NFS Versions 3, 4, and 4.1, where metadata and data are shared on
the same I/O path, with pNFS, metadata and data travels on different I/O paths.
It allows metadata server handles all the metadata activities from the client,
while the data servers provide a direct path for data access.
+-----------+
|+-----------+ +-----------+
||+-----------+ | |
||| | NFSv4.1 + pNFS |
Metadata |
+||
Clients |<----------Metadata------------>| Server
|
+| | | |
+-----------+ |
|
|||
+-----------+
Data |
||| |
||| Storage +-----------+ |
||| Protocol |+-----------+ |
||+----------------||+-----------+
Control |
|+-----------------||| |
Protocol|
+------------------+|| Data |------------+
+| Server |
+-----------+
Figure 1: pNFS
Architecture
As
a result of this, in a clustered storage system with multiple nodes you mount
only a directory or root namespace but you get direct access to data from each
nodes. The metadata server (MDS) handles all nondata traffic such as GETATTRs,
SETATTRs, ACCESS, LOOKUPs, and so on. Data servers (DSs) store file data and
respond directly to client read and write requests. A control protocol is used
to provide synchronization between the metadata server and data server.
Ok, but where does
pNFS add value?
Large
files and high number of concurrent users. With the advent of parallel
computing with multi node cluster, single job gets divided amongst n nodes and when a job arrives at
computational nodes they all try to access same data from one storage location
which soon becomes bottleneck however with pNFS multiple storage system nodes responds
with parts of file in parallel, increasing the aggregated bandwidth and
lowering the latency. At the same time when many small files are accessed by large
number of concurrent users single storage system can get chocked however with
pNFS all nodes of storage system hosting the file, share user load.
Great, but how all
this work?
In
principal, pNFS uses parallelism used by RAID-0 however at different level. As in
RAID-0 data is spread across multiple disks for faster response, same way in
pNFS one file/filesystem is spread across multiple nodes in clustered storage
array for faster response.
For
example when client sends read requests for a file to storage system, storage
system replies with file metadata along with layout details, detailing node
address, data location, and striping information, after getting layout details
client knows list of cluster nodes having parts of the file. Now client directly
contacts to all the data nodes simultaneously for the file and nodes reply with
the parts of files they have which clients later assembles.
I
think It’s enough for now, next post will detail about pNFS implementation by NetApp
Read
for scholars:
No comments:
Post a Comment