Distributed File Systems

  • Sun Network File System (NFS)
  • Server crash recovery
    • design of network protocol

Distributed Systems

  • Client/Server
    • One Server
  • Replicated Servers
    • Many servers

Different Than “local” System?

  • machine crash
  • network lose packets
  • performance latency, bandwidth
  • resource sharing policies

NFS

  • Basics
  • Protocol
  • from protocol to FS API
  • idempotency: key to failure handling
  • performance: caching

Server Crashes: How to Handle?

  • lead to unavailibility
  • key idea: when there is a problem => retry
  • File Handle 3 parts:
    • <volume#, inode #, generation #>
    • volume: which fs?
    • inode: which file?
    • generation: updated on delete

Protocol:

  • request have all info needed to complete operation (“statelessness”)

NFS Protocol

  • read(file handle, offset, size)
    • return error code, data
  • write(fh, data, offset, size)
    • return error rate
  • create(parent file handle, name)
  • lookup(parent fh, name)
    • return file handle of name

Example

  • open file + read it
  • int fd = open("/a/b.txt", O_RDONLY);
    read(fd, buffer, size);
    
  • assume: client has root directory file handle
  • open:
  • lookup(root fh, "a")
    => a's fh
    lookup(a's fh, "b.txt")
    => b.txt fh
    return fd
    

Crashes

  • Client req lost
  • Server reply lost
  • Server down
  • Uniform approach
    • timeout, retry (wait for a little while)
    • Property: idempotency
      • doing N times same as doing it once

Cache

  • Problems
    • Staleness
    • Visibility
  • Example - Staleness
    • T=1 => C1 read A and place to cache
    • T=2 => C2 write A'
    • T=3 => C1 read A from cache?
  • Example - Visibility
    • C1 write buffer A’ (nobody else know)
  • Solution - Staleness
    • Check with server if data has change before using the cached version
  • Solution - Visibility
    • “flush on close”
      • all dirty data written to server on close