Hello!

CS537 11/30

Preview FS API open, read, write, close, link, unlink, mkdir, rmdir, etc. Implementation Simple Implementation On-disk data structures Access methods --------------------------------------------- | | D A T A | -------------------------------------------- close up first part: -------------------------------------------------------------- |Super block|Inode BM|Data BM| I N O D E S | -------------------------------------------------------------- math: true BM = bitmap Superblock = meta-info about FS Inode: (~256 bytes) ------------- | type | | size | | #of blocks| (size smaller than block size, sparse file) | owner | | permission| | addresses | | of data blocks (direct pointer) ------------- Questions: Access?...

CS537 11/9

Brief Memory Volatile, not persistent Poweroff => info lost Persistence ----- -------- | CPU | <----------> | Memory | ----- | -------- ------ math: true net <---- | I/O |------> Hard drive, PCI | Chip | SATA SSD ------ | | USB mice, keyboards Today: Device interaction (I/O) How? Efficient? Specific Device: Hard Drive Future: Device up => OS: File Systems Generic Device: Hardware interface (generalization) Command Register (What to do) Data Register Status register How does OS read and write these?...

CS537 12/14

Today Flash-based SSD Solid State => Circuitry persistent storage basic characteristics from “row” flash => storage device Basics NAND Flash “trap” charge, encode bit(s) Bank/Chip operations read (page) (~2KB, 4KB) --------------------------------- | |page| | | --------------------------------- | block | erase (block) (~256KB) program (page) to write erase entire block then, you can program each page within block exactly once if you want to overwrite, must erase/prog again Performance read 10s of microseconds much faster then hard drive erase a few milliseconds program 100s of microseconds Reliability “wear out” once you erase/program block “too many” times it stops working “to many”: 10k ~ 100k times SSD: Block-level Device API: read, write block Map to read, erase/program Parallel storage device |------------------------ | ------------ ------ | | |Controller| |DRAM| | | ------------ ------ | | --- --- | | |F| |F| ....

CS537 12/2

Today File Systems implementation Static on-disk structures ---------------------------------- | (1) | inode | Data | ---------------------------------- (1) superblock, inode bitmap, data bitmap Performance Problem Example: new file creation (1 block) creat("/foo") Reads Writes x x inode bitmap x x data bitmap x x inode (foo) x x inode (root dir) x x data (root dir) x data (foo) lots of small, random I/O (Bad) Solutions caching (reads) write buffering (writes) doesn’t remove write in general Dynamic: Log-structured File System Disk --------------------------------------- |write|write|write|....

CS537 12/7

Today Crashes What? Why for FS? Crash (Unexpected interruption) Power loss OS bug hard reset Example file append -> file exists add block to it File System Image --------------------------------------------------- |S|IB|DB| inodes | Data blocks | --------------------------------------------------- S: Super block IB: Inode bitmap DB: Data bitmap Need to change: DB, Inode, Data block Possible write ordering: ------> D(1) I(2) DB D(3) DB(4) I I(5) DB(6) D I D DB DB I D DB D I lose data inconsistency lose data inconsistency (Space leak) inconsistency consistent (garbage) Solution Lazy: File system checker Eager: Write-ahead loggin (journaling) File System Checker (fsck) run checker after reboot (crash) scan entire file system -> SLOWt find inconsistencies fix them Write-ahead Logging (WAL) [Journaling] Idea: Before update, record some info Then do update If crash use info to recover -------------------------------------------- |S|Journal|IB|DB| inodes | Data blocks | -------------------------------------------- Protocol for update Assumptions: 512-bytes sector write is atomic (all or none) if issue many writes, happen in any order Example: File append (update DB, I, D) (memory) | -------------------------------------------------- (disk) | Journal Tb|DB|I|D|Te Tb: transaction begin Te: transaction end Split step 1 1a write Tb, contents (wait to complete) 1b write Te (512 bytes) Step 2 update in place (checkpointing) Problem: too slow (1/2 speed for data intensive workloads) Metadata-only Journaling write data only once how/when to write data