- Values for each single column are stored contiguously
- CPU faster than IO => Use CPU to save disk bandwidth
- Compress data (e.g. 1 => Wisconsin, 2 => Texas)
- Densepack values (e.g. pack N values each K bits into N*K bits)
- C-Store physically stores a collection of columns, each sorted on some attribute
- can store overlapping projections => improves performance and redundancy

- Insert: Write to WS, batch move to RS
- Delete: marked in RS, purge later
- Update: insert + delete
- Read: historical mode
- query select a timestamp
- return the correct answer as of that timestamp
Data Model #
- Logically still tables
- Store projections
- Tuples in a projection are stored column-wise
- sorted on the same sort key
- Horizontally partitioned => each segment is associated with a key range

- When Query:
- Join multiple segments

- Key: Storage key
- RS: index in the column
- WS: stored as int, larger than the largest in RS
- SID: Segment ID
- Colocate join-index with EMP3(Sender) and partitioned in the same way
RS #
- Use multiple different encoding tricks
- Storage key is not stored (Index)
WS #
- No encoding
- Store storage key
Update #
- An insert is represented as a collection of new objects in WS
- All inserts corresponding to a single logical record have the same storage key
- Keys in the WS will be consistent with RS storage keys because we set the initial value of this counter to be one larger than the largest key in RS
Snapshot Isolation
Insertion vector #
- contains for each record the epoch in which the record was inserted
Deleted record vector #
- contains the delete time for each tuple
Tuple Mover #
- Find segment with modify time at or before LWM (Low Water Mark lowest time user can execute queries)
- if deleted => discard
- else => move to RS
- Write to new RS’ segment
Questions #
- Type 3 encoding save space, not int
- 1:1 mapping between RS and WS?
comments powered by