LeanStore

Related Work Anti-Caching Mark each tuple as cold or hot, move cold tuples to cold storage 8 byte overhead per tuple cannot move indexes to cold storage Siberia Classify each tuple as cold or hot Indexes only cover hot tuples Bloom filter to access cold tuples complex Swapping No overhead (hardware support) database lose control Hardware-assisted Access Tracking Modify OS kernel => still has control over page eviction Cannot handle index structures Building Blocks Pointer Swizzling Traditional buffer manager: always use page identifier lookup hash table to find page Swizzling Use virtual address directly if in memory Use page identifier for disk Efficient Page Replacement Traditional: LRU, Second chance New: Randomly select 10% of page, put into a queue, mark as cooling If access, back to hot If reach to queue front, evict Leanstore Use a global lock to protect cooling and in-flight I/O data structures FIFO Queue: Might be a FIFO linked list, hash table points to entry in queue I/O: acquire global lock -> acquire OS I/O frame lock -> release global lcok Traditional: Lock per page Reading page acquire a lock, so other thread will not swap it out New epoch-based: Global epoch regularly increase thread set local epoch to global before reading Check min(all local epoch) before evicting a page Comments HDD/SSD

September 17, 2022

Buffer Managment

Buffer Management Buffer Management Strategies (berkeley.edu) Algorithm Domain Separation Algorithms Categorize pages into different domains e.g. one domain for non-leaf b-tree one for leaf Limitation Static: Can’t dynamic adjust domain, (importance might be different for different queries) Split into 2 domains a new domain A page and all page in domain A might be more important than any page in domain B, yet it will replace a page in domain A Does not prevent interference among competing users Need separate mechanism to prevent thrashing “New” Algorithm Buffer pool is subdivided and allocated on a per-relation basis Each active relation assigned a resident set that is initially empty, resident sets of relations are linked in a priority list....

September 15, 2022

Radix Join

Memory access cost Simulating simple scan Linear scan memory, ptr += stride size every time If stride cannot fit in cache => slower, worse for newer machine Data Structures Goal: Reduce stride width Solution: Vertical Fragmentation Algorithm Select Hash table -> Cache unfriendly B-Tree -> Better GroupBy/Join Sort/merge -> Random ops in sort -> Cache unfriendly Hash -> need to keep hash table in L2 cache Radix join radix-join(L, R, H [how many cluster]): radix-cluster(L, H) radix-cluster(R, H) FOREACH cluster IN [1....

September 11, 2022

Join Processing

Background Large memory (~$\sqrt{|Relation blocks|}$) Uniprocessor Notation: Relation: $R$ , $S$, $|R| \leq |S|$ Memory: $M$ Fudge factor $F$, (A hash table of $R$ occupies $|R| \times F$ blocks) Algorithms Sort-Merge Join Produce sorted runs of $R$ & $S$ Do $n$-way merge, output join result while merging (Choose $n$ to be as big as possible, ideally n == number of runs) Hash Based Hash smaller relation $R$ on the joining attr Scan $S$ and find match Need to partition if hash table of $R$ does not fit in memory Simple hash join algorithm Choose a range of hash value Build hash table for $R$ if it hash into the range Scan $S$ and output matched Choose another range and repeat GRACE Partition hash values so that $R$ will be partition into equal size (each approx main memory size) Scan $R$ and partition into equal size Scan $S$ and partition (probably not equal size) for each $i$, build hash table for $R_i$ , scan $S_i$ and output matches Hybrid Select $B$ (Approx to the number of steps for simple hashing), parition hash values so that $R$ will be splitted into $B$ equal size partition Scan $R$, if belongs to first partition, build hash table, else write to disk (obviously buffer first) Scan $S$, if belongs to first partition, probe hash table and output if match, else write to disk for each $i$, build hash table for $R_i$ , scan $S_i$ and output matches Partition Overflow Hash table of a single partition too large to fit in memory Partition overflow on disk Scan big partition and partition again Partition overflow in memory Reassign some hash values to other partition Memory management Issue with all real memory Memory manager cannot assign appropriate memory: To low => slow To high => new process can’t run Hotset + virtual memory Issue: LRU policy does not play well with join operations To-be-scan partition might be swap out before the partition which were just completed Solution: Mark page as throw immediately Question A run will be 2 * | M | blocks long

September 10, 2022