Hello!

Radix Join

Memory access cost Simulating simple scan Linear scan memory, ptr += stride size every time If stride cannot fit in cache => slower, worse for newer machine Data Structures Goal: Reduce stride width Solution: Vertical Fragmentation Algorithm Select Hash table -> Cache unfriendly B-Tree -> Better GroupBy/Join Sort/merge -> Random ops in sort -> Cache unfriendly Hash -> need to keep hash table in L2 cache Radix join radix-join(L, R, H [how many cluster]): radix-cluster(L, H) radix-cluster(R, H) FOREACH cluster IN [1....

Join Processing

Background Large memory (~$\sqrt{|Relation blocks|}$) Uniprocessor Notation: Relation: $R$ , $S$, $|R| \leq |S|$ Memory: $M$ Fudge factor $F$, (A hash table of $R$ occupies $|R| \times F$ blocks) Algorithms Sort-Merge Join Produce sorted runs of $R$ & $S$ Do $n$-way merge, output join result while merging (Choose $n$ to be as big as possible, ideally n == number of runs) Hash Based Hash smaller relation $R$ on the joining attr Scan $S$ and find match Need to partition if hash table of $R$ does not fit in memory Simple hash join algorithm Choose a range of hash value Build hash table for $R$ if it hash into the range Scan $S$ and output matched Choose another range and repeat GRACE Partition hash values so that $R$ will be partition into equal size (each approx main memory size) Scan $R$ and partition into equal size Scan $S$ and partition (probably not equal size) for each $i$, build hash table for $R_i$ , scan $S_i$ and output matches Hybrid Select $B$ (Approx to the number of steps for simple hashing), parition hash values so that $R$ will be splitted into $B$ equal size partition Scan $R$, if belongs to first partition, build hash table, else write to disk (obviously buffer first) Scan $S$, if belongs to first partition, probe hash table and output if match, else write to disk for each $i$, build hash table for $R_i$ , scan $S_i$ and output matches Partition Overflow Hash table of a single partition too large to fit in memory Partition overflow on disk Scan big partition and partition again Partition overflow in memory Reassign some hash values to other partition Memory management Issue with all real memory Memory manager cannot assign appropriate memory: To low => slow To high => new process can’t run Hotset + virtual memory Issue: LRU policy does not play well with join operations To-be-scan partition might be swap out before the partition which were just completed Solution: Mark page as throw immediately Question A run will be 2 * | M | blocks long

Hoisting

Hoisting - MDN Web Docs Glossary: Definitions of Web-related terms | MDN (mozilla.org)

Bigtable

Wide applicability, scalability, high performance, high availability Data Model Rows Every read or write of data under a single row key is atomic Rows sorted in lexicographic order tablet: row range Columns Grouped into columns family Column key = family:qualifier Access control on family level Timestamp Store multiple versions in one cell Implementation Master assigning tablets to tablet servers detect the addition and expiration of tablet servers load balancing garbage collection Tablet location Read chubby file -> get location of the root tablet Root tablet: contains all Metadata tablet location METADATA table Row key: (tablet's table id, end row) Data: location of data tablet, other metadata Tablet Assignment Master Tablet servers create file and lock in chubby Master detect tablet alive by asking tablet server If timeout or server report lost lock Master try lock server file If lock success => chubby up, server down => delete server file server commit suicide If can’t reach chubby => master kill itself On master boot Master scan chubby and ask every server to know tablet <=> server mapping Servers split table => commit to METADATA table => notify master Tablet Serving memtable => new updates SSTable => old updates Optimizations Locality groups One SSTable for each locality group (column family) Compression Cache Bloom filter reduce SSTable access by asking if data is in that SSTable Commit-log all servers append commit log to same physical file

Dynamo

Assumptions KV Store Small object (< 1MB) Week consistency Consideration Always writeable Conflict resolution during read Application can implement its own merge policy Or use simple policy the data store provides (last write win) Architecture Interface get(key) -> (Array[object], context) 1 object or many conflicting objects put(key, context, object) context: vector clock and other metadata Partitioning Each node hash to a location on a ring Data hash to another location on ring rotate clockwise, find nearest node Improvement: Each node hash to many virtual node on the ring Replication Replicates at N-1 clockwise successor nodes preference list - The list of nodes that is responsible for storing a particular key Versioning Vector clock compression Remove oldest clock when exceeds threshold get()/put() Set $R$, $W$ Write to a coordinator (nearest or load balancer) coordinator send to top $N$ in preference list wait for $W-1$ replies ($W$ in total) return Read from $N$ nodes, wait for $R$ responses Failures When a server is down Send data to next healthy note with a hint When backup send data back to previously failed node Membership change Require manual actions Each node contacts a peer at random every second to sync membership/partition changes Detect failure locally ?...