Leveraging Emerging Hardware to Improve the Performance of Data Analytics Frameworks

The goal of this thesis is to leverage emerging hardware to improve the performance of the data analytics frameworks. In this dissertation, we explore how to leverage emerging hardware to improve the performance of data analytics frameworks.

Thesis and Contributions

The irregular traversal tree memory access pattern hinders the exploitation of parallelism, since GPUs are designed for deterministic memory accesses. The B3-tree consistently outperforms the wB+-tree, one of the state-of-the-art variants of the B+-tree, by a large margin and shows comparable performance to the semi-persistent FPtree.

Thesis Organization

EclipseMR: Distributed and Parallel Task Processing with Consistent Hash-

18] proposed the MapReuse delta engine as an in-memory MapReduce framework that detects input data similarities, reuses available cached intermediate results, and computes only for the new portion of input data. ReStore[19] is another framework that stores intermediate results generated by map tasks in HDFS so that they can be reused by subsequent tasks.

B 3 -tree: Byte-Addressable Binary B-Tree for Persistent Memory

Both NV-Tree and FPTree use selective persistence, where they store internal pages in volatile DRAM but page pages in persistent memory. Therefore, the internal pages in the NV-Tree and FP-Tree can be lost in the event of a system error.

Write-Optimized Dynamic Hashing for Persistent Memory

The downside of expandable hashing is that each hash table reference requires an extra access to the directory. Say, we have a top level hash table that holds 100 records and a bottom level stash holds 50 records.

Figure 1: Extendible Hash Table Structure

Parallel Tree Traversal for Nearest Neighbor Query on the GPU

Data parallelism contrasts with task parallelism in the sense that a large number of GPU processing units perform the same task on different parts of an index. Therefore, each GPU processing unit accesses different parts of the global memory and the number of tree nodes accessed varies between GPU processing units.

Figure 2: Data parallel tree traversal algorithms access a single tree node at a time and each thread determines whether each branch can be pruned out or not

EclipseMR

In the DHT file system, each server manages its own routing table, called the finger table, which contains the information of m peer servers. Otherwise, i.e. if zero-hop routing is not enabled, the request is forwarded to another server that owns the hash key, as in the classic DHT routing algorithm [68].

Distributed In-Memory Cache

Reduced tasks reads these intermediate results from oCache or the DHT file system using the hashed keys (step 6). Hadoop stores intermediate results on the local disks of the server where the map tasks are executed.

Figure 4: MapReduce Job Scheduling in EclipseMR

Evaluation

Although we store intermediate results to disk, they can be stored in inoCache for future reuse. Note that we store the intermediate results on the reducer side, not the map side. If intermediate results are not available, mapping tasks are looked up in Cache input data blocks for reuse.

For such applications, EclipseMR allows applications to store iteration output in oCache or the DHT file system instead of intermediate results. Unlike Hadoop or Spark, EclipseMR decides where to run, reducing tasks based on the hash keys to the intermediate results. Instead, EclipseMR lets each mapper pipeline the intermediate results to the DHT file system in a decentralized manner as they are generated.

Based on the hash keys of the intermediate results, each map task stores the intermediate results in a memory buffer for each hash key range.

Figure 5: IO throughput with varying the number of data nodes

B 3 -Tree: Byte-addressable Binary B-tree

The fundamental challenge of designing a persistent memory data structure is ensuring atomicity of failure during the transition between one consistent state to another. It is necessary to rely on the fine-grained memory management scheme with memory fence instruction and cache line flush instruction. The set of keys and child page pointers stores the binary representation of sorted keys and child page pointers.

However, on our B3 tree page, a child pointer is the memory address of an external B3 tree page or the index of a binary node in the same B3 tree page. So the underlying pointers are not the memory address, but the index of BST nodes. However, in our B3 tree we note that all BST nodes have two children, because BSTS nodes are only created when a B3 tree page is split.

For example, if Page 3 splits, a new BST node is created pointing to Page 3 and a new split page.

Failure-Atomic B 3 -Tree Node Update

First, we describe how to make the deletion of a BST node in an internal B3 tree page fault atomic. A child BST node can be either another BST node on the same page or another B3 tree page. Note that the deletion of a BST node is triggered when we delete a B3 tree page.

In this case, we make their best parent BST node point to the merged page. We can make transactions follow the right sibling pointer if the parent BST node has the same left and right child pointers. Then we atomically replace the right child pointer of BST node p with the address of the new BST nodes on the parent page.

Finally, iv) we replace BST node (q) in the parent page with a new BST node (q' in the example), as shown in Figure 11(d).

Evaluation

B3 tree shows a slightly higher number of cache line access than wB+ trees with a. b) Number of Cache Line Flushes. Therefore, the average number of cache line flushes is much smaller than that of wB+ tree. As a result, the number of comparison operations in B3 tree leaf pages is higher than FAST and FAIR B-tree.

As we increase the write latency of PM, the FAST and FAIR B-tree suffers from a larger number of cache line flushes caused by shift operations. This is because B3-tree does not perform expensive logging and it calls a fewer number of cache line flushes than FPTree. In Figure 18, we measure the insert and seek throughput by varying the number of concurrent threads.

We observe that B3 tree shows higher insertion throughput than FAST and FAIR B tree when the number of threads is less than 16.

Figure 12: Insertion Performance with Varying Page Sizes

Cacheline-Conscious Extendible Hashing

Therefore, the segment splitting algorithm of extensible hashing performs several internal updates to the directory and copy-on-writes. In the next step, we update the directory entry for the new segment 4, as shown in Figure 21(b). Similar to splitting segments, merging segments performs the same operations, but in reverse order.

Once we increase the local depth of the partitioned segment in the library record, the migrated keys (the crossed-out keys in Figure 21(b)) will be considered invalid by subsequent transactions. As shown in Figure 23, with the MSB bits, the library entries allow us to keep track of the segmentation history as a binary buddy tree, where each node in the tree represents a segment. Note that the local depth of each segment and the global depth determine the segment's step in the folder, i.e. how many times the segment appears consecutively in the folder.

Therefore, a lock-free query accesses the old partitioned segment until we replace the pointer in the directory with a new segment.

Figure 21: Failure-Atomic Segment Split Example

Parallel Scan and Backtrack for kNN Query

In step 1, we search for the leaf node closest to the query point and update the pruning distance while calculating the MAXDIST of the child nodes. If nodeK does not update the kNN points or the trim distance, 6we retrieve the parent node E K from the global memory and discard the child nodes (L in the example) that are further than the trim distance. If M MINDIST is less than the pruning distance, 7 retrieves M and 8 continues to scan its related nodes.

If nodeN does not update the kNN points, we move to the parent nodeF, which has no child nodes within the pruning distance. If node C has no child nodes within the pruning distance, we move to the root node and complete the search. Otherwise, as in the example, we visit the leftmost leaf nodeG because it is within the pruning distance.

If the parent node has unvisited child nodes within the pruning distance, we choose and revisit the leftmost node among the unvisited child nodes.

Figure 27: Massively Parallel Scanning and Backtracking : In the root node A, the pruning distance is initially infinite

Bottom-up Construction of SS-tree

From q, we calculate the distances between q and the rest of the points in parallel and choose the farthest point r. However, since the node utilization of the bottom-up constructed SS tree is higher than that of the classic SS tree, the number of tree nodes is less than the classic SS tree, resulting in a shorter search path. We run the experiments using NVIDIA K40 GPUs for the bottom-up SS tree implementations, and we set the degree of an SS tree node to 128 so that each processing unit in a shared multiprocessor processes four branches, i.e. of 32 branches are processed in parallel.

We also compare the search performance of a bottom-up constructed SS tree on the GPU with that of a top-down constructed SR tree on a CPU. Therefore, we also compare search performance in terms of the number of bytes accessed. However, since the SS tree on the GPU does not allow backtracking, we let the SS tree on the GPU use additional parent links so that it can go back to the parent nodes.

In four dimensions, the bottom-up SS tree (Hilbert) accesses about 16 times more tree nodes than the bottom-up SS tree (k-means), resulting in 7.1 times slower average query response time.

Experiments

Leung, ‘The Google File System’, in Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), 2003. Zhu, ‘Journaling of journal is (bijna) gratis’, in Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST), 2014. Cho, “Barrier-Enabled IO Stack for Flash Storage,” in Proceedings of the 11th USENIX Conference on File and Storage (FAST), 2018.

Haskin, "Gpfs: A Shared Disk File System for Large Computer Clusters." in Proceedings of the USENIX Conference on File and Storage Technologies (FAST), vol. Jain, “Similarity Indexing with SS-Tree,” in Proceedings of the 12th International Conference on Data Engineering (ICDE), 1996, p. Nam, “Persistent transient inconsistency in byte-addressable persistent B+-trees,” in Proceedings of the 11th USENIX Conference on Files and Storage (FAST), 2018.

Swift, “Mnemosyne: Lightweight persistent memory,” in Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011. Ross, “Efficient hash probes on modern processors,” in Proceedings of the 23rd International Conference on Data Engineering (ICDE), 2007. Faloutsos, “On packing R-trees,” in Proceedings of the Second International Conference on Information and Knowledge Management, 1993, p.