Write-Optimal Radix Tree: A Deterministic Indexing Structure for Persistent Memory Storage Systems

This is because the radix tree structure is determined by the prefix of the inserted keys and also does not require tree rebalancing operations and node granularity updates. As another contribution, we present three radix tree variants, namely WORT (Write Optimal Radix Tree), WOART (Write Optimal Adaptive Radix Tree), and ART+CoW. In this paper, we argue that the radix tree may be better suited for PM indexing than B-tree variants.

We present PM-optimal radix tree version indexing structures since consistency is always guaranteed by a single 8-byte atomic error write without additional logging or CoW copies. The first contribution of this paper shows the suitability and limitation of the radix tree. This means that since the structure of the root tree is determined by the prefix of the inserted keys, the root tree does not require a key comparison.

The radix tree constraint leads us to the second contribution of this paper. For the second variant of the root tree, we present the Write Optimal Adaptive Root Tree (WOART), which is a PM extension of ART. Through an extensive performance study using synthetic workloads, we show that for insertion and search, the radix tree variants we propose outperform recent stable B-tree-based indexes such as NVTree, wB+Tree, and FPTree .

In Section 3, we first review the radix tree to help understand the main contributions of our work.

Background and Motivation

Consistency in Persistent Memory

Persistent B+-Trees

CLFLUSH instructions as many times as the number of records in a tree node to sort the records. 20] reduces the number of expensive memory fences and cache line flush instructions by employing an append-only update strategy. Due to this strategy and the fact that only the leaf nodes are kept in PM, NVTree requires only two cache line flushes, one for the entry and the other for the entry count, resulting in improved performance.

This results in two consequences; first, the leaf node remains unsorted and second, internal nodes may be lost upon system failure, although internal nodes can be trivially reconstructed using leaf nodes in PM. However, NVTree requires all internal nodes to be stored in contiguous memory blocks to take advantage of cache locality, and within a large memory block, internal nodes are located by offsets instead of pointers. However, since NVTree requires internal nodes to be stored in consecutive blocks, each split of a parent leaf node results in the reconstruction of the entire internal nodes.

FPTree: FPTree is another persistent index that keeps internal nodes in volatile memory while leaf nodes are kept in PM [15]. By storing the internal nodes in volatile memory, FPTree leverages hardware transactional memory to efficiently handle concurrency of internal node accesses. By scanning the fingerprints first before a query looks for keys, FPTree reduces the number of key accesses and thus the cache miss ratio.

Although FPTree shows superior performance over NVTree, FPTree also requires reconstruction of internal nodes when a system crashes. Since the items need to be sorted into internal nodes, wB+Tree proposes to sort the items via the slot array, which adds a level of indirection to the actual keys and pointers. If only the slot array is used, the number of cache line flushes decreases to two.

Although the number of cache line flushes is dramatically reduced compared to CDDS B-tree, wB+Tree bears the overhead of indirection. They usually reorganize the B+Tree only as a plugin leaving the keys unresolved to reduce the number of CPU memory flushes associated with logging and use the 8 byte failure atomic write as a commit token for updates. We had a question here "Why do we stick with B+tree even though its basic features, key ordering and tree rebalancing, create problems in using it in PM?", and felt the need for an alternative to replace B+Tree.

Radix Trees for PM

Radix Tree

5DGL[WUHH

3RLQWHUDUUD\

Failure Atomic Write in Radix Tree

In this section, we describe the small changes we made to make radix tree PM efficient. With the changes we propose, the root tree will remain robust to system failure without requiring any sort of logging or replication mechanism as long as the 8-byte failure atomicity assumption is satisfied. Basically, only two simple changes need to be made, which we describe below.

The first change is to ensure that the write to modify a pointer is done in a certain order. Given a given key, we traverse the path using the partial keys until we find a pointer value that is NULL. For 3,884, since the first partial key is 0000, we follow the leftmost child pointer from the root node.

At this level (depth 1), we find that the other partial key 1111 and that its pointer is NULL. After finding the child pointer to be NULL, we create a new node and keep doing this until the leaf node is created. The first modification we propose is to make the operation to replace the first NULL pointer (if we have a sequence of nodes created) with the address of the next level node be the last operation.

Because the pointer assignment instruction is an 8-byte atomic write, no form of logging is needed in the radix tree. However, we do need to call a few memory access serialization instructions to enforce error atomicity. For example, if 8,209 is indexed as shown in Figure 3.1, we must first call the memory fence and cache line flush instructions to ensure that all nodes leading to the leaf node, including the leaf node, are sent to PM written.

Only then we need to change the very first NULL pointer in the root, i.e. element 2.

3DWKFRPSUHVVLRQKHDGHU

XQVLJQHGFKDUGHSWK XQVLJQHGFKDU3UHIL[/HQ

3UHIL[DUUD\

3UHIL[OHQJWK1RGHGHSWK

Radix Tree Path Compression

HDGHUE\WHV

2OGSDUHQWSRLQWHUFKDQJH

1HZSDUHQWDOORFDWLRQ

3UHIL[DUUD\XSGDWH

WORT: Write Optimal Radix Tree
WOART: Write Optimal Adaptive Radix Tree
ART with Copy-on-Write
Experimental Environment
Performance Evaluation

Insertion Performance
Search Performance
Range Query Performance
Space consumption
Experiments with Memcached

Summary and Conclusion

However, we find that the radix tree can tolerate this temporary inconsistent state by storing the depth in each node and skipping the comparison of the partial key that is currently propagated to the new parent node. Poor node utilization and high memory consumption have been pointed out as the main drawbacks of radix tree. Finally, we take advantage of the fact that the partial key size in a root tree node is only one byte and that NODE4 stores only four keys per node, and also the fact that the total size of the array of slots is only four bytes.

Then the partial key and pointer values are placed in the empty entry position of the parallel arrays (lines 3-8). In sparse key distribution, keys are uniformly distributed, so they share a common prefix only in the top level of the tree structure. In this section, we evaluate the three proposed radix tree variants against the latest persistent indexing structures, namely wB+Tree, NVTree and FPTree.

We see that the overall performance is inversely proportional to the number of LLC misses and the depth of the tree. Note that the depth of the tree is slightly higher for the radix tree variants. The reason there are fewer LLC misses is that the radix tree can traverse the tree structure without performing any important comparisons, resulting in less cache pollution.

Therefore, the B-tree versions access the full extent of the node more often, resulting in more LLC errors. This is due to the depth of the tree, as WORT has the greatest depth. However, we note that PM versions of B-trees also do not keep keys sorted to reduce the overhead of ensuring consistency, which hurts one of the key properties of B+-trees.

We are also replacing the bucket lock mechanism on the hash table with the global tree lock mechanism. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2016.

In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011.