• Tidak ada hasil yang ditemukan

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

WordCount Grep Sort

Normalized Execution Time

Applications default-128MB

default-1GB max-coalescing assorted-coalescing

Figure 10: Normalized Job Exec. Time with Block Coalescing each standalone run.

For WordCount, the maximum coalescing scheme shows the fastest performance and the default Hadoop with 1 GB blocks shows a comparable performance. The assorted coalescing scheme outperforms the default Hadoop with 128 MB HDFS blocks by 10% but it is slightly slower than the maximum coalescing because of the larger container overhead.

For Grep application, the assorted coalescing shows the fastest performance. Since Grep is a computation intensive job and it often suffers from load imbalance as we discussed, the maximum coalescing scheme and the default Hadoop with 1 GB HDFS blocks shows even worse performance than the default Hadoop with 128 MB blocks. However, the assorted coalescing scheme reduces the container overhead while it achieves good load balancing, which explains 7% performance improvement over the default Hadoop.

For Sort application, the assorted coalescing also outperforms the default Hadoop by at least 14% because it reduces the container overhead while overlapping map phase and reduce phase using small input splits.

We also evaluate the job scheduling fairness of the assorted coalescing scheme in the exper- iments shown in Figure 8c. The results show that the assorted coalescing scheme has better fairness than the maximum coalescing scheme. With the good load balancing and the reduced container overhead, the assorted coalescing scheme improves the job execution time by 36.3%

on average in Figure 8.

of containers and the container overhead. Our experimental study shows the assorted block coalescing scheme reduces the container overhead while it achieves good load balancing and job scheduling fairness without impairing the degree of overlap between map phase and reduce phase.

V Braiding SkipLists for Byte-addressable Persistent LSM-Tree

With the advent of Intel’s Optane DC Persistent Memory (DCPM),which offers a higher capacity than DRAMs and shows a higher performance than SSDs, various data structures such as B+tree [15,17,21,22], and hash tables [19,53] have been redesigned to exploit the device characteristics. Although byte-addressable persistent B+trees have impressive read/

write performance in PM [17,37],B+tree is sub-optimal for write-intensive workloads. That is, although random access to PM has much lower latency than disks, sequential access to PM is still much faster because it benefits from locality and hardware prefetchers [17]. However, B+tree requires each query to traverse the tree structure to find a leaf node in a non-sequential fashion, and the performance degrades as the index size grows.

Log-Structured Merge-tree (LSM-tree)is being widely used in various NoSQL systems in- cluding LevelDB [4],RocksDB [5],HBase[6],and Cassandra[7].LSM-tree is a data structure designed for a high volume of inserts or updates. Since the most efficient operation on block device storage systems, such as SSD and HDD, is sequential access, LSM-tree is designed to store data as in transactional logs so that slow random access can be avoided. Nonetheless, to provide indexed access to huge amounts of accumulated data, LSM-tree performs background compactions to organize data in a tree-like structure.

To transform random writes into sequential writes, LSM-trees stack small indexing structures such as B+trees, SkipLists, or sorted arrays called SSTables in multiple layers. By layering multiple indexing components, LSM-tree sorts a small set of new writes in a volatile buffer and sequentially persists them to avoid random I/Os. The persisted sorted runs are merged into larger upper level components gradually by acompactionprocess, which runs in the background so that it does not affect the write latency of foreground threads. As a result, LSM-tree exhibits a significantly lower write latency than B+tree, which makes LSM-trees suitable for write- intensive workloads.

However, LSM-tree often suffers from the so-called write stall problem. I.e., although the compaction thread runs in the background, it may block foreground write operations from inserting new records into an in-memory buffer if there are too many SSTables to be compacted.

It has been widely reported that the background compactions frequently block foreground write operations, which not only hurts the throughput but also significantly increases the tail latency of queries [32–34].

The write stall problem occurs because of the write amplification and the unmatched high latency of block device storage, which makes the compaction thread runs slow. Recently, there have been numerous research efforts that try to address this problem and improve various aspects of LSM-trees [27,32–41,54–56]. However, these previous works focused on the performance of LSM-Trees in block device storage systems.

Given the high-performance, byte-addressability, and persistency of PM, recent studies have

redesigned LSM-trees to explore the possible resolution of write amplification and write stall problems while maintaining the advantages of LSM-trees [57–59]. However, we find the per- formance of these state-of-the-art PM-based key-value stores are far from satisfactory, and they do not sufficiently exploit the desirable properties of PM. Our performance study shows that the read performance of NoveLSM[58]is not significantly faster than PM-based LevelDB.

SLM-DB[37]also has performance problems with random writes and sequential reads.

In this work, we design and implement ZipperDB, a novel key-value store that employs log- structured persistent SkipLists. SkipList is relatively easy to leverage the byte-addressability of PM and guarantee the failure-atomicity due to its simple structure. Besides, SkipLists are widely-used in concurrent production environments[60]as highly efficient lock-based and lock- free SkipList implementations are available [60–62].

The main contributions of this work are as follows. First, ZipperDB employs persistent SkipList in all levels of LSM-trees to take advantage of the byte-addressability of PM. With persistent SkipLists, we propose a zipper compaction algorithm that merge-sorts two adjacent levels of SkipLists in-place without blocking read operations. With the in-place merge sort, the zipper compaction algorithm effectively resolves the write amplification problem. Second, we implement ZipperDB using PMDK and evaluate its performance on Optane DCPMM. In our performance study, ZipperDB shows up to 3.8x higher random write and 6.2x higher random read throughput than the state-of-the-art NoveLSM. Also, ZipperDB shows up to 4.7x higher random write throughput than SLM-DB. Such high-performance gains come from the reduction of write amplification factor by 1/7.