In this paper, we focus on reducing the overhead of WAL when metadata is processed with the LSM-based KV store. Among various attempts to efficiently process metadata, the method that has been consistently attempted from the past to the present is to use the LSM-based KV repository [7-11].
RocksDB
PM
Features of the metadata
In order to manage metadata effectively, it is necessary to carefully consider these features and treat them in a way that fits well with metadata. DM-WriteCache stores the dirty bit of the block as metadata, which is written to storage asynchronously. This is because synchronously updating metadata has a significant impact on write performance as it is in the critical path of writing.
Metadata operations occur very frequently: Among all operations occurring in the file system, the proportion of operations accessing metadata is 50~80%, which is very large [1-4]. In a system that must handle multiple users and many files, such as a distributed file system, metadata operation occurs more and more frequently, becoming a bottleneck for the entire system and affecting scalability. However, the proportion of the metadata operation in the total operation is very large and the size is small, so a lot of WA occurs, which causes significant overhead.
In addition, since it contains important information, it is very important to maintain consistency and this work adds considerable cost.
Using LSM-based KV store for Processing Metadata
Experiments were performed on sequential writes and random writes, respectively, and the memtable was tested by selecting two of the four types provided by RocksDB. In the case of sequential writes in Skiplist, buffer WAL was about 750 times better than when WAL was synchronized for each operation, and when WAL was not used at all, it was more than 2000 times better. That is, it can be seen that synchronizing the WAL for each operation shows a significant overhead.
In the case of random writes, synchronizing the WAL for each operation showed a large performance difference as with sequential writes. In the case of vector, synchronizing WAL for each operation in both sequential and random writes still showed poor performance compared to other things, and not using WAL showed about three times better performance than buffer WAL. In case of LSM based KV storage frequent compression occurs because all SST files are kept sorted on the other levels except L0.
In this chapter, we will propose a new LSM-based KV storage structure to solve the remaining problems in trying to process metadata with the LSM-based KV store and to process the metadata in a way more efficient.
Overall Architecture
Both MFR and persistent memtable are assigned to PM region and SSTable is stored in storage such as SSD or HDD. It shows the structure of chunk, tag and slot, and also shows the relationship between MFR and persistent memtable. Using a persistent memtable using PM can turn off WAL, preventing storage media wear due to WAL.
However, a problem with the persistent memtable is that reading is slower than that of existing structure. PM has a higher read latency than DRAM, so replacing memtable with persistent memtable results in lower read performance. By introducing a persistent cache called MFR on persistent memtable, we want to compensate for the lower reading performance and to ensure better performance in writing.
The reason is that if the MFR is allocated to a volatile medium such as DRAM, the consistency and durability of the data uploaded to the MFR is not guaranteed, so the problems experienced with the existing RocksDB memtable occur.
MFR: Design Choice & Architecture
Additionally, read and write performance varies depending on which data structure is used in the memtable, but by using MFR we have tried to ensure good read and write performance regardless of the memtable structure below. However, to provide better read and write performance than using only a persistent memory table, we use simple and static indexing and a two-way structured array. The valid bit indicates whether the value existing in the corresponding chunk is valid, and the new bit indicates which of the upper and lower chunks was written most recently.
The dirty bit indicates that the data from the corresponding chunk has not yet been flushed to the memtable. That is, since the key of MFR must have 7 bytes, if the key of the KV pair is less than or equal to 7 bytes, the key of the KV pair can be used as the key of the MFR, but if not, a separate conversion algorithm may be required. The key determines the position of the slot, i.e. the slot index, by the hashing function or modular operation.
When a conflict occurs, one of the top and bottom chunks must be flushed into the bottom memtable.
Mechanism
Change the new bit of the already written chunk to false and copy the data from the chunk to another chunk. After copying, write the new data to another chunk and replace both the new bit and valid bit of the newly written chunk with true. In this case, after changing the new bit of the previously written data to false, data is written to another bit.
Then both the valid bit and the new bit of the part in which the new data is written are set to true. In this case, flush the data of the block with a different key than the memory table and set both the valid bit and the new bit of the block to false. In this case, after flushing the data for which the new bit is false to the memory table, the valid bit of the corresponding part is set to false.
In this case, the data of the part whose new bit is false is poured into the memtable, and then the valid bit of the corresponding part is set to false.
Consistency & Recovery
In this case, it can be divided into two detailed cases according to the data requested to be written. Case 2-1 is a situation where data with the same key is being written to as data in chunks that have already been written is coming in. this case, it can be divided into three detailed cases according to the input data.Case 3-1 is a case where data with the same key as a chunk whose new bit is false is loaded. In this case, it is in the same state as step 1 in case 2-1, and therefore data is processed in the same process as from step 1 in case 2-1. Case 3-2 is a situation where the data has the same key, since the part in which the new bit is true is the input.
In this case, it will be in the same state as the original state of Case 2-1, so you can process data with the same mechanism as Case 2-1. In this case, it will be in the same state as the original state in case 2-2, so the subsequent mechanism can be treated in the same way as case 2-2. The structure from memtable is the same as the existing RocksDB, so we just need to handle it in the same way as the existing mechanism.
If a system crash occurs while flushing data from the MFR to the persistent memtable, the dirty bit of the tag remains true when the data is not completely off.
Environment Setup
Basic Performance
In sequential writing, MFR performance is the best, followed by vector performance, but the performance of both is similar. For the other benchmark sets, sequential write is the best case where the write occurs, but the reason why the performance is slightly lower than MFR is that the total data size exceeds 64 MB, which is a memtable size, so it is expected that memtable flush generated in the background caused performance degradation. On the other hand, in the case of vector, the performance was so poor that it could not give results even after a day of execution.
When random writes occur in MFR, a data sharing request of the same slot may occur within a similar amount of time. On the other hand, vector writes perform in such a way that both sequential writes and random writes continue to add data to the end using the same mechanism, so performance is the same in both cases. If random read occurs after random write, the overall performance is lower than when random read occurs after sequential write, but the trend of the result is the same.
That is, the performance of MFR is about 4.2× better than that of RocksDB with Skiplist, and the result of RocksDB with vector is so slow that the results could not be obtained even after more than a day. a) Sequential write & random read (b) random write & random read.
Various Value Sizes
Sync
For example, in the case of Ceph, the type of metadata can be distinguished by a key. Depending on the conversion method, the KV may be stored in the MFR in a somewhat sorted form based on the key. However, the overhead of WAL used to guarantee data consistency in the LSM-based KV store is too great.
If the WAL is synchronized for each operation to ensure consistency in the existing RocksDB, the performance of the proposed system is good, about 20∼43×. Arpaci-Dusseau, “All File Systems Are Not Created Equal: On the Complexity of Creating Crash-Consistent Applications,” in 11th USENIX Symposium on Operating System Design and Implementation (OSDI 14). Patt, “Metadata update performance in file systems,” in The First Symposium on Operating System Design and Implementation (OSDI 94).
Grider, "MDHIM: A parallel Key/Value Framework for HPC," i 7. USENIX Workshop om Hot Topics in Storage and File Systems. Tilgængelig: https://www.usenix.org/conference/fast20/presentation/yang [15] db_bench, "db_bench," https://github.com/facebook/rocksdb/wiki/Benchmarking-tools. Amvrosiadis, "Filsystemer uegnede som distribuerede lagerbackends: Lessons from 10 years of ceph evolution," i Proceedings of the 27th ACM Symposium on Operating Systems Principles, ser.