Lifetime of Flash-based SSDs

This document presents a policy that effectively classifies data on solid state drives (SSDs) based on their lifetime to reduce write amplification of flash-based storage. In particular, the recently proposed multicast technique has a limitation in that the data streams must be provided by the host. Specifically, it exploits the data movement patterns of flash memory blocks so that it can be effectively classified with only negligible overhead.

It is implemented at the storage device level, eliminating the need for any modification on the host. Since flash memory performs out-of-place updates, garbage collections (GCs) that cause additional writes are necessarily involved. Several studies have been conducted to reduce additional writing by GC, such as hot/cold or semantic classifications [1-6].

We achieve this by simply keeping track of the migration count, which is the number of copy operations performed on data. This technique can be implemented at the device firmware level and requires no change to the host system. We show that segregating data into separate blocks based on our policy reduces internal data copies and thus improves performance and lifetime of the device.

We compare MiDA with multi-stream [5,6], an advanced technique that has been shown to reduce the GC cost of SSD by classifying streams.

Flash Memory

Greedy GC Algorithm

Flash Cell Lifespan

Data hotness classification

Multi-stream

SSDs only support a small number of streams due to device hardware limitations. Then, virtual streams with similar life expectancy are mapped to the same physical stream via a clustering algorithm such as kmeans.

Mitigating GC overhead

In this section we present MiDA (Migration count based Data Age Classification), the data classification technique we propose. Similar to the multi-stream approach, the goal of MiDA is to cluster data of the same (or similar) age in the same block. We achieve this goal by tracking the number of migrations of all data, which we define as the number of copy operations performed on each of these data.

Since this technique can be implemented at the device firmware level and does not require any modifications to the host system, it is easy to implement. Through experiments, we find that the technique is effective in reducing internal data copies and thus effective in improving device performance and lifetime.

Migration Count and Separation Group

The graph shows the average lifetime of data with the same migration count when it was invalidated. For example, for all data with a migration count of 20, we see that the average number of writes that occurred between the initial write and the invalidation, that is, the lifetime of the data, is roughly 11 million writes. Each group is associated with a corresponding number of migrations and all data is assigned to a group accordingly.

Since the migration count indicates how cold the data is, by placing colder data together, we can predict that fewer pages in the block will be invalidated.

MiDA n

In such cases, different approaches to selecting a separation group based on the migration rate may be more feasible.

Implementation Details

For our evaluation experiments, we use two environments: an FTL algorithm simulator that we made and FEMU [22]. By default, FEMU selects the oldest block (the block with the longest elapsed time since writing) for the victim. All experiments start after a sequential write to the entire disk space, filling all pages up, so that GCs occur from the start of the test load.

All workloads are generated by running the benchmarks on a real SSD device while recording the I/O traces, and these are replayed for experiments using nvme-cli [26]. In the Zipfian workload, the write rate of each page on the device is the inverse of that page's rank. While this is not the worst case for MiDA, we expect it to be less effective.

To observe the effects of our policy in more realistic environments, we use workload A of the YCSB benchmark [28], a write-intensive benchmark among the YCSB workloads, on MySQL and RocksDB, and the TPC-C benchmark [29]. on MySQL to generate the workload. We also apply the multi-stream technology for real-world workloads and compare it with the result of MiDA. We expect that multi-stream, especially when supported by the application, will yield the best result among the different hotness classification methods.

So here we only compare MiDA alone with multi-stream and leave the comparison with other techniques for future work. First we allocate a separate stream for the sequential write before the workload to get a better result. In the case of workloads generated with RocksDB, it's easy because RocksDB supports native multi-stream since version 5.10.2.

Legacy RocksDB assigns Stream 1 to the WAL file, Stream 2 to the Level0 SSTable, and Stream3 to the Level1 SSTable. For MySQL, however, since MySQL does not yet support multi-streaming technology, we use fcntl call to allocate the stream on a per-file basis. In the case of YCSB there was no choice as it uses only one table for the entire workload.

Analysis of Separation Group Effect

In the experiment on zipfian workload in figure 5, using just two partition groups gave more than 66%. This is crucial since maintaining two separator groups can be done by a conventional bitmap, and it still gives a drastic performance boost. Applying MiDA2 to the mentioned workload gives a 52% decrease in WAF, which is actually a better result compared to the multi-stream result (denoted as MS in figure), and it continues to decrease with more separation groups.

Similar to the experiments in Section 6.1, we see how the number of partition groups affects the throughput performance of an SSD. Again, we can see that two partition sets are enough to get enough performance from MiDA, being already faster than the multi-stream result. We did not expect this result, especially for the workload created by RocksDB, which supports multi-stream technology.

For completeness, Figures 16 to 19 contrast the result of multi-streaming and MiDA experiments of two workloads with MySQL. Extending a state-of-the-art NVMe SSD emulator, we show the effectiveness of MiDA for various I/O workloads compared to the vanilla and multi-streaming schemes. Kuo, “An Adaptive Stripe Architecture for Flash Memory Storage Systems of Embedded Systems,” in Proceedings.

Sundararaman, "Don't stack your log on my log," in 2nd Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW 14). Balakrishnan, “Autostream: Automatic stream management for multi-streamed ssds,” in Proceedings of the 10th ACM International Systems and Storage Conference, ser. Handling flash streams in the file system,” in 16th USENIX Conference on File and Storage Technologies (FAST 18).

Kim, “vstream: Virtual Stream Management for Multi-Streamed SSDs,” in 10th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 18). Cho, “F2fs: A New File System for Flash Storage,”,in 13th USENIX Conference on File and Storage Technologies (FAST 15). Xie, “Remap-ssd: Securely and efficiently using SSD address remapping to eliminate duplicate writes,” in 19th USENIX Conference on File and Storage Technologies (FAST 21).

Në dispozicion: https://www.usenix.org/conference/fast21/presentation/zhou [26] “nvme-cli”, https://github.com/linux-nvme/nvme-cli. Sears, “Benchmarking cloud serving system with ycsb”, në Proceedings of the 1st Symposium ACM on Cloud Computing, ser.