A Study on the Effect of Multithreaditng and Concurrency Level on PM and DRAM Hybrid

As a result of testing the performance of the PM distribution system by applying the newly implemented MCAM policy, an improvement of 38% was obtained compared to the policy of the existing system. According to the results of the microbenchmark we ran, in the case of DRAM, local access shows 2.5 times better performance than remote access. As a result of the implementation of the MCAM policy, it was observed that more migration took place compared to the existing policy.

To analyze the results in relation to the number of migrations, we implement a new forced migration policy according to the ratio and compare it with the performance of the MCAM policy.

Native Linux Memory Management

Persistent Memory

As can be seen from this result, DRAM read bandwidth performance is 6.6× better than PM, and read latency performance is 2.7× better.

Tiered Memory System

However, with the advent of the tiered memory system, systems with different memory performance appeared, and research related to page migration for these systems appeared [8–14]. In a tiered memory system, performance varies greatly depending on the type of memory and which NUMA node it resides in. Existing research migrates the most visited pages by maintaining a history that stores recent access to each page.

Because page migration itself is an overhead operation, it is important to perform the migration efficiently. Therefore, there are studies that optimize the inefficient migration of existing Linux via multithreading and concurrent execution. Because page migration itself is a memory copy operation, improper page migration can even cause performance degradation.

In a specific workload, the locality is clearly visible and it is easy to distinguish the frequently accessed pages, so an optimal page set can be created by migrating the frequently accessed pages. However, in the case of a workload that accesses all pages equally, even if the most recently accessed pages are migrated and placed in a fast cache, other pages will be accessed soon, so the migration effect can don't be important. Therefore, it is also an important research area to adjust the migration frequency considering these workloads.

AutoNUMA

The existing Linux system provides the AutoNUMA system to increase the local access in the NUMA system.

AutoTiering

In addition, as an additional optimization, the policy of demoting less frequently accessed pages from fast memory to slow memory and promoting frequently accessed pages from slow memory to fast memory by modifying the existing policy in which page migration it doesn't happen when all the fast memory is being used. In addition, a technique is added to perform the migration asynchronously using the reserved space for page allocation that occurred on demand during the migration.

Nimble Page Management

Thermostat

AMP

In this study, LFU policies and random policies are added in addition to LRU, and the performance is measured under different workloads. And it shows that there is no optimal policy that shows good performance on all workloads. This system applies an optimal migration policy by applying the most appropriate policy for each workload to different workloads running concurrently and in parallel.

Cori

Parallel Processing Applications on Persistent Memory

Lack of Optimization of State of the Art Systems

In this section, we describe the design of the layered memory system and describe the newly added concurrency-aware multithreading and migration (MCAM) policy.

Page Migration System

Multithreading and Concurrency Aware Migration

NUMA Protection Maker is a module that wakes up periodically at a predetermined time and performs protection in a certain PTE area of the running process. If a process accesses a PTE on which protection is applied, an error occurs and it is possible to determine whether the page is migrated. NUMA Protection Maker also updates shared page history by identifying each page's access status, multithreading, and concurrency level.

Because the LSP list manages pages by shared level, it is possible to quickly find the page you want to demote at a higher level. When promotion occurs, if the upper-level memory is full, the MCAM Fault Handler can simply select the page to demote from the LSP list. Then check the common level of the page where the error occurred and see if the migration is 10.

If the partition level is high and must be promoted to the upper layer, and the upper layer is full, the page with the lowest shared level is referenced in the upper-level LSP list and compared with the shared level of the page where the fault occurred actual. In the following section, the implementation method of the basic system and MCAM policy is summarized in detail. In this section, the implementation method of AutoTiering, the basic system to which the newly added MCAM policy is applied, is explained, and the implementation method of the MCAM policy is described in detail.

Base System

The biggest flaw in OPM's policy is that a degradation must occur as the page progresses on demand. This on-demand degradation has the disadvantage that promotion is impossible while the page is degraded. Therefore, to overcome this shortcoming, some space in the cache is reserved as space for the allocation of promoted pages.

And when promotion occurs, promotion is first performed in the reserved space so that the application can respond quickly, and the demotion is processed asynchronously to overcome the long delay that can occur when the application accesses a specific page in the existing OPM method obtain.

MCAM Policy Implementation

Multithreading and Concurrency Level Awareness

The threshold is implemented to be easily changed at the user level via the proc file system, so that it can be dynamically changed according to the multithreading and concurrency level of the current system. To recap the overall flow, periodically set security to the PTE of the currently running process. At the same time, the shared bit is set by determining whether the page is opened, multithreading, and concurrency levels.

When a process-protected page is accessed, a NUMA hint error occurs, the shared history is checked, and the migration is performed. This chapter analyzes the performance of the newly introduced concurrency-aware multithreading and migration (MCAM) policy. In addition, it shows the verification process by adding various policies to prove the existing experimental results along with the performance analysis in terms of the number of reductions.

Experimental Setup

Policies Used in Experiment

The MCAM policy sets the shared bit by comparing the multithreading and concurrency level to a set threshold. FD-FD (Force Demotion) is a policy that forces degradation at a fixed rate when a NUMA Hinting Fault occurs, regardless of access. The significance of FD-N% in the experimental results is to generate N% migration under all NUMA Hinting Faults.

CMCAM - The CMCAM (Conservative Multithreading and Concurrency Aware Migration) policy uses AutoTiering's OPM policy as is, first selecting a page with a high level of multithreading and concurrency while selecting a page when it is demoted.

Application Benchmarks

In the next section, to analyze the number of degradations, the FD policy is additionally implemented that makes degradation probabilistic when a NUMA Hinting Fault occurs without using the access history. Using the FD (Force Demotion) policy that demotes a predetermined percentage of all NUMA Hinting Faults without regard to access history, we discover how the number of demotions affects performance. What we can confirm from this experiment is that the number of demotions can have a major impact on performance.

We can see that in the case of a 25% degradation ratio of the FD policy, the performance difference is 34% compared to the 100% degradation result, which has the lowest performance. On the graph, we can check the number of degradations in the OPM, MCAM and 0% 25% FD policies. From the experimental results, it can be seen that in the case of OPM, the number of degradations is significantly less than in the case of other policies.

On the other hand, in the case of MCAM, degradation of 7.08% of all errors occurs. In the case of the MCAM policy, it can be seen that the number of demotions in the FD-6% and FD-8% policies is similar, and in the case of the existing OPM, the number of migrations is absolutely small. In the Figure 11 graph, you can also check the result of the FD policy that continues degradation without considering the access history.

Figure 8: Graph500 FD policy performance according to probability of demotion

Limitation

Future works

Previous studies have referred to this memory environment as a tiered memory environment, and recent studies have been conducted to optimize performance in this environment. While early results have suggested ways to optimize performance in hybrid tiered memory environments, in this thesis we consider the effect of multithreading and concurrency level, which to our knowledge have not been considered elsewhere. Zwaenepoel, “Exploiting nvm in large-scale graph analytics,” inProceedings of the 3rd Workshop on Interactions of NVM/FLASH with Operating Systems and Workloads, 2015, pp.

Peter, “Hemem: Scalable tiered memory management for big data application and real nvm,” inProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, 2021, pp. Bhattacharjee, “Nimble page management for tiered memory systems,” inProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. Wenisch, “Thermostat: Application-transparent page management for two-tiered main memory”, inProceedings of the Twenty-Second International Conference on Architec - turale ondersteuning voor programmeertalen en besturingssystemen, 2017, pp.

Zhang, "Hinuma: Numa-aware dataplacement and migration in hybrid memory systems," i 2019 IEEE 37th International Conference on Computer Design (ICCD). Zhang, "Adaptiv sidemigreringspolitik med enorme sider i lagdelte hukommelsessystemer," IEEE Transactions on Computers, no. Denis, "pioman: a pthread-based multithreaded communication engine," i 2015 23. Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.