2.2 Cache Architectures in Chip Multiprocessors
2.2.2 NUCA for CCMP
Implementing the basic SNUCA design for CCMP is straight forward but its per- formance improvement has always been of research interest [45]. We call the SNUCA architecture for CCMPs as C-SNUCA. The LLC banks of C-SNUCA have fixed mapping policy as in SNUCA (single core). The additional mechanism required here is the coherence protocol to maintain consistency among the differ- ent L1 caches which are private to each core. MESI-CMP [46] is an example of such a protocol. Since multiple cores shares the LLC, partitioning the storage of LLC among the cores based on their workload-demand is another area of research.
(a) a (b) b
Figure 2.8: C-DNUCA proposed in [1].
Section 2.1.4 already mentioned about the basic cache partitioning techniques.
Some recent work in cache partitioning are [36, 25, 42].
DNUCA outperforms SNUCA in case of single core processors [17]. But extending it for CMP has some basic difficulties. As the size of the LLC in CMP increases the number of banks also increases. Hence the bankset searching time becomes a major challenge. Block migration may cause a block to ping-pong between two cores. Though the issue of migration can be handled easily but efficient bankset search is still an open area of research [5]. There already exists contributions for implementing DNUCA based CCMPs. However, most of these works had to suffer from the overheads of a fairly complex bankset search mechanism [5]. We call the DNUCA based CCMPs architecture as C-DNUCA.
2.2.2.1 Innovations in DNUCA based CCMPs
The first CMP based NUCA architecture was proposed by Beckmann and Wood [1]. They considered a CCMP architecture with eight cores. The banks are divided into multiple regions to allow a block to be migrated to a region closer to the requesting core. There are 16 regions, out of these 8 are called local regions (one for each core), 4 are called center regions and remaining 4 are called internal regions. Initial placement is somewhat random (based on the blocks tag bits).
From here, a block is allowed to gradually migrate to different regions based on the cores that access it. A block is migrated from the local region of one core to the local region of another core through the inter regions and center regions. To prevent ping-ponging a block among multiple regions migration is only allowed after a successive number of requests from a core. Figure 2.8 shows the proposed architecture of [1].
In this work the authors have reported that the amount of shared data in most of the workloads is less, but the frequency of accessing this data is very high. Hence the migration rule makes most of the shared data to be placed into the center regions, which are far away from all the cores. Normal RC based wires used as on chip interconnects are slow [47] and cannot access those center regions rapidly.
Hence to reduce access latency the authors use high speed transmission lines [47]
to connect directly each core to the center regions. Note that transmission lines can communicate data at the near speed of light. The details regarding trans- mission lines is out of our context. The most significant problem with the above architecture is the difficulty in locating a block. In a DNUCA that distributes the ways across the banks, it cannot statically determine the exactly location of a block. That means for every request it needs to search a set of banks to get the block (assuming it to be a hit). The authors use a multicast based mechanism for bankset search.
Another paper appeared shortly after [1] is [18]. Their experiments found that C- SNUCA has longer access time on average. C-DNUCA on the other hand shows better performance than C-SNUCA when migration is allowed. But they also mentioned that block searching in C-DNUCA is a major bottleneck. To reduce the requirement of accessing multiple banks, they implemented a distributed set of replicated partial tags. The partial tags for each block are stored at the top and bottom of every bankset (column). Such partial tags help to a reduced number of banks to be searched. But the additional tag storage required for partial tags creates a significant hardware overhead. Also the tag storages have to be updated with every on-chip change and a coherence mechanism needs to be implemented for these partial tag storages.
Re-NUCA [48] allows limited replication for shared blocks accessed by processors placed on the opposite sides of the chip. In particular, their solution allows at most two independent copies of the same block to be stored in the same shared LLC, each of them migrating towards the closest LLC side. An efficient data search algorithm for C-DNUCA has been proposed in [9] (HK-NUCA). In HK- NUCA each block has a statically mapped home bank whose purpose is to know which other NUCA banks have at least one of the data blocks that it manages.
The home-bank based mechanism significantly reduces the search-time, network traffic as well as miss rate. DR-SNUCA [49] and ESP-NUCA [50] are proposed for reducing energy overhead of NUCA based architectures.
Chishti et al. proposed an alternative architecture for DNUCA called NuRAPID [10]. It gives better performance than DNUCA and also has less complexity.
The original NuRAPID was proposed for single core architectures. The main contribution of the work is to separate the tag and data arrays. The tag arrays are separated from the data arrays and stored to a special storage called tag storage near the processing core. The tag storage stores the tag address of each block as well as a pointer to the location where the data is present. The data are stored in the banks. Such tag storages minimize the tag lookup time as the storage is nearer to the core. Also the forward pointer stored in the tag array allows to assess the block directly from the bank without searching any bank. The forward pointers allow data to be placed in any location of a bank irrespective of the set index. Such flexibility allows more number of heavily used blocks to be placed in the closer bank. The overhead in providing such a flexibility is that the data blocks need to maintain reverse pointers to identify their entry in the tag array, so that the corresponding tag can be updated when a move happens.
The CMP extension of NuRAPID is proposed in [19]. In this design each core has its own tag storage. Though it shows better performance than corresponding DNUCA, managing multiple tag storage is a major drawback of the work. The tag storages duplicate same content and need to be consistent with any changes in the cache. Hence coherence mechanism has to be implemented for them.
A shared/private hybrid LLC is proposed in [51]. The LLC is distributed into multiple banks same as in C-DNUCA but each bank reserves some ways as private to the local core. The blocks are initially placed into the private ways of the local bank of the requested core. In case of eviction, the block is placed into the shared ways of any bank. A prediction based method is used in [52] to minimize the bankset searching overheads. It uses the concept of Bloom filter [53] for possible location of a block in each bank. The design reduces the storage overhead by a factor of 10 as compared to the partial tag structure. But the use of Bloom filter introduces a non-trivial number of false positives.
Some other proposal for C-DNUCA are [54, 55, 56]. In [57] the migration of DNUCA is done earlier based on some prediction to reduce the hit time. In summary even after proposing many innovative designs the problem of minimizing the cost of bankset searching is still a challenging task. A better placement policy can reduce the requirement of total bank accesses per bankset search. Chapter 6 and 7 use this concept for better block placement.
Most of the DNUCA designs discussed above are purely made for CCMP and they cannot be directly applied on TCMP (more details are discussed in the next section). Considering the scalability issues of CCMP and the growing demand of TCMP [5] current researchers are giving more importance to TCMP than CCMP [58, 59,60, 61]. The next section discuss about TCMP and its NUCA designs.