3.1 Computer Architecture Simulators
3.1.2 GEMS: An Overview
The three major modules: Ruby, Garnet [137] and Orion [25] construct GEMS [126]
together. Ruby models the entire memory system of any CMP based architecture where each component like L1 cache, L2 banks, memory banks, directories etc.
are modeled. Individually, each of these components are called as “machine” in Ruby. Each and every machine has its own unique id called machineID which is required to identify a machine during the on-chip communications. The machines in CMP communicate with each other through the underlying NoC, which is man- aged by Garnet. Any CMP based architecture that can be modeled in ruby uses Garnet for establishing on-chip communications. Garnet simulates the real-time scenarios for transferring messages through the NoC. Moreover, Garnet supports different topologies which helps to simulate NoC with variety of options. Another simulator Orion [25] is attached with Garnet for modeling the energy consumed by the underlying NoC provided by Garnet. However, the Garnet version used in our simulation follows X-Y routing. The on-chip communication cost from both power and performance perspectives are considered for all the experimental analysis done in this thesis.
The requested block (load, store, fetch) from simics processors are passed to the Ruby module of GEMS. The very first level of cache i.e. L1 in ruby detects early if the requested block is a hit or a miss. If a hit is detected then simics continues its execution, else in case of a miss, the request from the issuing core is stalled for a while until GEMS completes its simulation for the detected cache miss. The timing-dependent functional simulation in Simics is determined by ruby, through the control of timing of when Simics moves forward its simulation.
A sequencer is attached with each of the L1 cache for managing the requests from the corresponding core. In case of CMP, multiple L1s and L2s can be available and can also handle cache requests concurrently. For each of the machine, a controller is attached and all functionalities of a machine are performed by its own controller.
GEMS provides a domain-specific language called SLICC for modeling of such con- trollers to manage all the operations of the machine itself and its communications
Chapter 3. Simulation Framework 62 with the others. Managing coherence, on the other hand, is a major concern in case of CMP based cache structure that uses shared caches, hence, coherence protocol has to be implemented. Different modules of the controller manage the coherence protocol through message passing among themselves. These messages are communicated across the modules are passed through the NoC (modeled by Garnet). SLICC is further responsible for coherence protocol designing as it is a combined task of all the controllers.
3.1.2.1 CMP Architectures Supported by GEMS
GEMS supports SNUCA based TCMP/CCMP cache architecture which is very ro- bust and can be configured with a set of varieties like, cache size, number of banks, number of tiles, cache access latency, miss penalty, hit time, network protocols etc.
All of these parameters can be modified by just changing their respective values in a separate configuration file. Apart from that, many other parameters are also there which can be reset based on the configuration demand, such as-number of virtual networks, block size, cache associativity, replacement policy, network flit size etc. For coherence management, we use MESI protocol in our proposed and baseline designs, which is termed as “MESI-CMP protocol” in GEMS. For exper- imenting with different network topology, changes can also be permitted in the network configuration file in Garnet.
GEMS supports the baseline cache architecture where we implemented our Dy- namic Cache Reconfiguration (DiCeR) techniques. We made certain changes at the protocol level to manage the associated coherence mechanism. The code writ- ten in SLICC has been modified for implementing the proposed architectures, as by simply changing the parameters our proposals could not be implemented. Fur- thermore, to provide supports, other internal structures of its memory systems are also modified. Finally, a new compilation of GEMS constructs the new ar- chitecture. The migration of cache blocks during bank turning-off/on increases workloads in NoC, which is also simulated in Garnet. In addition to that, the re- quest forwarding after reconfiguration of cache are also taken care by involvement
Chapter 3. Simulation Framework 63 of Garnet. The integration of Orion also simulates the corresponding changes in NoC power consumption.
3.1.2.2 Result Analysis
The GEMS-Simics integration can run real set of workloads on the simulated (modeled) CMP architecture. During execution, the simulation framework records plenty of information. Some important information recorded by GEMS during execution are:
• Total Cycle Executed: This metric is a summation of all the cycles executed by all the cores. Note that GEMS also records executed cycles for individual cores.
• Total instruction executed: Same as the above metric, i.e. individual as well as the summation of all executed instructions are recorded.
• Total L1 accesses/misses: GEMS also records total and individual access/miss counts for each of the L1 banks, which are private to their respective cores.
• Total L2 accesses/misses: This implies same like L1 accesses/misses, but in LLC (L2). The bank-wise distribution is not provided in the original GEMS but can be implemented easily on demand.
• Average network latency: Garnet outputs this parameter which is the av- erage cycles required for individual message to communicate through the NoC.
• NoC energy consumption: Total energy consumed by the NoC during exe- cution (produced by Orion).
Apart from that, GEMS also provides: cycle per instructions (CPI), instruc- tions per cycle (IPC), average memory access latency (including on-chip/off-chip communication time), average link utilisation, Miss Per Thousand Instructions (MPKI) etc.
Chapter 3. Simulation Framework 64 The special module, named as Profiler, embedded in GEMS manages all the re- sults during execution. However, Profiler can be initialised at any point of time during execution, so that results can be recorded for any particular time zone during execution. Further, the Profiler can also be modified (or reset on-the-fly) if required.