Hard to estimate the switching activity at higher levels
Logic primitive based estimation
By Landman and Rabaey
DSP circuits based on high-level description of the system
DSP circuits composed of logic primitives
adders, comparators, multipliers, etc.
Should estimate the switching activity at the output of such logic primitives for the given input signal probabilities and switching activities
Can be used for high-level synthesis
2
Using high-level stochastic behaviors such as mean, variance, etc.
Follows the gate-level power estimation technique
But considers larger blocks rather than gates
Direct relationship between bit level probabilities and word level statistics
For three major DSP applications
Speech, music and image
Signal and transition probabilities follow a similar pattern
Can be represented by a piece-wire linear curve
3
Low order bits: uncorrelated in space and time
Signal probability: 0.5 / switching activity: 0.25
Expected result
LSB switches from 0 to 1 / 1 to 0 based on whether it is even or odd
Higher order bits: complete dependence
Represent sign extensions
4
Two breakpoints: BP0, BP1
Represents signal probabilities and transition activities with statistical parameters
Mean: μ
Variance: σ2
Lag one correlation coefficient: ρ1
ρ1 =cov[Xt, Xt+1]/ σ2
BP0 for signal probability (given empirically)
Signifying the end of the low-order bits
BP0 for the switching activity (given empirically)
Signifying the end of the low-order bits
What about MSBs?
Depending on the signal distribution
For signal probabilities
PMSBs = F1(μ/σ)
For switching activities
AMSBs = F01(μ/σ, ρ1)
Many architectural/system decisions have huge impact on power and performance
Often need feedback at the early-stage of a design
Pre-RTL, pre-circuit analysis
Run-time, system-level feedback control
Application/dynamic run-time characteristics allow dynamic scaling for power reduction
7
performance
Often need feedback at the early-stage of a design
Pre-RTL, pre-circuit analysis
Run-time, system-level feedback control
Application/dynamic run-time characteristics allow dynamic scaling for power reduction
8
Architectural simulators
Functional Performance
Exec-driven Cycle timers
Interpreters
Trace-driven Inst. schedulers
Direct execution
Shaded parts indicate Simplescalar
Arch Sim
Instruction accurate
Programmer’s view
Correctness of ISA
Used for software development
Performance simulator
uArch Sim
Cycle accurate
Architect’s view
Performance improvement
Used for architecture exploration
10
Using trace for simulation
Trace was already obtained
No feedback to trace
Event-driven simulator
Using program for simulation
Generating stream dynamically
Pros.
Accuracy by reflecting dynamic circumstances
Cons.
Time-consuming to implement and run
Schedules instructions based on resource availability
Instructions processed one at a time
Less detailed
Cycle timers
Tracks the microarchitectural state in every cycle
Many instructions various stages at any time
Detailed
C code is complied with libraries
Binary is fed into simulators
GCC
GAS
GLD F2C
Libf77.a Libm.a
Libc.a
Simulators
Bin Utils Executables Object files
Fortran code
C code
Assembly code
Sim-Fast Sim-Safe Sim-Profile
Sim-Cache/
Sim-Cheetah/
Sim-BPred
Sim-Outorder
- 420 lines - functional - 4+ MIPS
- 350 lines - functional w/ checks
- 900 lines - functional - lot of stats
- <1000 lines - functional - cache stats - pred stats
- 3900 lines - performance - OoO issue - branch pred - mis-spec - ALUs - cache - TLB
- 200+ KIPS
Performance Detail
I/O is implemented
Via SYSCALL instruction
Decode system call
Copy arguments into the simulator memory
Perform the system call on the host (real machine)
Copy results into the simulated memory
write(fd, p, 4) results out sys_write(fd, p, 4)
args in
Simulated Program Simulator
I/O simulation
A single file describes ISAs
ss.def
Flexible
Existing commercial ISA : MIPS, ARM
L1 D-cache L1 I-cache
Local bus
D-TLB I-TLB
TLB bus
Global bus
L2 TLB On-chip banks L2 cache
Off-chip banks Network interface
-cache: define L1D: 1024:32:2:1:1:vipt:0:2:1:Onbus:Globalbus
Associativity Hit latency Translation Prefetch Resource names
Name # sets Block size Replacement # resources Resource code
-bus: define Onbus:32:8:1:1:0:2:1:L2:Onbank
Width Arbitration #resources Resource names
Name Width Cycle ratio Inf.b/w Resource code
-bank: define Onbank:20:0
Name Banking code
Access penalty
Accelerate simulation speed
At least more than 100%
Catch the program behavior that occurs in the later part of simulation
Reduce simulation error
22
23
It is too time-consuming to go to other phases
24
25
Estimates CPU power consumption
Processor core
On-chip cache
Instruction cache
Data cache
On-chip bus
Clock-generation/Control logic/FP units
SimpleScalar Interface
SimpleScalar provides a simulation environment
Out-of-order processors with 5- stage pipelines
Array structure
I cache, D cache, cache tag array
Register files
Branch predictors
Instruction window, load/store queue
Fully Associative CAM
Instruction window/reorder buffer wakeup logic
Load/store order checks
TLBs
Combinational Logic and Wires
Functional units
Clocking
Clock buffer, clock wires
27
microprocessor
P
d=CV
DD2AF
C is load capacitance
VDD is supply voltage
A is an activity factor
F is a operating frequency
C of each categorized power model is calculated
Each unit can be reduced into stages
Each stage is formed RC circuits
Selective clock-gating is considered
28
Model capacitance vs. physical schematics
Within 6-11%
Relative power consumption by structure
Average difference is 10.7%
Relative validation
Max power consumption
Absolute validation
Simulation speed
80K instructions per second
<cf> 105K instructions per second at SimpleScalar
29
Fetch Dispatch Issue/Execute Writeback/
Commit Power
(Units Accessed)
I-cache Bpred
Rename Table Inst. Window Reg. File
Inst. Window Reg File ALU D-Cache Load/St Q
Result Bus Reg File Bpred
Performance Cache Hit?
Bpred Lookup
Inst. Window
Full? Dependencies
Satisfied?
Resources?
Commit Bandwidth?
Parameterized Register File
Power Model Number of entries
Data width of entries
# Read Ports
# Write Ports
Power Estimate Number of Active Ports
Bitline Activity
Register files
31
Instead of a performance simulator, performance counters can be used
Speedup of simulation time
Real input trace
Eliminates the input trace error
Otherwise, the input trace error might incur severe errors to the output
Integration with power simulator
Getting input traces from a real machine
The collected trace is used for the power simulator
In this case, specification of the real machine and the power simulator should be same
compilability
code density
ISA μ-Arch
chip-area
verification cost
N I f Ed Pl
Performance Power
METRIC
?
N dynamic instr. count PI leakage power f max frequency
I architectural speed, IPC Ed average switching energy
Execution-driven, cycle-accurate RT level energy estimation tool
Estimates CPU power consumption
Processor core
On-chip cache
Instruction cache
Data cache
On-chip bus
Clock-generation/control logic/FP units are not included
Integrated to the SimpleScalar architectural simulator for the integer subset of the instruction set of SimpleScalar
34
35
the main source of CMOS microprocessor
P
d=CV
DD2AF
C is load capacitance
VDD is supply voltage
A is an activity factor
F is a operating frequency
SimplePower is based on input
transitions rather than input statistics
36
Switch capacitance table
Cache simulator
Modified Dinero III
Bus simulator
Snoops the I cache/D cache address/data bus (on-chip bus)
Switch capacitance table
Bit-dependent functional unit
Bit-independent functional unit
Table size problem (too large table) is solved by a clustering algorithm
If a clustering algorithm is not appropriate
Analytical modeling (eg. memory)
Partitioning a module into smaller submodules (eg. adder, subtracter, shifter, etc.)
37
Comparison between SimplePower and HSPICE
32*32 5-port register file energy estimation
Within 10-15% accuracy of circuit-level estimation
Simulation Speed
Much less than 0.1sec for each input transition
<cf> 556.42 sec at HSPICE circuit level simulation
38
A transaction level simulator equipped with power models
Based on MaxSim from ARM-ESL
Three major modeling blocks and their accuracy
Simulation speed: ~200KCPS
Blocks Accuracy (vs. gate level)
ARM926-EJS 93%
AXI 95%
IP blocks 80%
Simple mW/MHz model does not work
Separate Core and Cache power states
Cache power states
Sum of power values of SRAM generated by memory compiler for the following combinations
Modules: Tag / Data / Fill buffer
Mode: Read / Write
Access pattern: Sequential / Non-sequential / Fill buffer
42
AXI: AMBA 3.0 from ARM
PL300: A crossbar switch implementing AXI
Modeling strategy
Activate single master and single slave
Characterize power for each component
Active / standby power
For eight cases: Read / Write * Burst length (1, 2, 4, 8)
Consider multiple masters and slaves
Consider the coupling effect by using a linear regression model
Systematically varies # of masters and slaves
In estimation phase
Power is estimated based on the above model cycle by cycle
44
Use RTL power estimation
ViP model: cycle-accurate model translated from RTL
RTL power macro model: based on representative states
Can estimate the power dissipated by the entire system (peak as well as average)
Can identify power critical region for the improvement
Measurement-based tool
Measures the energy consumption of entire embedded system
HP3458A DMM
625Hz sampling frequency
Linux operating system
PC (program counter), PID (process id)
But, the energy consumption of each system component can not be measured
Originally integrated to the IBM 701C laptop
Currently integrated to the Itsy pocket computer with a small set of kernel modification
47
Profiling Computer
Data Collection Computer Apps DMM
System Monitor
Energy Monitor Trigger
Power
Source HP-IB
Bus
PC/PID
Samples Correlated
Current levels
48
Post processing
Process level energy analysis
Function level energy analysis
Postprocessing Computer
Energy Analyzer
Correlated Current levels Symbol Tables
PC/PIC Samples Energy Profile
49
Unknown
Model validation
Measurement-based, so no model exists
Profiling speed
625Hz real-time profiling
Drawn current
PC and PID
Post processing Speed
Unknown
50
Measurement-based tool
Measures the energy consumption of Itsy pocket computer
DAQ (Data Acquisition Equipment)
Rsense, Vsense
W = P*t = Vsupply*Vsense/Rsense*t
The energy consumption of pocket computer is markedly different from the one of laptop computer
51
52
Hardware
StrongARM SA-1100 processor
16MB DRAM, 4MB Flash memory
Audio CODEC, microphone, speaker
LCD display, touch screen
Serial, IrDA, USB IO, pushbuttons
Software
Runs Linux
MIDI
MPEG video playback
Text to speech, speech to text
53
Itsy Pocket Computer Supply
Voltage Rsense
Vsense
DAQ Equipment Differential
Amplifier
About 0.02 Ohm
54
Peak-to-Peak voltage error 0.15mV
Model validation
Measurement-based, so no model exists
Profiling Speed
5KHz real-time profiling
Vsense
Postprocessing Speed
Unkwnon
55
Web based SW energy profiling tool.
Input: C source code, optimization options, operating frequency.
Output: Energy consumption
Work for StrongARM SA-1100 and Hitachi SH-4 microprocessors.
56
57