• Tidak ada hasil yang ditemukan

High-level Power Estimation

N/A
N/A
Protected

Academic year: 2024

Membagikan "High-level Power Estimation"

Copied!
57
0
0

Teks penuh

(1)

1

Naehyuck Chang

Dept. of EECS/CSE

Seoul National University

[email protected]

(2)

Hard to estimate the switching activity at higher levels

Logic primitive based estimation

By Landman and Rabaey

DSP circuits based on high-level description of the system

DSP circuits composed of logic primitives

adders, comparators, multipliers, etc.

Should estimate the switching activity at the output of such logic primitives for the given input signal probabilities and switching activities

Can be used for high-level synthesis

2

(3)

Using high-level stochastic behaviors such as mean, variance, etc.

Follows the gate-level power estimation technique

But considers larger blocks rather than gates

Direct relationship between bit level probabilities and word level statistics

For three major DSP applications

Speech, music and image

Signal and transition probabilities follow a similar pattern

Can be represented by a piece-wire linear curve

3

(4)

Low order bits: uncorrelated in space and time

Signal probability: 0.5 / switching activity: 0.25

Expected result

LSB switches from 0 to 1 / 1 to 0 based on whether it is even or odd

Higher order bits: complete dependence

Represent sign extensions

4

(5)

Two breakpoints: BP0, BP1

Represents signal probabilities and transition activities with statistical parameters

Mean: μ

Variance: σ2

Lag one correlation coefficient: ρ1

ρ1 =cov[Xt, Xt+1]/ σ2

BP0 for signal probability (given empirically)

Signifying the end of the low-order bits

(6)

BP0 for the switching activity (given empirically)

Signifying the end of the low-order bits

What about MSBs?

Depending on the signal distribution

For signal probabilities

PMSBs = F1(μ/σ)

For switching activities

AMSBs = F01(μ/σ, ρ1)

(7)

Many architectural/system decisions have huge impact on power and performance

Often need feedback at the early-stage of a design

Pre-RTL, pre-circuit analysis

Run-time, system-level feedback control

Application/dynamic run-time characteristics allow dynamic scaling for power reduction

7

(8)

performance

Often need feedback at the early-stage of a design

Pre-RTL, pre-circuit analysis

Run-time, system-level feedback control

Application/dynamic run-time characteristics allow dynamic scaling for power reduction

8

(9)

Architectural simulators

Functional Performance

Exec-driven Cycle timers

Interpreters

Trace-driven Inst. schedulers

Direct execution

Shaded parts indicate Simplescalar

(10)

Arch Sim

Instruction accurate

Programmer’s view

Correctness of ISA

Used for software development

Performance simulator

uArch Sim

Cycle accurate

Architect’s view

Performance improvement

Used for architecture exploration

10

(11)

Using trace for simulation

Trace was already obtained

No feedback to trace

Event-driven simulator

Using program for simulation

Generating stream dynamically

Pros.

Accuracy by reflecting dynamic circumstances

Cons.

Time-consuming to implement and run

(12)

Schedules instructions based on resource availability

Instructions processed one at a time

Less detailed

Cycle timers

Tracks the microarchitectural state in every cycle

Many instructions various stages at any time

Detailed

(13)

C code is complied with libraries

Binary is fed into simulators

GCC

GAS

GLD F2C

Libf77.a Libm.a

Libc.a

Simulators

Bin Utils Executables Object files

Fortran code

C code

Assembly code

(14)
(15)

Sim-Fast Sim-Safe Sim-Profile

Sim-Cache/

Sim-Cheetah/

Sim-BPred

Sim-Outorder

- 420 lines - functional - 4+ MIPS

- 350 lines - functional w/ checks

- 900 lines - functional - lot of stats

- <1000 lines - functional - cache stats - pred stats

- 3900 lines - performance - OoO issue - branch pred - mis-spec - ALUs - cache - TLB

- 200+ KIPS

Performance Detail

(16)

I/O is implemented

Via SYSCALL instruction

Decode system call

Copy arguments into the simulator memory

Perform the system call on the host (real machine)

Copy results into the simulated memory

write(fd, p, 4) results out sys_write(fd, p, 4)

args in

Simulated Program Simulator

I/O simulation

(17)
(18)

A single file describes ISAs

ss.def

Flexible

Existing commercial ISA : MIPS, ARM

(19)

L1 D-cache L1 I-cache

Local bus

D-TLB I-TLB

TLB bus

Global bus

L2 TLB On-chip banks L2 cache

Off-chip banks Network interface

-cache: define L1D: 1024:32:2:1:1:vipt:0:2:1:Onbus:Globalbus

Associativity Hit latency Translation Prefetch Resource names

Name # sets Block size Replacement # resources Resource code

-bus: define Onbus:32:8:1:1:0:2:1:L2:Onbank

Width Arbitration #resources Resource names

Name Width Cycle ratio Inf.b/w Resource code

-bank: define Onbank:20:0

Name Banking code

Access penalty

(20)
(21)

Accelerate simulation speed

At least more than 100%

Catch the program behavior that occurs in the later part of simulation

Reduce simulation error

(22)

22

(23)

23

(24)

It is too time-consuming to go to other phases

24

(25)

25

(26)

Estimates CPU power consumption

Processor core

On-chip cache

Instruction cache

Data cache

On-chip bus

Clock-generation/Control logic/FP units

SimpleScalar Interface

SimpleScalar provides a simulation environment

Out-of-order processors with 5- stage pipelines

(27)

Array structure

I cache, D cache, cache tag array

Register files

Branch predictors

Instruction window, load/store queue

Fully Associative CAM

Instruction window/reorder buffer wakeup logic

Load/store order checks

TLBs

Combinational Logic and Wires

Functional units

Clocking

Clock buffer, clock wires

27

(28)

microprocessor

P

d

=CV

DD2

AF

C is load capacitance

VDD is supply voltage

A is an activity factor

F is a operating frequency

C of each categorized power model is calculated

Each unit can be reduced into stages

Each stage is formed RC circuits

Selective clock-gating is considered

28

(29)

Model capacitance vs. physical schematics

Within 6-11%

Relative power consumption by structure

Average difference is 10.7%

Relative validation

Max power consumption

Absolute validation

Simulation speed

80K instructions per second

<cf> 105K instructions per second at SimpleScalar

29

(30)

Fetch Dispatch Issue/Execute Writeback/

Commit Power

(Units Accessed)

I-cache Bpred

Rename Table Inst. Window Reg. File

Inst. Window Reg File ALU D-Cache Load/St Q

Result Bus Reg File Bpred

Performance Cache Hit?

Bpred Lookup

Inst. Window

Full? Dependencies

Satisfied?

Resources?

Commit Bandwidth?

(31)

Parameterized Register File

Power Model Number of entries

Data width of entries

# Read Ports

# Write Ports

Power Estimate Number of Active Ports

Bitline Activity

Register files

31

(32)

Instead of a performance simulator, performance counters can be used

Speedup of simulation time

Real input trace

Eliminates the input trace error

Otherwise, the input trace error might incur severe errors to the output

Integration with power simulator

Getting input traces from a real machine

The collected trace is used for the power simulator

In this case, specification of the real machine and the power simulator should be same

(33)

compilability

code density

ISA μ-Arch

chip-area

verification cost

N I f Ed Pl

Performance Power

METRIC

?

N dynamic instr. count PI leakage power f max frequency

I architectural speed, IPC Ed average switching energy

(34)

Execution-driven, cycle-accurate RT level energy estimation tool

Estimates CPU power consumption

Processor core

On-chip cache

Instruction cache

Data cache

On-chip bus

Clock-generation/control logic/FP units are not included

Integrated to the SimpleScalar architectural simulator for the integer subset of the instruction set of SimpleScalar

34

(35)

35

(36)

the main source of CMOS microprocessor

P

d

=CV

DD2

AF

C is load capacitance

VDD is supply voltage

A is an activity factor

F is a operating frequency

SimplePower is based on input

transitions rather than input statistics

36

Switch capacitance table

(37)

Cache simulator

Modified Dinero III

Bus simulator

Snoops the I cache/D cache address/data bus (on-chip bus)

Switch capacitance table

Bit-dependent functional unit

Bit-independent functional unit

Table size problem (too large table) is solved by a clustering algorithm

If a clustering algorithm is not appropriate

Analytical modeling (eg. memory)

Partitioning a module into smaller submodules (eg. adder, subtracter, shifter, etc.)

37

(38)

Comparison between SimplePower and HSPICE

32*32 5-port register file energy estimation

Within 10-15% accuracy of circuit-level estimation

Simulation Speed

Much less than 0.1sec for each input transition

<cf> 556.42 sec at HSPICE circuit level simulation

38

(39)

A transaction level simulator equipped with power models

Based on MaxSim from ARM-ESL

Three major modeling blocks and their accuracy

Simulation speed: ~200KCPS

Blocks Accuracy (vs. gate level)

ARM926-EJS 93%

AXI 95%

IP blocks 80%

(40)

Simple mW/MHz model does not work

(41)

Separate Core and Cache power states

Cache power states

Sum of power values of SRAM generated by memory compiler for the following combinations

Modules: Tag / Data / Fill buffer

Mode: Read / Write

Access pattern: Sequential / Non-sequential / Fill buffer

(42)

42

(43)

AXI: AMBA 3.0 from ARM

PL300: A crossbar switch implementing AXI

Modeling strategy

Activate single master and single slave

Characterize power for each component

Active / standby power

For eight cases: Read / Write * Burst length (1, 2, 4, 8)

Consider multiple masters and slaves

Consider the coupling effect by using a linear regression model

Systematically varies # of masters and slaves

In estimation phase

Power is estimated based on the above model cycle by cycle

(44)

44

(45)

Use RTL power estimation

ViP model: cycle-accurate model translated from RTL

RTL power macro model: based on representative states

(46)

Can estimate the power dissipated by the entire system (peak as well as average)

Can identify power critical region for the improvement

(47)

Measurement-based tool

Measures the energy consumption of entire embedded system

HP3458A DMM

625Hz sampling frequency

Linux operating system

PC (program counter), PID (process id)

But, the energy consumption of each system component can not be measured

Originally integrated to the IBM 701C laptop

Currently integrated to the Itsy pocket computer with a small set of kernel modification

47

(48)

Profiling Computer

Data Collection Computer Apps DMM

System Monitor

Energy Monitor Trigger

Power

Source HP-IB

Bus

PC/PID

Samples Correlated

Current levels

48

(49)

Post processing

Process level energy analysis

Function level energy analysis

Postprocessing Computer

Energy Analyzer

Correlated Current levels Symbol Tables

PC/PIC Samples Energy Profile

49

(50)

Unknown

Model validation

Measurement-based, so no model exists

Profiling speed

625Hz real-time profiling

Drawn current

PC and PID

Post processing Speed

Unknown

50

(51)

Measurement-based tool

Measures the energy consumption of Itsy pocket computer

DAQ (Data Acquisition Equipment)

Rsense, Vsense

W = P*t = Vsupply*Vsense/Rsense*t

The energy consumption of pocket computer is markedly different from the one of laptop computer

51

(52)

52

(53)

Hardware

StrongARM SA-1100 processor

16MB DRAM, 4MB Flash memory

Audio CODEC, microphone, speaker

LCD display, touch screen

Serial, IrDA, USB IO, pushbuttons

Software

Runs Linux

MIDI

MPEG video playback

Text to speech, speech to text

53

(54)

Itsy Pocket Computer Supply

Voltage Rsense

Vsense

DAQ Equipment Differential

Amplifier

About 0.02 Ohm

54

(55)

Peak-to-Peak voltage error 0.15mV

Model validation

Measurement-based, so no model exists

Profiling Speed

5KHz real-time profiling

Vsense

Postprocessing Speed

Unkwnon

55

(56)

Web based SW energy profiling tool.

Input: C source code, optimization options, operating frequency.

Output: Energy consumption

Work for StrongARM SA-1100 and Hitachi SH-4 microprocessors.

56

(57)

57

Architecture

Referensi

Dokumen terkait