PPT Low Power Systems Design - Seoul National University

(1)

Introduction

Introduction

• Why low power design?

– Increasing demand on performance and integrity of VLSI circuits

– Popularity of portable devices

– Energy consumption in huge number of electronic devices and datacenters

• Low power design at higher levels of abstraction

– Faster design space exploration – Wider view

– Higher power reduction – Less cost increase

(2)

Introduction

– Opportunities for power reduction at every level of abstraction

System 50-90% algorithms, HW-SW tradeoffs,

supply voltage scaling, bus encoding Architecture 40-70% scheduling, resource binding,

operand swapping

Register-

Transfer 30-50%

clock gating, operand isolation, pre-computation,

dynamic operand interchange, FSM encoding

Gate/Logic 20-30%

technology mapping, don’t care optimization, de-glitching

Transistor 10-20% transistor sizing

Physical 5-10% interconnect capacitance reduction, clock-tree synthesis

(3)

Introduction

– Power dissipation in CMOS circuits

• Dynamic power dissipation (dominant)

• Short-circuit power dissipation

• Leakage power dissipation

– Dynamic power dissipation

: effective (switched) capacitance : clock frequency

: switching activity : supply voltage

: physical capacitance

P C V f

C V f

dynamic eff dd

2

clk phy dd

2

clk



 

f_clk V_dd

C_eff



C_phy

(4)

Physical/Transistor/Gate-Level Design

Physical/Transistor/Gate-Level Design

• Interconnect capacitance reduction

– Signals having high switching activity are assigned short wires

• Clock-tree synthesis

– Clock is a major source of dynamic power dissipation – Clock of 200MHz DEC Alpha chip drives 3,250pF load,

3.3V supply voltage => 7W (30% of the total power)

– Clock skews must be controlled within tolerable values

Single driver scheme Distributed buffers scheme (preferred)

(5)

• Transistor sizing

– Compute the slack at each gate

– Sizes of the transistors in the gate are reduced until the slack becomes zero

– Reduced size => reduced capacitance => reduced power – Critical path is not affected

– Path balancing => reduced glitch => reduced power

(6)

• Technology mapping

– V. Tiwari, P. Ashar, and S. Malik, “Technology mapping f or low power,” Proc. of Design Automation Conference, pp. 74-79, June 1993

– Hide nodes with high switching activity inside the gates where they drive smaller load capacitances

H

L H L

H

L H L

L

(7)

• De-glitching

– Glitch consumes 10% - 40% of the dynamic power in typical combinational logic circuits

– Path balancing

• Add unit-delay buffers selectively such that the delays of all paths can be made equal

FA FA FA FA

A₀ B₀ A₁

B₁ A₂

B₂ A₃

B₃

C₀

S₀ C₁

S₁ C₂

S₂ C₃

S₃ C₄

1 0 1 0

0 1 0

1 0

1

(8)

RTL Design

RTL Design

• Clock gating

– Disable clocks to idle part of the circuit

– Saves clock power and power consumed by registered value change

register

MUX combinational

logic

register

F/F data

clock control

0

1

(9)

RTL Design

• Operand isolation

– Exploit output don’t cares of large circuit blocks in unused clock cycles

– Insert latches before the circuit blocks to reduce circuit activity

register

logic

register

F/F clock

control

0

multiplier 1 latch

adder

(10)

RTL Design

• Pre-computation

– Pre-compute the results of subsequent pipeline stages

register

logic

register

F/F clock

0

combinational 1 logic

Pre-computation logic

register

(11)

RTL Design

– Comparator example

register

MUX A>B

register

F/F

0

combinational 1 logic

register

A[MSB]

B[MSB]

(12)

Architecture-Level Design

Architecture-Level Design

• Supply voltage reduction

– Quadratic effect of voltage scaling on power

5V --> 3.3V => 60% power reduction

– Supply voltage reduction => increased latency P_dynamic  C_eff V_dd² f_clk

energy delay

Vdd Vdd

5

1 1 5

)

( _th

d

g V V

K V

T  

Edynamic/cycle Ceff Vdd

 2

(13)

– Perform optimizing transformation to meet throughput c onstraint even with voltage reduction

– Concurrency increasing transformation (increased hard ware cost ) => critical path reduction

– Loop unrolling, pipelining, retiming, algebraic transform ation, module selection

• A.P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R.W. Brodersen, “Optimizing power using transformation,”

IEEE Tr. on CAD/ICAS, pp. 12-31, Jan. 1995

– Y_N=AY_N-1+X_N --> Y_N=A²Y_N-2+AX_N-1+X_N Y_N-1=AY_N-2+X_N-1 Y_N-1=AY_N-2+X_N-1

+

*

D

X_N Y_N

+

*

2D

X_N Y_N

A² *

+

*

A Y_N-2

A

(14)

+

*

D

X_N Y_N

A

+

*

2D

X_N Y_N

A² *

+ Y_N-1

+

*

A Y_N-2

X_N-1

A C_eff=1

Voltage=5 Throughput=1 Power=25

C_eff=1.5 Voltage=3.7 Throughput=1 Power=20

+

*

2D

X_N Y_N

A² *

+ Y_N-1

+

*

A Y_N-2

X_N-1

A C_eff=1.5

Voltage=2.9 Throughput=1 Power=12.5

D

(15)

• Reduction of effective capacitance

– R. Mehra, L.M. Guerra, and J.M. Rabaey, “Low power arc hitectural synthesis and the impact of exploiting locality ,” Journal of VLSI Signal Processing, 1996

– Buses consume 5-40% of the total power

– Reducing access to global resource thru clustering

+ +

Global data transfers Local data transfers +

+

Adder1 Adder2

(16)

• Switching activity reduction

– Increasing data correlation thru operand sharing

• Operations sharing an operand also share resource

• Actively increase the chance of operand sharing thru loop interchange, operand reordering, loop unrolling, loop

folding

– Loop interchange

for i for j for k for l

a=f(k, l) b=f(i, j, k, l) c(i, j) = a - b

for k for l

a=f(k, l) for i for j

b=f(i, j, k, l) c(i, j) = a - b

(17)

– Scheduling and binding

• E. Musoll and J. Cortadella, “Scheduling and resource binding for l ow power,” Proc. of Int’l Symp. on System Synthesis, pp. 104-109, Apr. 1995

• Resource sharing by sibling operations

• Operations sharing the same operand are scheduled in control ste ps as close as possible (higher priority is given for list scheduling)

• After functional unit binding, bind registers such that useless pow er is reduced (no change of inputs to idle functional unit)

*

n1 n2

n3

* n4

*

n5

*

n1 n2

n3

n4

*

n5

traditional modified

* *

* idle

(18)

System-Level Design

System-Level Design

• System-level power optimization

Processor ASIC Core

On-chip Data Memory Interface

Circuits

Off-chip Memory (RAM, ROM)









Codec

On-chip Instruction

Memory System specification System specification

• Low-power compilation

• Memory mapping

• Instruction compaction

• Low-power compilation

• Memory mapping

• Instruction compaction

• VSP

• Power-conscious scheduling

• OSPM

• VSP

• Power-conscious scheduling

• OSPM Power

estimation/simulation Power

estimation/simulation

Low-power HW-SW partitioning

• Bus coding

• Interface exploration

• Bus coding

• Interface exploration

(19)

Bus Encoding

Bus Encoding

• Reduce number of transitions on high-

capacitance, multi-bit buses by encoding the signals

• Example

– Bus-invert coding

• M.R. Stan, W.P. Burleson, “Bus-invert coding for low-

power I/O,” IEEE Trans. on VLSI Systems, Vol. 3, No. 1, pp.

49-58, Mar. 1995

high-capacitance

00110001 01001100

00110001 0 10110011 1

6 toggles

3 toggles

(20)

shutdown

Dynamic Voltage Scaling

Dynamic Voltage Scaling

)

( _th

d

g V V

K V

T  

Dynamic power dissipation

clk dd eff

dynamic C V f

P  ²

Gate delay by  power model

Energy per cycle

2 _cycle eff dd

per C V

E 

Energy consumed by a task that takes n cycles n

V C

E_task  _eff _dd²

V V K V

f_clk _f ^th )

( 



 not a function of time but a function of # cycles (switchings)

performance

0 deadline

n V C

E_task  _eff _dd² V n C

E_task _eff ^dd 4

 2

, 2

2 ^clk

dd f

V

, _clk1 dd f V

full speed

low speed

(21)

• DVS on a Microprocessor System

– T. Pering, T., and R. Brodersen, “Dynamic Voltage Scali ng and the Design of a Low-Power Microprocessor Syst em,” in Power Driven Microarchitecture Workshop in co njunction with ISCA98, June 1998

– System block diagram (ARM8 architecture)

Proc.

Core

I/O bridge Unified SRAM

Cache

DVS components

Fixed-voltage

SRAM

SRAM SRAM

(22)

– System energy breakdown

Benchmark Miss Rate

Idle Time

Bus Activity

AUDIO 0.23% 67% 0.35%

MPEG 1.7% 22% 14%

UI 0.62% 95% 0.52%

(23)

Real-Time Scheduling on a VSP

Real-Time Scheduling on a VSP

• Y. Shin and K. Choi, “Power conscious fixed

priority scheduling for hard real-time systems,”

Proc. of Design Automation Conf. , pp. 134-139, June 1999

• Two methods for power reduction in processors

– Power-down mode

– VSP (Variable Speed Processor)

– Proposed method:

• Combine the two methods to obtain power saving for real- time systems

How to exploit these features ?

 Scheduling

(24)

• Priority-based preemptive scheduling

– Simple to implement

– Many analytical methods for schedulability analysis

– Fixed (static) priority (RMS, DMS)  LPFPS (Low Power Fixed Priority Scheduling)

– Dynamic priority LPEDF

• Implementation of priority-based preemptive sche duling

– Active task, Run Q, Delay Q

(25)

Active task Run Q

Delay Q

0 100 200 300

Run Q is empty

The speed of the processor can be slowed down until time 200, which is min(deadline of ,

next arrival time of Delay Q.head)

(26)

0 100 200 300

BCET/WCET

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

3D-image diesel fft bsort smooth blue check-data whetstone line

The chance for speed control increases

as the variation of execution time increases.

Variation of execution time [Ernst 97]

(27)

0 100 200 300

Active task Run Q

Delay Q Active task

Run Q Delay Q We can bring the processor

into the power-down mode because the processor will be idle until time 200

All the tasks reside in the Delay Q

(28)

– VSP

• NOP: 20% power consumption compared to typical instructions

• Power-down mode: 5% power consumption of fully active mode with 10 cycles delay

• Frequency: 100 MHz to 8 MHz with 1 MHz step

• Voltage: 3.3 V to 1.1 V

– Experimental procedure

• Control BCET: 0.1*WCET ~ 1.0*WCET

• Execution time: random variable following Normal

distribution with m=(BCET+WCET)/2, =(WCET-BCET)/6

• Run 3 times for each method and take average

(29)

• Experimental results

0 10 20 30 40 50 60

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 BCET/WCET

% reduction

FPS+power_down LPFPS

0 10 20 30 40 50 60

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 BCET/WCET

% reduction

0 10 20 30 40 50 60

% reduction

0 10 20 30 40 50 60

% reduction

Avionics INS

Flight control CNC