• Tidak ada hasil yang ditemukan

Profiling

N/A
N/A
Protected

Academic year: 2024

Membagikan "Profiling"

Copied!
27
0
0

Teks penuh

(1)

Chapter 6 Chapter 6

Profiling

Profiling

(2)

Simplified assembly instruction set Simplified assembly instruction set

design flow g

Design of

assembly Instruction set

Application assembly

instruction set architecture

Instruction set benchmarking Application

profiling

(3)

Contents

• What is profiling and why profiling

• Static profiling

– Coarse profiling and fine profiling

• Dynamic profiling Dynamic profiling

– Instrumentation and dynamic profiling

• Evaluate profiling quality

(4)

ASIP design: start from applications

• Divided ASIP design into two steps:

– 1: understand the functions of the application and its requirements on performance. q p

– 2: find ways to minimize costs while meeting requirements on flexibility and performance.

requirements on flexibility and performance.

• Code profiling analyzes the application

d t t d it ti

code structure and measures its execution time and memory cost.

(5)

Ways to understand applications

Reading Implementation Reading

Source code profiling

Reference assembly code profiling Implementation

proposal Design of the

Reference assembly code profiling

Benchmarking of designed assembly instruction set instruction set

Release the

instruction set Level of

instruction set Level of

understanding Project progress

(6)

Why profiling?

• Algorithm computing cost

• Memory access cost

• The cost of running program flow controlThe cost of running program flow control

• The execution cost for hardware and timing resource management

resource management

• Cost to serve asynchronous tasks

• Synchronization and communication as well other cost of parallelizationp

(7)

What to profile

• Algorithm computing cost

• Memory access cost

• The cost of running program flow controlThe cost of running program flow control

• The execution cost for hardware and timing resource management

resource management

• Cost to serve asynchronous tasks

• Synchronization and communication as well other cost of parallelizationp

(8)

To support instruction set design, profiling should identify:

• The most MIPS consuming functions that should be accelerated

be accelerated

Th f l i f i h

• The most frequently appearing functions that should be accelerated

• Datapath architecture satisfying questions 1 and 2

• Memory and bus architecture supporting questions 1 and 2 with the minimum on-chip memory cost

(9)

Static and dynamic profiling

• Static profiling is given by analyzing the source code instead of running it

– WCET analysisWCET analysis

• Dynamic profiling is performed by executing the source program and accumulating the p g g

execution time

(10)

Profiling process Profiling process

1. Decide the scope of a profiler: not easy 2. Analyze source code (static profiling).

3 Prepare and configure the profiling tool 3. Prepare and configure the profiling tool

(instrumentation).

4. Run dynamic profiling.

5 Analyze the results of static/dynamic 5. Analyze the results of static/dynamic

profiling.

(11)

Source code profile Assembly language profile

Efficiency Accuracy Accuracy

MATLAB C/C++ IR ASM

Dynamic profiling at IR level is possible if IR simulator is available

i h b i l d

Memory accesses might be excessively counted

(12)

Static profiling in 4 steps

1. Generate CFG (Control Flow Graph), identify

h b i bl k h l

the basic blocks are the lowest component.

2. Perform cost annotation on each branch of the CFG.

3. Identify cost accumulation based on annotations y on all branches of the CFG.

4 Analyze the result and find the cost of the true 4. Analyze the result and find the cost of the true

critical path.

(13)

i i d i

• For instruction set design, expose

computation cost in a BB (Basic blocks)

• Counting operations in a BB including

+ * / and %

+, , *, /, and %.

– logic operations &&, ||, and !.

– Bitwise operations ~, ^ , &, | , << , and >>

– Relation operators >,p , >=,, < ,<=,, , ==, and !=,

• Memory accesses and communication should

b l fil d

be also profiled

(14)

Start

B1 B3

a11

12

a13

15 16

Basic block 1:

10 MAC

20 ADD / SUB

p le

B2 a12

a21

a15 a16

20 ADD / SUB 15 scaling (shift)

x am p

B4 a22

24 a36 a23

a55

n e x

B7

B5 B6

a24

a31 a32 a41 a42 a51

Fine-grained a55

profiling

A n

B7

B8 B6 B9

B10 B11

a34 a33

a52

B8 B11

B10

a43 a54

a35 a37

a53

B12

(15)

Coarse-grained profiling Fine-grained profiling

Basic block Total Arithmetic operations Logic / Shift MAC Basic block Total Arithmetic operations Logic / Shift MAC

B1 45 20 15 10

B2 12 10 2 0

B2 12 10 2 0

B3 3 2 1 0

B4 140 10 30 100

B4 140 10 30 100

B5 8 4 2 2

B6 145 25 0 120

B6 145 25 0 120

B7 122 2 10 110

B8 53 50 3 0

B9 14 10 2 2

B10 214 12 2 200

B11 11 8 1 2

B12 6 5 1 0

(16)

Load/store and move Load/store and move

d h h hi

• Depend very much on the target architecture

– Instruction set – Register file

Data memory – Data memory

• Architecture independent load/store

– Input variables of a procedure – Output of the procedureOutput of the procedure

– Changed global variables

Hidd l d/ t ill d ti bl

• Hidden load/store: spilled vatiables

(17)

Coarse-grained Static Profiling

• Coarse-grained: for architecture design

– fine-grained: for instruction set design

• Identify

• Identify

– Time-consuming BBs

– Frequently executed BBs

– BBs that use large part of program memory – BBs that use large part of data memory

(18)

An example: IEEE802.11a/g: p g appearance of BB × cost of BB

Tasks Algorithms MOPS Data types

Viterbi decoding Viterbi decoding algorithm 1700 Short integer Receiving Filterg Complex FIR convolutionp 960 Complex p De-interleaving Permutation algorithm 260 Bit

Track fast fading FFT and vector diff 160 Complex

F di i F d i LMS 200 C l

Fading compensation F-domain LMS 200 Complex

Sampling phase offset FFT / phase decision 100 Complex

Demapping Decision FSM 80 Integer

Descrambling Polynomial algorithm 80 Bit

Packet detection Autocorrelation 80 Complex

Energ detection A tocorrelation 80 Comple

Energy detection Autocorrelation 80 Complex

IFFT/FFT 64 points FFT and IFFT 60 Complex

Payload Rotor Vector product 40 Complex

Normalization 1/x and vector product 40 Complex

TOTAL 3840

(19)

dynamic profiling y p g

• The purpose of instrumentation of a coarse- grained profiling is to understand the

software system and the structure of the y code at a higher level.

• It identifies the critical paths the scope of

• It identifies the critical paths, the scope of the algorithms, the general execution

behaviors of kernel algorithms, and the use of hardware in general.g

(20)

Instrumentation for Coarse grained Instrumentation for Coarse-grained

dynamic profiling y p g

• Probes or instrumentations are inserted to:

– Check the frequency to reach each program mode (behavior of the high-level CFG).( g )

– Check the cycle cost of each basic block.

Check the frequency of invocation of a basic – Check the frequency of invocation of a basic

block (low level paths).

(21)

dynamic profiling y p g

• To get the knowledge of appearance of instructions inside the basic blocks

• To help decide:To help decide:

– Which appears frequently

hi h h ld b l d

– Which should be accelerated

(22)

Example: Fine-grained dynamic profiling p g y p g

Standards G.723 6.3kb/s

Operations in a voice frame Dynamic Static

16 bits Fractional MAC 222373 264810

16 bits Fractional MAC 222373 264810

16 bits Integer MAC 58565 74768

32 bits arithmetic left shift 8550 10242

16 bits absolute value 6975 8945

32 bits absolute value 6937 9068

16 bits multiplication with 16 bits rounded result 3308 3478

32 bits arithmetic right shift 2625 4075

16 bits multiplication 32 bits result 2126 7076

32 bits subtraction 638 841

32 bits addition 557 682

32 bits normalization 378 782

32 bits by 16 bits multiplication 339 1406

16 bits multiplication 16 bits result 329 7544

TOTAL main arithmetic operations in a frame 313772 393968

The frame length (milliseconds) 30 30

TOTAL main arithmetic operations per voice sample 1307 1642

TOTAL arithmetic MOPS 10.46 13.14

(23)

Hidden cost: Shall be estimated

• To prepare and terminate a subroutine call

• The cost of preparing and running HW (DMA and other IO transactions)

( )

• The cost of the top-level management of a program

program

• The cost of thread and interrupt handling

Th f ll li i

• The extra cost of parallelization

(synchronization, communication, etc.)

(24)

Hidden cost: Shall be estimated

• Hidden computing costs cannot be exposed

b f l i bl l l d

before analyzing assembly level codes.

• When HWMR (HW multiplexing ratio) is low, ( p g ) , the hidden cost could be a big trouble (?)

• Check the example: Running 802 11a and

• Check the example: Running 802.11a and Bluetooth in parallel in the book.

– Bluetooth data packet processing : 1 microsecond – WLAN 802.11a processing: 4 microsecondsp g

– EDF scheduling for two tasks (250MHz) Thread switching: 50 cycles

– Thread switching: 50 cycles – 100/(250x4) = 10%

(25)

A h i h h d d i

• A technique to share a hardware device or a functional unit (a full adder, a multiplier, a register file, etc.) for different purposes at different times.

• Hardware multiplexing can be implemented

ith b ft b fi i th

either by software or by configuring the hardware

• HWMR:

– machine clock rate / signal sampling rate (?) – machine clock rate / signal sampling rate (?)

(26)

HW lti l i

HW multiplexing

Input of function 1

The HW functional unit to be multiplexed

Input of function 2 Output

p Input of function 3

Control signal g

(27)

• The quantitative design of an assembly instruction set

• The quantitative design of an assembly instruction set requires deep understanding of applications.

P fili i t d t d li ti

• Profiling is a way to understand applications.

• Static profiling: Code analysis

• Dynamic profiling: Running code using carefully selected stimuli

selected stimuli.

Referensi

Dokumen terkait

The scope of this project is developed an algorithm to identify the characters of plate number by using the MATLAB software, while the hardware is using personal computer to running

Since the cost of setting up and running an Internet Business is so low, and you can automate their operations with software, hardware and ingenious business concepts, you can set

6) Klik pada Tombol : “aux ➔ memory”, untuk melihat tampilan dari isi memory (Random Access Memory). 2) Ketikkan Code Program berikut di Editor emu8086, dengan Meng-Klik Tombol

The Windows Runtime (WinRT), a new application model for running Metro style apps, provides access to the new features of the Operating System (OS) and the native hardware

Therefore, with PDM signaling, the multi-drop memory interfaces with high capacity are feasible without increasing the power and hardware cost.. Keywords—Inter-symbol interference,

Title Publisher Year of Publication 1 Service Flow Management with Multi-objective Constraints in Heterogeneous Computing ICCES 2018 2018 2 Service Flow Using Genetic Algorithm

Skills: cognitive - explain the basic concepts of computer hardware and software Phsycomotoric - explain and create simple algorithm schemes using flow charts Competences: Students able

Topics No.of Hours 1 Program execution: Program, Compilation, Object files, Function call and return, Address space, Data and its representation 4 2 Computer organization: Memory,