PDF Digital Signal Processing Filtering with GPU - Seoul National University

(1)

Digital Signal Processing Filtering with GPU

Sajid Anwar, Wonyong Sung School of Electrical Engineering

Seoul National University Gwanak-gu, Seoul 151-744 Korea

Abstract— This work designed digital filter kernels with GTX 260 GPU. Various FIR and IIR kernels are implemented. IIR filters due to its recursive nature have dependency on previously computed output samples. The dependency problem can be solved by separate computation of particular and homogeneous solutions. Comparing to CPU based implementation experimental results show speed improvement of 3 to 40 times for IIR and FIR filter kernels respectively.

Keywords

—

FIR, IIR, GPU, CUDA, Homogenous Solution, Particular Solution, GPGPU

I. INTRODUCTION

Graphics Processing Unit (GPU) is responsible for manipulating and displaying graphical data. Present GPUs have hundreds of CPU cores and are thus excellent in parallel processing. Each of the CPU core can run hundreds of threads in parallel. This huge amount of processing capability leads to General Purpose Computing on GPU (GPGPU). However GPGPU limits the developer to a set of defined graphics API.

The developer has to conduct the task using standardized graphics APIs like OpenGL and DirectX. With the advent of Compute Unified Device Architecture (CUDA), these limitations have vanished [2]. This work has implemented both recursive and non-recursive filters on GPU with CUDA.

In contrast to FIR filters, recursive filters have dependency on previous output samples. A technique is used to reduce the dependency time. Nvidia GTX 260 having 194 cores is the implementation platform.

The next two sections discuss FIR and IIR kernels and corresponding CUDA implementations. Section 4 discusses mapping of our suggested solution onto CUDA threads and blocks. Section 5 finally concludes.

II. FIR FILTER KERNELS

FIR filters have inherent parallel nature and are easily implemented with CUDA. A 16-tap FIR filter is implemented.

Fast shared memory on the GPU is intelligently utilized to save valuable cycles. Profiling results are given in Table 3.

The consumed time includes memory transfers as well.

III. IÎRFÎLTERKÊRNELS

A first Order recursive filter is shown in Fig. 1. It’s evident from the figure that the current output is dependent on current input sample and previous output sample. Dependence on previous output sample describes the inherent sequential

nature of this category of filters. The dependency time can be reduced by decomposing the computation into two separate solutions; Homogenous and Particular [1]. Figure 2 shows this decomposition process. It can be observed that the homogeneous solution is dependent on initial condition while particular solution is not. So particular solution can now be computed in parallel. The next section discusses how this scheme is mapped onto the GPU.

Figure 1 First Oder IIR Filter

2

3 2

1 2

1

[ ]

[ 1] [ 1]

[ 1] [ 1] [ ] [ 1] similarly

[ 2] [ 1] [ ] [ 1] [ 2]

[ ] [ 1] [ ] ... [ 2] ...

[ 1]

Homogeneous soult [ ]

[

i ]

]

on

[ ] [

p p p

p

p p

a y n

y n a ax n

y n a y n ax n x n

y n a y n a x n ax n x n

y n p a y n a x n a x n

a y n z n a

y n x n

y n

x n a

+ −

+

=

+ = + +

+ = − + + +

+ = − + + + + +

−

= +

− +

1 [ 1] 2 [ 2]

... Particular Solution

x n ap x n

− + + − + +

Figure 2 Decomposition Process IV. MAPPING ONTO GPU

Particular solution is computed in parallel by a total of L blocks. Each block contains M threads. Each thread process N input elements sequentially. All threads run in parallel. Table 2 and Fig. 3 shows this process. Threads are synchronised at their finishing point. Once this is done thread level initial condition is propagated as depicted by Fig. 4. This propagation is sequentially conducted. At the end of step 2 (Fig. 4), each block contains particular solution.

(2)

Ta

0th 1st ...

(L- blo

and syn inte mem Tim the thre exp pro beli inte

ble 2: Mappi

0th th block x[0,0

x[0,0 block x[1,0 x[1,0 ...

1)th

ck x[L-1 x[L-1

Figure 3 Pa

Figure 4 Pr The final step d sum up nchronization

ercommunicat mory. Figure me for one out number of t eads per block penses of b

cessor cores.

ieved that th elligent memo

ing of input d and t hread 1st

,0] to ,N-1] x[0,

x[0, ,0] to ,N-1] x[1,

x[1, ...

1,0,0] to 1,0,N-1] x[L

x[L

rticular Solu

ropagating Pa p is to propag

the results is needed ted by trans

5 shows tim tput can be re threads per bl k may reduce lock level Table 3 sho hese results c ory hierarchy u

data structure threads

thread ...

,1,0] to ,1,N-1] ...

...

L-1,1,0] to L-1,1,N-1] ...

tion by each

articular Solu gate block lev s. This me at this po ferring data ming analysis p educed by incr

lock. Further number of bl synchronizati ows the prof can further b utilization.

e to CUDA b (M-1)th thread x[0,M-1, x[0,M-1, x[1,M-1, x[1,M-1, ...

x[L-1,M to x[L-1, 1,N-1]

block (Step 1

ution (Step 2) vel initial cond eans block

oint. Blocks to device g per output sa reasing L, wh higher numb ocks and henc ion. P repre filing results.

be improved locks

,0] to ,N-1]

-1,0]

,M-

1)

) dition

level s are global ample.

hich is ber of ce the esents It is with

Cat 16 FIR IIR IIR

T

=

Th GPU deco solut expe paral numb GPU effici to eff T Educ unde by H

[1]

[2]

Ta tegory Inpu

(mill Tap

R

R 2

R 4

(2 2

TimeforOneO LMul Mul A Mul A

= +

+

=

Figu

his work expl U. Recursive

mposing sol tion. It is ob nsive and ca llelism. Furth ber of threads U based impl ient than alter fficiently utiliz This work is

cation and Hu er the Brain K Higher Educati

W. Sung and S of Recursive Dig Dr. Dobb, “Su http://www.ddj.c

able 3 Perform ut

lion) GPU time 1 26 2.5 51 4.9 94

2

) / ( ) / ( Output TPL

LAdd PM Add P M Add P M

= + + + re 5 Time per

V. CONC

lored implem filters are lution into p served that b an be reduce her the anal s per blocks re

lementation p rnative CPU im ze the memory

A^CKNOWLE supported uman Resourc Korea (BK21)

ion Commissi REFERE . K. Mitra, “Effi gital Filters,” IC upercomputing f com/hpc-high-pe

mance Resul U

(μs) CPU (μs) 654 103 134 130 431 282

/

) / ) / PL Mul PAdd Mul Add L Mul Add L

+ +

r Output Sam

CLUSION

mentation of d efficiently particular an block level sy ed by increas

lysis showed educes time p proved to be mplementation y hierarchy fo

EDGEMENT

in part by ces Developm ) project and

on (HEC) Pak

ENCES icient Multi-Proc

ASSP 1986 for the masses,”

erformance-comp

ts

time Ratio 861 39.13 030 2.5 200 3

) / Psync PL L

L +

mple

digital filters w implemented nd homogene

ynchronizatio sing thread l d that increa pre output sam e 3 to 40 ti n. Future wor or recursive fi

the Ministry ment (MOEH in part suppo kistan.

cessor Implement

” [Available On uting/207402986

33

L

with by eous on is

level asing mple.

imes rk is lters.

y of RD) orted

tation nline]

6