Digital Signal Processing Filtering with GPU
Sajid Anwar, Wonyong Sung School of Electrical Engineering
Seoul National University Gwanak-gu, Seoul 151-744 Korea
[email protected] wysung@snu.ac.kr
Abstract— This work designed digital filter kernels with GTX 260 GPU. Various FIR and IIR kernels are implemented. IIR filters due to its recursive nature have dependency on previously computed output samples. The dependency problem can be solved by separate computation of particular and homogeneous solutions. Comparing to CPU based implementation experimental results show speed improvement of 3 to 40 times for IIR and FIR filter kernels respectively.
Keywords
—
FIR, IIR, GPU, CUDA, Homogenous Solution, Particular Solution, GPGPUI. INTRODUCTION
Graphics Processing Unit (GPU) is responsible for manipulating and displaying graphical data. Present GPUs have hundreds of CPU cores and are thus excellent in parallel processing. Each of the CPU core can run hundreds of threads in parallel. This huge amount of processing capability leads to General Purpose Computing on GPU (GPGPU). However GPGPU limits the developer to a set of defined graphics API.
The developer has to conduct the task using standardized graphics APIs like OpenGL and DirectX. With the advent of Compute Unified Device Architecture (CUDA), these limitations have vanished [2]. This work has implemented both recursive and non-recursive filters on GPU with CUDA.
In contrast to FIR filters, recursive filters have dependency on previous output samples. A technique is used to reduce the dependency time. Nvidia GTX 260 having 194 cores is the implementation platform.
The next two sections discuss FIR and IIR kernels and corresponding CUDA implementations. Section 4 discusses mapping of our suggested solution onto CUDA threads and blocks. Section 5 finally concludes.
II. FIR FILTER KERNELS
FIR filters have inherent parallel nature and are easily implemented with CUDA. A 16-tap FIR filter is implemented.
Fast shared memory on the GPU is intelligently utilized to save valuable cycles. Profiling results are given in Table 3.
The consumed time includes memory transfers as well.
III. IIR FILTER KERNELS
A first Order recursive filter is shown in Fig. 1. It’s evident from the figure that the current output is dependent on current input sample and previous output sample. Dependence on previous output sample describes the inherent sequential
nature of this category of filters. The dependency time can be reduced by decomposing the computation into two separate solutions; Homogenous and Particular [1]. Figure 2 shows this decomposition process. It can be observed that the homogeneous solution is dependent on initial condition while particular solution is not. So particular solution can now be computed in parallel. The next section discusses how this scheme is mapped onto the GPU.
Figure 1 First Oder IIR Filter
2
3 2
1 2
1
[ ]
[ 1] [ 1]
[ 1] [ 1] [ ] [ 1] similarly
[ 2] [ 1] [ ] [ 1] [ 2]
[ ] [ 1] [ ] ... [ 2] ...
[ 1]
[ 1]
Homogeneous soult [ ]
[
i ]
]
on
[ ] [
p p p
p
p p
a y n
y n a ax n
y n a y n ax n x n
y n a y n a x n ax n x n
y n p a y n a x n a x n
a y n z n a
y n x n
y n
x n a
+ −
+
=
+ = + +
+ = − + + +
+ = − + + + + +
+ = − + + + + +
−
= +
− +
1 [ 1] 2 [ 2]
... Particular Solution
x n ap x n
− + + − + +
Figure 2 Decomposition Process IV. MAPPING ONTO GPU
Particular solution is computed in parallel by a total of L blocks. Each block contains M threads. Each thread process N input elements sequentially. All threads run in parallel. Table 2 and Fig. 3 shows this process. Threads are synchronised at their finishing point. Once this is done thread level initial condition is propagated as depicted by Fig. 4. This propagation is sequentially conducted. At the end of step 2 (Fig. 4), each block contains particular solution.
Ta
0th 1st ...
(L- blo
and syn inte mem Tim the thre exp pro beli inte
ble 2: Mappi
0th th block x[0,0
x[0,0 block x[1,0 x[1,0 ...
1)th
ck x[L-1 x[L-1
Figure 3 Pa
Figure 4 Pr The final step d sum up nchronization
ercommunicat mory. Figure me for one out number of t eads per block penses of b
cessor cores.
ieved that th elligent memo
ing of input d and t hread 1st
,0] to ,N-1] x[0,
x[0, ,0] to ,N-1] x[1,
x[1, ...
1,0,0] to 1,0,N-1] x[L
x[L
rticular Solu
ropagating Pa p is to propag
the results is needed ted by trans
5 shows tim tput can be re threads per bl k may reduce lock level Table 3 sho hese results c ory hierarchy u
data structure threads
thread ...
,1,0] to ,1,N-1] ...
,1,0] to ,1,N-1] ...
...
L-1,1,0] to L-1,1,N-1] ...
tion by each
articular Solu gate block lev s. This me at this po ferring data ming analysis p educed by incr
lock. Further number of bl synchronizati ows the prof can further b utilization.
e to CUDA b (M-1)th thread x[0,M-1, x[0,M-1, x[1,M-1, x[1,M-1, ...
x[L-1,M to x[L-1, 1,N-1]
block (Step 1
ution (Step 2) vel initial cond eans block
oint. Blocks to device g per output sa reasing L, wh higher numb ocks and henc ion. P repre filing results.
be improved locks
,0] to ,N-1]
,0] to ,N-1]
-1,0]
,M-
1)
) dition
level s are global ample.
hich is ber of ce the esents It is with
Cat 16 FIR IIR IIR
T
=
=
=
Th GPU deco solut expe paral numb GPU effici to eff T Educ unde by H
[1]
[2]
Ta tegory Inpu
(mill Tap
R
R 2
R 4
(2 2
(2 2
(2 2
TimeforOneO LMul Mul A Mul A
= +
= +
+
=
Figu
his work expl U. Recursive
mposing sol tion. It is ob nsive and ca llelism. Furth ber of threads U based impl ient than alter fficiently utiliz This work is
cation and Hu er the Brain K Higher Educati
W. Sung and S of Recursive Dig Dr. Dobb, “Su http://www.ddj.c
able 3 Perform ut
lion) GPU time 1 26 2.5 51 4.9 94
2
) / ( ) / ( Output TPL
LAdd PM Add P M Add P M
= + + + re 5 Time per
V. CONC
lored implem filters are lution into p served that b an be reduce her the anal s per blocks re
lementation p rnative CPU im ze the memory
ACKNOWLE supported uman Resourc Korea (BK21)
ion Commissi REFERE . K. Mitra, “Effi gital Filters,” IC upercomputing f com/hpc-high-pe
mance Resul U
(μs) CPU (μs) 654 103 134 130 431 282
/
) / ) / PL Mul PAdd Mul Add L Mul Add L
+ +
+ +
r Output Sam
CLUSION
mentation of d efficiently particular an block level sy ed by increas
lysis showed educes time p proved to be mplementation y hierarchy fo
EDGEMENT
in part by ces Developm ) project and
on (HEC) Pak
ENCES icient Multi-Proc
ASSP 1986 for the masses,”
erformance-comp
ts
time Ratio 861 39.13 030 2.5 200 3
) / Psync PL L
L +
mple
digital filters w implemented nd homogene
ynchronizatio sing thread l d that increa pre output sam e 3 to 40 ti n. Future wor or recursive fi
the Ministry ment (MOEH in part suppo kistan.
cessor Implement
” [Available On uting/207402986
33
L
with by eous on is
level asing mple.
imes rk is lters.
y of RD) orted
tation nline]
6