A Performance Study of GPU, FPGA, DSP, and Multicore Processors For Embedded Vision Systems

(1)

ITSI Transactions on Electrical and Electronics Engineering (ITSI-TEEE)

________________________________________________________________________________________________

A Performance Study of GPU, FPGA, DSP, and Multicore Processors For Embedded Vision Systems

1M. S. Chelva, ²S. V. Halse

1Research Scholar, SVMV-SRTM University, Nanded, Maharashtra, India

2Karnataka States Women’s University Vijaypur, Karnataka, India Abstract- The emerging trend of embedded vision is

gaining its momentum now a day. The multi vision embedded functions requires a highly specialized and parallel architectures to have a control over the constraining factors such as cost, size, performance, power consumption, throughput, latency etc. To compute such an intensive operations, there are various processor technologies such as Graphic Processing Unit (GPU), Field programmable gate array (FPGA), Digital signal processors (DSPs) and the general purpose CPU which offer different degree of parallelism. These systems find applications in surveillance, advanced driver assistance systems, safety, medical equipment, automotive etc. These diversified applications have increased the complexity toward appropriate solution and promoted the core research area within computer science and engineering discipline. In this paper, we present a comparative study of processor technologies and give solution approach to various image processing algorithms in real time embedded system.

Keywords- Field Programmable Gate Array (FPGA), Digital Signal Processor (DSP), Graphics Processing Unit (GPU), Normalized Cross Correlation (NCC), Real Time Image Processing (RTIP).

I. INTRODUCTION

The technology in computer vision domain is maturing very fast. The use of visual input sensors like camera with embedded system refers to the embedded vision technology. The computer vision demands for parallelism and intense computation. The goal of computer vision is to automatically give details of a given scene by analyzing the sensed images of the scene. The sensed images can be a single image taken from a single camera, multiple images of the scene using multiple cameras or using same camera over period of time and angles. It forms core research area within computer science which is an engineering discipline too.

These systems find a lot of use in society for surveillance, advanced driver assistance systems, safety, medical equipment automotive systems etc. These applications have increased the complexity and hence the wide range of applications has promoted research on cost efficient, low power processing and high performance modules. The vision functionality implementation demands the use of parallel processing or specialized hardware architecture to make the system work for real time applications. The two common classes are Single Instruction Multiple Data (SIMD) and Multiple Instruction Multiple Data (MIMD). Typically, architectures for vision algorithms tend to be of the

SIMD class for the lower level algorithms. Parallelism at lower levels is more obvious compared to high level algorithms. Another way of classification is based on the type of hardware used in application specific processor versus general-purpose hardware. Since few decades, the architectures for computation are biased towards better parallelism. Numerous studies have shown that such architectures can accelerate applications by orders of magnitude compared to sequential software [14] [15]

[16]. For this purpose we are discussing the various platform overviews.

The aim of this paper is to compare the performance of the Graphics Processing Unit (GPU), Multicore Processor and Field Programmable Gate Array (FPGA) implementations of known algorithms in embedded systems. Algorithms such as normalized cross correlation, Finite Impulse Response (FIR) filters are especially interesting.

II. PLATFORM OVERVIEW

Some applications have accuracy as important criteria.

However, in some computational effort or power requirement is the main issue. For this purpose there are various platforms which offer image processing applications. The technology selection for various applications is a critical task. Application specific system deliver high performance with a little flexibility while the general purpose processors provide flexibility with an average performance. In a layman’s word, we can say that, FPGA is a solution for high power consumption while the GPUs are the solution for high cost. GPU, FPGA and DSPs are the three processor technologies. As we know the general purpose processors are developed to perform numerous operations but however they are not suitable for intensive image processing applications due to its serial processing blocks. While the GPU architectures are highly parallel, it has large number of cores, which are designed so as to handle multiple tasks simultaneously.

GPUs are flexible and easy to program using high level languages and APIs which abstract away hardware details. The GPUs are now being developed for the portable electronic devices, to meet the requirements.

GPUs are inexpensive, commodity parallel devices with huge market penetration. They are also employed as powerful coprocessors for a large number of applications. They are capable of image rotation, translation, accelerated video decoding and also accelerate the CPU algorithms. In embedded systems,

(2)

Digital Signal Processors (DSPs) and Field- Programmable Gate Arrays (FPGAs) have been the leading processor technologies. The FPGAs is basically a reconfigurable device unlike processor. Custom reconfigurable architectures are used to process high frame rates for high-definition images, and have enabled efficient real-time processing [22], while maintaining a low power budget in comparison to competing GPU and CPU solutions. Thirdly, Digital Signal Processors (DSP) have always been the first choice in image processing applications. DSPs offer single cycle multiply and accumulation operations, in addition to parallel processing capabilities and integrated memory blocks.

DSPs are very attractive for embedded automotive applications since they offer a good price to performance ratio.

Figure 1: Flexibility Vs. Specialization

Application development with hardware description languages (HDLs) such as VHDL or Verilog involves numerous challenges, which limits the potential impact of reconfigurable computing with FPGAs in high performance computing. The Open Computing Language (OpenCL) is the new alternative, which reduces productivity issues. The Open Computing Language (OpenCL) is a parallel programming standard that enables developers to create their applications using a C-based language and target a variety of heterogeneous platforms including CPUs, GPUs, DSPs, and most recently FPGAs. In the case of FPGAs, OpenCL acts as a high-level synthesis tool for HDL development [1]. Using OpenCL, a designer can work with an RC platform while avoiding RTL code as well as platform specific tools and libraries. The first commercial framework for OpenCL on FPGAs is the Altera SDK for OpenCL. The Altera Offline Compiler (AOC) exploits an application’s wide parallelism by using SIMD data types and simple compiler programs.

The tool further optimizes hardware design by pipelining the datapaths to harness deep parallelism available in the application. All of these features help alleviate the specialized and complex training previously needed for hardware design. OpenCL provides a higher level of abstraction to the programmer and includes development tools that postpone costly hardware

compilations until the end of the design process. Our results shows that the productivity advantage of OpenCL comes at the cost of increased resource usage. VHDL designs achieved a more efficient use of resources (59%

to 70% less logic), while maintaining similar timing constraints (255 MHz < fmax < 325 MHz). [1]

Previous work has considered following standard applications to compare performances of different processing platforms like FPGAs, GPUs, and CPUs.

Cross-Correlation: Wang et.al [6], discusses that cross correlation has found its application in image processing. To fulfil the real-time requirements of application in target recognition using image matching, two efficient FPGA based parallel architectures were proposed to accelerate normalized cross-correlation computation. In these two architectures, several efficient approaches were proposed to reduce logic resource usage and computation time. These two architectures can be applied in different situations according to the available resource like logic, DSP block or memory of the FPGA chip used. Function and timing simulation with Quartus II 8.0 and practical experiment in target recognition have shown that these architectures can work well and have effectively improved the performance of the target recognition system.

Lucas M. Et.al [4], shows 2-D Convolution: the implementation is the one of the key operators used in image processing. With the constant need to increase the performance in high-end applications and parallel architectures, such as GPUs and FPGAs, there is the necessity to compare these architectures in terms of performance under various scenarios. Mathematically, convolution is the linear sum of product of weighted coefficients with input function

h x ∗ f x = ⁿ_k=−nh x . f(x − k)

The convolution operation can be extended to 2D as shown below,

h x, y ∗ f x, y = n ^mt=−mh s, t . f(x − s, y − t)

s=−n

Limitation of convolution was related to boundaries of input images, where the mask value does not overlap with the input image. The 2D convolution was implemented in each of the following languages: CUDA for GPUs and Verilog for FPGAs [4]. In addition, the same algorithms were implemented in MATLAB, using predefined operations and in C using a regular x86 quad- core processor. Performance measures such as the execution time and the clock ratio were taken. Overall, it was possible to achieve a CUDA speedup of 200 times in comparison to C, 70 times in comparison to MATLAB and 20 times in comparison to FPGA [4]. 2D convolution also addresses one issue using GPUs for video processing, which is insufficient internal storage mechanism to store previously accessed data, which can be used for further processing. However, FPGA is more efficient comparatively, as the previously accessed data can be locally held on-chip. In case of CPU, the instruction set makes it less flexible.

(3)

________________________________________________________________________________________________

Shuai et.al [7], presents a performance study of three applications on an FPGA, a GPU and a multicore CPU system. In first step of application, Gaussian elimination uses advance memory access to access the rows and column of a matrix. It computes the result for all the variables in the linear system, in a row by row fashion.

For this purpose, the value produced, in each iteration should be computed in parallel. The second application DES is a cryptographic algorithm and makes use of bit wise operations. It is a symmetric key block cipher published by National Institute of Standards and Technology (NIST). It encrypts and decrypts the data in a group of 64 bit. This application requires bit level parallelism. Third application, Needleman Wunsch algorithm is used in bioinformatics for sequence alignment. It was one of the first applications of dynamic programming to compare the biological sequence. A matrix is developed using a pair of sequence. This matrix is filled with scores or weight, which is nothing but the value of the maximum weighted path. Further, track back process is used for alignment of sequence. The above three applications were performed using GPU, FPGA and compared with the multicore CPU. These experiments were performed on commercial product from GPU and FPGA markets.

NVDIA GeForce 8800 GTX with CUDA 1.1, Xilinx Virtex II Pro based on 130nm process clocked at 100MHz and Intel Xeon processor. GPU code was developed using NVDIAs CUDA API, FPGA code in VHDL using Xilinx v9.1.

The results were compared in terms of cycle count etc.

In Case of Gaussian elimination method, for any size of input, FPGAs and GPUs showed a better performance over CPU in terms of parallel computation. The only limitation of FPGA was complexity of programming using VHDL [7]. Secondly, in case of DES encryption, GPU does not support few important operations because of the underutilization of the processor. In third memory intensive application, NW algorithm, GPU takes far more cycles than FPGA. For a 64×64 input size, the GPU takes 3.0×105 cycles while the FPGA takes only 2.0×104 cycles [7]. But when input size increases, the difference between GPU and FPGA reduces.

Li Zhou et.al [2] Stereo Matching: discusses the stereo vision, which means extraction of 3D information from the digital input images. The 3D stereo finds many applications in entertainment, information transfer, stereo & feature tracking, industrial informatics, three- dimensional video processing, intelligent robots, and medicinal image processing and automated system. It is also used in robotics to extract the information about of the position of the 3D objects.

The common ground of stereo vision systems is to model three-dimensional (3D) space and to render 3D objects, using depth information that is the most important element of stereo vision systems [17]. This stereo matching approach can be carried out using various algorithms such as Belief propagation (BP) algorithm, Graphic cut (GC) algorithm and Dynamic

programming (DP) algorithm [2]. Stereo matching is one of the most important field of research, concerning on the depth information processing capability. Stereo matching quality is limited by processing capability, huge computation and algorithm complexity, high processing bandwidth requirement and algorithm accuracy. Software accelerator evolves in parallel optimization on CPU, DSP, and GPU. Hardware accelerators are based on FPGA or ASIC. CPU has the highest flexibility for stereo matching algorithms, but it has a limited acceleration for real-time calculation because CPU has less specific acceleration processing unit. DSP has better signal processing capability because of better data processing architectures, lower cost and less power consumption than CPU. Although DSP provides a quite better stereo matching performance, it has few limitations, such as data word alignment, bandwidth throughput issue, etc. So GPU stereo matching implementations was the only alternative, which had drawn attention and obtained desirable speedup, which is due to GPU’s hundreds of PEs and high-level software development platform. The GPU, FPGA or ASIC are advantageous compared to CPUs, DSPs, for their low power consumption and low cost embedded application systems. The limiting features of FPGA or ASIC is long developing time and less processing flexibility compared with GPU. Since past several stereo matching algorithms have been proposed, which are grouped under global and local method [8].

Graphic Cut (GC) and Belief Propagation (BP) are two well-known methods. Although global methods can reach a high quality level with VGA@30 frames per second (fps) performance [18] [19], it is still hard for real-time and high resolution application cases because of its computation complexity. Local approach is based on colour or intensity patterns within a finite window to determine the disparity [8]. GPU, FPGA and ASIC implementations are suitable for real-time embedded stereo matching applications, because they are consume low power, low cost, and have high performance. CPU has the highest flexibility for stereo matching algorithms, but it has a limited acceleration for real time calculation of dense disparity map because CPU has less specific acceleration processing unit. DSP has better signal processing capability because of better data processing architectures, lower cost and less power consumption than CPU. Although DSP can reach reasonable stereo matching performance, it has inherent disadvantages, such as data word alignment, bandwidth throughput issue, etc. Consequently, high quality algorithm is seldom realized by a DSP system and is only limited to window based algorithms [8]. With powerful multimedia accelerators, high system clock frequency, optimized cache usage and interconnections, multi-core processor is an effective direction in order to increase stereo matching performance [20]. But, increasing the number of processing elements, leads to high power consumption. In addition, there is no linear relationship between the number of processor cores and the processing performance. As a result, the GPU

(4)

architecture appeared. They concluded two points: First, there is optimization potential for both stereo matching algorithm optimization and software or hardware implementation in terms of speed, parallelism, data bandwidth, memory storage, etc. Second, GPU, FPGA and ASIC designs are future research trends in real-time embedded stereo vision application systems because of their high parallel processing capabilities and specific powerful calculation supporting components. GPU has more programming flexibility and powerful computation capability, while FPGA and ASIC have high performance, lower power consumption and cost.

Normalized Cross-Correlation: The application of the normalized cross-correlation techniques include medical imaging, face recognition, motion tracking, target recognition, satellite image monitoring, etc. The amount of time needed can be estimated, but huge numbers of operations are required to carry out for the implementation. For this purpose, parallelism is a must criterion. In template matching, the template image, which is to be searched for, is moved across the image.

The output will give strong response in the area where the input image has high correlation with the template image. This is known as correlation which is given by [3],

I u, v T(x + u, y + v)

u,v where,

I is the image matrix

T is the template image matrix

But simply correlating the two images is not sufficient.

Sometimes, the output response could be strong, even though the image we are looking for is not available or vice versa. To solve this problem after correlation, normalization is applied [3].

R(x, y) = _{u ,v} I u,v −I x,y (T x+u,y+v −T ) _{u ,v}(I u,v −I x,y)². _{u ,v}(T x+u,y+v −T )²

Wang and Wang [3], describe a way of computing the straightforward normalized cross-correlation on an FPGA [3]. Normalized cross-correlation computation requires a large amount of Multiply-ACcumulation (MAC) operations with little control logic. Although the normalized cross correlation is relatively simple and has high accuracy, the computational effort required is not affordable in many applications, especially in those with real-time requirements [25]. Therefore, it is very well suited for FPGA-based implementation [23] [24].

Performance and area results, for each architecture is given. A major point in their architecture was the parallel computation of several multiplications and additions, when computing the correlation between scene image and template. Another architecture proposed was memory architecture, with row buffering and multiplexers in order to achieve parallelism of arithmetic operations. For a scene image size of 512 X 512 and a template image of size 80 X 80, they achieve a clock frequency of 70 MHz and a latency of 224

millisecond. The implementations were done using GPU and FPGAs. The Xilinx 7 series FPGAs was considered.

Functional simulation is straightforward using VHDL simulation software. While, the CARMA development kit from SECO will be used for GPU software development and testing. It was concluded that, for larger images, a more powerful FPGA must be used.

The high performance and relative ease of software development of GPUs has to be weighed against the higher power consumption in real-world projects [3].

III. ANALYSIS AND DISCUSSION

The platforms discussed above can be evaluated with respect to development effort, cost, power consumption and performance for image processing applications. The importance of these parameters varies on type of requirement of a particular application. Some applications need to be cost efficient or power efficient.

Some of the constraints are discussed below:

i). Power consumption

The FPGA based implementations offer significantly low power consumption than the GPU solution. The GPU development kit also has a cooling fan, unlike the FPGA kit, which only uses passive cooling, derby increasing the physical area.

ii). Development time

In FPGA, as code and IP cores are added, simulation and synthesis time increases significantly and derby increasing the development time [3]. While in GPUs, Development time is very fast when using existing libraries such as OpenCV, along with high level languages such as CUD, C++ etc.

iii). Parallelism

One of the advantages of platforms like FPGAs and GPUs is the architectural parallelism in these devices.

On the other hand, they typically have lower clock frequencies than conventional CPUs. This means that in order to achieve a speed-up, the parallelism of the device must be exploited by identifying and taking advantage of parallelism in the problem to be solved.

iv). Code Translation Effort:

Code porting to GPU via OpenCL was straight forward, while implementation on FPGA was much more of a hassle. The only actual issue for the OpenCL implementation was finding OpenCL versions of proprietary CPU-optimized software libraries. For the FPGA implementation, it is not possible to directly port the available algorithm. After spending quite some time searching for approximated filter kernels, this approach proved unfeasible. Moreover, next to the tedious VHDL coding of the FPGA implementation itself, we spent lots of effort to get the CPU-FPGA interface working [11].

v). Integrating

Interfacing in case of FPGA is a difficult job. The overall processing speed of the FPGA platform was

(5)

________________________________________________________________________________________________

severely constrained by that interface. For GPU hardware, there is no issues observed in integration because of the seamless hardware integration and the fact that OpenCL is developed for hybrid CPU/GPU platforms [11].

vi). Physical space

FPGA based implementations are efficient in terms of board space. While the computational density of the GPU solutions is far worse, because, its huge PCB, sometimes incorporate huge fans for passive cooling, which even does not fit in a standard desktop case.

Obviously, FPGAs are far better than GPUs.

FPGA Multicore

Processor

GPU

Parallel Sequential Limited

parallel Extremely fast real

time processing

Varies with dependency

Fast real time processing Not good in

floating point operations

Versatile Excellent in floating point operations HDL programming

(complicated)

Easy to

program(C++)

OpenCV libraries (easy to program) Need to interface

the software

No need of special interfacing

No need of special interfacing Comparatively less

flexible but better performance

- Programming

flexibility Table 1: Comparison of FPGA, GPU and CPUs

IV. CONCLUSION

In this paper, we have considered the two platforms that is FPGA and GPUs and the comparison is done with respect to the conventional multicore CPUs. Many factors, including hardware features, application performance, programmability and overhead, and more importantly how to trade off among these factors are discussed. We can say that the platforms discussed above are advantageous in different aspects. The flexibility offered by the FPGAs are far better than the GPU based platform. But while considering the development time we may option for GPUs because of the libraries such as OpenCV and various high level languages. However, FPGAs are the better solution when considering physical space and power consumption.

REFERENCES

[1] Kenneth Hill, Stefan Craciun, Alan George, Herman Lam, “Comparative analysis of openCL vs HDL with Image processing Kernels on Stratix-V FPGA”, 2015.

[2] Li Zhou, Tao Sun, Yuanzhi Zhan and Jia Wang,

“Software and Hardware implementation of stereo matching”, International Journal of Signal Processing, Image Processing and Pattern Recognition, Vol.7, No.4, 2014, pp. 37-56.

[3] Fykse, Egil. "Performance Comparison of GPU, DSP and FPGA implementations of image processing and computer vision algorithms in embedded systems." (2013).

[4] Russo, Lucas M., Emerson C. Pedrino, Edilson Kato, and Valentin Obac Roda. "Image convolution processing: A GPU versus FPGA comparison." In Programmable Logic (SPL), 2012 VIII Southern Conference on, pp. 1-6.

IEEE, 2012.

[5] Jeremy Fowers, Greg Brown, Patrick Cooke, Greg Stitt, “A performance and energy comparison of FPGAs, GPUs and Multicores for sliding- Window Applications”, FPGA’12, February, 2012, California, USA. ACM 978-1- 4503-1155-7/12-02.

[6] Wang, Xiaotao, and Xingbo Wang. "FPGA based parallel architectures for normalized cross- correlation." 2009 First International Conference on Information Science and Engineering. IEEE, 2009.

[7] Che, Shuai, Jie Li, Jeremy W. Sheaffer, Kevin Skadron, and John Lach. "Accelerating compute- intensive applications with GPUs and FPGAs."

In Application Specific Processors, SASP 2008.

Symposium on, pp. 101-107. IEEE, 2008.

[8] M. Z. Brown, D. Burschka and G. D. Hager,

“Advances in Computational Stereo”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 8, no. 25, (2003), pp. 993-1008.

[9] De Ruijsscher, Bart, et al. "FPGA accelerator for real-time skin segmentation." Proceedings of the 2006 IEEE/ACM/IFIP Workshop on Embedded Systems for Real Time Multimedia. IEEE Computer Society, 2006.

[10] Vega-Rodriguez, Miguel A., Juan M. Sanchez- Perez, and Juan A. Gomez-Pulido. "Real time image processing with reconfigurable hardware."

Electronics, Circuits and Systems, 2001. ICECS 2001. The 8th IEEE International Conference on.

Vol. 1. IEEE, 2001.

[11] LTD, ARM. "ARM920T technical reference manual." ARM Ltd.

[12] Lars Struyf, Stijn De Beugher, Dong Hoon Van Uytsel, Frans Kanters and Toon Goedeme, “The battle of the giants: a case study of GPU vs FPGA optimization for real-time image processing”.

[13] EP9302 Reference manual.

(6)

[14] Asano S., Maruyama and Yamaguchi,

”Performance comparison of FPGA, GPU and CPU in Image Processing”, In Proc. Of International conference on Field Prog, Logic and application. FPL, 2009, 1126-131.

[15] Z.K. Baker, M. B. Gokhale and J.L. Tripp,

“Matched Filter computation on FPGA, cell and GPU”, In proc. Of IEEE symposium on Field- prog. Custom computing machines, FCCM, 2007, 207-218.

[16] J. Chase, B. Nelson, W. Zhaoyi and Dha-Jye,

“Real time optical flow calculations on FPGA and GPU architectures: a comparison study”, In proceeding of Int. Symposium on Field prog.

Custom Computing Machines, FCCM, 2008, 173-182.

[17] S.H. Lee and S. Sharma, “Real Time Disparity Estimation Algorithm for Stereo Camera Systems”, IEEE Transactions on consumer Electronics, vol.3, no. 57, 2011, pp 1018-1027.

[18] C.K. Liang, C.C. Cheng, Y.C. Lai, L.G. Chen and H. H. Chen, “Hardware Efficient Belief Propagation”, Proc. Of IEEE Conference Computer Vision and Pattern Recognition, (2009), June, 200-25, pp. 80-87, Miami, FL, USA.

[19] Y. C. Tseng and T. S. Chang, “Architecture Design of Belief Propagation for Real Time Disparity Estimation”, IEEE Transaction on Circuits and systems for Video Technology, vol.11, no. 20, (2010), pp. 1555-1565.

[20] S. Gotchdev, G. B. Akar, T. Capin, D.

Strohmeier and A. Boev, “Three Dimensional Media fir Mobile Devices”, Proceedings of the IEEE, vol.4, No.99, 2011,pp. 708-742.

[21] John G. Proakis and Dimitris K. Manolakis ,”

Digital Signal Processing “Prentice Hall, 4 edition, April 2006. ISBN 0131873741.

[22] B. A. Draper, “Accelerated Image Processing on FPGAs”, in IEEE Transaction on Image Processing, Vol. 12 no. 12, dec, 2003,pp. 1543 – 1551.

[23] Almudena Lindoso, Luis Entrena, “High performance FPGA based image correlation”, J.

Real time image Proc. 2007, vol. 2, pp. 223-233.

[24] Virtex-4 Family Overview: Xilinx Inc.

http://www.xilinx.com (2004).

[25] M. Cavadini, M. Wosnitza, G. Troster,

“Multiprocessor system for high resolution image correlation in real time”, IEEE Trans. In Very Large Scale Integr. VLSI Syst. 9(3), 439–449 (2001).

[26] P. Aschwanden, “Experimental comparison of correlation methods in image processing,” Ph.D.

dissertation, ETH, Zürich, Switzerland, 1993.

[27] X. Sun, X. Mei, S. H. Jiao, M. Zhou and H.

Wang, “Stereo Matching with Reliable Disparity Propagation”, 2011 International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), (2011) May 16- 19, pp. 132-139, Hangzhou, China.

