COMMUNICATION-AWARE MAPPING OF STREAM GRAPHS FOR MULTI-GPU

-aware Mapping of Stream Graphs for Multi-GPU Platforms. submitted at UNIST Graduate School in partial fulfillment. requirements for the Master of Science. Woongki Baek: Committee member no. 1 for the thesis. Wonki Jeong: Member of Thesis Committee No. 2. Flow graphs can provide a natural way to represent many applications in the multimedia and DSP domains. Although the exposed parallelism of stream graphs is relatively easy to map to GPGPUs, very large stream graphs, as well as how to best utilize multi-GPE platforms to achieve scalable performance, present a major challenge to stream graph mapping.

In this paper, we present a highly scalable GP-GPU mapping technique for large flow graphs with the following highlights:. Our experimental results on a real GPU platform demonstrate that our techniques can produce significantly better performance than the current state of the art in both single-GPE and multi-GPE cases. One way to mitigate such problems is to use a higher level of abstraction, such as flow graphs.

A stream graph consists of filters (also called actors) and channels that represent the flow of data between them, and can provide a more intuitive and natural representation for stream processing types of applications such as multimedia and DSP. In this paper, we present a new partitioning-based GPU mapping methodology for stream graphs, with the following contributions. Our experimental results on a real hardware platform show that our technique can generate scalable performance for up to 4 GPUs, achieve 3.2x speedup on average for large stream graphs compared to 1-GPU mapping, and can generate mappings, is far better, as measured in speedup over single-partition mapping, than the previous work [7].

After reviewing the background and some related works in Chapter II, we describe our multi-GPU mapping in Chapter III.

BACKGROUND AND RELATED WORK

Background

The next level of memory is the global memory which is visible to all threads in the grid. However, the large number of iterations per filter also means that each filter will produce a large amount of data to be consumed by the dependent filters, making it necessary to use the global GPU memory for communication between the filters. To avoid the global memory bottleneck problem, the second approach creates a single GPU kernel from the entire flow graph, as illustrated in Figure 2.1(c).

Then communication between filters can be done via SM, which is very efficient. Global memory access is still required for primary I/O, which can be performed by a dedicated group of threads called data transfer threads. To hide global memory access latency, it is essential to run data transfer threads concurrently with compute threads via double buffering.

The existence of two types of threads and the cap on the total number of threads per SM as well as the limited SM size creates an interesting optimization problem, from determining the number of calculations against all these parameters - viz. the number of threads per execution, the number of executions per SM, and the number of data transfer threads—must be determined simultaneously for optimal results, further adding to the challenge.

Figure 2.1: Mapping a stream graph (a) to a single GPU. (b) One-kernel-per-filter approach.

Related Work

To complicate matters further, using multiple computation threads per execution of a flow graph can result in higher performance if filters have more than one activation rate, which is often the case (the activation rate is the number of times a filter must be activated). are performed to match the input and output speeds). Our technique is based on the second approach, but significantly extends it to address the new challenges of multi-GPU mapping. Stream graphs have also been mapped to other architectures, including the Raw processor [2], the STI Cell processor [1,8], and FPGAs [3].

These techniques often focus on exploiting task-level parallelism and balancing workloads across multiple processors. To make workload balancing between multiple processors easier, stateless filter pooling/splitting is used [3,8]. Although some of them also target multi-core architectures, none of them explicitly model and optimize to minimize the impact of inter-process communication.

MAPPING STREAM GRAPHS TO MULTI-GPU

Our Mapping Flow

Finding the best flow graph partition for multi-GPU mapping is a very difficult problem. Since the most important factor affecting GPE performance is the use of SM, previous work [7] uses a partition heuristic that combines filters until the SM requirement is violated. We go a step further and explicitly optimize to reduce the total partition runtime as predicted by our performance estimation engine.

First, we generate partitions that use less SM by exploiting the structure of flow graphs. Third, we minimize the number of partitions that are IO-bound instead of compute-bound. Regarding the first strategy, our heuristic takes advantage of the fact that flow graphs are constructed using pipeline and split/join operators.

If a pipeline operator is used, the SM requirement of the pipeline is not much higher than that of the filters that make up the pipeline, as shown in Figure 3.2(a). As for the third strategy, we consider a partition p to be compute-bound (or IO-bound) if its computation time estimate (i.e., Tcomp(p)) is greater (or less) than its data transfer time estimate (i.e., Tdt (p)). The first phase merges the filters inside the pipelines, the second phase merges those outside the pipelines, and the third phase merges both types of partitions.

The last stage tries to merge two neighboring partitions and all filters simultaneously. The last condition implies that the merged partition must not violate the SM size constraint. This is done by first creating a partition that includes only the first pipeline element, and iteratively expanding it to include the neighboring node until it no longer meets the join criteria.

The second phase (lines 13-20) generates partitions outside the innermost pipes, or between nodes that do not yet belong to any partitions. To direct the splits to be as computationally efficient as possible, we prioritize joins (i) between IO-bound ones, then those (ii) between an IO-bound and an IO-bound with calculation, and those (iii) between calculation -related, in that order. The last stage tries to merge (1) two neighboring partitions at the same time, which can reduce execution time even though merging one of them does not, and (2) all nodes at once, which can help ensure that our multi-partition solution is no worse than the single-partition solution.

Figure 3.1: Overall flow of the proposed multi-GPU mapping.

Communication-aware Mapping

This is because merging two partitions often reduces the total data transfer time (thus making the IO-bound ones compute-bound) thanks to the shared buffer between them.1 Also, to balance the partition sizes, we favor those partitions which are less workload.

GPU1 GPU2 GPU3 GPU4

GPU Performance Estimation Engine

Creating a GPU core requires the determination of several key parameters, including the number of calculation / data transfer threads and the number of executions per core. Even the same kernel code can show very different performance if, for example, a different number of threads is allocated. We use all eight applications evaluated in the previous work [7], where one can also find the description of the applications and the size parameter N.

After the partitions are assigned to GPUs and CUDA kernels are generated (one partition corresponds to one kernel), we compile and run them on the 4-GPU machine, and collect the actual runtime information of the CUDA kernels with using the Nvidia profiler. Only kernel execution time is included, excluding kernel boot time.) Thus, in actual measurements, all kernels of an application run together, while our performance estimation engine considers each partition separately. Apart from the number of GPUs, everything else is the same in the four cases. Partitioning results: First, the number of partitions generated for each application for different N values is shown on the x-axis of Figure 4.2.1. These numbers are almost always greater than or equal to those of the previous work [7],2 which is equal to This is not surprising, as we use a stricter set of criteria for merging.

Second, it is interesting that the increase in the number of partitions is not uniform across applications. Third, if we define the ratio of the number of cores as the number of partitions in our vs. But at least our 1-GPE performance is not necessarily inferior to that of previous work, as shown in Figure 4.3.

For desktop-bound applications, our multi-GPU mapping yields solutions that far outperform those of the previous work, regardless of the parameter N. Note that we do not claim these numbers as measured performance difference on the same GPU, but we provide them as our best estimate of the expected performance differences between the two algorithms. First, the multi-GPU code generated by the previous work is identical regardless of the target GPU, as long as the SM size is the same.

The preliminary results are shown in Table 5.1, in which we report the running time of the improved version versus the original version when Bitonic and FFT are mapped to 1 GPU. At the same time, finding a good partition of a stream graph is a challenge, partly due to the uncertainty in the subsequent mapping step. Although this is limited due to its greedy nature, it is alleviated somewhat by exploiting the structure of the stream graphs.

Our experimental results on a real hardware platform show that our technique can generate scalable performance for up to 4 GPUs, achieve 3.2x speedup on average for large stream graphs compared to 1-GPU mapping, and can generate mappings, which is far better than the state of the art for computer-bound applications. In Proceedings of the 7th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’09, pages 210-220, Washington, DC, USA, 2009.