A survey of pipelined workflow scheduling

The period of the schedule (denoted by P) is the time that elapses between two consecutive data sets entering the system. First examine the execution of the application when sequentially mapped to the fastest processor, P3 (see Figure 1(c)). On the right, the processing of the first data set (and the beginning of the second one) is illustrated.

This adds a level of parallelism to the execution of the application, called data parallelism. When a task is repeated, it is common to impose some constraints on the distribution of data sets in replicas. The most common objective in pipeline scheduling is to maximize system throughput, which is the number of data sets processed per unit time.

Remember that this is equivalent to minimizing the period, which is the inverse of the throughput. The structure of the precedence constraints (the Structure column) can be a single chain (C), a structured graph such as a tree or series-parallel graph (S), or an arbitrary DAG (D).

FORMULATING THE SCHEDULING PROBLEM

Compacting the Problem

To do so, it may be necessary to associate a startup time with each communication and a fraction of the bandwidth used (multiport model). Compiler techniques such as Lamport hyperplane vectors or space-time unimodular transformations [Wolfe 1989; Dart et al. The cyclic schedule is constructed from the elementary schedule, which is the detailed schedule for a single data set.

If taskti is executed on processor jat timesi in the elementary schedule, then the execution of this taskti for datasetx will be executed on timesi+(x−1)P on the same processor j in the cyclic schedule. The elementary schedule is a compact representation of the global cyclic schedule, while it is easy to derive the actual startup time of each task instance, for each dataset, at runtime. Intuitively, it is sufficient to check the datasets released in the last time units to ensure that a processor is not running two tasks simultaneously and that a communication link is not being used twice.

A natural extension of cyclic schedules are periodic schedules that repeat their operation in each data set [Levner et al. However, using replication in 120 time units, P1 can process 40 data sets, P2 can process 24 data sets, and P3 can process 15 data sets, resulting in a periodic schedule with K and throughput T =79/120, which is about twice more than a cyclic schedule. . Note, however, that the increase in throughput comes at a price: the use of replication may make it very difficult to compute.

This is because the work rate for the entire system is no longer determined by a single critical resource, but rather by a cycle of dependent operations involving multiple computational units and communication links (refer to Benoit et al. [2009a] for details). . Periodic schedules are represented in a compact way by a schedule specifying the execution of K data sets similar to the elementary schedule of a cyclic schedule. Other common compact schedules consist of giving only the fraction of the time each processor spends executing each task [Beaumont et al.

However, the reconstruction of the actual schedule involves advanced concepts from graph theory and can be difficult to use in practice (although it can be done in polynomial time) [Beaumont et al.

Examples

A solution of the example of Figure 5 on two processors using the intervals {t1,t2,t3}and{t4} is shown in Figure 6, where the boxes represent the tasks and the data sets are identified by color. Given a map and a processor allocation, obtaining a schedule that achieves optimal latency can be done by greedily scheduling tasks in chain order. However, this may come at the price of a throughput degradation, as idle time may appear in the schedule.

Furthermore, the elementary schedule can be executed with a period equal to the length of the longest task, leading to a cyclic schedule with optimal throughput (for the processor allocation). If the node is a parallel node, then the latency is the maximum of the latencies of its children: L(P(l,r))=max(L(l),L(r)). Similar to the case of chains, a core schedule is achieved by scheduling all the tasks consecutively without taking care of the dependencies.

In this way, we get the optimal period (for this allocation) equal to the load of the busiest processor. Therefore, executing a data set takes as many periods as the depth of the priority task graph. Convex clustering schemes are also called processor-ordered schedules, because the graph of inter-processor communications driven by such schemes is acyclic [Guinand et al.

The execution of a given processor's tasks can be serialized and executed without idle time (provided their execution starts after all data has arrived). Two processors that are independent in the interprocessor communication graph can perform their tasks on a given data set in the same period in any order without violating the priority constraints. Such a construction leads to the optimal throughput for a given convex clustered mapping of tasks to processors and to a latencyL≤xP ≤mP, in which is the length of the longest chain in the graph for communication between processors.

Such a schedule can be seen as a pipelined execution of an unrolled version of the graph. Such a schedule has a throughput of T = KP, and its latency must be calculated as the maximum latency of the data sets in the elementary schedule. Each task receives its data from one of the duplicates of each of its predecessors.

Fig. 6. The solution of optimal throughput to the instance of Figure 5 using an interval mapping on two processors.

Complexity

When only the throughput matters (and not the latency), it is enough to ensure that no network connection is overloaded. One can reconstruct a periodic schedule explicitly using the model described earlier, considering each network link as a processor. The complexity of problems with linear graphs has been widely studied since it roots the general DAG case and most of the structured graphs [Benoit and Robert Benoit et al.

The large number of variants for these scheduling problems makes complexity results very difficult to understand. We summarize in Tables II and III the complexity results for period and latency optimization problems, which apply to all communication models. For the period minimization problem, the reader can refer to Benoit and Robert [2008] for the variant without replication, and to Benoit and Robert [2010] for the rest (results indicated by (rep.)).

For the latency minimization problem, we report here results without data parallelism; otherwise, the problem becomes NP-hard as soon as processors have different speeds (related model), without communication costs [Benoit and Robert 2010].

SOLVING PROBLEMS

Scheduling a Chain of Tasks with Interval Mappings
Scheduling a Chain of Tasks with General Mappings
Structured Application Graphs
Scheduling with Replication
General Method to Optimize the Throughput
General Method to Optimize Throughput and Latency

The split point is chosen so as to minimize the period of the new solution. The two other heuristics BSL and BSC use a binary search on the period of the solution. BSL assigns the start of the chain of tasks to the processor that will perform the most calculations while respecting the threshold.

Simulation shows that its performance is within 30% of the optimal solution of the general map (obtained from an integer linear programming formulation). Subhlok and Vondran [1995] address the problem of scheduling a chain of tasks on homogeneous processors to optimize application throughput without computation and communication overlap. This result can be extended by adding a latency dimension to the dynamic program to allow delay optimization under a throughput constraint [Subhlok and Vondran 1996].

Without replication and without communication costs, throughput optimization on homogeneous processors is NP-complete due to reduction to 3-PARTITION. The throughput constraint is ensured by setting the task latency to infinity for processor assignments that do not respect the throughput constraint. Evaluating the latency of a node for a given number of processors requires O(m) calculations and there are 2n−1∈ O(n) nodes to estimate different values of the number of processors in the tree form.

Recall that a schedule is processor-ordered if the communication graph between processors is acyclic.). The algorithm first builds a schedule by grouping several tasks together to reduce the size of the graph. Since the authors are only interested in throughput, the problem reduces to a mapping of tasks to processors, and the solution throughput is given by the busiest processor or communication link.

For each mapping change considered, the throughput is estimated by scheduling computations and communications using a greedy algorithm.

CONCLUSION AND FUTURE WORK

Proceedings of the 3rd International Symposium on Parallel and Distributed Computing/ Workshop 3rd International on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks (ISPDC’04).IEEE Computer Society, 296–302. Proceedings of the 6th International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks (HeteroPar'07). InProceedings of the 24th International IEEE Parallel and Distributed Processing Symposium (IPDPS’10).IEEE Computer Society Press.

InProceedings of the 14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'95).ACM Press, New York, 255-265. InProceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'93).ACM, New York, 1-12. InProceedings of the European Conference on Parallel Processing (EuroPar'09).Lecture Notes in Computer Science, vol.

In Proceedings of the International Conference on Computer Science, Tools for Program Development and Analysis in Computer Science. In Proceedings of the IEEE International Conference on Cluster Computing, Workshop on Parallel Programming on Accelerator Clusters (PPAC’09). In Proceedings of the International Provenance and Attribution Workshop (IPAW.

InProceedings of the European Conference on Parallel Processing (EuroPar'04).Lecture Notes in Computer Science, vol. InProceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'95).ACM, New York, 134–143. InProceedings van de ACM/IEEE Internationale Conferentie over Compilers, Architectuur en Synthese voor Embedded Systems (CASES’06).

In Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud, and Grid Computing (CCGrid.