Task and Message Co-scheduling Strategies in Real-time Cyber-Physical Systems

I conform to the norms and guidelines given in the Ethical Code of Conduct of the Institute. As far as I know, it has not been submitted elsewhere for the award of a degree.

Table 7.2 & Table 7.3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.5 ASAP and ALAP times of message nodes in PTGs G 1 and G 2 (shown in

Related Work

Although, optimal scheduling solutions can provide significantly better performance compared to sub-optimal heuristic solutions, finding optimal solutions can become prohibitively expensive for large problem sizes. Although optimal scheduling solutions can potentially provide significantly better performance, it can be noted that the computation of such optimal solutions can often become prohibitively expensive for large problem sizes.

Challenges

Energy consumption in real-time systems has become an important issue with the increase in the number of processing elements. Scheduling schemes for real-time systems must optimize energy consumption and satisfy other constraints such as timing, resource, precedent, etc.

Objectives

Scheduling schemes for safety-critical real-time systems must be able to efficiently utilize available resources from the underlying platform to satisfy the resource constraints associated with the real-time task set. Design and implementation of QoS adaptive scheduling mechanisms for real-time systems modeled as PTGs on fully connected heterogeneous multiprocessor system.

Summary of work done

Although G-SAQA follows an intuitive planning flow, it only considers the global slack (= Rok-P EF T makespan) to upgrade the task service level in PTG. Based on this selected task-to-processor mapping, the parent message nodes of the task node are assigned to the appropriate buses.

Organization of the Thesis

Application Layer

Real-time Task Model

Computation time or execution time (ei) is the time it takes the processor to complete the computation of the task without interruption. Static and dynamic task system: In a static task system, the tasks to be executed on the platform are completely defined before the system is started.

Real-time Scheduler

Arbitrary Deadlines: Task deadlines can be shorter than, equal to, or longer than their periods. Hyperperiod (H): Given a static task system, H represents the minimum time interval after which the schedule repeats.

Hardware Platform

Shared bus based multiprocessor system: In this system processors are connected through shared bus based communication network. For any given processor, task execution and communication with other processors can be done simultaneously without any kind of contention.

Figure 2.2: Fully connected multiprocessor system

Types of Task Constraints

From the figure, it can be observed that taskT1 has no predecessor and can immediately start its execution. It can also be seen that T2, T3 and T4 can only start executing after the previous task node T1 has completed its execution.

Figure 2.4: A Precedence-constrained Task Graph (PTG)

Classification of Real-Time Scheduling Algorithms

An algorithm is said to be optimal if it minimizes a given cost function defined over the set of tasks. An algorithm is called a heuristic if it is guided by a heuristic function in making its scheduling decisions.

A Discussion with Motivational Examples

Tables 2.5 and 2.3 show the task execution time and message transmission time, respectively. Tables 2.6 and 2.7 show the execution and communication times of tasks and messages, respectively, in PTG G1 and G2.

Figure 2.5: Gantt chart: Independent tasks (Table 2.1); Homogeneous processors

Multiprocessor Scheduling - A Brief Survey

An Overview of HEFT & PEFT

Heterogeneous Earliest Finish Time (HEFT) [6] is a list-based heuristic scheduling algorithm for executing PTGs on a heterogeneous fully connected distributed multiprocessor platform, with the objective of minimizing the overall space. This list governs the order in which tasks are considered for processor assignment, and (2) Determining the most suitable processor for a task in terms of minimizing the overall schedule space.

Summary

An Optimal Solution Approach (MMCKP-DP)
Accurate Low Overhead Level Allocator (ALOLA)
Example: Service-level Assignment
Offline Schedule Generation

ALOLA Time Complexity: The complexity associated with assigning a minimum service level to all tasks and computing their cost values in lines 1-3 is O(n). The total remaining computing and communication capacities after assigning the minimum service level sli1 to each task Ti become: (p − Pni=1wti1 =) 0.6 and (b.

Figure 3.1: Pipelined message transmission and execution over a synchronized system of homogeneous processors and buses

Experiments and Results

Data Generation Framework

The weights (wtij or wmij) for the non-core service levels (starting from level 2) of the tasks are assigned uniform random values bounded between 110% and 120% of the weights (wti(j−1) or wmi(j− 1)), corresponding to their directly to lower service levels. Reward values were randomly selected from 20 to 200, ensuring that the random reward value for a task at a given service level is higher than the reward value at lower service levels.

QoS Measurements

Results (Category 1): Figure 3.4 presents the N R plots obtained for both ALOLA and MMCKP-DP in systems consisting of or 60 tasks, 8 processors and processor utilization with respect to minimum service levels PU (refer Equation 3.12) which varies between 0.6 and 1. This is because the probability of reaching higher service levels decreases with the increase in P U .

Time Measurements: Results

This is also expected because the time complexity of ALOLA is directly proportional to the number of tasks and service levels. It can be seen that speedups increase with the number of tasks, processors and/or buses.

Figure 3.4: Processor Utilization (P U ) Vs. N R

Case Study: Flight Management System

In comparison, the complexity of ALOLA shows significantly lower sensitivity to the number of tasks and service levels (refer Section 3.1.2). The running time of ALOLA therefore does not vary significantly with changes in the number of processors and/or buses, because it is only indirectly sensitive to them.

Figure 3.6: ALOLA: Processor Utilization (P U ) Vs. Running Time

Applicability Considerations

However, in the case of larger systems where devices cannot be connected to all buses, ALOLA cannot be applied directly and must be adapted accordingly. Now a separate ALOLA scheduler can be deployed for each separate subgroup of buses and the processors connected to this bus subgroup.

Summary

To denote the in-degree and out-degree of the task nodeTi, we use the notations indeg(Ti) and outdeg(Ti), respectively. To indicate the predecessor and successor of the task nodeTi, we use the notations pred(Ti) and succ(Ti), respectively.

Earliest/Latest Start Times for PTG Nodes

It can be seen from the figures that the N R values decrease with the increase in the number of tasks. From the figure it can be seen that the normalized rewards (N R) decrease with the increase in the number of processors.

Integer Linear Programming (ILP) Formulation: ILP - Service-level Al-

Unique Start Time Constraint

This means that each task node Ti must start its execution only at one service level or at a unique time on a separate processor Pr.

Resource Constraint

Dependency Constraint

Linearization of Non-linear Term

Deadline Constraint

Objective Function

Complexity Analysis

It can be seen that speedups increase with the number of nodes, processors and/or buses. It can be seen that the running time of CC-TMS increases monotonically as the number of nodes increases.

Figure 4.3: ILP-SATC: Schedule for G (in Figure 4.2) depicted as a gantt chart

ILP Formulation: ILP - Service-level Allocation with Non-overlapping

Unique Resource Assignment

Unique Quality-level Assignment

Dependency Constraint

It can be seen that when both tasks Ti and Tj are executed on the same processor Pr, Zijr becomes 1, causing the term containing mij to disappear. The term (Si+P|SLl=1i|Ppr=1Uilr∗eilr) captures the absolute completion time of Ti at service level slil when set to Pr.

Non-overlapping Constraints

Deadline Constraint

Objective Function

Complexity Analysis

Experimental Evaluation

SLi|= 3, (4) The execution time ei1ro of each task node at its base service level for processorPr is taken randomly from a uniform distribution in the range 10m to 30ms. Specifically, x% of the tasks that have the highest reward per unit execution time (RPE) were selected for service level improvement.

Figure 4.5: (a) Gaussian Elimination [1], (b) Epigenomics [2]

Case Study: Adaptive Cruise Controller

PEFT is performed on the PTG and the resulting normalized reward NR and makespan values are noted. Message sending time avoidance: Messages are not sent when a task node is scheduled on the same processor as the successor task node.

Table 4.6: Computation time (in ms) of task nodes

Summary

An extensive set of simulation-based experiments was conducted to evaluate the performance of both ILPs. Specifically, we assume that tasks on different processors communicate by transferring data from the source processor over a fully connected network to the local memory of the receiving processor.

Heuristic Algorithms

Global Slack Aware Quality-level Allocator (G-SAQA)

After extraction, G-SAQA checks if it is possible to improve the service level of Ti using the available global slackg. The service level upgrade process takes O(Pni=1|SLi| ×n×log(n)) (row numbers.. 15 to 23) and this includes the O(n) required for the start time update and the completion of all descendants Ti nodes (line no. 21).

Figure 5.3: Assignment PTG G 0 obtained from PTG G and the PEFT schedule

Total Slack Aware Quality-level Allocator (T-SAQA)

The job T4 with the highest key value (currently at the base of the max heap) is removed from the heap and the service level is upgraded from sl41 to sl42, as an additional compute requirement. Then T2, the highest key job, is unheaped and the service level is upgraded from sl21 to sl22.

Experimental Evaluation

Performance evaluation using benchmark Precedence-constrained

Performance evaluation using randomly generated PTGs

On the other hand, although the total system capacity increases with the increase in the number of processors, the capacity of individual processors remains the same. In addition, it can be observed that the structure of the PTG (number of task nodes and their interdependencies) used as input to the experiment still remains the same as the number of processors is increased.

Figure 5.11: Effect of varying #tasks, #processors and heterogeneity

Case Study: Traction Controller

Similar to the trend of the results obtained in the experiments section (Section 5.3), we observe that the optimal solution ILP-SANC yields significantly higher rewards (348 w.r.t. higher. However, the overall runtime overhead associated with ILP- SANC are about ~106 times higher than heuristic algorithms.

Table 5.6: Computation time (in ms) of task nodes in Traction Control PTG

Summary

The obtained results show that both the heuristic schemes (G-SAQA and T-SAQA) are ~106 times faster than the optimal strategy ILP-SANC. Each elementckr ∈CT captures the communication time associated with message Mk (vertex Vn+k) on bus Br (resource Rp+r).

Earliest/Latest Start Times for PTG Nodes

However, this does not lead to a reduction in the number of message nodes and ultimately a reduction in CC-TMS performance. In comparison, the complexity of CC-TMS exhibits significantly lower sensitivity to the number of tasks, processors, and buses.

ILP Formulation: ILP with Explicit Time Reduced (ILP-ETR)

Unique Start Time Constraints

This means that each task node Ti must start its execution at a unique time step t on a separate process element Pr. Similarly, each message node Mk must have a unique start time if Mk is actually being transmitted on the bus (see Assumption 5).

Resource Constraints

Dependency Constraints

It can be noted that by setting C to a large enough value, the constraint in Equation 6.11 is trivially satisfied when both Ti and Tj are assigned to the same processing element (Yk = 1). It is worth noting that when Yk = 1, the constraint imposed by equation 6.2 applies Pp+br=p+1Pt.

Deadline Constraint

Then the task node Tj (= succ(Mk)) should only start its execution after the completion of Mk.

Objective Function

Complexity Analysis

A Gaussian elimination task graph contains ((χ2 +χ−2)/2) task nodes and (χ2−χ−1) message nodes when the number of equations to solve is χ. Here the number of processors (p) is varied from 2 to 8, matrix size χ (Gaussian elimination) between 3 and 6, number of parallel branches.

ILP Formulation: ILP with Non-overlapping Constraints (ILP-NC)

Unique Resource Assignment

It can be noted that, Ppr=1Zijr = 1, when both the predecessor (Ti) and successor (Tj) task nodes of message nodeMka are assigned to the same processing element Pr, forcing the LHS of equation 6.16 to become 0.

Dependency Constraints

Non-overlapping Constraints

Here, constraint 6.20 enforces non-overlap between the executions of Ti and Tj, when Ti starts before Tj. Similarly, the term (C ∗αij) in Equation 6.21 vanishes when Tj starts before Ti and makes the constraint to be trivially satisfied, otherwise.

Deadline Constraint

Objective Function

Complexity Analysis

The reason can be attributed to the high complexity of the ILP-NC solution together with its high sensitivity to the number of nodes, processors and buses. By comparing the structures of these two benchmark PTGs, it can be observed that the ratio of the number of message nodes to the number of task nodes is higher for Stencil compared to Laplace.

Figure 6.4: The schedule for the PTG (Figure 6.2a) using ILP-NC

Heuristic: Contention Cognizant Task and Message Scheduler (CC-TMS) 137

Earliest Start and Finish Time

EST(Ti, Pr) of a job Ti on processor Pr specifies the earliest time Ti can start on Pr. EST(Mk, Br) = max{availability[Br], AF T(Ti)} (6.35) where availability[Br] indicates the earliest time when Br becomes available for message transmission and AF T(Ti) is the actual time when the predecessor task Ti of Mk ends execution.

Co-scheduling Tasks and Messages

Output: schedule of task and message nodes (start time of each node and assignment of tasks/messages to processors/buses). 6 LetM sgP riorityListi is the list consisting of all predecessor message nodes of task Ti arranged in non-ascending order.

Complexity Analysis

Example

Experimental Evaluation

Experiment-2: Variation of the number of processors: The results of this experiment are shown in Figure 6.7. We see that for any given number of parallel branches (Epigenomics) or fixed matrix size (Gaussian elimination, Laplace, Stencil), the makespan ratio decreases as the number of processors increases.

Case Study: Traction Controller

We employed ILP-NC and CC-TMS to generate schedules for the PTG of TC with the aim of reducing the scope. However, the runtime overhead associated with the ILP is approximately ~105 times higher than CC-TMS.

Table 6.9: Execution times of task nodes (in µs)

Summary

On the other hand, the scheme calculated using CC-TMS (Figure 6.15) needs 50 µs to generate the solution which has a range of 1895 µs. So while the ILP produces highly efficient solutions that can be critical for resource-constrained embedded systems, the CC-TMS algorithm is much more scalable and faster to produce reasonably good solutions, which can be important when rapid design iterations are required.

Earliest/Latest Start Times for PTG Nodes