• Tidak ada hasil yang ditemukan

PARALLELISM INTRODUCTION

eople have recognized for a long time that in most instances two people can accomplish a task faster than one, and three people can accomplish it even faster. The way that this has been implemented in practice has varied. In offices, folders would be refiled faster if they were split among a group of workers. Assembly lines speed up a process, because if one person does the same task over and over, that person can do it more quickly because he or she doesn’t need to take time to change tools. Bucket brigades were dis- covered when people realized that more buckets of water could be moved if, instead of having people running back and forth, they stood in a line and just passed the buckets back and forth.

When we talk about parallel algorithms and programming, we see very sim- ilar concepts. There are multitasking systems, where each processor does the same task with different data. There are pipelined systems, where each proces- sor does just one step of the task of decoding and executing a program instruc- tion, passing the results onto another processor, which does the next step.

Dataflow systems set up a series of processors to carry out a task or calculation, and then the input data is passed from processor to processor in the calculation of the result.

This chapter is an introduction to the concept of parallel algorithms. Due to the complex nature of parallel algorithms and programming, to cover these ideas completely would at least double the size of this book. We begin with an overview of some of the general concepts related to the structure of parallel computer systems and then look at parallel algorithms for some of the prob- lems we have considered in Chapters 2 through 6. The parallel algorithms pre- sented will not always be the best parallel option but instead will give you an idea of how the problem could be solved in a parallel manner. The amount of detail that would be necessary to always present the most efficient parallel algo- rithm is well beyond this text.

7.1.1 Computer System Categories

Computer systems can be divided into four main categories. To understand these, you need to think of how a program runs in a slightly different way.

From the perspective of the main processor in a computer, the program arrives in a stream of instructions that have to be decoded and then executed. The data can also be seen as arriving in a stream. Our four categories are then based on whether there is one or multiple streams of instructions and data.

Single Instruction Single Data

Single instruction single data (SISD) is the classic single processor model that includes all early computers and many modern computers. In this case, there is one processor that can carry out one program instruction at a time. This pro- cessor can work with only one set of data at a time as well. These sequential systems exhibit no parallelism, as will be seen in comparison with the other three categories.

Single Instruction Multiple Data

In single instruction multiple data (SIMD) machines, there is some number of processors all doing the exact same operation but on different data. SIMD machines are sometimes referred to as vector processors because their opera- tion is well suited to doing vector operations, where each processor gets a dif- ferent element of the vector and after one instruction cycle, the entire vector has been handled. For example, adding two vectors together requires that each of the elements be added. The first element of the resulting vector is the sum of the first elements of the two input vectors, and the second element of the result is the sum of the second elements of the input vectors. In our SIMD machine, the instruction given to each processor would be an add, and each processor would be given one pair of values from the two input vectors. After this one instruction cycle, the entire result would be available. Notice that if the vector has N elements, a SISD machine would take N cycles doing one add per cycle, where a SIMD machine with at least N processors can do the addi- tion in one instruction cycle.

Multiple Instruction Single Data

The option of having different operations all applied to the same data may seem strange, because there are not many programs where you need the results of taking a single data value and squaring it, multiplying it by 2, sub- tracting 10 from it, and so on. But if we begin to think of this process from a

different perspective, we see that finding if a number is prime can be improved with this type of machine. 1 If we have N processors, in one cycle we can determine if a number between 1 and N2 is prime with a multiple instruction single data (MISD) machine, because if X is not prime, it will have a factor that is less than or equal to . To find out if XN2 is prime, we have the first processor divide by 2, the second divide by 3, the third divide by 4, and so on up to processor K 1, which divides by K, where If any of these processors finds that it can divide evenly by the number it is given, X is not prime. So in one operation, each of the proces- sors does its division and we have the result. On a sequential machine, you should see that a simple solution to this problem would take at least passes through a loop doing a division each time.

Multiple Instruction Multiple Data

Our final category is the most flexible of the options. In this case, we have mul- tiple processors, each of which is capable of carrying out a different instruction.

We also have multiple data streams, so that each processor can work on inde- pendent data sets. In practice, this means that a multiple instruction multiple data (MIMD) system could be running different programs on each processor or different parts of the same program or the vector operations we saw for the SIMD configuration. This category includes most of the modern attempts at parallelism, including clusters of computers and multiprocessor systems.

7.1.2 Parallel Architectures

There are two main issues in the architecture of parallel computer systems:

How are memory and processors connected, and how do the processors com- municate? These issues will be used when we discuss algorithms because some parallel options are best suited to one or another of these configurations.

Loosely versus Tightly Coupled Machines

In a loosely coupled machine, each of the processors has its own memory, and communication between processors occurs across “network” cables. This is the architecture of computer clusters, where each element of the cluster is a com-

1 Recall that a prime number is one that is only evenly divisible by itself and the num- ber 1. So, for example, 17 is a prime number because the only numbers between 1 and 17 that divide it evenly are 1 and 17.

X

K = X .

X

plete computer system that could function on its own. Parallelism is achieved by the way that tasks are assigned to each of the computers in the cluster by a central controlling computer.

In a tightly coupled machine, each of the processors shares a centralized memory. Communication between the processors is done by one processor writing information into memory and then one or more processors reading that information back out. An example of this communication will be given in Section 7.3.

Processor Communication

In a loosely coupled machine, we said that the processors communicate over cables or wires. We now look at some of the possible configurations that are possible for these processors and wires. At one extreme is a fully connected network, where each processor has a connection to every other processor. At the other extreme is a linear network, where the processors are laid out in a line, and each processor is connected to the two immediately adjacent (except for the two ends, which have only one adjacent processor). In a fully con- nected network, information can flow quickly between processors, but at the high cost for the extensive amount of wiring that is necessary. In a linear net- work, information travels more slowly because it must be passed from proces- sor to processor until it reaches its destination, and single point failure disrupts the information flow. This is not surprising if you recall our discussion of biconnected components in Chapter 6.

An alternative to a linear network that improves reliability is a ring network, which is like a linear network, except that the first and last nodes in the line are also connected. Information can now travel more quickly because it will only need to be passed through at most one-half of the processors. Notice in Fig.

7.1 that a message from node 1 to node 5 would have to pass through three

1 2

1 2 3 4 5 6

5 4

6 3

FIGURE 7.1 A fully connected and a linear network configuration

intermediate nodes, whereas in the ring network of Fig. 7.2, that message could now get there by passing through just node 6.

In a mesh network (see Fig. 7.3), the processors are laid out in a two- dimensional grid, and connections are made to those nodes that are adjacent either horizontally or vertically. Information now has even more ways to travel through the network, and the network is more reliable, but this is achieved at the cost of more complicated wiring.

There are other possible configurations that are not important to our discus- sion. Those include a tree network, where the processors are connected like a binary tree, and a hypercube, which is an expansion of a mesh network to three or more dimensions.

7.1.3 Principles for Parallelism Analysis

There are two new concepts that we encounter when dealing with the analysis of a parallel algorithm: speed up and cost. The speed up of a parallel algorithm is the factor by which it is faster than an equivalent optimal sequential algo- rithm. For example, we saw that the optimal time for a sorting algorithm was O(N lg N). If we have a parallel sorting algorithm that is of order O(N), we have achieved a speed up of O(lgN).

1 2

5 4

6 3

FIGURE 7.2 A ring network configuration

8 5 2

7 4 1

9 6 3

FIGURE 7.3 A mesh network

The second issue that we must consider is the cost of the parallel algorithm, which we will define as the time of the algorithm multiplied by the number of processors used. In our example, if the parallel sorting algorithm of O(N) required that the number of processors be the same as the number of data val- ues, the cost would be O(N2). This means that the parallel sorting algorithm would be more expensive because the cost of a one-processor sequential sort- ing algorithm is the same as its run time of O(N lg N).

A related issue is the scalability of the problem. If our only option for a par- allel sort requires that we have the same number of processors as input values, we will find that this algorithm is not usable for a list of any significant size.

This would be a problem, because our sequential sort algorithm has no such size restrictions. In general, we will be interested in parallel solutions where the number of processors is significantly less than the potential size of the input and where the algorithm can handle an increase in the size of the input with- out an increase in the number of processors.

7.1.4

1. A problem you are working on needs the series of numbers that are the summations of numbers from a set. More specifically, for the set {s1,s2,s3, . . ., sN} you need the sums s1 + s2,s1 + s2 + s3,s1 + s2 + s3 + s4, . . ., s1 + s2 + s3 + . . . + sN. Design a method that you could use to solve this prob- lem using parallelism.

2. Another network configuration is a star network, where there is one central processor, and every other processor is just connected to this central one.

Draw a picture of a star network that has seven processors. This section dis- cussed some advantages and disadvantages of a linear network. Using that discussion as the basis, what do you see as some of the advantages and disad- vantages of a star network?