Support for Node Decomposition - Processing Element Design

5.2 Logic Architecture

5.2.1 Processing Element Design

5.2.1.1 Support for Node Decomposition

The decomposition transform (analyzed in Section 7.2) impacts PE design by allowing PEs to have small memories and by relying on PE subcomponents’ dataflow-style synchronization. If PEs can have small memories then we can allocate a large number of PEs to each FPGA with minimal memory resources per PE. When a node is assigned to a PE it uses one word of Edge Memory for each of its predecessor edges and one word of Send Memory for each of its pointers to a successor edge (Figure 5.9). This prevents PEs with small memories from supporting high-degree nodes. Many graphs have a small number of high-degree nodes, so small PEs could severely restrict the set of runnable graphs. In our benchmark graphs, the degree of the largest node relative to average node degree varies from 1.3 to 760with a mean of 185. By performing decomposition, memory requirements per PE are reduced by 10 times, averaged across benchmark applications and graphs (Section 7.2).

Computation cost in a PE is one cycle per edge so breaking up large nodes allows load- balancing operations as well as data. The speedup from decomposition is2.6, as explained

Figure 5.11: An original node with 4 input messages and 4 output messages is decomposed into 2 fanin nodes, 1 root node, and 2 fanout nodes. The global barrier synchronization between graph-steps cuts operations at the original node. For the decomposed case, the global barrier synchronization only cuts operations at leaves of the fanin tree. A sparse subset of the arrows with dashed lines carry messages on each graph-step. All arrows with solid lines carry one message on each graph-step.

in Section 7.2.

Figure 1.5 shows how decomposition breaks eachoriginal nodeinto a root nodewith a tree offanin nodesand a tree offanout nodes. Each fanout node simply copies its input message to send to other fanout nodes or edges. Fanin nodes must perform GraphStep’s node reduce operation to combine multiple messages into one message. We are only al- lowed to replace a large node with a tree of fanin nodes because GraphStep reduce operators are commutative and associative.

Figure 5.11 shows how decomposition relates to barrier synchronization and which PE operators (Figure 5.7) occur in which types of decomposed nodes in our PE design. After decomposition only fanin-trees leaves wait for the barrier synchronization before sending messages. Other nodes are dataflow synchronized and send messages once they get all their input messages. We make the number of messages into each dataflow synchronized node a constant so the node can count inputs to detect when it has received all inputs. Each fanout

node only gets one message so it only needs to count to one. Without decomposition, all edges are sparsely active, so their destination nodes do not know the number of messages they will receive on each input. To make this number constant, edges internal to a fanin tree are densely active. Dense activation of fanin-tree messages requires that each fanin-tree leaf sends a message on each graph-step. At the beginning of a graph-step, each PE iterates over its fanin-tree leaves to send messages that are stored from the previous graph-step’s node reduce and send nil messages for nodes which have no stored message.

We place the barrier immediately after node reduce operators to minimize memory requirements for token storage. Whichever tokens or messages are cut by the barrier need to be stored when they arrive at the end of a graph-step so they can be used at the beginning of the next graph-step. If the barrier were before node reduce operators then memory would need to be reserved to store one value for each edge. We instead store the results of node reduce operators since they combine multiple edge values into one value. These tokens are stored in Node Reduce Memory (Figure 5.7). This node reduce state is not visible to the GRAPAL program, so moving node reduce operations across the barrier does not change semantics.

One alternative logic design would perform one global synchronization for each stage of the fanin and fanout tress so all edges are sparsely active. Another alternative logic design would perform no global synchronizations and make all edges densely active. In Section 7.3 we show that our mixed barrier, dataflow synchronized architecture has 1.7 times the performance of the fully sparse style, and 4times the performance of the fully dense style.

To make our decomposition scheme work, extra information is added to PE memories and the logic is customized. Each node and edge object needs to store its decomposition- type so the logic knows how to synchronize it and knows which operations to fire on it.

We were motivated to design the PE as self-synchronized, token-passing components to support dataflow-synchronized nodes. If all operations are synchronized by a barrier then a central locus of control can iterate over nodes and edges, invoking all actions in a pre- scheduled manner. To handle messages and tokens arriving at any time, a PE’s node and edge operators in Figure 5.7 are invoked by token arrival.

Application Wmsg Wf lit dWmsg/Wf lite

ConceptNet 30 30 1.0

Bellman-Ford 40 30 2.0 Push-Relabel 90 30 4.0 Spatial Router 40 30 2.0

Table 5.3: This table shows the datapath widths of messages, widths of flits, and flits per message for each application.

Dalam dokumen A Programming Language for Parallel Graph Algorithms (Halaman 89-92)