Improving the I/O Performance of Disk-Based Graph Engine by Graph Ordering

With the advent of big data and social networks, large-scale graph processing is becoming a popular research topic. Recently, an optimization technique called Gorder has been proposed to improve the performance of in-memory graph processing. However, because Gorder is designed with only those algorithms that have a specific I/O pattern, it is not suitable for some other graph algorithms;.

In addition, the algorithmic complexity of N.Order is simple compared to Gorder, therefore it is cheaper to apply N.Order.

Introduction

On the other hand, graph editing is a general approach to improve processing performance or graph algorithms by keeping their implementation and data structures unchanged. Gorder, a state-of-the-art research, was proposed as a preliminary study of performance improvement through graph editing [8]. Third, we model the relationship between graph order and algorithm performance and verify its modeling.

And in Chapter 4, we model and verify graph ordering and performance to determine which graph ordering is better.

Graph Ordering

Efficient I/O requests are very important because most of the graph processing time in a disk-based graph drive is taken up by I/O processing time. In addition, graph ordering must be completed in a short time because it is a cost that must be paid before the graph processing operation. The last thing to consider is the different graph algorithms. It is very difficult to find a graph order that is suitable for all algorithms because the I/O request patterns and graph characteristics are different for each graph algorithm.

To find a better graph ranking, we need to understand the disk access pattern of the graph algorithm and discover the graph layout that suits the processing performance. The next section discusses the behavior and characteristics of graphics algorithms in disk-based graphics engines.

Characteristics of Graph Algorithms

Process of Graph Computation

Graph Algorithms on Disk-Based Graph Engines

In contrast, in the case of TC, the peak rate and the number of I/O requests are proportional, but the number of I/O requests is very small compared to the rate. So it is not a good idea to assign a new ID in rank order. The difference in the characteristics of the algorithm shown in Figure 4 leads to the difference in the page cache hit ratio.

Because the same data is accessed multiple times, the cache hit ratio is high. It is very difficult to find an optimal graph ordering that reflects these different functions, because it has a very diverse workload for each graph algorithm. When assigning a node ID from Gorder, a state-of-the-art graph editing technique, it is assigned using the following scoring equation (1).

After allocating the current 5 points, the 6th point must be selected according to the scoring result in Gorder. This sort is a restricted graph sort suitable only for a highly localized graph algorithm that repeatedly requires I/O in single-node units such as TC and CC. In Figure 6, in BFS algorithm processing, it is better to assign point B to point 6, which is the adjacent point directly connected to point 1 than point A, preferentially.

In other words, it is more appropriate to allocate in a way that propagates to neighboring nodes rather than assigning scoring systems. Therefore, in this paper, we propose Neighbor Order (N.Order), which is a new graph order that first assigns neighboring nodes.

Classification of Graph Algorithms

In addition, since the node function is called at one point, the next operation generally tends to be the activated point with a high localization score and the node function is called. Outside the diagram, there is an algorithm that executes the nodes in the order of the node IDs regardless of the graph topology. Algorithms such as PageRank and Personalized PageRank typically run linearly because the execution of nodes is done in ID order.

Performance Modeling

The Modeling for Neighborhood-First Algorithms
The Modeling for Locality-First Algorithms
Combination
Verification

In this chapter, we model graph algorithms with Locality-First features such as triangle counting and clustering coefficient. This modeling essentially uses the locality calculation method that was used in Gorder. In Equation (5), the locality between two nodes is calculated by considering two factors.

First, if two nodes are adjacent to each other, there is a high probability that the next node will be executed after one node is executed. Second, if two points have a common neighbor point, then it is possible for two points that are related points to be executed together after the common neighbor point is executed. In equation (6), we adopt the sliding window model with window size w (w > 0) used in Gorder.

This graph ordering has very low preprocessing cost and shows very high performance in Neighborhood-First algorithms and even reasonable performance in Locality-First algorithms. In this chapter we do the verification about the modeling that was carried out in the previous chapter. We estimate the costs estimated by the model and the actual execution times of the algorithms.

In the linear regression analysis, 𝑅2 values are highly correlated when they are close to 1 and closer to 0, indicating that there is no correlation. Most of the 𝑅2 values in Figure 8 are around 0.8, indicating that the modeling results and performance are highly correlated.

Neighborhood Ordering

This intuition says that if there is a common neighbor node between two nodes, the jaccard similarity will be higher than that between two randomly selected nodes. We can see that the number of shared nodes is much larger in the case of sharing the same parent node than in either case. The first parent destination point is arbitrarily selected, the neighboring points of this point are sequentially assigned the ID of the destination point.

When all target vertices have been allocated, the next parent target vertex is randomly selected and the ID assignment is repeated.

Evaluation

Execution Time

The Neighborhood-First, N.Order algorithm proposed in this paper showed the highest performance, and the performance was improved about twice that of Random. On the other hand, in the case of the locality-first (TC, CC) algorithm, Gorder, which reflects locality, has the best performance. In the case of PR and PPR, there is no performance improvement due to the limitation of Section 3.1.

The impact of graph ordering is more significant when HDDs are used as external storage. The size of the page cache is set to be 10% of the size of the input graphs. Execution with neighbor ordering is up to an order of magnitude faster than random ordering and twice as fast as Gorder.

Since a hard disk is slower than an SSD and its random access performance is much worse, the influence of graph order becomes more pronounced with a hard disk. In summary, N.Order improves the performance of graph algorithms running on disk-based graph drives by a factor of two on SSDs and by a factor of ten on HDDs.

I/O Request Size

Pre-processing Overhead

Conclusion

Related Works

GraphX transforms graph processing operations in a node-centric model into data flow operations such as grouping, mapping, and reduction [25]. To ease the difficulty of programming large-scale graph analysis, SociaLite supports a declarative query language based on Datalog [26]. It should be easy to support declarative graph processing with the optimizations we've proposed in this paper.

Ligra is a lightweight framework that provides low-level primitives for implementing graph processing systems [30]. While Gorder is designed for main-memory graph systems, neighborhood ordering is designed to optimize the performance of disk-based graph engines.

Parallel graph processing: Is random ordering always the worst?

We suggest B.Gorder (Balanced Gorder), which is a simple way of ordering by combining randomization and Gorder instead of transforming Gorder. To test whether B.Gorder has a load-balancing effect, we measured the execution time per thread when running PageRank in a multi-threaded environment. In Gorder, runtime deviations per thread were at least 2.5 times to 13 times.

We also measured performance for three key algorithms to gauge the impact of load balancing on overall performance. Compared with Figure 17, it can be seen that the worse the performance, the more serious the load imbalance. The random ordering scheme can sometimes be a good choice if the load balancing is not implemented in the graph processing engine and the imbalance is severe.

In conclusion, the idea that "random order is always the worst order" is partially inadequate.

Toward zero pre-processing costs

A System for Large-Scale Graph Processing, Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. dhe Implementimi, fq. Hellerstein, “Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud”, Proceedings of the 38th International Conference on Very Large Databases, pp.

Gupta, “Load the Edge You Need: A General I/O Optimization for Disk-Based Graphics Processing,” Proceedings of the 2016 USENIX Annual Technical Conference, p. Conference on Data Management, p.

Chen, “GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning,” Proceedings of the USENIX Annual Technical Conference 2015, p. Franklin, "GraphX: Graph Processing in a Distributed Process of the Dataflow Framework," 11th USENIX Symposium on Design and Implementation of Operating Systems, p.

Halperin, ‘Asynchronous and Fault-Tolerant Recursive Datalog Evaluation in Shared-Nothing Engines’, Proceedings of the 41th International Conference on Very Large Databases, pp. Blelloch, ‘Ligra: A Lightweight Graph Processing Framework for Shared Memory’, Proceedings of the 18th ACM SIGPLAN symposium over principes en praktijk van parallel programmeren, pp.