Visualization of Very Large Graphs

(1)

ISSN (Print) : 2319 – 2526, Volume-2, Issue-4, 2013

72

Visualization of Very Large Graphs

Tanya Sheryl & D. Thenmozhi SSN College Of Engineering

E-mail : [email protected] & [email protected]

Abstract – Visual depictions of graphs are external representations that exploit human visual processing to reduce the cognitive load of many tasks that require understanding of global or local structures. Graph Visualization is used to easily identify these global and local structures of the data represented. One of the major issues of Graph Visualization is its scalability. As the size of the graph increases, the nodes become so close to each other and lead to clutter. In the previous approaches, the graph visualization process was limited to datasets of a few thousand nodes only. Even the visualization of these small graphs takes considerable time and effort. This work focuses on the issues and challenges faced during the visualization of very large graphs and devices a suitable mechanism to solve the issue of scalability and reduce the visual clutter. It involves finding out the inherent knowledge within the graph and visualizing it through an efficient layout pattern.

Index Terms—Graph Visualization, Hierarchical Edges, Non-hierarchical Edges, Edge Bundling.

I. INTRODUCTION

A graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected objects are called vertices, and the connecting links are called edges. Graphs are one of the most commonly used information structures.

Graphing is a pictorial way of representing relationships between various quantities, parameters, or measurable variables. Graph Visualization is a way of displaying the graph in a suitable layout such that the user is able to understand the meaning of the graph from this pictorial representation.

The main aim of visualization is to allow the user to see the structure in a graphed data set that is difficult or impossible to see in its raw form. For this purpose, we use clever presentation techniques in order to assimilate the complex set of information contained within the graph. Graph visualization has many areas of

application. A simple application would be a file hierarchy on a computer system which can be represented as a tree (a special type of graph). Graph Visualization helps users to easily navigate through the file hierarchy in order to find a particular file. Other areas of application include social networking, biochemical pathways, genetic maps, website maps, object oriented systems (class browsers), data structures (compiler data structures in particular) and real time systems (state transition diagrams, Petri nets) [6].

There are a lot of graph visualization tools and softwares currently available that are used for visualization purposes. GraphViz, Gephi, aiSee, Ubigraph etc are some of the softwares that are freely available but the major drawback with these tools is that they are not able to handle scalability. Many of them work efficiently for around thousands of nodes but fail to produce good results for larger graphs comprising around lakhs of nodes. In addition, they consume a very large amount of time to process the graph. Hence the rendering process cannot be completed in real time.

There are many issues that come up during the visualization process. The major one is the issue of scalability where in addition to the large number of nodes, each node is also highly dimensional.

Representing all this information in a very limited display area is a highly tedious work as it may also lead to a lot of overlapping and cluttering problems, a state where it is impossible to differentiate between two adjacent nodes. Label readability will also suffer, hindering the user’s abilities to understand the graph data and perform many tasks. Niche Works [2] or H3Viewer [3] are few of the systems that can handle thousands of nodes but even these become completely insufficient while trying to visualize many of the real time graphs. One other major issue is the sparsity of the large graphs. Many of the real world large graphs like

(2)

73 the social networks, web graphs, communication networks data sets are sparse graphs where the number of edges are very less when compared to the maximum number of edges that are possible. As a result, many of these graphs contain disconnected components. So in order to visualize these kinds of disconnected graphs we have to initially use algorithms that deal with the connectivity of the graph. This process may further increase the complexity of the problem domain. So most of the algorithms currently available, assumes that the graph data sets are complete graphs in itself, while they are actually only a subset of the large number of graphs, thus reducing the generality of the visualization algorithms. Ambiguity is one other major issue faced during the visualization.

In a very large graph, there may be multiple numbers of paths between a pair of vertices. Hence while processing the graph; it is difficult to determine the correct path. Predictability is one other issue that has to be taken care of. Predictability is a property where two different runs of the same algorithm involving the same data set should produce the same visualization layout [4]..

II. APPROACH

Any graph data set contains edges and these edges can either be a hierarchical edge or a non- hierarchical edge. The non- hierarchical edges can also be called as adjacency edges. The hierarchical information of a graph can be displayed using any one of the tree visualization mechanisms. In addition to this, if we try to add the non- hierarchical edges, it will lead to clutter. So in order to avoid this issue, we actually include both the hierarchical and non-hierarchical edges into one graph and represent them differently such that we are able to clearly differentiate between the two. The main issue in this overall process is to distinguish between the hierarchical and the non-hierarchical edges.

For determining the hierarchical part, we can either convert the graph into a tree or we can use a clustering algorithm. The latter is too complex and takes considerable amount of time for very large graphs.

Hence we go for the first method as to convert the graph into a tree. The conversion to tree just identifies the hierarchical edges but the structure of the tree is yet unknown. We then identify the hierarchical and non- hierarchical structures. The hierarchical structure or information is then visually displayed using any one of the tree visualization mechanisms and as to the non- hierarchical relations; we use links or curves to depict them. Hierarchical Edge bundling is performed on these links in order to reduce visual clutter.

Fig. 1. Architecture Diagram A. Converting Graph into tree

In the edge set of a graph, there may be any number of paths from a node to any other node. This involves a lot of ambiguity within the graph and to eliminate this, we try to convert the initial connected graph to tree. As a result, a large number of edges are removed from the initial edge set. Once the conversion is done, further procedures can be completely done on the new edge set with the reduced number of edges and thus reduces the time required to process the new reduced set. For this purpose, there are many standard algorithms to convert graphs into trees like Breadth First Search (BFS), Minimum Spanning Tree (MST) etc. In the new edge set, there is exactly only one path from any node to any other node in the tree and thereby ambiguity is eliminated and also the number of edges becomes exactly one less than the number of vertices.

B. Identifying Hierarchical and Non- hierarchical Edges

Once the edge set of the tree is obtained, the next step is to identify the structure of the tree. To identify the hierarchical edges, the first step is to determine the root of the tree. There are actually many mechanisms to determine the root. In the first approach, the node with the maximum degree or the highest number of the neighbouring nodes is taken as the root of the tree. In the next method, the first step is to determine the largest path in the tree. Then the middle node in the longest path is taken as the root node. Another approach is to remove the number of leaf nodes and recursively perform this procedure until we are left with either one or two nodes. In the case of very large graphs, the second and the third methods are time consuming. Once

(3)

74 the root is obtained, then the neighbouring nodes of the root node are taken as the child nodes, which in turn are the parent nodes of their neighbouring nodes and act as the root of the subsequent sub-trees. Thus, in a recursive fashion the entire tree hierarchy can be found out.

The adjacency relationships are drawn only between the leaf nodes. To identify the non-hierarchical edges, there are two mechanisms. In the first method, a similarity measure is used to determine the similarity between any two nodes. Similarity between nodes can be determined either using Structure based similarities or Content based similarities. The structure based approach uses the linkage information of the graph to determine similarities while Content based approach depend on the underlying graph data and the attributes of each node to determine similarities. In this paper, we will deal only with the Structure based similarity measures. The higher the amount of similarity between any pairs of vertices, the higher the adjacency. There are many measures that can be used for computing the similarity between two nodes like Minimum Coefficient, Jaccard’s coefficient, Pearson’s coefficient, distance measures like Euclidean distance, Cosine distance and many more. In the case of very large graphs, finding the distance measure for each pair of vertices is quite tedious and also time consuming. Hence we do not utilize the distance similarity measure here. Two vertices of a graph are called structurally equivalent or similar if they share the same neighbors. Thus, the similarity of vertices could be expressed by generalization of the number of common neighbors.

Identifying the most suitable metric for calculating the similarity between the nodes becomes very important when dealing with very large graphs as the time taken for visualization may differ drastically. Least similarity indicates that the nodes are hierarchical in nature and higher the similarity, the more non-hierarchical in nature. The minimum coefficient between any two nodes can be found out using the ratio between the mutual neighbors of both the nodes and the minimum of cardinality of the neighbors of either of the two nodes.

where,

NgB : nearest neighbors u , v : node index, u ,v ∈ {0,1,…, N-1}

N : number of nodes

The Jaccard’s Coefficient can be calculated using the formula:

(2)

The similarity measures can take the value [0, 1]. If the value of the similarity metric is equal to zero, it means that the number of common neighbours is equal to zero which in turn indicates minimum similarity between the nodes and is considered as a hierarchical edge. The nodes that do not contain any child nodes are taken as leaf nodes. The adjacency relationships are drawn only between the leaf nodes. To identify the non- hierarchical edges, the distance between each of the leaf nodes are determined. If the distance is below some threshold, then they are considered to be adjacent edges.

In this method, the number of non hierarchical edges explodes to a very high value when compared to the original graph edge set. This explosion of edges makes it difficult to visualize and hence we go for the next approach. In the second method, we consider the initial edge set of the original graph and the new edge set which comprises the edge set of the tree. The difference of these two edge sets gives the removed edges and is taken as the non-hierarchical edge set. Using this method, we get a reasonable number of non-hierarchical edges which is neither too small nor too large when compared to the original graph set.

C. Tree Visualization Techniques

Basically there are two kinds of tree visualization techniques for representing the hierarchical structures : (1) Node Link diagrams and (2) Space Filling Mechanisms.

Node Link diagrams has visible graphical edge from parents to their children. One of the most well-known node link tree visualizations is the rooted tree having the root at the top and the leaves at the bottom. Radial layout is another example where nodes are placed on concentric circles according to their depth in the tree. A sub tree is then placed over a sector of the circle such that two adjacent sectors do not overlap. The focus node or the root node is placed at the center of the display and the other nodes are laid out around it. Immediate neighbors of the focus lie on the smallest inner ring, their neighbors lie on the second smallest ring, and so on. The balloon layout is also a node-link representation in which sibling subtrees are included in circles attached to the parent node. The major drawback of radial and balloon layouts is that the space is not efficiently used.

Contrary to this the treemap layout is a space filling method which completely fills up the space in the display section. Treemaps display hierarchical (tree- structured) data as a set of nested rectangles. Each branch of the tree is assigned a rectangle, which is then recursively subdivided to represent sub-branches.

(4)

75 Compared to all these methods, Radial Layout produces the best results as maximum number of edges and nodes can be visualized effectively using this method.

D. Hierarchical Edge Bundling

It is a flexible approach used to bundle together all the adjacency edges in order to reduce the clutter effect [1]. The adjacency edges are depicted using B-spline curves. If an adjacency edge has to be drawn between any two nodes, then the path along the hierarchy between these two nodes is taken as the control polygon for that particular curve depicting the relationship. In order to reduce the ambiguity caused by this method, a bundling parameter is used to control the bundling effect. The bundling parameter β ranges from 0 to 1 where 0 indicates low bundling strength and 1 indicates high bundling strength.

III. RESULTS

Fig. 2. Input Edge set

Fig. 2 shows a snapshot of the input graph set. This graph set contains a set of edges. The graph is then converted into a tree. It can be done using either the MST, BFS, or DFS algorithms. MST and BFS algorithms produce somewhat similar results. While converting a very large graph into a tree using BFS and MST, the height of the tree is considerably small whereas the height of the tree produced using DFS algorithm is very large and hence it cannot be used in many of the visualization layouts. Once the tree edge set is produced, the structure of the tree is identified. The node with maximum neighbor nodes is taken as the root node and then all other nodes of the subtree are found recursively. We then find the hierarchical and non hierarchical edges. The non hierarchical edges are

drawn only between the leaf nodes. This is done to avoid clutter. An inverted Radial Layout is used for the visualization of hierarchical edges. It is an improved version of the Radial Layout and is created to achieve maximum visual clarity. The non hierarchical edges are then bundled together and displayed on the centre of the layout as shown in Fig. 3. Among all the visualization layout patterns, Inverted Radial Layout is selected since it can display maximum number of nodes on the screen with minimum clutter. In addition to this, it also facilitates the addition of non hierarchical edges at the center.

Fig. 3. Inverted Radial Layout

IV. CONCLUSION AND FUTURE WORK Visualization of graphs can thus be achieved even for a very large set containing around fifty thousand nodes or one lakh nodes, and that too in real time. If the size of the input graph set is even bigger, then we can extend this work further by finding the group of similar nodes and then replacing this group by the most prominent node of that group. Thereby, we can first reduce the size of the graph set and then further proceed with the same algorithm. This work can be also extended to graphs where each node contains several attributes in addition to the node label. Nowadays many of the real time graphs like social networks, dependency graphs in software engineering, website hyperlinks, collaboration networks, etc, are dynamic in nature.

Dynamic graphs are ones which are time dependant and keep evolving with time (nodes may be added or deleted). The algorithm can also be extended so as to deal with these kind of graphs. Other improvements that can be included are navigations within the graph, including zooming and panning techniques.

(5)

76 V. REFERENCES

[1] Danny Holten,‖ Hierarchical Edge Bundles:

Visualization of Adjacency Relations in Hierarchical Data‖, IEEE Transactions on Visualization and Computer Graphics, Vol. 12, no. 5, September/October 2006.

[2] G.J. Wills, ―Niche Works — Interactive Visualization of Very Large Graphs‖, Proceedings of the Symposium on Graph Drawing GD ’97, Springer–Verlag, pp. 403– 415, 1998..

[3] T.Munzner, ―Drawing Large Graphs with H3Viewer and Site Manager‖, Proceedings of the Symposium on Graph Drawing GD ’98, Springer–Verlag, pp. 384– 393, 1998.

[4] K. Misue, P. Eades, W. Lai, and K. Sugiyama,

―Layout Adjustment and the Mental Map‖, Journal of Visual Languages and Computing, Vol. 6, pp. 183–210, (1995).

[5] I. Herman, G. Melançon, M.M. de Ruiter, and M.

Delest, ―Latour — a Tree Visualization System‖, Proceedings of the Symposium on Graph Drawing GD ’99, Springer–Verlag, pp. 392–399, 1999. A more detailed version in : Reports of the Centre for Mathematics and Computer Sciences, Report number INS–R9904, available

[6] 6..Ivan Herman, Member, IEEE CS Society, Guy Melançon, and M. Scott Marshall, ―Graph Visualization and Navigation in Information Visualization: a Survey‖

