PDF Visualization of Multivariate Health Data using Self ... - DSPACE

The Special Problem entitled "Visualization of Multivariate Health Data Using Self-Organizing Maps" prepared and submitted by Mark Lester Y. Ghany in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science has been reviewed and is recommended for acceptance. Accepted and approved as partial fulfillment of the requirements for the bachelor's degree in computer science.

Data that is multivariate in nature is a type of data that can contain subtle patterns. However, it is considered to be an obstacle in research most of the time, as classical statistics can find it difficult to analyze. But computational statistics, a collaboration between computer science and statistics, offers a number of algorithms that can be used to overcome obstacles like this.

This can give several hypotheses that can be confirmed by other statistical tests and SOM Visualize has therefore enabled the possibility of analysis of multivariate data.

Background of the Study

Google has released an application programming interface (API) for users to embed Google Maps on their websites and overlay their data for free. Regional Health Data obtained through surveys conducted by the Philippine Census is an example of multidimensional data. Health is defined by the World Health Organization as "a state of complete physical, mental and social well-being and not merely the absence of disease or infirmity [8]". It is considered a complex system because several factors influence health including social, economic and environmental variables that may vary by region.

The data used in this study included thirty health and demographic variables in categories such as fertility, family planning, maternal health, infant mortality, child health and nutrition, knowledge and behavior related to HIV/AIDS, and violence against women.

Statement of the Problem

Objectives

The advent of GIS opened up many opportunities to visualize data that are geographical in nature. The data is available through the NSO website (http://census.gov.ph). a) Import an excel file containing data that matches a specific data format;. Construct a self-organizing map by asking the user to provide:. a) Map width and height.

Provide the user with a visualization of potential data clusters using the uniform distance matrix by asking the user the size of the neighborhood. Provide the user with a visualization of clusters via the k-means clustering algorithm by asking the user the number of desired clusters.

Significance of the Study

Scope and Limitations

It is not the tool's functionality to add the boundaries of a specific location in the database.

Assumptions

Review of Related Literature 6

In addition to health data, the self-organizing map has also been used to visualize complex geospatial datasets. The study has successfully demonstrated the application of the self-organizing map to cluster financial transactions into buyer-initiated and seller-initiated groups. In particular, authors suggest the use of the specific findings seen using the self-organizing map as hypotheses in further studies [14].

Finally, Penn used the self-organizing map in the visualization of geochemical and hyperspectral data. Clustering was achieved through a two-phase process, training the self-organizing map and applying a clustering algorithm. Due to its powerful algorithm, the self-organizing map can depict patterns in large datasets.

Previous articles have proven that using self-organizing mapping for multidimensional data visualization is feasible and particularly useful for data mining.

Self-Organizing Map

In general, this is a value that starts large, usually set at the mesh radius, but decreases each time step. In the figure above, the neighborhood is centered on the BMU and contains most of the other nodes, and the arrow indicates the radius of the neighborhood. A unique feature of the Kohonen learning algorithm is that the area of the neighborhood decreases over time, and this is achieved by reducing the radius over time.

Where σ0 denotes the width of the grid at time t0 and λ denotes a time constant which is calculated based on the number of iterations set and the current size of the neighborhood; t is the current time step, which in implementation is the current iteration of the loop. The equation states that the new adjusted weight for the node is equal to its current weight W plus a fraction of the difference Θ × L between the current weight and the input vector. Where dist is the spatial distance of the node from the BMU and σ is the width of the neighbor function as calculated using equation (2).

And since learning is a stochastic process, the statistical accuracy of the map depends on the number of iterations which must be large enough, Kohonen has devised a "rule of thumb" for the number of iterations in training[3].

Figure 1: The Rectangular and Hexagonal Topologies

Unified Distance Matrix

Although referred to as a neural network, self-organizing maps work differently than most known neural network algorithms, which are commonly supervised. Regarding the convergence criterion of SOM, SOM limits its training algorithm to a specific number of iterations. Since SOM is an unsupervised learning technique, there is no target vector and therefore we cannot obtain a difference between a target vector and the value at the current iteration [19].

The position of the projections of the input data points reflects the topology of the input space inherited from the basic SOM algorithm. The weight vectors of neurons with large U-heights are very far from other surrounding vectors in the data space. The weight vectors of neurons with small U-heights are very close to other neighborhood vectors in the data space.

This technique enables a visualization of clusters in data and is particularly good for finding the number of clusters in datasets where it is not predefined.

K-means clustering algorithm

Then assign the sample point to the nearest cluster by minimizing the point's distance to the centroid. For a particular case where the number of given clusters is greater than the number of data (nodes), then we assign each data point a center of gravity and therefore each node is considered to be a cluster. We continuously adjust the locations of the centroids by recalculating through the average of the values included in a particular cluster, since we are not sure of their true positions.

Geographic Information Systems

Design and Implementation 18

The figure suggests that the tool should be able to provide different data visualizations through the use of self-organizing mapping by asking the user for some input including the dataset, and details about the mapping and algorithm. The use case diagram shows the different functionalities that the user can do with the tool. Once done, the self-organizing map will be trained and allow viewing of various visualizations generated by the.

Of all these processes, the third process, visualizing the data, covers most of the tool's functionalities and provides the various visualizations to be generated by the system. After this, the radius of the neighborhood and the current learning rate are determined using the time constant, the map size and the initial learning rate. We go through all the nodes that are considered neighbors of the Best Matching Unit and calculate their distances.

Knowing the neighborhood distance and radius, the influence of the BMU on the node is calculated, followed by the adjustment of the weights.

Figure 6: Use Case Diagram, SOM Visualize

Technical Architecture

There are two ways to do it, the first way is to import an Excel file (.xls/.csv) into the tool and the second way is to use the input facility provided by the tool. This means that whenever text is given in the input file text field, the tool must try to find the path and file. In this case, the user's input is 100, and the tool marked the box red to indicate an error in the input.

The tool's training algorithm is implemented using threads, this means the user can choose to finish the current training or stop training and save the file to continue training later. Other features of the tool include console panel, file I/O and tutorial module. The console panel is responsible for informing the user of errors in data entries, parameters and events that are currently occurring with the tool.

The file I/O feature of the tool allows the user to save and load previous visualization files. The tool also has two browser-based modules, GIS Visualizer and Tutorial Module. When using GIS Visualizer, the user must enter an XML file generated by the tool.

Through SOM Visualize we are able to see a fuzzy clustering of the data in which data entries that have similarities are placed close to each other on the map. The third visualization module, the Cluster visualization, gives the user a visualization of the definitive clusters that exist in the data through the k-means algorithm. The tool provides a visualization of the clusters, correlations, trends and patterns that exist in the data.

With this, the tool should be able to help researchers in fields that require multidimensional analysis. Moreover, the tool only implemented the fundamental visualization done on the Self-Organizing Map. Therefore, it is undeniable that the tool has opened opportunities for analysis and further improvements to the tool will make it a powerful tool for visualization.

Although SOM Visualize has successfully achieved its goals, the tool has several features that can be further improved. To enable interactive use, it is proposed that the tool allows the user to add a specific location to the database along with its boundaries and other information. Although this is not our main goal, it is suggested that the tool should let the user know which algorithm will be used to select the clusters.