Experimental results - Experimental evaluation

3.10 Experimental evaluation

3.10.2 Experimental results

In this phase, we present the key experimental results that were obtained based on the metrics described in previous phase (experimental procedure). We also present certain observations and reasons pertaining to the experimental results.

Key Results: For every dataset, there are two horizontally placed plots (Figures 3.8(a), 3.8(b), 3.8(c), 3.8(d), 3.8(e), 3.8(f), 3.9(a), 3.9(b), 3.9(c), 3.9(d), 3.9(e), 3.9(f), 3.10(a), 3.10(b), 3.10(c) and 3.10(d)) viz. CPU execution times of iM ass and MBSCAN and the percentage of affected nodes from the iF orest due to execution of iM ass. Both the algorithms show an upward trend in execution times with increase in the size of base dataset due to repeated insertions. We observed that for Iris dataset, the iM ass algorithm achieved a highest speedup upto an order of 2.28 across all datasets which is about 191 times faster than the MBSCAN method. The average speedup obtained for Iris was upto an order of 2.11 (≈ 128.82 times) post insertion of the 50^th point. For another dataset Wine, a highest speedup of about 134 times was achieved. The mean speedup for Wine was around 83.17 (≈ order of 1.92). Among the datasets achieving a lesser mean speedup, it was about 23.44 times faster (≈order of 1.37) for Aggregation dataset while for S1 the mean speedup was about 27.54 (≈order of 1.44).

While computing the percentage of affected nodes in the iF orest, for Iris dataset we observed that an average of 26.54% of nodes were affected after all the 50 points had been inserted. For the highest sized dataset Segment with a insertion size of 770 points, the mean number of affected nodes were as low as 8.94%. The datasets with lesser size viz. Wine and Seeds had a higher mean percentage of affected nodes with 24.5% and 24.61% respectively. This is due to the fact that iF orests that were constructed with datasets of higher size had a greater number of nodes per iT ree contrary to datasets with lesser size.

Out of all datasets, Segment consumed the highest amount of memory space be- cause of its greater size. The average memory consumption for iM ass stood at 164.72 MB while MBSCAN took about 113.055 MB of space. The mean memory consumption due to iM ass algorithm was about 66.77% more than that of MBSCAN (See Tables 3.6, 3.9), ∴|M empoint−ins−M emnaive| ≈0.67. The higher memory consumption for iM ass can be attributed to the maintenance of the lowest node id for every pair of points in each of the t iT rees. Moreover, certain additional memory was also required for storing the parameters carried from the MBSCAN algorithm for all insertions.

Table 3.9: Memory consumed due toiM ass.

Dataset Libras Segment Wine Seeds Aggregation Iris S1 S2 Memory 28.67 MB 632.84 MB 9.42 MB 12.76 MB 137.33 MB 7.95 MB 158.67 MB 330.132 MB

Next we mention certain observations and reasons based on the experimental results that were obtained.

1. Efficiency comparison of iM ass and MBSCAN

Key observation(s): In the first plot for every dataset (Figure 3.8(a) through Figure 3.10(c)), we observed that against each point insertion, the CPU execution time for MBSCAN uponD⁰(updated dataset) is much higher as compared to that of theiM assalgorithm resulting in its better efficiency.

Reason(s): The iM ass algorithm avoids re-building of iF orest post any new insertion. Only the nodal masses of the affected nodes are updated incrementally. Moreover while constructing the new mass-matrix, a pairwise dissimilarity score between any two existing points in D (base dataset) is evaluated incrementally (Section 3.7.4).

2. Lower percentage of affected nodes

Key observation(s): In the second plot for each dataset (Figure 3.8(b) through Figure 3.10(d)), we observe that a lesser percentage of nodes (<28%

across datasets) were affected due to insertion of any new point.

Reason(s): The node-splitting criterion as per the MBSCAN algorithm is used by a newly inserted point to find the appropriate nodes while positioning itself in any iT ree. Therefore only a handful number of nodes acrossiT rees belonging to a giveniF orest would contain the newly entered point.

3. Efficiency achieved due to iM ass while building the iF orest incrementally

0 10 20 30 40 50

240 260 280 300 320 340 360

CPU execution Time (seconds)

Total #points after every insertion (#points) CPU execution time for iMass and MBSCAN(Libras dataset)

Base dataset size: 240 Total data points added: 120 No. of iTrees: 20 Core point threshold: 5

TimeiMass Time_MBSCAN

(a) Execution time foriM assand MBSCAN after every insertion for Libras dataset.

0 20 40 60 80 100

240 260 280 300 320 340 360

Percentage of affected nodes (%)

Total #points after every insertion (#points) Percentage of affected nodes from iForest post insertion (Libras dataset)

Base dataset size: 240 Total data points added: 120 No. of iTrees: 20 Core point threshold: 5 Percentage of affected nodes

(b) Percentage of affected nodes in theiF orestafter every insertion for Libras dataset.

0 200 400 600 800 1000 1200 1400

1600 1700 1800 1900 2000 2100 2200 2300

CPU execution Time (seconds)

Total #points after every insertion (#points) CPU execution time for iMass and MBSCAN(Segment dataset)

Base dataset size: 1540 Total data points added: 770 No. of iTrees: 20 Core point threshold: 10

TimeiMass Time_MBSCAN

0 20 40 60 80 100

1600 1700 1800 1900 2000 2100 2200 2300

Percentage of affected nodes (%)

Total #points after every insertion (#points) Percentage of affected nodes from iForest post insertion (Segment dataset)

Base dataset size: 1540 Total data points added: 770 No. of iTrees: 20 Core point threshold: 10

Percentage of affected nodes

(d) Percentage of affected nodes in theiF orestafter every insertion for Segment dataset.

0 5 10 15 20

120 130 140 150 160 170

CPU execution Time (seconds)

Total #points after every insertion (#points) CPU execution time for iMass and MBSCAN(Wine dataset)

Base dataset size: 118 Total data points added: 60 No. of iTrees: 20 Core point threshold: 5

Time_iMass Time_MBSCAN

(e) Execution time foriM assand MBSCAN after every insertion for Wine dataset.

0 20 40 60 80 100

120 130 140 150 160 170

Percentage of affected nodes (%)

Total #points after every insertion (#points) Percentage of affected nodes from iForest post insertion (Wine dataset)

Base dataset size: 118 Total data points added: 60 No. of iTrees: 20 Core point threshold: 5 Percentage of affected nodes

(f) Percentage of affected nodes in theiF orest after every insertion for Wine dataset.

Figure 3.8: Efficiency comparison between iM ass and MBSCAN along with the percentage of affected nodes due to iM ass for datasets: Libras, Segment,

Wine.

0 5 10 15 20 25

140 150 160 170 180 190 200 210

CPU execution Time (seconds)

Total #points after every insertion (#points) CPU execution time for iMass and MBSCAN(Seeds dataset)

Base dataset size: 140 Total data points added: 70 No. of iTrees: 21 Core point threshold: 7

Time_iMass TimeMBSCAN

(a) Execution time foriM assand MBSCAN after every insertion for Seeds dataset.

0 20 40 60 80 100

140 150 160 170 180 190 200 210

Percentage of affected nodes (%)

Total #points after every insertion (#points) Percentage of affected nodes from iForest post insertion (Seeds dataset)

Base dataset size: 140 Total data points added: 70 No. of iTrees: 21 Core point threshold: 7 Percentage of affected nodes

(b) Percentage of affected nodes in theiF orestafter every insertion for Seeds dataset.

0 50 100 150 200

550 600 650 700 750

CPU execution Time (seconds)

Total #points after every insertion (#points) CPU execution time for iMass and MBSCAN(Aggregation dataset)

Base dataset size: 525 Total data points added: 253 No. of iTrees: 24 Core point threshold: 9

Time_iMass Time_MBSCAN

0 20 40 60 80 100

550 600 650 700 750

Percentage of affected nodes (%)

Total #points after every insertion (#points)

Percentage of affected nodes from iForest post insertion (Aggregation dataset)

Base dataset size: 525 Total data points added: 253 No. of iTrees: 24 Core point threshold: 9 Percentage of affected nodes

(d) Percentage of affected nodes in theiF orestafter every insertion for Aggregation dataset.

0 5 10 15 20 25 30 35

100 110 120 130 140 150

CPU execution Time (seconds)

Total #points after every insertion (#points) CPU execution time for iMass and MBSCAN(Iris dataset)

Base dataset size: 100 Total data points added: 50 No. of iTrees: 20 Core point threshold: 7

Time_iMass Time_MBSCAN

(e) Execution time foriM assand MBSCAN after every insertion for Iris dataset.

0 20 40 60 80 100

100 110 120 130 140 150

Percentage of affected nodes (%)

Total #points after every insertion (#points) Percentage of affected nodes from iForest post insertion (Iris dataset)

Base dataset size: 100 Total data points added: 50 No. of iTrees: 20 Core point threshold: 7

Percentage of affected nodes

(f) Percentage of affected nodes in theiF orest after every insertion for Iris dataset.

Figure 3.9: Efficiency comparison between iM ass and MBSCAN along with the percentage of affected nodes due toiM assfor datasets: Seeds, Aggregation,

Iris.

0 50 100 150 200 250

600 650 700 750 800 850 900

CPU execution Time (seconds)

Total #points after every insertion (#points) CPU execution time for iMass and MBSCAN(S1 dataset)

Base dataset size: 600 Total data points added: 300 No. of iTrees: 20 Core point threshold: 9

Time_iMass Time_MBSCAN

(a) Execution time foriM assand MBSCAN after every insertion for S1 dataset.

0 20 40 60 80 100

600 650 700 750 800 850 900

Percentage of affected nodes (%)

Total #points after every insertion (#points) Percentage of affected nodes from iForest post insertion (S1 dataset)

Base dataset size: 600 Total data points added: 300 No. of iTrees: 20 Core point threshold: 9 Percentage of affected nodes

(b) Percentage of affected nodes in theiF orestafter every insertion for S1 dataset.

0 200 400 600 800 1000

1000 1100 1200 1300 1400 1500

CPU execution Time (seconds)

Total #points after every insertion (#points) CPU execution time for iMass and MBSCAN(S2 dataset)

Base dataset size: 1000 Total data points added: 500 No. of iTrees: 20 Core point threshold: 10

Time_iMass Time_MBSCAN

0 20 40 60 80 100

1000 1100 1200 1300 1400 1500

Percentage of affected nodes (%)

Total #points after every insertion (#points) Percentage of affected nodes from iForest post insertion (S2 dataset)

Base dataset size: 1000 Total data points added: 500 No. of iTrees: 20 Core point threshold: 10

Percentage of affected nodes

(d) Percentage of affected nodes in theiF orestafter every insertion for S2 dataset.

Figure 3.10: Efficiency comparison betweeniM assand MBSCAN along with percentage of affected nodes due to iM ass for datasets: S1, S2.

Key Observation(s): A significant amount of gain in terms of efficiency (time required) is observed (Refer Table 3.10) while building theiF orest for iM assalgorithm as compared to MBSCAN.

Reason(s): The iM ass algorithm avoids re-construction of iF orest afresh upon insertion of any new data point. Instead once a point is inserted, the node-split criterion is used by any priorly positioned node of an iT ree to direct the newly entered point to its further appropriate nodes. The points within unaffected nodes retain their prior position in aniT reewithout inflicting any changes. Moreover, no additional iT rees are added to the iF orest.

4. Efficiency achieved while building the mass-matrix incrementally Key Observation(s): A significant amount of gain in terms of efficiency

(time required) is observed (Refer Table 3.10) while building the mass-matrix for iM assalgorithm as compared to MBSCAN.

Reason(s): For any pair of points (x, y), the lowest node ids are stored while determining the Algo-Comp values. Whenever a new point places itself into the appropriate nodes of an iT ree, the iM ass algorithm accesses the lowest node id of (x, y) within that iT ree. This is done in order to check whether the new point has penetrated into that designated lowest leveled node. If the newly entered point is present within that lowest leveled node, then the mass of (x, y) w.r.t., thatiT ree is increased by one (Refer Section 3.7.4). In this way iM ass avoids re-computation of mass values for any pair of existing points (x, y)∈D post new insertion. Moreover from Figure 3.8(b) through Figure 3.10(d), we observe that the percentage of affected nodes including the lowest leveled nodes in the iF orest reaches a maximum of 27% across all datasets. These reasons collectively ensure a reduced time duration required for computing pair wise dissimilarity in the new mass-mass matrix.

Dalam dokumen Density-Based Mining Algorithms for Dynamic Data (Halaman 95-100)