Analysis-III — Modeling the Attention on Web Elements

• The modality-independent features, size and position, are informative for both text and images, analogous to the information foraging and page recognition task settings [19].

4.3. ANALYSIS-III — MODELING THE ATTENTION ON WEB ELEMENTS

performance is considered. The same procedure was followed for image elements and the analysis was conducted using ‘rpart’ (Recursive partitioning for classification) and ‘caret’

(Classification and Regression Training) packages from R language [54,149].

Results: The decision tree based metrics for θ= 5 and 10 are shown in Table 4.3. The metrics indicate a very good performance of the informative features in identifying the salient elements at the fixation-index threshold (θ= 5) closer to the median FI. Both the precision and recall performances are higher than the respective task-dependent performance reported in [19] where a decision tree is constructed to investigate whether an element obtains a fixation or not. The higher performance in the free-viewing case is attributed to prominence of visual features in guiding the initial fixations whereas user idiosyncratic expectational bias is an addendum in directing the attention in task-dependent settings. The relative higher precision of the images (95.83%) than the text (77.29%) indicates that image visual features are more influential in deciding whether an element is salient or not than their counterpart text features. In contrast, the recall of text (93.72%) is higher than the images (82.14%) indicating the visual features of the salient text elements better segregates them from non-salient element features than their counterpart image features. The overall performance in identifying the salient elements issubstantial [102] for images (κ= 0.6739) than text (κ= 0.2505). However, the model performance reduces with an increase inθ (see performance forθ= 10 in Table 4.3) as informative visual features reduced and the later fixations are less likely driven by the visual features.

The decision tree based analysis helped in segregating the salient elements from non-salient elements. To further quantify the saliency of the identified elements and to predict the ordinal visual attention, we performed multiclass classification using Linear Support Vector Machine(SVM) as described in subsection 2.6.1.

4.3.2 Ordinal Visual Attention Prediction on Web Elements

Ordinal visual attention or Fixation-Index (FI) prediction is a multiclass classification problem. Each possible value ofFIrepresents a class. Unlike the binary classification (salient or non-salient determination), multiclass (number of classes > 2) classification (prediction of FI) is a difficult problem to solve. One can not obtain a single boundary to separate all the classes from each other. Forθ classes, ^θ₂=θ(θ−1)/2 separating boundaries need to be determined to separate any class from every other class. That is,θ(θ−1)/2 binary classifiers are required. Accordingly, multi-class classification requires to learn number of parameters× ^θ₂ parameters. For this ordinal visual attention prediction, we utilized the approach as described insubsection 2.6.1. The analysis was carried out using ‘rminer’ (Data Mining Classification and Regression Methods) [31] and ‘kernlab’ (Kernel-Based Machine Learning Lab) packages [83] in R.

gaze-data

Model building

predicted fixation-index test

web element

extract fixations

identify the informative visual features

assign fixation indices

identify elements from fixation

positions

build binary one-Vs-one

classifiers

apply all binary classifiers

employ majority

voting scheme

Prediction

apply (thresholding)

on fixation indices

extract elements’

visual features

extract the values of informative

features

Figure 4.6: Ordinal visual attention prediction procedure Table 4.4: Ordinal visual attention prediction performance

Type Model Average

Accuracy(%)

micro F1-score(%)

Text Our model 87.37 68.44

Baseline 68.48 21.20

Image Our model 89.10 72.75

Baseline 67.84 19.61

Data Preparation: The frequency distribution of the FIsfor text and images are shown inFigure 4.5. The FIs are reassigned usingEquation 4.3 to persist with the originalFIin case it is less than the thresholdθ or else replace with the threshold FI. For the elements with multiple FIs, majority-voting-scheme is followed to assign the ground-truthFIwith the random resolving of conflicts.

Results: The ordinal visual attention prediction results at median FI thresholding are shown in Table 4.4. Clearly, the informative intrinsic visual feature-based multiclass model outperformed the baseline model. The text-based model outperformed the baseline micro F1-score by 222.83% with a value of 68.44% and Average Accuracy by 27.58% with a value of 87.37%. Similarly, the image-based model outperformed the baseline micro F1-score by 270.98% with a value of 72.75% and Average Accuracy by 31.34% with a value of 89.10%.

4.3. ANALYSIS-III — MODELING THE ATTENTION ON WEB ELEMENTS

65 70 75 80 85 90 95

5 6 7 8 9 10 11 12 13 14 15 16

Average Accuracy (%)

Fixation-index threshold, θ Text: Our Model Image: Our Model Text: Baseline Image: Baseline

(a) Average Accuracy

0 10 20 30 40 50 60 70 80

5 6 7 8 9 10 11 12 13 14 15 16

micro F1-score (%)

Fixation-index threshold, θ Text: Our Model Image: Our Model Text: Baseline Image: Baseline

(b) micro F1-score

Figure 4.7: Prediction performance of informative intrinsic visual features with variation in θ

The relatively superior performance of the images over text is attributed to the higher information gain scores of image intrinsic visual features. The seemingly higher baseline average accuracies (68.48% for text and 67.84% for images) are attributed to the class imbalance (see Figure 4.5), where accuracy tend to bias towards the prediction of most frequent class.

Variation in θ: To further understand the influence of θon prediction performance, the prediction procedure is repeated for each increment in θ. Average Accuracy and micro F1-scores are shown in Figure 4.7. The Average prediction Accuracy of text and images remained higher than 86% (and closer to 90%) throughout the variation in θand remained outperformed the baseline models. However, the Average Accuracy of baseline models (for both text and images) progressively increased with each increment in FI threshold θ. This corresponds to the increase in class-imbalance with an increase in FI threshold.

In contrast, the class-imbalance (or bias) overcoming micro F1-score remained lower for baseline models throughout the variation inθ. The prediction performance reduced with an increase in FIthreshold. This is attributed to the reduction in remaining salient elements of a webpage on which user attention allocation varied diversely, resulting in relatively inadequate performance.

Key Findings of Analysis-III

• Informative visual features help in segregating the salient elements (fixation-index up to the median FI) from non-salient.

• Image features are better than text features in deciding whether an element is salient or not.

• Text features are better than image features in correctly identifying the salient elements as salient.

• The informative visual features help in predicting the salient fixation-index of an element with nearly 90% average accuracy.

• The best fixation-index prediction is achieved at the thresholding near the median FIof the elements with 68.44% and 72.75% micro F1-score respectively for text and images.

• The average accuracy remains around 90% and micro F1-score reduces with increased FIthreshold θ.

Motivation for Remainder of the Chapter: On prevalent bi-modal webpages con- sisting of text and images¹, the latter stand-out from rest-of-the text [21, 165] and draw relatively more attention [112]. Images are highly probable (60–70%) in drawing human’s first attention [121] and overall attention of more than 50% as compared to text [123].

Consequently, Jana et al. [78] proposed a regression-based model to computationally predict the attention on image elements. However, the constrained model suffers from the limitations as discussed in subsection 2.4.2. In remainder of this chapter, we present the computational approaches towards overcoming those limitations. In Section 4.4 possible non-linear association between image visual features and the user attention is investigated.

In Section 4.5, multiple ground-truth attention assignment techniques are investigated. In Section 4.6possibility of assigning multiple ground-truth attention to the same image element are explored.

Dalam dokumen Computational Modeling of Free-viewing Attention on Multimodal Webpages - A Machine Learning Approach (Halaman 85-89)