Kernel-based Attention Prediction - Computational Modeling of Free-viewing Attention on Multimo

• Text features are better than image features in correctly identifying the salient elements as salient.

• The informative visual features help in predicting the salient fixation-index of an element with nearly 90% average accuracy.

• The best fixation-index prediction is achieved at the thresholding near the median FIof the elements with 68.44% and 72.75% micro F1-score respectively for text and images.

• The average accuracy remains around 90% and micro F1-score reduces with increased FIthreshold θ.

Motivation for Remainder of the Chapter: On prevalent bi-modal webpages con- sisting of text and images¹, the latter stand-out from rest-of-the text [21, 165] and draw relatively more attention [112]. Images are highly probable (60–70%) in drawing human’s first attention [121] and overall attention of more than 50% as compared to text [123].

Consequently, Jana et al. [78] proposed a regression-based model to computationally predict the attention on image elements. However, the constrained model suffers from the limitations as discussed in subsection 2.4.2. In remainder of this chapter, we present the computational approaches towards overcoming those limitations. In Section 4.4 possible non-linear association between image visual features and the user attention is investigated.

In Section 4.5, multiple ground-truth attention assignment techniques are investigated. In Section 4.6possibility of assigning multiple ground-truth attention to the same image element are explored.

4.4. KERNEL-BASED ATTENTION PREDICTION

Polynomial kernel:

K(Ik, It) = (scalehI_k, Iti+of f set)^degree (4.5) This kernel is especially popular in natural language processing domain. The hyper- parametersscale helps to normalize the pattern without modifying the data,of f set provides the bias, and degree >0 is the degree of polynomial.

Linear kernel:

K(Ik, It) =hI_k, Iti (4.6) It is useful for linearly separable data and is fastest to compute. Especially popular in text classification.

Hyperbolic tangent kernel:

K(I_k, It) = tanh (scalehI_k, Iti+of f set) (4.7) Hyperbolic tangent kernel or sigmoid kernel or multilayer perceptron kernel is well- known in neural networks field. Usage of sigmoid kernel with SVM is equivalent to an application of two layer perceptron. The hyper-parameters are similar to polynomial kernel’s.

Laplacian kernel:

K(I_k, I_t) = exp (−σkI_k−I_tk) (4.8) Similar to Gaussian RBF kernel but less sensitive to variations in σ.

Bessel kernel:

K(I_k, I_t) =−Bessel^degree_order σkI_k−I_tk² (4.9) The hyper-parameters order anddegreeare the parameters of Bessel function. Typ- ically, degree is set to 1 and the kernel is popular in theory of function spaces of fractional smoothness [141].

ANOVA RBF kernel:

K(Ik, It) =

degree

j=1 n

l=1

K⁰(I_k^j

l, I_t^j_l) (4.10)

where K⁰(I_k^j

l, I_t^j_l) is a Gaussian RBF kernel withI_k_l andI_t_l indicating thelth feature of Ik and It respectively. Accordingly, the kernel is similar to Gaussian RBF and

Laplacian kernels.

In the aforementioned kernels, I_k andIt are the webpage image data-points where former is from training set and latter from test set. The hI_k, I_ti=I_k^>I_t is the inner product and kI_k−I_tk is the Euclidean distance betweenI_k andI_t. For the analysis, thehyper-parameters are set to the default values, that is, scale= 1, of f set= 1, degree= 1, and order= 1.

To analyze the performance of the proposed approach we utilized the gaze data from Experiment-II.

Prediction Performance

The webpage image and associated fixation-index data was split into 80:20 train test proportions for the analysis. In total θ(θ−1)/2 = 10 binary classifiers were constructed with the train-data and majority-voting-scheme was applied to predict the fixation-index for test-data. The 5-fold cross-validation was employed to measure the performance. Prediction procedure was repeated for 10 times after randomly permuting the data. The random permutation overcomes the possible special structures occurring in splitting the data. The performance over all the iterations were averaged to obtain the overall model performance.

Table 4.5: Prediction performance at median fixation-index

Method Average Accuracy micro-F1

Gaussian RBF kernel 91.64 79.10

Laplacian kernel 90.90 77.25

Polynomial kernel 89.18 72.95

Linear kernel 89.04 72.60

ANOVA RBF kernel 88.92 72.30

Bessel kernel 87.70 69.25

Sigmoid kernel 87.18 67.95

Jana and Bhattacharya model [78] 73.92 34.80

Random prediction (RP) model 67.63 19.07

The standard multiclass classification performance metrics, Average Accuracy andmicro F1- scoreare computed. Especially, for the class imbalanced data (such as shown in Figure4.5b), micro F1-score is best indicator of prediction performance as accuracy biases towards the prediction performance on most frequent class (FI=1 in Figure4.5b).

The averaged performance metrics are shown in Table 4.5. Among the kernels, Gaussian RBF achieved the best performance (average accuracy=91.64% and micro-F1=79.10%) followed by Laplacian (average accuracy=90.90% and micro-F1=77.25%) and polynomial kernels. Sigmoid kernel achieved the relatively lowest performance with an average accuracy of 87.18% and micro F1-score of 67.95%. As shown in Table 4.5, kernels’ performance is consistent across the metrics. That is, the kernel with the highest average accuracy

4.4. KERNEL-BASED ATTENTION PREDICTION

86 87 88 89 90 91 92

5 6 7 8 9 10 11 12

Average Accuracy (%)

Fixation-index threshold, θ

Gaussian RBF kernel Polynomial kernel Linear kernel Sigmoid kernel Laplacian kernel Bessel kernel ANOVA RBF kernel

(a) Average accuracy

10 20 30 40 50 60 70 80

5 6 7 8 9 10 11 12

micro F1-score (%)

Fixation-index threshold, θ

Gaussian RBF kernel Polynomial kernel Linear kernel Sigmoid kernel Laplacian kernel Bessel kernel ANOVA RBF kernel

(b) micro F1-score = micro Precision = micro Recall

Figure 4.8: Quantitative visual attention prediction performance of each kernel with variation inθ.

has shown the highest micro F1-score as well, and so on. However, micro F1-score better highlighted the varied prediction performance of the kernels (ranging from 67.95% to 79.10%) than the average accuracy (ranging from 87.18% to 91.64%). The linear kernel which linearly models the association between image visual features and fixation-indices shown median performance (average accuracy=89.04% and micro-F1=72.60%) among the kernels. This performance indicates the significant linear association between considered features and the quantitative visual attention.

Comparison with State-of-the-art: To compare our model performance with state- of-the-art model, the constrained regression model [78] was applied on our dataset. As their model restricts the image elements’ properties (a total of 6 images distributed into 3 vertical columns with each column containing 2 images completely inside them), webpage satisfying the constraints was considered for learning the model parameters. Among the five features, ‘Intensity-contrast’, ‘Chromatic-contrast’, ‘Size’, ‘X-position’, and ‘Y-position’, former two features were discarded due to low correlation magnitude (<0.09 [78]) with the fixation-index (-0.03 and 0.05 respectively). The significant correlations -0.30, -0.19, and 0.25 of ‘Size’, ‘X-position’, and ‘Y-position’ features resulted in regression model coefficients as 0.41, 0.26, and 0.33 respectively. As shown in Table 4.5, their model achieved an average accuracy of 73.92% and micro F1-score of 34.80%. Clearly all our kernel based models outperformed the state-of-the-art performance. The rationale for poor performance of the regression-based model is, our dataset consists of real-world webpages with no constraints on images’ properties whereas their model expects the presence of only 6 images on webpage but no other element.

Performance Vs. θ: To further understand the influence of fixation-index thresholdθon performance, metrics were computed with variation inθfrom 5 to 12 as shown inFigure 4.8.

The Gaussian RBF and Laplacian kernels showed persistently better performance than other kernels. However, the prediction performance of the kernels reduced with increasing θ.

Especially, micro F1-score reduction was substantial whereas the average accuracy reduced moderately. The performance reduction (beyond median fixation-index) is attributed to the reduced number of salient webpage images, as on an average only four webpage images drew user attention on each webpage.

In summary, the performance metrics demonstrated the efficacy of the proposed approach in predicting the user’s attention on webpage images. The sample user attention predictions are shown inFigure 4.9.

Dalam dokumen Computational Modeling of Free-viewing Attention on Multimodal Webpages - A Machine Learning Approach (Halaman 89-94)