6.2.2 Stage-II: Prediction Model
ConsiderT andIrepresent the text and image elements (represented using the visual features) for which the multi-level attention needs to be predicted. Let, a total of h (minimum ofm andn) linear transformation directions are obtained from the aforementioned correlation model. Arranging h directions into the transformation matrices respectively for text and images as Wt= [wt1|wt2|. . .|wth]m×h,Wi = [wi1|wi2|. . .|wih]n×h.
projected text in the image spaceT0 =T ×Wt×Wi+ (6.9) projected images in the text spaceI0 =I ×Wi×Wt+ (6.10) where, Wi+ andWt+ indicates the Moore-Penrose inverse or psuedo inverse of theWi and Wt respectively.
The unified elements (ensemble of original feature space elements and the projected elements) in the text space are represented ashT
I0
i
d×m, and analogously, unified elements in the image space as
T0 I
d×n
; where,drepresents the number of unified elements. Considering ith row of either of the matrices as an elementEi, the associated multi-level attention is indicated asfi.
However, not all the levels of user attention is equally important. Especially, the later attention may be influenced by the factors apart from visual features, such as semantic features or exhaust in the salient elements, etc. Thus, we introduce thresholding on multi- levels to segregate and combine the non-prominent attention levels (the later fixation-indices).
Analogous to the fixation-index, the increasing attention-level value indicates decreasing prominence. Thus, thresholding the maximum attention-level toθ(calledsaliency-threshold), results in θ classes each corresponding to an attention-level, i.e., fk ∈ {1, . . . , θ}; k = 1, . . . , d. Accordingly, the multi-level attention prediction on web elements transforms to a polychotomous classification problem with θ classes.
We utilize the multiclass support vector machine (multiclass SVM) based approach to solve this polychotomous prediction problem, as described insubsection 2.6.1.
We analyze the performance of the proposed approach on two real-world webpage datasets.
6.3. PREDICTION PERFORMANCE USING WG
65 70 75 80 85 90 95
5 6 7 8 9 10 11 12 13 14 15
Average Accuracy (%)
Saliency Threshold, θ Text space Image space Random
(a) Average Accuracy
0 10 20 30 40 50 60
5 6 7 8 9 10 11 12 13 14 15
micro F1-score (%)
Saliency Threshold, θ Text space Image space Random
(b) Micro-F1 score
Figure 6.2: Experiment-I: Prediction performance metrics with variation inθ 6.3.1 Prediction Performance on Same Dataset
For the analysis in this section, the fixation-data from Experiment-I (described inSection 3.2) is used in both the stages.
Ground-truth Preparation: For each fixated web element, the fixation-indices from all users are obtained. The fixation-index resulting from the applicationmajority-voting-scheme is assigned as the ground-truth multi-level attention. In case of conflict (multiple majority- voted fixation-indices), the fixation-index with the lower value is assigned as an indicator of element’s attention drawing ability.
θ selection: The maximum of median fixation-indices is considered as the saliency- threshold inEquation 4.3. The median fixation-index of text and images (from Experiment-II) are 5 and 4 respectively. Accordingly, θ= 5 is considered for the prediction performance analysis.
Procedure: The image elements are projected from Image Visual Space into the Text Visual Space (via the Common Visual Space) using Equation 6.10 where the significant canonical directions are utilized forWtandWi. On the unified data, the procedure described insubsection 6.2.2 is applied with a 5-fold cross-validation with 10 iterations. The micro prediction performance metrics (average accuracy and micro-F1 score) computed in the each iteration are averaged to obtain an overall prediction performance. Further, the whole procedure is repeated by projecting the text elements from Text Visual Space into the Image Visual Space via the Common Visual Space.
Baseline Selection: A very few of the existing approaches are centered on element- granular attention prediction and that to multi-level attention prediction. The location-based
Ground-truth Predicted Ground-truth Predicted
Text Modality Image Modality
Figure 6.3: Experiment-I: Example multi-level attention predictions on elements
saliency-oriented approaches (though limited to binary-level predictions) cannot be utilized for the baseline as, two locations corresponding to the same element may indicate one as salient and the other as non-salient. On the other hand, pattern-oriented approaches are centered on eliciting an attention-pattern than prediction. Though the work in [19]
considered limited visual features of web elements, the attention prediction is proposed for task-dependent settings and constrained to binary-level prediction. Thus, we utilize the random prediction as the baseline (analogous to [32,118]) to comprehend the performance of the proposed approach.
At median saliency-threshold, the multi-level attention was predicted at an average accuracy of 82.79% and the micro-F1 score of 56.98% in the Text Visual Space. The attention prediction in Image Visual Space also achieved similar performance with the average accuracy of 81.72%
and micro-F1 score of 54.29%. Both metrics outperformed the baseline random prediction which achieved an average accuracy of 68.03% and micro-F1 score of 20.07%. To further understand the influence of saliency-threshold, the performance metrics are computed with each unitary increment inθas shown in Figure 6.2. The average accuracy gradually increased and micro-F1 score gradually reduced with θtowards saturation. Nevertheless, throughout the variation in θ, the approach outperformed the baseline. The contrasting variation in the performance metrics is attributed to the class-imbalance which is better accounted by the micro-metric F1-score. The prediction performance is consistent across the visual spaces.
Thus, the research questions R1andR2are answered when the interface idiosyncrasies are constrained. However, the observed variation in prediction performance is attributed to the utilization of low dimensional (28 significant canonical directions) Common Visual Space for elements’ unification where not all the dimensions achieved optimal correlation between text and images. Example predictions from the proposed approach are shown in Figure 6.3.
6.4. PREDICTION PERFORMANCE USING WUG