Analyzes of user attention collected on real-world web pages reveal: (i) achieved unification outperforms the random baseline in predicting user attention with elements; (ii) the presented model demonstrates efficiency in predicting attention to all elements unified in text as well as unified in visual image spaces; (iii) the model also outperforms the baseline for predicting attention to invisible web page data. Extension: the importance of the elements that make up the web page is overlooked due to processing the entire web page as an image.
Scope of the Thesis
Task-based: Capture the eye movements when the user voluntarily stares to perform a specific task. Because the thesis focuses on Bottom-Up attention modeling, i.e. modeling the influence of the visual features of web pages on the user's attention, the eye-tracking experiments are limited to free viewing.
Motivation
Due to the gradual expansion of the Internet and users, attention models are desirable for experienced web users. To summarize, the aim of the thesis is to model the attention of experienced adult users when freely viewing multimodal static web pages as displayed on desktop monitors.
Objectives and Contributions
R2: Which intrinsic visual features are informative in explaining the free-viewing ordinal attention to web elements. R1: Are the text and image visual features correlated based on the free-viewing user's attention allocation on bi-modal web pages.
Thesis Organization
The background includes the (i) introduction of web page as a Document Object Model (DOM) in Section 2.1, (ii) eye tracking as a technique to capture the eye movements in Section 2.2, and (iii) the association between eye movements and visual attention in the terminology of eye movements in Section 2.3. The related works include the (i) uni-modal (text-only or image-only) analyzes that motivated the feature selection for our analysis (Section 2.4), (ii) related web page-based analyzes to describe the constraints in condition -the -.
Eye-tracking
Eye tracking technologies evolved over the years in capturing physiological human eye movements. In Neuroscience, eye tracking was studied to establish the relationship between eye movements and visual attention.
Visual Attention and Terminology
A stable assignment of the foveal visual angle (1-2°) results in fixation, an indicator of visual perception of the lower element. Fixation indices (FI) are ordinal numbers associated with fixations that indicate the sequential assignment of fixations.
Modality-specific Visual Feature based Analyses
Text Modality
These works motivated us to consider text- and background-related color features for the attention analysis. We also used font size to account for the number of characters per page. line besides its own influence.
Image Modality
Motivated by this work, we included font family and font size features for the analysis of our proposed model. For the web pages, all the above-mentioned attribute values are manipulated by the web designers through Cascading Style Sheets (CSS) functions.
Attention Models for Webpages
User Attention-based (UAB) approaches
Calculating transition probabilities: These techniques help determine transition probabilities between visual elements. Common scan path identification: In contrast to the aforementioned approaches, these techniques help to obtain a representative scan path (called common scan path) from a given set of scan paths.
Visual Feature-based (VFB) approaches
UAB approaches are motivational rather than predictive of attention; Thus, user attention data is required for each test website. Collecting this data on every website and especially after every minor modification implemented on a website may not be practical due to time constraints, money constraints and user availability limitations.
Element Visual Features’ and User Attention based Analysis
Prediction Approaches
Multi-class Classification using SVM
Analogous to equation 2.1, the optimization problem equation 2.4 is solved using Lagrangian formulation and primal-dual reformulations [22,30]. Precision is the proportion of correctly predicted cases among all the positively predicted cases.
Multi-label Classification
Thus, the distinct performance metrics are introduced in the literature to estimate the performance of a multilabel classifier. Rather, other metrics are analogous to the typical performance metrics for binary and multiclass classification, and higher values are preferred for these metrics.
List of Visual Features
Additionally, "bolder" and "lighter" values are used to set the font weight relative to the font weight of the parent element. The possible values are length- and percentage-based metrics; the former indicates the absolute indentation, and the latter indicates the percentage of indentation relative to the width of the text block.
Summary
The difference between various visual features of a web page and the image results in contrast features. Furthermore, this chapter summarizes the visual characteristics of text and image modalities (Section 3.5) which are used throughout the thesis work.
Common Experimental Setup
- Apparatus
- Stimuli Preparation
- Experimental Setup
- Procedure
- Data Extraction
Ethernet cable to transfer the processed viewing data from the EPU to the host CPU; magnetic strip to attach the eye tracker to the screen monitor. The eye tracker was connected to the EPU to send the sampled gaze data for processing.
Experiment-I Data
Experiment-II Data
Experiment-III Data
Considered Visual Features
Visual Text Features
The POSITION group, which describes the rectangular position of the text DOI and its size, contributes five functions. Consequently, we calculated the cumulative distribution function (CDF) of font family frequencies and merged the sparse font families that do not contribute to 85% of the total data.
Visual Image Features
Summary
However, the lower and right regions were least visible when drawing the first fixation for free viewing, similar to the information retrieval and page recognition tasks. Among the dominant four regions TopLeft,TopCenter,MiddleLeft, the frequency of free viewing is comparable to information retrieval and page recognition.
Analysis-II— Identification of Informative Visual Features
Informative Text Visual Features
These five text-related intrinsic visual features were discarded in [19] due to consideration of only tags. Surprisingly, none of the COLOR-related intrinsic visual features turned out to be informative; these are extensively researched in the literature.
Informative Image Visual Features
Moreover, the four position-related visual features, which emphasize the prominence of position, were found to be among the Top 10 informative visual features. Relatively speaking, the intrinsic visual features of images are more informative (with higher IG scores) than the intrinsic visual features of the text.
Analysis-III — Modeling the Attention on Web Elements
Salient Element Identification through Decision Tree
The relatively higher precision of the images (95.83%) than the text (77.29%) indicates that visual features of images have more influence on the decision whether an element is salient or not than its text features. In contrast, recall of text (93.72%) is higher than that of images (82.14%), indicating that the visual features of the salient text elements separate them from non-salient element features better than their counterpart image features.
Ordinal Visual Attention Prediction on Web Elements
Informative visual features help distinguish salient elements (fixation index to median FI) from non-salient ones. Informative visual features help predict an item's salient fixation index with nearly 90% average accuracy.
Kernel-based Attention Prediction
K(Ik, It) =−Besseldegreeorder σkIk−Itk2 (4.9) The hyperparameters sort and rank the parameters of the Bessel function. The sigmoid kernel achieved the relatively lowest performance with an average accuracy of 87.18% and micro F1 score of 67.95%.
Weighted Voting based Attention Prediction on Images
Effective Visual Attention Assignment
However, the least frequent (in the overall attention data) fixation index is assigned the highest weight. The fixation index with the highest weighted votes (wvf) is assigned as the effective fixation index in each weighting strategy.
Prediction Model
The number of informative visual features are for uniform, linear, proportional, and inverse proportional weightings, respectively. To further understand the informative functions, the number of informative visual functions with variation in θ is calculated as shown in Figure 4.13c.
Prediction Performance
The number of informative features for uniform, linear, and inverse proportional weighting decreased significantly with an increase in θ. The inversely proportional weighting achieved the second highest entropy until θ= 11, but outperformed the linear weighting thereafter.
Multi-users’ Multi-level Attention (MUMLA) on Images
- Introduction and Motivation
- MUMLA Prediction Approach
- MUMLA Prediction with Experiment-III
- MUMLA Prediction with Experiment-II
Label cardinality – the average number of FI associated with each image element; highlights the salience of multi-label approaches. To note, the true positive rate (TPR) (aka, recall) of the baseline is comparable to the predicted performance.
Summary
In Chapter 4, the importance of visual features of text and images is established in explaining free viewing attention. R2: Which cross-modal visual features are comparable to each other based on the attention distribution of a free-viewing user on bi-modal web pages.
Views of Attention and Grouping
Accordingly, perceptual features describe features related to fixation, including fixation position, fixation index—the sequential number of fixations along the scan path, and fixation-modality index—the modality-specific relative index of fixation to the DOI. For example, in Figure 5.2, the textDOI on the website G1 with the fixation index numbered 3 received a modality fixation index of two from user U2, since this is the second text modality element that U2 fixed to G1.
Proposed DOI based Approach
Thus, the text feature with the highest canonical loading is comparable to the image feature with the highest canonical loading when the canonical variables wtandwi are highly correlated (ie, ρmax is larger). That is, each column in Wt and Wi is a canonical direction stacked horizontally in order of descending canonical correlations from left to right.
Results and Analysis
Correlations between Text and Image DOIs
The application of the proposed approach (see subsection 5.3.1) led to the correlations between visualized text and image features as shown in Figure 5.4. Rather, limiting the interface idiosyncrasies together with user idiosyncrasies (WUG clustering) achieved the correlation between text and images.
Comparative Text and Image Features
Regarding the Top-2 canonical loadings (withρ2max>0.8), the FAMILY, COLOR, SPACE features of text are as influential as HISTOGRAM, CONTRAST HISTORY and POSITION features of the images in determining the correlation between both modalities. Overall, FAMILY features of text dominated the canonical loadings across the canonical directions with ρ2max >0.8 and are comparable to the attention associated with the HISTOGRAM (followed by CONTRAST HISTORY) features, as visualized in Figure 5.6.
Cross-Delineation of Attention
In user-oriented grouping, the preservation of the local neighborhood is very strong for the text and strong for the images. The variation in ρs for RUG is attributed to the low homogeneity between text and images, as indicated by the respective canonical correlation.
Prominent Feature Groups
Webpage-oriented Grouping
Canonical relationship between text and images. with ALL function groups without FAMILY and CONTRAST COMPREHENSIVE without FAMILY and CONTRAST HISTOGRAM without FAMILY and CONTRAST without FAMILY and HISTOGRAM without COLOR and POSITION without FAMILY and COMPREHENSIVE. The prominence of image contrast features is further established as the exclusion of FAMILY and CONTRAST.
Webpage-and-User-oriented Grouping
Analogous to the leave-one-out strategy, the exclusion of COMPREHENSIVE, HISTOGRAM, and CONTRAST COMPREHENSIVE features along with FAMILY had little impact on the correlation. Even the exclusion of the feature group from both modalities resulted in and so on correlations (see Figure 5.10), indicating their limited prominence in fusing text and images.
Summary
R1: Can attention to elements be predicted if all elements are unified in a text mode. R2: Can attention to elements be predicted if all elements are unified into an image modality.
Proposed Approach
Stage-I: Correlation Model
Accordingly, each text element and a visual element represent a point in the visual text space ⊆Rm and the visual space ⊆Rn, respectively. Let the linear transformation of text data along wt∈Rm×1 and image data along ∈Rn×1 establish the correspondence between them.
Stage-II: Prediction Model
Prediction Performance using WG
Prediction Performance on Same Dataset
At the median salience threshold, multilevel attention was predicted with an average accuracy of 82.79% and the micro-F1 score of 56.98% in the visual text space. The attention prediction in Image Visual Space also achieved similar performance with an average accuracy of 81.72%.
Prediction Performance on Unseen Dataset
Prediction Performance using WUG
Prediction Performance on Same Dataset
SVM-based classification in visual image space) fixation index=1. test image element) (test text element). The attention prediction in Image Visual Space also achieved similar performance with an average accuracy of 87.36%.
Prediction Performance on Unseen Dataset
However, the observed variation in prediction performance is attributed to the utilization of low dimensional (30 significant canonical directions) Common Visual Space for elements' unification where not all dimensions have reached optimal correlation between text and images.
Summary
In this chapter, we propose to extend elemental attention into visual prediction of scan paths. In contrast to existing approaches, our work determines the visible scan path by selecting elements that attract the most attention.
Prediction Model
On the contrary, the scan path prediction should follow the zero-sum rule: no two elements can obtain the same fixation index. Now for each fixed element (Ek∈E; k= 1, ..., d), the center of the element is mapped onto the underlying grid cell (as shown in Figure 7.2) in the direction of positional preference accumulation.
Prediction Performance using WG
Prediction Performance on Same Dataset
Among the fragmentations, the prediction performance for the scan path is relatively lower than that of the individual elements. PREDICTION PERFORMANCE USING Tables WG 7.2: Experiment-II: Elementalized Attention and Scanpath Prediction Performance at Median Threshold (θ= 5).
Prediction Performance on Unseen Dataset
Both the element-granular attention prediction and scanpath prediction performance were comparable, but relatively lower than that of Experiment-I. The performance variation was analogous to the analysis of Experiment-I, where the prediction performance decreased with increasing θ.
Prediction Performance using WUG
Prediction Performance on Same Dataset
Since the FI classes are unbalanced, micro statistics (micro F1 score and average accuracy) are calculated to analyze the overall prediction performance. Thus, we consider the random prediction [107] as a baseline (analogous to [118]) to understand the prediction performance of the proposed model.
Prediction Performance on Unseen Dataset
Summary
To understand the prominent visual features, the information-gain scores between the visual features and fixation indices were calculated. The analysis revealed that space and font size visual features are informative for textual elements, and mid-level color histogram visual features are informative for image elements, while position and size are informative for both the.
Applications
CSS functions including column-gap, line-height, padding, font-size, font-weight can be varied to achieve the desired user attention. Thus, color component values can be manipulated towards intermediate level color components (eg from 50 to 200) to achieve the desired user attention.
Limitations
However, further studies are requested for the influence of other user characteristics such as culture and age [148] on the intrinsic visual feature-based attention allocation. However, their inclusion would complicate the analysis due to the incorporation of motion-related characteristics and synchronization of user attention with motion frame.
Future Research Directions
Sandeep Vidyapu, Vijaya Saradhi Vedula & Samit Bhattacharya, "Quantitative Visual Attention Prediction on Webpage Images Using Multiclass SVM", ACM Symposium on Eye Tracking Research & Applications (ETRA), ETWEB, Denver, USA, 2019. Sandeep Vidyapu, Vijaya Saradhi Vedula & Samit Bhattacharya, “Correlation between text and image visual features of web pages based on fixation indices”, ACM Symposium on Eye Tracking Research and Applications (ETRA), Warsaw, Poland, 2018.
Organization of the contributions
A prototype PCCR eye-tracker, Tobii X2-60
Illustration of eye-gazing towards a web image element. The stable allocation
Example users’ scanpaths on a webpage: Each bubble is a fixation with the
Example Areas-Of-Interest (AOIs) on a webpage
Linear SVM illustration for binary classification of classes i and j
Eye-tracker setup used for collecting the gaze-data
Sample stimuli webpages of Experiment-I (top two rows) and Experiment-II
Experimental procedure in four steps: (i) stimuli loading (ii) eye-tracker
Experiment-I: Frequency distribution of fixation-indices
Experiment-II: Frequency distribution of fixation-indices
Sample stimuli webpages from each category of Experiment-III
Frequency distribution of fixation-indices (attention-levels)
Median time (in milliseconds) for first fixation on 3x3 webpage regions during
Spearman’s Rho (ρ s ) for three groupings
Canonical correlation between text and images for WG
Unification correlation between images and text
WG: Schematic of the proposed web element attention prediction approach
Experiment-I: Prediction performance metrics with variation in θ
Experiment-I: Example multi-level attention predictions on elements
Experiment-III: Prediction performance metrics with variation in θ
Experiment-III: Example multi-level attention predictions
WUG: Schematic of the proposed web element attention prediction approach
Experiment-I: Prediction performance metrics with variation in θ
Experiment-III: Prediction performance metrics with variation in θ
WG: Schematic of the proposed approach. UNIFICATION MODEL achieves
Computed Positional Preference
Experiment-I prediction performance with variation in thresholded value . 123