Computational Modeling of Free-viewing Attention on Multimodal Webpages - A Machine Learning Approach

Analyzes of user attention collected on real-world web pages reveal: (i) achieved unification outperforms the random baseline in predicting user attention with elements; (ii) the presented model demonstrates efficiency in predicting attention to all elements unified in text as well as unified in visual image spaces; (iii) the model also outperforms the baseline for predicting attention to invisible web page data. Extension: the importance of the elements that make up the web page is overlooked due to processing the entire web page as an image.

Scope of the Thesis

Task-based: Capture the eye movements when the user voluntarily stares to perform a specific task. Because the thesis focuses on Bottom-Up attention modeling, i.e. modeling the influence of the visual features of web pages on the user's attention, the eye-tracking experiments are limited to free viewing.

Motivation

Due to the gradual expansion of the Internet and users, attention models are desirable for experienced web users. To summarize, the aim of the thesis is to model the attention of experienced adult users when freely viewing multimodal static web pages as displayed on desktop monitors.

Objectives and Contributions

R2: Which intrinsic visual features are informative in explaining the free-viewing ordinal attention to web elements. R1: Are the text and image visual features correlated based on the free-viewing user's attention allocation on bi-modal web pages.

Thesis Organization

The background includes the (i) introduction of web page as a Document Object Model (DOM) in Section 2.1, (ii) eye tracking as a technique to capture the eye movements in Section 2.2, and (iii) the association between eye movements and visual attention in the terminology of eye movements in Section 2.3. The related works include the (i) uni-modal (text-only or image-only) analyzes that motivated the feature selection for our analysis (Section 2.4), (ii) related web page-based analyzes to describe the constraints in condition -the -.

Eye-tracking

Eye tracking technologies evolved over the years in capturing physiological human eye movements. In Neuroscience, eye tracking was studied to establish the relationship between eye movements and visual attention.

Visual Attention and Terminology

A stable assignment of the foveal visual angle (1-2°) results in fixation, an indicator of visual perception of the lower element. Fixation indices (FI) are ordinal numbers associated with fixations that indicate the sequential assignment of fixations.

Modality-specific Visual Feature based Analyses

Text Modality

These works motivated us to consider text- and background-related color features for the attention analysis. We also used font size to account for the number of characters per page. line besides its own influence.

Image Modality

Motivated by this work, we included font family and font size features for the analysis of our proposed model. For the web pages, all the above-mentioned attribute values are manipulated by the web designers through Cascading Style Sheets (CSS) functions.

Attention Models for Webpages

User Attention-based (UAB) approaches

Calculating transition probabilities: These techniques help determine transition probabilities between visual elements. Common scan path identification: In contrast to the aforementioned approaches, these techniques help to obtain a representative scan path (called common scan path) from a given set of scan paths.

Visual Feature-based (VFB) approaches

UAB approaches are motivational rather than predictive of attention; Thus, user attention data is required for each test website. Collecting this data on every website and especially after every minor modification implemented on a website may not be practical due to time constraints, money constraints and user availability limitations.

Element Visual Features’ and User Attention based Analysis

Prediction Approaches

Multi-class Classification using SVM

Analogous to equation 2.1, the optimization problem equation 2.4 is solved using Lagrangian formulation and primal-dual reformulations [22,30]. Precision is the proportion of correctly predicted cases among all the positively predicted cases.

Multi-label Classification

Thus, the distinct performance metrics are introduced in the literature to estimate the performance of a multilabel classifier. Rather, other metrics are analogous to the typical performance metrics for binary and multiclass classification, and higher values are preferred for these metrics.

List of Visual Features

Additionally, "bolder" and "lighter" values are used to set the font weight relative to the font weight of the parent element. The possible values are length- and percentage-based metrics; the former indicates the absolute indentation, and the latter indicates the percentage of indentation relative to the width of the text block.

Summary

The difference between various visual features of a web page and the image results in contrast features. Furthermore, this chapter summarizes the visual characteristics of text and image modalities (Section 3.5) which are used throughout the thesis work.

Common Experimental Setup

Apparatus
Stimuli Preparation
Experimental Setup
Procedure
Data Extraction

Ethernet cable to transfer the processed viewing data from the EPU to the host CPU; magnetic strip to attach the eye tracker to the screen monitor. The eye tracker was connected to the EPU to send the sampled gaze data for processing.

Experiment-I Data

Experiment-II Data

Experiment-III Data

Considered Visual Features

Visual Text Features

The POSITION group, which describes the rectangular position of the text DOI and its size, contributes five functions. Consequently, we calculated the cumulative distribution function (CDF) of font family frequencies and merged the sparse font families that do not contribute to 85% of the total data.

Visual Image Features

Summary

However, the lower and right regions were least visible when drawing the first fixation for free viewing, similar to the information retrieval and page recognition tasks. Among the dominant four regions TopLeft,TopCenter,MiddleLeft, the frequency of free viewing is comparable to information retrieval and page recognition.

Analysis-II— Identification of Informative Visual Features

Informative Text Visual Features

These five text-related intrinsic visual features were discarded in [19] due to consideration of only tags. Surprisingly, none of the COLOR-related intrinsic visual features turned out to be informative; these are extensively researched in the literature.

Informative Image Visual Features

Moreover, the four position-related visual features, which emphasize the prominence of position, were found to be among the Top 10 informative visual features. Relatively speaking, the intrinsic visual features of images are more informative (with higher IG scores) than the intrinsic visual features of the text.

Analysis-III — Modeling the Attention on Web Elements

Salient Element Identification through Decision Tree

The relatively higher precision of the images (95.83%) than the text (77.29%) indicates that visual features of images have more influence on the decision whether an element is salient or not than its text features. In contrast, recall of text (93.72%) is higher than that of images (82.14%), indicating that the visual features of the salient text elements separate them from non-salient element features better than their counterpart image features.

Ordinal Visual Attention Prediction on Web Elements

Informative visual features help distinguish salient elements (fixation index to median FI) from non-salient ones. Informative visual features help predict an item's salient fixation index with nearly 90% average accuracy.

Kernel-based Attention Prediction

K(Ik, It) =−Besseldegreeorder σkIk−Itk2 (4.9) The hyperparameters sort and rank the parameters of the Bessel function. The sigmoid kernel achieved the relatively lowest performance with an average accuracy of 87.18% and micro F1 score of 67.95%.

Weighted Voting based Attention Prediction on Images

Effective Visual Attention Assignment

However, the least frequent (in the overall attention data) fixation index is assigned the highest weight. The fixation index with the highest weighted votes (wvf) is assigned as the effective fixation index in each weighting strategy.

Prediction Model

The number of informative visual features are for uniform, linear, proportional, and inverse proportional weightings, respectively. To further understand the informative functions, the number of informative visual functions with variation in θ is calculated as shown in Figure 4.13c.

Prediction Performance

The number of informative features for uniform, linear, and inverse proportional weighting decreased significantly with an increase in θ. The inversely proportional weighting achieved the second highest entropy until θ= 11, but outperformed the linear weighting thereafter.

Multi-users’ Multi-level Attention (MUMLA) on Images

Introduction and Motivation
MUMLA Prediction Approach
MUMLA Prediction with Experiment-III
MUMLA Prediction with Experiment-II

Label cardinality – the average number of FI associated with each image element; highlights the salience of multi-label approaches. To note, the true positive rate (TPR) (aka, recall) of the baseline is comparable to the predicted performance.

Summary

In Chapter 4, the importance of visual features of text and images is established in explaining free viewing attention. R2: Which cross-modal visual features are comparable to each other based on the attention distribution of a free-viewing user on bi-modal web pages.

Views of Attention and Grouping

Accordingly, perceptual features describe features related to fixation, including fixation position, fixation index—the sequential number of fixations along the scan path, and fixation-modality index—the modality-specific relative index of fixation to the DOI. For example, in Figure 5.2, the textDOI on the website G1 with the fixation index numbered 3 received a modality fixation index of two from user U2, since this is the second text modality element that U2 fixed to G1.

Proposed DOI based Approach

Thus, the text feature with the highest canonical loading is comparable to the image feature with the highest canonical loading when the canonical variables wtandwi are highly correlated (ie, ρmax is larger). That is, each column in Wt and Wi is a canonical direction stacked horizontally in order of descending canonical correlations from left to right.

Results and Analysis

Correlations between Text and Image DOIs

The application of the proposed approach (see subsection 5.3.1) led to the correlations between visualized text and image features as shown in Figure 5.4. Rather, limiting the interface idiosyncrasies together with user idiosyncrasies (WUG clustering) achieved the correlation between text and images.

Comparative Text and Image Features

Regarding the Top-2 canonical loadings (withρ2max>0.8), the FAMILY, COLOR, SPACE features of text are as influential as HISTOGRAM, CONTRAST HISTORY and POSITION features of the images in determining the correlation between both modalities. Overall, FAMILY features of text dominated the canonical loadings across the canonical directions with ρ2max >0.8 and are comparable to the attention associated with the HISTOGRAM (followed by CONTRAST HISTORY) features, as visualized in Figure 5.6.

Cross-Delineation of Attention

In user-oriented grouping, the preservation of the local neighborhood is very strong for the text and strong for the images. The variation in ρs for RUG is attributed to the low homogeneity between text and images, as indicated by the respective canonical correlation.

Prominent Feature Groups

Webpage-oriented Grouping

Canonical relationship between text and images. with ALL function groups without FAMILY and CONTRAST COMPREHENSIVE without FAMILY and CONTRAST HISTOGRAM without FAMILY and CONTRAST without FAMILY and HISTOGRAM without COLOR and POSITION without FAMILY and COMPREHENSIVE. The prominence of image contrast features is further established as the exclusion of FAMILY and CONTRAST.

Webpage-and-User-oriented Grouping

Analogous to the leave-one-out strategy, the exclusion of COMPREHENSIVE, HISTOGRAM, and CONTRAST COMPREHENSIVE features along with FAMILY had little impact on the correlation. Even the exclusion of the feature group from both modalities resulted in and so on correlations (see Figure 5.10), indicating their limited prominence in fusing text and images.

Summary

R1: Can attention to elements be predicted if all elements are unified in a text mode. R2: Can attention to elements be predicted if all elements are unified into an image modality.

Proposed Approach

Stage-I: Correlation Model

Accordingly, each text element and a visual element represent a point in the visual text space ⊆Rm and the visual space ⊆Rn, respectively. Let the linear transformation of text data along wt∈Rm×1 and image data along ∈Rn×1 establish the correspondence between them.

Stage-II: Prediction Model

Prediction Performance using WG

Prediction Performance on Same Dataset

At the median salience threshold, multilevel attention was predicted with an average accuracy of 82.79% and the micro-F1 score of 56.98% in the visual text space. The attention prediction in Image Visual Space also achieved similar performance with an average accuracy of 81.72%.

Prediction Performance on Unseen Dataset

Prediction Performance using WUG

Prediction Performance on Same Dataset

SVM-based classification in visual image space) fixation index=1. test image element) (test text element). The attention prediction in Image Visual Space also achieved similar performance with an average accuracy of 87.36%.

Prediction Performance on Unseen Dataset

However, the observed variation in prediction performance is attributed to the utilization of low dimensional (30 significant canonical directions) Common Visual Space for elements' unification where not all dimensions have reached optimal correlation between text and images.

Summary

In this chapter, we propose to extend elemental attention into visual prediction of scan paths. In contrast to existing approaches, our work determines the visible scan path by selecting elements that attract the most attention.

Prediction Model

On the contrary, the scan path prediction should follow the zero-sum rule: no two elements can obtain the same fixation index. Now for each fixed element (Ek∈E; k= 1, ..., d), the center of the element is mapped onto the underlying grid cell (as shown in Figure 7.2) in the direction of positional preference accumulation.

Prediction Performance using WG

Prediction Performance on Same Dataset

Among the fragmentations, the prediction performance for the scan path is relatively lower than that of the individual elements. PREDICTION PERFORMANCE USING Tables WG 7.2: Experiment-II: Elementalized Attention and Scanpath Prediction Performance at Median Threshold (θ= 5).

Prediction Performance on Unseen Dataset

Both the element-granular attention prediction and scanpath prediction performance were comparable, but relatively lower than that of Experiment-I. The performance variation was analogous to the analysis of Experiment-I, where the prediction performance decreased with increasing θ.

Prediction Performance using WUG

Prediction Performance on Same Dataset

Since the FI classes are unbalanced, micro statistics (micro F1 score and average accuracy) are calculated to analyze the overall prediction performance. Thus, we consider the random prediction [107] as a baseline (analogous to [118]) to understand the prediction performance of the proposed model.

Prediction Performance on Unseen Dataset

Summary

To understand the prominent visual features, the information-gain scores between the visual features and fixation indices were calculated. The analysis revealed that space and font size visual features are informative for textual elements, and mid-level color histogram visual features are informative for image elements, while position and size are informative for both the.

Applications

CSS functions including column-gap, line-height, padding, font-size, font-weight can be varied to achieve the desired user attention. Thus, color component values can be manipulated towards intermediate level color components (eg from 50 to 200) to achieve the desired user attention.

Limitations

However, further studies are requested for the influence of other user characteristics such as culture and age [148] on the intrinsic visual feature-based attention allocation. However, their inclusion would complicate the analysis due to the incorporation of motion-related characteristics and synchronization of user attention with motion frame.

Future Research Directions

Sandeep Vidyapu, Vijaya Saradhi Vedula & Samit Bhattacharya, "Quantitative Visual Attention Prediction on Webpage Images Using Multiclass SVM", ACM Symposium on Eye Tracking Research & Applications (ETRA), ETWEB, Denver, USA, 2019. Sandeep Vidyapu, Vijaya Saradhi Vedula & Samit Bhattacharya, “Correlation between text and image visual features of web pages based on fixation indices”, ACM Symposium on Eye Tracking Research and Applications (ETRA), Warsaw, Poland, 2018.

Organization of the contributions

A prototype PCCR eye-tracker, Tobii X2-60

Illustration of eye-gazing towards a web image element. The stable allocation

Example users’ scanpaths on a webpage: Each bubble is a fixation with the

Example Areas-Of-Interest (AOIs) on a webpage

Linear SVM illustration for binary classification of classes i and j

Eye-tracker setup used for collecting the gaze-data

Sample stimuli webpages of Experiment-I (top two rows) and Experiment-II

Experimental procedure in four steps: (i) stimuli loading (ii) eye-tracker

Experiment-I: Frequency distribution of fixation-indices

Experiment-II: Frequency distribution of fixation-indices

Sample stimuli webpages from each category of Experiment-III

Frequency distribution of fixation-indices (attention-levels)

Median time (in milliseconds) for first fixation on 3x3 webpage regions during

Spearman’s Rho (ρ s ) for three groupings

Canonical correlation between text and images for WG

Unification correlation between images and text

WG: Schematic of the proposed web element attention prediction approach

Experiment-I: Prediction performance metrics with variation in θ

Experiment-I: Example multi-level attention predictions on elements

Experiment-III: Prediction performance metrics with variation in θ

Experiment-III: Example multi-level attention predictions

WUG: Schematic of the proposed web element attention prediction approach

Experiment-I: Prediction performance metrics with variation in θ

Experiment-III: Prediction performance metrics with variation in θ

WG: Schematic of the proposed approach. UNIFICATION MODEL achieves

Computed Positional Preference

Experiment-I prediction performance with variation in thresholded value . 123