Deep Learning frameworks for Image Quality Assessment

The second method I proposed is fully referenced and unreferenced image quality assessment using deep convolutional neural networks. It is shown that the total of the patch quality scores has a high correlation with the subjective scores. Since quality is an important criterion for images, image quality assessment becomes a useful area of research.

Image quality is a characteristic of an image that measures the perceived image degradation (usually compared to an ideal or perfect image). In subjective quality assessment, a number of human subjects are instructed to rate the quality of a given image in a defined scale. An algorithm that can predict the subjective quality of a given image is called an objective quality assessment.

When performing a subjective quality assessment, we must consider several users, as opinions will differ between subjects.

Full reference image quality assessment(FRIQA)

But in many cases we will not get the expected quality of images. It will also depend on lighting conditions, the subject's experience in judging quality, distance from the image, etc. The average value of the opinions is considered as a quality score, since opinions differ between subjects.

Reduced reference image quality assessment(RRIQA)

No reference image quality assessment(NRIQA)

Deep learning

Supervised learning

Unsupervised learning

Neural network is one of the machine learning methods inspired from the human brain. There are feedforward neural networks that allow the signal to travel from input to output in a single direction. Convolutional neural networks are the ones biologically inspired by the visual cortex layers.

Each filter is small in space (in width and height) but spans the entire depth of the input volume. As the name implies, it will connect the neurons of the previous layer to each neuron of the subsequent layer.

Figure 2.1: Basic structure of Neural network and 3D representation of the Convolutional network

Autoencoder

The activation layer controls how the signal flows from one layer to another, such as how neurons are excited in the brain. The loss function is one that quantifies the amount of error that has occurred from the amount predicted by the ground truth labels. Depending on the loss, the error will propagate back and update the weights of each layer.

If our data points come from a non-linear surface, then we need to go to non-linear autoencoders which will have non-linear actuations.

Support Vector Regression (SVR)

The idea of SVR is based on the computation of a linear regression function in a high-dimensional feature space where the input data is mapped via a non-linear function.

Generative Adversarial Networks

Related work
Feature Extraction
Quality Measurement
Finding Correlation

8] conducted a study on the evaluation of IQA algorithms with full reference, which paves the way to think about the factors that will affect the quality of the image. But in many real-world cases we will not get the clean version of the distorted image. Most of the predominant algorithms in the literature first attempt to learn statistics of the image using various tools and then correlate the obtained features with the human subjective scores.

15] looked at the differences in the statistical features of the distorted image from the pristine one and used these features to train a statistical model fully in the DCT framework. The quality is evaluated by calculating the distance between the MVG fit of NSS features extracted from the test image and an MVG model of quality-aware features extracted from the corpus of natural images. In the case of the videos, the researchers also started by looking at the statistics of the natural videos and finding how many statistics were broken in the distorted videos.

They used generalized Gaussian distribution (GGD) to model the sparse representation of each atom of the dictionary and showed that these GGD parameters are well suited for distortion discrimination. Entry of Deep learning [27] made a drastic change in the field of image processing and quality assessment. For the exploitation of time dependencies we need all data points and next hidden layers should be connected to the previous time which paved the way for the invention of the Recurrent Neural Networks (RNN).

They also trained fully connected layers in parallel with the regression part to get the weights of each part of the image. Previous literature suggests that the statistical characteristics of pristine images will change in the presence of distortion. Since image dimensions are usually large to investigate, a smaller dimensional representation of a pristine image and finding changes in their representation in the presence of distortion.

The decoder part of the autoencoder must be efficient and must be able to generate a high dimensional data from the low dimensional data. Letfrandfdbe two vectors of length Nwhich represent the encoded output of the pristine and reference image respectively and d be the length of the feature.

Figure 2.5: Denoised image from MNIST database

Results and Discussion

Datasets

Performance evaluation

Decoded results are the ones that keep the same shape as that of the input image. It is vectorized and the distance between these two vectors is found using different distance metrics. The scatterplot showing the relationship between absolute distance between decoded reference and decoded distorted image and corresponding subjective scores for the images present in the LIVE database is shown in fig.5.3.

The decoded result is obtained from the model which is trained using the patch size of 256×256 for all LIVE database images. Different distance measures are calculated between the decoded reference image and the decoded distorted image. Correlation results between these distance features and subjective scores for all distortions in the LIVE database are shown in Table 5.6.

The encoded distance measures, which are highly correlated with human subjective scores, are taken as the input function for Support Vector Regression (SVR) and trained against the DMOS values. Testing is done on a model trained to a patch size of 256×256. The same procedure is repeated for the other standard databases for image quality assessment, which are listed in Table 5.2.

For all other datasets, the model considered is the one trained with the patch size of 256×256. Feature similarity metric (FSIM) and multiscale structural similarity index metric (MSSSIM) are two powerful complete reference metrics in the literature.

Table 4.3: Comparison of the performance of the model trained with patch size 64×64 over different distortion types in LIVE Realease2

Conclusion

Full reference image quality assessment
No reference image quality assessment
Network Architecture
Quality Estimation

At low patch sizes, the Chebyshev distance features have a better correlation with DMOS, but as the patch size increases, the Lorentz distance features are the ones with high correlation values. This leads me to propose a deep convolutional neural network based on full reference and no reference image quality assessment. The CNN extracts features from patches of the distorted and reference images and estimates the perceived quality of the distorted image by combining the features and training the regression using two fully connected layers.

Two additional fully connected layers are added in parallel to the fully connected layers for regression to obtain the weighted average aggregation of the patch-estimated local quality to the global quality at work [47]. No reference image quality assessment algorithm is implemented in a similar way as in the case of the full reference method. The network has only one input, which is the patch obtained from the input image.

The network is trained in such a way that for a given test image it is able to provide image quality. The network is deep consisting of eight convolutional layers with maximum coupling layers after two convolutional layers. In the full reference work it consists of additional pooling layer to connect different features obtained after the convolutional layer.

Passing reference and distorted patch separately through different convolutional layers and concatenating the features is actually inspired by a stereo work [48]. At the end, two fully connected layers are used to perform regression, that is, to extract image quality from patch quality. Let Np be the number of patches extracted from an image and yi be the patch quality estimate for a given patch i.

Table 4.6: Comparison of the performance of the model train with patch size 256×256 using decoded result as the feature over different distortion types on the LIVE Release2 database

Results and Discussions

Spots are assigned with the quality labels given to the complete image from which the respective patch is cut. Fused feature vectors are given as input to a fully connected neural network for performing regression to get a patch quality estimate. In all convolutional layers, initial layers will be able to extract low-level features and as the network goes deeper, the final convolutional layers will be able to extract high-level features.

Both the CSIQ and TID2013 datasets share only four deformations in common with the LIVE database.

Conclusion

The level of distortion can vary from one region to another within an image, i.e., we may not be able to see the distortion in some parts of an image, but it may be present in other parts. For a highly distorted image, most of the patches in it will be of low quality. The case is different for an image that has an average level of distortion where we can find some low quality and some high quality patches.

This gives me the idea to look into a classification model that can classify a given input patch into low quality or high quality. But the model needs to be trained on a very small patch size to extract local features.

Proposed Method

Network Architecture

Our network consists of six convolution layers, three maxpooling layers and finally at the end we have a fully connected layer with softmax as the activation function. For a two-class classification problem, there are two nodes on the last fully connected node, and in a multi-class classification, the number of nodes on the last fully connected layer is equal to the number of deformations plus one (one additional node for pristine). The pre-trained VGG16 model is also used with the first few layers merged and two fully connected layers added at the end.

Quality Measurement

Even in the multi-class classification problem, the image quality score is obtained by averaging the labels of all high-quality classes across all corrections.

Results and Discussions

Conclusions

Basic structure of Autoencoder

Denoised image from MNIST database

Input image and decoded Image

Scatter plot showing the relation between subjective scores and absolute error between

Scatter plot showing the relation between subjective scores and the absolute error

Block diagram of the FRIQA algorithm

Block diagram of NRIQA algorithm

Block diagram for classification