Subjective and Objective Evaluation of Tone-Mapping and De-Ghosting Algorithms

This process of converting HDR images to LDR images is called tone mapping or tone rendering. Different tone mapping operators affect the HDR images differently, such as changing the maximum luminance value, gradients, edges, etc. The database of tone mapping images is constructed using 21 state-of-the-art tone mapping operators reviewed in [1] .

Experimental Setup

Removal of Outliers

When we applied the above algorithm, 111 ratings out of 8820 ratings given by 21 different subjects were flagged as incorrect ratings. According to the above recommendation, the maximum error allowed is 21, which is 5% of the total number of 420 images. When calculating the subjects' average behavior for all 420 image sets, these 111 incorrect observations were not taken into account.

5 corresponds to the best image with low perceptible distortions and 0 corresponds to the worst image with many visible distortions. Since this study included only LDR images, we only calculated the mean opinion score (not DMOS). The average subjective scores that were in the range of 1.6 to 3.6 suggest that the tone mapping operators are failing for some of the images and not producing perceptually better images.

Because we included many different lighting and natural scenes, average scores were given in order of 3. This is an opportunity to check which tone mapping operators work well for a type of lighting and scene.

Objective Scores

QAC and SBIQE lead to similar results with the range of subjective scores with equal amounts of spread. In our experiment we have seen that the maximum average target score produced by FSIM was 3.9676 for FerwerdaTMO. But the average subjective opinion score for FerwerdaTMO is 1.8786 which was one of the lowest average opinion scores given by the subjects.

This suggests that structural similarity was not that important to subjects within an acceptable range. FSIM also compared well with other TMOs. This suggests that there is a need to learn the human behavior to certain changes in the structure of the tone-mapped images, to assess the change good or bad, where the subjective judgments are quite significant. PSNR is a full reference metric that considers the mean squared error between reference and test image as a metric for objective evaluation.

But the mean subjective score for this TMO (2.0214) was one of the least subjective scores. We can conclude that MSE does not play an important role in the evaluation of the TMOs.

Table 3.1: Correlation with Objective scores.

Conclusions of Subjective and Objective Evaluation of Tone-Mapping Algorithms

Objective Evaluation

Conclusions

The objective evaluation of the TMOs motivated us to formulate a new objective quality metric to assess the performance of the Tone-Mapping algorithms. The moderate correlations of the full reference and no reference and algorithms suggest that these algorithms fail in some cases. This shows that full reference and no reference algorithms found different information while evaluating these algorithms.

So if we can gather more information by combining the parameters of these unreferenced and fully referenced algorithms, we can form a good Objective Quality Assessment metric. The objective quality metric has several challenges, such as the naturalness of the color information in the algorithm output and the structural change of the reference image. Most of the current state-of-the-art algorithms do not include color information when assessing quality.

Although the subjects do not have information about the structure of the reference image, by their learning the subjects have a good knowledge of natural structures. But even the structure of the information is preserved, if the naturalness of the structure is lost, the preservation should give a lower weight.

Parameter Selection

Structural Similarity

The subjective evaluation was not a reference, but the reference image is present for each algorithm output. With observation we can say that when evaluating these algorithms we can use the structural information because a good algorithm should preserve as much structural information as possible. This shows how important it is to find the location where the structural information is preserved and the structures are natural.

Naturalness

CNN based visual model

Objective Metric

The De-ghosting algorithms use multiple images with different exposure values to calculate an image with enhanced visual details. If the alignment of the dynamic scene is imperfect, there is room for GHOSTs. The De-Ghosting algorithm performance depends on two parameter improvements in the visual content and the amount of ghosts present in the algorithm output.

There is no standard database available to test the performance of different De-Ghosting algorithms. Subjective evaluation of these De-Ghosting algorithms is of utmost importance to capture the state-of-the-art performance of these algorithms and provide a benchmark for the research community to develop new De-Ghosting algorithms. The pixel spread of these visible distortions depends on the resolution of the input image.

The number of De-Ghosting algorithms that are publicly available to the research communities for academic research purposes is very few. We have done an extensive research to find the recently published state-of-the-art de-Ghosting algorithms for the selection of De-Ghosting algorithms and the best parameters for their performance.

Database

Challenges to algorithms

If the image is lit the same, most of the details captured at different exposure values will not be very different. If the illumination of the scene is different in different locations, each exposure image contains different information. Depending on the camera's exposure value, the camera will either focus on a high-light area, which makes the low-light region completely dark with no visual detail, or low-light areas, which will to capture details in darker regions and will make the bright areas in it saturated.

Thus, the input scenes are captured in such a way that they cover most lighting conditions, such as evenly lit areas and highly dynamic scenes. We have also added some standard images with a resolution of 1920X1080 that form a database included in the evaluation of most standard De-Ghosting algorithms.

Selection of reference

Experimental Setup and GUI

Subjective Evaluation

The objective evaluation of these algorithms is necessary due to the fact that the subjective evaluation is always very time-consuming and the accuracy of the subjects is very low. It was found that the specific algorithm to objectively assess the quality of the images is not found in literature. The best matching algorithms were therefore selected to objectively evaluate the performance of the algorithms.

Objective Metric Selection

Results and Discussion

It appears that an objective rating based on rarity best captures the naturalness of the De-Ghosted image. The overall performance score depends not only on naturalness, but also on improvements in visual detail, which is an improvement in dynamic range. It is necessary to find a metric to calculate the dynamic range score and then combine it with the smoothness score to calculate the overall performance of the algorithm.

The correlations with the subjective scores show that it is necessary to formulate an objective quality assessment algorithm to determine the performance of the algorithms. And the parameters to take into account are naturalness and improvements in dynamic range and the combination of these scores to calculate the overall performance score. This is evident from the fact that the correlation between the subjective naturalness scores and the existing objective algorithms is moderate.

And the objective algorithm must be perceptually motivated to best match the subjective scores and the end user of the algorithms are humans. The current best performing algorithm SBIQE is statistically not learning based, which inspires us to use parsimony as one of the features in formulating a perceptual learning based algorithm.

Relative Naturalness

A lot of ghosting is included in the algorithm output, which is a very new type of distortion. The color information is mostly modified, so the algorithm should consider color information while evaluating the performance. And other features should be included to include color information and ghost artifacts.

The amount of distortion present in the inputs to the algorithm can be calculated by considering only the reference or the entire LDR stack. The details present in the reference are very important, and the reference image contains most of the information about the scene, so the relative straightness with respect to the reference was determined. The relative naturalness correlation found using only rarity as a function was 0.52, which was higher than the naturalness of the score alone, which was 0.48.

The small improvement in naturalness is because most of the input reference images are highly natural, which does not change the relative naturalness score much.

Relative Dynamic range

Thus, the dynamic range function can be used and a visual model should be used to calculate effective improvements in the dynamic range to improve the correlation with the subjective scores. The overall dynamic range of the image is calculated by averaging local dynamic ranges calculated using local windows of different sizes. For the resolution of the image used, an empirical value of the window size of 27X27 was used.

The visible improvements in Figure 7.1 are greater compared to Figure 7.3, which is evident by observing the dynamic range scores listed below the figures.

Differenciation between Input LDR images using Mscn Coefficients

Learing Overall Performance score of Algorithms

An accuracy of 0.96 was achieved, indicating that the overall performance score is a combination of the relative naturalness score and the relative improvement score.

Overall Algorithm performance Score

And the pre-trained models can be used to find the local features and then combine all the features together to find larger features and calculate an overall performance score. Instead of averaging, an efficient way to calculate an overall performance score can be formulated.