Chapter 1. Introduction 1
5.2 Method
5.3.4 Relevance score between question and annotated image-based
5.3.3 Result of prediction of image with annotations
The YOLO model (Gordon et al., 2018) was pre-trained on ImageNet dataset. Further, approximately 1,00,000 image with annotations were considered from MSCOCO dataset. After pre-processing, sketches were randomly distributed in a proportion of 80% and 20% for training and validation, respectively. The dataset comprised 80 object categories and 91 stuff categories. Each image was associated with five annotations. The model contained 24 convolution layers followed by two fully connected layers. The last layer i.e., the output layer calculated probabilities of belonging to a particular class and further computed coordinates of bounding boxes of objects present in an image. The prediction of this model is illustrated in Figure 5.6.
Figure 5.6: Prediction of annotated images with coordinate of bounding boxes
Table 5.14: Relevance score between question and description in annotated image-based pattern of creative responses
Question-response ID Relevance score
Q1-1 0.845
Q1-2 0.898
Q1-3 0.852
Q1-4 0.868
Q1-5 0.928
Q1-6 0.870
Q1-7 0.943
Q1-8 0.880
Q1-9 0.904
Q1-10 0.858
The test results of image with annotations was combined from two different sources viz., MSCOCO dataset and Google scraped data. There were sufficient train and test data available from MSCOCO dataset, but the scraped data with same schema was supplied to the model to verify whether it can perform effectively for any random test sample possessing similar types of labels (Cesa-Bianchi et al., 2010). All samples were pre-processed, and evaluation was performed. The test results considered for display in this study comprise the first 50%
MSCOCO test dataset, and the next 50% Google scraped test dataset.
Major decision-making was required for a relevance score that would suggest a specific score based on which responses would be considered as irrelevant that would be discarded from further evaluation of novelty. Precisely a threshold was required that would be able to determine scores to demarcate relevant and irrelevant responses. A threshold usually depends on experts’ choice of how lenient or stringent they are (Aubin et al., 2018). Many a time, if a question paper is too difficult for a cohort of students, then experts attempt to keep the evaluation lenient and vice versa. More often, threshold of evaluation depends on type and level of examination. There may be multiple types of examinations, such as national, state, institutional, and evaluation in this type of examination depend on their individual criteria.
Level of examination depends on the difficulty of examination, which may be broadly classified as very easy, easy, moderate, hard, and very hard (Park et al., 2017).
Any threshold in an examination determining either success or failure of students or evaluation of any particular factor can be identified by longitudinal studies (de Vergara & Olmos, 2019).
Over the years, one may study pattern of student responses, type of examination, level of
examination, level of questions, etc., and further analyse and predict a threshold for evaluation.
However, there are other techniques as well by which one may decide upon a threshold value.
In this context, a threshold was essential to identify scores which attempt to demarcate relevant and irrelevant responses. To achieve this, experts' scores based on relevance between question and annotated image-based pattern of creative responses were collected, which was considered as a golden standard. Further, it was compared with scores generated by cosine similarity function. F-measure was calculated to evaluate model performance.
A threshold from 0.3 to 0.9 was considered, and their corresponding F-measures were computed between question and description, question and image, and image and description.
The sample outcomes are shown in Table 5.15. In most question-response pairs, a threshold of 0.3 received the highest value of F-measure. In few cases, threshold of 0.4 also received higher F-measure values, but on manual verification, it was observed that approximately 10% of relevant responses were categorized as irrelevant, which might lead to frustration in students (Penumatsa et al., 2006). Therefore, 0.3 was considered as the threshold for filtering irrelevant responses for further processing of evaluating novelty.
Table 5.15: Threshold and corresponding F-measure of annotated image-based pattern of creative responses
Question-description Question-image Image-description Threshold F-measure Threshold F-measure Threshold F-measure
0.3 0.783 0.3 0.963 0.3 0.833
0.4 0.667 0.4 1.000 0.4 0.727
0.5 0.683 0.5 0.917 0.5 0.353
0.6 0.607 0.6 0.670 0.6 0.364
0.7 0.367 0.7 0.503 0.7 0.375
0.8 0.450 0.8 0.234 0.8 0.287
0.9 0.500 0.9 0.310 0.9 0.200
5.3.5 Results of language processing of annotations in responses
Descriptions in annotated image-based pattern of creative responses required spelling and grammatical error checking, termed as language processing in this context. Language processing though not directly associated with novelty but supports in identifying novelty in write-up of students. Therefore, it is essential in examination to process it and identify errors associated with spelling and grammar. Language exhibits the sense of a concept and
understanding of novelty to its target audience. An online language processing tool (LanguageTool - Online Grammar, Style & Spell Checker, 2019) was used to identify multiple categories of grammatical errors such as non-conformance of sentences, duplication, typological errors, and misspellings. The tool returned .json file for mistakes and scores. A normalized score between 0 and 1 was generated for language processing. The .json file returned by the tool is illustrated in Figure 5.7 (Chaudhuri et al., 2021b).
Figure 5.7: Possible errors in annotation
5.3.6 Results of clustering annotated image-based pattern of creative responses and novelty scores
Responses that were considered relevant and post language processing require clustering them based on their semantic similarity. However, since the response was divided into two parts- image and annotation, they required common representation space. Therefore, multi-modal joint embedding was used to represent image predictions and annotations for unifying them. A deep CNN and LSTM recurrent network was used for joint image annotation embedding (Kiros et al., 2014). Further, jointly embedded responses were clustered using K-means clustering algorithm. However, in this case, K-means clustering technique provided inconsistent results in each run of the algorithm. The number of clusters as well as the density of cluster varied on repeated execution. Further, BIRCH clustering algorithm was used to get a stable set of clusters (Lorbeer et al., 2018). It is effective on large datasets. Unlike K-means it can handle outliers
(Zhang et al., 1997). The number of clusters formed after execution of both the algorithms is illustrated in Table 5.16, 5.17. Based on the density of clusters, relative uniqueness scores were computed. The score was calculated by number of creative responses in a cluster divided by the total number of creative responses corresponding to a question subtracted from 1.
Further, algorithmically computed novelty score is shown in Table 5.18.
Table 5.16: Clusters at multiple runs of K-means algorithm for image-based creative responses
Run sequence Clusters
Run 1 [0 0 0 0 0 0 0 0 1 1]
Run 2 [0 0 0 0 0 0 1 1 2 2]
Run 3 [0 0 0 0 0 0 0 0 0 1]
Table 5.17: Clusters using BIRCH algorithm for image-based creative responses Run sequence Clusters
Run 1 [0 1 1 0 1 0 2 0 0 2 3 0 1 1 1 3 2 3 3 0]
Run 2 [0 1 1 0 1 0 2 0 0 2 3 0 1 1 1 3 2 3 3 0]
Run 3 [0 1 1 0 1 0 2 0 0 2 3 0 1 1 1 3 2 3 3 0]
Table 5.18: Normalized novelty score for annotated image-based pattern of creative responses
QR-ID Cluster Normalized novelty score ResponseID
= Q1-1
Cluster assigned = 0
Relevance Score: 0.467, Uniqueness Score: 0.467, Language Score: 1.0, Normalized novelty Score:0.644
ResponseID
= Q1-2
Cluster assigned = 0
Relevance Score: 0.427, Uniqueness Score: 0.467, Language Score: 1.0, Normalized novelty Score:0.631
ResponseID
= Q1-3
Cluster assigned = 1
Relevance Score: 0.835, Uniqueness Score: 0.667, Language Score: 1.0, Normalized novelty Score: 0.834
ResponseID
= Q1-4
Cluster assigned = 0
Relevance Score: 0.327, Uniqueness Score: 0.467, Language Score: 0.929, Normalized novelty Score: 0.574
ResponseID
= Q1-5
Cluster assigned = 2
Relevance Score: 0.330, Uniqueness Score: 0.867, Language Score: 1.0, Normalized novelty Score: 0.732
ResponseID
= Q1-6
Cluster assigned = 0
Relevance Score: 0.339, Uniqueness Score: 0.467, Language Score: 0.917, Normalized novelty Score: 0.574
ResponseID
= Q1-7
Cluster assigned = 1
Relevance Score: 0.603, Uniqueness Score: 0.667, Language Score: 1.0, Normalized novelty Score: 0.757
ResponseID
= Q1-8
Cluster assigned = 1
Relevance Score: 0.616, Uniqueness Score: 0.667, Language Score: 1.0, Normalized novelty Score: 0.760
ResponseID
= Q1-9
Cluster assigned = 1
Relevance Score: 0.340, Uniqueness Score: 0.667, Language Score: 1.0, Normalized novelty Score: 0.669
ResponseID
= Q1-10
Cluster assigned = 1
Relevance Score: 0.690, Uniqueness Score: 0.667, Language Score: 1.0, Normalized novelty Score: 0.786
ResponseID
= Q1-11
Cluster assigned = 0
Relevance Score: 0.497, Uniqueness Score: 0.467, Language Score: 1.0, Normalized novelty Score: 0.655
ResponseID
= Q1-12
Cluster assigned = 2
Relevance Score: 0.345; Uniqueness Score: 0.867, Language Score: 1.0, Normalized novelty Score: 0.737
ResponseID
= Q1-13
Cluster assigned = 0
Relevance Score: 0.367, Uniqueness Score: 0.467, Language Score: 1.0, Normalized novelty Score: 0.611
ResponseID
= Q1-14
Cluster assigned = 0
Relevance Score: 0.374, Uniqueness Score: 0.467, Language Score: 1.0, Normalized novelty Score: 0.0.614
ResponseID
= Q1-15
Cluster assigned = 0
Relevance Score: 0.467, Uniqueness Score: 0.467, Language Score: 1.0, Normalized novelty Score: 0.644
ResponseID
= Q1-16
Cluster assigned = 1
Relevance Score: 0.605, Uniqueness Score: 0.667, Language Score: 1.0, Normalized novelty Score: 0.757
The displayed test result considered a single question with corresponding sixteen annotated image-based pattern of creative responses. Three clusters were formed viz., 0, 1, and 2 with 8 elements in the first cluster, 6 elements in the second cluster, and 2 elements in the third cluster, respectively. Novelty score was measured by the summative assessment (Chaudhuri et al., 2020, 2021b) of relevance score, language, score, and uniqueness score. It was further normalized within the range of 0 to 1.