• Tidak ada hasil yang ditemukan

5. Machine Learning: Sentence Classification

5.3 Results

5.3.1 Model Evaluation

I used three metrics to evaluate the fine-tuned model’s performance on the test data (recall that the training data was used to fine-tune the model): precision, recall, and micro-F1110. The basis of these metrics comes from the concepts true positive, true negative, false positive, and false negative. Table 12 illustrates these concepts; it shows the ground-truths and predicted frames for four sentences in the training and testing datasets. The first sentence is from the opponent party brief in Bray v. Alexandria (1993), and it received two “protect life” frames:

“protect the lives of women” and “protect the lives of the unborn.” However, the model only predicted the “protect the lives of the unborn” frame. This is an example of a false negative or Type II error; this occurs when a classifier misses a frame that exists in the sentence.

The second sentence in Table 12 is from an amicus brief supporting the feminist party in Bray v. Alexandria (1993) that does not contain any “protect life” frames (i.e., it was classified as

“No frame”), but the model predicted that the sentence contained a “protect the lives of patients”

frame. This is an example of a false positive or Type I error, and it occurs when a classifier predicts a frame that does not exist in the sentence.

The final sentence comes from an amicus brief filed in support of the feminist party in Madsen v. Women’s Health Center (1994), and the model correctly predicted the existence of the

“protect the lives of patients” frame. When a classifier correctly predicts the existence of a frame it is referred to as a true positive. Similarly, a true negative occurs when a classifier correctly predicts the non-existence of a frame. This can be observed in the fourth sentence, an amicus

110 For an overview of metrics used to evaluate multiclass classification models see Grandini et al. (2020.)

99

brief supporting the opponents in McCullen v. Coakley (2014), when the classifier correctly predicts that “protect the lives of patients” does not exist in the sentence.

Precision is the number of true positives for a certain frame divided by all of the positive predictions belonging to that frame. In other words, precision measures the model’s ability to correctly identify instances of a particular frame.

Precision = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒

For example, the number of times the model predicted the “protect the lives of patients”

frame in Table 12 is two but only one was a correct prediction. Therefore, the precision for the patients frame would be ½ = 0.5 = 50%.

Recall is the proportion of true positives for a given frame divided by all of the true examples for that frame. Practically, recall is the ability of a model to find all of the sentences with the frame in a dataset.

Recall = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒

For example, the classifier missed the “protect the lives of women” frame in the first sentence in Table 12, but it correctly predicted it in the fourth sentence. Therefore, the recall for the women frame would be 1/1 = 1 = 100%.

While an in-depth discussion of the trade-off between precision and recall is outside the scope of this paper, it is important to note that typically there is a negative relationship between precision and recall; that is, when one metric increases the other decreases. Because precision only considers observations that were classified as a particular frame, it is a useful metric when

100

we only care about the predicted target frame being correctly classified. On the other hand, recall is useful when we are interested in the model correctly classifying the sentences that actually contain a particular frame.

The F1 score is the weighted average between precision and recall for each frame. Given that most researchers want to balance precision and recall, the F1-score is typically the preferred metric, and it is calculated using the following formula:

F1-score = 2

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛−1+ 𝑅𝑒𝑐𝑎𝑙𝑙−1= 2 ∙ (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙)

When it comes to evaluating the overall model performance of multilabel classifiers, the F1- score must involve all of the frames. The micro F1-score accomplishes this by calculating the total true positives, false negatives, and false positives globally, or for all of the frames.

Table 13 shows the evaluation metrics111 for all six frames and the overall performance of the fine-tuned Legal-BERT for multilabel sentence classification. The micro-averaged precision, recall, and F1-scores for the overall model are all 87.35%. Impressively, the results indicate that the fine-tuned model outperforms the original Legal-BERT in terms of its micro F1- score112. The precision, recall, and F1-score for the “No frame” classification are 96.60%, 77.99%, and 85.70%, respectively. The precision, recall, and F1-score for the “Patients” frame are 88.37%, 92.68%, and 90.48%, respectively. The “Unborn” frame has the highest recall (100.00%) and F1-score (98.90%) among all of the frames, and it has a precision of 97.83%. The

“Public” frame has the highest precision (100.00%) among all of the frames, and the recall and F1-scores are 90.79% and 95.17%, respectively. The precision for the “Women” frame is

111 See the Appendix for the confusion matrix for all six labels.

112 Chalkidis et al. (2020) reported a maximum F1-score = 59.2 for the multilabel task.

101

75.68%. Moreover, the “Women” frame received the lowest recall (71.79%) and F1-score (73.68%). Finally, the “Workers” frame received the lowest precision score (66.67%), and its recall and F1-score are 90.57% and 76.80%, respectively.

Protect life frame Precision Recall F1-score

Fine-tuned Legal-BERT113 87.53% 87.53% 87.53%

No frame 96.61% 77.03% 85.71%

Patients 88.37% 92.68% 90.48%

Unborn 97.83% 100.00% 98.90%

Public 100.00% 90.79% 95.17%

Women 75.68% 71.79% 73.68%

Workers 66.67% 90.57% 76.80%

Table 13 Performance results of fine-tuned Legal-BERT