Chapter 6 Results and Discussion

7 out of 10 responses generated by the FTER model were preferred by most of the annotators. The scoring and labeling accuracy of the PERMA module is also tested within the unit test. This includes the definition and background of the wellness labels used in the study, links to the chatbot and Google Forms, their access schedule and their VHope login account.

All users' access and conversations including their PERMA scores are logged and collected at the end of the user test.

Table 6.1: Average scores on content and emotion criteria using scale of 0 to 2 for the 20 random dialogues in Performance Test with highest score in bold and lowest is italicized

PTIQ

Another also noted that sometimes they can share it when they feel anxious or sad. But one user stated that he/she did not use chatbot as they are not useful.

Figure 6.7: User responses to the question ’Do you like sharing your thoughts or feelings?’

SWAP

Since 95% of the participants have experience using a chatbot, they are further asked how they feel when talking to chatbots. Most of the comments are negative, stating that talking to them is uncomfortable, very limiting because they are scripted, repetitive, seem unnatural and sometimes feel weird and not human because they already know they are talking to a chatbot. One also said they like chatbots because they are designed to follow rules that prevent them from hurting the feelings of the human they are talking to.

The first results of SWAP from pre-tests are used first to measure the accuracy of the automated PERMA module compared to the manual questionnaire used in the assessment form. Regarding the original objective of the SWAP, which was to monitor the changes in user well-being at the beginning and end of the experiment, the results are shown in Table 6.5, which also shows 25 rows for 25 participants. Of the 11 users with retained label, 7 actually had a small increase in their overall wellness score, but they are still within the range of the same wellness label.

Shown in Figures 6.12 and 6.13 are the histograms for both Model-S and Model-H to better visualize the different ratings of each user in each statement of the performance criteria. VHope going out of context as commented from performance criteria also affected the evaluation of humanity criteria, with another user saying that it is impossible to empathize with VHope as it always changes the topic of the conversation. Aﬀect checks whether the system was able to read and respond to the user's moods.

These results gave Mode-S an overall mean score of 3.44, slightly higher than the overall scores of the other two criteria. According to the other two criteria, there is no significant difference in the overall mean score of Model S (3.44) compared to Model H in this evaluation.

Table 6.4: SWAP Pre-test results VS Day 1 PERMA module of VHope SWAP

Expert Evaluation of Logs

PERMA labeling

Another task of the experts is to evaluate the perceived PERMA mark in the logs of each conversation. This was by asking the expert to comment on whether or not the displayed PERMA tag was appropriate in parts of the talk based on his calculation placement and topic. Of the first 96 journals evaluated by the experts, only 43 were brought back with comments related to the PERMA tags.

Note that logs without a PERMA tag were calculated and those with more than one PERMA tag were displayed. There is no specific count of how many calculations and PERMA tags are present in the log of each conversation. Even with low accuracy, experts commented that PERMA labeling was able to adapt and correctly detect during conversation.

Comparing Model-S and Model-H

The rather low percentage in performance criteria of both models must be due to the seemingly irrelevant responses caused by the training data fed to the generative model which comes from the Wellbeing Conversations where the chatbot Abot continued to provide brief summary and constantly sharing parables and movie stories. A topic shared by the user from the early phase of the conversation, that is, introduction phase, is answered only in the later phases of the conversation, that is, listening phase. This may cause confusion to the reader as VHope may have switched to another topic.

To further explore the difference in performance of the two models, Figure 6.21 is presented, showing a simple comparison of the generated responses. Both models were able to ask a follow-up question about the “I went to the park with my dog” in pink input. And both were able to detect the correct PERMA wellness label on the green “I feel so happy and excited” input.

Apart from the effect of input modeling on the unrelated responses generated by the model, the model itself had the greatest effect. Thus, in generating responses, DialoGPT uses transformer-decoder blocks that output one character at a time added to the sequence of input. The idea of using a language model in this study is to generate a new empathic statement related to the input, but the conversation logs showed that the model generates a statement that already answers the input composed of the guide response to the user input.

Figure 6.20: Summary of experts’ evaluation on logs within the three criteria (in percent %)

Comparison to Pure Retrieval-based model

Performance
Humanity
Aﬀect
Overall Design

Experts, however, praised that VHope was able to adapt its answers to the respondents. Looking at the respondent's response pattern, VHope didn't seem to understand much if the responses were short or just one word. This was done multiple times without any understanding of the information or any form of confirmation for the student.

VHope, on the other hand, surprisingly generated more human-like responses than were relevant with experts commenting Yes, almost all of the responses seemed. There are also times that VHope was able to show a sense of humor that made users have fun with him. Finally, one expert commented that some of the answers were strangely off-topic, but the way VHope said them seemed to be typical or natural.

On the other hand, more of the responses generated by VHope were able to show emotions and empathize with the user. On expert even noted that there was a great flow of conversation in one of the logs which ended it well and VHope showed compassion. Finally, there have even been incidents where VHope has been able to give bits of advice and comforting words to the user.

Figure 6.21: Comparison of generated responses of the two models - Model-H and Model-S given the same prompt (same color of boxes corresponds to [almost] same

Log Analysis

Welcome Message
Abot name
Repetitive Responses
Continuous Sharing
Single-turn conversations
Forgetting past oﬀers
Ending the conversations

The general idea of this is to give and show as much support as possible so the user will feel more listened to and then be able to share anything. They are pre-defined as using the generative model because it loses its purpose of asking the user since the model cannot specifically compose a question when prompted. Sometimes VHope introduces itself as Abot, and sometimes it refers to the user it is talking to as Abot.

From these conversations, Abot always starts by introducing himself to the user and the user even calls it Abot in his/her replies. A possible solution to this is to keep a collection of words for the generative model's use. Listing 6.14: VHope Model-H provides hints, but forgets to say what they are and the user actually asks where the hint is.

One theory of this sudden ending of chats is due to the training data of well-being chats that thank the user for chatting not only at the end of chats. Listing 6.16: VHope Model-H offers tips but forgets to state what it is and the user actually asking where the tip is. Listing 6.17: VHope Model-H offers advice, but forgets to state what it is, and the user stays to still talk to it.

VHope as Mental Health Support Chatbot

VHope as a social companion

Persona-based Chatbots

VHope was able to embody most of these characteristics, but not all of them in every conversation. Where most of these characteristics were present, users were more entertained and engaged in the conversation. It is very important that call agents show a specific personality that users can trust (Garg & Sengupta, 2020).

These embedded personalities can make conversational agents more human, as they can act as someone with a specific role rather than a generic chatbot that just responds to every input. Positive characters such as a cheerful friend and knowledgeable colleague can display the six characteristics that users expect from a reliable conversation partner. These personas and attributes will greatly aid VHope's goal of preserving the user's well-being by recognizing and providing helpful feedback through empathic responses.

Incorporating such personas into VHope's design as a chatbot for mental health support can further help it develop deeper connection with its users and achieve its goal.

Encouraging self-disclosure

Design Analysis

The wellness conversation training data consists of exchanges between a human and the Abot chatbot. Training from this dataset provided many tips and questions related to well-being, but it also gave another name to the chatbot and its random sharing of parables and movies. As shown in Figure 6.24, the generative model was able to produce empathetic and fresh sentences that are more acceptable and human-like, because it can actually adapt to the ongoing conversation than the boilerplate answers that the search model can provide.

But the model was unable to understand and respond to common phrases such as 'Hello' with 'Hello' and 'Thank you' with 'Welcome'. This can be solved by collecting common phrases and the appropriate responses expected for them and using them as training data. There is no special feature for the generative model to know when it is time to end the conversation.

There are commands used by the rollback model, but it is not provided to users.