Chapter 7
Conclusion and
Recommendations
This chapter discusses the overall evaluation of the resulting project and recom- mendations - including the possible improvements for the project and ideas for future works.
7.1 Conclusion
The main objective of this study was to develop an empathetic hybrid retrieval- generation-based conversational agent that can engage with students in order to help them maintain their optimal mental well-being which can further promote mental health awareness. As a result of the study, VHope was created by us- ing a retrieval-based model that can detect emotions, a generative-based model trained to show emotions and empathy, and a PERMA well-being detector to label the user’s well-being. Based on the users’ pre- and post-test using SWAP (Student’s Well-being Assessment using PERMA), most of users’ well-being label were successfully maintained within the duration of the testing. Also, both users and experts commended VHope’s ability to show emotions and empathize during the conversation. With this, VHope was able to fulfill the role of a listener as it can understand and empathize with the user but failed in being a facilitator as its responses were mostly in first person that sometimes results to the conversation being focused on its concerns rather than the story and concerns of the user.
Aside from accomplishing this main objective, the following specific objectives
1. To perform transfer learning in teaching the pre-trained generative language model empathy. DialoGPT wasfine-tuned on theEmpathet- icDialoguesdataset to train and teach it empathy. FTER (Fine-tuned for Empathetic Responses) model resulted from this which showed great poten- tial in showing emotions and empathizing with user inputs from the initial preference and performance test conducted using only single turn dialogues.
This model is further fine-tuned using Well-being conversations to train it for well-being related topics.
2. To apply the PERMA model in assessing the user’s well-being based on the text input. Well-being of the users were measured in their inputs using the PERMA lexica which contains weighted words in each of the PERMA categories. This measurement results to an overall positive and negative scores which are then labeled as in crisis, struggling, surviving, thriving, and excelling adapted from the Mental Health Continuum. The resulting assessments were also evaluated by experts during their log analysis and concluded that VHope were able to adjust and detect the appropriate well-being label throughout a conversation.
3. To adapt a conversation flow for the retrieval-based model in maintaining a more natural flow and connect a transformer-based model in generating new and empathetic responses to the user’s in- put. VHope was developed after integrating the retrieval model and gener- ative transformer-based model into a single system. A pre-existing retrieval model named MHBot was used. The retrieval model, with its conversation flow, is expected to generate responses related to the topic being shared while following specific phases of a conversation. These retrieved responses are then used as input for the generative model to output freshly composed sentences avoiding repetitive responses common in chatbots.
4. To measure the perplexity of the generative model in order to determine how well it can generate relevant and grammatically correct responses. A neural conversational model’s fluency can be mea- sured using a an automatic metric called perplexity. During training of the FTER (Fine-tuned for Empathetic Responses) model in EmpatheticDi- alogues dataset, perplexity is measured to compare the fluency of each variations - small, medium, and large; and see if larger models really per- form better than smaller ones. Since it took a lot of time and resources training the large model, only small and medium were then trained on the next collections of dataset which is composed of (1) combinedEmpathetic- Dialogues + Well-being conversations and (2) EmpatheticDialogues + Well-being conversations + PERMA Lexica.
5. To evaluate the conversational agent in terms of performance, hu- manity, and affect criteria. Functional, integration, and expert testing were conducted before user testing. For functional testing of the retrieval model, theflow of the conversation is observed while generative model under- gone performance and preference test from human annotators. Integration testing was done to test if the Flask web page can successfully accept input and display output. After user testing, users evaluated their experience with talking to VHope using the three criteria and their conversations logs were then subjected to expert evaluation and log analysis by the researcher which also both used the same criteria.
This study contributed to the body of knowledge with its exploration on the hybrid design of a conversational agent, modelling of input to improve response generation, and teaching a language model empathy by further training it on an empathy-rich conversations. Hybrid models are recently studied but not using both retrieval and generative models’ output. They treat those models as equal or different and just use either of their output. This study provided initial exper- iment on using and considering both models’ output by using retrieval model’s output as input to generative model. Modelling of the input and its effect to the response generation is also reviewed in this study. Lastly, this supported the idea of using an empathetic dataset and machine learning model to create and teach a conversational model empathetic ability. By providing the positive results of this study regarding empathetic ability shown by VHope on conversations, modelling empathy on complex conversation design can be avoided.
7.2 Recommendations
Even though VHope was able to accomplish all of its research objects, some rec- ommendations can still improve its ability to generate appropriate and empathetic responses to better listen and give advice regarding user’s well-being:
7.2.1 Dataset
Machine learning models are generally dependent on the type of data it is fed with.
EmpatheticDialogues dataset was able to successfully teach the generative model empathy but the Well-being conversations which was collected from another chatbot’s logs had negative effects on the responses. This includes continuously
or stories. Well-being conversations was used hoping that will it give well-being related words and topics to the model. Though it was able to do so, it is still evident that well-being related advice are still lacking. Thus, collecting actual conversations purely focused on well-being and mental health will be helpful to the model to perform its objectives better.
Also, generative model may further be trained with a collection of common phrases with its corresponding answers so that it will learn to respond even with short greetings from its users.
7.2.2 Neural Conversational Model
The larger the machine learning model is, the better performance it can give.
VHope was only able to train and use the medium variation of DialoGPT with 345 million parameters, 24 decoder layers, and 1024 model dimensionality due to limited resources and longer training time it takes for the larger model. With that in mind, experimentation using larger model of DialoGPT, or even newer available neural conversational models is recommended.
Aside from a larger, newer model to be considered, using a rephrasing trans- former model can also be more useful. As stated in the results of this study, the language model used generates a response given an input. But the hybrid design’s input is an output of the retrieval model which is already a response of the user’s input. Thus, the generated response by the generative model is a response of the already chosen response which results to unrelated response. With this in mind, a rephrasing model will be more accurate to process a candidate response to turn into an empathetic statement.
7.2.3 Response Generation
The input format for the FTER model is believed to be a large contributing factor to the irrelevant responses of VHope. Aside from training in more related datasets, another solution in mind is to get only the very important tokens or words of the user from the beginning of the conversation up to the current to be fed to the generative model every time it generates a response. That way, the model’s response will be limited to those tokens and topic will stay throughout the conversation.
7.2.4 Ranking Responses
In hybrid design models, past studies (Tammewar et al., 2017; Kulatska, 2019;
Day & Hung, 2019) explored other techniques in handling two models in order to answer the repetitive responses of the retrieval models and the irrelevant and unguided responses of the generative models. One of these is by using a ranking module. Both models - the retrieval and the generative, works on the input and generates their own output. Those output is then scored and ranked to determine which one is better. This approach can also be explored in VHope to solve the problem in the input modelling. The retrieval model’s output will be treated as equal so that good and appropriate responses it can sometimes give will not be left unused and the generative model will be able to generate a response using the user input instead of the retrieval model’s candidate response.
7.2.5 Ending conversations
It is important to end a conversation in a good manner. VHope’s current abrupt ending of conversations can appear invalidating and rude to the user. Aside from having commands such as ’the end’, there should be ’Conversation Ending Detec- tor’ that can understand and determine if the user’s input indicates that he/she wants to end the conversation. This feature can make the chatbot more humane, removing commands and fully understanding the intent of the user.
7.2.6 Emotion Detection
The emotion detection used by VHope was derived from EREN chatbot which centers the conversation on the detected emotion of the user. While EREN made full use of the detected emotion throughout the conversation, this is not the case for VHope as VHope only used the emotion as another context input to the generative model. VHope also tried to consider the emotion as input to the PERMA computation for well-being labelling but the incorrect detected emotions negatively affects the computation thus it is removed.
Emotions greatly affect the overall mood and mental well-being of a person (Houben, Van Den Noortgate, & Kuppens, 2015). Thus, emotion can be further utilized to double check the detected well-being. But this will only be possible if the emotion detector can correctly detect the user’s emotion all the time. Emotion detected can then again be used as supporting input to the PERMA module to
label is appropriate to the current emotion. To support this, the initial results of emotion detector from the user testing conversation can be used. Emotions detected from the conversations can be noted and summarized to be a basis on formulating the relation of specific emotions to each PERMA label. Findings should also be approvedfirst by mental health experts.
7.2.7 VHope’s Roles
VHope’s initial target role was to be a listener and facilitator in order to help its users recognize their feelings and realize what can be done to improve it.
Being able to empathize with the user gave an impression that it can understand the concern while listening and thus, completing its role as listener. On the other hand, due to its first person and random responses, it was not able to fully facilitate user’s emotional sharing of stories.
To perform the role of a facilitator, VHope should be trained with less first person conversations. It should be able to ask more of the user’s concern than sharing its own. Additionally, statements helping the users realize what they can do to improve their well-being should be more visible. This statements should be the focused during the reflecting phase, which is before the end of the conversation, to make sure that the user is already done sharing his story to ensure that it made the user feel understood.
People with resilience can better cope with stress and anxiety in healthy ways.
There are different strategies to learn and improve resilience. These include having social support, keeping a positive view of oneself, accepting change, and keeping things in perspective. Such strategies can also be used to be a successful facilitator.
And this will be possible by training the model with datasets that incorporate such ideas and always providing reflective questions rather than just giving advice to let the users have control and feel responsible for their own wellness.