Chapter 4 The VHope System

(1)

Chapter 4 The VHope System

VHope (Virtual Hope) is a hybrid empathetic conversational agent that uses a retrieval-based model and neural conversational model in order to provide students a place where they can freely talk about their feelings and emotions without feeling invalidated and help them maintain their well-being which is measured using the PERMA model of well-being (Seligman, 2012).

This hybrid model is a combination of an existing retrieval-based model and training a publicly available neural conversational model. The Retrieval-based model used is ﬁrst EREN (Santos et al., 2020) and later on MHBot (Ong, Go, Lao, Pastor, & To, 2021). EREN provided the emotional classiﬁcation on the input as it aims to teach emotional intelligence for children. Its target users are aged 9 to 11, far from VHope’s 17-20, so the model is later changed to MHBot which also used EREN as their base system but has college students as their target users. MHBot’s goal is also to help college students in their well-being so its responses are more appropriate for VHope’s use. VHope was made available as a web page using Python Flask so users can access it on their phone, tablet, or computers - wherever they are comfortable.

4.1 Software Objectives

4.1.1 General Objective

To combine a retrieval-based model with a generative-based model that can generate empathetic responses for newer and more human-like conversation.

(2)

4.1.2 Speciﬁc Objectives

• To use an existing retrieval-based model with the trained neural conversational model.

• To train neural conversational model with a collection of empathetic dialogues dataset to teach the model emotions and empathy.

• To detect the user’s wellbeing level using PERMA model as measurement and Mental Health continuum as label

• To act as a facilitator that helps label and recognize their wellbeing and as a listener that can empathize and encourage to share their feelings and stories.

4.2 Scope and Limitations

Retrieval-based model. Since VHope is a hybrid model, it uses a retrieval- based and a generative model working together to get the desired output. The retrieval-based model used in this systemfirst is an already existing system called Emotion Reflective Entity or EREN for short (Santos et al., 2020). EREN uses the OCC model to classify an event’s emotion which consists of 14 emotions - anger, disappointment, distress, fears-confirmed, sorry for, gratitude, hate, joy, love, surprise, relief, resentment, satisfaction, and shock. Then when MHBot became available, EREN was replaced with MHBot as its responses are more appropriate for the target audience and objective of VHope. The retrieval-based system was modified in order to accommodate the modified conversation flow.

Neural conversational model. As for the generative model, the publicly available DialoGPT (Y. Zhang et al., 2019) is used. DialoGPT’s pre-trained model with sizes small, medium, and large areﬁne-tuned to theEmpatheticDi- alogues (Rashkin et al., 2018b) dataset to train the model to generate empathetic responses. Perplexity is used to evaluate the performance of each trained variation of the model before integrating it with the retrieval-based model.

PERMA model. In order to help users maintain their well-being, their well-being must first be computed and then classified into some labels for easier identification. Well-being will be measured using the PERMA lexica introduced by Schwartz et al. (2016) based on the PERMA model of Seligman (2012). This measurement will then be classified into 5 categories - in crisis, struggling, surviving, thriving, and excelling which are derived from the Mental Health Continuum stated by Delphis.

(3)

VHope as facilitator and listener. As an empathetic conversational agent, VHope is expected to act as a facilitator which helps the user recognize their wellbeing and a listener that shows empathy as users share their stories.

To perform the role of a facilitator, VHope should be able to encourage the user to share their stories, label their wellbeing, and help them recognize why they are feeling which can be achieved by asking questions. And as a listener, VHope should be able to recognize the user’s wellbeing and respond to it accordingly while showing empathy.

4.3 Architectural Design

The overall architectural design of the system is as seen in Figure 4.1. This hybrid model is composed of 2 submodels - retrieval and generative. These two models are expected to work together to answer each other’s weakness - retrieval-based to give rules or certainflow and specific context to the open-domain generative-based and for the latter to generate a more empathetic and newly constructed response, different from predefined responses of retrieval-based.

Figure 4.1: VHope hybrid model system architecture

To start conversing with VHope, users will login to the web page and will be prompted with a welcome message from VHope - mainly asking about the user’s day so that the user will feel more comfortable to start sharing stories. Once a user starts giving an utterance, this input will be accepted by the retrieval model.

Utterance will be processed in the Text understanding module where it passes the text to the Story World and Event Extraction process. In Story World Extraction,

(4)

characters, objects, settings, and events are extracted from the text utterance whereas in Event Extraction are the creation, description or action events. The retrieval model’s Emotion Recognition module uses the OCC Model for classifying emotions from action events and the Keyword Spotting Technique for description events. Once an emotion is detected and classified, the model uses the Dialogue Manager to find the best suitable move which then chooses a template randomly that will later be filled up by the Response Generation.

Both the detected emotion and response generated will be used as input to the generative model to generate an appropriate response. For the generative model to be able to generate a response, it must ﬁrst be ﬁne-tuned and trained with EmpatheticDialogues(Rashkin et al., 2018b) dataset for it to learn generating responses with empathy, Well-being conversations (Sia, Yu, Daliva, Montenegro,

& Ong, 2021) retrieved from a past study which focused on providing conversations that encourage students in improving their lifestyle habits and well-being for the model to learn words used in well-being related conversations, and the PERMA lexicon to teach the model words related to each category in PERMA model.

The PERMA module uses the PERMA lexica (Schwartz et al., 2016) in order to measure the current well-being of the user. This computes the score of each word in the utterance in each PERMA category - positive emotion, engagement, relationships, meaning, and accomplishment. The overall score will then be la- beled as in crisis, struggling, surviving, thriving, or excelling as adapted from Delphis¹. If the user’s label is in crisis or struggling, VHope will notify the user and ask if he wants to seek help as suggested by the psychology expert. Otherwise, it will just continue the storytelling by asking if the detected PERMA is correct to what they are currently feeling.

4.3.1 Retrieval-based model

Retrieval-based model is responsible for processing the input, understanding it, recognizing emotion, and giving a candidate response which follows the designed conversation ﬂow. MHBot was created using the architecture of EREN, the only diﬀerence is their templates wherein MHBot provides responses regarding college student’s well-being and advice in order to help them maintain their mental well- being whereas EREN tries to guide children to share their stories and recognize their emotions.

1https://delphis.org.uk/mental-health/continuum-mental-health/

(5)

Figure 4.2: Retrieval-based model Knowledge Base

Retrieval model uses a variety of knowledge bases in order to understand the text input and perform its tasks. The three knowledge bases used are the following:

Story World Representation. This maintains two types of event chain - one is the set of action and descriptions events and the other is the emotion chain event which contains the set of emotions which is utilized in order to check for recurring emotions throughout the conversation.

Commonsense Ontology. This provides the commonsense knowledge needed to understand the basic words and concepts in the real world. It is utilized to recognize the familiarity of the object detected and is also used to properly label an emotion.

NRC Emotion Lexicon. This helps the model understand aﬀective words. It was based on the NRC Emotion Lexicon which is composed of words, its sentiment which can either be positive or negative, and eight basic emotions namely - anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

Text Understanding

In text understanding, the user’s text input is processed where important elements are extracted. These elements create an Event which can be categorized into three - Action, Description, or Creation Event. Every actual event from the input which was performed is categorized as Action event, information which de- scribes an object or character as Description event, and Creation event are where

(6)

new characters or events are introduced in the story. Only Action and Descrip- tion events are passed to the Emotion Recognition module since emotions can be inﬂuenced by actions performed and adjectives that can explicitly describe how they are feeling.

Emotion Recognition

Emotion Recognition classiﬁes the emotions from the detected events. Action events such as subject, verb, direct object, and adverb that were passed to the Emotion Recognition module uses the OCC model while Description events uses the Keyword Spotting Technique.

Dialogue Manager

Dialogue Manager works to know and decide which dialogue move is appropriate to use. Different from chatbots designed to assist humans with a specific goal or task, conversational agents providing social and mental health support need to mimic and maintain the common human-to-human exchange of turns. It is also essential in order to provide an acceptable flow of conversation expected from an actual health care professional or therapist. Thus, this study adapts the conversation flow of EREN originated from Gottman et al. (1996)’s Emotion Coaching Model which, as seen in Figure 4.3, consists of five (5) phases namely - introduction, labeling, listening, reflecting, and evaluating. This conversation flow guides the retrieval-based model what candidate responses to generate which will be used as input to the FTER model.

Introduction. Every conversation starts with an introduction phase where the agent greets the user, provides a simple self-introduction of what it can oﬀer, and ends with asking the user about his/her day signaling the user to share his/her story or feeling. This uses the simple retrieval-based model to output a predeﬁned response. This is also where the system tries to detect an emotion using the retrieval-model’s Emotion Recognition module and assesses the user’s well-being using the PERMA module of the generative model through his/her utterance.

Labeling. In this phase, the agent asks the user if the detected well-being state is correct. In case the user disagrees and does not provide any alternative well- being state, it will try to ask for the speciﬁc well-being label state OR process the new utterance to detect and assess again. Once the user conﬁrms these, related keywords will be collected and used as a guide in generating empathetic and relevant responses throughout the succeeding conversation.

(7)

Figure 4.3: Retrieval-based model’s updated conversation ﬂow

Listening. With the well-being state conﬁrmed, the agent will now try to gain the user’s trust by building rapport in asking questions on why he/she feels that way and is in that state. The agent should also be able to generate positively validating statements to the user’s situation. This phase will make full use of the FTER model in generating empathetic responses.

Reflecting. This phase will still continue to make use of the FTER model guided by the candidate responses generated by the retrieval-based model. The agent will also offer statements containing positive thoughts and advice to al- low the user to reflect and think of better actions which triggers the user’s self- reflection, but note that the system will not check if the user’s action was correct or not.

Evaluating. Lastly, the agent will ask the user’s insights about their conversation and whether it positively aﬀected and helped his/her concern. The conversation will then be ended by the agent’s statement expressing its availability and willingness to support and listen anytime and anywhere to the user’s stories and/or concerns.

4.3.2 Generative-based model

The generative-based model used for this system is composed of a publicly available neural conversational model DialoGPT which is Fine-tuned for Empathetic Responses or what is named the FTER model and the PERMA Module. The base model of DialoGPT was adapted from the architecture of GPT-2 in 2019 by

(8)

Figure 4.4: Generative-based model

OpenAI but is trained with a much larger dataset consisting of twelve years worth of 147 million conversation-like exchanges from Reddit comment chains. Three diﬀerent sizes of the model small, medium, and large were trained consisting of the model with 117 million, 345 million, and 762 million parameters, respectively.

All these variations are made available in the HuggingFace Transformers repository. For FTER to perform its objective of generating empathetic responses, it was trained and evaluated with three datasets namely EmpatheticDialogues (Rashkin et al., 2018b) Dataset, Well-being Conversations (Sia et al., 2021), and the PERMA Lexica (Schwartz et al., 2016).

EmpatheticDialogues Dataset

The recent EmpatheticDialogues (Rashkin et al., 2018b) dataset contains 24,850 conversations grounded on emotional situations. 32 emotion labels were chosen by aggregating labels from several emotion prediction datasets. Each of the dialogues consists of two roles - the Speaker and Listener. The Speaker is the person who chooses the emotion and description of the situation then instigates a conversation on it. The participant who understands and responds to the given situation is the Listener. Each conversation ranges from four to eight utterances and the average number of utterances per conversation is 4.31 while the average utterance length is 15.2 words. The dataset was initially split to make sure that all sets of conversations with the same speaker providing the initial situation description would be in the same partition.

There are eight features in the dataset: conv\_idfor conversation id,utterance\_idx for utterance id in each conversation, contextwhich contains the emotion where

(9)

the conversation is grounded, prompt as the initial emotional situation of the conversation, and the utterance which contains the alternating replies of both Speaker and Listener. The original format of the dataset contains the same prompt and diﬀerent utterances between one Speaker and Listener in multiple rows. These data were processed in order to populate a whole conversation within a single row, resulting in 17,841 rows of conversation for training, 2,758 rows for validation, and 2,539 rows for testing.

Well-being Conversations

Well-being Conversations (Sia et al., 2021) are a collection of well-being related conversation logs from a rule-based chatbot used by twenty-ﬁve Grade 12 students aged 17-18. Conversations aimed to promote healthy lifestyle and well-being that can help prevent further illness and distress can be found. By training the generative model with this type of conversations, it is expected to learn giving out helpful advice regarding well-being.

PERMA Lexica

PERMA Lexica (Schwartz et al., 2016) is composed of lexicons which can be used to predict well-being through the PERMA scales. Each lexicon has a respective category and weight. Categories are POS P, POS E, POS R, POS M, POS A, NEG P, NEG E, NEG R, NEG M, and NEG A for the positive and negative of all ﬁve categories of the PERMA model. Weight is the score of the word in the category which ranges from negative 0.3 to positive 0.7 as shown in Table 4.1.

Table 4.1: PERMA Lexica minimum and maximum scores per Category Minimum Score Maximum Score Mean

POS P -0.36639 0.76549 0.04172

POS E -0.30074 0.34065 0.03234

POS R -0.28884 0.78376 0.03824

POS M -0.16748 0.77167 0.02517

POS A -0.19784 0.55031 0.03990

NEG P -0.32731 0.70697 0.04705

NEG E -0.15230 0.84017 0.04354

NEG R -0.28648 0.62033 0.04045

NEG M -0.14987 0.31674 0.03416

NEG A -0.15369 0.24760 0.03426

(10)

4.3.3 PERMA Module

The PERMA module is responsible for computing the well-being score of the user. To compute the well-being score, PERMA Lexica (Schwartz et al., 2016) is used which contains a collection of words and its score in each PERMA category.

Butler and Kern also provided a table to interpret these scores. These clusters are re-named after the Mental Health Continuum phases for easier understanding of the users.

Table 4.2: PERMA Proﬁler Score Interpreter (Butler and Kern, 2016) Cluster Well-being Score Negative Emotion Score

Very High Functioning 9 and above 0 to 1

High Functioning 8 8.9 1.1 to 3

Normal Functioning 6.5 to 7.9 3 to 5

Sub-optimal Functioning 5 to 6.4 5.1 to 6.5

Languishing Functioning Below 5 Above 6.5

4.4 Physical Environment and Resources

The VHope system requires the following resources for development:

• Python

• Python Flask

• MySQL

• HuggingFace Transformers

To ﬁne-tune the models, re-training is required and will consume a lot of resources. CCS Cloud’s current Alpha Testing provided the researcher a JupyterHub with 2x NVIDIA Tesla V100 GPUs at 32GB each.

To make the web page available for access, a server machine is needed. CCS Cloud also provided an environment conﬁgured in Ubuntu 18.04 with 4GB RAM and 32 GB storage since neural models are big in size.

The minimum requirement for accessing the web page is any updated browser installed in phone, tablet, or laptop and a stable internet connection.

(11)

Figure 4.5: CCS Cloud JupyterHub Interface

Figure 4.6: CCS Cloud allocated resources

4.4.1 Tools

ConceptNet

Pre-existing retrieval-based models, EREN and MHBot, used ConceptNet to provide commonsense knowledge to their models. ConceptNet is a large semantic net- work which contains a variety of concepts extracted from other diﬀerent sources such as Open Mind Common Sense, Wordnet, DBPedia, UMBEL, Wiktionary, and Verbosity.

(12)

Spacy

Spacy is an open-source library written in Python and Cython languages used for Natural Language Processing tasks such as tokenization, part-of-speech tagging, dependency parsing, and named entity recognition. This tool helped the retrieval- based model to understand the input and extract usedul information.

HuggingFace Transformers

For the generative-based model, HuggingFace Transformers is used. This provides a variety of pre-trained models used in many cases which includes conversational modelling. DialoGPT, a large open-source neural conversational model, uses transformers and is readily available from the HuggingFace repository.