Acknowledgement

(1)

Acknowledgement

The great eﬀort and assistance of numerous people have made this project feasible.

First of all, entering graduate school and freely doing my thesis will not be possible without the scholarship provided by DOST-ERDT. Huge thanks to Dr. Ethel Chua Joy Ong, my adviser, who helped me achieve the goal of my thesis by guiding me throughout my learning journey and providing informative materials I needed. I would also like to extend my appreciation to my panel members, Dr. Charibeth K. Cheng and Dr. Judith J. Azcarraga for sharing their time and wisdom in order to improve the process and results of this project.

I acknowledge EREN and MHBot, two earlier projects, whose applications and initial research I based my ideas on. Conducting my user testing won’t be possible without the support and resources allocated to me by the CCS Cloud handled by Mr. Fritz Kevin Flores who responded to all my queries and requests swiftly. To the twenty-one students who took time to use and give their feedback for this project, your eﬀorts are greatly appreciated. I am also deeply indebted to Jessica Beredo who recommended all my expert evaluators - Katherine Beltran, Kristel Cacnio, and Dr. Lovely Lucky Evarretta. Many thanks to them for patiently evaluating numerous logs within the limited time I asked them.

Lastly, I extend my sincerest gratitude to my family who encouraged and comforted me throughout all the challenging moments of completing this thesis.

To God be the glory.

(2)

Abstract

Mental health has always been a concern for everyone, globally and even in the Philippines. Natural conversational interfaces such as conversational agents or chatbots are proven to be effective in providing more accessible and non- stigmatizing mental health interventions but most of them focus on treating people and only a few helps in preventing mental illness by providing social support. More importantly, most of these chatbots are built with only rule-based or retrieval-based models which limits the chatbot’s input understanding, response generation, and the ability to freely adapt to the context of a conversation. This study resulted to the creation of VHope (Virtual Hope)’s hybrid design that com- bines a retrieval model with a natural conversationflow to help the agent to act as a therapist that can facilitate the conversation, and a generative model that can generate new and empathetic responses using neural networks to enrich the conversation. For the generative model, a neural conversational model DialoGPT was fine-tuned with EmpatheticDialogues dataset and a cleaned Well-being conversations. The fluency of the generative model was measured using the perplexity metric and the best performing model among the trained variations got a perplexity score of 9.977. In determining the user’s well-being state throughout the conversation, PERMA model was utilized. This model got an accuracy of 57%

from comparing users’ manual questionnaire results over the automated detection and 59% accuracy based from the experts evaluation. Experts further commented that VHope was able to adapt and correct the well-being labels during the conversations. With this, 61% of the users maintained their well-being while 22%

improved over the testing period. Overall results from experts validation of logs showed that the responses generated by VHope were 67% relevant, 78% human- like, and 79% empathic. Users also recognized Vhope as a non-judgemental human (15%) they can talk emotional stories with and a friend (5%) who can comfort him and listen to his stories.

Keywords: Mental Well-being, Empathic Computing, PERMA Model, Neural Language Model, Conversational Agents

(3)

List of Figures

1.1 Spectrum representation of mental well-being to mental health . . 2

2.1 Features and interaction ﬂow of Woebot (De Nieva, Joaquin, Tan, Marc Te, & Ong, 2020) . . . 15

2.2 2D representation showing words with the highest or lowest ratings in valence (V), arousal (A), and dominance (D) (Zhong, Wang, & Miao, 2019) . . . 19

2.3 Alcohol education conversation map of Elmasri and Maeder (2016) 23 2.4 Hybrid chatbot system architecture by Day and Hung (2019) . . 25

3.1 Deep neural network . . . 30

3.2 Recurrent neural network passing hidden state to next time step . 31 3.3 Transformer architecture . . . 32

3.4 GPT-2 models architecture¹ . . . 34

4.1 VHope hybrid model system architecture . . . 40

4.2 Retrieval-based model . . . 42

4.3 Retrieval-based model’s updated conversation ﬂow . . . 44

4.4 Generative-based model . . . 45

4.5 CCS Cloud JupyterHub Interface . . . 48

4.6 CCS Cloud allocated resources . . . 48

(10)

5.1 Initial design of dialogue ﬂow for maintaining well-being . . . 52 5.2 Updated design of dialogue ﬂow blue arrows and shades for main-

taining well-being as per psychology expert’s feedback . . . 53 5.3 Sample log content of docx ﬁle from Abot conversation logs . . . . 65 5.4 The DialoGPT model’s processing of an input for a single-turn

conversation (following the blue ﬂow) and allowing of multi-turn conversation by using the process in redto generate content-related responses. . . 69 5.5 Perplexity of the top 3 learning rates within 3 epochs. . . 72 5.6 The improvement of perplexity at diﬀerent iterations during the

ﬁrst epoch of top 3 learning rates. . . 72 5.7 Sample conversations on Model-H that responded with ”I’m not

sure what you’re talking about.” . . . 76 5.8 Log in user interface for VHope created in Python Flask . . . 78 5.9 The user interface for VHope accessed in a mobile browser created

using Python Flask framework . . . 79 5.10 Raw text log ﬁle processed to a tabularized word format . . . 80 5.11 Sample sentence with its weights in each PERMA categories . . . 81 6.1 Perplexity of small and medium model trained on the datasets

(the lower, the better) . . . 85 6.2 Preference test results comparing FTER and vanilla DialoGPT con-

sidering the response’s content and emotion . . . 88 6.3 Chatbot Usage Evaluation for Experts (CUEE) answered by the

psychology expert during testing . . . 90 6.4 Chatbot Assessment Form (CAF) answered by the psychology ex-

pert during testing . . . 92 6.5 User testing ﬂow . . . 93 6.6 User counts in each step of the user testing ﬂow . . . 95

(11)

6.7 User responses to the question ’Do you like sharing your thoughts or feelings?’ . . . 96 6.8 User responses to the question ’Who do you tell your feelings to?’ 97 6.9 User responses to the question ’Are you comfortable sharing your

thoughts and feelings to your parents?’ . . . 98 6.10 User responses to the question’Do you have your own phone, tablet,

or laptop?’ . . . 99 6.11 User responses to the question ’Do you know what a chatbot is?’ . 99 6.12 Visible variation on users’ ratings for each question in Performance

criteria on Model-S . . . 105 6.13 Visible variation on users’ ratings for each question in Performance

criteria on Model-H . . . 105 6.14 Visible variation on users’ ratings for each question in Humanity

criteria on Model-S . . . 107 6.15 Visible variation on users’ ratings for each question in Humanity

criteria on Model-H . . . 108 6.16 Visible variation on users’ ratings for each question in Aﬀect criteria

on Model-S . . . 110 6.17 Visible variation on users’ ratings for each question in Aﬀect criteria

on Model-H . . . 110 6.18 Count of conversation logs evaluated by experts . . . 111 6.19 Accuracy of the PERMA labels according to the experts’ evaluation 112 6.20 Summary of experts’ evaluation on logs within the three criteria

(in percent %) . . . 114 6.21 Comparison of generated responses of the two models - Model-H

and Model-S given the same prompt (same color of boxes corre- sponds to [almost] same prompt) . . . 116 6.22 Analysis of the architectural design regarding the modelling of input

given to the generative model . . . 129

(12)

6.23 Analysis of the architectural design regarding the training dataset for the generative model . . . 130 6.24 Analysis of the architectural design of design regarding the perfor-

mance of the trained generative model . . . 131 6.25 Analysis of the architectural design of design regarding the eﬀect

and architecture of the retrieval model . . . 132

(13)

List of Tables

1.1 Timetable of Activities . . . 12

2.1 Chatbots for Mental Health and Well-being: Summary of Sources 17 2.2 Empathetic Computing: Summary of Sources . . . 21

2.3 Response Generation Models: Summary of Sources . . . 26

3.1 RNNs VS Transformer models accuracy . . . 33

4.1 PERMA Lexica minimum and maximum scores per Category . . 46

4.2 PERMA Proﬁler Score Interpreter (Butler and Kern, 2016) . . . . 47

5.1 Welcome templates . . . 54

5.2 General Pumping Templates . . . 55

5.3 Speciﬁc Pumping Templates . . . 56

5.4 Labeling Conﬁrming Templates . . . 56

5.5 Cause, Status, and Follow-up Conﬁrming Templates . . . 57

5.6 Low Score Templates . . . 57

5.7 Emphasis Feedback Templates . . . 58

5.8 Praise Feedback Templates . . . 58

5.9 Reﬂecting Templates . . . 59

(14)

5.10 Evaluating Templates . . . 59

5.11 DialoGPT model variations . . . 60

5.12 EmpatheticDialogues dataset data split counts . . . 63

5.13 Well-being conversations data split counts . . . 66

5.14 Well-being conversations data split counts . . . 69

5.15 Perplexity attained during experiment with diﬀerent learning rates in three epochs. . . 70

5.16 Perplexity at diﬀerent iterations within one epoch using the top 3 learning rates (best in bold). . . 70

5.17 Perplexities of the FTER (Fine-tuned for Empathetic Responses) model within ﬁrst four epochs. . . 71

5.18 Training and evaluating time for each model variation on processed training datasets (approximate H - hours, M - minutes, S - seconds) 71 5.19 Perplexity measure in each model variation and dataset for both vanilla DialoGPT and FTER (Fine-tuned for Empathetic Responses) model . . . 73

5.20 Perplexity scores of the DialoGPT on other datasets from past studies 74 5.21 Updated PERMA Module Score Interpreter . . . 82

6.1 Average scores on content and emotion criteria using scale of 0 to 2 for the 20 random dialogues in Performance Test with highest score in boldand lowest is italicized . . . 87

6.2 Sample emotional statements with its positive and negative score and PERMA label on ﬁrst testing . . . 88

6.3 Sample emotional statements with its positive and negative score and PERMA label on second testing . . . 89

6.4 SWAP Pre-test results VS Day 1 PERMA module of VHope . . . 101

6.5 PRE and POST PERMA level results per user . . . 102

6.6 Final average scores of the two models using CAF . . . 103

(15)

6.7 Model-S scores in Performance criteria per user in each evaluation statement . . . 104 6.8 Model-H scores in Performance criteria per user in each evaluation

statement . . . 104 6.9 Model-S Humanity criteria scores per user in each statement . . . 106 6.10 Model-H Humanity criteria scores per user in each statement . . . 107 6.11 Model-S scores in Aﬀect criteria per user in each evaluation statement109 6.12 Model-H scores in Aﬀect criteria per user in each evaluation statement109 6.13 Expert gathering . . . 111 6.14 Summary scores of each criteria from experts’ evaluation . . . 113

Acknowledgement