Acknowledgement
The great effort and assistance of numerous people have made this project feasible.
First of all, entering graduate school and freely doing my thesis will not be possible without the scholarship provided by DOST-ERDT. Huge thanks to Dr. Ethel Chua Joy Ong, my adviser, who helped me achieve the goal of my thesis by guiding me throughout my learning journey and providing informative materials I needed. I would also like to extend my appreciation to my panel members, Dr. Charibeth K. Cheng and Dr. Judith J. Azcarraga for sharing their time and wisdom in order to improve the process and results of this project.
I acknowledge EREN and MHBot, two earlier projects, whose applications and initial research I based my ideas on. Conducting my user testing won’t be possible without the support and resources allocated to me by the CCS Cloud handled by Mr. Fritz Kevin Flores who responded to all my queries and requests swiftly. To the twenty-one students who took time to use and give their feedback for this project, your efforts are greatly appreciated. I am also deeply indebted to Jessica Beredo who recommended all my expert evaluators - Katherine Beltran, Kristel Cacnio, and Dr. Lovely Lucky Evarretta. Many thanks to them for patiently evaluating numerous logs within the limited time I asked them.
Lastly, I extend my sincerest gratitude to my family who encouraged and comforted me throughout all the challenging moments of completing this thesis.
To God be the glory.
Abstract
Mental health has always been a concern for everyone, globally and even in the Philippines. Natural conversational interfaces such as conversational agents or chatbots are proven to be effective in providing more accessible and non- stigmatizing mental health interventions but most of them focus on treating peo- ple and only a few helps in preventing mental illness by providing social sup- port. More importantly, most of these chatbots are built with only rule-based or retrieval-based models which limits the chatbot’s input understanding, response generation, and the ability to freely adapt to the context of a conversation. This study resulted to the creation of VHope (Virtual Hope)’s hybrid design that com- bines a retrieval model with a natural conversationflow to help the agent to act as a therapist that can facilitate the conversation, and a generative model that can generate new and empathetic responses using neural networks to enrich the con- versation. For the generative model, a neural conversational model DialoGPT was fine-tuned with EmpatheticDialogues dataset and a cleaned Well-being conversa- tions. The fluency of the generative model was measured using the perplexity metric and the best performing model among the trained variations got a per- plexity score of 9.977. In determining the user’s well-being state throughout the conversation, PERMA model was utilized. This model got an accuracy of 57%
from comparing users’ manual questionnaire results over the automated detection and 59% accuracy based from the experts evaluation. Experts further commented that VHope was able to adapt and correct the well-being labels during the con- versations. With this, 61% of the users maintained their well-being while 22%
improved over the testing period. Overall results from experts validation of logs showed that the responses generated by VHope were 67% relevant, 78% human- like, and 79% empathic. Users also recognized Vhope as a non-judgemental human (15%) they can talk emotional stories with and a friend (5%) who can comfort him and listen to his stories.
Keywords: Mental Well-being, Empathic Computing, PERMA Model, Neural Language Model, Conversational Agents
Contents
1 Research Description 1
1.1 Overview of the Current State of Technology . . . 1
1.2 Research Objectives . . . 4
1.2.1 General Objective . . . 4
1.2.2 Specific Objectives . . . 5
1.3 Scope and Limitations of the Research . . . 5
1.4 Significance of the Research . . . 7
1.5 Research Activities . . . 8
1.5.1 Review of Related Literature . . . 8
1.5.2 Review of Ethical Issues . . . 8
1.5.3 Software Design and Implementation . . . 9
1.5.4 FTER Model Evaluation . . . 9
1.5.5 Functional Testing . . . 9
1.5.6 Expert and End User testing . . . 10
1.5.7 Analysis of Conversation Logs . . . 11
1.5.8 Documentation . . . 11
1.5.9 Calendar of Activities . . . 11
2 Review of Related Literature 13
2.1 Chatbots for Mental Health and Well-being . . . 13
2.2 Empathic Computing . . . 18
2.3 Response Generation Models . . . 22
2.3.1 Retrieval-based Models . . . 22
2.3.2 Generative Models . . . 23
2.3.3 Hybrid Models . . . 24
3 Theoretical Framework 27 3.1 PERMA Model and Lexica . . . 27
3.2 Natural Language Processing . . . 29
3.3 Neural Networks . . . 30
3.3.1 Recurrent Neural Networks . . . 31
3.3.2 Transformers . . . 32
3.3.3 RNN Vs Transformers . . . 33
3.4 Pre-trained Language Models . . . 34
3.4.1 GPT-2 . . . 34
3.4.2 DialoGPT . . . 35
3.5 Evaluation Methods . . . 36
3.5.1 Automated Evaluation . . . 36
3.5.2 Human Evaluation . . . 37
4 The VHope System 38 4.1 Software Objectives . . . 38
4.1.1 General Objective . . . 38
4.1.2 Specific Objectives . . . 39
4.2 Scope and Limitations . . . 39
4.3 Architectural Design . . . 40
4.3.1 Retrieval-based model . . . 41
4.3.2 Generative-based model . . . 44
4.3.3 PERMA Module . . . 47
4.4 Physical Environment and Resources . . . 47
4.4.1 Tools . . . 48
5 Design and Implementation 50 5.1 Retrieval-based model . . . 50
5.1.1 Dialogue Flow . . . 51
5.1.2 Dialogue Planner . . . 51
5.2 Generative-based model . . . 60
5.2.1 DialoGPT . . . 60
5.2.2 Pre-Processing of Datasets . . . 62
5.2.3 Fine-Tuned for Empathetic Responses (FTER) . . . 69
5.3 Hybrid model . . . 74
5.3.1 Challenges with retrieval model . . . 74
5.3.2 Connecting two models . . . 75
5.3.3 Creating the Flask Web page . . . 77
5.3.4 Creating and handling conversation logs . . . 77
5.4 PERMA module . . . 81
5.5 Deployment . . . 83
6 Results and Discussion 84
6.1 Automatic Evaluation . . . 84
6.2 Functional Testing . . . 85
6.2.1 Unit Testing . . . 86
6.2.2 Integration Testing . . . 89
6.3 Initial Expert Testing . . . 90
6.4 End-User Testing . . . 91
6.4.1 Methodology . . . 93
6.4.2 PTIQ . . . 95
6.4.3 SWAP . . . 100
6.4.4 CAF . . . 103
6.5 Expert Evaluation of Logs . . . 111
6.5.1 PERMA labeling . . . 112
6.5.2 Comparing Model-S and Model-H . . . 113
6.6 Comparison to Pure Retrieval-based model . . . 115
6.6.1 Performance . . . 115
6.6.2 Humanity . . . 118
6.6.3 Affect . . . 119
6.6.4 Overall Design . . . 121
6.7 Log Analysis . . . 121
6.7.1 Welcome Message . . . 122
6.7.2 Abot name . . . 123
6.7.3 Repetitive Responses . . . 123
6.7.4 Continuous Sharing . . . 124
6.7.5 Single-turn conversations . . . 125
6.7.6 Forgetting past offers . . . 125
6.7.7 Ending the conversations . . . 126
6.8 VHope as Mental Health Support Chatbot . . . 127
6.8.1 VHope as a social companion . . . 127
6.8.2 Persona-based Chatbots . . . 127
6.8.3 Encouraging self-disclosure . . . 128
6.9 Design Analysis . . . 129
7 Conclusion and Recommendations 133 7.1 Conclusion . . . 133
7.2 Recommendations . . . 135
7.2.1 Dataset . . . 135
7.2.2 Neural Conversational Model . . . 136
7.2.3 Response Generation . . . 136
7.2.4 Ranking Responses . . . 137
7.2.5 Ending conversations . . . 137
7.2.6 Emotion Detection . . . 137
7.2.7 VHope’s Roles . . . 138
A Research Ethics Documents 139
B Validation Instruments 157
C Resource Persons 162
D Tallied Scores 164 D.1 Logs and PERMA Summary . . . 164 D.2 Expert’s Logs Evaluation . . . 167 D.3 Expert’s PERMA Evaluation . . . 170
References 172
List of Figures
1.1 Spectrum representation of mental well-being to mental health . . 2
2.1 Features and interaction flow of Woebot (De Nieva, Joaquin, Tan, Marc Te, & Ong, 2020) . . . 15
2.2 2D representation showing words with the highest or lowest ratings in valence (V), arousal (A), and dominance (D) (Zhong, Wang, & Miao, 2019) . . . 19
2.3 Alcohol education conversation map of Elmasri and Maeder (2016) 23 2.4 Hybrid chatbot system architecture by Day and Hung (2019) . . 25
3.1 Deep neural network . . . 30
3.2 Recurrent neural network passing hidden state to next time step . 31 3.3 Transformer architecture . . . 32
3.4 GPT-2 models architecture1 . . . 34
4.1 VHope hybrid model system architecture . . . 40
4.2 Retrieval-based model . . . 42
4.3 Retrieval-based model’s updated conversation flow . . . 44
4.4 Generative-based model . . . 45
4.5 CCS Cloud JupyterHub Interface . . . 48
4.6 CCS Cloud allocated resources . . . 48
5.1 Initial design of dialogue flow for maintaining well-being . . . 52 5.2 Updated design of dialogue flow blue arrows and shades for main-
taining well-being as per psychology expert’s feedback . . . 53 5.3 Sample log content of docx file from Abot conversation logs . . . . 65 5.4 The DialoGPT model’s processing of an input for a single-turn
conversation (following the blue flow) and allowing of multi-turn conversation by using the process in redto generate content-related responses. . . 69 5.5 Perplexity of the top 3 learning rates within 3 epochs. . . 72 5.6 The improvement of perplexity at different iterations during the
first epoch of top 3 learning rates. . . 72 5.7 Sample conversations on Model-H that responded with ”I’m not
sure what you’re talking about.” . . . 76 5.8 Log in user interface for VHope created in Python Flask . . . 78 5.9 The user interface for VHope accessed in a mobile browser created
using Python Flask framework . . . 79 5.10 Raw text log file processed to a tabularized word format . . . 80 5.11 Sample sentence with its weights in each PERMA categories . . . 81 6.1 Perplexity of small and medium model trained on the datasets
(the lower, the better) . . . 85 6.2 Preference test results comparing FTER and vanilla DialoGPT con-
sidering the response’s content and emotion . . . 88 6.3 Chatbot Usage Evaluation for Experts (CUEE) answered by the
psychology expert during testing . . . 90 6.4 Chatbot Assessment Form (CAF) answered by the psychology ex-
pert during testing . . . 92 6.5 User testing flow . . . 93 6.6 User counts in each step of the user testing flow . . . 95
6.7 User responses to the question ’Do you like sharing your thoughts or feelings?’ . . . 96 6.8 User responses to the question ’Who do you tell your feelings to?’ 97 6.9 User responses to the question ’Are you comfortable sharing your
thoughts and feelings to your parents?’ . . . 98 6.10 User responses to the question’Do you have your own phone, tablet,
or laptop?’ . . . 99 6.11 User responses to the question ’Do you know what a chatbot is?’ . 99 6.12 Visible variation on users’ ratings for each question in Performance
criteria on Model-S . . . 105 6.13 Visible variation on users’ ratings for each question in Performance
criteria on Model-H . . . 105 6.14 Visible variation on users’ ratings for each question in Humanity
criteria on Model-S . . . 107 6.15 Visible variation on users’ ratings for each question in Humanity
criteria on Model-H . . . 108 6.16 Visible variation on users’ ratings for each question in Affect criteria
on Model-S . . . 110 6.17 Visible variation on users’ ratings for each question in Affect criteria
on Model-H . . . 110 6.18 Count of conversation logs evaluated by experts . . . 111 6.19 Accuracy of the PERMA labels according to the experts’ evaluation 112 6.20 Summary of experts’ evaluation on logs within the three criteria
(in percent %) . . . 114 6.21 Comparison of generated responses of the two models - Model-H
and Model-S given the same prompt (same color of boxes corre- sponds to [almost] same prompt) . . . 116 6.22 Analysis of the architectural design regarding the modelling of input
given to the generative model . . . 129
6.23 Analysis of the architectural design regarding the training dataset for the generative model . . . 130 6.24 Analysis of the architectural design of design regarding the perfor-
mance of the trained generative model . . . 131 6.25 Analysis of the architectural design of design regarding the effect
and architecture of the retrieval model . . . 132
List of Tables
1.1 Timetable of Activities . . . 12
2.1 Chatbots for Mental Health and Well-being: Summary of Sources 17 2.2 Empathetic Computing: Summary of Sources . . . 21
2.3 Response Generation Models: Summary of Sources . . . 26
3.1 RNNs VS Transformer models accuracy . . . 33
4.1 PERMA Lexica minimum and maximum scores per Category . . 46
4.2 PERMA Profiler Score Interpreter (Butler and Kern, 2016) . . . . 47
5.1 Welcome templates . . . 54
5.2 General Pumping Templates . . . 55
5.3 Specific Pumping Templates . . . 56
5.4 Labeling Confirming Templates . . . 56
5.5 Cause, Status, and Follow-up Confirming Templates . . . 57
5.6 Low Score Templates . . . 57
5.7 Emphasis Feedback Templates . . . 58
5.8 Praise Feedback Templates . . . 58
5.9 Reflecting Templates . . . 59
5.10 Evaluating Templates . . . 59
5.11 DialoGPT model variations . . . 60
5.12 EmpatheticDialogues dataset data split counts . . . 63
5.13 Well-being conversations data split counts . . . 66
5.14 Well-being conversations data split counts . . . 69
5.15 Perplexity attained during experiment with different learning rates in three epochs. . . 70
5.16 Perplexity at different iterations within one epoch using the top 3 learning rates (best in bold). . . 70
5.17 Perplexities of the FTER (Fine-tuned for Empathetic Responses) model within first four epochs. . . 71
5.18 Training and evaluating time for each model variation on processed training datasets (approximate H - hours, M - minutes, S - seconds) 71 5.19 Perplexity measure in each model variation and dataset for both vanilla DialoGPT and FTER (Fine-tuned for Empathetic Responses) model . . . 73
5.20 Perplexity scores of the DialoGPT on other datasets from past studies 74 5.21 Updated PERMA Module Score Interpreter . . . 82
6.1 Average scores on content and emotion criteria using scale of 0 to 2 for the 20 random dialogues in Performance Test with highest score in boldand lowest is italicized . . . 87
6.2 Sample emotional statements with its positive and negative score and PERMA label on first testing . . . 88
6.3 Sample emotional statements with its positive and negative score and PERMA label on second testing . . . 89
6.4 SWAP Pre-test results VS Day 1 PERMA module of VHope . . . 101
6.5 PRE and POST PERMA level results per user . . . 102
6.6 Final average scores of the two models using CAF . . . 103
6.7 Model-S scores in Performance criteria per user in each evaluation statement . . . 104 6.8 Model-H scores in Performance criteria per user in each evaluation
statement . . . 104 6.9 Model-S Humanity criteria scores per user in each statement . . . 106 6.10 Model-H Humanity criteria scores per user in each statement . . . 107 6.11 Model-S scores in Affect criteria per user in each evaluation statement109 6.12 Model-H scores in Affect criteria per user in each evaluation statement109 6.13 Expert gathering . . . 111 6.14 Summary scores of each criteria from experts’ evaluation . . . 113