Chapter 1 Research Description

(1)

This chapter gives an overview of the problems surrounding mental health, how serious it is, and the possible way to avoid serious issues it can result to using the current natural language processing (NLP) techniques. It also consists of the research objectives, its scope and limitations, the signiﬁcance of the research, and research activities.

1.1 Overview of the Current State of Technology

Mental health, as defined by the World Health Organization (WHO, 2019b), is a state of well-being, in which an individual realizes his or her own abilities, can cope with the normal stresses of life, can work productively, and is able to make a contribution to his or her community. On the other hand, mental well-being is defined as having both an awareness of emotions and the ability to manage and express those feelings in a healthy manner. According to a charity in England (Mind, 2020), experiencing low mental well-being for a long time may result in mental health problems. Another way to look at it is as a spectrum shown in Figure 1.1, a continuum where at one end is mental well-being, where people are thriving, fulfilled, and at ease. In the middle, people can be described as coping, surviving, or struggling. And at the far end sits the long-term mental illnesses.

1 out of 5 children and adolescents from the world’s population already have mental disorders. About half of the mental disorders start at the age of 14. To date, depression is one of the leading causes of disability which aﬀects 264 million people worldwide and the second cause of death in individuals between 15 to 29 years old (WHO, 2019a).

(2)

Figure 1.1: Spectrum representation of mental well-being to mental health In the Philippines with a population of more than 106M, mental health problems are already a cause of concern. According to the World Health Organization (WHO, 2020), there is a prevalence of 1.1% of this population to have a major depressive disorder and 0.5% for bipolar disorder. These disorders may even result in suicide if left untreated which constitutes to the suicide rate of 5.4 per 100,000 population every year.

On June 20, 2018, the first mental health act legislation in the history of the Philippines has been officially signed and enacted as the Republic Act No. 11036 or Mental Health Act after being proposed in the congress more than three years ago and in the senate in 2017 (DOH, 2018). Before this act, the Philippines was one of the minority countries with no mental health legislation resulting in the lack of guidance to clinicians on legal and ethical aspects of their practice and no clear definition of patients’ rights. But even with this signed law, there are still problems surrounding mental health such as lack of resources. Of the total health budget, only 2.65% is allotted for mental health and most of it goes to fund mental hospitals. As for human resources, there is an estimated 133 psychologists with a rate of 0.1 per 100,000 population and 548 psychiatrists which is approximately 0.5 per 100,000 which is far from the ideal ratio of 1 psychiatrist per 50,000 (WHO, 2020). Also, the stigma surrounding mental health is still prevalent which affects an individual’s self-esteem and limits their access to opportunities. This includes the mere avoidance of people in engaging discussions regarding mental illness, the use of offensive labels in describing mentally ill persons, and the overemphasized

(3)

media reports of people with mental illness who are involved in crimes which are also lacking in the information needed that will, at least, attempt to explain the rationale for their conduct (Rivera & Antonio, 2017).

As early as 2015, the United States Food and Drug Administration (US FDA) has estimated that a total of 500 million people globally are currently using mHealth or Mobile Digital Health applications and this number is predicted to grow up to 1 billion by 2018 (Veloso et al., 2019). A more recent survey shows that almost 83% of physicians in the United States are already using mHealth applications to provide patient care and a study conducted by Wolters Kluwer Health have shown that 72% of physicians use smartphones to access drug information, 63% uses tablets to access medical research, and 44% uses smartphones to communicate with nurses and other staﬀs. Around 68% of the population in the Philippines is expected to use mobile phones by 2020.

This availability and access to mobile phones can give way to a solution that has been used worldwide for mental health interventions, the help and guide through conversational agents or chatbots. Chatbots are defined by Shevat (2017) as a new way to expose software services through a conversational interface that can be applicable to many uses and as it became prevalent in society, it has also been implemented for use in the medical field. With the lack of mental health awareness to people, Liu et al. (2013) created a chatterbot that answers non- obstructive psychological domain-specific questions. The answers provided by this chatbot are derived from an online Q&A forum using extraction strategies.

Huang et al. (2015) created a chatbot to address the limited input of Q&A format of the former study and to detect stress in adolescents. This study accepts input in sentence format to allow the user to fully express his or her stress and to feel that they are being listened to, understood, and comforted. The most recent research for mental health therapy is a deep learning-based chatbot for campus psychological therapy (Yin, Chen, Zhou, & Yu, 2019) that can diagnose negative emotions and prevent further depression by responding with positive responses.

A scoping review done by Abd-Alrazaq et al. in 2019 provides an overview of the features of chatbots used by individuals for their mental health. After reviewing the initial 1,039 retrieved citations, only 53 unique studies were included which named 41 diﬀerent chatbots. Their study highlighted that most of these chatbots are implemented in the United States which may be because important platforms for chatbot use, such as Facebook Messenger, Skype, and Kik, target this geographical market in particular. Chatbots are also mostly used for therapy, training, and screening and are focused on addressing depression and autism in particular. Lastly, the majority of these chatbots are designed as rule-based which follows a pre-deﬁned set of rules or decision trees in generating responses.

(4)

Even with the numerous researches and chatbot applications addressing mental health, most of them are restricted to rule-based designs which can only offer limited and scripted responses (Abd-Alrazaq et al., 2019). This may be by choice as designers cannot risk their chatbot generating a weird response to their users who can be suffering from mental illness already (Fitzpatrick, Darcy, & Vier- hile, 2017; Inkster, Sarda, & Subramanian, 2018). Rule-based and retrieval-based chatbots’ input understanding can also be limited to processing only the keywords programmed to them thus, making the chatbot less human. Users also complain about the repetitive responses from rule-based chatbots caused by their predefined rules and inability of remembering past choices or input during a whole conversation (Tammewar, Pamecha, Jain, Nagvenkar, & Modi, 2017; Day & Hung, 2019).

In 2020, Santos et al. developed EREN, an emotion-detecting conversational agent designed with a conversation flow adapted from the Emotion Coaching Model (Gottman, Katz, & Hooven, 1996) to teach Emotional Intelligence (EI) to children by helping them sort out their emotions using storytelling strategies. Its conversation flow consists of five (5) phases namely introducing, labelling, listening, reflecting, and evaluating. During a conversation, it uses dialogue moves - feedback, pumping, confirming, and reflecting, to elicit story details from the child where it can detect the emotion and ask more about it. Empathy is not only about knowing what the other is feeling but also showing that he/she was understood which can be felt through affective responses (Clark, 2016). Currently, EREN cannot fully empathize with the user since it can only detect the emotion and apply active listening by generating more questions regarding that certain emotion using template-based responses to make the user feel like someone is actually listening. It is also limited with repetitive responses in asking the user about the story and has no ability to provide healthy and expert-accepted advice and information about managing their emotions and well-being.

1.2 Research Objectives

1.2.1 General Objective

To develop an empathetic hybrid retrieval-generation-based conversational agent that can engage with students to help them maintain optimal mental well-being and promote mental health awareness.

(5)

1.2.2 Speciﬁc Objectives

1. To perform transfer learning in teaching the pre-trained generative language model empathy

2. To apply the PERMA model in assessing the user’s well-being based on the text input

3. To adapt a conversation ﬂow for the retrieval-based model in maintaining a more natural ﬂow and connect a transformer-based model in generating new and empathetic responses to the user’s input

4. To measure the perplexity of the generative model in order to determine how well it can generate relevant and grammatically correct responses 5. To evaluate the conversational agent in terms of performance, humanity,

and aﬀect criteria

1.3 Scope and Limitations of the Research

With the success of transfer learning in NLP, the problem of training large models on a huge amount of data can be reduced by using the so-called pre-trained language models (Wu, Zhang, Li, & Yu, 2019). A large model can be fine-tuned on a specific dataset to adapt to a specific domain or topic and still perform better over the traditional deep learning models (Zhu et al., 2019). To teach the generative model empathy or the ability to feel what others are feeling, we use the publicly availableEmpatheticDialogues (Rashkin, Smith, Li, & Boureau, 2018b) dataset. The dataset is different from other emotion-annotated datasets as this was specifically grounded on emotional situations during collection while others are manually annotated after the collection.

This study will only process textual data thus, conversations are done purely in text form. By processing the text input from the user, the model will assess the user’s well-being with the use of PERMA lexica introduced by Schwartz et al. (2016) based on the PERMA model of Seligman (2012). This assessment will be the basis or topic of the conversation. The PERMA model is also studied and proven to be a cross-cultural tool for measuring well-being (Khaw & Kern, 2014).

The conversational agent will follow a conversation ﬂow which was previously used by EREN (Santos et al., 2020), an emotion detecting chatbot that teaches emotional intelligence to children, to maintain a more natural conversation and to

(6)

provide guidance in addressing the user’s concern at hand. This flow was initially adapted from Gottman et al. (1996)’s Emotion Coaching Model. Conversations with the conversational agent will start with an introduction, followed by labelling where the detection of emotion and assessment of well-being occurs, listening where the agent try to build trust with the user, then the reflecting phase and closing. The generative transformer-based model will serve as an extension of the retrieval-based that follows the conversation flow in order to generate newer and empathetic responses addressing the limited and repetitive responses generated by the latter model.

Fluency of generative conversational models are commonly measured using the perplexity metric (Li, Sun, Wei, Li, & Tao, 2019; Lubis, Sakti, Yoshino, &

Nakamura, 2018; Rashkin, Smith, Li, & Boureau, 2018a). Perplexity is deﬁned as the multiplicative inverse of the probability assigned to the test set by the language model, normalized by the number of words. Since language models work by assigning high probabilities to sentences that are real and syntactically correct, and low probabilities to fake, incorrect, or highly infrequent sentences, the best language model is the model that can predict an unseen test set. Thus, perplexity is acceptable in evaluating language models as it measures the inverse probability of the unseen test set, where low perplexity means better performance. Zhong et al. (2019) also stated that perplexity is the only well-established automatic evaluation method in conversation modelling and other metrics such as BLEU do not correlate well with human judgement.

To qualitatively evaluate the conversational agent, human evaluation remains crucial. It is also by users that the human-likeness of the agent can be properly evaluated. In this study, human evaluation will use the performance, humanity, and aﬀect criteria from the study of EREN (Santos et al., 2020), which was adapted from Radziwill and Benton (2017)’s extracted quality attributes of chatbots. Performance will evaluate the system’s capability to capture and interpret input from the participant, so this will show how appropriate the generated responses are and how it was able to maintain the conversation with the user.

Humanity will assess how convincing, satisfying, and natural the whole interaction was. Lastly, affect criterion will check if the interaction was entertaining for the user and whether it was able to read and respond to the moods of user. This evaluation will be done by senior high school students aged 17 and 18. According to Clarke et al., the prevention of emotional and behavioural problems and the promotion of social and mental well-being should be a priority for adolescents since this is the period where emotional and behavioural problems can become a major cause of ill health together with their physical health. Adolescents include those aged 10 to 19 as defined by WHO. The target age range is trimmed to ages 17 and 18 to specifically choose senior high school students since older adolescents

(7)

are more aware of their feelings and can better describe them (Rankin, Lane, Gibbons, & Gerrard, 2004). Also, this is to limit the educational background or environment of the participants.

1.4 Signiﬁcance of the Research

Mental well-being is about one’s thoughts and feelings and how they can cope up with life’s ups and downs, which is not the same thing as mental health. More importantly, good mental well-being is not just the absence of negative thoughts and feelings. Long periods of low mental well-being can result to the development of more noticeable mental health conditions like anxiety or depression. One contribution of this study is providing a platform for senior high school students where they can freely tell their stories, understand their feelings without the fear of stigma and being judged, and learn more about their mental health that will allow them to accept and be familiar with their own feelings. These can also help them develop their emotional well-being skills so that they can better cope with stress, handle emotions in time of challenges, and quickly recover from dis- appointments which, in the long run, can make them enjoy more, live happier, and pursue goals a bit more eﬀectively preventing them from developing serious mental health problems.

Conversation modelling has always been a great challenge in theﬁeld of NLP.

Most of them used the traditional and task-speciﬁc approaches especially in creat- ing conversational agents. Recently, as deep learning techniques gained popularity due to its high performance, NLP tasks such as machine translation, question an- swering, and sentiment analysis upgraded to using this more advanced approaches.

Most conversational agents for mental health support are still using the traditional response generation such as template-based or rule-based. Thus, this study is con- tributing to the investigation of using a large pre-trained conversational model to generate more human-like and freely-adapting responses for maintaining conversations with senior high school students. By further ﬁne-tuning a large language model with a certain dataset to gain empathy skills, this study provides another look at a fully data-driven approach in teaching empathy to the model.

Response generation in artificial intelligence (AI) chatbot systems can be clas- sified in two - the retrieval and generative. In recent years, an additional hybrid model, containing both retrieval and generative models, have been explored. Dif- ferent implementations in combining the two models are used such as using a ranking module to choose the better response or applying the generative model when the conversation flow deviated from the expected. This study contributes

(8)

to the exploration of a hybrid model design where the generative model acts as an extension to the retrieval model in order to address the latter’s problem of limited and repetitive responses resulting to non human-like conversations.

1.5 Research Activities

This section describes the speciﬁc steps and activities needed to accomplish the objectives of the research.

1.5.1 Review of Related Literature

Reviewing existing literature and applications is essential to design an efficient system. Theories and literature are reviewed related to mental health, empathic computing, and natural language response generation. Conversational agents used to deliver mental health interventions are analyzed tofind out the approaches used and to know how people makes use of them. In improving the approaches used in designing such systems, empathic computing and response generation techniques are also studied to identify better ways of offering such services. With this, it is possible to design a system using newer approach and improve the communication between human and conversational agents in an emotional setting. Using better response generation for conversational agents can provide more human-like and freely-adapting responses that can gain the user’s interest which can encourage them to share their emotions and stories.

1.5.2 Review of Ethical Issues

In working with minor students, ethical issues are needed to be considered. The University provides ethics forms, namely the DLSU Research Ethics Checklist, which needs to be accomplished. Consent forms for the parents/guardians and assent forms for the students are also required to be signed before the testing proper. Also, research instruments listed below will be checked ﬁrst by the psychology expert and will be integrated in Google Forms as it is more eﬃcient with the nation’s current online setting:

• Student’s Well-being Assessment using PERMA (SWAP)

• Pre-testing Interview Questionnaire (PTIQ)

(9)

• Chatbot Assessment Form (CAF)

Lastly, the psychology expert will perform system validation ﬁrst to check if the system is suitable and safe before the actual testing with the students.

1.5.3 Software Design and Implementation

This activity includes the development of the following system components:

• Fine-Tuned for Empathetic Responses (FTER) model which is the process of ﬁne-tuning DialoGPT on the EmpatheticDialogues dataset to provide empathetic and more human-like responses

• PERMA assessmentto determine the user’s level of well-being

• Hybrid modelto extend the retrieval model with a generative model

1.5.4 FTER Model Evaluation

Language models can be evaluated using automatic metrics. In this activity, perplexity is used to evaluate the performance of the generative model. This further shows the ﬂuency of the model to the language. By initially evaluating the generative model, it is made sure that the model was trained properly and performing at its best.

1.5.5 Functional Testing

To ensure that all basic functions are working according to speciﬁcations, the system will ﬁrst undergo functional testing before the user testing. Functional testing includes both unit testing and integration testing. Unit testing checks if the system modules are working as expected individually while integration testing combines all individual modules and evaluate them for compliance to the functional requirements. System testing is done to make sure that it meets the objectives of this research.

(10)

1.5.6 Expert and End User testing

With functional testing done, the conversational agent will then undergo evaluation by two types of users - Psychology Expert and End Users consisting of senior high school students aged 17 and 18. There are three criteria in evaluating the agent namely performance, humanity, and aﬀect. Performance evaluates whether the agent properly understands the input and generates appropriate responses.

Humanity will assess how convincing, satisfying, and natural the whole interaction was. Lastly, the aﬀect criterion will check if the interaction was entertaining for the user and whether it was able to read and respond to the moods of user.

Psychology Expert Evaluation on Research Instruments and Agent The expert is expected to review the research instruments (SWAP, PTIQ, and CAF) to be used by the students to check whether the questions are appropriate for the target age group and if it properly represents the criteria (performance, humanity, and aﬀect) for evaluating the conversational agent. Also, expert will evaluate the system’s:

• PERMA assessment. Check whether the agent can correctly deﬁne the user’s current state.

• Response Generation. Analyze whether the responses generated are able to empathize with the user which helps maintain the conversation and assess if the words used for the responses are appropriate for the age group.

• Conversation Flow. Determine whether the conversation ﬂow used was acceptable for the system’s goal and is being correctly followed.

Selection of Participants

At least twenty (20) participants will be selected through convenience sampling.

They should be 17 or 18 years old and is comfortable in using English as the language for communicating with the agent.

Orientation and Procedure

All information and updates will be sent through email. Before using the agent, participants will be asked to take the SWAP and PTIQ. The study requires its

(11)

participants to interact with the agent for at least 15 minutes per day within one whole week. After the user testing proper, they will be asked to answer SWAP again and the CAF.

Debrieﬁng

After the one-week interaction, a debrieﬁng with a psychologist must be conducted to make sure that each student can better process the negative feelings and thoughts that surfaced while conversing with the agent. This step will also make sure that they can properly understood what occurred during the one week research period. Further details will be discussed with the psychologist.

1.5.7 Analysis of Conversation Logs

To gauge the engagement level of the user to the agent, a detailed analysis on the conversation logs will be done. At least one log per day in a span of one week from every users will be included in the analysis. Three dimensions will be used in the analysis to measure the engagement of the users - language production, flow maintenance, and affect. Language production defines the quantity of user’s words in each utterance,flow maintenance describes the semantic appropriateness of user’s responses, and affect shows the emotional engagement of the children (Xu

& Warschauer, 2020).

1.5.8 Documentation

The proponent will document all the procedures and necessary details andﬁndings of every activity throughout the development of this research.

1.5.9 Calendar of Activities

The Gantt chart in Table 1.1 presents the activities needed to fulﬁll this study.

Each bullet represents approximately one week’s worth of activity.

(12)

Table1.1:TimetableofActivities ActivitiesMarAprMayJunJulAugSepOctNovDec ReviewofRelatedLiterature•••••••••••••••• ReviewofEthicalIssues•• SoftwareDesignandImplementation•••••••••••••••••• FTERModelEvaluation•• FunctionalTesting•• ExpertandEndUserTesting•••• AnalysisofConversationLogs•••• Documentation•••••••••••••••••••••••••••••••••••