Exploring the User Experience with a Speech Recognition System for Smart TVs

(1)

Exploring the User Experience with a Speech Recognition System for Smart TVs

Ah Young Han, Jun Yeob Choi, Jun Young Choi, and Bong Gwan Jun Graduate School of Culture Technology (GSCT),

KAIST

Daehakro 291, Daejeon Republic of Korea [email protected]

ABSTRACT

This paper reports the results of a qualitative user study undertaken to enhance the user experience (UX) with a speech recognition application on a smart TV. Through focus group interviews (FGIs), we found that users anticipate rather intuitive interactions when using a speech recognition application, as in their usual conversations. We then tried to determine users’ self-recordings based on precisely how users actually want to command their TVs in an effort to enrich the capability of verbal commands and form the basis of more natural interaction. We conducted user studies to 1) determine the user contexts and behaviors when watching TV, and 2) explore speech data which can be used as significant information when developing such an application for smart TVs. This paper presents an overview of user studies with 872 unconstrained speech recording data classify them into predictable list. We expect that the results can enhance the user experience by increasing the recognition quality by concentrating on major demands derived from their inputs.

Author Keywords

User studies; Smart TV; UX Design; Speech Recognition

ACM Classification Keywords

H.5.m. Information interfaces and presentation (e.g., HCI):

Miscellaneous.

INTRODUCTION

Recently, interactive TVs have become smart TVs, and these are in the process of being launched as a new product family in the display market [1]. Smart TVs are loaded with various applications combined with the internet and are capable of carrying out a broad range of interactions compared to existing interactive TVs [2]. Based on these

circumstances, in the field of smart TVs, efforts have continued to find appropriate UX designs, such as touch remote controllers, motion-based remote controllers, and action and speech recognition applications [3]. Speech recognition has advantages in that users can easily operate the functions simply by voice [4]. However, we found problems with speech recognition systems on smart TVs through a focus group interview (FGI) [5]. We conducted a FGI on a recently released Smart TV and discussed how to design better speech recognition systems for users. The FGI was held at a studio where the Smart TVs are installed. Five UX researchers were interviewed about usability issues after using a speech recognition system of a Smart TVs for one hour. We uncovered two problems and sought to understand both with regard to user behavior.

The first drawback is a lack of accessibility to real-time content information which is suitable for speech commands.

Numerous speech commands from the user tend to be in the form of a question due to the nature of the modality of the command. These questions are mainly about the program they are watching, which constantly presents new audiovisual information. To bring out the best advantages of speech control would be to support these issues. Thus, we uncovered the necessity of user research which studied the information users want from a television.

The second drawback is the requirement of accurate commands due to the lack of synonym handling. Even apart from the information provision, control commands were no exception with regard to deficient accessibility. Trivial differences in word choices, such as “volume up,” and

“sound up” can often cause inconveniences. The functions which are expected to be widely repeated require natural and intuitive commands. Forcing the user to learn accurate commands to control their TV will degrade the advantages of speech control. Apart from the longer input time than the remote control method, a more fundamental problem is when users fail to find proper commands for certain functions. Therefore, we need to understand the habitual inputs of users to design a better speech recognition system for TVs [5]. Understanding the circumstances when using speech commands and users’ actual needs represents the

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

TVX’14, June 25 – June 27, 2014, Newcastle, UK.

(2)

starting point of designing a meaningful speech command system. To enhance the quality of the speech recognition system, the potential tasks of a TV should be determined.

Doing so enriches the capability of speech commands and forms the basis of more natural interaction.

In order to achieve this purpose, we need to conduct additional user research and analyze how users actually watch TV in their everyday lives. We also need to understand what they want based on their age group and family members. We conducted a user study involving 10 different families which included various age groups ranging from children to the elderly for one full week. The study involved having the users record their opinions on a speech recorder every time they felt that a speech recognition function is required while using a TV. In this paper, it is very important that the survey is conducted from the users’ point of view without any intervention on the part of the researchers. The major problem of speech recognition on smart TVs presently available on the market is that they impose on their users a set of fixed commands.

The research on speech recognition of the past has been focused on technology issues either in the form of usability evaluations after the initial product development or considering the aspect of indexing to the improve recognition rate. However, this paper focuses on users’

actual needs with regard to speech recognition before its development.

The major purposes of the research can be defined as follows: 1) to determine which speech recognition situations arise when users are watching a TV by investigating user contexts and user behaviors, and 2) to explore speech data which can be used as significant information when developing text data for smart TVs.

Speech Recognition System of a Smart TVs

The most typical lean-back appliance, the TV, has evolved toward interactivity constantly, involving more than just watching it. In terms of the user experience, Cesar and Chorianopoulos (2009) consider the properties of interactive TV as mash-ups of pre-edited video clips and low- to mid-level user input [3]. There have been several studies of speech recognition systems which are used to control TVs and their levels of efficiency. Ibrahim (2001) showed that speech input is more effective to solve navigation difficulties when there are considerable numbers of television channels. Berglund (2004) pointed out the complexity problem of an interactive TV derived from the enrichment of entertainment with an information provision, suggesting the use of speech as a means of interactive TV navigation [4]. The latest TVs have been termed Smart TVs, as they include various applications with integrated accessibility through the internet. Shin et al. (2012) categorized the differences between Smart TVs and formal interactive TV, studying their search abilities, social networks and their degrees of integration and synchronicity [2]. As the number of functions increased drastically,

speech recognition came to the fore to overcome the limitations of existing interfaces as a means of TV control, finally reaching commercialization. Although speech recognition has been studied for decades, it became popular after Siri was released by Apple. After 2012, it faced market competition as the Korean companies Samsung and LG included it with their latest models of Smart TVs.

Although speech recognition is feasible for controlling a TV, it remains unsatisfactory in its present from. Currently, Samsung Smart TVs support speech control, and the commands are classified into three categories: TV Basic, Broadcast and Contents, and Run Application. Basic commands consist of the fundamental functions of ordinary TVs such as “channel up.” Examples of Broadcast and Contents commands are “Change to a news channel” and

“Which program would you recommend for tonight?” The Run Application command mostly consists of the names of installed applications. Products from other makes are not very different. They usually do not exceed the categories of basic functions, content navigation and recommendations.

The manufacturer mentions a current limitation: “Speech control performance may vary depending on the language, local dialect, pronunciation, speech and ambient noise mix, and lighting level” [8]. They point out several technical imperfections regarding the accuracy of recognition, but improving the capability of recognition could be another means of enhancing the recognition quality. They have sufficient performance to replace a remote control but not to handle various forms of input by users who are not constrained to the format.

USER STUDIES

The purposes of the user study here can be defined as follows: 1) to determine the major information desired by the users from TV programs through a speech recognition function, and 2) to determine the actual languages used by the users, other than fixed commands, when they command their TVs. When we selected the participants, we did not consider whether they possessed a smart TV or not, as smart TVs were not fully commercialized in the country at the time of this study, and because we intended to investigate the most natural user behaviors without imposing any stereotypes.

Field research: Cultural Probes

The reason why we choose Cultural probes is that we want to find a clear users’ desire to use their handhelds to provide access to TV. Cultural probes provide a means of gathering information about people and their activities.

Unlike direct observation, this technique allows users to self-report. Cultural probes are one means of accessing environments which are difficult to observe directly while also capturing more of the ‘felt life’ in them [6]. In a cultural probe, the selected participants are briefed, given a kit of materials, and briefed about the requirement to record or note specific events, feelings or interactions over a specified period. Most kits contain a diary for recording

(3)

comments or impressions. It may also contain items such as a speech recorder, camera, post-it notes, and the like- anything that can help users gather and record information.

At the end of the specified period, the materials are collected and analyzed. In previous research on interactive TVs, a team led by Anxo Cereijo Roibás used cultural probe as a method for obtaining user experiences with pervasive interactive TV [7].

Procedure and participants

In previous research in this area, users of TVs are divided by age, occupation, gender, TV watching state, TV program type, and emotion felt while watching TV. Also in earlier research, other research teams classified users in terms of academic ability, location, income level and TV watching style. With these studies, we categorized our user group in terms of age, gender, income level, occupation, and location. Our ten representative groups are shown in the following table.

Age Gender Family State Example of Speech Data Children

(3~10) Female Nuclear Family (3~4)

“I want to skip the toy advertisement.”

“Turn off the TV when I fall asleep.”

Teenager Female Nuclear Family (3~4)

“Turn back to the previous volume level.”

“In which other programs does Suzy appear?” …

20s

Female Single

“Tell me when the advertising is over.”

“Give me a re-run schedule of what I enjoy watching.”

Male Single

“Traffic information to Gangnam station”

“I want to share this program with my friend.”

30s

Male Couple

“Turn to another channel when this program ends.”

“Check the location of the restaurant in the program.”

Female Single “I want to know the brand of shoes that just appeared.”

Male Single “Show me only the highlights.”

40s~50s

Male Extended

family (5~) “Show me a map related to this travel program.”

Female

Nuclear Family (with teenager)

“Give me a summary of this soap opera.”

“Give me the movie channel schedule.”

Elderly Female Couple “What's his name? I guess he is a new face.”

“Surf all channels slowly.”

Table 1. 10 Representative groups

We created a persona for each group to determine their states when watching TV. With each representative group, we recruited he testers and undertook a cultural probe session. The cultural probe involved the use of a kit which consisted of a diary, a pen, and a speech recorder. In the diary, testers recorded their types of TV programs, their

state while watching TV, and their behavior while watching TV. On the speech recorder, testers recorded the commands they wanted to use to control their TV. We did not limit the types of commands when attempting to determine what users actually wanted their TV to do. We allowed one week for each tester with the cultural probe to determine a one- week life cycle.

RESULTS AND INSIGHTS Collected Information from Diaries

We collected specifically ‘type’, ‘program’, ‘with who’, and ‘behavior’ information from the user. These are considered to be basic information that provides an understanding of the circumstances of different households.

Our main data consists of 872 speech recording instances, which are the sentences from users assuming that unconstrained speech recognition was available. We categorized them via the following criteria: ‘basic’ for the inherent functions of a TV, ‘recommendations’ for asking for suitable content satisfying a certain requirement,

‘search’ for ask for information about a designated target, and ‘additional’ for all others. The collected data contains natural emotional expressions. We categorized this as

‘social talk’ considering that social networks have a significant role in the Smart TV roadmap.

We created these categories considering the features of Smart TVs compared to earlier TVs. The ‘basic’ functions are very fundamental, such as channel, volume and power, but take only one part due to their frequency and variety.

The ‘recommendations’ play a significant role in the operation of a Smart TV, as they seek to access various types of content interactively in an integrated environment.

Plenty of information-required data was collected to create a different category from ‘recommendations’, which was the ‘search’ category. The ‘additional’ category consists of various needs for functions which may make a TV truly smart.

Function Criteria

Basic Inherent functions of a TV

Recommendation Ask about suggest suitable content satisfying certain requirements

Search Ask for information by designating a target Additional All others

Social Talk Emotional expressions

Table 2. Categories of speech data Analysis of Users’ Self-Recording Data

The distribution of the features of the classified data is as follows: ‘social talk’ 3%, ‘recommendation’ 10%, ‘basic’

18%, ‘additional’ 31%, and ‘search’ accounts for 38%, the largest share of the data, as we expected. It appears that an intention to obtain information from real-time content has an influence. There are many instances of question-type

(4)

speech data which instantly come to mind due to their modalities. The most problematic issue is to compensate for this factor. Next, we classified the data again by chunking similar speech elements in terms of the context and assigning features. This is confined to possible commands in a speech recognition environment for a proper division.

We subdivide major functions more, such as ‘channel’ or

‘volume change’. For example, program-name-oriented channel-change commands are separated from cast-name- oriented channel-change commands. Data gathered from more than two different users were bound with the proper name and put the category ‘other’.

Function Chunking similar speech elements Basic Channel Change-Channel

Name/Volume/Recording/Power…

Recommendation Emotion-Anything good, fun/ Age/Food/Game...

Search Common Sense/Place/Travel/Movie/Contents…

Additional Camera/Subtitle/CCTV/Call/Home Network…

Social Talk Companion/Recipe…

Table 3. Classified the data by chunking similar speech elements

Next was to find which features have high levels of priority for users among the subdivision. We excluded duplicated inputs from one user; thus, there were 12 users in 10 households who recorded certain features. That is, we sought to determine the needs and input format based on high-frequency data from all users instead of merely counting the instances of such data. This can enhance the user experience by increasing the recognition quality by concentrating on major demands derived from their inputs.

CONCLUSION AND FUTURE WORKS

This study was conducted to find the needs of users with regard to speech recognition on smart TVs. Through the FGI, we found that users are anticipating rather intuitive interactions in a more natural environment when using the speech recognition function, as in their usual conversations.

Therefore, we tried to utilize users’ self-recordings of precisely how they actually want to command their TVs so that more natural UX designs can be realized. For this purpose, we conducted a cultural probe method, a user- centered design, where users recorded information that occurred to them by themselves. The data collected through this method are considered to be highly valuable in that they can be used as textual data for creating smart TV speech recognition functions.

We categorized a total of 872 instances of collected speech data into different groups based on their degrees of resemblance such that users can easily find and use them in an actual speech recognition environment. This type of categorization is expected to be of great help during the

actual process of design of the UX. For example, we discovered the fact that the demand for search functions did not intensively arise in any specific age group but rather frequently arose with various users. If such a factor is considered and leads to intensive loading into a database during the actual design process, it can be expected to improve existing UX designs significantly.

However, the limitations of this study are that there were not enough participants for the experiments, reducing the quantity of the data and affecting the representativeness of the participants in terms of their age, gender and living locations, as we selected the participants based on documentary surveys. Therefore, for future research, in order to offset these limitations, we plan on improving the objectivity of the data and creating meta-data which can actually be applied to smart TVs by continuing the research on specific UIs. Also, we intend to proceed with a post- evaluation of the usability after loading (realizing) the selected data and determining the responses with actual TVs. We believe that our study will assist those who develop and conduct research on speech recognition technology.

REFERENCES

1. S.-J. Lee, "A Study on Acceptance and Resistance of Smart TVs," International Journal of Contents (2012), vol. 8, pp. 12-19.

2. D.-H. Shin, Y. Hwang, and H. Choo, "Smart TV: are they really smart in interacting with people?

Understanding the interactivity of Korean Smart TV,"

Behaviour & Information Technology (2013), vol. 32, pp. 156-172.

3. K. Chorianopoulos and P. Cesar, "The Evolution of TV Systems, Content, and Users Toward Interactivity,"

Foundations and Trends® in Human-Computer Interaction (2007), vol. 2, pp. 373-95.

4. A. Berglund and P. Johansson, "Using speech and dialogue for interactive TV navigation," Universal Access in the Information Society (2004), vol. 3, pp.

224-238.

5. L. Eronen, "User centered research for interactive television," in Proceedings of the 2003 European Conference on Interactive Television: From Viewers to Actors (April 2-4, Brighton, UK) (2003), pp. 5-12.

6. B. Gaver, T. Dunne, and E. Pacenti, "Design: cultural probes," interactions (1999), vol. 6, pp. 21-29.

7. A. C. Roibás, D. Geerts, E. Furtado, and L. Calvi,

"Investigating new user experience challenges in iTV:

mobility & sociability," in CHI'06 Extended Abstracts on Human Factors in Computing Systems (2006), pp.

1659-1662.

8. http://www.samsung.com/global/microsite/tv/2013_vi/m obile/html/voice_control.htm