A/B Test - APPLICATION AND FINDINGS .1 Survey 1 .1 Survey 1

3.3 APPLICATION AND FINDINGS .1 Survey 1 .1 Survey 1

3.3.4 A/B Test

The last step was A/B Test. For the A/B comparison, a B structure (alternative structure) is needed to compare with A structure (current structure). For that, the results of card sorting leaded to constitute the alternative structure. The result of the A/B tests is going to show the comparison between generated new navigation structure which is result of survey and card sorting tests and current structure that is already being used by Digiturk.

Current structure:

Figure 3.8 Digiturk's current movie structure in Video on Demand.

Digiturk‟s current structure has no sub categorization and 16 categories are seen at the same time in the structure. However Digiturk also used some matching such as History-War and Thriller-Horror.

Generated Structure:

Figure 3.9 New structure proposal generated from Card Sorting.

The generated structure which is result of survey and card sorting tests, has 9 main categories in the first place and users will be able to see the sub categories by clicking on the main categories. The aim here is creating a leaner and easier navigation structure.

Comparison Method

For the comparison, A/B test lets users to use both versions by means of given tasks. Current structure has shown as “A” and generated structure has shown as “B”.

First a web page model is created for each version by using Wix web development platform. Everything such as colors, pictures, font, size etc. are beholden to be same for both version. 12 movies are placed under the categories (Movies are selected same with the movies in survey 2; most popular ones are selected). Movies are located under categories according to survey 2 results (highest percentages are considered).

After preparing the test environment, it is asked from users to find 5 movies that they know, in the structure. While they are completing their tasks, time to find a movie (in seconds) and click numbers to reach a movie are counted. At the end, after users completed their tasks, they are asked to complete a satisfaction survey for the structure they used. In order to create the survey, the System Usability Scale (SUS) survey method is used.

With SUS survey, it is aimed to measure;

• Complexity

• Inconsistency

• Fun

• Easiness

• Simplicity

• Accuracy

• Speed

Survey conducted to 72 people (36 for A, 36 for B) with 8 scale questions and asked them to scale each questions from 1 to 5 (1 is strongly disagree and 5 is strongly agree).

Table 3.11 SUS Survey Questions.

No Survey Questions

1 Easy to use

2 System is very complex 3 It saves time

4 I can achieve what I want to do in few steps.

5 Funny to use

6 There are many inconsistencies in the system 7 I found what I was looking for accurately 8 There are missing categories

Results and Evaluation of A/B Test The flow of the test is as follows;

1. 5 movies among 12 (already located under A and B categories) are asked users to find in one of the navigation versions (A or B).

2. For each movie the time and number of clicks are measured until a user finds the movie.

3. After applications on A and B interfaces, users are asked to fill the SUS survey to scale their overall satisfaction about navigations.

Important points;

• First of all, users have to be informed that their skills or speed or they are not the ones which are tested, but only the system itself is tested.

• If a user does not know the asked movie, another movie name should be given.

• It is essential that users need to feel totally free during test. They can talk, they can try anything or they can wait as long as they want.

• There cannot be any interruption, redirecting or judgment from researcher.

Test is applied total number of 72 users, ages between 18 and 30.

Time and Click Results

The time and click results are collected on an Excel table (A.1) and analyzed descriptive statistics for both A and B. (All time values are in terms of seconds).

Table 3.12 Descriptive statistics of time and click for A.

A Avg. Time Tot. Click

Mean 10,001 Mean 6,722

St Error 0,852 St Error 0,344

Median 8,1 Median 6

P-Value 0,017 P-Value 0,017

St Dev 5,110 St Dev 2,065

Variance 26,111 Variance 4,263

Kurtosis 0,673 Kurtosis 6,482

Skewness 0,985 Skewness 2,148

A-Squared 0,92 A-Squared 0,005

Minimum 3,8 Minimum 5

Maximum 25 Maximum 15

Total 360,03 Total 242

N 36 N 36

95,0% CI 1,729 95,0% CI 0,699

When the average time results of A version are examined, it can be said that the data does not fit the normal distribution by looking at the P value which is less than 0.05 (%95 confidence level).

Mean value is 10,001 second/movie and deviation is 5,110 second/movie, the average time to find a movie in version A is 10,001 seconds.

If the statistics for total number of click results are scanned, the average number of clicks to find a movie for one user is 6,722 click/movie.

Also, minimum value and first quartile is 5 and median is 6 so almost 50% of data is 5,5 and max number of clicks is 15 to find a movie in version A.

Table 3.13 Descriptive statistics of time and click for B.

B Avg. Time Tot. Click

Mean 9,586 Mean 6,861

St Error 0,958 St Error 0,246

Median 8,1 Median 7

P-Value <0,005 P-Value <0,005

St Dev 5,746 St Dev 1,477

Variance 33,019 Variance 2,180

Kurtosis 1,330 Kurtosis 1,014

Skewness 1,353 Skewness 1,099

A-Squared 1,89 A-Squared 1,74

Minimum 2,4 Minimum 5

Maximum 24 Maximum 11

Total 345,1 Total 247

N 36 N 36

95,0% CI 1,944 95,0% CI 0,500

For the version B, it is also seen that the P value is less than 0.05 (%95 confidence level) which means data does not fit the normal distribution. Mean value is 9,586 second/movie and deviation is 5,746 second/movie, the average time to find a movie in version B is 9,586 seconds.

In terms of click results, the minimum value is 5, first quartile is 6, median is 7 and max number of clicks is 11 to find a movie in version B. Furthermore, the average number of clicks to find a movie for one user is 6,861 click/movie in version B.

Comparison of Time and Click Results

In order to see better the difference between A and B versions‟ time and click results, two sample t-test is applied to both data.

T-test for time data of A and B:

Difference = mu (1) - mu (2) Estimate for difference: 0,415000

95% CI for difference: (-2,141693; 2,971693)

T-Test of difference = 0 (vs not =): T-Value = 0,32 P-Value = 0,747 DF = 69

When the t-test results above were observed, it can be seen that the p value of the test is higher than the a-level (0,05) which means there is no statistically difference between average times of A and B.

Same as time data the same test is applied to click data as well.

T-test for click data of A and B:

Difference = mu (1) - mu (2) Estimate for difference: -0,139000

95% CI for difference: (-0,984581; 0,706581)

T-Test of difference = 0 (vs not =): T-Value = -0,33 P-Value = 0,744 DF = 63 Here also same conclusion can be done; the 0,744 p-value is higher than 0,05 a-level which means there is no statistically difference between two versions.

When the results of A/B test are compared, basically the outcome can be summarized as there is not a dramatically big difference between two versions. But if it is observed a bit deeper, the slight differences can give some clues about the systems.

The time data of the A/B test shows that in version A (current structure), the average time that a user spends while looking for a movie is slightly higher than version B (alternative structure). The meaning of that fact is alternative structure B let the users to find the movie they want a bit faster. The leading reason might be the brief outlook of the navigation structure which is leaner and shorter while in structure A users need to scroll down and up to see all categories which causes time loss.

On contrary, the fact that the click results show that in current structure A, users reached a movie with fewer steps than the alternative structure B. This result is quite reasonable because in Digiturk‟s current structure (version A), all the categories are listed one under the other which means user can see all 15 categories at once.

SUS Survey Results

This survey is prepared on “Survey Monkey” database and conducted to each of those 72 people right after they used the structures. The results are dragged from Survey Monkey and collected on an Excel table to analyze. The summaries table below shows the number of users who voted for each question.

Table 3.14 SUS survey results by number of people.

1 2 3 4 5 1 2 3 4 5

1 Easy to use ⁰ ⁰ ¹¹ ⁹ ¹⁶ ⁰ ³ ⁴ ¹⁶ ¹³

2 System is very complex ²⁰ ⁸ ⁵ ¹ ² ²⁰ ⁵ ⁴ ⁶ ¹

3 It saves time ² ² ¹⁰ ¹⁰ ¹² ⁰ ⁴ ⁸ ¹⁶ ⁸

4 I can achieve what I want to do fewer steps. ⁰ ³ ¹¹ ⁹ ¹³ ⁰ ³ ¹⁰ ¹⁴ ⁹

5 Funny to use ¹ ⁵ ⁷ ¹¹ ¹² ⁰ ⁴ ⁵ ¹⁶ ¹¹

6 There are many inconsistencies in the system ¹⁶ ⁹ ⁷ ⁴ ⁰ ¹⁹ ¹⁰ ⁴ ¹ ²

7 I found what I was looking for accurately ⁰ ⁵ ¹⁰ ⁸ ¹³ ⁰ ³ ⁶ ¹⁶ ¹¹

8 There are missing categories ¹⁹ ⁸ ³ ⁶ ⁰ ¹⁸ ⁶ ³ ⁶ ³

No Survey Questions

A(36 people)

1 Strongly Disagree / 5 Strongly Agree

B(36 people)

In order to analyze data to find the average scores of A and B, first of all data listed openly (A.2). Before analyzing, the data of negative questions which are questions 2, 6 and 8 are reverted to find true scores. It means for example; if a user voted 2 for one of those questions, it is changed to 4 to analyze. First correlations and then averages are calculated.

Table 3.15 Correlation table of SUS questions.

Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4

Q1 1

Q2 0,378 1

Q3 0,413 0,337 1

Q4 0,454 0,405 0,602 1

Q1 0,366 0,129 0,487 0,258 1

Q2 0,512 0,379 0,394 0,343 0,114 1

Q3 0,430 0,312 0,423 0,452 0,266 0,335 1

Q4 0,303 0,245 0,214 0,185 0,201 0,316 0,289 1

The table above shows the correlation results of questions. It showed that questions 1 and 2, and questions 3 and 4 are highly correlated which is very accurate.

While question 1 evaluates the easiness of system, question 2 measures the complexity;

two questions are similar and related. Also questions 3 and 4 are for evaluating time and click results by users.

After finding correlations, a summary table for each question and for each versions is prepared.

Table 3.16 Summary table for SUS questions.

Avg Q1 Avg Q2 Avg Q3 Avg Q4 Avg Q5 Avg Q6 Avg Q7 Avg Q8 Average A 4,139 4,194 3,778 3,889 3,778 4,083 3,806 4,167 3,979 B 4,083 4,083 3,778 3,806 3,944 4,250 3,972 3,833 3,969 Grand Total 4,111 4,139 3,778 3,847 3,861 4,167 3,889 4,000

By using pivot table, average of each question and final averages of A and B are calculated. The results let us to compare two structures also each question for A and B.

For some questions A looks better while for some of them B looks more satisfying. For example for question 4; result shows that users could find the movies in slightly fewer steps in A structure than B. This result also supports the click results which showed the same conclusion. Also the last question supported that people might think that there

may be missing categories in B because of the hidden (pull-down) menu items. For the question 5; it can be conclude that the pull-down menu makes the system more fun to use. The question 6 and 7 showed that in the new structure category and movie matches look more accurate and consistent to users.

When the final average results for A and B are calculated, it is seen that the score of version A is very slightly higher than the version B, yet they are very close though.

In order to see better if there is a statistically significant difference between SUS results of two versions, a two sample t-test was applied in here.

Two-sample T for A vs B:

N Mean StDev SE Mean A 8 3,979 0,184 0,065 B 8 3,969 0,164 0,058 Difference = mu (A) - mu (B) Estimate for difference: 0,010625

95% CI for difference: (-0,177346; 0,198596)

T-Test of difference = 0 (vs not =): T-Value = 0,12 P-Value = 0,905 DF = 13 As it was concluded before, t-test results also supported that conclusion. Test showed that the p-value which is 0,905 is less than a-level which is 0,05. That means there is not a statistically big difference between A and B versions in terms of SUS survey results.

There might be several reasons for this result;

• The time for adapting a new system

• The understanding time for pull-down menu system

• Complexity caused by the appearance of pull-down menu

• Confusion caused by hidden items

• Higher click numbers

49 CHAPTER 4

CONCLUSION

Usability and User Experience works are getting more understandable and more important as value. Before, when a product is produced and tested by only designer‟s or engineer‟s eye was enough to release it to the market. At the end, this early release causes additional costs to the firms. Once producers realized that fact, the term usability became more important. When the producers remember the fact that they produce goods for human beings and the major thing what makes a product counted as usable is user insights, they started to concentrate usability researches and to establish usability teams or hire experts from outside as in the Digiturk case. Digiturk, as a big and corporate company, know that the real optimization can be achieved by working and connecting directly to the end users. The information of what exactly users‟ experiences through the product is too valuable to construct usability.

User Experience helps defining something as usable or not. Even though usability consist of engineering disciplines, statistics, human factors, ergonomics and physiology;

actually the root of the UX endures on psychology. The thing makes usability different from other engineering studies is its nature of focusing on human. It examines the interaction between user and product and basically it depends on emotions. And that is the reason why it is not easy to predict. Because of that, in order to determine usability of a product, we need User Experience Tests which requires the full collaboration of real users. Usability testing refers evaluating a product or service by testing it with representative users. Participants try to complete several tasks while they are observed by researchers. Every action they do are recorded as a selected way like video records, voice records, screen records or just taking notes. The goal is identifying any usability

problem, collecting qualitative and quantitative data, understanding user behavior and determining participant's satisfaction about the product.

The challenge is here to be able to find or adapt the right tests which exactly fit our product and fulfill our needs and questions about product.

In this thesis project collaborated with Digiturk, it is aimed to design a new category structure for Digiturk‟s movie category list presented as Video on Demand according to Turkish people‟s insight, so they can easily reach out the movies and find what they are trying to find quickly, while giving them an enjoyable experience through the interface. But beyond this, it is desired to explain and understand Turkish users‟

preferences about movies and categories, their movie watching habits and frequencies, how they define and group categories by also comparing people from different educational backgrounds and different ages. This kind of information about audiences is important not only for this study but also can be used for other projects in Digiturk.

In order to reach out a new structure and usability information, three usability tests are decided to apply in an orderly manner; survey, card sorting and A/B tests. The combination and process of these tests are chosen only for this project and each test supported one another, also each of them gave different information about users.

Surveys provided a better understanding about who are the users, what they want and how they think such as their favorite categories, how they categorize movies or how they feel about them. By the help of surveys the important and unimportant categories were defined which will help to revise which categories should be on the list. Survey results are also utilized for card sorting test. The reason of choosing this test was the fact that working on a navigation system and for navigational systems card sorting is the best test to carry on. It lets users to arrange every item in their way of use. It is an easy and fun method to apply and facilitates to see majority‟s way of navigation. Card Sorting results are used for creating A/B test. A/B test is basically a comparison method which also helps to see the way and behaviors of users while they are using the interfaces. Except comparison of two systems, A/B test allows to observe users and make conclusions. Even though there are different ways of applying A/B test, for this project, it is chosen to record time and click data, and also the test is combined with the SUS evaluation method. It means statistical and behavioral (qualitative) data

information and results are combined as a method generated for this TV interface project.

The main contributions of the thesis is that applying UX methods which has limited number of examples especially in Turkey into a brand new field with the different way and combinations of existing methods. We believe that this research will lead and help other researches about that area and also will be helpful for similar TV or online interfaces working with categories. Because this kind of online/offline movie platforms are highly interacted with users and requires the involvement of users directly which means users need to get into the interface, understand the system and find category or movie they want by themselves alone without a manual or description. So, preparing and providing the best and highest usability to people is the core element of those broadcasters (or companies) to get a high level of customer satisfaction. The competition in today‟s world is really big and conditions are too cruel to forgive even a single defect. Therefore even small improvements and knowing users well help the companies to increase their quality, reliability and revenues. Everything starts with the ability of seeing things through users‟ eyes to provide them the best experience and to make a company the leader of its market.

52 REFERENCES

Bias, R. G., & Mayhew, D. J. (1994). Cost-justifying Usability. Boston: Morgan Kaufman.

Bradley, J. (1993). Methodological issues and practices in qualitative research. Library Quarterly , 431-449.

Brooke, J. (1996). SUS: A "quick and dirty" usability scale. In P. W. Jordan, B.

Thomas, B. A. Weerdmeester, & A. L. McClelland (Eds.), Usability Evaluation in Industry. London: Taylor and Francis, 189-194

Connaway, P. a. (2004). Basic Research Methods for Librarians. Westport: Libraries Unlimited.

Emanuel, J. (2013). Usability testing in libraries:methods, limitations and implications.

OCLC Systems & Services: International digital library perspectives , 29 (4), 204-2017.

Fitts, P. (1954). The Information Capacity of the Human Motor System in Controlling the Amplitude of Movement. Journal of Experimental Psychology , 381-391.

Flannery, W. (2008). "Information Architecture for Information Professionals". Library Management , 29 (8/9), 808 - 809.

Fogg, B. K. (2002). Stanford-Makovsky Web Credibility Study 2002:investigating what makes websites credible today. Stanford University.

Garrett, J. J. (2011). Elements of User Experience,The: User-Centered Design for the Web and Beyond. Berkeley,CA: Pearson Education.

Gross, J. (2011, May 17). Improving Usability with Fitts’ Law. October 30, 2014 Six Revisions: http://sixrevisions.com/usabilityaccessibility/improving-usability-with-fitts- law/

Hassenzahl, M. (2013). User Experience and Experience Design. In: Soegaard, Mads and Dam, Rikke Friis (eds.). "The Encyclopedia of Human-Computer Interaction, 2nd Ed.". Aarhus, Denmark: The Interaction Design Foundation. December 21, 2014

Interaction Design Foundation: http://www.interaction-

design.org/encyclopedia/user_experience_and_experience_design.html

Hassenzahl, M. (2013). User Experience and Experience Design. In: Soegaard, Mads and Dam, Rikke Friis (eds.). "The Encyclopedia of Human-Computer Interaction, 2nd

Ed.". Aarhus, Denmark: The Interaction Design Foundation. December 21, 2014

Interaction Design Foundation: http://www.interaction-

design.org/encyclopedia/user_experience_and_experience_design.html

Kashimura, K. K. (1995). Apparent Usability vs. Inherent Usability Experimental analysis on the determinants of the apparent usability. CHI '95 Conference Companion on Human Factors in Computing Systems (s. 292-293). New York,USA: ACM.

Kuniavsky, M. (2003). Observing the User Experience: A Practitioner's Guide to User Research. Paperback.

Ma, S. (2010, September 20). Dancing with the Cards: Quick-and-Dirty Analysis of

Card-Sorting Data. January 13, 2015 UX Matters:

http://www.uxmatters.com/mt/archives/2010/09/dancing-with-the-cards-quick-and- dirty-analysis-of-card-sorting-data.php

Miller, J. (2005). The User Experience. IEEE Computer Society , 90-92.

Nielsen, J. (2009, August 24). Card Sorting: Pushing Users Beyond Terminology

Matches. January 13, 2015 Nielsen Norman Group:

http://www.nngroup.com/articles/card-sorting-terminology-matches/

Nielsen, J. (2005, August 15). Putting A/B Testing in Its Place. January 14, 2015 Nielsen Norman Group: http://www.nngroup.com/articles/putting-ab-testing-in-its- place/

Nielsen, J. (2012, January 4). Usability 101: Introduction to Usability. september 29, 2014 Nielsen Norman Group: www.useit.com/alert box/20040301.html

Nielsen, J., & Norman, D. (tarih yok). The Definition of User Experience. September 29, 2014 Nielsen Norman Group: http://www.nngroup.com/articles/definition-user- experience/

Nielsen, J., & Norman, D. (tarih yok). The Definition of User Experience. September 29, 2014 Nielsen Norman Group: The Definition of User Experience

Nilsson, S., Gustafsson, T., & Carleberg, P. (2009). Hands Free Interaction with Virtual Information in a Real Environment: Eye Gaze as an Interaction Tool in an Augmented Reality System. PsychNology Journal , 7 (2), 175 – 196.

Norman, D. A. (2004). Emotional Design. Magazine Ubiquity , 1-1.

Steenbergen, M. (2010, November 13). Eye Candy vs. Bare-Bones in UI Design.

December 22, 2014 UX Magazine: http://uxmag.com/articles/eye-candy-vs-bare-bones- in-ui-design&gt

Dalam dokumen USER EXPERIENCE DESIGN: OPTIMIZATION OF VIDEO ON DEMAND INTERFACE (Halaman 50-69)