• Tidak ada hasil yang ditemukan

Big Data Analytics Techniques

Big Data Analytics

6.4 Big Data Analytics Techniques

should be able to interpret the analysis results to obtain value from the entire analysis process and to perform visual analysis and derive valuable business insights from the massive data.

6.3.5  Analytics Application

The analysis results can be used to enhance the business process and increase business profits by evolving a new business strategy. For example, a customer analysis result when fed into an online retail store may deliver the recommenda- tions list that the consumer may be interested in purchasing, thus making the online shopping customer friendly and revamping the business as well.

Interval data—In case of interval data, not only the order of the data matters, but the difference between them also matters. One of the common examples of ordinal data is the difference in temperature in Celsius. The difference between 50°C and 60°C is the same as the difference between 70°C and 80°C. In time scale the increments are consistent and measurable.

Ratio data—A ratio variable is essentially an interval data with the additional property that the values can have absolute zero. Zero value in ratio indicates that the variable does not exist. Height, weight, and age are examples of ratio data. For example 40 of 10 years. Whereas those data such as temperature are ratio variables since 0°C does not mean that the temperature does not exist.

6.4.2  Qualitative Analysis

Qualitative analysis in big data is the analysis of data in their natural settings.

Qualitative data are those that cannot be easily reduced to numbers. Stories, arti- cles, survey comments, transcriptions, conversations, music, graphics, art, and pictures are all qualitative data. Qualitative analysis basically answers to “how,”

“why,” and “what” questions. There are basically two approaches in qualitative data analysis, namely, the deductive approach and the inductive approach. A deductive analysis is performed by using the research questions to group the data under study and then look for similarities or differences in them. An inductive approach is performed by using the emergent framework of the research to group the data and then look for the relationships in them.

A qualitative analysis has the following basic types:

1) Content analysis—Content analysis is used for the purpose of classification, tabulation, and summarization. Content analysis can be descriptive (what is actually the data?) or interpretive (what does the data mean?).

2) Narrative analysis—Narrative analyses are used to transcribe the observation or interview data. The data must be enhanced and presented to the reader in a revised shape. Thus, the core activity of a narrative analysis is reformulating the data presented by people in different contexts based on their experiences.

3) Discourse analysis—Discourse analysis is used in analyzing data such as writ- ten text or a naturally occurring conversation. The analysis focuses mainly on how people use languages to express themselves verbally. Some people speak in a simple and straightforward way while some other people speak in a vague and indirect way.

4) Framework analysis—Framework analysis is used in identifying the initial framework, which is developed from the problem in hand.

5) Grounded theory—Grounded theory basically starts with examining one par- ticular case from the population and formulating a general theory about the entire population.

6.4.3  Statistical Analysis

Statistical analysis uses statistical methods for analyzing data. The statistical anal- ysis techniques described are:

A/B testing;

Correlation; and

Regression.

6.4.3.1  A/B Testing

A/B testing, also called split testing or bucket testing, is a method that compares two versions of an object under interest to determine which among the two versions per- forms better. The element subjected to analysis may be a web page or online deals on products. The two versions are version A, which is the current version and is called control version, and the modified version, version B, is called as the treatment. Both version A and version B are tested simultaneously, and the results are analyzed to determine the successful version. For example, two different versions of a web page to visitors with similar interests. The successful version is the one that has higher conver- sion rates. When an e-commerce website versions are compared, a version with more of buyers will be considered successful. Similarly, new websites that win a larger num- ber of paid subscriptions is considered the successful version. Anything on the web- site such as a headline, an image, links, paragraph text, and so forth, can be tested.

6.4.3.2  Correlation

Correlation is a method used to determine if there exists a relationship between two variables, that is, to determine whether they are correlated. If they are corre- lated, the type of correlation between the variables is determined. The type of correlation is determined by monitoring the second variable when the first varia- ble increases or decreases. It is categorized into three types:

Positive correlation—When one variable increases, the other variable increases.

Figure 6.6a shows positive correlation. Examples of positive correlation are:

1) The production of cold beverages and ice cream increases with the increase in temperature.

2) The more a person exercises, the more the calories burnt.

3) With the increased consumption of food, the weight gain of a person increases.

Negative correlation—When one variable increases, the other variable decreases. Figure 6.6b shows negative correlation.

Examples of negative correlation are:

1) As weather gets colder, the cost of air conditioning decreases.

2) The working capability decreases with the increase in age.

3) With the increase in the speed of the car, time taken to travel decreases.

No correlation—When one variable increases, the other variable does not change. Figure 6.6c shows no correlation. An example of no correlation between two variables is:

1) There is no correlation between eating Cheetos and speaking better English.

With the scatterplots given above, it is easy to determine whether the variables are correlated. However, to quantify the correlation between two variables, Pearson’s correlation coefficient r is used. This technique used to calculate the

Positive Correlation Y

(a) (b)

X Negative Correlation X

Y

(c)

No Correlation X

Y

Figure 6.6  (a) Positive correlation. (b) negative correlation. (c) No correlation.

correlation coefficient is called Pearson product moment correlation. The formula to calculate the correlation coefficient is

Correlation coefficient r, i

n

i i

i n

i i

x x y y x x

1 1

2 1 nn

y yi 2

To compute the value of r, the mean is subtracted from each observation for the x and y variables.

The value of the correlation coefficient ranges between −1 to +1. A value +1 or

−1 for the correlation coefficient indicates perfect correlation. If the value of the correlation coefficient is less than zero, it essentially means that there is a nega- tive correlation between the variables, and the increase of one variable will lead to the decrease of the other variable. If the value of the correlation coefficient is greater than zero, it means that there is a positive correlation between the varia- bles, and the increase of one variable leads to the increase of the other variable.

The higher the value of the correlation coefficient, the stronger the relationship, be it a positive or negative correlation, and the value closer to zero depicts a weak relationship between the variables. If the value of the correlation coefficient is zero, it means that there is no relationship between the variables. If the value of the correlation coefficient is close to +1, it indicates high positive correlation. If the value of the correlation coefficient is close to −1, it indicates high negative correlation.

The Pearson product moment correlation is the most widely adopted technique to determine the correlation coefficient. Other techniques used to calculate the correlation coefficient are Spearman rank order correlation, PHI correlation, and point biserial.

6.4.3.3  Regression

Regression is a technique that is used to determine the relationship between a dependent variable and an independent variable. The dependent variable is the outcome variable or the response variable or predicted variable, denoted by “Y,”

and the independent variable is the predictor or the explanatory or the carrier variable or input variable, denoted by “X.” The regression technique is used when a relationship exists between the variables. The relationship can be determined with the scatterplots. The relationship can be modeled by fitting the data points on a linear equation. The linear equation is

Y a bX, where,

X = independent variable, Y = dependent variable,

a = intercept, the value of Y when X = 0, and b = slope of the line.

The major difference between regression and correlation is that correlation does not imply causation. A change in a variable does not cause the change in another variable even if there is a strong correlation between the two variables. While regres- sion, on the other hand, implies a degree of causation between the dependent and the independent variable. Thus correlation can be used to determine if there is a relationship between two variables and if a relationship exists between the variables, regression can be used further to explore and determine the value of the dependent variable based on the independent variable whose value is previously known.

In order to determine the extra stock of ice creams required, the analysts feed the value of temperature recorded based on the weather forecast. Here, the tem- perature is treated as independent variable and the ice cream stock is treated as the dependent variable. Analysts frame a percentage of increase in stock for a specific decrease in temperature. For example, 10% of the total stock may be required to be increased for every 5°C decrease in temperature. The regression may be linear or nonlinear.

Figure 6.7a shows a linear regression. When there is a constant rate of change, then it is called linear regression.

Figure 6.7b shows nonlinear regression. When there is a variable rate of change, then it is called nonlinear regression.