• Tidak ada hasil yang ditemukan

Data Analytics Life Cycle

Big Data Analytics

6.3 Data Analytics Life Cycle

is analyzed with past data. Diagnostic analytics is used to analyze and understand customer behavior while predictive analytics is used to predict customer future behavior, and prescriptive analytics is used to influence this future behavior.

perceive if the issue in hand really pertains to big data. For a problem to be classi- fied as a big data problem, it needs to be associated with one or more of the char- acteristics of big data, that is, volume, variety, and velocity. The data scientists need to assess the source data available to carry out the analysis in hand. The data set may be accessible internally to the organization or it may be available exter- nally with third-party data providers. It is to be determined if the data available is adequate to achieve the target analysis. If the data available is not adequate, either additional data have to be collected or available data have to be transformed. If the data available is still not sufficient to achieve the target, the scope of the analysis is constrained to work within the limits of the data available. The underlying budget, availability of domain experts, tools, and technology needed and the level of analytical and technological support available within the organization is to be evaluated. It is important to weigh the estimated budget against the benefits of obtaining the desired objective. In addition the time required to complete the pro- ject is also to be evaluated.

Data Mart

Preprocessed Data Data Cleaning

Transformed Data Data Transformation (Alpha, Numeric)

Patterns

Analysis Analytics Application Interpretation and

Evaluation

Data Selection Analyzing what

data is needed for application

Source Data

Figure 6.3  Analytics life cycle.

6.3.2  Data Preparation

The required data could possibly be spread across disparate data sets that have to be consolidated via fields that exist in common between the data sets. Performing this integration might be complicated because of the difference in their data struc- ture and semantics. Semantics is the same value having different labels in differ- ent datasets, such as DOB and date of birth. Figure 6.4 illustrates a simple data integration using the EmpId field.

The data gathered from various sources may be erroneous, corrupt, and incon- sistent and thus have no significant value to the analysis problem in hand. Thereby the data have to be preprocessed before using it for analysis to make the analysis effective and meaningful and to gain the required insight from the business data.

Data that may be considered as unimportant for one analysis could be important for a different type of problem analysis, so a copy of the original data set, be it an internal data set or a data set external to the organization, has to be persisted before filtering the data set. In case of batch analysis, data have to be preserved before analysis and in case of real-time analysis, data have to be preserved after the analysis.

Unlike a traditional database, where the data is structured and validated, the source data for big data solutions may be unstructured, invalid, and complex in nature, which further complicates the analysis. The data have to be cleansed to validate it and to remove redundancy. In case of a batch system, the cleansing can be handled by a traditional ETL (Extract, Transform and Load) operation. In case of real-time analysis, the data must be validated and cleansed through complex in-memory database systems. In-memory data storage systems load the data in main memory, which bypasses the data being written to and read from a disk to lower the CPU requirement and to improve the performance.

EmpId Name

4567 Maria

4656 John

EmpId Salary DOB

4567 $2000 08/10/1990

4656 $3000 06/06/1975

EmpId Name Salary DOB

4567 Maria $2000 08/10/1990 4656 John $3000 06/06/1975 Figure 6.4  Data integration with EmpId field.

6.3.3  Data Extraction and Transformation

The data arriving from disparate sources may be in a format that is incompatible for big data analysis. Hence, the data must be extracted and transformed into a format acceptable by the big data solution and can be utilized for acquiring the desired insight from the data. In some cases, extraction and transformation may not be necessary if the big data solution can directly process the source data, while some cases may demand extraction wherein transformation may not be necessary.

Figure 6.5 illustrates the extraction of Computer Name and User Id from the XML file, which does not require any transformation.

6.3.4  Data Analysis and Visualization

Data analysis is the phase where actual analysis on the data set is carried out. The analysis could be iterative in nature, and the task may be repeated until the desired insight is discovered from the data. The analysis could be simple or complex depending on the target to be achieved.

Data analysis falls into two categories, namely, confirmatory analysis and exploratory analysis. Confirmatory data analysis is deductive in nature wherein the data analysts will have the proposed outcome called hypothesis in hand and the evidence must be evaluated against the facts. Exploratory data analysis is inductive in nature where the data scientists do not have any hypotheses or assumptions; rather, the data set is explored and iterated until an appropriate pat- tern or result is achieved.

Data visualization is a process that makes the analyzed data results to be visu- ally presented to the business users for effective interpretation. Without data visu- alization tools and techniques, the entire analysis life cycle carries only a meager value as the analysis results could only be interpreted by the analysts. Organizations

<? xml Version=”1.0”?>

<ComputerName>

</ComputerName>

<Date>

</Date>

<UserId>

</UserId>

334332 Atl-ws-001

10/31/2015 Computer Name

Atl-ws-001 334332 User ID

Figure 6.5  Illustration of extraction without transformation.

should be able to interpret the analysis results to obtain value from the entire analysis process and to perform visual analysis and derive valuable business insights from the massive data.

6.3.5  Analytics Application

The analysis results can be used to enhance the business process and increase business profits by evolving a new business strategy. For example, a customer analysis result when fed into an online retail store may deliver the recommenda- tions list that the consumer may be interested in purchasing, thus making the online shopping customer friendly and revamping the business as well.