Data Storage and Management: Traditional Systems

Introductionto Big Data Analytics

1.6.1 Data Storage and Management: Traditional Systems

( tr

a e

1. Whyis data quality important in discoveringnew knowledgeand decisionmaking?

2. List the examples of cloud services for exporting data stores.

3. How is conversion to CSV file before data store beneficial? How is conversion to tables from CSV files from data store beneficial?

4. List the usages of three types of services that clouds offer. List Big Data cloud services, to data sources export from data store, and perform cloud during analytics, visualizations and intelligence discovery.

5. Consider databases storing the daily sales figures of chocolates, such as KitKat, Milk, Fruit and Nuts, Nougat and Oreo, each at every machine in Example 1.6(i).

How will you name the data sources in ACVMs analytics? How will the ACVMs sales be analyzed for each type of chocolate using the data-source master•

tables?

1.6

!

DATA STORAGE AND ANALYSIS

The following subsections describe data storage and analysis, and comparison between Big Data management and analysis with traditional database management systems.

1.6.1

Data Storage and Management: Traditional

Business process data which stores business events, such as registering a customer, taking an order, generating an invoice, and managing products in pre-defined formats. The data falls in the category of highly structured data.

The data consists of transaction records, tables, relationships and metadata that build the information about the business data.

Commercial transactions Banking/ stock records

E-commerce transactions data.

The following example explains the sources and data store of semi-structured data.

EXAMPLE 1.12

Give examples of sources of data store of semi-structured data.

SOLUTION

Examples of semi-structured data are:

XML andJSON semi-structured documents7•8

A comma-separated values (CSV) file. The CSV stores tabular data in plain text.

Each line is a data record. A record can have several fields, each filed separated by a comma. Structured data, such as database include multiple relations but CSV does not consider the relations in a single CSV file. CSV cannot represent object-oriented databases or hierarchical data records. A CSV file is as follows:

Preeti,1995,MCA,Object Oriented Prograrnrning,8.75 Kirti,2010, M.Tech., Mobile Operating System, 8.5

Data represent the data records for columns and rows of a table. Each row has names, year of passing, degree name, course name and grade point out of 10.

Rows are separated by a new line and the columns by a comma.

]SON Object Data Formats: CSV does not represent object-oriented records, databases or hierarchical data records. ]SON and XML represent semi•

structured data and represent object-oriented and hierarchical data records.

Example 3.5 explains CSV and JSON objects and the hierarchical data records in the JSON file format.

1.6.1.2 SQL

An RDBMS uses SQL (Structured Query Language). SQL is a language for viewing or changing (update, insert or append or delete) databases. It is a language for data access control, schema creation and data modifications.

SQL was originally based on the tuple relational calculus and relational algebra. SQL can embed within other languages using SQL modules, libraries and pre-compilers. SQL does the following:

1. Create schema, which is a structure which contains description of objects (base tables, views, constraints) created by a user. The user can describe the data and define the data in the database.

2. Create catalog, which consists of a set of schemas which describe the database.

3. Data Definition Language (DDL) for the commands which depicts a database, that include creating, altering and dropping of tables and establishing the constraints.

A user can create and drop databases and tables, establish foreign keys, create view, stored procedure, functions in the database etc.

4. Data Manipulation Language (DML) for commands that maintain and query the database. A user can manipulate (INSERT/UPDATE) and access (SELECT) the data.

5. Data Control Language (DCL) for commands that control a database, and include administering of privileges and committing. A user can set (grant, add or revoke) permissions on tables, procedures and views.

SQL is a language for managing the RDBMS. A relational DB is a collection of data in multiple tables, which relate to each other through special fields, called keys (primary key, foreign key and unique key). Relational databases provide flexibilities. Relational database examples are MySQL PostGreSQL Oracle database, Informix, IBM DB2 and Microsoft SQL server.

1.6.1.3 Large Data Storage usingRDBMS

RDBMS tables store data in a structured form. The tables have rows and columns. Data management of Data Store includes the provisions for privacy and security, data integration, compaction and fusion. The systems use machine-generated data, human-sourced data, and data from business processes (BP) and business intelligence (BI).

lirradlitiolillal systetlilils Rellati CWTtal1 aatalbase ¹is rnlliediollil ot dl:ab 'ililltO llillll!lltiplle t.ablles vih iohi rrel.ates

to each other tmirou gh speciall iffieldis, ca ll~d 1keys

A set of keys and relational keys access the fields at tables, and retrieve data using queries (insert, modify, append, join or delete). RDBMSs use software for data administration also.

Online content associated with Practice Exercise 1.12 describes the use of tables in

relational databases in detail.

1.6.1.4 Distributed Database Management System

A distributed DBMS (DDBMS) is a collection of logically interrelated databases at multiple system over a computer network. The features of a distributed database system are:

1. A collection of logically related databases.

U.,shlbutoot m!I i1s ai colle:tion, Olrllogit-allyrinterrel'atte.dl databases at mlllltiple systenris over a comptrter network.

2. Cooperation between databases in a transparent manner. Transparent means that each user within the system may access all of the data within all of the databases as if they were a single database.

3. Should be 'location independent' which means the user is unaware of where the data is located, and it is possible to move the data from one physical location to another without affecting the user.

1.6.1.5 In-Memory Column Formats Data

A columnar format in-memory allows faster data retrieval when only a few columns in a table need to be selected during query processing or aggregation. Data in a column are kept together in-memory in columnar format. A single memory access, therefore, loads many values at the column. An address increment to a next memory address for the next value is fast when compared to first computing the address of the next value, which is not the immediate next address. The following example explains the in•

memory columnar format.

EXAMPLE 1.13

Consider analysis of monthly sales of chocolates on ACVMs (Example 1.6) in company's annual profit reports.

(i) How does sales analysis become easy in-memory columnar format?

(ii) How does during an analysis the access is made to few columns in place of from entire datasets?

SOLUTION

All the column 1 values for several days' record is physically together in-memory at consecutive addresses. All the column 2 values are then physically together at the next successive addresses. Then, the column 3 and other columns store at the columnar database in-memory.

The data stores for each record order in successive columns, so that the lOOth entry at column 1 and the lOOth entry for column 2 belong to the same record

and same input accessible from a single row-key. Column vector refers to a vector whose elements are values at column fields.

Analytics, therefore, can be executed faster when data is in the column format, and more rows and few columns need to be selected during analysis. Successive days' sales of each flavour of chocolate stores in successive values in one column from row r to (r + 30) in a month, thirty row-keys for 30 days, and 365 row keys in a year.

Aggregation functions and other analysis functions are easy to run due to successive memory addresses for sales for each day for each flavour. Examples of aggregation functions are sum, count, maximum, minimum, average, minimum and maximum deviation from a specified value.

Online Analytical Processing (OLAP) in real-time transaction processing is fast when using in-memory column format tables. OLAP enables real-time analytics. The CPU accesses all columns in a single instance of access to the memory in columnar format in•

memory data-storage.

Online Analytical Processing (OLAP) enables online viewing of analyzed data and visualization up to the desired granularity (fineness or coarseness) enables view by rolling up (finer granulates to coarse granulates data) or drilling down (coarser granulates data to finer granulates). OLAP enables obtaining online summarized information and automated reports for a large database.

Metadata describes the data. Pre-storing of calculated values provide consistently fast response. Result formats from the queries are based on Metadata.

1.6.1.6 In-Memory Row Format Databases

A row format in-memory allows much faster data processing during OLTP (online transaction processing). Refer Example 1.13. Each row record has corresponding values in multiple columns and the on-line values store at the consecutive memory addresses in row format. A specific day's sale of five different chocolate flavours is stored in consecutive columns c to c+S at memory. A single instance of memory accesses loads values of all five flavours at successive columns during online processing. For example, the total number of chocolates sold computes online. Data is in-memory row-formats in stream and event analytics. The stream analytics method does continuous computation that happens as data is flowing through the system. Event analytics does computation on event and use event data for tracking and reporting events.

1.6.1.7 EnterpriseData-Store Server and Data Warehouse

Enterprise data, after data cleaning process, integrate with the server data at warehouse.

Enterprise data server use data from several distributed sources which store data using

various technologies. All data merge using an integration tool. Integration enables collective viewing of the datasets at the data warehouse (Figure 1.3).

Enterprise data integration may also include integration with application(s), such as analytics, visualization, reporting, business intelligence and knowledge discovery.

Heterogeneous systems execute complex integration processes when integrating at an enterprise server or data warehouse. Complex application-integration means the integration of heterogeneous application architectures and processes with the databases at the enterprise. Enterprise data warehouse store the databases, and data stores after integration, using tools from number of sources.

Online contents associated with Practice Exercises 1.9 and 1.10 give details of commercial solutions for complex application-integration of processes.

Following are some standardised business processes, as defined in the Oracle application-integration architecture:

1. Integrating and enhancing the existing systems and processes

2. Business intelligence 3. Data security and integrity

4. New business services/products (Web services)

5. Collaboration/knowledge management

6. Enterprise architecture/SOA

7. e-commerce

8. External customer services

9. Supply chain automation/visualization

10. Data centre optimization

Figure 1.6 shows Steps 1 to 5 in enterprise data integration and management with Big Data for high performance computing using local and cloud resources for analytics,

applications and services.

Slep.s: in enteir¹p -~ dqta c100 applie2tjon:s in e,gr.atio,- and 1T1anagement . · · igh putlng wing ocal and cl'oud iresCN,11rce.s:

Ora e , siQ iddlew-a~.

Oracle Cli1tapg:s:e a md

Extemall l!)a,ta

Data Reports, Data Vis al1za· ion, Busi , ess ntel igen£e Decisi'on:;;.,.

Knowledge disc-ovsy

Data goYemam,e and ata am.a:genflent l'il.Ode~ Data , ased o ot on SQL or NoSQL

1Dah1 Anaty,;"s, Teoo MaLytics, oisy Te-xt analr · ·s and lililtural - npage

proc:.es.sing;; analytit,s wit s.otjal, networlc messages, pa,ge.s, grap -

databases

Figure 1.6 Steps 1 to 5 in Enterprise data integration and management with Big•

Data for high performance computing using local and cloud resources for the analytics, applications and services

Dalam dokumen BUKU BIG DATA ANALYTICS (Halaman 72-78)