• Tidak ada hasil yang ditemukan

Data Sources

Dalam dokumen BUKU BIG DATA ANALYTICS (Halaman 61-64)

Introductionto Big Data Analytics

1.5.1 Data Sources

~a an

~

1

1.5 DATA SOURCES, QUALITY, PRE-PROCESSING AND STORING The following subsections describe data sources, data quality

data pre-processing and data store export to the cloud.

(ii) File sources are stored files. An application executing the data, first connects to a driver manager of the source. A user, client or application does not register with the source, but connects to the manager when required. The process of connection is simple when using a file data source in case the file contains a connection string that would otherwise have to be built using a call to a connect-function driver.

Oracle applications consider two types of data sources: (i)

:~~~!~~~:f~! ~:~rc!s~~~s

database, which identifies the database information that the iim Oracle applications

software needs to connect to database, and (ii) logic-machine, which identifies the machine which runs batches of applications and master business functions. 5 Source definition

identifies the machine. The source can be on a network. The definition in that case also includes network information, such as the name of the server, which hosts the machine functions.

The applications consider data sources as the ones where the database tables reside and where the software runs logic objects for an enterprise. Data sources can point to:

1. A database in a specific location or in a data library of OS 2. A specific machine in the enterprise that processes logic

3. A data source master table which stores data source definitions. The table may be at a centralized source (enterprise server) or at server-map for the source.

Specific database inst..ililce

A database can be in an IBM i data library'' [IBM i is a on11e as data sourcasin

computer operating system in which IBM i considers 119_~1_i _ sy_ste_rm_-1 _

everything as an object, each possessing persistence. The system IBM i offers Unix-like file directories using an integrated file system.].

IBM applications consider data sources for applications and tools as one which identifies either (i) a specific database instance or (ii) file on a remote system that stores data. 6 Data sources can be shared. The access to source is restricted according to the roles assigned to both the source and the application that use it.

EXAMPLE 1.7

(i) How would you name the data sources of the student grade-sheets?

(ii) How does an analytics application (Analysis_APP)access student grade-sheet data source, using the Data Dictionary or data-source master-table for the

grade-sheets of students?

(iii) How does the application connect and access the data source of students' grade-sheets?

Assume each student can have a grade-sheet for each of the six semesters in UG Computer Science programme.

SOLUTION

(i) Assume SemID is distinct key for a semester. StudID is a key assigned to a student, whether in CS or another subject, and whether in UG or PG. A StudID is unique. Data source can be file data source named 'UG_CS_Sem_StudID_Grades' for all UG CS student grades. UG_CS_Sem_StudID_Gradedsatabase consists of maximum six grade sheets UG_CS_SemID_StudID_Gradesi,.e., one for each semester. Assume that Analysis_APPdoes not connect or directly links to the data source UG_CS_Sem_StudID_Gradedsatabase. Then, the Analysis_APPlinks to a Data Dictionary or data source master table, which is data repository for the pointers of all six semesters of UG Computer Science program and other subject programs.

(ii) Assume that Analysis_APP associates to Oracle data-source master-table. The table stores the data-source definitions for all UG and PG, and all subjects and semester grades of the students. The data-source master-table stores the pointers of all semester grades. The table thus points to UG_CS_Sem_StudID_GradeDsB for the student identified by StudID.

(iii) Assume that application deploys Microsoft DB. Then, first Analysis_APPlinks to a Driver Manager. The Driver Manager then calls the ODBC functions in the Driver Manager. The application identifies the target driver for the UG_CS_Sem_StudID_Gradedsata source with a connection handle. When the Driver Manager loads the driver, the Driver Manager builds a table of pointers to the functions in that driver. It uses the connection handle passed by the application to find the address of the function in the target driver and calls that function by address.

1.5.1.2 Unstructured Data Sources

Unstructured data sources are distributed over high-speed networks. The data need high velocity processing. Sources are from distributed file systems. The sources are of file types, such as .txt (text file), .csv (comma separated values file). Data may be as key•

value pairs, such as hash key-values pairs. Data may have internal structures, such as in e-mail, Facebook pages, twitter messages etc. The data do not model, reveal relationships, hierarchy relationships or object-oriented features, such as extensibility.

1.5.1.3 Data Sources - Sensors, Signals and GPS

The data sources can be sensors, sensor networks, signals from machines, devices, controllers and intelligent edge nodes of different types in the industry M2M

communication and the GPS systems.

Sensors are electronic devices that sense the physical environment. Sensors are devices which are used for measuring temperature, pressure, humidity, light intensity, traffic in proximity, acceleration, locations, object(s) proximity, orientations and magnetic intensity, and other physical states and parameters. Sensors play an active role in the automotive industry.

RFIDs and their sensors play an active role in RFID based supply chain management, and tracking parcels, goods and delivery.

Sensors embedded in processors, which include machine-learning instructions, and wireless communication capabilities are innovations. They are sources in IoT applications.

Dalam dokumen BUKU BIG DATA ANALYTICS (Halaman 61-64)