• Tidak ada hasil yang ditemukan

Conceptual Learning Data Machine Learning

N/A
N/A
Protected

Academic year: 2018

Membagikan "Conceptual Learning Data Machine Learning"

Copied!
19
0
0

Teks penuh

(1)

Oleh: Tim Dosen

(2)

Telkom University

o

Data Simulation (Monte Carlo)

o

Data Preprocessing

o

Conceptual Learning Data / Machine Learning

o

Model Evaluation / Accuracy

o

Case Study / Exercise

(3)

Modeling and Simulation

Modeling and simulation (M&S) refers to using models – physical, mathematical, or otherwise logical representation of a system, entity, phenomenon, or process – as a basis

for simulations– methods for implementing a model (either statically or) over time – to develop data as a basis for managerial or technical decision making.[1][2] M&S helps getting information about how something will behave without actually testing it in real life (wikipedia)

An Example of Simulation : Monte Carlo Methods

Monte Carlo

Monte Carlo methods (or Monte Carlo experiments) are a broad class

of computational algorithmsthat rely on repeated random samplingto obtain numerical results. Their essential idea is using randomness to solve problems that might be

deterministic in principle. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to use other approaches. Monte Carlo methods are mainly used in three distinct problem classes:[1]optimization, numerical

(4)

Telkom University

(5)
(6)

Telkom University

Why Simulation

• Simulations is generally cheaper, safer and sometimes more ethical than conducting real-world

experiments. For example,supercomputersare sometimes used to simulate the detonation of nuclear devices and their effects in order to support better preparedness in the event of anuclear explosion. Similar efforts are conducted to simulate hurricanes and other natural catastrophes.

• Simulations can often be even more realistic than traditional experiments, as they allow the free configuration of environment parameters found in the operational application field of the final

product. Examples are supporting deep water operation of the US Navy or the simulating the surface of neighbored planets in preparation ofNASA missions

(7)

Data Preprocessing (Why ?)

Measures for

data quality

: A multidimensional view

Accuracy

: correct or wrong, accurate or not

Completeness

: not recorded, unavailable, …

Consistency

: some modified but some not, …

Timeliness

: timely update?

Believability

: how trustable the data are correct?

(8)

Telkom University

1.

Data

cleaning

Fill in

missing

values

Smooth

noisy

data

Identify or

remove outliers

Resolve

inconsistencies

2.

Data

reduction

Dimensionality

reduction

Numerosity

reduction

Data

compression

3.

Data

transformation

and data

discretization

Normalization

Concept hierarchy generation

4.

Data

integration

Integration of

multiple databases

or files

Major Task

(9)

Data in the Real World Is Dirty

: Lots of potentially incorrect data,

e.g., instrument faulty, human or computer error, transmission

error

Incomplete

: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data

e.g., Occupation=“ ” (

missing data

)

Noisy

: containing noise, errors, or outliers

e.g., Salary=“−10” (

an error

)

Inconsistent

: containing discrepancies in codes or names

e.g., Age

=“

42

”, Birthday=“

03/07/2010

Was rating “1, 2, 3”, now rating “A, B, C”

Discrepancy between

duplicate records

Intentional (e.g.,

disguised missing data

)

Jan. 1 as everyone’s birthday

?

(10)

Telkom University

Data is

not always available

E.g.,

many tuples have no recorded value

for several attributes, such as

customer income in sales data

Missing data

may be due to

equipment

malfunction

inconsistent with other recorded data and thus

deleted

data not entered due to

misunderstanding

certain data

may not be considered important

at the time of entry

not register history or

changes of the data

Missing data may

need to be inferred

(11)

Data Reduction Strategies

Data Reduction

Obtain a reduced representation of the data set that is much smaller in volume but yet

produces the same analytical results

Why Data Reduction?

A database/data warehouse may store terabytes of data

Complex data analysis take a very long time to run on the complete dataset

Data Reduction Strategies

1. Dimensionality reduction

1. Feature Extraction 2. Feature Selection

2. Numerosity reduction

(

Data Reduction

)

• Regression and Log-Linear Models

(12)

Telkom University

1.

Estimation:

Linear Regression, Neural Network, Support Vector Machine, etc

2.

Prediction/Forecasting:

Linear Regression, Neural Network, Support Vector Machine, etc

3.

Classification

:

Naive Bayes, K-Nearest Neighbor, C4.5, ID3, CART, Linear Discriminant Analysis, Logistic

Regression, etc

4.

Clustering

:

K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means, etc

5.

Association

:

FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc

(13)

1.

Estimation:

Error

: Root Mean Square Error (RMSE), MSE, MAPE, etc

2.

Prediction/Forecasting (Prediksi/Peramalan):

Error

: Root Mean Square Error (RMSE) , MSE, MAPE, etc

3.

Classification:

Confusion Matrix

: Accuracy

ROC Curve

: Area Under Curve (AUC)

4.

Clustering:

 Internal Evaluation: Davies–Bouldin index, Dunn index,

 External Evaluation: Rand measure, F-measure, Jaccard index, Fowlkes–Mallows index, Confusion matrix

5.

Association:

 Lift Charts: Lift Ratio

 Precision and Recall(F-measure)

Evaluation

(14)

Telkom University

Machine Learning

In the field of data analytics, machine learning is a method used to devise complex models and

algorithms that lend themselves to prediction - in commercial use, this is known as predictive analytics. These analytical models allow researchers, data scientists, engineers, and analysts to "produce

reliable, repeatable decisions and results" and uncover "hidden insights" through learning from historical relationships and trends in the data (wikipedia)

Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. (standford/coursera)

(15)

Data Split

The Split Data operator takes a dataset as its input and delivers the subsets of that dataset through its output ports

The sampling type parameter decides how the examples should be shuffled in the resultant partitions:

1. Linear sampling: Linear sampling simply divides the dataset into partitions without changing the order of the examples

Subsets with consecutive examples are created

2. Shuffled sampling: Shuffled sampling builds random subsets of the dataset Examples are chosen randomly for making subsets

3. Stratified sampling: Stratified sampling builds randomsubsets and ensures that the class distribution in the subsets is the same as in the whole dataset

In the case of a binominal classification, stratified sampling builds random subsets so that each subset contains roughly the same proportions of the two values of the label

(16)

Telkom University

Cross Validation Methods

• Cross-Validation method used to avoidoverlapping choicefrom testing data

• Cross-Validation step:

• Divide data intok subset (same size)

• Use each subset for testing data and the rest for training data

• This method also calledk-fold cross-validation

• We often use stratified (bertingkat) sampling before cross-validation process, because it reduces

(17)

10 Fold Cross-Validation

(18)

Telkom University

(19)

Exercise:

1. Use one of the following tools : RapidMiner, R, Orange, Weka

2. Create prediction model (prediksi elektabilitas caleg) using data

training

on data pemilu (

datapemilukpu.xls

) using the following

algorithm :.

1.

Decision Tree

(C4.5)

2.

Naïve Bayes

(NB)

3.

K-Nearest Neighbor

(K-NN)

3. Do

evaluation / accuracy testing

using

10-fold X Validation

C4.5 NB K-NN

Accuracy 92.45% 77.46% 88.72%

Referensi

Dokumen terkait

Dengan menggunakan analisis jalur untuk pengujian hipotesis dalam penelitian ini, maka hasil yang diperoleh adalah bahwa aset tetap dari entitas-entitas bisnis yang

Petani dari ketiga suku tersebut mengalami peningkatan atau perubahan tingkat pendapatan setelah melakukan konversi lahan pertanian menjadi perkebunan kelapa sawit,

To create an equal and offsetting forward position, the market participant would sell EUR 10 million three months forward using the USD/EUR spot exchange rate and forward points in

Hasil penelitian ini menemukan bahwa variabel firm size terbukti berpengaruh negatif terhadap earning management , dan variabel beban pajak penghasilan dan variable

Sistem yang dibangun dapat mengelola data menjadi informasi yang cepat dan

Kuliah Kerja Praktik, selanjutnya disingkat dengan KKP, adalah mata kuliah wajib yang harus diselesaikan oleh seorang mahasiswa Fakultas Teknik Universitas Syiah

-Febrian Putra Pratama -Muhmad Ihsan Salimudin -Nisa Fitriyah N.S. -Nugraha Rochmatullah

Tujuan yang akan dicapai dari Tugas Akhir pembuatan Sistem Informasi Perpustakaan Booking Online ini adalah:. “Membuat sistem informasi perpustakaan booking online melalui web dan