Spatial Data Mining
using SAR-Kriging Model
Atje Setiawan Abdullah
A Lecturer at Informatics Engineering Study Program
Department of Computer Science FMIPA Universitas Padjadjaran Jl. Raya Bandung Sumedang Km 21 Jatinangor
e-mail: [email protected], [email protected]
SEAMS School
Spatio Temporal Data Mining and Optimization Modeling UTC-Bandung, August 9-19, 2016
1. Introduction
In this paper
we combine the Expansion of
Spatial Autoregressive (Expansion SAR) model as
an extension of SAR model and Kriging technique
to predict a quality of education of
elementary
school.
The quality of education is defined as a
result of student on study which is measured by
National End Test (UAN). In Indonesia the score of
UAN still spreadly sparse, because there are
difference on education services based on spatial
or location.
Education of elementary or middle level is study
process of passing school, imposed to student to be
having storey;
certain interest in cognate ability,
psycomotoric, and affective, according to specified
by a middle and elementary education curriculum.
Quality of education defined
as achievement
reached by the student and measured by pursuant
to final test value of national (UAN).
1.1 Problems
Research about quality of education still be limited,
focused at measurement of result of education
through
UAN school, and analysis method still
limited to descriptive analysis. Considering regional
swampy forest broadness of education in Indonesia
and social condition, economic, and also culture
which
different
in
each
location,
hence
related/relevant problem with the education quality
in school at various location in Indonesia represent
the interesting study to be studied by method of
spatial of data mining.
One of model of spatial of data mining which can be
used for the description and prediction is Expansion
Spatial Autoregressive ( Expansion SAR). The
Expansion SAR used for prediction of observation in
sample
location.
In
the
case
of
measuring
heterogeneities based on co-ordinate of location
spatial. Lack of the SAR model, it cannot be used to
predict at unsample location. Kriging method is one
of spatial analysis which can be used for prediction at
unsample location. So, we try to combine the SAR
and Kriging method to be SAR-Kriging for prediction
at unsample location using the parameter of SAR as
an input of Kriging method.
1.2 The Aims of Research
• Studying model of combination of Expansion
SAR and Kriging method (SAR-Kriging)
• Applying concept of spatial of data mining use
the method of SAR-Kriging, for prediction at
unsample locations. For case study we use the
database of SDPN 2003 to predict quality of
education for
elementary school, junior high
PROSES SPASIAL DATA MINING MENGGUNAKAN SAR-KRIGING DATABASE HASIL SDPN 2003 HASIL CLEANING & TRANSFORMASI HASIL DATA PREPARATION HASIL MODEL SAR-KRIGING HASIL EKSPANSI SAR
& GRAFIK HASIL MODEL SAR &
INDEKS MORAN
KNOWLEDGE PATTERN
CLEANING DATA & TRANSFORMASI KE RASIO
MODEL SAR
EVALUASI & VISUALISASI
DATA EKSTERNAL KOORDINAT KECAMATAN MODEL SAR INTERPRETASI PERHITUNGAN KRIGING
MODEL EKSPANSI SAR
HASIL PERBANDINGAN DATA AKTUAL & PREDIKSI
PERSAMAAN SAR-KRIGING DAN MUTU HASIL SAR-KRIGING
DATA MUTU HASIL EKSPANSI SAR DATA MUTU HASIL SURVEI
PR EP RO CE SS IN G DA TA M IN IN G PO ST PR OC ES SI NG HASIL SELEKSI FAKTOR DAN SEM
INTEGRASI DATA SPASIAL & NON SPASIAL SELEKSI INDIKATOR MENGGUNAKAN FAKTOR & SEM
DATABASE SDPN 2003
DATA MINING INTEGRASI DATA TRANSFORMASI DATA
SELEKSI DATA
INTERPRETASI DAN VISUALISASI HASIL
KNOWLEDGE
PROSES DATA MINING
CLEANING DATA
Scalability
Ukuran data 3,91 GB (4.178.499.369 byte) Terukur terdiri dari struktur tabel SD/SMP/SMA
Non-traditional Analysis
Melibatkan koordinat lokasi dan peta lokasi kecamatan, kabupaten dan provinsi di Indonesia
Analysis menggunakan model spasial
Data Ownership and Distribution
Tersebar secara geografis terdiri dari: provinsi,kabupaten, kecamatan dan desa
Heterogeneity and Complex Data
Melibatkan data non spasial dan data spasial Data non spasial indikator mutu pendidikan
Data spasial koordinat kecamatan
High dimentionality
Jumlah total record adalah 203.590 Jumlah variabel terdiri dari 569 DATABASE SDPN 2003
DATA PERSEKOLAHAN TK: 54226 Record SD: 158590 Record SMP: 28949 Record SMA: 10810 Record SMK: 4753 Record DATA PENELITIAN
SD: 158.590 record dengan 122 variabel SMP: 28.949 record dengan 138 variabel SMA 10.810 record dengan 142 variabel
SELEKSI DATA
DATABASE SDPN 2003
Data Persekolahan 257660 Data Pendidikan Luar Sekolah 3047 Data Non Pendidikan 240 Data Perguruan Tinggi 13202
SELECT left(sd_sarana.id,7) AS kdkec, Sum(jbkips_1+jbkips_2+jbkips_3+jbkips_4+jbkips_5+jbkips_6+jbkPPKN_1+jbkPPKN_2+jbkPPK N_3+jbkPPKN_4+jbkPPKN_5+jbkPPKN_6+jbkINDO_1+jbkINDO_2+jbkINDO_3+jbkINDO_4+jbkI NDO_5+jbkINDO_6+jbkMat_1+jbkMat_2+jbkMat_3+jbkMat_4+jbkMat_5+jbkMat_6+jbkipa_1+jbki pa_2+jbkipa_3+jbkipa_4+jbkipa_5+jbkipa_6)/ Sum(jsisK_tk1l+jsisK_tk1p+jsisK_tk2l+jsisK_tk2p+jsisK_tk3l+jsisK_tk3p+jsisK_tk4l+jsisK_tk4p+jsi sK_tk5l+jsisK_tk5p+jsisK_tk6l+jsisK_tk6p) AS RSBKTS, Sum(Lbangun)/ Sum(jsisK_tk1l+jsisK_tk1p+jsisK_tk2l+jsisK_tk2p+jsisK_tk3l+jsisK_tk3p+jsisK_tk4l+jsisK_tk4p+jsi sKtk5l+jsisK_tk5p+jsisK_tk6l+jsisK_tk6p) AS RSLBTS, Sum(Ltanah)/ Sum(jsisK_tk1l+jsisK_tk1p+jsisK_tk2l+jsisK_tk2p+jsisK_tk3l+jsisK_tk3p+jsisK_tk4l+jsisK_tk4p+jsi sK_tk5l+jsisK_tk5p+jsisK_tk6l+jsisK_tk6p) AS RSLTTS, Sum(jrng_baik)/ Sum(jrng_baik+jrng_rr+jrng_rb+jrng_bm) AS RSRB, Sum(jprg_ppkn+jprg_indo+jprg_mat+jprg_ipa+jprg_ips)/Sum(jrng_baik+jrng_rr+jrng_rb+jrng_bm) AS RSPRGTK FROM SD_Sarana INNER JOIN SD_SISWA ON SD_Sarana.ID=SD_SISWA.ID GROUP BY left(sd_sarana.id,7);
TRANSFORMASI DATA DARI VARIABEL KE INDIKATOR
TRANSFORMASI DATA DASAR KE DATA INDIKATOR (QUERY)
SD: 21 Indikator SMP: 19 Indikator SMA: 20 Indikator
HASIL SELEKSI INDIKATOR MENGGUNAKAN ANALISIS FAKTOR
SD: 14 Indikator SMP: 16 Indikator SMA: 14 Indikator SELEKSI INDIKATOR DATA DASAR SD: 122 Variabel SMP: 138 Variabel SMA 142 Variabel
HASIL SELEKSI INDIKATOR MENGGUNAKAN SEM
SD: 7 Indikator SMP: 10 Indikator SMA: 13 Indikator
input proses Mutu
Rasio jumlah siswa thp jumlah kelas
(RSTRB)
Rasio jml siswa thp jml guru(RSTGR)
Rasio jml siswa usia 7 tahun thdp jml siswa (RSBR7) Rasio jml siswa mengulang thdp jml siswa (RSULGTJS) Rasio jml buku thdp jml siswa (RSBKTS) Rasio luas bangunan
thdp jml siswa (RSLBTS)
Rasio luas tanah thdp jml siswa (RSLTTS)
Rata-rata jumlah nilai UAS (TOTUAS)
Rasio jml guru tetap thdp jml guru (RSGTTG) Rasio jml pendaftar asal TK thdp jml pendaftar (RSDFTK) Rasio jml siswa usia 7-12 tahun thdp jml siswa (RSUM712)
Rasio jml siswa putus sekolah thdp jml siswa
(RSPTSTD) Rasio jml ruang kelas
baik thdp jml ruang kelas (RSRB) Rasio jml alat peraga
thdp jml kelas (RSPRGTK)
Rasio jml guru >= D2 thdp jml guru
(RSGLTG) Rasio jml guru agama
thdp rombel (RSGATRB)
Rata-rata Tingkat Kelulusan siswa (TKTLLS)
INDIKATOR PENELITIAN MUTU PENDIDIKAN JENJANG SD
Rasio jml guru kelas terhadap jml guru
(RSGKTG)
Rasio julah guru B. Ing thdp rombel (RSGINTROM) Rasio jml siswa baru
thdp jml siswa (RSB) RSTGR 28.69 RSBR12 0.03 RSUM1315 0.01 RSDFSD 0.00 INPUT PROSES MUTU RSLAB0.02 RSRB 0.03 RSGUAN0.00 RSGLTG0.01 RSPTSTS0.00 TOTUAN1.55 Chi-Square=32.88, df=27, P-value=0.20104, RMSEA=0.023 0.01 0.03 0.04 0.05 1.00 40.43 4.29 0.01 -0.00 0.03 -0.00 0.89 0.01 0.00-0.00 -0.000.03 -0.01
Kecamatan yang tidak tersurvei pada SDPN 2003 dihilangkan dengan cara mengedit data
spasialnya.
Menggabungkan data non spasial dengan data spasial yang telah terpilih pada tabel peta spasial
sesuai dengan kecamatan masing-masing.
INTEGRASI DATA
Menghubungkan kecamatan-kecamatan pada peta spasial dengan data kecamatan yang disurvei
pada SDPN 2003.
Menjalankan program MATLAB menggunakan metode yang sesuai
Database SDPN 2003 Sihombing (2002)
Nababan (2003)
PROSES SPASIAL DATA MINING
Cliff dan Ord (1975) Anselin, (1988) Cressie (1993) Armstrong (1998) Lazarevic (2000) Lichstein et al. (2002) Sekhar et al. (2003) LeSage (1999)
LeSage dan Pace (2004) Van Beers dan Kleijinen (2004) Celik et al. (2005)
Bronnenberg (2005) Kanazaki et al. (2006)
Kumar dan Remadevi (2006) Bakkali, S. dan Amrani, M. (2008) Lu et.al (2008)
Zhao Lu et al. (2008)
Koperski et al. (1997)
Berry dan Linoff (2000)
Soukup dan Davidson (2002)
Giudici, et al. (2003)
Han dan Kamber (2006)
Tan et al. (2006)
Olson dan Shi (2007)
Refaat (2007)
Giannotti dan Pedreschi (2008)
Maimon dan Rokach, (2008)
SPASIAL DATA MINING
DESKRIPSI
Indeks dan Plot Moran
PREDIKSI
Ordinary Kriging
MODEL KAUSAL
Model SAR Model Ekspansi SAR
MODEL SAR-KRIGING
MODEL SAR KRIGING
SELEKSI VARIABEL
Proses Input Output Analisis Faktor, SEM
1.3 Variables of Research
In this research we use the database of SDPN 2003 from Balitbang-Depdiknas (2003), especially in elementary and indicator variables. Elementary variable represent the variable in individual raw data of school. Indicator variable is variable obtained by pursuant to elementary variables. Elementary variable cover the school identity, student indicator, medium indicator, teacher indicator, and total assess the UAN. From above indicator, builder by system of input and output of quality of education, input consisted by the student indicator, process composed by the indicator of medium and teacher indicator, output indicator of quality of education consisted by the amount assess the UAN and mean mount the pass. Indicator selection use the factor analysis and Structural Equation Model ( SEM).
Figure 1.1 Variables Reduction Process
input proses Mutu
Rasio jumlah siswa thp jumlah kelas (RSTRB) Rasio jml siswa usia 7 tahun thdp jml siswa (RSBR7) Rata-rata jumlah nilai UAS (TOTUAS) Rasio jml ruang kelas baik thdp jml ruang kelas (RSRB) Rasio jml guru >= D2 thdp jml guru (RSGLTG) Rata-rata Tingkat Kelulusan siswa (TKTLLS) HASIL REDUKSI VARIABEL INDIKATOR PENELITIAN MUTU PENDIDIKAN JENJANG SD
MENGGUNAKAN STRUCTURAL EQUATION MODEL
Rasio jml siswa baru thdp jml
Figure 1.1 shows the result reduces of indicator
variables having an effect on to quality, using
factor analysis and SEM. The result for input gives
3 indicators, student ratio to amount class, ratio
sum up the student old age 7 year to student at the
first class and ratio new student to all all students.
Process composed by 2 indicators that is ratio of
well classroom to all space and competent teacher
ratio to total teacher. Output composed by 2
indicators, total assess the UAN, and mount the
pass. Indicator outputs UAN try to be analyzed by
expansion SAR model.
2. Modeling at Spatial Data Mining 2.1 The Expansion SAR Model
The expansion SAR like known the previous model spatial SAR in measuring heterogeneities spatial based on neighborhood. Model the linear spatial locally in the case of measuring heterogeneities based on
co-ordinate of location spatial or a co-co-ordinate. Model the spatial like this is first time introduced by Casetti ( 1972,
1992 in Anselin, 1988 & Lesage, 1999). Paying attention to model regression in the following is:
0
Where abouts and each showing coefficient regression, and vector perception from free variable. Coefficient regression in the equation shows the heterogeneities spatial in perception unit. For that, in the equation require to be entangled by a number of extension variables, for example and in such a way till go into effect:
1 0 1 1
z
2 2z
0 0 1( ) 2( 2 )
y x z x1 z x
ε
Xβ
y
If the equation (2.1) substitution into equation ( 2.2) obtained:
In general model the Casetti formulated as follows:
0
ZJβ
β
where n y y y 2 1 y ' ' 2 ' 1 0 0 0 0 n x x x X n 2 1 β y x 0 β n 2 1 ε k yn k xn k y k x I Z I Z I Z I Z 0 0 1 1 Z
The model appraised by using smallest square method to appraise the parameters. Pursuant to the parameter
valuation, other valuation for the dot of in space appraised to use the second equation from (2.3). Distance from
perception center formulated:
2
2 y yi xc xi i z z z z d (2.4)so the expansion SAR model can be noticed:
ε
XDβ
Xβ
α
y
0
(2.5)In the equation (2.5), the influence of variable can be separated between non spatial and spatial
ε
XDβ
Xβ
α
y
0
spatial spatial nonParameter β and β0 can be used to describe marginal influence for non spatial and spatil influences. For describing independent variables individually to dependent variable also can be used graphically through equation
i i di yi y i yi xi x i xi
D
Z
Z
0
(2.6)2.2 Ordinary Kriging Method
Kriging is a method of calculating estimates of a regionalized variable at a point, over an area, or within a volume, and uses as a criterion the minimization of an estimation variance Kriging interpolation involves the generation of images of the reservoir properties and commonly used to visualize reservoir heterogeneities Therefore, Kriging techniques not well suited for reproducing geological reservoir patterns where the number of data are very limited. Using Kriging technique, we can predict the observation at unsample location (Armstrong, 1998).
Assume that the regionalized variable under study has
value ( )
i i Z x
Z , each representing the value at a point
i
x
. Also assume that this regionalized variable is second order stationary, with:expectation: E[Z(x)] m
Covariance: E
Z(x h).Z(x)
m2 C(h)A kriged estimator
Z
V*is a linear combination of n values of the regionalized variable:
n i i i VZ
Z
1 *
(2.7) For two locations, we have the minimum variance of Kriging (Armstrong, 1998): 1 2 1 12 1 2 1 V V 12 2 1 2 2 2 1 V VTo get the value of
1 and2
using ordinary Kriging method we should have the values of V 1
,
2V and 12
The value of
12is semivariogram experimental from two sample points and
1
V
is the semivariogram of the first sample point and the unsample point which will be predicted.
For case study we use the spherical semivariogram for two locations
r
h
r
r
h
h
r
r
h
,
)
(
ˆ
,
)
(
ˆ
)
(
(2.9)2.3 SAR-Kriging Method
Method of SAR-Kriging in this study represent the combination model the Expansion SAR with the technique Kriging addressed for the prediction of quality of education unsample locations. Stages in explainable SAR-Kriging model as follows (Abdullah, A.S.-2009):
• Determining variable dependent and independent to model the Ekspansi SAR entangling region data through distance between location center with the perception location
• Conducting parameter estimating model the Expansion SAR with the Maximum Likelihood method
• Determining location which unsample , around two sample location of co-ordinate and also apart to location sample
• Parameter valuation model the Expansion SAR made by input at Kriging method to obtain; get weight in location to be predicted of quality of education
• The weight of Kriging represent the parameter valuation in unsample location
• The weight of Kriging obtained become the coefficient model of the Expansion SAR in unsample location
• Because model of Expansion SAR represent the model for the data of cross sectional, hence method of SAR Kriging got applicable to predict of quality of education if known by the independent values variable.
The Result of SAR-Kriging
In this paper, we implemented spatial data mining using SAR-Kriging method to predict quality of education at 13 provinces in Indonesia included Aceh Province. In the base survey of education year 2003, Aceh didn’t included as a survey location, because of the situation and condition was very dangerous. So, for predicting of quality education we can use SAR-Kriging method.
For the method of SAR-Kriging, selected by data input-proses of quality of storey; level of elementary school, junior high school, and senior high school from two provinces in region of Indonesia, that is Banten Province and South Sulawesi Province.
Figure 3.1 Maps of Provinces in Indonesia
Following the SAR-Kriging procedure, we have:
(1). Location co-ordinate which unsample selected by 13 provinces around Banten and South Sulawesi
(2). It ’ s obtained by a parameter valuation model the Expansion SAR through technique Kriging to 13 new locations by its co-ordinate
(3). Position of 13 locations between Banten and South Sulawesi Provinces
(4). Pursuant to weight Kriging at step 2, can be expressed by model of prediction expansion SAR through Kriging to quality of education at 13 unsample locations for elementary school
Pursuant to inferential result that to 13 locations
among Banten and South Sulawesi, obtained by
model prediction of quality of education for
elementary school through method of SAR Kriging.
If known by the values from input variable and
process the education and also co-ordinate of
each;every location, hence quality of education
measured by totalizing UAN will be able to predict.
Model the prediction of quality of education to 13
locations
among Banten and South Sulawesi
Table 3.1 Prediction of Quality Education for Elementary School in Indonesia using SAR-Kriging
From Table 3.1 we can explain that quality of
education
in
13
provinces
influenced
by
component of non spatial with five variables and
five components spatial with five the variable
including distance of perception location to center
location. If we a selected Aceh Provinces between
Banten and South Sulawesi, pursuant to data
SDPN 2003 obtained by the following model
Expansion SAR:
Quality of Education at Aceh
=
25.61
+
0.02RSTRB
+
5.88RSB
-2.87RSBR7
– 6.31RSRB + 1.77RSGLTG +
0.22d-RSTRB -7.81d-RSB
-11.39d-RSBR7-1.53d-RSRB+0.57d-RSGLTG
For predicting of quality education on elementary
school, junior high school and senior high school
at
13
Provinces
in
Indonesia,
we
have
a
comparison between actual and prediction
SAR-Kriging as follows:
Table 3.2 Comparison of Quality Education Actual and Prediction SAR-Kriging At Elementary School
NO PROVINCE ACTUAL PREDICTION ERROR APE
1 DKI 26.85 23.81 3.04 11.32 2 JABAR 31.73 26.04 5.69 17.93 3 JATENG 26.15 27.44 -1.29 4.93 4 DIY 26.47 26.76 -0.29 1.10 5 JATIM 26.83 28.19 -1.36 5.07 6 ACEH 25.94 24.27 1.67 6.44 7 SUMUT 24.22 24.54 -0.32 1.32 8 SUMBAR 23.13 29.13 -6 25.94 9 SULUT 24.95 25.96 -1.01 4.05 10 SULBAR 25.39 25.48 -0.09 0.35 11 KALBAR 24.09 24.1 -0.01 0.04 12 KALTENG 23.43 26.52 -3.09 13.19 13 KALTIM 23.57 26.68 -3.11 13.19 MAPE 8.07
Table 3.3 Comparison of Quality Education Actual and Prediction SAR-Kriging At Junior High School
NO PROVINCE ACTUAL PREDICTION ERROR APE
1 DKI 18.54 16.99 1.55 8.36 2 JABAR 17.85 16.82 1.03 5.77 3 JATENG 17.65 18.00 -0.35 1.98 4 DIY 18.99 17.98 1.01 5.31 5 JATIM 16.46 16.97 -0.51 3.10 6 ACEH 14.47 15.23 -0.76 5.25 7 SUMUT 18.53 15.11 3.42 18.46 8 SUMBAR 19.20 16.57 2.63 13.69 9 SULUT 14.13 17.30 -3.17 22.43 10 SULBAR 18.02 17.36 0.66 3.66 11 KALBAR 16.15 16.07 0.08 0.50 12 KALTENG 18.20 16.94 1.26 6.92 13 KALTIM 16.42 16.71 -0.29 1.77 MAPE 7.48
Table 3.4 Comparison of Quality Education Actual and
Prediction SAR-Kriging At Senior High School
NO PROVINCE ACTUAL PREDICTION ERROR APE
1 DKI 36.74 16.90 19.84 54.00 2 JABAR 36.30 31.20 5.10 14.04 3 JATENG 39.54 29.92 9.62 24.33 4 DIY 40.30 29.25 11.05 27.43 5 JATIM 45.34 29.55 15.79 34.82 6 ACEH 17.16 28.93 -11.77 68.61 7 SUMUT 31.90 38.66 -6.76 21.19 8 SUMBAR 33.22 35.46 -2.24 6.73 9 SULUT 45.48 38.54 6.94 15.26 10 SULBAR 20.78 37.17 -16.39 78.87 11 KALBAR 16.58 16.70 -0.12 0.72 12 KALTENG 39.09 37.96 1.13 2.89 13 KALTIM 25.48 33.33 -7.85 30.81 MAPE 29.21
From three tables above, we can conclude that
Mean Average Percentage Error (MAPE) for
prediction of quality education at 13 provinces I
Indonesia for elementary school and junior high
school are less than 10%. But for senior high
school more than 10%. It means that the
SAR-Kriging method fit a good model for prediction of
quality
education
at
unsample
locations
on
4. Conclusion
1). SAR-Kriging model is one of tools in spatial data
mining which combines expansion SAR model and
Kriging method.
2).
An
application
of
SAR-Kriging
model
for
prediction of quality of education at unsample
locations in Indonesia show that it gave a good result
for elementary and junior high school at 13 provinces
which are located in among two selected provinces.
References
• Abdullah, A. S. 2009. Spatial Data Mining using
SAR-Kriging Model (Spatial Autoregressive-Kriging) for Mapping Quality of Education in Indonesia. Unpublished
Dissertation. Yogyakarta: Universitas Gadjah Mada.
• Anselin, L. 1988, Spatial Econometrics : Method and
Models, London: Kluwer Academic publisher.
• Armstrong, M. 1998. Basic Liniear Geostatistic, New York: Springer Verlag.
• Balitbang Depdiknas, 2003, Survei Dasar Pendidikan Nasional Tahun 2003, Jakarta.
• Han, J., and Kamber, M., 2006, Data Mining, Concept
and Techniques, USA: Academic Press.
• LeSage, J. P. 1999. The Theory and Practice of Spatial