Big Data Machine Learning dalam Bidang Geologi

(1)

Dr. Ir. Asep HP Kesumajana, MT

Teknik Geologi – FITB - ITB GL3101 - Komputasi Geologi

Big Data

Machine Learning dalam Bidang Geologi

(2)

Definisi (Wikipedia)

 “Big data” adalah istilah untuk set data yang sangat besar (jumlah) dan kompleks yg tidak bisa ditangani oleh software pemrosesan data biasa.

 Beberapa tantangan dalam menangani big data:

 Analysis

 capture,

 data curation (pemilihan, pengelompokan dan perawatan),

 search,

 sharing,

 storage,

 transfer,

 visualization, and

 information privacy.

(3)

Definisi (Research Data Alliance)

 “Big data” adalah istilah yang menggambarkan volume data yang besar



Berupa data terstruktur atau tidak terstruktur



Didapatkan dari kegiatan bisnis sehari-hari.



Bukan jumlah data yg terpenting



Yang penting adalah bagaimana melakukan organisasi data.

 Big data dapat dianalisis untuk mendapatkan wawasan yang mengarah pada

keputusan yang lebih baik dan langkah bisnis strategis.

(4)

3V Big Data (2001)

 Gartner menginterpretasikan big data dalam bentuk 3V, (Laney, 2004)



Volume



Ukuran data

 Terabyte

 Petabyte

 Zettabyte



Variety



Variasi/ragam jenis data

 Terstruktur

 Semi terstruktur

 Tidak terstruktur



Velocity



Kecepatan kemunculan data

 Batch

 Real-time

 Stream

 Near real-time

(5)

Volume

Mengapa volume data menjadi sangat besar?

 Makin besarnya kapasitas penyimpanan data

 Penggunaan internet untuk semua hal (IoT/internet of Things)



Sistim perangkat komputasi yg saling terkait, baik mekanik maupun digital



Setiap device memiliki identitas yg unik (UID)



Device mampu mengirim data lewat jaringan



Device mampu melakukan interaksi tanpa manusia

Deteksi kemacetan Google Maps



menggunakan spatiotemporal data dari semua device yg memiliki GPS



bila GPS diaktifkan, secara anonym mengirimkan sinyal ke google

https://upload.wikimedia.org/wikipedia/commons/7/7c/Hilbert_InfoGrowth.png

(6)

Variety

 Variasi/ragam jenis data



Terstruktur: data disimpan dalam format yg terstruktur



Database, RDBMS



Semi terstruktur: data disimpan pada format yg tidak baku



JSON (Java Script Object Notation)



XML (eXtensible Markup Language)



RDF (Resource Description Framework)



Tidak terstruktur: data disimpan tidak menggunakan model yg telah ditentukan sebelumnya, umumnya berupa dokumen elektronik



Buku, jurnal, dokumen



Metadata, rekam medis



Audio, video, gambar, foto, presentasi



File, analog data, text email

(7)

Variety

Sifat Terstruktur Semi terstruktur Tidak terstruktur

Teknologi Tabel database relasi XML/RDF/JSON Karakter dan biner Manajemen transaksi

Transaksi jatuh tempo dan berbagai teknik konkurensi

Transaksi diadaptasi dari DBMS tidak jatuh tempo

Tidak ada manajemen transaksi dan tidak ada konkurensi Pengaturan versi tupel, baris, tabel tupel atau grafik

dimungkinkan

Versi secara keseluruhan Fleksibilitas Tergantung skema

kurang fleksibel lebih fleksibel sangat fleksibel Skalabilitas Sangat sulit lebih sederhana sangat mudah Robustness

(ketahanan thd error) Sangat kuat tidak terlalu kuat Kinerja Query

memungkinkan

penggabungan yang node anonim dimungkinkan

Hanya tekstual yang

dimungkinkan

(8)

Velocity

 Kecepatan kemunculan data



Real-time: pemrosesan data yg sangat cepat mendekati waktu saat diinputkan, bila terdapat delay hanya dalam satuan mili detik



Presentasi zoom, percakapan telepon, nonton pertandingan bola life



Pemantauan dengan cctv



Near real-time: pemrosesan data cepat sebagai respon dari input, delay yg terjadi bisa beberapa detik hingga menit (kadang-kadang sangat lama)



Pengiriman data Sms, whatsapp, traffic di google map



Streaming: pemrosesan data yg menerus tanpa jeda, hasilnya bisa real time ataupun tidak



Video youtube,



Nonton pertandingan bola bisa life ataupun siaran tunda



Batch processing: pemrosesan banyak data secara otomatis tanpa user interface



Pengadaan data tagihan listrik, telepon

(9)

4V Big data (IBM, 2012)

 Veracity



Ketidakpastian (uncertainty) data:



Kualitas



Perulangan



Tidak lengkap



dibutuhkan



pembersihan data (data cleansing)



Perbaikan kualitas data

(10)

4V Big data (Dunn and Coffee, 2013)

 Value



Mencari/mendapatkan nilai dari:



informasi



Pola



struktur



Yg tersembunyi di dalam data menggunakan metoda



Statistik



Hypotesa



Korelasi



Pemodelan

(11)

5V Big Data (Perwej, 2017)

 Rangkuman kedua 4V

(12)

6V Big Data

 Validity



Kebenaran data



Data benar (correct Data)



Data salah (incorrect Data)

 Variability



Perubahan data



Konsisten



Inkonsisten

 Viability



Variabel



Pemilihan



Relevan



hubungan

Fouad dkk, 2015 Rahman dkk, 2016

Lněnička dkk, 2017;

Ristevski dkk, 2018

(13)

7V Big Data

 Visualitazion



Kemudahan membaca data



Mudah dibaca



Sudah dibaca

 Volatilty



Waktu penyimpanan data



Mahal tempat penyimpanan data



Batasan waktu data disimpan

Khan dkk, 2014 Fernando, 2017

(14)

42V Big Data (Shafer, 2017)

1.Vagueness: The meaning of found data is often very unclear, regardless of how much data is available.

2.Validity: Rigor in analysis (e.g., Target Shuffling) is essential for valid predictions.

3.Valor: In the face of big data, we must gamely tackle the big problems.

4.Value: Data science continues to provide ever-increasing value for users as more data becomes available and new techniques are developed.

5.Vane: Data science can aid decision making by pointing in the correct direction.

6.Vanilla: Even the simplest models, constructed with rigor, can provide value.

7.Vantage: Big data allows us a privileged view of complex systems.

8.Variability: Data science often models variable data sources.

Models deployed into production can encounter especially wild data.

9.Variety: In data science, we work with many data formats (flat files, relational databases, graph networks) and varying levels of data completeness.

10.Varifocal: Big data and data science together allow us to see both the forest and the trees.

11. Varmint: As big data gets bigger, so can software bugs!

12. Varnish: How end-users interact with our work matters, and polish counts.

13. Vastness: With the advent of the Internet of Things (IoT), the "bigness" of big data is accelerating.

14. Vaticination: Predictive analytics provides the ability to forecast. (Of course, these forecasts can be more or less accurate depending on rigor and the complexity of the problem. The future is pesky and never conforms to our March Madness brackets.)

15. Vault: With many data science applications based on large and often sensitive data sets, data security is increasingly important.

16. Veer: With the rise of agile data science, we should be able to navigate the customer's needs and change directions quickly when called upon.

17. Veil: Data science provides the capability to peer behind the curtain and examine the effects of latent variables in the data.

18. Velocity: Not only is the volume of data ever increasing, but the rate of data generation (from the Internet of Things, social media, etc.) is increasing as well.

(15)

42V Big Data (Shafer, 2017)

19. Venue: Data science work takes place in different locations and under different arrangements: Locally, on customer workstations, and in the cloud.

20. Veracity: Reproducibility is essential for accurate analysis.

21. Verdict: As an increasing number of people are affected by models' decisions, Veracity and Validity become ever more important.

22. Versed: Data scientists often need to know a little about a great many things: mathematics, statistics, programming, databases, etc.

23. Version Control: You're using it, right?

24. Vet: Data science allows us to vet our assumptions, augmenting intuition with evidence.

25. Vexed: Some of the excitement around data science is based on its potential to shed light on large, complicated problems.

26. Viability: It is difficult to build robust models, and it's harder still to build systems that will be viable in production.

27. Vibrant: A thriving data science community is vital, and it provides insights, ideas, and support in all of our endeavors.

29. Viral: How does data spread among other users and applications?

30. Virtuosity: If data scientists need to know a little about many things, we should also grow to know a lot about one thing.

31. Viscosity: Related to Velocity; how difficult is the data to work with?

32. Visibility: Data science provides visibility into complex big data problems.

33. Visualization: Often the only way customers interact with models.

34. Vivify: Data science has the potential to animate all manner of decision making and business processes, from marketing to fraud detection.

35. Vocabulary: Data science provides a vocabulary for addressing a variety of problems. Different modeling

approaches tackle different problem domains, and different validation techniques harden these approaches in different applications.

36. Vogue: "Machine Learning" becomes "Artificial Intelligence", which becomes...?

(16)

42V Big Data (Shafer, 2017)

37. Voice: Data science provides the ability to speak with knowledge (though not all knowledge, of course) on a diverse range of topics.

38. Volatility: Especially in production systems, one has to prepare for data volatility. Data that should "never" be missing suddenly disappears, numbers suddenly contain characters!

39. Volume: More people use data-collecting devices as more devices become internet-enabled. The volume of data is increasing at a staggering rate.

40. Voodoo: Data science and big data aren't voodoo, but how can we convince potential customers of data science's value to deliver results with real-world impact?

41. Voyage: May we always keep learning as we tackle the problems that data science provides.

42. Vulpine: Nate Silver would like you to be a fox, please.

(17)

Analisis Big Data

 Proses:

 Pengumpulan

 Pengorganisasian

 untuk mendapatkan:

 Trend

 Pola

 Korelasi

 Informasi

 4 jenis analisis big data:

 Deskriptif

 Diagnostik

 Prediktif

 Preskriptif

big data

(18)

Analisis big data

1. Analisis Deskriptif



Menjelaskan apa yg terjadi



Deskripsi suatu keadaan



Membuat laporan, visualisasi

2. Analisis Diagnostik



Menjelaskan mengapa terjadi



Dapat mencari lebih dalam untuk menemukan penyebab terjadinya sesuatu

3. Analisis Prediktif (paling populer)



Memperkirakan apa yg akan terjadi



Membutuhkan AI dan machine learning

4. Analisis Preskriptif



Memberikan solusi terbaik yg harus diambil untuk mencapai tujuan yg diinginkan



Membutuhkan machine learning yg

sangat canggih

(19)

Big data di bidang ilmu kebumian

 Chen dkk. (2016), Abad 21:

 Ilmu big data menjadi paradigma ilmiah baru

 Matematika geologi dan geosain kuantitatif memasuki era big data geologi

 Geologi digital (Matematika geologi dan teknologi informasi)

→membentuk platform baru pengembangan matematika geologi

(kombinasi dari geologi dan matematika)

(20)

Big data di bidang ilmu pengetahuan (scientific)

 Riset menghasilkan

 akumulasi data dalam jumlah besar

 tidak dapat ditangani oleh metoda konvensional

 Sebagai alternatif:

 Cloud computing

 Artificial Intelligence

 Blockchain

 Big data menjadi sumberdaya strategis baru bagi manusia

 Mendorong terjadinya transformasi metodologi ilmiah

(21)

Big data di bidang ilmu pengetahuan (scientific)

 Muncul suatu cabang ilmu baru: “Scientific big data” (“data science”)

 scientiﬁc big data:

 non-reproducibility

 high degree of uncertainty

 high dimensionality

 high complexity.

 Karakteristik:

 tipe data,

 volume data,

 akuisisi data, dan

 analisis data

 Big data →tantangan baru bagi teknik dan metode pemrosesan data

(22)

Big data di bidang ilmu pengetahuan (scientific)

 Tujuan dari penelitian big data adalah untuk memanfaatkan data menggunakan computer sebagai alat bantu

 Riset big data berkembang melalui penentuan korelasi antar data dan

ditandai dengan pengambilan keputusan berdasarkan probabilitas tinggi.

(23)

Big data di bidang ilmu pengetahuan (scientific)

 Tahapan perkembangan sains:

 era empirical science,

 era theoretical science,

 era information science,

 big data and artificial intelligence.

 Metode penelitian tradisional:

 metode deduktif (dari umum ke individu)



teori kristalisasi pemisahan magma

 metode induktif (dari individu ke umum).



peta diskriminan basal menggunakan metode induktif (Zhang et.al., 2018)

(24)

Big data di bidang ilmu pengetahuan (scientific)

Model “Theory-driven”:

 interpretasi data dipandu teori

 model didasari oleh teori,

 Memerlukan:



teori yang baik,



data yang akurat,



kausalitas yang jelas (sebab-akibat).

 Syarat teori harus jelas dan mampu menjelaskan hubungan data.

 Fokus penelitian:



Kasualitas (sebab-akibat)

 Sering subjektifitas berpengaruh

Model “Data-driven”:

 Menggunakan metoda big data

 Model data-driven

 Pengambilan data ditekankan pada:



Keseluruhan data <> sampel



Efisiensi <> akurat



Korelasi <> kausalitas (sebab-akibat)

 Tidak memiliki persyaratan apapun

 → telah melampaui batas-batas penelitian ilmiah

 Fokus penelitian:



Korelasi

 Tidak ada subjektifitas (Zhang et

al., 2018)

(25)

Big data di bidang ilmu pengetahuan (scientific)

 Karakteristik data geologi:



diversity,



multidimensionality,



multi-source availability,



correlation,



randomness,



uncertainty, and



temporal and spatial inhomogeneity;

 Big data→peluang dan tantangan di bidang geologi

 Model “data-driven”→ perspektif baru dalam penelitian geologi (Zhai et.al., 2018)



“Machine learning” sebagai inti dari “artificial intelligence”,



Memberikan kecerdasan dasar pada komputer



“Deep learning” bagian dari “machine learning”,



Yg paling sering digunakan adalah Algorithma “convolutional neural network” (Zhou et al.,

(26)

Contoh Data Driven ilmu kebumian: geokimia minyak bumi



Bila sample batuan sudah matang:

 Pengukuran TOC → sisa dari TOC awal → Sebagian sudah menjadi hidrokarbon

 Pengukuran HI→ sisa dari HI awal → Sebagian sudah menjadi hidrokarbon

Chen dan Jiang, 2015

Nordegg shale

Yeomen shale

Aklak shale

(27)

Contoh Data Driven ilmu kebumian: geokimia minyak bumi

Chen dkk, 2016

(28)

Contoh Data Driven ilmu kebumian: compaction curve

(29)

Compaction Curve parameters of Central Sumatra Stratigraphic Units

EQUATION LINIER : f = m - cZ

f = porosity

m = porosity at depositional interface c = compaction factor

Z = depth

EQUATION HYPERBOLIC :

f = porosity Z = depth a = 75 * b b = (38 * 1600^c)/37 h = Constanta

EQUATION POWER LAW : f = a + bZ^c f = porosity Z = depth

HYPERBOLIC

NO TOP BASE LINEAR POWERLAW

SEGMENT FORMATION SEQUENCE SEQUENCE POROSITY f = a + bZ^c

BOUNDARY BOUNDARY f = m - cZ

c m a b c d h a b c

1 PETANI 0 15.5 0.003042 42.9334 1473.3 19.644 0.4 1.15 -10000 74.79 -4.38 0.2810

2 TELISA + SIHAPAS 15.5 25.5 0.006489 48.4384 2357280 31430 1.4 1.1 -35000 77.27 -3.05 0.349

h

d c

Z Z b

a +

= + f

h

d c

Z Z

b

a +

= +

f

(30)

Contoh Data Driven ilmu kebumian: klasifikasi minyak

GC Data

• Pristane/ Phytane

• Pristane/ n-C17

• Phytane/ n-C18

• nC27/ nC17

• nC31/ nC19 Bulk Property

• Saturate (fraction)

• Aromatics (fraction)

• Polars (fraction) (NSO)

• Alkanes (fraction) (ASPL)

Steranes/ Hopane

• ααα C27/C29 Steranes

• Steranes/ Hopanes Triterpane

• C19/C23 Tricyclic

• C26/C25 Tricyclic

• Tm/Ts

• C29/C30 Hopane

• C30 Mor./ C30 Hop

• Oleananes/ C30 Hopane

• Gammacerane

Objective

To examine geochemical parameters that can be used to distinguish between Lower Cibulakan and Jatibarang oils.

Method

The multivariate analysis method used UPGMA Clustering

(Unweighted Pair Group with Arithmetic Mean), including

The Euclidian Similarity Index.

(31)

Machine learning

 Machine learning (ML) adalah bagian dari Artifisial Inteligent (AI) yg memiliki focus kepada:



Algorithma dan metoda yg digunakan untuk mendapatkan pola dari suatu kumpulan data



Pola tersebut digunakan untuk :



Klasifikasi



prediksi

(32)

Contoh dalam ilmu kebumian: Stratigrafi & Sedimentologi

Contoh: GEA-1 well PENENTUAN LITOLOGI & BATAS LITOLOGI

*source data: 20 wells in South Sumatra Basin

(33)

Contoh dalam ilmu kebumian: Petrofisik

Baturaja Fm. Talangakar Fm.

Ginger & Fielding, 2005

PENENTUAN PARAMETER PERHITUNGAN POROSITAS (SHALY SAND) DENGAN BIG DATA ANALYSIS

RhoMatrix

Water Frequency Plot

RhoMa 2.683 Rho Clay 2.607 NeuCLay 0.333 Wet

Clay

*source data: ~70 wells in South Sumatra Basin

(34)

Contoh dalam ilmu kebumian: Petrologi

 Petrologi (Petrelli, and Perugini, 2016):



Penentuan lokasi tektonik pembentukan batuan volkanik menggunakan data geokimia batuan dan isotop



geochemical signature of major elements:



SiO

₂

, TiO

₂

, Al

₂

O

₃

, Fe

₂

O

₃

T, CaO, MgO, Na

₂

O, K

₂

O



selected trace elements:



Sr, Ba, Rb, Zr, Nb, La, Ce, Nd, Hf, Sm, Gd, Y, Yb, Lu, Ta, Th



Isotopes:

 ²⁰⁶

Pb/

²⁰⁴

Pb,

²⁰⁷

Pb/

²⁰⁴

Pb,

²⁰⁸

Pb/

²⁰⁴

Pb,

⁸⁷

Sr/

⁸⁶

Sr and

¹⁴³

Nd/

¹⁴⁴

Nd



Data (open-access and comprehensive petrological databases:



PetDB https://search.earthchem.org/



GEOROC http://georoc.mpch-mainz.gwdg.de/georoc/



Kesesuaian data komposisi geokimia batuan dengan posisi tektonik rata-rata 93%.



Terendah di batuan volkanik dari back-arc basins (65%).



Tertinggi di batuan volkanik dari oceanic islands (99%).

(35)

Contoh dalam ilmu kebumian: Petrologi



Metoda yg digunakan:



Support Vector Machines (SVM) (Cortes and Vapnik, 1995)



Sample dibagi 2 bagian:

 sudah terkatagori sebagai “training examples” dan

 yg tidak terkatagori yg kemudian akan

dikelompokkan berdasarkan hasil training sample



kelebihan (Cortes and Vapnik 1995; Yu et al.

2005):



SVMs are effective in high dimensional spaces;



SVMs can model complex, real-world problems;



SVMs perform well on datasets with many attributes



Berupa analisis diskriminan dengan modul Scikit- learn (python)



Metoda linear & non-linear Kernel (Radial Basis

Function – RBF) untuk pengelompokan data

(36)

Contoh dalam ilmu kebumian: Petrologi

https://arxiv.org/ftp/arxiv/papers/1706/1706.10108.pdf

(37)

Contoh dalam ilmu kebumian: Petrologi

(38)

Contoh dalam ilmu kebumian: Petrologi

Transformasi data menjadi gausian

Transformasi data menjadi dimensionless

(39)

Daftar Pustaka

 Research Data Alliance, Big Data - Definition, Importance, Examples & Tools, sumber: https://www.rd-alliance.org/group/big-data-ig-data-development-ig/wiki/big-data-definition- importance-examples-tools, diakses pada 20-10-2019

 Looi Consulting, The Evolution of Data, sumber: https://www.looiconsulting.com/home/enterprise-big-data/, diakses pada 9-07-2020

 Brindle, Beth, () How Does Google Maps Predict Traffic?, https://electronics.howstuffworks.com/how-does-google-maps-predict-traffic.htm, diakses pada 7-08-2020

 3V:

 Laney., Douglas, 2001, 3D Data Management: Controlling Data Volume, Velocity and Variety, Application Delivery Strategies, Meta Group, 6 Feb 2001, pp 1-4. diunduh dari https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf, diakses pada 20-10-2019

 Spacey, John., (2017), 5 types of data velocity, https://simplicable.com/new/data-velocity, diakses 7-08-2020

 Kaye, Jonathan, (????), Real time vs. streaming—a short explanation, https://sqlstream.com/real-time-vs-streaming-a-short-explanation/, diakses 7-08-2020

 Vishwakarma, Ashish. (2019), Difference between Structured, Semi-structured and Unstructured data, https://www.geeksforgeeks.org/difference-between-structured-semi-structured- and-unstructured-data/, diakses 7-08-2020

 4V:

 IBM, 2012, The Four V's of Big Data. https://www.ibmbigdatahub.com/infographic/four-vs-big-data, diakses pada 5-Agustus-2020

 Rossi, Alessio. (2017). Predictive models in sport science: multi-dimensional analysis of football training and injury prediction., PhD thesis at Scuola Di Scienze Motorie – Universita Degli Studi di Milano

 5V: Perwej, Yusuf., (2017), An Experiential Study of the Big Data, ITECES Vol. 4, No. 1, 14-25

 6V:

 Rahman, Hamidur., Begum, Shahina., and Ahmed, Mobyen. (2016). Ins and Outs of Big Data: A Review. Internet of Things Technologies for HealthCare: 3^rdInternational Conference, HealthyIoT 2016, Västerås, Sweden, October 18-19, 2016 (10.1007/978-3-319-51234-1_7).

 Ristevski, Blagoj & Chen, Ming. (2018). Big Data Analytics in Medicine and Healthcare. Journal of Integrative Bioinformatics. 15. (10.1515/jib-2017-0030)

 Fouad, Mohamed & Oweis, Nour & Gaber, Tarek & Ahmed, Maamoun & Snasel, Vaclav. (2015). Data Mining and Fusion Techniques for WSNs as a Source of the Big Data.

Procedia Computer Science. 65. (10.1016/j.procs.2015.09.023)

 Lnenicka, Martin., Máchová, Renáta., Komárková, Jitka., and Cermáková, Ivana. (2017). Components of Big Data Analytics for Strategic Management of Enterprise Architecture, Conference: 12^thInternational Conference on Strategic Management and its Support by Information Systems 2017

 7V:

 https://impact.com/marketing-intelligence/7-vs-big-data/

 https://bigdatapath.wordpress.com/2019/11/13/understanding-the-7-vs-of-big-data/

 Fernando, Lahiru, 2017, 7 V's of Big Data, Posted 17th January 2017 dalam https://bbvaopen4u.com/en/actualidad/seven-vs-big-data diakses pada 5-Agustus-2020

 Khan, M. Ali-ud-din., Uddin, Muhammad Fahim., Gupta, Navarun., (2014), IEEE Seven V’s of Big Data Understanding Big Data to extract Value, Proceedings of 2014 Zone 1

(40)

Daftar Pustaka

 Jianping, Chen & Xiang, Jie & Qiao, HU & Wei, Yang & Zili, LAI & Bin, Hu & Wei, WEI. (2016). Quantitative Geoscience and Geological Big Data Development: A Review. Acta Geologica Sinica - English Edition. 90. 1490-1515. 10.1111/1755-6724.12782

 Zhang Qi & Liu Xuelong (2019) Big data: new methods and ideas in geological scientific research, Big Earth Data, 3:1, 1-7, DOI:

10.1080/20964471.2018.1564478

 Riahi, Youssra. (2018). Big Data and Big Data Analytics: Concepts, Types and Technologies. International Journal of Research and Engineering Vol. 5 No.9 PP. 524-528. 10.21276/ijre.2018.5.9.5.