By
Sand Frans Cisco Nainggolan 2-2015-110
MASTER’S DEGREE in
INFORMATION TECHNOLOGY
FACULTY OF ENGINEERING AND INFORMATION TECHNOLOGY
SWISS GERMAN UNIVERSITY EduTown BSD City
Tangerang 15339 Indonesia
August 2016
Sand Frans Cisco Nainggolan STATEMENT BY THE AUTHOR
I hereby declare that this submission is my own work and to the best of my knowledge, it contains no material previously published or written by another person, nor material which to a substantial extent has been accepted for the award of any other degree or diploma at any educational institution, except where due acknowledgement is made in the thesis.
Sand Frans Cisco Nainggolan
_____________________________________________
Student Date
Approved by:
Dr. Adhiguna Mahendra
_____________________________________________
Thesis Advisor Date
Charles Lim, Msc
_____________________________________________
Thesis Co-Advisor Date
Dr. Ir. Gembong Baskoro, M.Sc.
_____________________________________________
Dean
Date
Sand Frans Cisco Nainggolan ABSTRACT
CLASSIFICATION ANOMALOUS DNS TRAFFIC AT THE INTERNET SERVICE PROVIDER
By
Sand Frans Cisco Nainggolan Dr. Adhiguna Mahendra, Advisor
Charles Lim, MSc., Co-Advisor
SWISS GERMAN UNIVERISTY
The usage of Internet in Indonesia has grown rapidly. This was proved by the number of Internet users. Internet has become the one thing that people need. However, sometimes they are often unaware when their environment has been compromised by something harmful. One of component that involved is Domain Name Service (DNS) which it will involve Internet Service Provider too. Through this component, people will be helped since DNS will perform translating domain name into IP Address which is difficult to remember IP Address than human-readable names for website and online services. However, public DNS records are something that constantly changing, in some cases can be in every few minutes. This condition can be used by some people in wild way to attack or make active threat on internet from online criminal activity or possible of vulnerability of name servers due to bugs in software or missed configuration. Therefore, in this research we proposed a mechanism to automatically extracted significant features of DNS to analyse whether it is normal or anomaly traffic.
Real data from PT. XYZ as one of ISP used to do this research which it will be used for some classification through DNS’s features. The significant feature of this approach will lead us to take necessary action related to the anomaly even though it does in passive analysis but trigger related party to manage system to have proper functioning and good performance while validation the classification is performed with machine learning algorithms. The system successfully found 4.35% Query traffic without a Response, rejection in DNS response about 7.57% as Non Existent Domain (and 2.8%
as Refused) and many unknowns of TLD (Top Level Domain) from samples data
Sand Frans Cisco Nainggolan observation and over 98% accuracy has been achieved by the classification system.
This research also offered insight on internal workings on some malwares activity or vulnerability of name server.
Keywords: Anomaly, Domain Name Service, Static Features, Passive Analysis, Classification.
Sand Frans Cisco Nainggolan
© Copyright 2016
By Sand Frans Cisco Nainggolan All rights reserved
Sand Frans Cisco Nainggolan DEDICATION
I dedicate this research to my lovely wife and my lovely mom And to the company have Domain Name System and my country: Indonesia
Sand Frans Cisco Nainggolan ACKNOWLEDGEMENTS
First of all, I thank to my Almighty GOD, Jesus Christ, with all His grace and his favour which pour out upon me with healthy, ability, capacity and joyfully.
There are peoples who supports me during the creation of this thesis.
I would like thank to my lovely family (my wife, my kids, my mom, my brother, my sisters) and my DATE team.
I would like to thank my thesis advisor, Pak Adhiguna, and co-advisor, Pak Charles, for their valuable input during the writing and process of this thesis. Especially to Pak Charles, for your big efforts to me, to guide me as well.
Sand Frans Cisco Nainggolan TABLE OF CONTENTS
Page
STATEMENT BY THE AUTHOR ... 2
ABSTRACT ... 3
DEDICATION ... 6
ACKNOWLEDGEMENTS ... 7
TABLE OF CONTENTS ... 8
LIST OF FIGURES ... 14
LIST OF TABLES ... 18
LIST OF SCRIPT ... 20
CHAPTER 1 - INTRODUCTION ... 21
1.1. Background ... 21
1.2. Research Problems ... 24
1.3. Research Objectives ... 25
1.4. Significance of Study ... 26
1.5. Scope of Study ... 26
1.6. Research Questions ... 26
1.7. Hypothesis... 27
1.8. Thesis Structure ... 27
CHAPTER 2 - LITERATURE REVIEW ... 28
2.1. Internet and Domain Name System ... 28
2.1.1. IP Address ... 29
2.1.2. URL (Uniform Resource Locator) ... 29
2.2. DNS... 29
2.2.1. Domain Name Space... 30
2.2.2. Domain Name Registration... 30
2.2.3. Domain Name Resolution ... 31
2.2.3.1. Name Servers ... 31
2.2.3.2. Name Resolvers ... 32
2.2.4. DNS Message Packet ... 32
Sand Frans Cisco Nainggolan
2.3. Anomaly Traffic... 36
2.3.1. Anomaly Taxonomy ... 36
2.3.2. Anomaly Detection Techniques ... 37
2.3.2.1. Statistical anomaly ... 38
2.3.2.1.1 Operational Model or Threshold Metric ... 39
2.3.2.1.2 Average and Standard Deviation... 39
2.3.2.1.3 Multivariate ... 39
2.3.2.1.4 The Markovian ... 40
2.3.2.1.5 Time Series ... 40
2.3.2.1.6 Heap’s Law ... 40
2.3.2.1.7 Histogram ... 41
2.3.2.2. Data mining based approach... 41
2.3.2.2.1 Classification ... 42
2.3.2.2.2 Clustering ... 42
2.3.2.2.3 Association Rule ... 42
2.3.2.3. Machine learning based detection technique ... 43
2.3.2.3.1 Neural Networks ... 43
2.3.2.3.2 Fuzzy Logic Approach ... 44
2.3.2.3.3 Support Vector Machine ... 44
2.3.2.4. Knowledge based detection technique ... 44
2.3.2.4.1 State Transition Analysis ... 45
2.3.2.4.2 Expert System ... 45
2.3.2.4.3 Signature Analysis... 45
2.3.3. Output of Anomaly Detection ... 45
2.3.3.1. Labels... 45
2.3.3.2. Scores... 46
2.3.3.3. Receiver Operating Characteristics (ROC) ... 46
2.4. Anomaly DNS Traffic... 46
2.4.1. Detection Model in DNS ... 48
2.4.1.1. Detecting Hidden Anomalies in DNS Communication ... 48
2.4.1.2. Confirmation, Diagnosis and Remediation (CDR)... 49
2.4.1.2.1. Anomaly Taxonomy Workflow ... 49
2.4.1.2.2. Workflow Requirements ... 50
2.4.1.2.3. Anomaly Confirmation Workflow ... 50
2.4.1.3. Passive Monitoring ... 50
2.4.1.3.1. Passive Monitoring DNS Anomalies ... 51
Sand Frans Cisco Nainggolan
2.4.1.3.2. The DNSPacketlizer Tool ... 51
2.4.1.3.3. Fingerprinting Internet DNS Amplification in DDoS Activities .. ... 52
2.4.1.3.4. Passive DNS Replication ... 53
2.4.1.3.5. EXPOSURE a Passive DNS Analysis ... 54
2.5. Data Mining for Classification ... 56
2.5.1. Naïve Bayes ... 57
2.5.2. Decision Tree ... 57
2.5.3. Random Forest ... 58
2.5.4. SVM (Support Vector Machine) ... 58
2.5.5. Out-of-Core Processing - Scikit Learn ... 58
2.5.5.1. SGDClassifier ... 60
2.5.5.2. Perceptron ... 60
2.5.5.3. Passive Aggressive Classifier ... 60
2.6. CRISP-DM ... 60
2.6.1. Business understanding ... 60
2.6.2. Data Understanding ... 61
2.6.3. Data Preparation... 61
2.6.4. Modelling ... 62
2.6.5. Evaluation ... 62
2.6.6. Deployment ... 62
2.7. Related Works ... 62
2.8. Theoretical Framework ... 66
2.9. Summary ... 68
CHAPTER 3 - METHODOLOGY ... 69
3.1. Overview ... 69
3.2. Framework of the Methodology ... 69
3.3. General System Architecture ... 69
3.4. Step by step methodology ... 70
3.4.1. Data Collection ... 71
3.4.1.1. Data Sampling ... 71
3.4.1.2. DNS Data Collecting ... 71
3.4.2. Data Preparation... 71
Sand Frans Cisco Nainggolan
3.4.3. Data Modelling ... 73
3.4.3.1. Feature Attribution ... 73
3.4.3.2. Feature Analysis ... 73
3.4.4. Evaluation ... 73
CHAPTER 4 - EXPERIMENT RESULTS ... 74
4.1. Environment Setup... 74
4.1.1. Client Spesification ... 74
4.1.2. Server Spesification ... 75
4.1.3. Tapper Spesification ... 76
4.1.4. Job Scheduler for Automatically Process ... 77
4.2. Data Collection ... 77
4.2.1. Data Collection Timeframe ... 77
4.2.2. Process ... 78
4.2.3. Result ... 78
4.3. Data Preparation... 78
4.3.1. Data Processing Steps ... 78
4.3.1.1. Processing data using SFCNPcapDNS ... 79
4.3.1.2. Processing data using EditCap (Wireshark) ... 79
4.3.1.3. Processing data using Tshark (Wireshark) ... 81
4.3.1.4. Processing data using SQLCMD (SQL Server 2012) ... 82
4.3.1.5. Repeating Process for Automatic Task Processing ... 82
4.3.1.6. Processing data for Classification ... 83
4.4. Data Modelling ... 84
4.4.1. Feature Attribution ... 84
4.4.2. Analysis Classifier ... 84
4.4.2.1. Static Data based on Volume Traffic of Common Features ... 84
4.4.2.1.1. Query Traffic (QR=0) & Response Traffic (QR=1) ... 84
4.4.2.1.2. Record Resource Type ... 85
4.4.2.1.3. Return Code... 89
4.4.2.1.4. Domain Name ... 91
4.4.2.1.5. DNS Protocol ... 92
4.4.2.2. Static Data based on Volume Traffic of Defined Combination Features ... 94
4.4.2.2.1. Transaction ID Mismatch and Different Domain Name (Class001) ... 94
Sand Frans Cisco Nainggolan
4.4.2.2.2. Query Type ANY and Recursive Flag (Class002) ... 95
4.4.2.2.3. Query Type TXT / NS and On Response Type and Domain Name Different / Long Domain Name / Label (Class003) ... 95
4.4.2.2.4. Invalid Format of Naming Convention (Class004) ... 96
4.4.2.2.5. Long Domain Name (>255 Chars) (Class005) ... 97
4.4.2.2.6. Blank Query Name / Domain Name Different on Response Record (Class006) ... 97
4.4.2.2.7. Return Code 3 (Non-Existent Domain) and Round Trip > 2s (Class007) ... 99
4.4.2.2.8. Resource Record without Response (Class008)... 100
4.4.2.2.9. Return Code 2 (Server Fail) / 5 (Refused) and Round Trip > 2sec (Class009) ... 100
4.4.2.2.10. Recursive Flag and Round Trip > 2s (Class010) ... 101
4.4.2.2.11. Return Code 2 (Server Fail), 3 (Non Existent Domain), 5 (Refused) (Class011) ... 102
4.4.2.2.12. Round Trip 2 ~ 30s (Class012a) ... 103
4.4.2.2.13. Round Trip >30s (Class012b) ... 104
4.4.2.2.14. Query Type AAAA and Round Trip 2~30s (Class013a) ... 105
4.4.2.2.15. Query Type AAAA and Round Trip >30s (Class013b) ... 105
4.4.2.2.16. Query Type AAAA (IPv6) but Response in IPv4 (Class014) .... ... 105
4.4.2.2.17. Time To Live (TTL) of SOA: Zero / Negative Value (Class015) ... 106
4.4.2.2.18. UDP Protocol and Truncated and Packet Size > 512 Bytes (Class016) ... 107
4.4.2.2.19. UDP Protocol and Query Type not AAAA / DNSKEY and Packet Size > 512 Bytes (Class017) ... 107
4.4.2.2.20. DNS Flag not Query or Response (Class018) ... 108
4.4.2.2.21. Operation Code is not Query, Inverse Query and Status (Class019) ... 109
4.4.2.2.22. Undefined Resource Record Type (Class020) ... 109
4.4.2.2.23. Resource Record Experimental (MB, MG, MR, NULL) (Class021) ... 110
4.4.2.2.24. TCP Protocol but Resource Record is not ZXFR (Class022) ... ... 110
4.4.2.2.25. TCP Protocol and SYN Flag (Class023) ... 111
4.4.2.2.26. UDP Protocol and Client Port < 49152 (Class024) ... 112
4.4.3. Score ... 112
4.5. Evaluation ... 114
Sand Frans Cisco Nainggolan
CHAPTER 5 - CONCLUSION ... 118
5.1. Contribution ... 118
5.2. Limitation ... 118
5.3. Recommendation ... 118
5.3.1. People ... 119
5.3.2. Process ... 119
5.3.3. Technology ... 120
5.4. Future Works ... 120
5.4.1. Feature reduction ... 120
5.4.2. Dynamic method Analysis ... 120
GLOSSARY ... 122
REFERENCES ... 123
APPENDICES ... 129
Appendix 1. DNS Message Header ... 129
Appendix 2. DNS Resource Record (RR) Type ... 130
Appendix 3. DNS Top Level Domain (TLD) ... 133
Appendix 4. SQL Script to create the Table for importing DNS data purposes ... 134
Appendix 5. SQL Script of Flagging raw data of DNS as “Normal” or “Anomaly” . ... 135
Appendix 6. VB Script to manage automatic process ... 139
Appendix 7. Scikit Learn Script to calculate the accuracy ... 140
Appendix 8. Average and Standard Deviation of RRType Traffic ... 143
Appendix 9. Average and Standard Deviation of Return Code, Domain Name, DNS Protocol ... 143
Appendix 10. Average and Standard Deviation of Definition Combination of Features ... 145
Appendix 11. Summary of File Collection ... 146
CURRICULUM VITAE ... 147