The object of the study are also other departments of the bank, for example, finance and staffl

(1)

УДК 004.9

МРНТИ 81.93.29

G.U. Mamatova¹, A.A. Balgabek², Zh.M. Bekaulova², G.A.Tolganbayeva², R.D.Omarova³

1Al-Farabi Kazakh National University, Almaty, Kazakhstan,

2International Information Technology University, Almaty, Kazakhstan,

3Civil aviation Academy, Almaty, Kazakhstan [email protected]

USING DATA MINING ALGORITHMS FOR SOLVING TASKS IN BANKING SECTOR

Annotation: This paper considers the bank and operations conducted in it. Today, in large domestic banks there is a need to automate the work of managers. For example, the task “Is it worth it to issue a loan to a client?” was solved, and the stream of clients that were solved using Data Mining algorithms was predicted. The object of the study are also other departments of the bank, for example, finance and staffl.

Keywords: Data Mining, Decision trees, Clustering Algorithms, selecting, tasks.

Г.У.Маматова¹, A.A.Балгабек², Ж.М.Бекаулова², Г.А.Толганбаева², Р.Д.Омарова³

1 Казахский национальный университет имени аль-Фараби, Алматы, Казахстан,

2Международный университет информационных технологии, Алматы, Казахстан, ³Академия гражданской авиации, Алматы, Казахстан,

[email protected]

ИСПОЛЬЗОВАНИЕ АЛГОРИТМОВ DATA MINING ДЛЯ РЕШЕНИЯ ЗАДАЧ В БАНКОВСКОМ СЕКТОРЕ

Аннотация: В данной статье рассматриваются банк и проводимые в нем операции. Сегодня в крупных отечественных банках возникает необходимость автоматизации работы менеджеров. Например, была решена задача «Стоит ли выдавать кредит клиенту?», и был предсказан поток клиентов, которые были решены с использованием алгоритмов Data Mining. Объектом исследования являются также другие отделы банка, например, финансы и персонал.

Ключевые слова: Data Mining, деревья решений, алгоритмы кластеризации, выборка, задачи.

(2)

Introduction

Humankind has been engaged in the analysis of hidden patterns in the mountains of information for many centuries, in connection with the development of information technologies, databases, local and global networks, global flows of information in various areas have fallen upon people. Mountains of collected information appear, and the idea that these mountains are full of gold is developing more and more. The term Data Mining derives its name from two concepts, the search for valuable information in large databases and mining of ore. Both of these processes require or sifting a huge amount of “raw” material for a reasonable study and search for the desired values. Thus, Data Mining is the process of detecting previously unknown, non-trivial, practically useful and accessible interpretations of knowledge necessary for making decisions in various spheres of human activity in raw data. [1]

To date, data analysis alone is not enough, if it cannot also offer a structure for modeling, forecasting and forecasting based on the data sets being analyzed. The so- called data mining combines pattern matching, influence ratios, time set correlations, and difference analysis to suggest modeling future data sets. One of the advantages is that these algorithms are capable of embedding whole data sets in their workers, and not only in samples, which makes their accuracy much higher.

Thus, Data Mining is a process of selection, exploration, and modelling of large quantities of data to discover regularities or relations that are at first unknown with the aim of obtaining clear and useful results for the owner of the database. [2]

Algorithms Data Mining

Using Data Mining in decision support systems, it becomes possible to cover a much wider range of tasks. One not unimportant factor of the bank is customer lending, which needs to be brought to automation. Lending to customers is a very difficult and deep area of business for banks. When developing a credit history, a bank can use predictive analysis to monitor individual clients and groups of clients. Predictive analysis models are able to investigate bad customer stories and look for warning signs.

Thus, a credit risk analysis was performed, that is, an accurate and rapid assessment of the borrower's creditworthiness. Using the Data Mining tools, this task is solved by analyzing the information accumulated or acquired in the credit bureau - the credit history of a large number of different clients. For example, a bank - a new player in the lending market - already has a database of borrowers. It should be analyzed in order to identify categories of bona fide clients who are likely to repay the loan on time, and unreliable clients who most likely will not be able to do so. The interpretation of the results of such an analysis can be presented in the form of a decision tree. [3]

Decision trees

The basis of the work of decision trees is the process of recursive splitting of the initial set of objects or observations into subsets associated with classes, which are

(3)

determined by certain attributes on each cycle of the recursive partition. The splitting is carried out by the decision rules in which the attribute values are checked for this condition. An example of using decision trees would be issuing a loan to a bank customer. We need a database on the basis of which the forecasting will be carried out.

This database should contain the initial data on the bank's customers: [4]

 age;

 education;

 availability of real estate (housing / car / enterprise);

 monthly income;

 loan repayment on time;

Thus, it is clear that, on the basis of the above attributes, prediction will be made whether it is worth issuing a loan to a new client. This task will be solved in two stages.

First, we construct a classification model, that is, this classification tree or set of classification rules. Secondly, we use the constructed model to make a decision on new customers, namely, the built model will be the path from its root to one of the vertices, which are a set of rules for a specific question that will be used to answer the question

"Will I give a loan?"

The rule set will be a logical construction, in the following form:

"if: <condition> then: <operator>".

Suppose that for the above example there is a statistic presented in the form of a table.

Table 1 - Baseline data of bank customers

(4)

Figure 1 - Example of a decision tree for the "loan to the client" problem (1- to issue a loan, 0-not to issue)

Thus, this example shows that the internal nodes of the tree are attributes of the described database. These attributes are called splitting attributes. The end nodes of the tree, labeled class, are the values of the dependent categorical variable “issue” or “not issue” credit.

At each stage, it is necessary to specify a conditional check (split predicate) that would break the set that is associated with this node into subsets. For such a check, one of the database attributes should be selected, except for the dependent variable. This question of choosing the splitting attribute is the main point in the automated construction of decision trees.

The general rule for selecting an attribute is as follows: the selected attribute must break up the set so that the result of the subset consists of objects that belong to the same class or are as close as possible to it, that is, the number of objects from other classes in each of these sets less.

The C4.5 algorithm uses the concept of data increment or entropy reduction to select the optimal splitting at each node of the decision tree. Because a decrease in entropy leads to an increase in information and vice versa. First you need to consider the concept of entropy. If there is a variable X whose k possible values have probabilities p1, p2, ... pk. How much information is needed to transmit a stream of

(5)

characters that represent the values of the observed X? The answer to this question is entropy X and is defined as

H(X)^^



j pj log(pj) (1) Algorithm C4.5 uses the concept of entropy as follows. It can be assumed that there is a splitting attribute S that divides the training data set T into several subsets, T1, T2, ... Tk. The average amount of information on these subsets can be calculated as a weighted sum of the entropies of individual subsets, as follows: [5]

, ) , ( )

, (

1





 ^s

i

i s i

s T D PH T d

H (2) where Pi represents the proportion of records in the subset i, Hs (T, D) is the entropy of the splitting of the data set T by the splitting attribute S with the dependent variable D, H (Ti, D) is the entropy of the subset Ti of splitting by the splitting attribute S with the dependent variable D, s is the number of splitting attribute values S (determines the number of splitting subsets (classes)). [6]

Then, one can determine the information gain or decrease in entropy (information gain or entropy reduction) when splitting the initial set T into subsets by splitting attribute s as:

Gain(T,S) = H(T,D)-Hs(T,D), (3) where H (T, D) is the entropy of the initial data set T by the attribute of the D-dependent variable.





 ^d

i

iH T D

P D

T H

1

) , ( )

,

( , (4) where d is the number of attribute values of the D-dependent variable.

Thus, the information gain is produced by dividing the initial set of training data T in accordance with the splitting attribute S. At each node, C4.5 chooses the optimal splitting, which has the highest gain factor information Gain (T, S). Perform the calculation of the entropy and the increase in information on the determination of the optimal attribute for example. Calculate the entropy of the original dataset H (T).

H(T,Loan)= ⁰^.⁹⁸⁵²

7 log 3 7 3 7 log 4 7 4

2

2  

 (5)

Here, credit is a dependent variable.

Calculate the increments of information for various attributes.

Increment of information on the splitting attribute "Income":

Gain(T, Income) = H(T, Loan) –

7

2Н(Тhigh., Loan) –

7

3Н(Тmid., Loan) –

7

2Н(Тlow., Loan)

9181 . 0 2) log 2 2 2 2 log 0 2 ( 0 7 ) 2 3 log 1 3 1 3 log 2 3 ( 2 7 ) 3 2 log 0 2 0 2 log 2 7 ( 2 7 9852 2 ,

0   ₂  ₂   ₂  ₂   ₂  ₂ 



The increase in information on the splitting attribute "Age":

(6)

Gain(T,Age)= H(T, Loan) –

7

3Н(Тage>30, Loan) –

7

1Н(Т age>60,Loan) –

7

3Н(Тage<=30,Loan)

0   ₂  ₂   ₂  ₂   ₂  ₂ 



Increment of information on the splitting attribute "Availability of real estate":

Gain(T, Real estate)= H(T, Loan) –

7

4Н(Тyes, Loan) –

7

3Н(Тno, Loan) 9181 . 0 3) log 3 3 3 3 log 0 3 ( 0 7 ) 3 4 log 0 4 0 4 log 4 4 ( 4 7 9852 4 ,

0   ₂  ₂   ₂  ₂ 



Increase information on the attribute of splitting "Education":

Gain(T, Education) = H(T, Loan) – 7

3Н(Тhigh., Loan) – 7

1Н(Т second., Кредит) – –7

3Н(Тsecond.,Loan)

0   ₂  ₂   ₂  ₂   ₂  ₂ 



Thus, the calculation of the increase in information shows that the best at the first level is to check the attribute "The availability of real estate", in which the growth rate of information Gain (T, Real estate) = 0.9852. With this in mind, the decision tree of this example will look like in Figure 2.

Select the data set for the right branch from the initial data set (Table 2). This is necessary for the recursive application of the decision tree construction algorithm to determine the splitting attribute for the next level. For this, the set of possible attributes includes: age, education, income. The attribute "The availability of real estate" is used for splitting on the first level.

Table 2 - Data set for the selection of the second splitting attribute.

Calculate the entropy of the original dataset H (T) for the next level of splitting.

H(T,Loan)= ⁰

4 log 0 4 0 4 log 4 4 4

2

2  

 . Since the entropy of this set for the next level is zero, this means that the process of building the decision tree for this example is complete. The resulting decision tree for this source data set of example statistics is

(7)

optimal (Fig. 2). However, it should be borne in mind that the original sample may not cover all possible cases in reality, so the final decision on the structure of the decision tree is left to the user.

Figure 2 – The optimal decision tree for example Clustering Algorithms. Hierarchical cluster method.

The aim of database segmentation is to partition a database into an unknown number of segments, or clusters, of similar records, that is, records that share a number of properties and so are considered to be homogeneous. This approach uses unsupervised learning to discover homogeneous subpopulations in a database to improve the accuracy of the profiles. Database segmentation is less precise than other operations and is therefore less sensitive to redundant and irrelevant features [7].

As a classification task, an analysis was conducted on the annual growth and outflow of bank customers. For the initial data were taken financial data of the bank.

The result was the following table with the original data [8]:

Table 3 - Baseline Customer Information for the Clustering Method

Year Number of clients (in thousands)

2009 1,39

2010 11,63

2011 20,73

2012 39,99

2013 72,29

2014 125,05

2015 150,26

2016 169,22

2017 231,22

2018 241,88

The method of "nearest neighbor".

(8)

Data set (above is in the table):

1,39 20,73 39,99 72,29 125,05 150,26 169,22 231,22 241,88 Step 1. Since singleton clusters are considered, the minimum distance between any record of cluster A and any record of cluster B will be searched between the elements. These will be clusters with elements 231.22, 241.88, the distance between them is 10.66.

Step 2. Examining the cluster options shows that the cluster (150.26,) should cluster with the cluster (169.22) with a distance of 18.96.

Step 3. At this step, it is necessary to merge a cluster.

(20.73) with a cluster (39.99) with a distance of 18.96.

Step 4. In this step, clusters (1.39) and (20.73, 39.99) are combined with a distance of 19.34.

Step 5. Clusters (125.05) and (150.26, 169.22) are combined with a distance of 25.21.

Step 6. The clusters (1.39, 20.73, 39.99) and (72.29) are combined with a distance of 32.3.

Step 7. Finally, clusters (1.39, 20.73, 39.99, 72.29) and (125.05, 150.26, 169.22) are combined with a distance of 52.76.

Step 8. Finally, clusters (1.39, 20.73, 39.99, 72.29, 125.05, 150.26, 169.22) and (231.22, 241.88) are combined with a distance of 62 [9].

Table 4 - solution of the problem using the “nearest neighbor” method.

Step 1.39 20.73 39.99 72.29 125.05 150.26 169.22 231.22 241.88

1 231.22, 241.88

d=10.66

2 150.26, 169.22

d=18.96

3 20.73, 39.99

d=18.96 4 1.39, 20.73, 39.99

d=19.34

5 125.05, 150.26, 169.22

d=25.21 6 1.39, 20.73, 39.99, 72,29

d=32.3.

7 1.39, 20.73, 39.99, 72,29, 125.05, 150.26, 169.22 d=52.76.

8 1.39, 20.73, 39.99, 72,29, 125.05, 150.26, 169.22, 231.22, 241.88 d=62.

Conclusion

(9)

The article describes the means of multidimensional classification, in particular the method of cluster analysis, as well as their application relative to the banking sector.

The main classification methods are highlighted and the use of Data Mining algorithms in analyzing banking data in the context of improving the banking system is described.

The methods of Data Mining contribute to the improvement of the quality of work of banks, since they are based on the structuring of objects of different nature, as well as their discrimination.

However, the predicted values give us a general idea, but in no way accurate values, i.e. the methods can determine the approximate values of the predicted values of the predicted variable.

REFERENCES

[1] Data Mining: Concepts, Models, Methods, and Algorithms Mehmed Kantardzic.

John Wiley & Sons, 5 янв. 2011 г.

[2] Applied Data Mining: Statistical Methods for Business and Industry. Paolo Giudici John Wiley & Sons, 27 сент. 2005 г.

[3] Database systems: A practical approach to design, implementation and management. Thomas Connoly, Carolyn Begg. fifth editopn p.1233-1234

[4] Entropy and Entropy Generation: Fundamentals and Applications. J.S.

ShinerSpringer Science & Business Media, 1996.

[5] Data Mining With Decision Trees: Theory And Applications (2nd Edition).

Maimon Oded Z, Rokach Lior, World Scientific, 2014.

[6] Very Fast C4.5 Decision Tree Algorithm. Anis Cherfi, Kaouther Nouira, Ahmed Ferchichi. Taylor & Francis CatchWord, 2018.

[7] Data Clustering: Theory, Algorithms, and Applications. Guojun Gan, Chaoqun Ma, Jianhong Wu, SIAM, 2007.

[8] Nearest Neighbor Search: A Database Perspective. Apostolos N. Papadopoulos, Yannis Manolopoulos, Springer Science & Business Media, 2006.

[9] Pattern Recognition Algorithms for Data Mining. Sankar K. Pal, Pabitra Mitra. CRC Press, 2004.

Г.Ө.Маматова, A.A.Балғабек, Ж.М.Бекаулова, Г.А.Толғанбаева, Р.Д.Омарова

(10)

БАНК СЕКТОРЫНДАҒЫ ТАПСЫРМАЛАРДЫ DATA MINING АЛГОРИТМДЕР АРҚЫЛЫ ШЕШУ

Аңдатпа: Бұл мақалада банк және ішінде өткізілетін операциялары қарастырылады. Бүгінгі күннің өзінде ірі банктердің менеджерлерінің жұмысының автоматтандыру маңыздылығы бар. Мысалы, «Клиентке несие беру керек пе?» деген тапсарманың шешімі қабылданды және Data Mining алгоритмдерін пайдалану арқылы клиенттердің ағыны болжанған болатын.

Зерттеу нысаны банктің басқа бөлімдері болып табылады, мысалы, қаржы және қызметкерлер.

Түйін сөздер: Деректерді өңдеу, шешім ағаштары, кластерлеу алгоритмдері, іріктеу, тапсырмалар.