Instance–based learning and financial applications

Numerical Data Mining Models and Financial Applications

2.3. Instance–based learning and financial applications

The literature considers two similar concepts instance-based learning (IBL) and case-based reasoning [Mitchell, 1997]. Instance-based learning covers methods based on objects presented by a set of numeric attributes of a fixed size. Case-based reasoning allows one the use of more general objects. In this section, we focus on instance-based learning. In ILB objects are usually embedded into the n-dimensional Euclidean space with measur- ing the distance between data objects

by the Euclidean distance

Then the target value for a new object d is assigned according to the target values of some training objects. Usually, it is done selecting k nearest

neighbours [Mitchell, 1997], i.e., objects with minimal

IF THEN for all

One of the simplest approaches is to assign a target value for d by averaging the target values of the k nearest neighbors if the target is continuous variable:

Alternatively, each object can be weighted proportionally to its distance from d:

Similarly, if the target variable presents a discrete classification, then a weighted majority voting mechanism is used:

Here where wi is a weight of

the classifier and if classifier classifiedd into the first class.

Similarly, where if the

classifier classified d into the second class.

It is well known that the result of instance-based learning can be very sensitive to parameters like k and and normalization of attributes. For instance, stock prices and trade volumes are measured in dollars. Let and represent data for two stocks in the format <$price, $volume>. Then but if we exclude

stock prices from and descriptions and consider and then Therefore contribution of stock price

into the Euclidean distance is miserable (less then ). Thus, it is obvious that normalization is needed in order to use information about differences in the stock prices, but the result will be sensitive to the particular normalization used.

Two possible normalizations are:

1. normalization based on max values of stock price and volume in the sample:

StockPrice(di)/maxStockPrice, StockVolume(di)/maxStockVolume 2. normalization based on averages in the sample:

StockPrice(d_i)/AverageStockPrice, StockVolume(d_i)/AverageStockVolume.

Similarly, specific normalizations and related data coding are needed for many data types. More about different coding schemes associated with meaningful distances and advanced case-based learning methods are presented, for instance, in [Kovalerchuk, 1973, 1975, 1997; Hattoru, Torri,

1993; Kamgar-Parsi, Kanal, 1985].

The voting schemes can be extended outside of the k neighbors to use more complex estimates than presented above. One of methods (AVO

method) uses combinatorial estimates for voting [Zhuravlev, Nikiforov, 1971]:

where and are voting thresholds and is an combinatorial estimate of closeness of object d to the objects of class #1 using distance between d and

all objects of the class #1 . Similarly, estimates closeness of d to the class #2. Below we present formulas for and where k is a parameter:

All distances are viewed as integers, they are computed using Hamming distance with thresholds

where is a threshold difference:

This case-based method was studied and used intensively: In particular, it has been shown in [Kovalerchuk, 1977] that this method is sensitive to backcasting. Backcast means the forecast of known attributes values which have not been used for training. For instance, a backcast value de- pends on an entered hypothetical target value T. At first glance, it appears to not be a very useful property. However, for non-stationary training data this is the way to “extend” the sample. Extending the data allows for a better testing of a discovered regularity. In this way, 10 attributes and 1000 cases can be used to obtain values for backcasting, because for the training data any value of the attributes, not only the target, can be backcast. This approach has been used successfully with the AVO instance-based learning method for classification of court decisions with a scarce data set [Kovalerchuk et al, 1977].

Case-based reasoning covers instance-based learning relaxing the re- quirement of embedding data into Euclidean space. Other presentation of data are permitted and alternative measures of closeness can be used [Mitchell, 1997].

Financial Time Series Forecasting using k-Nearest Neighbors Classi- fication. Maggini et al. [1997] transform the prediction problem into a classification task and use an approach based on the k-nearest neighbors algo- rithm to obtain the most probable variation with respect to the present price value. Often predicting the actual price with high accuracy is unreal- istic. Therefore, classes of prices are predicted instead of values. It is based on the past variations and other variables that might be correlated. In addi- tion, the classification is based on a set of examples extracted from the pre- vious values of the series. The classification is made adaptively -- new values are inserted in the prototype set when they are available and the old values are discarded. This method learns quickly because it is sufficient to memorize all the values contained in the fitting set [Maggini et al., 1997]

Comparison with other methods. A summary of differences between data mining methods was presented in Table 1.7 (Chapter 1). The instance- based learning methods were evaluated in the column marked IBL. Ac- cording to [Dhar, Stein, 1977] this column indicates the two strongest features of IBR: scalability and flexibility. We added to Table 1.7 that it is easy to use numerical data with ILB. Two other features are indicated as weakest features: low compactness and use of logical expressions. We added the last feature as crucial for developing explainable models like hy- brid models combining statistical (probabilistic) inference and first- order logic like MMDR (Chapters 4 and 5). These methods are marked as PILP (Probabilistic Inductive Logic Programming) methods in Table 1.7.

The flexibility and scalability of instance-based learning has lead to these methods being widely and successfully used in data mining including financial applications. For instance, adding new data to the k-nearest neighbor method increases the training time and forecasting time reasonably. Inde- pendence of an expert when using IBL is relatively high in comparison with neuro-fuzzy and some other methods. In IBL, selection of measure of closeness and data normalization can be challenging. Using Euclidean distance has the obvious advantage of computational simplicity, but it may not be adequate to the problem. This distance is not invariant to the attribute scale transformation. Tuning normalization and selecting a distance measure is an art and an expert is integral to and perhaps the most important part of this process. Embeddability of IBL is quite good -- software and open codes are widely available and runtime is not prohibitive for many real tasks.

Most of the other features of IBL are average in comparison with alternative methods (see Table 1.7). There is one feature of the k-nearest neighbors method that makes it unique.

This is the possibility for an expert to study the behavior of a real case with similar attributes and motivate his/her intuition and informal knowl- edge for the final forecasts. Below some features of IBL based on [Dhar, Stein, 1977] are presented.

Explainability Medium

Response speed Medium-High

Tolerance for noise in data Medium

Tolerance for sparse data Medium

Ease of use logical relations Low Ease of use of numerical data High

These features are needed to compare IBL with probabilistic ILP. Excepting the feature - “ease of use of numerical data”, the probabilistic ILP methods have advantages over IBL. To use numerical data in a probabilistic ILP, the data should be transformed into relational form (see Chapter 4).

Dalam dokumen DATA MINING IN FINANCE (Halaman 46-50)