Numerical relational data mining - If rule R is a BST rule for data D, then rule R is regularit

Relational Data Mining (RDM)

Theorem 1. If rule R is a BST rule for data D, then rule R is regularity on data D [Vityaev, 1992]

4.9. Numerical relational data mining

Rapid growth of databases is accompanied by growing variety of types of these data. Relational data mining has unique capabilities for discovering a wide range of human-readable, understandable regularities in databases with various data types. However, the use of symbolic relational data for numerical forecast and discovery regularities in numerical data requires the solution of two problems.

Problem 1 is the development of a mechanism for transformation be- tween numerical and relational data presentations. In particular, numerical stock market time series should be transformed into symbolic predicate form for discovering relational regularities.

Problem 2 is the discovery of rules computationally tractable in a relational form.

Representative measurement theory (RMT) is a powerful mechanism for approaching both problems. Below we review RMT theory from this viewpoint. RMP was originated by P. Suppes and other scientists at Stanford University [Scott, Suppes, 1958; Suppes, Zines, 1963]. For more than three decades, this theory produced many important results summarized in [Krantz at el., 1971, 1981, 1990, Narens, 1985; Pfanzagl, 1968]. Some re- lated to data mining are presented below. RMT was motivated by the inten- tion to formalize measurement in psychology to a level comparable with physics. Further study has shown that the measurement in physics itself

should be formalized. Finally a formal concept of measurement scale was developed. This concept expresses what we call a data type in data mining.

Connections between learning methods and the measurement theory were established in the 70s [Samokhvalov, 1973; Vityaev 1976; Zagoruiko, 1979;

Kovalerchuk, 1973; Lbov et al, 1973; Vityaev, 1983]. First of all, RMT yields the result that

any numerical type of data can be transformed into a relational form with complete preservation of the relevant properties of a numeric data type.

It means that all regularities that can be discovered with numeric data presentation also can be discovered using relational data presentation. Many theorems [Krantz, et al, 1971, 1981, 1990; Kovalerchuk, 1975] support this property mathematically. These theorems are called homomorphism theorems, because they match (set homomorphism) numerical properties and predicates. Actually, this is a theoretical basis of logical data presentations without loosing empirical information and without generating meaningless predicates and numerical relations. Moreover, RMT changes the paradigm of data types. What is the primary numerical or relational data form? RMT argues that a numerical presentation is a secondary one. It is a derivative presentation of an empirical relational data type. The theorems mentioned support this idea.

The next critical result from measurement theory for learning algorithms is that the ordering relation is the most important relation for this transformation. Most practical regularities can be written in this form and dis- covered using an ordering relation. This relation is a central relation in transformation between numerical and relational data presentation.

The third idea prompted by RMT is that the hierarchy of data types developed in RMT can help to speed up the search for regularities. The search begins with testing rules based on properties of weaker scales and

finishes with properties of stronger scales as defined in RMT. The number

of search computations for weaker scales is smaller than for stronger scales.

This idea is actually implemented in MMDR method (Section 4.8). MMDR begins by discovering regularities with monadic (unary) predicates; e.g., x is larger then constant 5, x>5 and then discovers regularities with ordering relation x>y with two variables. In other words, MMDR discovers regularities similar to decision-tree type regularities and then MMDR discovers regularities based on ordering relations which are more complex first order logic relations.

Another way to speed up the search for regularities is to reduce the space of hypotheses. Measurement theory prompts us to search in a smaller space of hypotheses. The reason is that there is no need to assume any par-

ticular class of numerical functions to select a forecasting function in this class. In contrast, numerical interpolation approaches assume a class of nu-

merical functions. Neural networks assume this, defining the network ar-

chitecture, activation functions, and other components. In contrast relational

data mining tests a property of monotonicity just for one predicate instead of

thousands different monotone numerical functions. We actually have used this advantage in [Kovalerchuk, Vityaev, Ruiz, 1996,1997]. RMT has several theorems, which match classes of numeric functions with relations.

The next benefit from RMT is that the theory prompts us to avoid incor- rect data preprocessing which may violate data type properties. This is an especially sensitive issue if preprocessing involves combinations of several features. For instance, results of case-based reasoning are very sensitive to preprocessing. Case-based reasoning methods like k-neighbors compute the nearest k objects and assign target value to the current object according to target values of these k-neighbors. Usually a discovered regularity is changed significantly if an uncoordinated preprocessing of attributes is ap- plied. In this way, patterns can be corrupted. In particular, measurement theory can help to select appropriate preprocessing mechanisms for stock market data and a better preprocessing mechanism speeds up the search for regularities.

The discussion above shows that there should be two steps in the transformation of numerical data into relational form:

– extracting, generating, and discovering essential predicates (relations), and

– matching these essential predicates with numerical data.

Sometimes this is straightforward. Sometimes this is a much more complex task, especially taking into account that computational complexity of the problem is growing exponentially with the number of new predicates. Rela- tional data mining can be viewed also as a tool for discovering predicates, which will be used for solving a target problem by some other methods. In this way, the whole data mining area can be considered as a predicate dis- covery technology.

Several studies have shown the actual success of discovery of understandable regularities in databases significantly depends on use of data type information. Date type information is the first source to get predicates for relational data presentation (see Section 4.10).

A formal mechanism for describing data types is well developed in Object-Oriented Programming (OOP) and is implemented in all modern programming languages. An abstract data type in C++ includes private elements and member functions. These functions describe appropriate operations with elements and relations between elements. However, OOP does not help much in generating these operations and relations themselves. The

OOP’s intent is to provide a tool for describing already generated abstract data types.

In contrast, RMT is intended for generating operations and predicates as a description of a data type. This RMT mechanism is called an empirical axiomatic theory. A variety of data types as matrices of comparisons of pairs, and multiple comparisons, attribute-based matrices, matrices of or- dered data, and matrices of closeness can be represented in this way (see below). Then the relational representation of data and data types are used for discovering understandable regularities. We argue that current learning methods utilize only part of data type information actually presented in data.

They either lose a part of data type information or add some non- interpretable information.

The language of empirical axiomatic theories is an adequate lossless language to reach this goal. There is an important difference between language of empirical axiomatic theories and first-order logic language. The first-order logic language does not indicate anything about real world enti- ties. Meanwhile the language of empirical axiomatic theories uses first order logic language as a mathematical mechanism, but it also incorporates addi- tional concepts to meet the strong requirement of empirical interpretation of relations and operations.

Dalam dokumen DATA MINING IN FINANCE (Halaman 180-183)