Problem of data types - Data types - If rule R is a BST rule for data D, then rule R is regula

Relational Data Mining (RDM)

Theorem 1. If rule R is a BST rule for data D, then rule R is regularity on data D [Vityaev, 1992]

4.10. Data types

4.10.1. Problem of data types

OOP’s intent is to provide a tool for describing already generated abstract data types.

In contrast, RMT is intended for generating operations and predicates as a description of a data type. This RMT mechanism is called an empirical axiomatic theory. A variety of data types as matrices of comparisons of pairs, and multiple comparisons, attribute-based matrices, matrices of or- dered data, and matrices of closeness can be represented in this way (see below). Then the relational representation of data and data types are used for discovering understandable regularities. We argue that current learning methods utilize only part of data type information actually presented in data.

They either lose a part of data type information or add some non- interpretable information.

The language of empirical axiomatic theories is an adequate lossless language to reach this goal. There is an important difference between language of empirical axiomatic theories and first-order logic language. The first-order logic language does not indicate anything about real world enti- ties. Meanwhile the language of empirical axiomatic theories uses first order logic language as a mathematical mechanism, but it also incorporates addi- tional concepts to meet the strong requirement of empirical interpretation of relations and operations.

Another solution would be to develop universal DM tools to handle any data type. In this case, member functions of a data type should be input information along with the data (MMDR implements this approach). This problem is more general than the problem of a specific DM tool. Let a relative difference for the stock price be

a “float” data type. This is correct for a computer memory allocation, but it does not help to decide if all operations with float numbers are applicable for For instance, what does it mean to add one relative difference to another There is no empirical procedure matching this sum operation. However, the comparison operation makes sense, e.g.,

means faster growth of stock price on date y than on date x. This relation

Both of these relations are already interpreted empirically.

Therefore, values can be compared, but one probably should avoid an addition operation (+) if the goal is to produce an interpretable learned rule.

If one decides to ignore these observations and applies operations formally proper for float numbers in programming languages, then a learned rule will be difficult to interpret. As was already mentioned, these difficulties arose from the uncertainty of the set of interpretable operations and predicates for these data, i.e., uncertainty of empirical contents of data. The precise defi- nition of empirical content of data will be given in terms of the empirical axiomatic theory below.

Relational data types

A data type is relational if it is described in terms of the set of relations (predicates). Some basic relational data types include:

– Equivalence data type, – Tolerance data type, – Partial ordering data type,

– Weak ordering data type,

– Strong ordering data type, – Semi-ordering data type,

also helps interpret a relation “∆(w) between and ∆(w)” as

– Interval ordering data type, – Tree data type.

Next, we define these data types.

An equivalence data type is based on an equivalence relation E:

for any

1. (true),

2. and

Further, we will often omit the predicate truth value assuming that P(x,y) is the same as

The pair <A,E> is called an equivalence data type, where A is a set of possible values of an attribute and E is an equivalence relation. An equivalence relation partitions set A. This data type is called a nominal scale in representative measurement theory. An equivalence data type can be presented as a class in programming languages like C++ with a Boolean member function E(x,y).

A tolerance data type is based on a tolerance relation L:

for any

1. and 2.

A tolerance relation L is weaker than an equivalence relation E. Intervals satisfy the tolerance relation but not the equivalence relation.

Example. Let If and then

but (false), because

A partial ordering data type is based on a partial ordering relation P:

for any 1.

This relation is also called a quasi-ordering relation. There is no numeric coding (representation) for elements of A completely consistent with relation P, i.e., there is not a numeric relation equivalent to P. The formal concepts of consistency and equivalency will be discussed in the next section.

A weak ordering data type is based on a weak ordering relation P:

for any

The difference between partial ordering and weak ordering is illustrated by noting that any a and b can be compared by a weak ordering, but in partial ordering some a and b may not be comparable.

A strong ordering data type is based on a strong ordering relation P:

for any 1.

2. and 3.

The difference between strong ordering and weak ordering can be illustrated with two relations: and “>”. For instance, is true, but a>a is false, and It is proved in [Pfanzagl, 1971] that for strong and weak ordering data types <A;P>, A can be coded by numbers with preservation of properties 1-3 if A is countable.

A semi-ordering data type is based on a semi-ordering relation P:

for any

such that P(a,d)vP(d,c).

This relation P can be illustrated with intervals. Let a function U be a coding function each element of A as a real number, U: Then P(a,b) is defined as

The function U is called a numeric representation of A. It is not neces- sary that a numeric representation exist for any data type. For instance, a partial ordering relation does not have a numeric representation. At first glance this seems strange, because we always can code elements of A with numbers. However, there is no numerical coding consistent with a partial

ordering relation P. Consider an a and b which are not comparable, i.e.,

but a and b are coded by numbers. Any numbers are obviously comparable, e.g., 4<5.

Therefore, a non-interpretable property is brought to A by numerical coding of its elements. However, if numerical comparison is not used for constructing a learned rule, any one-to-one numerical coding can be used. If a numerical order is used in learning a rule, then this rule can be non- interpretable.

An interval ordering data type is based on an interval-ordering relation P: for any

1.P(a,a), 2.

It is proved in [Fishbern, 1970] that a numeric representation for <A,P>

exists completely consistent with properties 1 and 2. There are functions U and S, U, S: such that for any

A tree-type data type is based on a tree-type relation P:

for any 1.

There is no numerical representation completely consistent with these properties.

If some ordering relation is interpretable in terms of domain background knowledge, then it can be included in background knowledge and used for discovering rules in RDM directly. Partial ordering and tree-type relations often appear in hierarchical classifications and in decision trees. In financial applications, usually the data are presented as numeric attributes, but often relations are not presented explicitly. More precisely, these attributes are coded with numbers, but applicability of number relations and operations must be confirmed. Let us illustrate use of relations defined in this section for identifying data type for stock relative difference Let PS(x,y) be defined as follows

and a financial expert agrees that PS(x,y) makes sense. Now we can identify its type. This is a strong ordering relation. Therefore, we can identify the attribute as an attribute of the strong relational data type. Similarly we can define and identify as an attribute of the weak relational data type. One can continue identify as belonging to other data types listed in this section as well. One may wish to produce a predicate PM(x,y,z),

There is little financial sense in this predicate, because the operation (+) is not financially interpreted for The above considerations show that there are relations without a numeric representation. Therefore, the relational data representation in the first order logic is more general than a nu- meric representation. What is the traditional way of processing binary relations? Traditionally numerical methods use distance functions (from a metric space) between matrices of relations (binary relations). These distance functions are defined axiomatically or using some statistical assumptions associated with coefficients of Stuart, Yule and Kendall, information gain measures, etc. Obviously, these assumptions restrict the areas of applicability for these methods.

Dalam dokumen DATA MINING IN FINANCE (Halaman 183-188)