Relational data mining paradigm - Relational Data Mining (RDM)

Relational Data Mining (RDM)

4.3. Relational data mining paradigm

As we discussed in the previous section, attribute-value languages are quite restrictive in inducing relations between different objects explicitly.

Therefore, richer languages were proposed to express relations between objects and to operate with objects more complex than a single tuple of attrib- utes. Lists, sets, graphs and composite types exemplify complex objects.

These more expressive languages belong to the class of first order logic languages (see for definitions section 4.4.3 and [Russell, Norvig, 1995; Dze-

roski, 1996; Mitchell, 1977]). These languages support variables, relations, and complex expressions. As we already mentioned, relational data mining is based on first-order logic.

ILP can discover regularities using several tables (relations) in a data- base, such as Tables 4.3 and 4.4, but the propositional approach requires creating a single table, called a universal relation (relation in the sense of relational databases) [Dzeroski [1995]. The following fields form the universal relation in example 2 presented above:

Age, Sex, Num_of_Supervised, Corporate_cardholder along with

Colleague_Age, Colleague_Sex, Colleague_Num_of_Supervised, and Col- league_Corporate_cardholder.

Thus, features of a colleague are included in the list of features of a per-

son. Table 4.6 illustrates this combined (universal) relation. Table 4.6 is larger than Tables 4.2 and 4.3 together. One of the reasons is that the first and the last lines actually present the same information, just in reverse order.

The universal relation can be very large and, therefore, inconvenient and

impractical [Dzeroski [1995]. First-order logic rules have an advantage in discovering relational assertions because they capture relations directly.

The typical Inductive Logic Programming task is a classification task [Dzeroski, 1996]] with background knowledge B expressed as:

– a set of predicate definitions and properties,

– positive examples for some class (pattern) and, – negative examples for the same class (pattern).

Using this background knowledge an ILP system will construct a predicate logic formula H such that:

All the examples in can be logically derived from B and H, and no negative example in can be logically derived from B and H. Formula H expresses some regularity that existed in the background knowledge.

This formula could be discovered by ILP methods. ILP assumes a predi- cate logic representation of B and H.

In the example above, records like Corporate_Cardholder(Diana Right)=Yes form the set of positive examples for the pattern (class) “poten- tial customer” and records like Corporate_Cardholder(Barbara Walker)=No form the set of negative examples for this pattern. Usually background knowledge B, formula H, and positive and negative examples and are all written as programs in the Prolog programming language. Prolog was designed to support rich logical expressions, relations, and inference. This

language differs significantly from common programming languages like

C++ and Java. A search for regularity H is organized using logical inference implemented in Prolog. Convenience of logical inference in Prolog also has a drawback. It may require a lot of memory and disk space for storing examples and background knowledge. Examples E and background knowledge B can be represented in different ways, from very compact to very space

consuming. Therefore, a search for regularity H can be complex and inef- fective.

The application of ILP involves three steps:

– development of an effective representation of the examples, – development of relevant background knowledge, and – use of a general purpose ILP system.

There is a practical need for generalizing the goal. Relational data mining methods should be able to solve numerical and interval forecasting tasks along with classification tasks such as presented above. This requires

modifying the concept of training examples and and modifying the

concept of deriving (inferring) training examples from background knowledge and a predicate. Section 4.8 presents a relational algorithm, called MMDR, which is able to solve numerical and interval forecasting tasks.

This algorithm operates with a set of training examples E. Each example is amended with a target value like is done in Table 4.2, where attribute #3 is a target attribute--stock price for the next day. This is a numerical value.

There is no need for MMDR to make this target discrete to get a classification task. Therefore, more generally, a relational data mining (RDM) mechanism is designed for forecasting tasks, including classification, interval and numerical forecasting. Similar to the definition given for classifica-

tion tasks, for general RDM background knowledge B is expressed as:

– a set of predicate definitions,

– training examples E expanded with target values T (nominal or numeric), and

– set of hypotheses expressed in terms of predicate definitions.

Using this background knowledge a RDM system will construct a set of predicate logic formulas such that:

The target forecast for all the examples in E can be logically derived from B and the appropriate i.e., from B and

Example 4. Let us consider Rule 2 discovered from Table 4.2 (see Ex- ample 1, Section 4.2).

IF Greater(StockPrice(t), StockPrice(t-1)) AND

Greater(StockTradeVolume(t),StockTradeVolume(t-1)) THEN StockPrice(t+1)>StockPrice(t)

This rule represents logical formula and table 4.2 represents training examples E. These two sources allow us to derive the following logically for date

assuming that This is consistent with actual Stock- for date 01.03.99. Rule 1 from the same Example 1 in Section 4.1.2 represents logical formula H1, but this rule is not applicable to In addition, other rules can be discovered from Table 4.2. For instance,

IF StockPrice(t)<$60 AND StockTradeVolume(t)< $90000 THEN Greater($60,StockPrice(t+1))

This rule allows us to infer

Combining (1) and (2) we obtain

With more data we can narrow the interval (54.6, 60) for A similar logical inference mechanism can be applied for to pro-

duce a forecast for

The next generalization of relational data mining methods should handle (1) classification, (2) interval and (3) numerical forecasting tasks with noise. This is especially important in financial applications with numerical data and a high level of noise.

Hybridizing the pure logical RDM with a probabilistic approach (“probabilistic laws”) is a promising direction here. This is done by introducing probabilities over logical formulas [Carnap, 1962; Fenstad, 1967; Vityaev E.

1983; Halpern, 1990, Vityaev, Moskvitin, 1993; Muggleton, 1994, Vityaev et al, 1995; Kovalerchuk, Vityaev, 1998].

In contrast with the deterministic approach, in Hybrid Probabilistic Relational Data Mining background knowledge B is expressed as:

– A set of predicate definitions,

– Training examples E expanded with target values (nominal or numeric), and

– A set of probabilistic hypotheses expressed in terms of predicate definitions.

Using this background knowledge a system constructs a set of predicate logic formulas such that:

Any example in E is derived from B and the appropriate probabilistically, i.e., statistically significantly.

Applying this approach to (3) 60>StockPrice(01.04.99)>54.6,

we may conclude that although this inequality is true and is derived from table 4.2 it is not a statistically significant conclusion. It may be a property of a training sample, which is too small. Therefore, it is risky to rely on statistically insignificant forecasting rules to derive this inequality.

The MMDR method (section 4.8) is one of the few Hybrid Probabilistic Relational Data Mining methods developed [Muggleton, 1994, Vityaev et al, 1995; Kovalerchuk, Vityaev, 1998] and probably the only one which has been applied to financial data.

4.4. Challenges and obstacles in relational data mining

Dalam dokumen DATA MINING IN FINANCE (Halaman 137-141)