Why Separating Information from Data? - Comparative Analysis of Ontology Charts

Chapter 8 Comparative Analysis of Ontology Charts

9.3 Why Separating Information from Data?

The previous section gives theoretical reasons for separating information from data. We will now examine more specific and practical reasons for such separation within the context of database systems.

9.3.1 Instances do not necessarily carry the information that results in their “types”

Data stored in databases are schemas and instances of these. It would seem plausible to think that a schema represents the “type” level information, and instances of the schema represent the information on the “token” level.

In such a sense, it might seem reasonable and sound to take that data as the information in the database. However, we want to argue why data is not information exactly from this perspective.

All the related things on the type level, i.e. the structure, constraints, and legitimate operations [30] on the data, of the schema contribute to the meaning of the data that is stored in the database. So schema itself can be seen as “concepts” in the sense of Dretske ([7], p.214). Schema also cap- tures the relationships between concepts, which are in a broad sense also concepts themselves.

Calvanese et al. [5] argue that a schema only determines necessary condition(s) for data to be qualified to instantiate the schema, but not suffi- cient condition(s). According to Dretske, instances of a concept inherit everything from the concept and it has the capability of “giving meaning”

to its instances. But schema cannot guarantee that the instances that are put into the schema are right or true. The formation of a concept needs the right information while something to be deemed as an instance of the concept does not need that piece of right information.

For example, in cartography, a blue wiggly lined area on a map represents a lake. If a map maker mistakenly put a lake symbol on a map where there is no such lake in reality, the map can misrepresent the geography of an area only insofar as its elements are understood to have a meaning independent of their success in carrying information on any given occasion ([7], p.192).

“A symbol token fails to carry the information that, in virtue of the type of which it is a token, it is its job to convey.” ([7], p.193)

Due to these special characteristics of types and tokens, or concepts and instantiations, and the complex relationships between them, it is deemed that problems will occur if simply taking data as information. Instances do not always exactly inherit the information content of its concept. In this sense, instances are not reliable to be taken as information.

9.3.2 Information content vs. literal/conventional meaning Ultimately it is the information content of a database that the researcher and the user of the database are interested in and this is what really mat- ters. The “information content” of a conceptual data schema has been

recognised to be “difficult to define and measure” [1] and [18]. It would seem that as a result of this, instead of digging into the complexity, and sometimes rather deep philosophical issues, which would be required for studying information content, and due to its simplicity and straight- forwardness, many database researchers have concentrated their effort on data and their meaning.

Taking one step back however, we see that the meaning of data in a database is taken as information (content) that the data express. Literal or conventional meaning of data is probably the most convenient to obtain.

As a result, literal or conventional meaning of data is granted the status of information (content). We argue that the meaning of data is not necessarily their information content, i.e. what the data represent.

Following the ideas of OS, an information system is a system of signs.

Signs are the bearers of information. To represent a piece of information, there could be various ways and therefore various bearers for the same piece of information. As long as the system offers some means to infer the information content from the bearer, all these bearers are valid and practically feasible. So we say that for a data construct to be capable of representing a piece of information, the information content of the data construct, when it is considered in isolation, must include the information content of that piece of information. The simplest case is that the literal or conventional meaning of the data construct is part of its information content and they represent what are required. It seems too restrictive or unnecessary, and theoretically unsound for a database to impose the constraint that for some data to represent a piece of information, the information content of the data has to be the literal or conventional meaning of the data.

9.3.3 An analysis of the RIC theory

As mentioned previously in the introduction, the relative information capacity is a typical example in database research where data (instances) are taken as information. We will analyse why this view is problematic and only valid within a rather narrow context.

The notion of information capacity preserving is originated from the relative information capacity (RIC) theory [14]. It is concerned with information preserving mapping, dominance, and equivalence for simple conceptual data schemas and it is used as a correctness measure for schema trans- formation with no information loss.

Four progressively less restrictive dominances were proposed. These dominances and equivalences are based on the existence of abstract functions of some particular type between the instances of a pair of schemas.

But it is not clear how to reason about the existence of such abstract functions, which are crucial for RIC. Miller et al. [21], [22], [23], and [24], redefine the notions of absolute and internal dominance among the four and put forward the schema intension graph (SIG) data model with a view to enable reasoning about the existence of the abstract functions. The literature would seem to show that the RIC theory and the SIG model are widely accepted and used [12], [16], [17], [19], and [32].

We have studied the SIG formalism and have, we believe, a number of significant findings. The basic idea of information capacity preserving can be stated that if every valid instance of a schema can be represented as a valid instance of another schema, which can be recovered from the latter, then the latter schema is said to dominate the former and have a greater information capacity than the former. For example, if two data schemas S1 and S2 are the same except that S2 has fewer constraints than S1, then S2 would be deemed to have a greater information capacity than S1 simply because all valid instances of S1 can be accommodated in S2. It sounds plausible. However, if we take a closer look, it can be seen that the fewer constraints one schema has, the less specific the instances would be, and therefore the less informative the instances become. For example, a rela- tionship with a “many to one” cardinality ratio is seen, following the RIC theory, dominating one with a “one to one” cardinality ratio, because any instance of the latter can be stored as instances in the former. That is, the former has a greater information capacity than the latter, i.e. the information capacity of the former includes that of the latter. However, an instance of the former is less specific than that of the latter, and therefore less informative. That a student is tutored by a professor in a “one to one” session is more specific and therefore contains more information than when a student is tutored by a professor in a session of “many to one”, which includes “one to one”. This is because the latter involves more uncertainty (more possibilities) than the former.

Thus we observe that the basic ideas behind RIC and SIG are instance- centric and inappropriately taking data instances as the entire information that a data schema can provide. We argue that this viewpoint is questionable, unnecessarily restricting and it only makes sense when what is meant to be a valid instance is narrowly defined. One example of a valid instance is “a student is tutored by a professor” without considering the constraint of the cardinality ratio as mentioned above.

In conclusion then, through the above analysis, we argue that simply taking data as information in databases is questionable, inappropriate, imprecise, narrow-minded, and dated on some occasions.

9.4 Separating Data and Information Should Help Further

Dalam dokumen www.books.mec.biz (Halaman 195-199)