Indexing in Spatial Databases - NUS Computing

Due to the volume of the set of spatial data objects, it is very inefficient to precompute and store spatial relationships between all data objects (although there are some proposals that store precomputed spatial relationships [LuH92, Rot91]). In the container approach, the minimum rectangle/circle (box/sphere) – the smallest rectangle/circle (box/sphere) that encloses the object is used to represent an object, and only if the test on the container passes, the real object is displayed. is being investigated. Another limitation is that the density (i.e. the number of objects a point contains) in a map space must be less than the page capacity (i.e. the maximum number of objects that can be stored on a page).

When designing a spatial index, the issues that should be minimized are:. a) Coverage area of rectangles stored in internal nodes;

Figure 4 The evolution of the spatial indexes

Access Methods for Extended Spatial Objects

Binary-Tree based Indexing Techniques

The kd-Tree
Kd-tree Extensions with Paging Capability

To improve the browsing ability of the kd-tree, the K-D-B-Tree [Rob81] which is a combination of a kd-tree and a B-tree [BaM72, Com79] has been proposed. The K-D-B tree has included the pagination of the B tree and the tree is consequently height balanced. This causes duplication of a portion of the tree near the root in the parent index node.

Matsuyama's kd-tree is one of the earlier indexing structures that uses the object duplication approach.

Figure 5 The organization of data in a kd-tree

B-tree based Indexing Techniques

The R-Tree
The R-tree Extensions

Thus, entries belonging to a deleted node at the i-th level must be reinserted into the nodes at the i-th level of the tree. The heuristic optimization criterion used in the R-tree is to minimize the area of internal nodes covering rectangles. The buddy tree which can be considered as a compromise between the R-tree and the grid file was proposed by Seeger et.

Furthermore, zero overlap is only feasible at the leaf level of the R-tree with static construction. The algorithm stops when there are fewer than M objects left; these remaining objects are assigned to the root node of the R-tree. A slightly different way of wrapping is to include the magnification criterion of the R-tree to select the next item.

The R+ tree structure is exactly the same as that of the R tree, but the constraints are slightly different. The disjoint covering rectangles avoid the multiple search paths of the R-tree for point queries. In other words, some space ("dead space") within the current node cannot be covered by any of the covering rectangles of the entries in the node.

If the new object cannot be completely covered, one or more covering rectangles are split. This is due to the fact that storing data objects in internal nodes increases the height of the tree.

Figure 15 Objects covering in R + -trees

Quad-tree Based Structures

A convex polyhedral covering of an object is composed of a set of polyhedra to better approximate irregularly shaped spatial objects. Like the R+ tree, a spatial object represented can be stored in more than one leaf node. One major problem with the indexes using object duplication techniques is that each new object can be split into multiple chunks to store them in a tree where internal node bounding polygons do not overlap.

Each split of a node leads to a decrease in the node's data space, but an increase in the number of nodes per object. These oversized stacks are data nodes connected to internal nodes in the cellular tree, in a way, causing the tree to be unbalanced in height. We did some experiments before in the R-tree such that large objects are stored in an internal node as long as possible, and these large objects are pushed down to the next level only when the node has access of the M subtree.

Unlike the quad-CIF tree representation, a point representing a rectangle can be used to store rectangles in a PR-quad tree [Sha86]. One problem with this representation is that virtually the entire tree must be searched for intersection queries, which is due to the fact that rectangles can extend over any part of the space.

Cell Methods based on Dynamic Hashing

In [NiH85], the grid file was proposed as a means for spatial indexing of non-point objects. The EXCELL method of Tamminen [Tam82a, Tam82b] is an independent work that uses the same approach as that of the grid file. In [HSW88a, HSW88b], the twin grid file access method was proposed to improve the low storage utilization of grid files by using two grid files, called the primary grid file and the secondary grid file.

If the bucket overflows, a point must be transferred from the primary grid file to the secondary grid file if it has space for the point. Deletions can cause a bucket to underflow in one of the grid files, moving points is required if two matching buckets cannot be merged. Points can therefore be transferred from the secondary grid file to the primary grid file, or vice versa, after each insertion or deletion.

In [HSW88b], the primary raster file is given preference over the secondary raster file if storing a point does not increase the number of data buckets. Consider Figure 22; let the dotted boundaries be those of the primary raster file and the page capacity be 3. To improve the search performance of the raster file, a multi-layer raster file that avoids object allocation was proposed in [SiW88].

In the R file, cells are split using the grid file split strategy [NHS84] and a cell is split when it overflows. In [KrS88], a network file extension was proposed for storing non-zero size objects.

Figure 19 illustrates a three dimensional grid object space. A three-dimensional array, dir(1..3, 1..3, 1..2), is required to store the grid entries, and the description of an entry may be obtained once the scales are known

Spatial Objects Ordering

One can notice that, when locating the location keys of the same level, the ordering is a form of N or Z ordering. However, even if this approach is adopted, the size of the representative quadrant can still be much larger than the size of the object's bounding rectangle. The major advantage of using the location key is that B+ tree structures are widely supported by conventional DBMSs.

The z-value of an element is a binary value obtained by interleaving the binary bits of the descriptive bits of each dimension. A data object is contained in the query rectangle if and only if the data object falls in all of the query ranges. Overlapping rectangles can be detected using the so-called spatial join [Ore86], which is a simple extension of the natural join.

In [Ore90] it was shown that the performance of native space implementation of the overlap query is better than that of the parameter space implementation in the context of z-order based indexing methods. The query rectangle is then transformed into a set of serial queries in the final one-dimensional space. That is, the level of the hierarchy of grid regions is generated from the previous higher level by partitioning along a selected dimension.

After some insertions, R1's data page overflows, requiring the grid cell to be partitioned. Suppose such a raster region is divided into four sub-grid regions and one of them, say R2, contains almost half of the data items.

Performance Analysis

Approaches to Performance Comparison

When we talk about object sizes, we should limit ourselves to the application domain and consider only relative size. In distributions with many large objects, indexes based on object boundaries and object duplication are strongly affected. Nevertheless, as stated in [Fra91], the general rule of "data volume constant" applies to an organization, based on the fact that for smaller application areas the data collected is more detailed.

With such a large amount of data and so many indexes at our disposal, it is not practical to test indexes on these datasets. The parameters are interrelated, and to see the effect of one on the others, we need to consider all combinations of parameter values. But during the analysis of the results, we can only consider a different parameter against the cost incurred.

Figure 29 The BANG file numbering scheme

Comparison Study

A subset of the data, e.g. a small area of a circle is sufficient, which is large enough to provide indices of reasonable height. Its nodes store less information than those of the skd tree, which accounts for a smaller directory size. In [FSR87], analysis of R-tree and R+-tree behavior in representing one-dimensional intervals of equal length was performed by transforming the intervals into points in two-dimensional space.

In [OSM91] we used the same approach in the previous subsection to compare the performance of the skd-tree and the R-tree. The results indicate that the skd-tree is a more efficient structure than the R-tree [Gut84] with almost the same storage requirements. Tests were conducted on six maps of the counties of Maryland, United States, with each map containing approximately 50,000 line segments.

These queries include finding all lines that appear at a given point and at the other endpoint of a given point's line segment, the closest line segments to a given point, the MBR of line segments containing a given point, and all line segments with a given rectangular window. The difference in performance is not very large, although the PMR four-tree has a slight advantage over the other two, and the R+-tree is slightly better than the R*-tree due to the disjoint decomposition of the line segments. However, the study did not make claims of convincing superiority for any of the three indices tested.

For small MBRS (less than 0.02% of the map area) and medium MBRS (less than 0.1% of the map area), R*-trees and R+-trees perform better than R- tree, with R+ slightly more efficient than R*-tree. However, for large MBRs (less than 0.5% of the map area), R+ becomes less efficient than the other two due to the additional level caused by duplications.

Gunther: The design of the cell tree: an object-oriented index structure for geometric databases. Sakauchi: MD tree: A balanced hierarchical data structure for multidimensional data with highly efficient dynamic characteristics. Ooi: Efficient query processing in geographic information systems. a revised version appears in: Lecture Notes in Computer Science #471, Springer-Verlag, 1990).

Kriegel: Buddy-tree: An efficient and robust access method for geographic database systems.