Ranking in a Word-based Block Sorting Compression Revisited



Abstract— In a word-based block sorting text compression, ranking is a reversible stage to exploit the locality of reference as the result of applying Burrows-Wheeler Transform to a sequence of tokens. MTF (move-to-front) is the simplest and most popular way to implement ranking, but it is not compulsory. Many alternative ways exist to implement ranking as will be described and revisited in this paper. Isal has shown that by using the Fibonacci sequence to partition the sizes of the splay trees in the ranking stage gives better compression effectiveness for large input text files. In this paper, it is aimed to find a real number r, r>1, such that the partition sizes determined by a geometric sequence r^k would give a better compression effectiveness. The effectiveness of these ranking strategies are compared by calculating its zero-order self entropy.

I. INTRODUCTION

complete reference which covers thorough discussion on block-sorting data compression can be found in book recently written by Adjeroh, Bell and Mukherjee, including the word-based block sorting text compression [11]. A word-based block sorting text compression mechanism has been proposed by Isal and Moffat [7], [9]. It consists of four stages applied to a text input file as depicted in Figure 1: parsing the input text file into a sequence of tokens;

permuting the sequence of tokens by using the Burrows-Wheeler Transformation (BWT); ranking the symbols in the permuted sequence of tokens; and assigning code to each symbol by using any entropy coder of choice. For the parsing stage, Isal and Moffat has proposed the use of implicit dictionary spaceless word parsing method for its more efficient dictionary and sequence of tokens representation [6]. In the word-based, the BWT stage is achieved by calling a standard sorting procedure as in the character-based model, except that it has a large set of tokens/symbols.

The author wish to thank the Faculty of Computer Science and DRPM, The University of Indonesia for its financial support that made this research possible.

Figure 1.Word-Based Block-Sorting Mechanism for Text Files (taken from Isal et.al,2010 [8]).

A word-based model immediately introduce the problem of large alphabet symbols, which requires proper handling both in time and space complexities.

For the ranking stage, Isal has proposed the use of forest of splay trees as data structure in the implementation of ranking method. The sizes of the splay trees can be determined by using various growing functions, and Isal and Moffat have also experimented various ways of promoting/demoting the symbol being ranked. Some results have been reported, for example, that promoting the symbol being ranked halfway to the front [8]. Isal has also reported that by using Fibonacci sequence to determine the sizes of the splay trees, make some improvement to the overall compression effectiveness [9].

A ranking method calculates the rank value of a symbol. Given a sequence of tokens/symbols, it transforms the sequence into a sequence of ranks.

After the BWT is applied into a sequence of tokens, the output sequence of tokens is said to have a locality of reference (the repetitions of some symbols over a short period of time); and applying ranking to a sequence of this type produces a sequence of ranks with a skewer probability distribution. Applying a ranking method to a block-sorted sequence of symbols transforms the sequence into another sequence of symbols over the same set of alphabet, but with different probability distribution of symbols.

In [8], it has been shown that different ranking strategies produce similar probability distributions, and that some methods consistently produce better compression result. In this paper, an example of processing a small part of the input sequence is presented to show further why one method is better than the others.

In [9], it has been shown that the use of the Fibonacci sequence to determine the sizes of the splay R. Yugo Kartono Isal

Faculty of Computer Science, University of Indonesia Email: [email protected]

Ranking in a Word-based Block Sorting Compression

Table 1. Zero-order self entropy (in bits/symbol) of test files using modified ranking methods, with different partition arrangements.

The test files are parsed using implicit-dictionary spaceless word model, and then have the BWT applied.

trees in the ranking stage produces better compression effectiveness for large input text files. The results of using various partition sizes has been reported and for a reminder is presented again in Table 1. Essentially, all functions used to partition the sizes of splay trees in Table 1 are geometric sequences with a ratio r. In this paper, an experiment is conducted to find a real number r, r>1, such that the partition sizes determined by a geometric sequence r^k gives a better compression effectiveness. As in [9], the input text files are taken from the Calgary and Canterbury corpuses and from the Wall Street Journal.

Move-To-Front (denoted as mtf in Table 1) using array data structure, is the mechanism commonly used in the RANKING stage, and was first introduced by Bentley, et.al [2]. The mtf processes one symbol x at a time by reading it from input sequence and returns the number of distinct other symbols that have occurred since the last appearance of x. The returned value can be computed by searching x in the array and returns the position of x in the array, and move x to the front of the list (the position 1), and shift all other symbols in front of x one position to the right. In the word- based block sorting mechanism, due to the large number of distinct symbols used in the array, both searching and shifting become very slow.

Initially designed to overcome the time problem, Isal and Moffat have proposed the use of forest of splay trees as data structure to implement the RANKING stage [5]. A splay tree is a self-adjusting binary search tree with a very good amortized efficiency for an arbitrarily but sufficiently long sequence of retrieval [1]. In [5], a modification of MTF is presented, in which k splay trees are used, each of which stores a non intersecting subset of the set of all alphabet symbols S. Surprisingly, some variants of modification of mtf also bring improvements to the compression effectiveness.

To compare the effectiveness of various ranking strategies, the zero-order self entropy of the transformed sequence is calculated. For a sequence of symbols in which symbol i appears fi times, the zero- order self entropy of the sequence is calculated as:

𝐻 = −^𝑓𝑖_𝑁. log2𝑓𝑖 𝑁

𝑛

𝑖 =1 where 𝑁 = ^𝑛_{𝑗 =1}𝑓𝑖

Assuming an ideal coder is available, H can be interpreted as the average number of bits per symbol required to represent the sequence.

It has been shown in [9] that the modified mtf strategies by using various partitions produces a different (but quite similar) probability distribution to the exact mtf, and that the partition fib (denotes the Fibonacci sequence) consistently produces better results measured by the zero-order self-entropy of the transformed sequence of integers, followed tightly by sseg; while dseg consistently produces worse results, for all test files. Compared to mtf, the partition sseg also produces better result, especially for large test files. The partition arith also produces good zero-order self-entropy, it is better than mtf, and is comparable to sseg. However, since arith uses more splay trees, its running time is about 10 times that of sseg [9].

II. RANKING IN ACTION: MTF AND ITS ALTERNATIVES

In this paper, several modifications of the approximate ranking are presented, and the partition fib is chosen as reference. To avoid confusion, denote a ranking strategy by the pair of partition sizes used, and its promotion strategy. For example, the ranking methods presented in Table 1 use the full evaluation promotion strategy described previously, and they are denoted as dseg-F, sseg-F, fib-F, and arith-F, respectively.

To show that the promotion strategies produce different sequence of integers/ranks, the output of the ranking strategies in the same sequence of integers can be observed in more detail. Table 2 shows the outputs of various modifications of fib-F on the same input text. The input is taken from the last 200 bytes of wsj20, after being parsed by using implicit spaceless word method. The words/non-words in the dictionary is given in the first column, and the corresponding integer tokens (based on the order of appearance as they are recorded in the dictionary) are given in the second column. The next columns show the ranks of the input tokens, produced by the promotion strategy in the corresponding columns. The last 200 bytes of text from the input wsj20 is the following:

“patent monopoly is sharply cut, as happens u nder systems of compulsory patent licensure, ”.

The integer token on the first row of the Table 2 is 9235, and the rank of the token according to different ranking strategies are shown on the next column in the same row. Observe that the word “patent” with token number 9235 appeared twice in the text; and that the word “licensure” is a newly found word and is spelled out character by character and is ended by a .

According to the implicit-dictionary, the new word was assigned the token 71929, the last word in the

Input file BWT only mtf

Partition sizes dseg sseg fib arit grammar.lsp 6.09 5.04 5.11 5.04 5.04 5.07

xargs.1 6.28 5.46 5.52 5.46 5.45 5.46 fields.c 6.66 4.86 4.91 4.87 4.85 4.86 cp.html 6.42 4.61 4.75 4.62 4.61 4.60 sum 5.98 3.97 4.02 3.96 3.95 3.96 asyoulik.txt 7.33 6.45 6.55 6.38 6.37 6.38 alice29.tex 7.65 6.46 6.56 6.40 6.39 6.40 lcet10.txt 7.99 6.78 6.88 6.70 6.69 6.69 plrabn12.txt 7.73 6.98 7.06 6.86 6.86 6.87 world192.txt 9.09 5.65 5.82 5.60 5.59 5.60 bible.txt 8.45 6.50 6.66 6.39 6.39 6.39 wsj20 10.19 7.52 7.73 7.37 7.36 7.37

Table 2. Rank values produced by various modification of fib-F, where the input is taken from the last 200 bytes of the WSJ20. The threshold in fib-T was set to =16.

input message. When the integer token 9235 (representing the word “patent” in the first row of Table 2) is accessed, its current rank under the strategies varies, depending upon the current state of the splay trees being maintained. Each modification of fib-F has a different promotion strategy, and has different ways of moving nodes within the forest.

The various promotion strategies are denoted as different suffix: F for Full promotion; N for Neighbor;

H for Halfway; S for Skipping; T for Threshold; and C for Counting. For example, in Halfway strategy, an accessed symbol x is searched in the splay tree starting from T0; and if x is found in tree Tk, then the rank value of x is computed and output, and x is promoted as a new node in the tree T k/2, and there will be a cascading delete and insert of symbol between the tree T k/2 up to tree Tk to restore the tree sizes. The description of all promotion strategies can be found in [8].

Despite having the same partition sizes and initialization, after a while, the information of “which symbols currently kept in which trees” is different amongst different promotion strategies. For example, the rank produced by mtffor token 9235 is 33; which means that there were 32 other distinct words and/or non words appeared in the input message after the last

appearance of 9235. The way mtfcalculate the rank of a symbol x is accurate; literally returns the rank that is equal to the number of other distinct symbols after x was last accessed. On the reappearance of 9235 (on the thirteenth row of the Table 2), mtfreturn 12 as its rank. The largest rank value for token 9235 on the first row is 504, produced by fib-N, and the next appearance of 9235 is given a rank 253 (about half of its previous rank values). It is sensible, since on the previous access of 9235, it was promoted to the neighbor splay tree, which size is half the size of the current splay tree. On the other hand, mtfassigns 12 as the rank for token 9235, since there are 11 other distinct symbols have appeared between the current and the last appearance of token 9235.

For the same token, fib-H produce a rank 15 for the occurrence of 9235 on the first row, and a rank of 3 for the next occurrence (even though there are 11 other distinct symbols appeared after the last appearence of 9235). Perhaps this is one possible improvement from fib-H over the traditional way of calculating mtf rank value. In the Table 2, the strategies fib-F and fib-T produce exactly the same sequence of ranks; this is because all the tokens in the example are in the range of symbols to be processed, and both perform the full evaluation strategy.

Word/non-word Token mtf Variations on Promotion Strategies fib

fib-F fib-N fib-H fib-S fib-T fib-C

... .... .... .... .... .... .... .... ....

patent 9235 33 58 504 15 466 58 31

monopoly 12638 50 62 1967 120 477 62 503

is 320 34 35 5 21 274 35 23

sharply 1687 3103 2287 1251 2283 65597 2287 2290

cut 3998 612 785 461 783 17313 785 803

, 276 46 37 4 29 262 37 34

happens 14316 2941 3497 3799 3495 37621 3497 3632

under 404 1077 1062 281 1063 16420 1062 534

systems 1660 3404 2279 1244 2275 16722 2279 2281

of 357 12 11 5 13 18 11 6

compulsory 45435 104 126 4076 127 245 126 510

patent 9235 12 14 253 3 24 14 7

l 108 199 129 34 130 123 129 130

i 105 40 35 33 18 257 35 18

c 99 40 32 17 16 253 32 16

e 101 38 33 9 16 254 33 16

n 110 200 129 33 130 123 129 130

s 115 38 35 9 10 125 35 18

u 117 964 520 264 520 32790 520 521

r 114 40 34 21 16 124 34 17

e 101 5 5 4 5 13 5 5

 256 40 35 21 17 6 35 17

, 276 17 17 5 12 16 17 11

III. EXPERIMENTS

In the previous section, we have seen that by using different partition sizes for splay trees also results in different probability distribution, measured by calculating the order-0 self entropy of the output sequence of ranks. By using the partition sizes fib, the various promotion ranking strategies are experimented on the test files after the implicit spaceless word parsing and BWT are applied. Different ranks returned by different promotion strategies may result in different probability distribution on the output sequences of ranks. To compare the efficacy of the ranking strategies, the running times are also presented. The results is given in the following.

Table 3 lists the effects of applying different promotion strategies to test files, by using partition sizes fib. By exploring various heuristic promotion strategies, it is possible to achieve a slight improvement over the basic promotion strategy fib-F.

Table 3. Zero-order self-entropy (in bits/symbol) of test files using different ranking strategies, using fib partition sizes. The test files are parsed using the implicit-dictionary spaceless word model, and then processed using a BWT transformation. The threshold in fib- T was set to = 16.

The input on the first column are parsed using implicit-dictionary spaceless word and BWT

transformed. The input files are tested against various ranking strategies, and their effectiveness are compared by calculating the zero-order self entropy, measured in bits per symbol. The smallest bits per symbol rate in each row is highlighted, to indicate the ranking strategy that produces the most effective result for that particular test file.

The mtf produces the most compressible output sequence only for the smallest test file grammar.lsp, but as the file size grows, variants of fib-F produce better results. As shown in the Table 3, perhaps fib-H is the strategy which produce an overall better effectiveness, especially on large input files. Table 3 shows that the larger the size of input file, the wider

the improvement gap made by fib-H relative to mtf.

Compared to fib-F, a slight improvement in compressibility is achieved by fib-H, and Figure 2 illustrates why. The probability distribution of fib-H is smoother on the first few splay trees (like mtf), and starts having saw-tooth patterns on the rest of the splay trees (like fib-F). Perhaps this is a good combination of the ranking effectiveness between mtf ranking (which is best for small file sizes, or symbols stored on the first few splay trees) and the fib-F ranking (which is best for larger file sizes).

Even though fib-N produces the most compressible version for plrabn12.txt, the improvement gap is insignificant. It has ranking effectiveness that is worse than mtfon small file sizes, and worse than fib-F on larger file sizes. Overall, fib-N produces less effective ranking, perhaps because it is too slow to promote an accessed symbol.

Figure 2. Probability distribution of ranks for WSJ20 by using fib-F and fib-H.

The fib-S variant also produced worse results, than to fib-F, and achieves no improvement on any single input file. Promotion is always guaranteed for an accessed symbol, but demotion only happens in the transited splay trees during the cascading deletes and inserts. This is perhaps because of unwarranted demotion of some symbols in some trees because they are skipped over during the ranking process.

For the strategy fib-C, even though it never achieve the most effective ranking strategy in any of the input test files, it still achieves slight improvement over fib- F on several files. The fib-T is an improved variant of fib-F; besides it is designed for better efficiency, it produces the same effectiveness as fib-F does. This is because the threshold chosen for fib-T in the experiment covered all the integer symbols. As will be shown later, choosing a smaller threshold for fib-T results in a better efficiency at the cost of ranking effectiveness.

Comparing variants of fib-F will only be fair when the running times of the comparands are also presented. Figure 3 displays the results of the ranking effectiveness from Table 3 and the time spent for the ranking processes (the sum of forward and reverse transformation). Various partitions are denoted by using solid boxes; various promotion strategies using fib partition are denoted by gray solid boxes; and particularly for fib-T with various are connected by line.

Input file

mtf

Ranking Strategies with Partition fib

fib-F fib-N fib-H fib-S fib-T fib-C

grammar.lsp 5.04 5.04 5.56 5.16 5.17 5.05 5.28 xargs.1 5.46 5.45 6.00 5.61 5.63 5.47 5.74 fields.c 4.86 4.85 5.72 5.12 5.05 4.86 5.32 cp.html 4.61 4.61 4.91 4.59 4.85 4.61 4.67 sum 3.97 3.95 4.43 4.00 4.11 3.96 4.15 asyoulik.txt 6.45 6.37 6.41 6.27 6.71 6.38 6.31 alice29.tex 6.46 6.39 6.48 6.29 6.72 6.39 6.33 lcet10.txt 6.78 6.69 6.73 6.57 6.98 6.69 6.61 plrabn12.txt 6.98 6.86 6.65 6.63 7.13 6.86 6.64 world192.txt 5.65 5.59 6.09 5.59 5.87 5.59 5.70 bible.txt 6.50 6.39 6.45 6.24 6.74 6.39 6.29 wsj20 7.52 7.36 7.46 7.19 7.75 7.36 7.25

Figure 3. Results of different ranking strategies for file wsj20 when parsed using the implicit-dictionary spaceless word mechanism, and then transformed using the BWT. Time spent is reported as the sum of forward and reverse processes, in CPU seconds on a 2.4 GHz Intel Pentium 4 (Xeon) with 512KB on-die L2cache and 1GB RAM

running Debian/Linux 2.4.

The fastest ranking strategy is MTF, and it is the best choice if speed is the most important consideration. If compression effectiveness is more important than speed, then fib-H is the best choice. A compromise balance between efficiency and effectiveness is offered by fib-N and fib-C.

The effectiveness of fib-C falls between that of fib- N and fib-H. Even though fib-C does more computation to decide where the accessed symbol should be promoted, its running time is faster than fib- H. This suggests that in the input file wsj20, in average over the whole ranking process, accessed symbols are promoted in less than halfway. The gray boxes connected by line were fib-T, run with, from left to right,



= 12, 14, 16, 18, 20 and 22. Increasing



results in a better effectiveness, but runs slower. This is another compromise to achieve balance efficacy, and its effectiveness approaches that of fib-F.

Table 4 shows some statistics of the probability distributions of the sequence of ranks produced by various ranking strategies by using the partition sizes fib. The input is wsj20, parsed with the implicit- dictionary spaceless word mechanism, and then transformed using BWT. The data presented in the Table 4 explains why a particular strategy produces better effectiveness than others. The first row lists the zero-order self entropy of the sequence of ranks produced by the corresponding ranking strategies on each column. The rest of the rows show the cumulative probabilities of rank 1, and cumulative probabilities of the first k symbols, where k = 2, 4, 7, 12, ..., 2583 and 4180. The reason these numbers were chosen was that they also represent the cumulative probability of ranks up to certain tree, where k = prev(h), for h ≥ 0. For examples, 12 is the cumulative probabilities of ranks/symbols in trees T0 up to T4.

Table 4 shows some statistics of applying different ranking strategies on the file wsj20, and is divided into three parts. The first part shows that the best and

second best ranking strategies in term of effectiveness are fib-H and fib-C respectively. The next parts of the Table 4 explain why the effectiveness of the ranking strategies vary in results. The shaded numbers in the second part of the table show the highest cumulative probabilities achieved in each row (up to certain tree), which are dominated by fib-H and fib-C. It turns out that effective ranking strategies are those which can

“bring” symbols such that, when symbols are accessed, they are found as close to the front as possible. Comparing cumulative probabilities of some ranking strategies may be useful to observe their effectiveness.

Table 4. Zero-order self entropy (in bits/symbol), cumulative probabilities of ranks (up to certain trees) of different ranking strategies, by using fib partition sizes. The test files are parsed using the implicit-dictionary spaceless-word, and after the BWT transformation is applied. All values are in bits per symbol. The threshold in fib-T was set to = 16.

The last part of the Table 4 shows the average rank produced when different ranking strategies are applied. The input of the ranking step is the result of parsed and BWT transformed of the file wsj20, with the zero-order self entropy of 10.19, comprises 71,921 distinct integers, and the total frequency of 4,788,496.

The input sequence has an average symbol of 2993.50, and has 10,252 symbols with frequencies of one. The average symbols is calculated by summing up all symbols appeared in the sequence and then divided by the number of symbols in the sequence. The average rank is calculated similarly, but now the sequence is the output of applying a ranking strategy.

Ranking strategy which decreases the average rank tends to be more compressible, but not always. This argument is somewhat weaker than the skewness of the probability distribution of a sequence. However, the average rank can be a good indicator for the expected average cost of the strategies, since they are analyzed based on rank values of input symbols.

Stats

R=%rank MTF Ranking Strategies with Partition fib

fib-F fib-N fib-H fib-S fib-T fib-C

0-order

self entr 7.52 7.36 7.46 7.19 7.75 7.36 7.25 R≤1 27.8 27.8 29.0 29.9 27.8 27.8 29.8 R≤2 34.5 34.5 36.4 37.8 34.5 34.6 36.0 R≤4 42.0 42.0 44.0 45.7 41.7 42.6 44.7 R≤7 48.3 48.3 50.1 51.8 44.0 48.9 51.1 R≤12 54.1 54.1 55.4 57.2 49.8 54.6 56.7 R≤20 59.1 59.1 59.7 61.7 57.8 59.4 61.3 R≤33 63.4 63.4 63.5 65.4 58.1 63.5 65.3 R≤54 67.1 67.1 66.8 68.8 58.8 67.2 68.9 R≤88 70.6 70.6 70.0 71.9 61.1 70.6 72.1 R≤143 73.8 73.8 73.0 74.8 65.8 73.8 75.1 R≤1596 87.8 87.8 87.6 87.8 79.0 87.8 88.3 R≤2583 90.3 90.3 90.3 90.3 80.3 90.3 90.7 R≤4180 92.6 92.6 92.8 92.6 81.6 92.6 92.9 Average

rank 1331

.3 1282

.2 1315

.3 1278

.4 4608

.6 1283

.7 1258

Dalam dokumen Aston Kuta Hotel & Spa, Bali 20 -23 November 2010 (Halaman 158-165)