DMC.pdf

Data Compression Using Dynamic Markov Modelling

G. V. CORMACK* AND R. N. S. HORSPOOL*

* Department of Computer Science, University of Waterloo, Waterloo, Ontario N2L 3G 1, Canada

* Department of Computer Science, University of Victoria, P.O. Box 1700, Victoria, B.G. V8W 2Y2, Canada (Address for correspondence)

A method of dynamically constructing Markov chain models that describe the characteristics of binary messages is developed. Such models can be used to

predict future message characters and can therefore be used as a basis for data compression. To this end, the Markov modelling technique is combined with

Guazzo's arithmetic coding scheme to produce a powerful method of data compression. The method has the advantage of being adaptive: messages may be

encoded or decoded with just a single pass through the data. Experimental results reported here indicate that.the Markov modelling approach generally

achieves much better data compression than that observed with competing methods on typical computer data.

Received April 1986

results are similar to those achievable with the method of

Cleary and Witten, 1 we argue that our method is intrinsically

simpler and in any case requires less storage and consumes less

CPU time.

Because the reader may not be familiar with Guazzo's

approach to data compression, this paper contains a fairly long

and intuitive description of the method. We believe that our

introduction will provide insights into the method that are not

readily available from reading the original paper. Following

this, we present a brief description of adaptive coding and then

we explain our method of automatically constructing Markov

chain models. Next we return to the Guazzo coding method

and discuss problems associated with the coding and decoding

algorithms. Full implementation details of these algorithms are

provided in an appendix to the paper. Finally, we present

results obtained from an implementation of our data

compression algorithm (which we will refer to as DMC, for

Dynamic Markov Compression). These results are compared

with other data compression techniques.

The main contribution of this paper is, perhaps, the Markov

modelling technique. The fact that this technique can be allied

with arithmetic coding to achieve an excellent degree of data

compression is a happy accident. As we point out in the

conclusions, the modelling algorithm may have other

applications. Unfortunately, the modelling technique does not

appear to be amenable to analysis. We have no results to prove

that the dynamically changing model converges, in some sense,

to the true model for the data. Such a result is unlikely in any

case, because the number of states in the model grows without

limit. The modelling technique can be judged only on the basis

that it works, and apparently works extremely well.

1. INTRODUCTION

All data compression methods rely on a priori assumptions

about the structure of the source data. Huffman coding, 7 for

example, assumes that the source data consists of a stream of

characters that are independently and randomly selected

according to some (possibly unknown) static probabilities. The

assumption that we make in this paper is remarkably weak. We

assume only that the source data is a stream of bits generated

by a discrete-parameter Markov chain model. 11 Such a model

is sufficiently general to describe the assumptions of Huffman

coding, of run-length encoding and of many other

compression techniques in common use. The survey paper by

Severance 10 provides a good introduction to data compression

and to various compression techniques that might be

employed.

A similar assumption, that there is an underlying Markov

model for the data, is made in the Ziv-Lempel 13.14 and the

Cleary-Witten 1 techniques. In fact, it is provable that Ziv-

Lempel coding approaches the optimal compression factor for

sufficiently long messages that are generated by a. Markov

model.

The new direction taken in our work is an algorithmic

attempt to discover a Markov chain model that describes the

data. If such a model can be constructed from the first part of

a message, it can be used to predict forthcoming binary

characters. Each state in the Markov chain supplies

probabilities for the next binary character being a zero or a

one. After using the probability estimate in a data coding

scheme, we can use the actual message character to transfer to

a new state in the Markov chain. This new state is then used to

predict the next message bit, and so on.

If the probability estimates for binary characters deviate

from 0.5 and are reasonably accurate, they can be used as the

basis of a data compression method. In our implementation

they are used directly to control a minimum-redundancy

coding method known as arithmetic coding. 9 We u s e a

particular form of arithmetic coding that was invented by

Guazzo. 5 The combination of Markov chain model generation

with Guazzo encoding has turned out to be a very powerful

method of compressing data. As our experimental results

show, it compares very favourably with the competing

compression methods. Although some of our performance

2. INTRODUCTION TO THE GUAZZO

CODING ALGORITHM

Guazzo's algorithm 5 generates minimum-redundancy codes,

suitable for discrete message sources with memory. The term

'message source with memory' is used to describe a message

that has been generated by a Markov chain model. In practical

situations, messages almost always exhibit some degree of

correlation between adjacent characters, and this corresponds

THE COMPUTER JOURNAL, VOL. 30, NO.6, 1987

541

G. V. CORMACK AND R. N. S. HORSPOOL

the new source alphabet. These probabilities can be used to

derive Huffman codes. Table 1 shows the probabilities and

corresponding codes, assuming that bits generated by the

model of Fig. 1 are grouped into threes.

If our Huffman codes for triples are used to encode a long

message generated by the Markov chain model, a moderate

degree of compression is achieved. In fact a long message will,

on average, be compacted to approximately 90.3% of its

original size. However, this number is considerably larger than

the information-theoretic lower bound. A simple calculation

of the entropy in the information source shows that

compression to 81.9% of the original size should be

achievable.

There are two reasons why a Huffman code does not

achieve the lower bound. The first reason is that Huffman

codes can be free of redundancy only if character frequencies

in the source alphabet are integer powers of 2. The second

reason is that the Huffman code is taking advantage of

correlations only between the bits that we have grouped

together. If, for example, our source message contains the two

adjacent groups

... 001 110 ...

the Huffman scheme encodes both groups independently. But

this ignores correlation between the last bit of the first group

and the first bit of the second group. Therefore, the second

group is encoded in a sub-optimal manner. These two

problems with the Huffman scheme are ameliorated only by

choosing larger groups of bits for constructing a source

alphabet. As we choose larger and larger groups, the coding

efficiency comes closer to the lower bound but the alphabet

size grows exponentially. The storage for tables needed to

hold Huffman encodings also grows exponentially, while the

computational cost of creating these tables becomes infeasible.

An alternative to imprudent expansion of the alphabet size

is to use multiple sets of Huffman codes. The choice of which

set of codes to use for the next message character is

determined by the preceding message characters. Such an

approach can be tuned to achieve a reasonable compromise

between compression performance and the total amount of

storage needed to hold tables of Huffman codes. 2

0 (50%)

1 (50%)

0 (40%)

0 (67%)

1 (60%)

1 (33%)

Figure 1. An example Markov model.

Table 1. Huffman coding for 3-bit groups

Source

form

Huffman

code

Probability

000

0.2424

001

0.1212

010

0.0727

1101

011

0.1091

000

100

0.1212

011

101

0.0606

1100

110

0.1091

001

111

0.1636

111

to the message source having memory.

To introduce the subject, we use a simple Markov model as

an example. In our example, we assume that the source

message string consists of binary digits where

0 is followed by 0 with probability 2/3

0 is followed by 1 with probability 1/3

1 is followed by 0 with probability 2/5

1 is followed by 1 with probability 3/5

and we also assume that the first digit in the message is equally

likely to be 0 or 1. The messages generated by this model have

the overall property that a zero digit is more likely to be

followed by another zero than by a one. Similarly, a one digit is

more likely to be followed by a one digit. The state transition

diagram for a model that exactly encapsulates these

probabilities is shown in Fig. 1.

Working with this particular model, we shall first examine

why the well-known Huffman coding 7 is less than ideal, and

then we will proceed to see how Guazzo coding can achieve

near-optimal codes.

2.2. Guazzo encoding applied to the Markov model

The Guazzo method does not suffer from either of the defects

noted for Huffman coding. For sufficiently long messages, the

method can generate encodings that come arbitrarily close to

the information-theoretic lower bound. The first step in

understanding the binary version of the Guazzo method is to

consider the output encoding as being a binary fraction. For

example, if the output encoding begins with the digits

0 1 1 0 1 ...

we should consider this as being the binary number that

begins '0.01101...'. (This is a number which has a value close to

when expressed as a decimal fraction.) The job of the

encoding algorithm is, in effect, to choose a fractional number

between zero and one which encodes the entire source

message.

Given that the Guazzo algorithm has access to the Markov

2.1. Huffman coding of message sources with memory

We begin by considering how Huffman coding can be used for

messages generated by the Markov model given in Fig. 1.

Huffman coding cannot be directly applied to a binary source

alphabet. Having to provide separate codes for a zero bit and a

one bit implies that compression is impossible.

The usual way around the problem is to create a large

source alphabet. This is easily done by grouping message

characters together. For example, we can choose to work with

groups of three bits, yielding an alphabet of size 8. Through an

analysis of the Markov model in Fig. 1, involving the

computation of equilibrium state probabilities, we have

calculated the probability of occurrence for each character in

------

542

THE COMPUTER JOURNAL, VOL. 30, NO.6, 1987

DATA COMPRESSION USING DYNAMIC MARKOV MODELLING

model shown in Fig. 1, we can trace the algorithm and see how

it would choose an encoding for an indefinitely long source

message that begins

0 1 1 1 0 0 1 ...

Initially, the Guazzo algorithm has freedom to choose binary

fractions that lie between

0.000... and 0.111...

inclusive. Guazzo's algorithm determines from the Markov

model that the first digit of the source message is equally likely

to be zero or one. It therefore divides the space of binary

fractions into two equal halves, choosing all fractions in the

closed interval [0.000..., 0.0111...] to represent a source

message that begins with '0', and all fractions in the closed

interval [0.1000..., 0.111...] to represent messages that begin

with '1'. Our source message begins with zero and therefore

we must pick the first sub-interval. Since all binary fractions in

the selected sub-range begin with the digit '0', this effectively

determines that the first output digit is zero.

The first source digit takes us to the state labelled B in our

Markov model. In this state the next digit is twice as likely to

be a zero as a one. Guazzo's algorithm therefore determines

that the range of available binary fractions should be

subdivided into two parts in the ratio two to one. After the

split, the sub-interval [0.000...,0.010101...] represents source

messages that begin '00. ..' and the sub-interval [0.010101...,

0.0111...] represents messages that begin with '01'. The first of

these two sub-intervals has exactly twice the range of the

second. Since the second digit of the source message is one,

we are restricted to the second interval. And since this source

digit leads us to state C in the model, we now have to split the

available range of message encodings in a three-to-two ratio.

Rather than proceeding through the example at this level of

detail, we will skip a few steps. Continuing along the same lines

as before, the Guazzo algorithm will eventually determine that

our source message beginning with '0111001' should be

represented by some binary fraction in the interval

[0.011100110101..., 0.01110100101111...]

Since the lower and upper bounds of the interval begin with

the same five fractional digits, the encoding algorithm could

have already generated these digits (namely '01110').

If we consider the requirements of a practical computer

implementation, it is clear that interval bounds should not be

computed as binary fractions containing unlimited numbers of

digits. An implementation that requires only a small finite

number of bits to be retained in the calculations of the interval

bounds is essential. We defer consideration of this and other

practical issues to a later section of this paper.

To add some further intuition to our intuitive description of

the Guazzo algorithm, we offer a simple observation. The way

that the available space of binary encodings is subdivided at

each step of the algorithm distributes encoded messages as

evenly as possible throughout this space. It is important to

distribute encodings evenly, otherwise some encodings will be

unnecessarily close together – and two values that are almost

the same will usually require more bits to differentiate them

than two values that are farther apart.

The encoding process is easily reversible. The decoding

algorithm has access to the same Markov model for the

information source as was used by the encoder. It starts by

constructing the same interval [0.000..., 0.111...] of possible

encodings and by subdividing it in the same way as the encoder

subdivided it. Inspection of the leading bit(s) in the encoded

message then determines which of the two partitions must have

been used by the encoder and therefore determines what the

first source digit must have been. Once this first digit is known,

the decoding algorithm can select the appropriate partition and

repeat the division into two parts. Inspection of the encoded

message now determines the second digit, and so on.

3. ADAPTIVE CODING

Data compression has traditionally been implemented as a

two-pass technique. An initial pass through the source

message is performed to discover its characteristics. Then,

using knowledge of these characteristics, a second pass to

perform the compression is made. For example, if we are

using Huffman coding, the first pass would count frequencies

of occurrence of each character. The Huffman codes can then

be constructed before the second pass performs encoding. To

ensure that the compressed data can be decoded, either a fixed

coding scheme must be used or else details of the compression

scheme must be included with the compressed data. Although

a two-pass implementation for our new data-compression

technique would be easy to develop, we prefer to proceed

directly to a one-pass adaptive data-compression

implementation. One-pass implementations are preferable

because they do not require the entire message to be saved in

on-line computer memory before encoding can take place.

There is a one-pass technique for data compression that, in

practice, achieves compression very close to that obtained

with two-pass techniques. The basic idea is that the encoding

scheme changes dynamically while the message is being

encoded. The coding scheme used for the k th character of a

message is based on the characteristics of the preceding k -1

characters in the message. This technique is known as adaptive

coding. For example, adaptive version of Huffman coding

have been proposed and implemented. 3,4,8 In practice,

adaptive Huffman coding achieves data compression that

differs insignificantly from conventional two-pass Huffman

coding, but at the expense of considerably more

computational effort.

The Ziv-Lempel 12,13,14 and the Cleary-Witten 1 mehods are

also adaptive coding techniques. The basic idea of Ziv-Lempel

coding is that a group of characters in the message may be

replaced by a pointer to an earlier occurrence of that character

group in the message. After a short learning period, these

pointers will consistently occupy fewer bits than the character

groups they replace. This algorithm can be implemented in

such a way as to be much faster than adaptive Huffman

coding, while achieving much better data compression.

Nevertheless, the Ziv-Lempel algorithm takes advantage only

of correlations among source characters that happen to be

grouped together: all context is discarded between such

groups. The scheme therefore suffers from the same defect as

that described in Section 2.1; the loss of context is ameliorated

THE COMPUTER JOURNAL, VOL. 30, NO.6, 1987

543

G. V. CORMACK AND R. N. S. HORSPOOL

only by using longer and longer groups, with commensurate

consumption of memory resources.

The method of Cleary and Witten uses the preceding k

characters in a message to predict the probabilities for the

current message character, and uses these probabilities to drive

an arithmetic coding scheme. To this end, a table of all previous

occurrences of strings of length k is kept, along with the counts

of characters following each string. However, if the particular

combination of k characters has not appeared previously in the

message, we will be unable to estimate the probabilities using

the table. In this case, the Cleary and Witten method escapes to

use a table of the preceding k -1 characters which is more likely

to be able to make an estimate of probabilities. If the

combination of k -l preceding characters has not previously

occurred either, another escape will take place, and so on. (The

null string will always make a prediction.) Each table of size k

may be regarded as a sparse representation of a (very large) k th-

order Markov model. The main point is that the probabilities

used in the various Markov models are learned as the message is

being processed. Since the decoder can be programmed to learn

in the same way as the encoder, it is easy to construct an

adaptive compression algorithm.

The Guazzo coding algorithm is eminently suitable for use

in adaptive coding schemes. The only aspect of the algorithm

that need change dynamically is the source of probability

estimates for message characters. At each step of the encoding

process, the algorithm requires probability estimates for each

of the possibilities for the next message character. It does not

matter to the Guazzo algorithm whether these probability

estimates are derived from a static Markov model or from a

dynamically changing Markov model. In such a dynamic

model, both the set of states and the transition probabilities

may change, based on message characters seen so far. The

next section of this paper explains our method for dynamically

building a Markov model for the source message.

Decoding a message produced by an adaptive coding

implementation of the Guazzo algorithm should not prove to

be a problem either. All that the decoding algorithm needs to

do is to recreate the same sequence of changes to the

dynamically changing Markov model as were made by the

encoding algorithm. Since the decoding algorithm sees exactly

the same sequence of unencoded digits as the encoding

algorithm, there is no difficulty.

the transitions. We further assume, as a starting point, that

correlations exist between a message character and the

immediately preceding characters.

If we have been reading binary digits from a source

message, we could have been following the corresponding

transitions in the model and have been counting how many

times each transition in the model has been taken. These

counts provide reasonable estimators of the probabilities for

those transitions that have been taken many times. More

precisely, if the transition out of state A for digit 0 has been

taken n 0 times and the transition for digit 1 has been taken n 1

times, then the following probability estimates are reasonable:

Prob {digit = 0 | current state = A} = n 0 / ( n 0 + n 1 )

Prob {digit = 1 | current state = A} = n 1 / (n 0 + n 1 )

The more often we have visited state A, the more confidence

we would have in these probability estimates.

Unfortunately, the adaptive coding technique requires us to

begin to make probability estimates before these transition

counts have grown to significant values. Furthermore, the

above formulae are undefined for a first visit to a state, and

yield transition probabilities of 0% and 100% on the next few

visits. We must be especially careful not to supply the Guazzo

algorithm with a 0% probability for any bit because the

generated encoding would have an infinite length if that bit

were actually observed. There are many ways in which the

probability formulae can be adjusted to take account of these

two concerns. The method which we used in our

implementation is, perhaps, the simplest. We adjusted the

formulae to be

Prob {digit = 0 | current state = A}

= (n 0 + c) / (n 0 + n 1 + 2c)

Prob {digit = 1 | current state = A}

= ( n 1 + c) / (n 0 + n 1 + 2c)

where c is a positive constant. Using small values for c is

equivalent to having confidence in probability estimates based

on small sample sizes, whereas large values correspond to

having little confidence. On the other hand, an adaptive

algorithm will seem to 'learn' the characteristics of a source file

faster if small values for c are used, but at the expanse of

making poor predictions more often. If very large files are

being compressed, the choice of c becomes largely irrelevant.

4. DYNAMIC CONSTRUCTION OF

PREDICTIVE MARKOV MODELS

A Markov chain model can be characterised as a directed

graph with probabilities attached to the graph edges. We can

distinguish two different aspects to the problem of creating

such a Markov model automatically. One part is the

determination of suitable probabilities to place on the edges of

the graph. The other part is the determination of the structure

of the graph itself. We consider these two parts separately,

beginning with the easier problem of choosing probabilities

for the transitions in a given model.

4.2. Building the state transition graph

The method by which probabilities attached to transitions in

the Markov model change dynamically has just been explained.

What has not been explained is the method by which the set of

states in the model changes dynamically. We shall try to explain

this method through consideration of a simple scenario.

Suppose that we have a partially constructed model which

includes states named A, B, ... E, as drawn in Fig. 2( a ). The

figure shows that there are transitions from both A and B to

C, and transitions from C to both D and E. Now, whenever

the model enters state C, some contextual information is lost.

In effect, we forget whether we reached state C from A or

from B. But it is quite possible that the choice of next state, D

or E, is correlated with the previous state, A or B. An easy way

to learn whether such a correlation exists is to duplicate state

4.1. Choosing edge probabilities

To begin out explanation, we assume that we already have a

Markov model but that there are no probabilities attached to

544

THE COMPUTER JOURNAL, VOL. 30, NO.6, 1987

DATA COMPRESSION USING DYNAMIC MARKOV MODELLING

It is assumed that the next transition to be followed in the

Markov chain model would transfer from the current state to

the candidate state, a state that is eligible for cloning.

Generalizing the scenario of Fig. 2( a ), we have the following

cloning criterion. The candidate state is cloned if and only if

the number of observed transitions from the current state to

the candidate state is greater than MIN_CNTI, and the

number of observed transitions from all states other than the

current state to the candidate state is greater than

MIN_CNT2.

The full algorithm for implementing state cloning appears

in the appendix to this paper as Fig. A 1. The algorithm also

shows how transition counts are apportioned when a state is

cloned. By apportioning these counts, Kirchhoff's Laws * are

maintained. The assumption of Kirchhoff's Laws simplifies

the logic needed to determine how often the candidate state

has been visited from states other than the current state. The

choice of suitable values for MIN_CNTl and MIN_CNT2 is

discussed later in the paper.

( a )

( b )

C’

Figure 2( a ). Part of a Markov model; ( b ) the Markov model

after ‘cloning’

4.3. Starting and stopping the model construction

Two important questions have not yet been answered. The

first question is: what Markov model should we begin with?

The simple answer is that we need only begin with a minimal

model capable of generating any message sequence. It

contains only one state. Both transitions out of this state (for

the digits 0 and 1) loop back to this single state. This model is

diagrammed in Fig. 3 ( a ). After operation of the cloning

algorithm, this single-state model rapidly grows into a complex

model with thousands of states.

In practice, there is some benefit to be gained from

beginning with a slightly less trivial initial model. Almost all

computer data is byte- or word-aligned. Correlations tend to

occur between adjacent bytes more often than between adjacent

bits. If we begin with a model that corresponds to byte

structure, the process of learning source message characteristics

occurs faster, leading to s8lightly better data compression. A

simple model for byte structure has 255 states, arranged as a

binary tree, with transitions from each leaf node returning to

the root of the tree. The general shape of this tree, but with a

smaller number of nodes, is diagrammed in Fig. 3 ( b ).

Although a tree-structured initial graph works well, it is

possible to do a little better. When the transition counts have

reached reasonable values, the counts have the potential to

show correlations between the individual bits of a byte. For

example, if it is the case that the third bit in a byte being a zero

implies that the seventh bit is always a one, the tree model

would be able to describe this correlation. In general, a state in

the kth level of the tree represents correlations with the

preceding k -1 bits. The amount of left context that is

correlated with the current state builds up from 0 bits to 7 bits

and then the correlation is discarded when a transition to the

top (root) of the tree is made. A more satisfactory model is

one that retains a constant amount of left context. Apart from

C (we call this process cloning), generating a new state C'. This

creates a revised Markov model as drawn in Fig. 2( b ). After

this change to the model, the counts for transitions from C to

D or E will be updated only when state C is reached from A,

whereas the counts for C' to D or E will be updated only when

C' is reached from B. Thus the model can now learn the

degree of correlation between the A, B states and the D, E

states.

If the above cloning process is performed when, in fact, no

correlation between the previous state and the next state

exists, little has been lost. We have simply made the model

more complicated (more states) and made our probability

estimates more susceptible to statistical fluctuations (because

each state is visited less often). If such correlations do exist,

the improvements in the probability estimates can be dramatic.

Carrying on with the model of Fig. 2( b ), it is possible that the

choice of next state (D or E) is not correlated with the

previous state being A or B but is correlated with the states

immediately before state A or state B. If this is the case,

cloning state A, cloning state B, cloning state C' and recloning

state C will enable our model to discover the correlations. In

general, the more cloning that is performed, the longer the

range of correlations that can be discovered and be used for

predictive purposes.

In light of the previous observation, our implementation

performs cloning as soon as practicable. Whether it is

practicable to clone a particular state depends on whether that

state has been visited a reasonable number of times from each

of two (or more) different predecessor states. Referring to Fig.

2( a ), again, let us assume that the current state is A and a

transition is about to be made to state C. The desirability of

cloning state C should depend on whether the AC and BC

transition counts are both reasonably large. If, say, the BC

count is zero or is small compared to the AC count, then the

probabilities associated with the transitions leaving C will

reflect a correlation with the predecessor being A. Cloning

state C would enable correlations with state B to be discerned,

but there is little benefit to be gained if the BC transition is

rarely taken.

*. By this we mean that the count of transitions into some state from all its

predecessors should be the same as the count of transitions out of that state to

all its successors. In practice, the two counts may differ by one, because there

is always one more transition into the current state than there are transitions

that have been taken out of it.

THE COMPUTER JOURNAL, VOL. 30, NO.6, 1987

545

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: