DMC.pdf

(232 KB) Pobierz
Data Compression Using Dynamic Markov Modelling
G. V. CORMACK* AND R. N. S. HORSPOOL*
* Department of Computer Science, University of Waterloo, Waterloo, Ontario N2L 3G 1, Canada
* Department of Computer Science, University of Victoria, P.O. Box 1700, Victoria, B.G. V8W 2Y2, Canada (Address for correspondence)
A method of dynamically constructing Markov chain models that describe the characteristics of binary messages is developed. Such models can be used to
predict future message characters and can therefore be used as a basis for data compression. To this end, the Markov modelling technique is combined with
Guazzo's arithmetic coding scheme to produce a powerful method of data compression. The method has the advantage of being adaptive: messages may be
encoded or decoded with just a single pass through the data. Experimental results reported here indicate that.the Markov modelling approach generally
achieves much better data compression than that observed with competing methods on typical computer data.
Received April 1986
results are similar to those achievable with the method of
Cleary and Witten, 1 we argue that our method is intrinsically
simpler and in any case requires less storage and consumes less
CPU time.
Because the reader may not be familiar with Guazzo's
approach to data compression, this paper contains a fairly long
and intuitive description of the method. We believe that our
introduction will provide insights into the method that are not
readily available from reading the original paper. Following
this, we present a brief description of adaptive coding and then
we explain our method of automatically constructing Markov
chain models. Next we return to the Guazzo coding method
and discuss problems associated with the coding and decoding
algorithms. Full implementation details of these algorithms are
provided in an appendix to the paper. Finally, we present
results obtained from an implementation of our data
compression algorithm (which we will refer to as DMC, for
Dynamic Markov Compression). These results are compared
with other data compression techniques.
The main contribution of this paper is, perhaps, the Markov
modelling technique. The fact that this technique can be allied
with arithmetic coding to achieve an excellent degree of data
compression is a happy accident. As we point out in the
conclusions, the modelling algorithm may have other
applications. Unfortunately, the modelling technique does not
appear to be amenable to analysis. We have no results to prove
that the dynamically changing model converges, in some sense,
to the true model for the data. Such a result is unlikely in any
case, because the number of states in the model grows without
limit. The modelling technique can be judged only on the basis
that it works, and apparently works extremely well.
1. INTRODUCTION
All data compression methods rely on a priori assumptions
about the structure of the source data. Huffman coding, 7 for
example, assumes that the source data consists of a stream of
characters that are independently and randomly selected
according to some (possibly unknown) static probabilities. The
assumption that we make in this paper is remarkably weak. We
assume only that the source data is a stream of bits generated
by a discrete-parameter Markov chain model. 11 Such a model
is sufficiently general to describe the assumptions of Huffman
coding, of run-length encoding and of many other
compression techniques in common use. The survey paper by
Severance 10 provides a good introduction to data compression
and to various compression techniques that might be
employed.
A similar assumption, that there is an underlying Markov
model for the data, is made in the Ziv-Lempel 13.14 and the
Cleary-Witten 1 techniques. In fact, it is provable that Ziv-
Lempel coding approaches the optimal compression factor for
sufficiently long messages that are generated by a. Markov
model.
The new direction taken in our work is an algorithmic
attempt to discover a Markov chain model that describes the
data. If such a model can be constructed from the first part of
a message, it can be used to predict forthcoming binary
characters. Each state in the Markov chain supplies
probabilities for the next binary character being a zero or a
one. After using the probability estimate in a data coding
scheme, we can use the actual message character to transfer to
a new state in the Markov chain. This new state is then used to
predict the next message bit, and so on.
If the probability estimates for binary characters deviate
from 0.5 and are reasonably accurate, they can be used as the
basis of a data compression method. In our implementation
they are used directly to control a minimum-redundancy
coding method known as arithmetic coding. 9 We u s e a
particular form of arithmetic coding that was invented by
Guazzo. 5 The combination of Markov chain model generation
with Guazzo encoding has turned out to be a very powerful
method of compressing data. As our experimental results
show, it compares very favourably with the competing
compression methods. Although some of our performance
2. INTRODUCTION TO THE GUAZZO
CODING ALGORITHM
Guazzo's algorithm 5 generates minimum-redundancy codes,
suitable for discrete message sources with memory. The term
'message source with memory' is used to describe a message
that has been generated by a Markov chain model. In practical
situations, messages almost always exhibit some degree of
correlation between adjacent characters, and this corresponds
THE COMPUTER JOURNAL, VOL. 30, NO.6, 1987
541
830714001.050.png 830714001.054.png
 
G. V. CORMACK AND R. N. S. HORSPOOL
the new source alphabet. These probabilities can be used to
derive Huffman codes. Table 1 shows the probabilities and
corresponding codes, assuming that bits generated by the
model of Fig. 1 are grouped into threes.
If our Huffman codes for triples are used to encode a long
message generated by the Markov chain model, a moderate
degree of compression is achieved. In fact a long message will,
on average, be compacted to approximately 90.3% of its
original size. However, this number is considerably larger than
the information-theoretic lower bound. A simple calculation
of the entropy in the information source shows that
compression to 81.9% of the original size should be
achievable.
There are two reasons why a Huffman code does not
achieve the lower bound. The first reason is that Huffman
codes can be free of redundancy only if character frequencies
in the source alphabet are integer powers of 2. The second
reason is that the Huffman code is taking advantage of
correlations only between the bits that we have grouped
together. If, for example, our source message contains the two
adjacent groups
... 001 110 ...
the Huffman scheme encodes both groups independently. But
this ignores correlation between the last bit of the first group
and the first bit of the second group. Therefore, the second
group is encoded in a sub-optimal manner. These two
problems with the Huffman scheme are ameliorated only by
choosing larger groups of bits for constructing a source
alphabet. As we choose larger and larger groups, the coding
efficiency comes closer to the lower bound but the alphabet
size grows exponentially. The storage for tables needed to
hold Huffman encodings also grows exponentially, while the
computational cost of creating these tables becomes infeasible.
An alternative to imprudent expansion of the alphabet size
is to use multiple sets of Huffman codes. The choice of which
set of codes to use for the next message character is
determined by the preceding message characters. Such an
approach can be tuned to achieve a reasonable compromise
between compression performance and the total amount of
storage needed to hold tables of Huffman codes. 2
A
0 (50%)
1 (50%)
0 (40%)
B
C
0 (67%)
1 (60%)
1 (33%)
Figure 1. An example Markov model.
Table 1. Huffman coding for 3-bit groups
Source
form
Huffman
code
Probability
000
0.2424
10
001
0.1212
010
010
0.0727
1101
011
0.1091
000
100
0.1212
011
101
0.0606
1100
110
0.1091
001
111
0.1636
111
to the message source having memory.
To introduce the subject, we use a simple Markov model as
an example. In our example, we assume that the source
message string consists of binary digits where
0 is followed by 0 with probability 2/3
0 is followed by 1 with probability 1/3
1 is followed by 0 with probability 2/5
1 is followed by 1 with probability 3/5
and we also assume that the first digit in the message is equally
likely to be 0 or 1. The messages generated by this model have
the overall property that a zero digit is more likely to be
followed by another zero than by a one. Similarly, a one digit is
more likely to be followed by a one digit. The state transition
diagram for a model that exactly encapsulates these
probabilities is shown in Fig. 1.
Working with this particular model, we shall first examine
why the well-known Huffman coding 7 is less than ideal, and
then we will proceed to see how Guazzo coding can achieve
near-optimal codes.
2.2. Guazzo encoding applied to the Markov model
The Guazzo method does not suffer from either of the defects
noted for Huffman coding. For sufficiently long messages, the
method can generate encodings that come arbitrarily close to
the information-theoretic lower bound. The first step in
understanding the binary version of the Guazzo method is to
consider the output encoding as being a binary fraction. For
example, if the output encoding begins with the digits
0 1 1 0 1 ...
we should consider this as being the binary number that
begins '0.01101...'. (This is a number which has a value close to
when expressed as a decimal fraction.) The job of the
encoding algorithm is, in effect, to choose a fractional number
between zero and one which encodes the entire source
message.
Given that the Guazzo algorithm has access to the Markov
2.1. Huffman coding of message sources with memory
We begin by considering how Huffman coding can be used for
messages generated by the Markov model given in Fig. 1.
Huffman coding cannot be directly applied to a binary source
alphabet. Having to provide separate codes for a zero bit and a
one bit implies that compression is impossible.
The usual way around the problem is to create a large
source alphabet. This is easily done by grouping message
characters together. For example, we can choose to work with
groups of three bits, yielding an alphabet of size 8. Through an
analysis of the Markov model in Fig. 1, involving the
computation of equilibrium state probabilities, we have
calculated the probability of occurrence for each character in
13
16
------
542
THE COMPUTER JOURNAL, VOL. 30, NO.6, 1987
830714001.055.png 830714001.001.png 830714001.002.png 830714001.003.png 830714001.004.png 830714001.005.png 830714001.006.png 830714001.007.png 830714001.008.png 830714001.009.png 830714001.010.png 830714001.011.png 830714001.012.png
 
DATA COMPRESSION USING DYNAMIC MARKOV MODELLING
model shown in Fig. 1, we can trace the algorithm and see how
it would choose an encoding for an indefinitely long source
message that begins
0 1 1 1 0 0 1 ...
Initially, the Guazzo algorithm has freedom to choose binary
fractions that lie between
0.000... and 0.111...
inclusive. Guazzo's algorithm determines from the Markov
model that the first digit of the source message is equally likely
to be zero or one. It therefore divides the space of binary
fractions into two equal halves, choosing all fractions in the
closed interval [0.000..., 0.0111...] to represent a source
message that begins with '0', and all fractions in the closed
interval [0.1000..., 0.111...] to represent messages that begin
with '1'. Our source message begins with zero and therefore
we must pick the first sub-interval. Since all binary fractions in
the selected sub-range begin with the digit '0', this effectively
determines that the first output digit is zero.
The first source digit takes us to the state labelled B in our
Markov model. In this state the next digit is twice as likely to
be a zero as a one. Guazzo's algorithm therefore determines
that the range of available binary fractions should be
subdivided into two parts in the ratio two to one. After the
split, the sub-interval [0.000...,0.010101...] represents source
messages that begin '00. ..' and the sub-interval [0.010101...,
0.0111...] represents messages that begin with '01'. The first of
these two sub-intervals has exactly twice the range of the
second. Since the second digit of the source message is one,
we are restricted to the second interval. And since this source
digit leads us to state C in the model, we now have to split the
available range of message encodings in a three-to-two ratio.
Rather than proceeding through the example at this level of
detail, we will skip a few steps. Continuing along the same lines
as before, the Guazzo algorithm will eventually determine that
our source message beginning with '0111001' should be
represented by some binary fraction in the interval
[0.011100110101..., 0.01110100101111...]
Since the lower and upper bounds of the interval begin with
the same five fractional digits, the encoding algorithm could
have already generated these digits (namely '01110').
If we consider the requirements of a practical computer
implementation, it is clear that interval bounds should not be
computed as binary fractions containing unlimited numbers of
digits. An implementation that requires only a small finite
number of bits to be retained in the calculations of the interval
bounds is essential. We defer consideration of this and other
practical issues to a later section of this paper.
To add some further intuition to our intuitive description of
the Guazzo algorithm, we offer a simple observation. The way
that the available space of binary encodings is subdivided at
each step of the algorithm distributes encoded messages as
evenly as possible throughout this space. It is important to
distribute encodings evenly, otherwise some encodings will be
unnecessarily close together – and two values that are almost
the same will usually require more bits to differentiate them
than two values that are farther apart.
The encoding process is easily reversible. The decoding
algorithm has access to the same Markov model for the
information source as was used by the encoder. It starts by
constructing the same interval [0.000..., 0.111...] of possible
encodings and by subdividing it in the same way as the encoder
subdivided it. Inspection of the leading bit(s) in the encoded
message then determines which of the two partitions must have
been used by the encoder and therefore determines what the
first source digit must have been. Once this first digit is known,
the decoding algorithm can select the appropriate partition and
repeat the division into two parts. Inspection of the encoded
message now determines the second digit, and so on.
3. ADAPTIVE CODING
Data compression has traditionally been implemented as a
two-pass technique. An initial pass through the source
message is performed to discover its characteristics. Then,
using knowledge of these characteristics, a second pass to
perform the compression is made. For example, if we are
using Huffman coding, the first pass would count frequencies
of occurrence of each character. The Huffman codes can then
be constructed before the second pass performs encoding. To
ensure that the compressed data can be decoded, either a fixed
coding scheme must be used or else details of the compression
scheme must be included with the compressed data. Although
a two-pass implementation for our new data-compression
technique would be easy to develop, we prefer to proceed
directly to a one-pass adaptive data-compression
implementation. One-pass implementations are preferable
because they do not require the entire message to be saved in
on-line computer memory before encoding can take place.
There is a one-pass technique for data compression that, in
practice, achieves compression very close to that obtained
with two-pass techniques. The basic idea is that the encoding
scheme changes dynamically while the message is being
encoded. The coding scheme used for the k th character of a
message is based on the characteristics of the preceding k -1
characters in the message. This technique is known as adaptive
coding. For example, adaptive version of Huffman coding
have been proposed and implemented. 3,4,8 In practice,
adaptive Huffman coding achieves data compression that
differs insignificantly from conventional two-pass Huffman
coding, but at the expense of considerably more
computational effort.
The Ziv-Lempel 12,13,14 and the Cleary-Witten 1 mehods are
also adaptive coding techniques. The basic idea of Ziv-Lempel
coding is that a group of characters in the message may be
replaced by a pointer to an earlier occurrence of that character
group in the message. After a short learning period, these
pointers will consistently occupy fewer bits than the character
groups they replace. This algorithm can be implemented in
such a way as to be much faster than adaptive Huffman
coding, while achieving much better data compression.
Nevertheless, the Ziv-Lempel algorithm takes advantage only
of correlations among source characters that happen to be
grouped together: all context is discarded between such
groups. The scheme therefore suffers from the same defect as
that described in Section 2.1; the loss of context is ameliorated
THE COMPUTER JOURNAL, VOL. 30, NO.6, 1987
543
G. V. CORMACK AND R. N. S. HORSPOOL
only by using longer and longer groups, with commensurate
consumption of memory resources.
The method of Cleary and Witten uses the preceding k
characters in a message to predict the probabilities for the
current message character, and uses these probabilities to drive
an arithmetic coding scheme. To this end, a table of all previous
occurrences of strings of length k is kept, along with the counts
of characters following each string. However, if the particular
combination of k characters has not appeared previously in the
message, we will be unable to estimate the probabilities using
the table. In this case, the Cleary and Witten method escapes to
use a table of the preceding k -1 characters which is more likely
to be able to make an estimate of probabilities. If the
combination of k -l preceding characters has not previously
occurred either, another escape will take place, and so on. (The
null string will always make a prediction.) Each table of size k
may be regarded as a sparse representation of a (very large) k th-
order Markov model. The main point is that the probabilities
used in the various Markov models are learned as the message is
being processed. Since the decoder can be programmed to learn
in the same way as the encoder, it is easy to construct an
adaptive compression algorithm.
The Guazzo coding algorithm is eminently suitable for use
in adaptive coding schemes. The only aspect of the algorithm
that need change dynamically is the source of probability
estimates for message characters. At each step of the encoding
process, the algorithm requires probability estimates for each
of the possibilities for the next message character. It does not
matter to the Guazzo algorithm whether these probability
estimates are derived from a static Markov model or from a
dynamically changing Markov model. In such a dynamic
model, both the set of states and the transition probabilities
may change, based on message characters seen so far. The
next section of this paper explains our method for dynamically
building a Markov model for the source message.
Decoding a message produced by an adaptive coding
implementation of the Guazzo algorithm should not prove to
be a problem either. All that the decoding algorithm needs to
do is to recreate the same sequence of changes to the
dynamically changing Markov model as were made by the
encoding algorithm. Since the decoding algorithm sees exactly
the same sequence of unencoded digits as the encoding
algorithm, there is no difficulty.
the transitions. We further assume, as a starting point, that
correlations exist between a message character and the
immediately preceding characters.
If we have been reading binary digits from a source
message, we could have been following the corresponding
transitions in the model and have been counting how many
times each transition in the model has been taken. These
counts provide reasonable estimators of the probabilities for
those transitions that have been taken many times. More
precisely, if the transition out of state A for digit 0 has been
taken n 0 times and the transition for digit 1 has been taken n 1
times, then the following probability estimates are reasonable:
Prob {digit = 0 | current state = A} = n 0 / ( n 0 + n 1 )
Prob {digit = 1 | current state = A} = n 1 / (n 0 + n 1 )
The more often we have visited state A, the more confidence
we would have in these probability estimates.
Unfortunately, the adaptive coding technique requires us to
begin to make probability estimates before these transition
counts have grown to significant values. Furthermore, the
above formulae are undefined for a first visit to a state, and
yield transition probabilities of 0% and 100% on the next few
visits. We must be especially careful not to supply the Guazzo
algorithm with a 0% probability for any bit because the
generated encoding would have an infinite length if that bit
were actually observed. There are many ways in which the
probability formulae can be adjusted to take account of these
two concerns. The method which we used in our
implementation is, perhaps, the simplest. We adjusted the
formulae to be
Prob {digit = 0 | current state = A}
= (n 0 + c) / (n 0 + n 1 + 2c)
Prob {digit = 1 | current state = A}
= ( n 1 + c) / (n 0 + n 1 + 2c)
where c is a positive constant. Using small values for c is
equivalent to having confidence in probability estimates based
on small sample sizes, whereas large values correspond to
having little confidence. On the other hand, an adaptive
algorithm will seem to 'learn' the characteristics of a source file
faster if small values for c are used, but at the expanse of
making poor predictions more often. If very large files are
being compressed, the choice of c becomes largely irrelevant.
4. DYNAMIC CONSTRUCTION OF
PREDICTIVE MARKOV MODELS
A Markov chain model can be characterised as a directed
graph with probabilities attached to the graph edges. We can
distinguish two different aspects to the problem of creating
such a Markov model automatically. One part is the
determination of suitable probabilities to place on the edges of
the graph. The other part is the determination of the structure
of the graph itself. We consider these two parts separately,
beginning with the easier problem of choosing probabilities
for the transitions in a given model.
4.2. Building the state transition graph
The method by which probabilities attached to transitions in
the Markov model change dynamically has just been explained.
What has not been explained is the method by which the set of
states in the model changes dynamically. We shall try to explain
this method through consideration of a simple scenario.
Suppose that we have a partially constructed model which
includes states named A, B, ... E, as drawn in Fig. 2( a ). The
figure shows that there are transitions from both A and B to
C, and transitions from C to both D and E. Now, whenever
the model enters state C, some contextual information is lost.
In effect, we forget whether we reached state C from A or
from B. But it is quite possible that the choice of next state, D
or E, is correlated with the previous state, A or B. An easy way
to learn whether such a correlation exists is to duplicate state
4.1. Choosing edge probabilities
To begin out explanation, we assume that we already have a
Markov model but that there are no probabilities attached to
544
THE COMPUTER JOURNAL, VOL. 30, NO.6, 1987
DATA COMPRESSION USING DYNAMIC MARKOV MODELLING
It is assumed that the next transition to be followed in the
Markov chain model would transfer from the current state to
the candidate state, a state that is eligible for cloning.
Generalizing the scenario of Fig. 2( a ), we have the following
cloning criterion. The candidate state is cloned if and only if
the number of observed transitions from the current state to
the candidate state is greater than MIN_CNTI, and the
number of observed transitions from all states other than the
current state to the candidate state is greater than
MIN_CNT2.
The full algorithm for implementing state cloning appears
in the appendix to this paper as Fig. A 1. The algorithm also
shows how transition counts are apportioned when a state is
cloned. By apportioning these counts, Kirchhoff's Laws * are
maintained. The assumption of Kirchhoff's Laws simplifies
the logic needed to determine how often the candidate state
has been visited from states other than the current state. The
choice of suitable values for MIN_CNTl and MIN_CNT2 is
discussed later in the paper.
( a )
0
0
A
C
D
1
1
B
E
( b )
0
0
A
C
D
0
1
1
1
B
C’
E
Figure 2( a ). Part of a Markov model; ( b ) the Markov model
after ‘cloning’
4.3. Starting and stopping the model construction
Two important questions have not yet been answered. The
first question is: what Markov model should we begin with?
The simple answer is that we need only begin with a minimal
model capable of generating any message sequence. It
contains only one state. Both transitions out of this state (for
the digits 0 and 1) loop back to this single state. This model is
diagrammed in Fig. 3 ( a ). After operation of the cloning
algorithm, this single-state model rapidly grows into a complex
model with thousands of states.
In practice, there is some benefit to be gained from
beginning with a slightly less trivial initial model. Almost all
computer data is byte- or word-aligned. Correlations tend to
occur between adjacent bytes more often than between adjacent
bits. If we begin with a model that corresponds to byte
structure, the process of learning source message characteristics
occurs faster, leading to s8lightly better data compression. A
simple model for byte structure has 255 states, arranged as a
binary tree, with transitions from each leaf node returning to
the root of the tree. The general shape of this tree, but with a
smaller number of nodes, is diagrammed in Fig. 3 ( b ).
Although a tree-structured initial graph works well, it is
possible to do a little better. When the transition counts have
reached reasonable values, the counts have the potential to
show correlations between the individual bits of a byte. For
example, if it is the case that the third bit in a byte being a zero
implies that the seventh bit is always a one, the tree model
would be able to describe this correlation. In general, a state in
the kth level of the tree represents correlations with the
preceding k -1 bits. The amount of left context that is
correlated with the current state builds up from 0 bits to 7 bits
and then the correlation is discarded when a transition to the
top (root) of the tree is made. A more satisfactory model is
one that retains a constant amount of left context. Apart from
C (we call this process cloning), generating a new state C'. This
creates a revised Markov model as drawn in Fig. 2( b ). After
this change to the model, the counts for transitions from C to
D or E will be updated only when state C is reached from A,
whereas the counts for C' to D or E will be updated only when
C' is reached from B. Thus the model can now learn the
degree of correlation between the A, B states and the D, E
states.
If the above cloning process is performed when, in fact, no
correlation between the previous state and the next state
exists, little has been lost. We have simply made the model
more complicated (more states) and made our probability
estimates more susceptible to statistical fluctuations (because
each state is visited less often). If such correlations do exist,
the improvements in the probability estimates can be dramatic.
Carrying on with the model of Fig. 2( b ), it is possible that the
choice of next state (D or E) is not correlated with the
previous state being A or B but is correlated with the states
immediately before state A or state B. If this is the case,
cloning state A, cloning state B, cloning state C' and recloning
state C will enable our model to discover the correlations. In
general, the more cloning that is performed, the longer the
range of correlations that can be discovered and be used for
predictive purposes.
In light of the previous observation, our implementation
performs cloning as soon as practicable. Whether it is
practicable to clone a particular state depends on whether that
state has been visited a reasonable number of times from each
of two (or more) different predecessor states. Referring to Fig.
2( a ), again, let us assume that the current state is A and a
transition is about to be made to state C. The desirability of
cloning state C should depend on whether the AC and BC
transition counts are both reasonably large. If, say, the BC
count is zero or is small compared to the AC count, then the
probabilities associated with the transitions leaving C will
reflect a correlation with the predecessor being A. Cloning
state C would enable correlations with state B to be discerned,
but there is little benefit to be gained if the BC transition is
rarely taken.
*. By this we mean that the count of transitions into some state from all its
predecessors should be the same as the count of transitions out of that state to
all its successors. In practice, the two counts may differ by one, because there
is always one more transition into the current state than there are transitions
that have been taken out of it.
THE COMPUTER JOURNAL, VOL. 30, NO.6, 1987
545
830714001.013.png 830714001.014.png 830714001.015.png 830714001.016.png 830714001.017.png 830714001.018.png 830714001.019.png 830714001.020.png 830714001.021.png 830714001.022.png 830714001.023.png 830714001.024.png 830714001.025.png 830714001.026.png 830714001.027.png 830714001.028.png 830714001.029.png 830714001.030.png 830714001.031.png 830714001.032.png 830714001.033.png 830714001.034.png 830714001.035.png 830714001.036.png 830714001.037.png 830714001.038.png 830714001.039.png 830714001.040.png 830714001.041.png 830714001.042.png 830714001.043.png 830714001.044.png 830714001.045.png 830714001.046.png 830714001.047.png 830714001.048.png 830714001.049.png 830714001.051.png 830714001.052.png 830714001.053.png
 
Zgłoś jeśli naruszono regulamin