design0.txt - Don Yang (uguu.org)

09/12/01


This is an obsolete version of the design document, the current version is
design.txt.

This document describes most design details behind this program.  It's really a
collection of notes to myself for what I did and why I did it...

Throughout the document, "Kirika" refers to this algorithm that is being
implemented.


Design

   The general goal is to reduce redundancy by replacing substrings with
   pointers to a previous location. e.g.

      aaaaaaabbaaaaaaa  ->  aaaaaaabb[-9]

   However, it's noted that substrings may not always be of perfect match, in
   which case it may be more efficient to match the longest substring, then
   patch the difference, e.g.

      aaaaaaabbaaaabaa  ->  aaaaaaabb[-9][-2b]

   In other cases, it may not be able to match any previous substrings (e.g.
   at beginning of stream.  In which case the bytes are coped to output, until
   a suitable match is possible.

Format

   Encoded stream consists of a series of tags, optionally followed by a series
   of bytes.  Each tag is a 16bit unsigned integer stored in the intel fashion
   (i.e. little endian, least significant byte first).  Order of bits within
   the byte are left undefined.

   Tags are distinguished by the two high order bits.

   00xxxxxx xxxxxxxx [data]               "literal block"
      length = x, 0 < x < (2^14 - 2)
      Copy <length> bytes to output (which follows the tag).
      Because of this tag, the input buffer must be at least 2^14 bytes.

   01xxxxxx xxxxxxxx <byte>               "patch block"
      offset = -x, 1 < x < (2^14 - 2)
      Patch <offset> with <byte>

   10xxxxxx xxxxxxxx <8bit length y>      "substring reference"
   11xxxxxx xxxxxxxx <16bit length y>
      offset = -x, 2 < x < (2^14 - 2)
      length = y
      Copy <length> bytes from <offset>.
      Because of this tag, the history buffer must be at least 2^14 bytes.

   Some bit patterns are currently unused, e.g. any patter with x==0.  It is
   possible to make more efficient use of the bits, but this design allows for
   simplicity.

Encoder

   Encoder buffers input stream, and advance the input pointer only when it has
   decided how to write the input data.  At which point the encoded input is
   written, and the data written is being pushed to the history buffer for
   substring matching later.

   History buffer is initially zero (i.e. bytes at negative file offsets are
   assumed zero).  This helps in compressing files with leading zeroes.

   Substring references

      Substrings of length less than or equal to 4 bytes are not matched, since
      it will be more efficient to continue a literal block in that case:

         For sequence below, ... was some literal, [a..d] are matched substring
         bytes:
            [...] [abcdx]

         Using substring reference:
            [lit(2) ...] [sub(3)] [lit(2) x] -> +6 bytes
            [lit(2) ... abcdx] -> +5 bytes

      Substrings of length 5 are matched, even though a literal encoding would
      produce the same output length.  The rationale here is to flush output as
      much as possible, so that there is less chance of a block being
      overflowed.

   Patching bytes

      If given a sequence of bytes:
         [match.....] [error] [match........]

      where replacing the error byte would provide a longer matched substring,
      a pointer to the first match is written, and a patch block for the error
      byte is written after the substring block.

      Because both blocks before and after the error byte matches some previous
      substring, the error byte saves space in not writing the header bytes for
      the second block:

         [sub(3)] [lit(2) error] [sub(3)]    (without patching, 9 bytes)
         [sub(3)] [patch(2) error]           (patched, 6 bytes)

      For this to work, both matching substring blocks before and after the
      error byte must be long enough for the patching to be worthwhile,
      otherwise the encoder could potentially do better by consolidating bytes
      before and after in literal blocks.

      Following the criterion for substring blocks, error bytes are not matched
      for indexes less than or equal to 5 (i.e. block before error byte
      qualifies as a substring), and not matched unless there are 5 or more
      matching bytes following it (i.e. block after error byte qualifies as a
      substring).

   It's not proven whether or not a greedy algorithm always produces optimal
   results.  Of course, it's much simpler to implement Kirika using only greedy
   algorithms.

Decoder

   The decoding process is pretty much dictated by the stream format, which is
   to read in tags and write the associated data to output.

   Because of the backward patching nature of the stream format, decoder must
   buffer 2^14 bytes of data, and only write data advancing the output pointer
   would cause some of the data to expire (the expired data is then written).

Differences from LZ77

   This section describes why you shouldn't throw away gzip :)

   For my purposes, I wanted a compressor/decompressor that is correct and
   simple to implement.  A high quality compressor such as gzip does a lot more
   things, and is of course a lot more complicated.

   Patching bytes

      This was not in LZ77.

   No Huffman encoding

      LZ77 performs Huffman encoding on literal blocks, Kirika doesn't.
      Implementing it would improve compression even on random files, and also
      significant increase in complexity since blocks will no longer fall on
      byte boundaries.

   Reference distance > 16K

      Kirika uses two byte headers, with two bits reserved for block type.
      Thus blocks are limited to 16K sizes.  Works reasonably well in practice
      though.

   Negative references

      Byte pointers can reference offsets that are before the start of file.
      This compresses leading zeroes better.

   One byte hash instead of three byte hash

      LZ77 uses a very efficient 3 byte hash algorithm to find previous
      substrings, Kirika uses only one byte.  This greatly simplifies the
      design and reduces memory usage.

   One byte checksum instead of 4 bytes Adler-32

      Kirika uses only one byte as the checksum, which the XOR of all bytes.
      It's better than nothing :)