design.txt - Don Yang (uguu.org)

09/15/01


This document describes most design details behind this program.  It's really a
collection of notes to myself for what I did and why I did it...

Throughout the document, "Kirika" refers to this algorithm that is being
implemented.


Design

   The general goal is to reduce redundancy by replacing substrings with
   pointers to a previous location. e.g.

      aaaaaaabbaaaaaaa  ->  aaaaaaabb[-9]

   In cases where a previous substring can not be matched, the bytes are copied
   to the output literally.

Format

   Encoded stream consists of a series of tags, optionally followed by a series
   of bytes.  Each tag is a 16bit unsigned integer stored in the intel fashion
   (i.e. little endian, least significant byte first).  Order of bits within
   the byte are left undefined.

   Tags are distinguished by the two high order bits.

   00xxxxxx xxxxxxxx [data]               "literal block"
      length = x, 0 < x < (2^14 - 2)
      Copy <length> bytes to output (which follows the tag).
      Because of this tag, the input buffer must be at least 2^14 bytes.

   01xxxxxx xxxxxxxx [data]               "huffman block"
      length = x, 0 < x < (2^14 - 2)
      Block is encoded using predefined Huffman tree

   10xxxxxx xxxxxxxx <8bit length y>      "substring reference"
   11xxxxxx xxxxxxxx <16bit length y>
      offset = -x, 2 < x < (2^14 - 2)
      length = y
      Copy <length> bytes from <offset>.
      Because of this tag, the history buffer must be at least 2^14 bytes.

   Some bit patterns are currently unused, e.g. any patter with x==0.  It is
   possible to make more efficient use of the bits, but this design allows for
   simplicity.

Encoder

   Encoder buffers input stream, and advance the input pointer only when it has
   decided how to write the input data.  At which point the encoded input is
   written, and the data written is being pushed to the history buffer for
   substring matching later.

   History buffer is initially zero (i.e. bytes at negative file offsets are
   assumed zero).  This helps in compressing files with leading zeroes.

   Literal blocks

      Literal blocks are compressed using Huffman encoding, with a predefined
      tree (generated by gentree.c under huffman/).  When Huffman encoding
      produced a larger block, the bytes are copied literally.

      The Huffman tree is generated based on frequencies from a 0.5MB data
      file, which is mostly text and it's unlikely to perform very well on
      binary data.

   Substring references

      Substrings of length less than or equal to 4 bytes are not matched, since
      it will be more efficient to continue a literal block in that case:

         For sequence below, ... was some literal, [a..d] are matched substring
         bytes:
            [...] [abcdx]

         Using substring reference:
            [lit(2) ...] [sub(3)] [lit(2) x] -> +6 bytes
            [lit(2) ... abcdx] -> +5 bytes

      Substrings of length 5 are matched, even though a literal encoding would
      produce the same output length.  The rationale here is to flush output as
      much as possible, so that there is less chance of a block being
      overflowed.

   It's not proven whether or not a greedy algorithm always produces optimal
   results.  Of course, it's much simpler to implement Kirika using only greedy
   algorithms.

Decoder

   The decoding process is pretty much dictated by the stream format, which is
   to read in tags and write the associated data to output.

Differences from LZ77

   This section describes why you shouldn't throw away gzip :)

   For my purposes, I wanted a compressor/decompressor that is correct and
   simple to implement.  A high quality compressor such as gzip does a lot more
   things, and is of course a lot more complicated.

   Huffman encoding

      LZ77 uses more than multiple trees (possibly embedded with file), Kirika
      uses only one predefined tree.  Also, this tree is only limited to text
      data, thus binary files usually do not benefit from Huffman encoding.

   Reference distance > 16K

      Kirika uses two byte headers, with two bits reserved for block type.
      Thus blocks are limited to 16K sizes.  Works reasonably well in practice
      though.

   Negative references

      Byte pointers can reference offsets that are before the start of file.
      This compresses leading zeroes better.

   One byte hash instead of three byte hash

      LZ77 uses a very efficient 3 byte hash algorithm to find previous
      substrings, Kirika uses only one byte.  This greatly simplifies the
      design and reduces memory usage.

   One byte checksum instead of 4 bytes Adler-32

      Kirika uses only one byte as the checksum, which the XOR of all bytes.
      It's better than nothing :)

   1GB file size limit

      Theoretically, Kirika can compress up to 2GB too, but the size limit is
      hardcoded to 1GB.  This is not really a limitation, since chances are
      that the user will run out of patience long before the limit is reached
      anyway.

   Rarely does Kirika compress better than gzip, since gzip uses a larger
   dictionary and a different Huffman algorithm.  Kirika almost always
   compresses slower.

Obsolete version

   Older version of Kirika (specified by design0.txt) contained a patching
   algorithm which allows imperfect substring matching.  Early experiments
   showed minimal gain in compression only in rare hand crafted cases, while
   adding significant time and space requirements.  The patching algorithm is
   therefore dropped in favor of optimizations that are otherwise not possible.

   With the removal of the patching algorithm and the addition of Huffman
   blocks, Kirika is getting pretty close to LZ77...