notes.txt - Don Yang (uguu.org)

This file describes the various implementation issues.

10/21/01


Huffman file format

   Header size (1 byte)

      This stores the number of internal nodes.

   First header block (up to 64 bytes)

      Bit vector describing the tree structure, odd bits are left branches,
      even bits are right branches.

      Bit set denotes that the branch is pointing at a leaf node, and the
      value stored is the byte value.  Bit cleared denotes that branch is
      pointing at an internal node, the value stored is the index to the
      internal node, and will be greater than the index value of the current
      node.

   Second header block (up to 510 bytes)

      Array of node indexes/leaf values, distinguished by bit vector specified
      above.  Every two bytes specifies an internal node, left branch first.

   Last header block (4 bytes)

      Decoded file size in little-endian format.

   Total header is up to 579 bytes.  Decoder will always try to parse the
   header.  If the decoder fails, the encoder is called on the input file.

   The rest of the file describes branches in the huffman tree.  Decoder
   should start at tree root,  follow left branch for every 0 bit, and follow
   right branch for every 1 bit.  When a leaf node is reached, write byte
   value and restart at root of tree.

   Huffman compression is not the best algorithm in the world, for most data
   files, it can only only squeeze few kilobytes out of it.  But here at IOCCC,
   everyone takes those few kilobytes very seriously...

C (first phase)

   A reasonably efficient program for encoding/decoding the above format is
   implemented in C, nothing too tricky here.

Bytecode (second phase)

   To translate C expressions to bytecodes efficiently, the program uses a
   stack similar to the mechanism used by PostScript.  Unlike pure PostScript
   though, the engine has structured branches (begin/end/call), arguments for
   which comes after the opcode instead of before.

   For the most part, the bytecode engine models most modern architectures:
   random access memory, register, stack.  I am pretty sure the bytecode
   language is Turing-complete :)

Huffman bytecode (third phase)

   Of course, I wouldn't really write it in bytecode if I didn't had source
   compression in mind :)  Using frequency data from second phase, I was able
   to compress the original source to a little under 1K.  This is good, if and
   only if I can write the decoder/interpreter in the remaining 1K...

   Obviously at each phase, the source code decreases in size while the
   overhead increases, as well as the execution time (especially the execution
   time).  This version is now really only useful for IOCCC, because it's too
   slow to be practical.  On the other hand, it's a good place to stress those
   expensive 1GHz processors :)

                     Encoding 180K of data      Decoding:
   C                 0:00.24                    0:00.15
   Bytecode          1:43.76                    0:07.42
   Huffman bytecode  2:32.02                    0:08.95

   The tree used to encode (tree.txt the bytecode is not the optimal huffman
   tree (tree-opt.txt).  For 16 bits loss in compressed code size, the decoder
   can be much simplified (and result in smaller code overall).

   There is a bug somewhere in the header recognition code for the decoder
   (second or third phase?)  The result is that encoded binary files can't be
   decoded.  Well, nobody will really want to use this program for serious
   compression anyway...