notes.txt - Don Yang (uguu.org) This file describes the various implementation issues. 10/21/01 Huffman file format Header size (1 byte) This stores the number of internal nodes. First header block (up to 64 bytes) Bit vector describing the tree structure, odd bits are left branches, even bits are right branches. Bit set denotes that the branch is pointing at a leaf node, and the value stored is the byte value. Bit cleared denotes that branch is pointing at an internal node, the value stored is the index to the internal node, and will be greater than the index value of the current node. Second header block (up to 510 bytes) Array of node indexes/leaf values, distinguished by bit vector specified above. Every two bytes specifies an internal node, left branch first. Last header block (4 bytes) Decoded file size in little-endian format. Total header is up to 579 bytes. Decoder will always try to parse the header. If the decoder fails, the encoder is called on the input file. The rest of the file describes branches in the huffman tree. Decoder should start at tree root, follow left branch for every 0 bit, and follow right branch for every 1 bit. When a leaf node is reached, write byte value and restart at root of tree. Huffman compression is not the best algorithm in the world, for most data files, it can only only squeeze few kilobytes out of it. But here at IOCCC, everyone takes those few kilobytes very seriously... C (first phase) A reasonably efficient program for encoding/decoding the above format is implemented in C, nothing too tricky here. Bytecode (second phase) To translate C expressions to bytecodes efficiently, the program uses a stack similar to the mechanism used by PostScript. Unlike pure PostScript though, the engine has structured branches (begin/end/call), arguments for which comes after the opcode instead of before. For the most part, the bytecode engine models most modern architectures: random access memory, register, stack. I am pretty sure the bytecode language is Turing-complete :) Huffman bytecode (third phase) Of course, I wouldn't really write it in bytecode if I didn't had source compression in mind :) Using frequency data from second phase, I was able to compress the original source to a little under 1K. This is good, if and only if I can write the decoder/interpreter in the remaining 1K... Obviously at each phase, the source code decreases in size while the overhead increases, as well as the execution time (especially the execution time). This version is now really only useful for IOCCC, because it's too slow to be practical. On the other hand, it's a good place to stress those expensive 1GHz processors :) Encoding 180K of data Decoding: C 0:00.24 0:00.15 Bytecode 1:43.76 0:07.42 Huffman bytecode 2:32.02 0:08.95 The tree used to encode (tree.txt the bytecode is not the optimal huffman tree (tree-opt.txt). For 16 bits loss in compressed code size, the decoder can be much simplified (and result in smaller code overall). There is a bug somewhere in the header recognition code for the decoder (second or third phase?) The result is that encoded binary files can't be decoded. Well, nobody will really want to use this program for serious compression anyway...