design.txt - Don Yang (uguu.org) 09/15/01 This document describes most design details behind this program. It's really a collection of notes to myself for what I did and why I did it... Throughout the document, "Kirika" refers to this algorithm that is being implemented. Design The general goal is to reduce redundancy by replacing substrings with pointers to a previous location. e.g. aaaaaaabbaaaaaaa -> aaaaaaabb[-9] In cases where a previous substring can not be matched, the bytes are copied to the output literally. Format Encoded stream consists of a series of tags, optionally followed by a series of bytes. Each tag is a 16bit unsigned integer stored in the intel fashion (i.e. little endian, least significant byte first). Order of bits within the byte are left undefined. Tags are distinguished by the two high order bits. 00xxxxxx xxxxxxxx [data] "literal block" length = x, 0 < x < (2^14 - 2) Copy bytes to output (which follows the tag). Because of this tag, the input buffer must be at least 2^14 bytes. 01xxxxxx xxxxxxxx [data] "huffman block" length = x, 0 < x < (2^14 - 2) Block is encoded using predefined Huffman tree 10xxxxxx xxxxxxxx <8bit length y> "substring reference" 11xxxxxx xxxxxxxx <16bit length y> offset = -x, 2 < x < (2^14 - 2) length = y Copy bytes from . Because of this tag, the history buffer must be at least 2^14 bytes. Some bit patterns are currently unused, e.g. any patter with x==0. It is possible to make more efficient use of the bits, but this design allows for simplicity. Encoder Encoder buffers input stream, and advance the input pointer only when it has decided how to write the input data. At which point the encoded input is written, and the data written is being pushed to the history buffer for substring matching later. History buffer is initially zero (i.e. bytes at negative file offsets are assumed zero). This helps in compressing files with leading zeroes. Literal blocks Literal blocks are compressed using Huffman encoding, with a predefined tree (generated by gentree.c under huffman/). When Huffman encoding produced a larger block, the bytes are copied literally. The Huffman tree is generated based on frequencies from a 0.5MB data file, which is mostly text and it's unlikely to perform very well on binary data. Substring references Substrings of length less than or equal to 4 bytes are not matched, since it will be more efficient to continue a literal block in that case: For sequence below, ... was some literal, [a..d] are matched substring bytes: [...] [abcdx] Using substring reference: [lit(2) ...] [sub(3)] [lit(2) x] -> +6 bytes [lit(2) ... abcdx] -> +5 bytes Substrings of length 5 are matched, even though a literal encoding would produce the same output length. The rationale here is to flush output as much as possible, so that there is less chance of a block being overflowed. It's not proven whether or not a greedy algorithm always produces optimal results. Of course, it's much simpler to implement Kirika using only greedy algorithms. Decoder The decoding process is pretty much dictated by the stream format, which is to read in tags and write the associated data to output. Differences from LZ77 This section describes why you shouldn't throw away gzip :) For my purposes, I wanted a compressor/decompressor that is correct and simple to implement. A high quality compressor such as gzip does a lot more things, and is of course a lot more complicated. Huffman encoding LZ77 uses more than multiple trees (possibly embedded with file), Kirika uses only one predefined tree. Also, this tree is only limited to text data, thus binary files usually do not benefit from Huffman encoding. Reference distance > 16K Kirika uses two byte headers, with two bits reserved for block type. Thus blocks are limited to 16K sizes. Works reasonably well in practice though. Negative references Byte pointers can reference offsets that are before the start of file. This compresses leading zeroes better. One byte hash instead of three byte hash LZ77 uses a very efficient 3 byte hash algorithm to find previous substrings, Kirika uses only one byte. This greatly simplifies the design and reduces memory usage. One byte checksum instead of 4 bytes Adler-32 Kirika uses only one byte as the checksum, which the XOR of all bytes. It's better than nothing :) 1GB file size limit Theoretically, Kirika can compress up to 2GB too, but the size limit is hardcoded to 1GB. This is not really a limitation, since chances are that the user will run out of patience long before the limit is reached anyway. Rarely does Kirika compress better than gzip, since gzip uses a larger dictionary and a different Huffman algorithm. Kirika almost always compresses slower. Obsolete version Older version of Kirika (specified by design0.txt) contained a patching algorithm which allows imperfect substring matching. Early experiments showed minimal gain in compression only in rare hand crafted cases, while adding significant time and space requirements. The patching algorithm is therefore dropped in favor of optimizations that are otherwise not possible. With the removal of the patching algorithm and the addition of Huffman blocks, Kirika is getting pretty close to LZ77...