design0.txt - Don Yang (uguu.org) 09/12/01 This is an obsolete version of the design document, the current version is design.txt. This document describes most design details behind this program. It's really a collection of notes to myself for what I did and why I did it... Throughout the document, "Kirika" refers to this algorithm that is being implemented. Design The general goal is to reduce redundancy by replacing substrings with pointers to a previous location. e.g. aaaaaaabbaaaaaaa -> aaaaaaabb[-9] However, it's noted that substrings may not always be of perfect match, in which case it may be more efficient to match the longest substring, then patch the difference, e.g. aaaaaaabbaaaabaa -> aaaaaaabb[-9][-2b] In other cases, it may not be able to match any previous substrings (e.g. at beginning of stream. In which case the bytes are coped to output, until a suitable match is possible. Format Encoded stream consists of a series of tags, optionally followed by a series of bytes. Each tag is a 16bit unsigned integer stored in the intel fashion (i.e. little endian, least significant byte first). Order of bits within the byte are left undefined. Tags are distinguished by the two high order bits. 00xxxxxx xxxxxxxx [data] "literal block" length = x, 0 < x < (2^14 - 2) Copy bytes to output (which follows the tag). Because of this tag, the input buffer must be at least 2^14 bytes. 01xxxxxx xxxxxxxx "patch block" offset = -x, 1 < x < (2^14 - 2) Patch with 10xxxxxx xxxxxxxx <8bit length y> "substring reference" 11xxxxxx xxxxxxxx <16bit length y> offset = -x, 2 < x < (2^14 - 2) length = y Copy bytes from . Because of this tag, the history buffer must be at least 2^14 bytes. Some bit patterns are currently unused, e.g. any patter with x==0. It is possible to make more efficient use of the bits, but this design allows for simplicity. Encoder Encoder buffers input stream, and advance the input pointer only when it has decided how to write the input data. At which point the encoded input is written, and the data written is being pushed to the history buffer for substring matching later. History buffer is initially zero (i.e. bytes at negative file offsets are assumed zero). This helps in compressing files with leading zeroes. Substring references Substrings of length less than or equal to 4 bytes are not matched, since it will be more efficient to continue a literal block in that case: For sequence below, ... was some literal, [a..d] are matched substring bytes: [...] [abcdx] Using substring reference: [lit(2) ...] [sub(3)] [lit(2) x] -> +6 bytes [lit(2) ... abcdx] -> +5 bytes Substrings of length 5 are matched, even though a literal encoding would produce the same output length. The rationale here is to flush output as much as possible, so that there is less chance of a block being overflowed. Patching bytes If given a sequence of bytes: [match.....] [error] [match........] where replacing the error byte would provide a longer matched substring, a pointer to the first match is written, and a patch block for the error byte is written after the substring block. Because both blocks before and after the error byte matches some previous substring, the error byte saves space in not writing the header bytes for the second block: [sub(3)] [lit(2) error] [sub(3)] (without patching, 9 bytes) [sub(3)] [patch(2) error] (patched, 6 bytes) For this to work, both matching substring blocks before and after the error byte must be long enough for the patching to be worthwhile, otherwise the encoder could potentially do better by consolidating bytes before and after in literal blocks. Following the criterion for substring blocks, error bytes are not matched for indexes less than or equal to 5 (i.e. block before error byte qualifies as a substring), and not matched unless there are 5 or more matching bytes following it (i.e. block after error byte qualifies as a substring). It's not proven whether or not a greedy algorithm always produces optimal results. Of course, it's much simpler to implement Kirika using only greedy algorithms. Decoder The decoding process is pretty much dictated by the stream format, which is to read in tags and write the associated data to output. Because of the backward patching nature of the stream format, decoder must buffer 2^14 bytes of data, and only write data advancing the output pointer would cause some of the data to expire (the expired data is then written). Differences from LZ77 This section describes why you shouldn't throw away gzip :) For my purposes, I wanted a compressor/decompressor that is correct and simple to implement. A high quality compressor such as gzip does a lot more things, and is of course a lot more complicated. Patching bytes This was not in LZ77. No Huffman encoding LZ77 performs Huffman encoding on literal blocks, Kirika doesn't. Implementing it would improve compression even on random files, and also significant increase in complexity since blocks will no longer fall on byte boundaries. Reference distance > 16K Kirika uses two byte headers, with two bits reserved for block type. Thus blocks are limited to 16K sizes. Works reasonably well in practice though. Negative references Byte pointers can reference offsets that are before the start of file. This compresses leading zeroes better. One byte hash instead of three byte hash LZ77 uses a very efficient 3 byte hash algorithm to find previous substrings, Kirika uses only one byte. This greatly simplifies the design and reduces memory usage. One byte checksum instead of 4 bytes Adler-32 Kirika uses only one byte as the checksum, which the XOR of all bytes. It's better than nothing :)