Usage: Without command line arguments: filters stdin to stdout. For example: % echo "IOCCC2006!" | ./shindou I O C C C 2 0 0 6 !! With command line arguments: open each file specified in command line arguments, and print their encoding to stdout. For example: % ./shindou remarks.txt remarks.txt: Shift_JIS Summary: Shindou is a filter that adds emphasis to text. For example, given the following text in stdin: blue skies, white clouds, summer is calling me. Shindou prints the following to stdout: blue skies! white clouds! s u m m e r i s c a l l i n g m e ! That alone isn't very difficult. But this is the *International* Obfuscated C Code Contest, so Shindou also accepts Japanese (your editor must be able to display Shift_JIS to read the following text): 青い空、白い雲、夏が呼んでいる。 Output: 青い空!白い雲!夏 が 呼 ん で い る ! Thus come the obfuscated part... if you thought some programming languages are difficult for humans to understand, some human languages are just as hard for programs to understand. Japanese in particular can be encoded in many different ways, and often only a human can really tell them apart. Shindou supports most popular encodings: UTF-8 / UCS-2LE / UCS-2BE (with or without byte order mark) EUC-JP / EUC-JIS-2004 Shift_JIS Output encoding will be the same as input. The way encodings are detected automatically is the main obfuscation, readers are encouraged to try to figure out how it is done. Notes/Hints: * Fitting the space limit: So my initial version didn't fit within the 2K/4K limit :( Actually my ~25 initial versions didn't fit within the size limit, and I had to keep on removing features. The final version is less than 2K by IOCCC rules. The more artistic looking code that doesn't fit the 2K limit is included for comparison. The two versions are equally obfuscated after the preprocessor. Source layout is inspired by Shindou from Mizuiro. In particular, it's inspired by the songs "Natsu wa Machine Gun" and "Fuyu mo Machine Gun". * Compatibility/bugs: Source is encoded in ASCII. Compiles with gcc or msvc. Note that filtering will not always work with msvc compiled binaries, because stdin/stdout by default is not binary under win32, so you get an extra CR for every LF written, and UCS-2LE and UCS-2BE are guaranteed to break. Shindou assumes that input text is valid in one of the recognized encodings. If it's something else (e.g. random binary data), some error recovery is implemented to make sure that things should be stable. But if it breaks, hey, garbage in, garbage out. Shindou usually guess the file encoding correctly, but she is not 100% accurate. Filtering will likely be broken if encoding is guessed wrong. If you see strange filtered output, pass the file name to command line arguments to see what Shindou thinks the file encoding is. * Encoding encoding-specific information: Most of it is done with some range checking opcodes encoded in that long string literal. Wherever possible, the opcodes and literals for different encodings overlap, to reduce code size and also to make things harder to read. Punctuation data are stored as integers, with some delta-encoding applied to save space. Since significant implementation details are stored as opaque data, code beautifiers likely to be not very helpful. * Detecting file encoding automatically: 1. Check for byte order mark, since some editors always prepend U+FEFF to output text. The presence of these unambiguously identifies one of the 3 Unicode encodings. 2. Differentiate UCS2 and single byte encodings. This is done here because normal latin1 characters tend to look like Japanese when interpreted 2 bytes at a time. Here we make the assumption that the input text contains at least one linefeed, and check for U+000A. If we found U+000A, input is UCS-2BE. If we found U+0A00, input is UCS-2LE (since U+0A00 is not a character yet). 3. Differentiate UTF-8/Shift_JIS/EUC-JP using statistical analysis. It's "statistical", but nowhere as complicated as what Mozilla does. Here we assume the input is Japanese, and most meaningful Japanese text requires some hiragana characters, so we count which encoding decodes the most hiragana characters without encountering an illegal character. In case of ties (e.g. input doesn't contain any hiragana), prefer UTF-8, then Shift_JIS.