Usage:

Without command line arguments: filters stdin to stdout.  For example:
   % echo "IOCCC2006!" | ./shindou
   I O C C C 2 0 0 6 !!

With command line arguments: open each file specified in command line
arguments, and print their encoding to stdout.  For example:

   % ./shindou remarks.txt
   remarks.txt: Shift_JIS


Summary:

Shindou is a filter that adds emphasis to text.  For example, given
the following text in stdin:

   blue skies, white clouds, summer is calling me.

Shindou prints the following to stdout:

   blue skies! white clouds!  s u m m e r   i s   c a l l i n g   m e !

That alone isn't very difficult.  But this is the *International*
Obfuscated C Code Contest, so Shindou also accepts Japanese (your
editor must be able to display Shift_JIS to read the following text):

   青い空、白い雲、夏が呼んでいる。

Output:

   青い空！白い雲！夏 が 呼 ん で い る ！

Thus come the obfuscated part... if you thought some programming
languages are difficult for humans to understand, some human languages
are just as hard for programs to understand.  Japanese in particular
can be encoded in many different ways, and often only a human can
really tell them apart.  Shindou supports most popular encodings:

   UTF-8 / UCS-2LE / UCS-2BE (with or without byte order mark)
   EUC-JP / EUC-JIS-2004
   Shift_JIS

Output encoding will be the same as input.  The way encodings are
detected automatically is the main obfuscation, readers are encouraged
to try to figure out how it is done.


Notes/Hints:

* Fitting the space limit:

So my initial version didn't fit within the 2K/4K limit :(  Actually
my ~25 initial versions didn't fit within the size limit, and I had to
keep on removing features.  The final version is less than 2K by IOCCC
rules.  The more artistic looking code that doesn't fit the 2K limit
is included for comparison.  The two versions are equally obfuscated
after the preprocessor.

Source layout is inspired by Shindou from Mizuiro.  In particular,
it's inspired by the songs "Natsu wa Machine Gun" and "Fuyu mo Machine
Gun".

* Compatibility/bugs:

Source is encoded in ASCII.  Compiles with gcc or msvc.

Note that filtering will not always work with msvc compiled binaries,
because stdin/stdout by default is not binary under win32, so you get
an extra CR for every LF written, and UCS-2LE and UCS-2BE are
guaranteed to break.

Shindou assumes that input text is valid in one of the recognized
encodings.  If it's something else (e.g. random binary data), some
error recovery is implemented to make sure that things should be
stable.  But if it breaks, hey, garbage in, garbage out.

Shindou usually guess the file encoding correctly, but she is not 100%
accurate.  Filtering will likely be broken if encoding is guessed
wrong.  If you see strange filtered output, pass the file name to
command line arguments to see what Shindou thinks the file encoding
is.

* Encoding encoding-specific information:

Most of it is done with some range checking opcodes encoded in that
long string literal.  Wherever possible, the opcodes and literals for
different encodings overlap, to reduce code size and also to make
things harder to read.

Punctuation data are stored as integers, with some delta-encoding
applied to save space.

Since significant implementation details are stored as opaque data,
code beautifiers likely to be not very helpful.

* Detecting file encoding automatically:

1. Check for byte order mark, since some editors always prepend U+FEFF
   to output text.  The presence of these unambiguously identifies one
   of the 3 Unicode encodings.

2. Differentiate UCS2 and single byte encodings.  This is done here
   because normal latin1 characters tend to look like Japanese when
   interpreted 2 bytes at a time.  Here we make the assumption that
   the input text contains at least one linefeed, and check for
   U+000A.  If we found U+000A, input is UCS-2BE.  If we found U+0A00,
   input is UCS-2LE (since U+0A00 is not a character yet).

3. Differentiate UTF-8/Shift_JIS/EUC-JP using statistical analysis.
   It's "statistical", but nowhere as complicated as what Mozilla
   does.  Here we assume the input is Japanese, and most meaningful
   Japanese text requires some hiragana characters, so we count which
   encoding decodes the most hiragana characters without encountering
   an illegal character.  In case of ties (e.g. input doesn't contain
   any hiragana), prefer UTF-8, then Shift_JIS.