The Old Blog Archive, 2005-2009

Some History on the .cin Format, and on Apple’s .cin Support

Eric Rasmussen of Yale Chinese Mac has started a discussion on the .cin support in Apple’s Mac OS X Leopard, an addition to their exisitng input method framework as an alternative to help users create their own input methods. I was invited to share what I know about the format, so I wrote a long reply to Eric’s post. The length of the follow-up seems to warrant a standalone blog entry, so here it is. I’ll put more links in the text later.

History of the .cin Format

.cin was first introduced by Xcin, an input method framework for X11 developed in the mid 1990s, as a data format for table-based input methods. By table-based I mean input methods that can be implemented, or seen, as a table look-up mechanism. Around 90% of input methods (Chinese and beyond) can be implemented that way. Apple’s .inputplugin also belongs to that category. Almost every mainstream input method framework supports at least one form of user-customizable IME creation. .cin seems to have become one of the standard data formats because it’s simple and many user-generated tables are already in wide
circulation.

I have very limited knowledge of Xcin and other frameworks, but in the early days, .cin was intended as a source format, not to be consumed directly by input method framework (or more precisely, the table-based input method “generator”). Also back then a .cin could use any encoding recognized by the framework. So phone.cin (renamed to bpmf.cin in OV) was encoded in Big5, pinyin.cin in GB, and so on. When we were developing the “generic” module (first named OVIMXcin, later renamed to OVIMGeneric) to support .cin in OpenVanilla, we made two decisions: first, we no longer require user to run a compiler/ converter to make .cin into a binary format, as it was so, which means the .cin is consumed by the input method module directly. Second, all .cin files must use UTF-8 encoding. This opened the door to bigger character set and the famous “♨” input method.

What’s in a .cin?

So what constitutes a valid .cin file? For OpenVanilla, a .cin file consists of three sections:

  1. A header consisting of directives beginning with “%”, like %ename, %selkey, %endkey. Some of them are like meta-data, some of them are controlling directives;
  2. a keyname block between the directives “%keyname begin” and “%keyname end”. This tells the generic input method to map the key typed to a character displayed in the composing stage (mostly to represent radicals in radcial-based input methods), and
  3. a chardef block between the directives “%chardef begin” and “%chardef end”. This is the body of the data table. “chardef” is somewhat an anachronistic misnomer. It used to define the relationship between key sequences to characters (hence the name), but modern implementations like OV and gcin allow phrases in this block

Different frameworks have implemented the details somewhat differently. OV’s implementation disallows the use of Windows-style CR LF (so only the UNIX-style \n is used, and that’s also what OS X uses), and comment lines (beginning with #) is not allowed in the chardef block.

Although .cin contains enough information for key-character/phrase mapping, but many input methods (like 倉頡 Cangjei/”Changjei” or 簡易 Simplex/Jianyi) require finer control. For OpenVanilla, the control is provided in the form of input method preferences (with some mind- bogging names like “force composition when reaching maximum length of radical” or “use space to select the 1st candidate). Different input methods require different controls (and those are a must — failure to provide those controls yields barely usable input methods). gcin
differs from OV’s implementation in that it allows those control directives to be expressed as a .cin header, with its own directive extensions.

OpenVanilla’s Own Take of .cin

OpenVanilla’s repository of .cin is available at: http://openvanilla.googlecode.com/svn/trunk/Modules/SharedData/

Zonble has written an excellent tutorial (in Chinese) on how to create
your own input method by writing up a .cin, which is kind of standard
text now: http://docs.google.com/View?docid=ah6d8th954vw_201fd5dkx

Technically .cin is really just a set of key-value pairs with its own convention. OV makes heavy use of .cin as a format. Things like reverse radical/pinyin lookup or associated phrases are also done with .cin-based data tables. I see it a good sign that Apple adopts a popular (and mostly consistent and cross-framework compatible) data format for Leopard.

Leopard’s Support of .cin

So what about Leopard? As far as I know, dropping in a UTF-8-encoded .cin into ~/Library/Input Methods or /Library/Input Methods then re-login just works. A new input method, using the name defined in the .cin, shows up in the Input Menu tab of the International preferences panel. I’m not aware of any per-method level control so far (I might be very ignorant on this).

In terms of limitation, I’m not aware of that either. OV’s own implementation (and many others) is only limited by memory and your patience (loading a .cin with 200,000 entries on a G3 is no small thing; a database-backed design will solve the problem). Leopard’s own take should not differ much. So it should be very flexible and easily customizable.

One Response to “Some History on the .cin Format, and on Apple’s .cin Support”

  1. on 12 Nov 2007 at 5:57 pmtransient proofreader

    Cangjei → Cāngjié