Mercurial > hg > xemacs-beta

diff man/lispref/mule.texi @ 428:3ecd8885ac67 r21-2-22
Import from CVS: tag r21-2-22
author: cvs
date: Mon, 13 Aug 2007 11:28:15 +0200
children: 8de8e3f6228a
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/man/lispref/mule.texi	Mon Aug 13 11:28:15 2007 +0200
@@ -0,0 +1,1505 @@
+@c -*-texinfo-*-
+@c This is part of the XEmacs Lisp Reference Manual.
+@c Copyright (C) 1996 Ben Wing.
+@c See the file lispref.texi for copying conditions.
+@setfilename ../../info/internationalization.info
+@node MULE, Tips, Internationalization, top
+@chapter MULE
+
+@dfn{MULE} is the name originally given to the version of GNU Emacs
+extended for multi-lingual (and in particular Asian-language) support.
+``MULE'' is short for ``MUlti-Lingual Emacs''.  It was originally called
+Nemacs (``Nihon Emacs'' where ``Nihon'' is the Japanese word for
+``Japan''), when it only provided support for Japanese.  XEmacs
+refers to its multi-lingual support as @dfn{MULE support} since it
+is based on @dfn{MULE}.
+
+@menu
+* Internationalization Terminology::
+                        Definition of various internationalization terms.
+* Charsets::            Sets of related characters.
+* MULE Characters::     Working with characters in XEmacs/MULE.
+* Composite Characters:: Making new characters by overstriking other ones.
+* ISO 2022::            An international standard for charsets and encodings.
+* Coding Systems::      Ways of representing a string of chars using integers.
+* CCL::                 A special language for writing fast converters.
+* Category Tables::     Subdividing charsets into groups.
+@end menu
+
+@node Internationalization Terminology
+@section Internationalization Terminology
+
+   In internationalization terminology, a string of text is divided up
+into @dfn{characters}, which are the printable units that make up the
+text.  A single character is (for example) a capital @samp{A}, the
+number @samp{2}, a Katakana character, a Kanji ideograph (an
+@dfn{ideograph} is a ``picture'' character, such as is used in Japanese
+Kanji, Chinese Hanzi, and Korean Hangul; typically there are thousands
+of such ideographs in each language), etc.  The basic property of a
+character is its shape.  Note that the same character may be drawn by
+two different people (or in two different fonts) in slightly different
+ways, although the basic shape will be the same.
+
+  In some cases, the differences will be significant enough that it is
+actually possible to identify two or more distinct shapes that both
+represent the same character.  For example, the lowercase letters
+@samp{a} and @samp{g} each have two distinct possible shapes -- the
+@samp{a} can optionally have a curved tail projecting off the top, and
+the @samp{g} can be formed either of two loops, or of one loop and a
+tail hanging off the bottom.  Such distinct possible shapes of a
+character are called @dfn{glyphs}.  The important characteristic of two
+glyphs making up the same character is that the choice between one or
+the other is purely stylistic and has no linguistic effect on a word
+(this is the reason why a capital @samp{A} and lowercase @samp{a}
+are different characters rather than different glyphs -- e.g.
+@samp{Aspen} is a city while @samp{aspen} is a kind of tree).
+
+  Note that @dfn{character} and @dfn{glyph} are used differently
+here than elsewhere in XEmacs.
+
+  A @dfn{character set} is simply a set of related characters.  ASCII,
+for example, is a set of 94 characters (or 128, if you count
+non-printing characters).  Other character sets are ISO8859-1 (ASCII
+plus various accented characters and other international symbols),
+JISX0201 (ASCII, more or less, plus half-width Katakana), JISX0208
+(Japanese Kanji), JISX0212 (a second set of less-used Japanese Kanji),
+GB2312 (Mainland Chinese Hanzi), etc.
+
+  Every character set has one or more @dfn{orderings}, which can be
+viewed as a way of assigning a number (or set of numbers) to each
+character in the set.  For most character sets, there is a standard
+ordering, and in fact all of the character sets mentioned above define a
+particular ordering.  ASCII, for example, places letters in their
+``natural'' order, puts uppercase letters before lowercase letters,
+numbers before letters, etc.  Note that for many of the Asian character
+sets, there is no natural ordering of the characters.  The actual
+orderings are based on one or more salient characteristic, of which
+there are many to choose from -- e.g. number of strokes, common
+radicals, phonetic ordering, etc.
+
+  The set of numbers assigned to any particular character are called
+the character's @dfn{position codes}.  The number of position codes
+required to index a particular character in a character set is called
+the @dfn{dimension} of the character set.  ASCII, being a relatively
+small character set, is of dimension one, and each character in the
+set is indexed using a single position code, in the range 0 through
+127 (if non-printing characters are included) or 33 through 126
+(if only the printing characters are considered).  JISX0208, i.e.
+Japanese Kanji, has thousands of characters, and is of dimension two --
+every character is indexed by two position codes, each in the range
+33 through 126. (Note that the choice of the range here is somewhat
+arbitrary.  Although a character set such as JISX0208 defines an
+@emph{ordering} of all its characters, it does not define the actual
+mapping between numbers and characters.  You could just as easily
+index the characters in JISX0208 using numbers in the range 0 through
+93, 1 through 94, 2 through 95, etc.  The reason for the actual range
+chosen is so that the position codes match up with the actual values
+used in the common encodings.)
+
+  An @dfn{encoding} is a way of numerically representing characters from
+one or more character sets into a stream of like-sized numerical values
+called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit
+quantities.  If an encoding encompasses only one character set, then the
+position codes for the characters in that character set could be used
+directly. (This is the case with ASCII, and as a result, most people do
+not understand the difference between a character set and an encoding.)
+This is not possible, however, if more than one character set is to be
+used in the encoding.  For example, printed Japanese text typically
+requires characters from multiple character sets -- ASCII, JISX0208, and
+JISX0212, to be specific.  Each of these is indexed using one or more
+position codes in the range 33 through 126, so the position codes could
+not be used directly or there would be no way to tell which character
+was meant.  Different Japanese encodings handle this differently -- JIS
+uses special escape characters to denote different character sets; EUC
+sets the high bit of the position codes for JISX0208 and JISX0212, and
+puts a special extra byte before each JISX0212 character; etc. (JIS,
+EUC, and most of the other encodings you will encounter are 7-bit or
+8-bit encodings.  There is one common 16-bit encoding, which is Unicode;
+this strives to represent all the world's characters in a single large
+character set.  32-bit encodings are generally used internally in
+programs to simplify the code that manipulates them; however, they are
+not much used externally because they are not very space-efficient.)
+
+  Encodings are classified as either @dfn{modal} or @dfn{non-modal}.  In
+a @dfn{modal encoding}, there are multiple states that the encoding can be in,
+and the interpretation of the values in the stream depends on the
+current global state of the encoding.  Special values in the encoding,
+called @dfn{escape sequences}, are used to change the global state.
+JIS, for example, is a modal encoding.  The bytes @samp{ESC $ B}
+indicate that, from then on, bytes are to be interpreted as position
+codes for JISX0208, rather than as ASCII.  This effect is cancelled
+using the bytes @samp{ESC ( B}, which mean ``switch from whatever the
+current state is to ASCII''.  To switch to JISX0212, the escape sequence
+@samp{ESC $ ( D}. (Note that here, as is common, the escape sequences do
+in fact begin with @samp{ESC}.  This is not necessarily the case,
+however.)
+
+A @dfn{non-modal encoding} has no global state that extends past the
+character currently being interpreted.  EUC, for example, is a
+non-modal encoding.  Characters in JISX0208 are encoded by setting
+the high bit of the position codes, and characters in JISX0212 are
+encoded by doing the same but also prefixing the character with the
+byte 0x8F.
+
+  The advantage of a modal encoding is that it is generally more
+space-efficient, and is easily extendable because there are essentially
+an arbitrary number of escape sequences that can be created.  The
+disadvantage, however, is that it is much more difficult to work with
+if it is not being processed in a sequential manner.  In the non-modal
+EUC encoding, for example, the byte 0x41 always refers to the letter
+@samp{A}; whereas in JIS, it could either be the letter @samp{A}, or
+one of the two position codes in a JISX0208 character, or one of the
+two position codes in a JISX0212 character.  Determining exactly which
+one is meant could be difficult and time-consuming if the previous
+bytes in the string have not already been processed.
+
+  Non-modal encodings are further divided into @dfn{fixed-width} and
+@dfn{variable-width} formats.  A fixed-width encoding always uses
+the same number of words per character, whereas a variable-width
+encoding does not.  EUC is a good example of a variable-width
+encoding: one to three bytes are used per character, depending on
+the character set.  16-bit and 32-bit encodings are nearly always
+fixed-width, and this is in fact one of the main reasons for using
+an encoding with a larger word size.  The advantages of fixed-width
+encodings should be obvious.  The advantages of variable-width
+encodings are that they are generally more space-efficient and allow
+for compatibility with existing 8-bit encodings such as ASCII.
+
+  Note that the bytes in an 8-bit encoding are often referred to
+as @dfn{octets} rather than simply as bytes.  This terminology
+dates back to the days before 8-bit bytes were universal, when
+some computers had 9-bit bytes, others had 10-bit bytes, etc.
+
+@node Charsets
+@section Charsets
+
+  A @dfn{charset} in MULE is an object that encapsulates a
+particular character set as well as an ordering of those characters.
+Charsets are permanent objects and are named using symbols, like
+faces.
+
+@defun charsetp object
+This function returns non-@code{nil} if @var{object} is a charset.
+@end defun
+
+@menu
+* Charset Properties::          Properties of a charset.
+* Basic Charset Functions::     Functions for working with charsets.
+* Charset Property Functions::  Functions for accessing charset properties.
+* Predefined Charsets::         Predefined charset objects.
+@end menu
+
+@node Charset Properties
+@subsection Charset Properties
+
+  Charsets have the following properties:
+
+@table @code
+@item name
+A symbol naming the charset.  Every charset must have a different name;
+this allows a charset to be referred to using its name rather than
+the actual charset object.
+@item doc-string
+A documentation string describing the charset.
+@item registry
+A regular expression matching the font registry field for this character
+set.  For example, both the @code{ascii} and @code{latin-iso8859-1}
+charsets use the registry @code{"ISO8859-1"}.  This field is used to
+choose an appropriate font when the user gives a general font
+specification such as @samp{-*-courier-medium-r-*-140-*}, i.e. a
+14-point upright medium-weight Courier font.
+@item dimension
+Number of position codes used to index a character in the character set.
+XEmacs/MULE can only handle character sets of dimension 1 or 2.
+This property defaults to 1.
+@item chars
+Number of characters in each dimension.  In XEmacs/MULE, the only
+allowed values are 94 or 96. (There are a couple of pre-defined
+character sets, such as ASCII, that do not follow this, but you cannot
+define new ones like this.) Defaults to 94.  Note that if the dimension
+is 2, the character set thus described is 94x94 or 96x96.
+@item columns
+Number of columns used to display a character in this charset.
+Only used in TTY mode. (Under X, the actual width of a character
+can be derived from the font used to display the characters.)
+If unspecified, defaults to the dimension. (This is almost
+always the correct value, because character sets with dimension 2
+are usually ideograph character sets, which need two columns to
+display the intricate ideographs.)
+@item direction
+A symbol, either @code{l2r} (left-to-right) or @code{r2l}
+(right-to-left).  Defaults to @code{l2r}.  This specifies the
+direction that the text should be displayed in, and will be
+left-to-right for most charsets but right-to-left for Hebrew
+and Arabic. (Right-to-left display is not currently implemented.)
+@item final
+Final byte of the standard ISO 2022 escape sequence designating this
+charset.  Must be supplied.  Each combination of (@var{dimension},
+@var{chars}) defines a separate namespace for final bytes, and each
+charset within a particular namespace must have a different final byte.
+Note that ISO 2022 restricts the final byte to the range 0x30 - 0x7E if
+dimension == 1, and 0x30 - 0x5F if dimension == 2.  Note also that final
+bytes in the range 0x30 - 0x3F are reserved for user-defined (not
+official) character sets.  For more information on ISO 2022, see @ref{Coding
+Systems}.
+@item graphic
+0 (use left half of font on output) or 1 (use right half of font on
+output).  Defaults to 0.  This specifies how to convert the position
+codes that index a character in a character set into an index into the
+font used to display the character set.  With @code{graphic} set to 0,
+position codes 33 through 126 map to font indices 33 through 126; with
+it set to 1, position codes 33 through 126 map to font indices 161
+through 254 (i.e. the same number but with the high bit set).  For
+example, for a font whose registry is ISO8859-1, the left half of the
+font (octets 0x20 - 0x7F) is the @code{ascii} charset, while the right
+half (octets 0xA0 - 0xFF) is the @code{latin-iso8859-1} charset.
+@item ccl-program
+A compiled CCL program used to convert a character in this charset into
+an index into the font.  This is in addition to the @code{graphic}
+property.  If a CCL program is defined, the position codes of a
+character will first be processed according to @code{graphic} and
+then passed through the CCL program, with the resulting values used
+to index the font.
+
+This is used, for example, in the Big5 character set (used in Taiwan).
+This character set is not ISO-2022-compliant, and its size (94x157) does
+not fit within the maximum 96x96 size of ISO-2022-compliant character
+sets.  As a result, XEmacs/MULE splits it (in a rather complex fashion,
+so as to group the most commonly used characters together) into two
+charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94,
+and each charset object uses a CCL program to convert the modified
+position codes back into standard Big5 indices to retrieve a character
+from a Big5 font.
+@end table
+
+Most of the above properties can only be changed when the charset
+is created.  @xref{Charset Property Functions}.
+
+@node Basic Charset Functions
+@subsection Basic Charset Functions
+
+@defun find-charset charset-or-name
+This function retrieves the charset of the given name.  If
+@var{charset-or-name} is a charset object, it is simply returned.
+Otherwise, @var{charset-or-name} should be a symbol.  If there is no
+such charset, @code{nil} is returned.  Otherwise the associated charset
+object is returned.
+@end defun
+
+@defun get-charset name
+This function retrieves the charset of the given name.  Same as
+@code{find-charset} except an error is signalled if there is no such
+charset instead of returning @code{nil}.
+@end defun
+
+@defun charset-list
+This function returns a list of the names of all defined charsets.
+@end defun
+
+@defun make-charset name doc-string props
+This function defines a new character set.  This function is for use
+with Mule support.  @var{name} is a symbol, the name by which the
+character set is normally referred.  @var{doc-string} is a string
+describing the character set.  @var{props} is a property list,
+describing the specific nature of the character set.  The recognized
+properties are @code{registry}, @code{dimension}, @code{columns},
+@code{chars}, @code{final}, @code{graphic}, @code{direction}, and
+@code{ccl-program}, as previously described.
+@end defun
+
+@defun make-reverse-direction-charset charset new-name
+This function makes a charset equivalent to @var{charset} but which goes
+in the opposite direction.  @var{new-name} is the name of the new
+charset.  The new charset is returned.
+@end defun
+
+@defun charset-from-attributes dimension chars final &optional direction
+This function returns a charset with the given @var{dimension},
+@var{chars}, @var{final}, and @var{direction}.  If @var{direction} is
+omitted, both directions will be checked (left-to-right will be returned
+if character sets exist for both directions).
+@end defun
+
+@defun charset-reverse-direction-charset charset
+This function returns the charset (if any) with the same dimension,
+number of characters, and final byte as @var{charset}, but which is
+displayed in the opposite direction.
+@end defun
+
+@node Charset Property Functions
+@subsection Charset Property Functions
+
+All of these functions accept either a charset name or charset object.
+
+@defun charset-property charset prop
+This function returns property @var{prop} of @var{charset}.
+@xref{Charset Properties}.
+@end defun
+
+Convenience functions are also provided for retrieving individual
+properties of a charset.
+
+@defun charset-name charset
+This function returns the name of @var{charset}.  This will be a symbol.
+@end defun
+
+@defun charset-doc-string charset
+This function returns the doc string of @var{charset}.
+@end defun
+
+@defun charset-registry charset
+This function returns the registry of @var{charset}.
+@end defun
+
+@defun charset-dimension charset
+This function returns the dimension of @var{charset}.
+@end defun
+
+@defun charset-chars charset
+This function returns the number of characters per dimension of
+@var{charset}.
+@end defun
+
+@defun charset-columns charset
+This function returns the number of display columns per character (in
+TTY mode) of @var{charset}.
+@end defun
+
+@defun charset-direction charset
+This function returns the display direction of @var{charset} -- either
+@code{l2r} or @code{r2l}.
+@end defun
+
+@defun charset-final charset
+This function returns the final byte of the ISO 2022 escape sequence
+designating @var{charset}.
+@end defun
+
+@defun charset-graphic charset
+This function returns either 0 or 1, depending on whether the position
+codes of characters in @var{charset} map to the left or right half
+of their font, respectively.
+@end defun
+
+@defun charset-ccl-program charset
+This function returns the CCL program, if any, for converting
+position codes of characters in @var{charset} into font indices.
+@end defun
+
+The only property of a charset that can currently be set after
+the charset has been created is the CCL program.
+
+@defun set-charset-ccl-program charset ccl-program
+This function sets the @code{ccl-program} property of @var{charset} to
+@var{ccl-program}.
+@end defun
+
+@node Predefined Charsets
+@subsection Predefined Charsets
+
+The following charsets are predefined in the C code.
+
+@example
+Name                    Type  Fi Gr Dir Registry
+--------------------------------------------------------------
+ascii                    94    B  0  l2r ISO8859-1
+control-1                94       0  l2r ---
+latin-iso8859-1          94    A  1  l2r ISO8859-1
+latin-iso8859-2          96    B  1  l2r ISO8859-2
+latin-iso8859-3          96    C  1  l2r ISO8859-3
+latin-iso8859-4          96    D  1  l2r ISO8859-4
+cyrillic-iso8859-5       96    L  1  l2r ISO8859-5
+arabic-iso8859-6         96    G  1  r2l ISO8859-6
+greek-iso8859-7          96    F  1  l2r ISO8859-7
+hebrew-iso8859-8         96    H  1  r2l ISO8859-8
+latin-iso8859-9          96    M  1  l2r ISO8859-9
+thai-tis620              96    T  1  l2r TIS620
+katakana-jisx0201        94    I  1  l2r JISX0201.1976
+latin-jisx0201           94    J  0  l2r JISX0201.1976
+japanese-jisx0208-1978   94x94 @@  0  l2r JISX0208.1978
+japanese-jisx0208        94x94 B  0  l2r JISX0208.19(83|90)
+japanese-jisx0212        94x94 D  0  l2r JISX0212
+chinese-gb2312           94x94 A  0  l2r GB2312
+chinese-cns11643-1       94x94 G  0  l2r CNS11643.1
+chinese-cns11643-2       94x94 H  0  l2r CNS11643.2
+chinese-big5-1           94x94 0  0  l2r Big5
+chinese-big5-2           94x94 1  0  l2r Big5
+korean-ksc5601           94x94 C  0  l2r KSC5601
+composite                96x96    0  l2r ---
+@end example
+
+The following charsets are predefined in the Lisp code.
+
+@example
+Name                     Type  Fi Gr Dir Registry
+--------------------------------------------------------------
+arabic-digit             94    2  0  l2r MuleArabic-0
+arabic-1-column          94    3  0  r2l MuleArabic-1
+arabic-2-column          94    4  0  r2l MuleArabic-2
+sisheng                  94    0  0  l2r sisheng_cwnn\|OMRON_UDC_ZH
+chinese-cns11643-3       94x94 I  0  l2r CNS11643.1
+chinese-cns11643-4       94x94 J  0  l2r CNS11643.1
+chinese-cns11643-5       94x94 K  0  l2r CNS11643.1
+chinese-cns11643-6       94x94 L  0  l2r CNS11643.1
+chinese-cns11643-7       94x94 M  0  l2r CNS11643.1
+ethiopic                 94x94 2  0  l2r Ethio
+ascii-r2l                94    B  0  r2l ISO8859-1
+ipa                      96    0  1  l2r MuleIPA
+vietnamese-lower         96    1  1  l2r VISCII1.1
+vietnamese-upper         96    2  1  l2r VISCII1.1
+@end example
+
+For all of the above charsets, the dimension and number of columns are
+the same.
+
+Note that ASCII, Control-1, and Composite are handled specially.
+This is why some of the fields are blank; and some of the filled-in
+fields (e.g. the type) are not really accurate.
+
+@node MULE Characters
+@section MULE Characters
+
+@defun make-char charset arg1 &optional arg2
+This function makes a multi-byte character from @var{charset} and octets
+@var{arg1} and @var{arg2}.
+@end defun
+
+@defun char-charset ch
+This function returns the character set of char @var{ch}.
+@end defun
+
+@defun char-octet ch &optional n
+This function returns the octet (i.e. position code) numbered @var{n}
+(should be 0 or 1) of char @var{ch}.  @var{n} defaults to 0 if omitted.
+@end defun
+
+@defun find-charset-region start end &optional buffer
+This function returns a list of the charsets in the region between
+@var{start} and @var{end}.  @var{buffer} defaults to the current buffer
+if omitted.
+@end defun
+
+@defun find-charset-string string
+This function returns a list of the charsets in @var{string}.
+@end defun
+
+@node Composite Characters
+@section Composite Characters
+
+Composite characters are not yet completely implemented.
+
+@defun make-composite-char string
+This function converts a string into a single composite character.  The
+character is the result of overstriking all the characters in the
+string.
+@end defun
+
+@defun composite-char-string ch
+This function returns a string of the characters comprising a composite
+character.
+@end defun
+
+@defun compose-region start end &optional buffer
+This function composes the characters in the region from @var{start} to
+@var{end} in @var{buffer} into one composite character.  The composite
+character replaces the composed characters.  @var{buffer} defaults to
+the current buffer if omitted.
+@end defun
+
+@defun decompose-region start end &optional buffer
+This function decomposes any composite characters in the region from
+@var{start} to @var{end} in @var{buffer}.  This converts each composite
+character into one or more characters, the individual characters out of
+which the composite character was formed.  Non-composite characters are
+left as-is.  @var{buffer} defaults to the current buffer if omitted.
+@end defun
+
+@node ISO 2022
+@section ISO 2022
+
+This section briefly describes the ISO 2022 encoding standard.  For more
+thorough understanding, please refer to the original document of ISO
+2022.
+
+Character sets (@dfn{charsets}) are classified into the following four
+categories, according to the number of characters of charset:
+94-charset, 96-charset, 94x94-charset, and 96x96-charset.
+
+@need 1000
+@table @asis
+@item 94-charset
+ ASCII(B), left(J) and right(I) half of JISX0201, ...
+@item 96-charset
+ Latin-1(A), Latin-2(B), Latin-3(C), ...
+@item 94x94-charset
+ GB2312(A), JISX0208(B), KSC5601(C), ...
+@item 96x96-charset
+ none for the moment
+@end table
+
+The character in parentheses after the name of each charset
+is the @dfn{final character} @var{F}, which can be regarded as
+the identifier of the charset.  ECMA allocates @var{F} to each
+charset.  @var{F} is in the range of 0x30..0x7F, but 0x30..0x3F
+are only for private use.
+
+Note: @dfn{ECMA} = European Computer Manufacturers Association
+
+There are four @dfn{registers of charsets}, called G0 thru G3.
+You can designate (or assign) any charset to one of these
+registers.
+
+The code space contained within one octet (of size 256) is divided into
+4 areas: C0, GL, C1, and GR.  GL and GR are the areas into which a
+register of charset can be invoked into.
+
+@example
+@group
+	C0: 0x00 - 0x1F
+	GL: 0x20 - 0x7F
+	C1: 0x80 - 0x9F
+	GR: 0xA0 - 0xFF
+@end group
+@end example
+
+Usually, in the initial state, G0 is invoked into GL, and G1
+is invoked into GR.
+
+ISO 2022 distinguishes 7-bit environments and 8-bit environments.  In
+7-bit environments, only C0 and GL are used.
+
+Charset designation is done by escape sequences of the form:
+
+@example
+	ESC [@var{I}] @var{I} @var{F}
+@end example
+
+where @var{I} is an intermediate character in the range 0x20 - 0x2F, and
+@var{F} is the final character identifying this charset.
+
+The meaning of intermediate characters are:
+
+@example
+@group
+	$ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
+	( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}.
+	) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}.
+	* [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}.
+	+ [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}.
+	- [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}.
+	. [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}.
+	/ [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}.
+@end group
+@end example
+
+The following rule is not allowed in ISO 2022 but can be used in Mule.
+
+@example
+	, [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}.
+@end example
+
+Here are examples of designations:
+
+@example
+@group
+	ESC ( B :              designate to G0 ASCII
+	ESC - A :              designate to G1 Latin-1
+	ESC $ ( A or ESC $ A : designate to G0 GB2312
+	ESC $ ( B or ESC $ B : designate to G0 JISX0208
+	ESC $ ) C :            designate to G1 KSC5601
+@end group
+@end example
+
+To use a charset designated to G2 or G3, and to use a charset designated
+to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3
+into GL.  There are two types of invocation, Locking Shift (forever) and
+Single Shift (one character only).
+
+Locking Shift is done as follows:
+
+@example
+	LS0 or SI (0x0F): invoke G0 into GL
+	LS1 or SO (0x0E): invoke G1 into GL
+	LS2:  invoke G2 into GL
+	LS3:  invoke G3 into GL
+	LS1R: invoke G1 into GR
+	LS2R: invoke G2 into GR
+	LS3R: invoke G3 into GR
+@end example
+
+Single Shift is done as follows:
+
+@example
+@group
+	SS2 or ESC N: invoke G2 into GL
+	SS3 or ESC O: invoke G3 into GL
+@end group
+@end example
+
+(#### Ben says: I think the above is slightly incorrect.  It appears that
+SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and
+ESC O behave as indicated.  The above definitions will not parse 
+EUC-encoded text correctly, and it looks like the code in mule-coding.c
+has similar problems.)
+
+You may realize that there are a lot of ISO-2022-compliant ways of
+encoding multilingual text.  Now, in the world, there exist many coding
+systems such as X11's Compound Text, Japanese JUNET code, and so-called
+EUC (Extended UNIX Code); all of these are variants of ISO 2022.
+
+In Mule, we characterize ISO 2022 by the following attributes:
+
+@enumerate
+@item
+Initial designation to G0 thru G3.
+@item
+Allow designation of short form for Japanese and Chinese.
+@item
+Should we designate ASCII to G0 before control characters?
+@item
+Should we designate ASCII to G0 at the end of line?
+@item
+7-bit environment or 8-bit environment.
+@item
+Use Locking Shift or not.
+@item
+Use ASCII or JIS0201-1976-Roman.
+@item
+Use JISX0208-1983 or JISX0208-1976.
+@end enumerate
+
+(The last two are only for Japanese.)
+
+By specifying these attributes, you can create any variant
+of ISO 2022.
+
+Here are several examples:
+
+@example
+@group
+junet -- Coding system used in JUNET.
+	1. G0 <- ASCII, G1..3 <- never used
+	2. Yes.
+	3. Yes.
+	4. Yes.
+	5. 7-bit environment
+	6. No.
+	7. Use ASCII
+	8. Use JISX0208-1983
+@end group
+
+@group
+ctext -- Compound Text
+	1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used
+	2. No.
+	3. No.
+	4. Yes.
+	5. 8-bit environment
+	6. No.
+	7. Use ASCII
+	8. Use JISX0208-1983
+@end group
+
+@group
+euc-china -- Chinese EUC.  Although many people call this
+as "GB encoding", the name may cause misunderstanding.
+	1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used
+	2. No.
+	3. Yes.
+	4. Yes.
+	5. 8-bit environment
+	6. No.
+	7. Use ASCII
+	8. Use JISX0208-1983
+@end group
+
+@group
+korean-mail -- Coding system used in Korean network.
+	1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used
+	2. No.
+	3. Yes.
+	4. Yes.
+	5. 7-bit environment
+	6. Yes.
+	7. No.
+	8. No.
+@end group
+@end example
+
+Mule creates all these coding systems by default.
+
+@node Coding Systems
+@section Coding Systems
+
+A coding system is an object that defines how text containing multiple
+character sets is encoded into a stream of (typically 8-bit) bytes.  The
+coding system is used to decode the stream into a series of characters
+(which may be from multiple charsets) when the text is read from a file
+or process, and is used to encode the text back into the same format
+when it is written out to a file or process.
+
+For example, many ISO-2022-compliant coding systems (such as Compound
+Text, which is used for inter-client data under the X Window System) use
+escape sequences to switch between different charsets -- Japanese Kanji,
+for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with
+@samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}.  See
+@code{make-coding-system} for more information.
+
+Coding systems are normally identified using a symbol, and the symbol is
+accepted in place of the actual coding system object whenever a coding
+system is called for. (This is similar to how faces and charsets work.)
+
+@defun coding-system-p object
+This function returns non-@code{nil} if @var{object} is a coding system.
+@end defun
+
+@menu
+* Coding System Types::               Classifying coding systems.
+* EOL Conversion::                    Dealing with different ways of denoting
+                                        the end of a line.
+* Coding System Properties::          Properties of a coding system.
+* Basic Coding System Functions::     Working with coding systems.
+* Coding System Property Functions::  Retrieving a coding system's properties.
+* Encoding and Decoding Text::        Encoding and decoding text.
+* Detection of Textual Encoding::     Determining how text is encoded.
+* Big5 and Shift-JIS Functions::      Special functions for these non-standard
+                                        encodings.
+@end menu
+
+@node Coding System Types
+@subsection Coding System Types
+
+@table @code
+@item nil
+@itemx autodetect
+Automatic conversion.  XEmacs attempts to detect the coding system used
+in the file.
+@item no-conversion
+No conversion.  Use this for binary files and such.  On output, graphic
+characters that are not in ASCII or Latin-1 will be replaced by a
+@samp{?}. (For a no-conversion-encoded buffer, these characters will
+only be present if you explicitly insert them.)
+@item shift-jis
+Shift-JIS (a Japanese encoding commonly used in PC operating systems).
+@item iso2022
+Any ISO-2022-compliant encoding.  Among other things, this includes JIS
+(the Japanese encoding commonly used for e-mail), national variants of
+EUC (the standard Unix encoding for Japanese and other languages), and
+Compound Text (an encoding used in X11).  You can specify more specific
+information about the conversion with the @var{flags} argument.
+@item big5
+Big5 (the encoding commonly used for Taiwanese).
+@item ccl
+The conversion is performed using a user-written pseudo-code program.
+CCL (Code Conversion Language) is the name of this pseudo-code.
+@item internal
+Write out or read in the raw contents of the memory representing the
+buffer's text.  This is primarily useful for debugging purposes, and is
+only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set
+(the @samp{--debug} configure option).  @strong{Warning}: Reading in a
+file using @code{internal} conversion can result in an internal
+inconsistency in the memory representing a buffer's text, which will
+produce unpredictable results and may cause XEmacs to crash.  Under
+normal circumstances you should never use @code{internal} conversion.
+@end table
+
+@node EOL Conversion
+@subsection EOL Conversion
+
+@table @code
+@item nil
+Automatically detect the end-of-line type (LF, CRLF, or CR).  Also
+generate subsidiary coding systems named @code{@var{name}-unix},
+@code{@var{name}-dos}, and @code{@var{name}-mac}, that are identical to
+this coding system but have an EOL-TYPE value of @code{lf}, @code{crlf},
+and @code{cr}, respectively.
+@item lf
+The end of a line is marked externally using ASCII LF.  Since this is
+also the way that XEmacs represents an end-of-line internally,
+specifying this option results in no end-of-line conversion.  This is
+the standard format for Unix text files.
+@item crlf
+The end of a line is marked externally using ASCII CRLF.  This is the
+standard format for MS-DOS text files.
+@item cr
+The end of a line is marked externally using ASCII CR.  This is the
+standard format for Macintosh text files.
+@item t
+Automatically detect the end-of-line type but do not generate subsidiary
+coding systems.  (This value is converted to @code{nil} when stored
+internally, and @code{coding-system-property} will return @code{nil}.)
+@end table
+
+@node Coding System Properties
+@subsection Coding System Properties
+
+@table @code
+@item mnemonic
+String to be displayed in the modeline when this coding system is
+active.
+
+@item eol-type
+End-of-line conversion to be used.  It should be one of the types
+listed in @ref{EOL Conversion}.
+
+@item post-read-conversion
+Function called after a file has been read in, to perform the decoding.
+Called with two arguments, @var{beg} and @var{end}, denoting a region of
+the current buffer to be decoded.
+
+@item pre-write-conversion
+Function called before a file is written out, to perform the encoding.
+Called with two arguments, @var{beg} and @var{end}, denoting a region of
+the current buffer to be encoded.
+@end table
+
+The following additional properties are recognized if @var{type} is
+@code{iso2022}:
+
+@table @code
+@item charset-g0
+@itemx charset-g1
+@itemx charset-g2
+@itemx charset-g3
+The character set initially designated to the G0 - G3 registers.
+The value should be one of
+
+@itemize @bullet
+@item
+A charset object (designate that character set)
+@item
+@code{nil} (do not ever use this register)
+@item
+@code{t} (no character set is initially designated to the register, but
+may be later on; this automatically sets the corresponding
+@code{force-g*-on-output} property)
+@end itemize
+
+@item force-g0-on-output
+@itemx force-g1-on-output
+@itemx force-g2-on-output
+@itemx force-g3-on-output
+If non-@code{nil}, send an explicit designation sequence on output
+before using the specified register.
+
+@item short
+If non-@code{nil}, use the short forms @samp{ESC $ @@}, @samp{ESC $ A},
+and @samp{ESC $ B} on output in place of the full designation sequences
+@samp{ESC $ ( @@}, @samp{ESC $ ( A}, and @samp{ESC $ ( B}.
+
+@item no-ascii-eol
+If non-@code{nil}, don't designate ASCII to G0 at each end of line on
+output.  Setting this to non-@code{nil} also suppresses other
+state-resetting that normally happens at the end of a line.
+
+@item no-ascii-cntl
+If non-@code{nil}, don't designate ASCII to G0 before control chars on
+output.
+
+@item seven
+If non-@code{nil}, use 7-bit environment on output.  Otherwise, use 8-bit
+environment.
+
+@item lock-shift
+If non-@code{nil}, use locking-shift (SO/SI) instead of single-shift or
+designation by escape sequence.
+
+@item no-iso6429
+If non-@code{nil}, don't use ISO6429's direction specification.
+
+@item escape-quoted
+If non-nil, literal control characters that are the same as the
+beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in
+particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F),
+and CSI (0x9B)) are ``quoted'' with an escape character so that they can
+be properly distinguished from an escape sequence.  (Note that doing
+this results in a non-portable encoding.) This encoding flag is used for
+byte-compiled files.  Note that ESC is a good choice for a quoting
+character because there are no escape sequences whose second byte is a
+character from the Control-0 or Control-1 character sets; this is
+explicitly disallowed by the ISO 2022 standard.
+
+@item input-charset-conversion
+A list of conversion specifications, specifying conversion of characters
+in one charset to another when decoding is performed.  Each
+specification is a list of two elements: the source charset, and the
+destination charset.
+
+@item output-charset-conversion
+A list of conversion specifications, specifying conversion of characters
+in one charset to another when encoding is performed.  The form of each
+specification is the same as for @code{input-charset-conversion}.
+@end table
+
+The following additional properties are recognized (and required) if
+@var{type} is @code{ccl}:
+
+@table @code
+@item decode
+CCL program used for decoding (converting to internal format).
+
+@item encode
+CCL program used for encoding (converting to external format).
+@end table
+
+@node Basic Coding System Functions
+@subsection Basic Coding System Functions
+
+@defun find-coding-system coding-system-or-name
+This function retrieves the coding system of the given name.
+
+If @var{coding-system-or-name} is a coding-system object, it is simply
+returned.  Otherwise, @var{coding-system-or-name} should be a symbol.
+If there is no such coding system, @code{nil} is returned.  Otherwise
+the associated coding system object is returned.
+@end defun
+
+@defun get-coding-system name
+This function retrieves the coding system of the given name.  Same as
+@code{find-coding-system} except an error is signalled if there is no
+such coding system instead of returning @code{nil}.
+@end defun
+
+@defun coding-system-list
+This function returns a list of the names of all defined coding systems.
+@end defun
+
+@defun coding-system-name coding-system
+This function returns the name of the given coding system.
+@end defun
+
+@defun make-coding-system name type &optional doc-string props
+This function registers symbol @var{name} as a coding system.
+
+@var{type} describes the conversion method used and should be one of
+the types listed in @ref{Coding System Types}.
+
+@var{doc-string} is a string describing the coding system.
+
+@var{props} is a property list, describing the specific nature of the
+character set.  Recognized properties are as in @ref{Coding System
+Properties}.
+@end defun
+
+@defun copy-coding-system old-coding-system new-name
+This function copies @var{old-coding-system} to @var{new-name}.  If
+@var{new-name} does not name an existing coding system, a new one will
+be created.
+@end defun
+
+@defun subsidiary-coding-system coding-system eol-type
+This function returns the subsidiary coding system of
+@var{coding-system} with eol type @var{eol-type}.
+@end defun
+
+@node Coding System Property Functions
+@subsection Coding System Property Functions
+
+@defun coding-system-doc-string coding-system
+This function returns the doc string for @var{coding-system}.
+@end defun
+
+@defun coding-system-type coding-system
+This function returns the type of @var{coding-system}.
+@end defun
+
+@defun coding-system-property coding-system prop
+This function returns the @var{prop} property of @var{coding-system}.
+@end defun
+
+@node Encoding and Decoding Text
+@subsection Encoding and Decoding Text
+
+@defun decode-coding-region start end coding-system &optional buffer
+This function decodes the text between @var{start} and @var{end} which
+is encoded in @var{coding-system}.  This is useful if you've read in
+encoded text from a file without decoding it (e.g. you read in a
+JIS-formatted file but used the @code{binary} or @code{no-conversion} coding
+system, so that it shows up as @samp{^[$B!<!+^[(B}).  The length of the
+encoded text is returned.  @var{buffer} defaults to the current buffer
+if unspecified.
+@end defun
+
+@defun encode-coding-region start end coding-system &optional buffer
+This function encodes the text between @var{start} and @var{end} using
+@var{coding-system}.  This will, for example, convert Japanese
+characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS
+encoding.  The length of the encoded text is returned.  @var{buffer}
+defaults to the current buffer if unspecified.
+@end defun
+
+@node Detection of Textual Encoding
+@subsection Detection of Textual Encoding
+
+@defun coding-category-list
+This function returns a list of all recognized coding categories.
+@end defun
+
+@defun set-coding-priority-list list
+This function changes the priority order of the coding categories.
+@var{list} should be a list of coding categories, in descending order of
+priority.  Unspecified coding categories will be lower in priority than
+all specified ones, in the same relative order they were in previously.
+@end defun
+
+@defun coding-priority-list
+This function returns a list of coding categories in descending order of
+priority.
+@end defun
+
+@defun set-coding-category-system coding-category coding-system
+This function changes the coding system associated with a coding category.
+@end defun
+
+@defun coding-category-system coding-category
+This function returns the coding system associated with a coding category.
+@end defun
+
+@defun detect-coding-region start end &optional buffer
+This function detects coding system of the text in the region between
+@var{start} and @var{end}.  Returned value is a list of possible coding
+systems ordered by priority.  If only ASCII characters are found, it
+returns @code{autodetect} or one of its subsidiary coding systems
+according to a detected end-of-line type.  Optional arg @var{buffer}
+defaults to the current buffer.
+@end defun
+
+@node Big5 and Shift-JIS Functions
+@subsection Big5 and Shift-JIS Functions
+
+These are special functions for working with the non-standard
+Shift-JIS and Big5 encodings.
+
+@defun decode-shift-jis-char code
+This function decodes a JISX0208 character of Shift-JIS coding-system.
+@var{code} is the character code in Shift-JIS as a cons of type bytes.
+The corresponding character is returned.
+@end defun
+
+@defun encode-shift-jis-char ch
+This function encodes a JISX0208 character @var{ch} to SHIFT-JIS
+coding-system.  The corresponding character code in SHIFT-JIS is
+returned as a cons of two bytes.
+@end defun
+
+@defun decode-big5-char code
+This function decodes a Big5 character @var{code} of BIG5 coding-system.
+@var{code} is the character code in BIG5.  The corresponding character
+is returned.
+@end defun
+
+@defun encode-big5-char ch
+This function encodes the Big5 character @var{char} to BIG5
+coding-system.  The corresponding character code in Big5 is returned.
+@end defun
+
+@node CCL, Category Tables, Coding Systems, MULE
+@section CCL
+
+CCL (Code Conversion Language) is a simple structured programming
+language designed for character coding conversions.  A CCL program is
+compiled to CCL code (represented by a vector of integers) and executed
+by the CCL interpreter embedded in Emacs.  The CCL interpreter
+implements a virtual machine with 8 registers called @code{r0}, ...,
+@code{r7}, a number of control structures, and some I/O operators.  Take
+care when using registers @code{r0} (used in implicit @dfn{set}
+statements) and especially @code{r7} (used internally by several
+statements and operations, especially for multiple return values and I/O 
+operations).
+
+CCL is used for code conversion during process I/O and file I/O for
+non-ISO2022 coding systems.  (It is the only way for a user to specify a
+code conversion function.)  It is also used for calculating the code
+point of an X11 font from a character code.  However, since CCL is
+designed as a powerful programming language, it can be used for more
+generic calculation where efficiency is demanded.  A combination of
+three or more arithmetic operations can be calculated faster by CCL than
+by Emacs Lisp.
+
+@strong{Warning:}  The code in @file{src/mule-ccl.c} and
+@file{$packages/lisp/mule-base/mule-ccl.el} is the definitive
+description of CCL's semantics.  The previous version of this section
+contained several typos and obsolete names left from earlier versions of
+MULE, and many may remain.  (I am not an experienced CCL programmer; the
+few who know CCL well find writing English painful.)
+
+A CCL program transforms an input data stream into an output data
+stream.  The input stream, held in a buffer of constant bytes, is left
+unchanged.  The buffer may be filled by an external input operation,
+taken from an Emacs buffer, or taken from a Lisp string.  The output
+buffer is a dynamic array of bytes, which can be written by an external
+output operation, inserted into an Emacs buffer, or returned as a Lisp
+string.
+
+A CCL program is a (Lisp) list containing two or three members.  The
+first member is the @dfn{buffer magnification}, which indicates the
+required minimum size of the output buffer as a multiple of the input
+buffer.  It is followed by the @dfn{main block} which executes while
+there is input remaining, and an optional @dfn{EOF block} which is
+executed when the input is exhausted.  Both the main block and the EOF
+block are CCL blocks.
+
+A @dfn{CCL block} is either a CCL statement or list of CCL statements.
+A @dfn{CCL statement} is either a @dfn{set statement} (either an integer 
+or an @dfn{assignment}, which is a list of a register to receive the
+assignment, an assignment operator, and an expression) or a @dfn{control 
+statement} (a list starting with a keyword, whose allowable syntax
+depends on the keyword).
+
+@menu
+* CCL Syntax::          CCL program syntax in BNF notation.
+* CCL Statements::      Semantics of CCL statements.
+* CCL Expressions::     Operators and expressions in CCL.
+* Calling CCL::         Running CCL programs.
+* CCL Examples::        The encoding functions for Big5 and KOI-8.
+@end menu
+
+@node    CCL Syntax, CCL Statements, CCL,       CCL
+@comment Node,       Next,           Previous,  Up
+@subsection CCL Syntax
+
+The full syntax of a CCL program in BNF notation:
+
+@format
+CCL_PROGRAM :=
+        (BUFFER_MAGNIFICATION
+         CCL_MAIN_BLOCK
+         [ CCL_EOF_BLOCK ])
+
+BUFFER_MAGNIFICATION := integer
+CCL_MAIN_BLOCK := CCL_BLOCK
+CCL_EOF_BLOCK := CCL_BLOCK
+
+CCL_BLOCK :=
+        STATEMENT | (STATEMENT [STATEMENT ...])
+STATEMENT :=
+        SET | IF | BRANCH | LOOP | REPEAT | BREAK | READ | WRITE
+        | CALL | END
+
+SET :=
+        (REG = EXPRESSION)
+        | (REG ASSIGNMENT_OPERATOR EXPRESSION)
+        | integer
+
+EXPRESSION := ARG | (EXPRESSION OPERATOR ARG)
+
+IF := (if EXPRESSION CCL_BLOCK [CCL_BLOCK])
+BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...])
+LOOP := (loop STATEMENT [STATEMENT ...])
+BREAK := (break)
+REPEAT :=
+        (repeat)
+        | (write-repeat [REG | integer | string])
+        | (write-read-repeat REG [integer | ARRAY])
+READ :=
+        (read REG ...)
+        | (read-if (REG OPERATOR ARG) CCL_BLOCK CCL_BLOCK)
+        | (read-branch REG CCL_BLOCK [CCL_BLOCK ...])
+WRITE :=
+        (write REG ...)
+        | (write EXPRESSION)
+        | (write integer) | (write string) | (write REG ARRAY)
+        | string
+CALL := (call ccl-program-name)
+END := (end)
+
+REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7
+ARG := REG | integer
+OPERATOR :=
+        + | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | //
+        | < | > | == | <= | >= | != | de-sjis | en-sjis
+ASSIGNMENT_OPERATOR :=
+        += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>=
+ARRAY := '[' integer ... ']'
+@end format
+
+@node    CCL Statements, CCL Expressions, CCL Syntax, CCL
+@comment Node,           Next,            Previous,   Up
+@subsection CCL Statements
+
+The Emacs Code Conversion Language provides the following statement
+types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat},
+@dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, and @dfn{end}.
+
+@heading Set statement:
+
+The @dfn{set} statement has three variants with the syntaxes
+@samp{(@var{reg} = @var{expression})},
+@samp{(@var{reg} @var{assignment_operator} @var{expression})}, and
+@samp{@var{integer}}.  The assignment operator variation of the
+@dfn{set} statement works the same way as the corresponding C expression
+statement does.  The assignment operators are @code{+=}, @code{-=},
+@code{*=}, @code{/=}, @code{%=}, @code{&=}, @code{|=}, @code{^=},
+@code{<<=}, and @code{>>=}, and they have the same meanings as in C.  A
+"naked integer" @var{integer} is equivalent to a @var{set} statement of
+the form @code{(r0 = @var{integer})}.
+
+@heading I/O statements:
+
+The @dfn{read} statement takes one or more registers as arguments.  It
+reads one byte (a C char) from the input into each register in turn.  
+
+The @dfn{write} takes several forms.  In the form @samp{(write @var{reg}
+...)} it takes one or more registers as arguments and writes each in
+turn to the output.  The integer in a register (interpreted as an
+Emchar) is encoded to multibyte form (ie, Bufbytes) and written to the
+current output buffer.  If it is less than 256, it is written as is.
+The forms @samp{(write @var{expression})} and @samp{(write
+@var{integer})} are treated analogously.  The form @samp{(write
+@var{string})} writes the constant string to the output.  A
+"naked string" @samp{@var{string}} is equivalent to the statement @samp{(write
+@var{string})}.  The form @samp{(write @var{reg} @var{array})} writes
+the @var{reg}th element of the @var{array} to the output.
+
+@heading Conditional statements:
+
+The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and
+an optional @var{second CCL block} as arguments.  If the
+@var{expression} evaluates to non-zero, the first @var{CCL block} is
+executed.  Otherwise, if there is a @var{second CCL block}, it is
+executed.
+
+The @dfn{read-if} variant of the @dfn{if} statement takes an
+@var{expression}, a @var{CCL block}, and an optional @var{second CCL
+block} as arguments.  The @var{expression} must have the form
+@code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is
+a register or an integer).  The @code{read-if} statement first reads
+from the input into the first register operand in the @var{expression},
+then conditionally executes a CCL block just as the @code{if} statement
+does.
+
+The @dfn{branch} statement takes an @var{expression} and one or more CCL
+blocks as arguments.  The CCL blocks are treated as a zero-indexed
+array, and the @code{branch} statement uses the @var{expression} as the
+index of the CCL block to execute.  Null CCL blocks may be used as
+no-ops, continuing execution with the statement following the
+@code{branch} statement in the containing CCL block.  Out-of-range
+values for the @var{EXPRESSION} are also treated as no-ops.
+
+The @dfn{read-branch} variant of the @dfn{branch} statement takes an
+@var{register}, a @var{CCL block}, and an optional @var{second CCL
+block} as arguments.  The @code{read-branch} statement first reads from
+the input into the @var{register}, then conditionally executes a CCL
+block just as the @code{branch} statement does.
+
+@heading Loop control statements:
+
+The @dfn{loop} statement creates a block with an implied jump from the
+end of the block back to its head.  The loop is exited on a @code{break} 
+statement, and continued without executing the tail by a @code{repeat}
+statement.
+
+The @dfn{break} statement, written @samp{(break)}, terminates the
+current loop and continues with the next statement in the current
+block. 
+
+The @dfn{repeat} statement has three variants, @code{repeat},
+@code{write-repeat}, and @code{write-read-repeat}.  Each continues the
+current loop from its head, possibly after performing I/O.
+@code{repeat} takes no arguments and does no I/O before jumping.
+@code{write-repeat} takes a single argument (a register, an 
+integer, or a string), writes it to the output, then jumps.
+@code{write-read-repeat} takes one or two arguments.  The first must
+be a register.  The second may be an integer or an array; if absent, it
+is implicitly set to the first (register) argument.
+@code{write-read-repeat} writes its second argument to the output, then
+reads from the input into the register, and finally jumps.  See the
+@code{write} and @code{read} statements for the semantics of the I/O
+operations for each type of argument.
+
+@heading Other control statements:
+
+The @dfn{call} statement, written @samp{(call @var{ccl-program-name})},
+executes a CCL program as a subroutine.  It does not return a value to
+the caller, but can modify the register status.
+
+The @dfn{end} statement, written @samp{(end)}, terminates the CCL
+program successfully, and returns to caller (which may be a CCL
+program).  It does not alter the status of the registers.
+
+@node    CCL Expressions, Calling CCL, CCL Statements, CCL
+@comment Node,            Next,        Previous,       Up
+@subsection CCL Expressions
+
+CCL, unlike Lisp, uses infix expressions.  The simplest CCL expressions
+consist of a single @var{operand}, either a register (one of @code{r0},
+..., @code{r0}) or an integer.  Complex expressions are lists of the
+form @code{( @var{expression} @var{operator} @var{operand} )}.  Unlike
+C, assignments are not expressions.
+
+In the following table, @var{X} is the target resister for a @dfn{set}.
+In subexpressions, this is implicitly @code{r7}.  This means that
+@code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used
+freely in subexpressions, since they return parts of their values in
+@code{r7}.  @var{Y} may be an expression, register, or integer, while
+@var{Z} must be a register or an integer.
+
+@multitable @columnfractions .22 .14 .09 .55
+@item Name @tab Operator @tab Code @tab C-like Description
+@item CCL_PLUS @tab @code{+} @tab 0x00 @tab X = Y + Z
+@item CCL_MINUS @tab @code{-} @tab 0x01 @tab X = Y - Z
+@item CCL_MUL @tab @code{*} @tab 0x02 @tab X = Y * Z
+@item CCL_DIV @tab @code{/} @tab 0x03 @tab X = Y / Z
+@item CCL_MOD @tab @code{%} @tab 0x04 @tab X = Y % Z
+@item CCL_AND @tab @code{&} @tab 0x05 @tab X = Y & Z
+@item CCL_OR @tab @code{|} @tab 0x06 @tab X = Y | Z
+@item CCL_XOR @tab @code{^} @tab 0x07 @tab X = Y ^ Z
+@item CCL_LSH @tab @code{<<} @tab 0x08 @tab X = Y << Z
+@item CCL_RSH @tab @code{>>} @tab 0x09 @tab X = Y >> Z
+@item CCL_LSH8 @tab @code{<8} @tab 0x0A @tab X = (Y << 8) | Z
+@item CCL_RSH8 @tab @code{>8} @tab 0x0B @tab X = Y >> 8, r[7] = Y & 0xFF
+@item CCL_DIVMOD @tab @code{//} @tab 0x0C @tab X = Y / Z, r[7] = Y % Z
+@item CCL_LS @tab @code{<} @tab 0x10 @tab X = (X < Y)
+@item CCL_GT @tab @code{>} @tab 0x11 @tab X = (X > Y)
+@item CCL_EQ @tab @code{==} @tab 0x12 @tab X = (X == Y)
+@item CCL_LE @tab @code{<=} @tab 0x13 @tab X = (X <= Y)
+@item CCL_GE @tab @code{>=} @tab 0x14 @tab X = (X >= Y)
+@item CCL_NE @tab @code{!=} @tab 0x15 @tab X = (X != Y)
+@item CCL_ENCODE_SJIS @tab @code{en-sjis} @tab 0x16 @tab X = HIGHER_BYTE (SJIS (Y, Z))
+@item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z)
+@item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z))
+@item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z))
+@end multitable
+
+The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8,
+CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS.  The CCL_ENCODE_SJIS
+and CCL_DECODE_SJIS treat their first and second bytes as the high and
+low bytes of a two-byte character code.  (SJIS stands for Shift JIS, an
+encoding of Japanese characters used by Microsoft.  CCL_ENCODE_SJIS is a
+complicated transformation of the Japanese standard JIS encoding to
+Shift JIS.  CCL_DECODE_SJIS is its inverse.)  It is somewhat odd to
+represent the SJIS operations in infix form.
+
+@node    Calling CCL, CCL Examples,  CCL Expressions, CCL
+@comment Node,        Next,          Previous,        Up
+@subsection Calling CCL
+
+CCL programs are called automatically during Emacs buffer I/O when the
+external representation has a coding system type of @code{shift-jis},
+@code{big5}, or @code{ccl}.  The program is specified by the coding
+system (@pxref{Coding Systems}).  You can also call CCL programs from
+other CCL programs, and from Lisp using these functions:
+
+@defun ccl-execute ccl-program status
+Execute @var{ccl-program} with registers initialized by
+@var{status}.  @var{ccl-program} is a vector of compiled CCL code
+created by @code{ccl-compile}.  It is an error for the program to try to 
+execute a CCL I/O command.  @var{status} must be a vector of nine
+values, specifying the initial value for the R0, R1 .. R7 registers and
+for the instruction counter IC.  A @code{nil} value for a register
+initializer causes the register to be set to 0.  A @code{nil} value for
+the IC initializer causes execution to start at the beginning of the
+program.  When the program is done, @var{status} is modified (by
+side-effect) to contain the ending values for the corresponding
+registers and IC.  
+@end defun
+
+@defun ccl-execute-on-string ccl-program status str &optional continue
+Execute @var{ccl-program} with initial @var{status} on
+@var{string}.  @var{ccl-program} is a vector of compiled CCL code
+created by @code{ccl-compile}.  @var{status} must be a vector of nine
+values, specifying the initial value for the R0, R1 .. R7 registers and
+for the instruction counter IC.  A @code{nil} value for a register
+initializer causes the register to be set to 0.  A @code{nil} value for
+the IC initializer causes execution to start at the beginning of the
+program.  An optional fourth argument @var{continue}, if non-nil, causes
+the IC to
+remain on the unsatisfied read operation if the program terminates due
+to exhaustion of the input buffer.  Otherwise the IC is set to the end
+of the program.  When the program is done, @var{status} is modified (by 
+side-effect) to contain the ending values for the corresponding
+registers and IC.  Returns the resulting string.
+@end defun
+
+To call a CCL program from another CCL program, it must first be
+registered:
+
+@defun register-ccl-program name ccl-program
+Register @var{name} for CCL program @var{program} in
+@code{ccl-program-table}.  @var{program} should be the compiled form of
+a CCL program, or nil.  Return index number of the registered CCL
+program.
+@end defun
+
+Information about the processor time used by the CCL interpreter can be
+obtained using these functions:
+
+@defun ccl-elapsed-time
+Returns the elapsed processor time of the CCL interpreter as cons of
+user and system time, as
+floating point numbers measured in seconds.  If only one
+overall value can be determined, the return value will be a cons of that
+value and 0.
+@end defun
+
+@defun ccl-reset-elapsed-time
+Resets the CCL interpreter's internal elapsed time registers.
+@end defun
+
+@node    CCL Examples, ,     Calling CCL, CCL
+@comment Node,         Next, Previous,    Up
+@subsection CCL Examples
+
+This section is not yet written.
+
+@node Category Tables, , CCL, MULE
+@section Category Tables
+
+  A category table is a type of char table used for keeping track of
+categories.  Categories are used for classifying characters for use in
+regexps -- you can refer to a category rather than having to use a
+complicated [] expression (and category lookups are significantly
+faster).
+
+  There are 95 different categories available, one for each printable
+character (including space) in the ASCII charset.  Each category is
+designated by one such character, called a @dfn{category designator}.
+They are specified in a regexp using the syntax @samp{\cX}, where X is a
+category designator. (This is not yet implemented.)
+
+  A category table specifies, for each character, the categories that
+the character is in.  Note that a character can be in more than one
+category.  More specifically, a category table maps from a character to
+either the value @code{nil} (meaning the character is in no categories)
+or a 95-element bit vector, specifying for each of the 95 categories
+whether the character is in that category.
+
+  Special Lisp functions are provided that abstract this, so you do not
+have to directly manipulate bit vectors.
+
+@defun category-table-p obj
+This function returns @code{t} if @var{arg} is a category table.
+@end defun
+
+@defun category-table &optional buffer
+This function returns the current category table.  This is the one
+specified by the current buffer, or by @var{buffer} if it is
+non-@code{nil}.
+@end defun
+
+@defun standard-category-table
+This function returns the standard category table.  This is the one used
+for new buffers.
+@end defun
+
+@defun copy-category-table &optional table
+This function constructs a new category table and return it.  It is a
+copy of the @var{table}, which defaults to the standard category table.
+@end defun
+
+@defun set-category-table table &optional buffer
+This function selects a new category table for @var{buffer}.  One
+argument, a category table.  @var{buffer} defaults to the current buffer
+if omitted.
+@end defun
+
+@defun category-designator-p obj
+This function returns @code{t} if @var{arg} is a category designator (a
+char in the range @samp{' '} to @samp{'~'}).
+@end defun
+
+@defun category-table-value-p obj
+This function returns @code{t} if @var{arg} is a category table value.
+Valid values are @code{nil} or a bit vector of size 95.
+@end defun
+
author	cvs
date	Mon, 13 Aug 2007 11:28:15 +0200
parents
children	8de8e3f6228a