Mercurial > hg > xemacs-beta
diff man/lispref/mule.texi @ 428:3ecd8885ac67 r21-2-22
Import from CVS: tag r21-2-22
author | cvs |
---|---|
date | Mon, 13 Aug 2007 11:28:15 +0200 |
parents | |
children | 8de8e3f6228a |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/man/lispref/mule.texi Mon Aug 13 11:28:15 2007 +0200 @@ -0,0 +1,1505 @@ +@c -*-texinfo-*- +@c This is part of the XEmacs Lisp Reference Manual. +@c Copyright (C) 1996 Ben Wing. +@c See the file lispref.texi for copying conditions. +@setfilename ../../info/internationalization.info +@node MULE, Tips, Internationalization, top +@chapter MULE + +@dfn{MULE} is the name originally given to the version of GNU Emacs +extended for multi-lingual (and in particular Asian-language) support. +``MULE'' is short for ``MUlti-Lingual Emacs''. It was originally called +Nemacs (``Nihon Emacs'' where ``Nihon'' is the Japanese word for +``Japan''), when it only provided support for Japanese. XEmacs +refers to its multi-lingual support as @dfn{MULE support} since it +is based on @dfn{MULE}. + +@menu +* Internationalization Terminology:: + Definition of various internationalization terms. +* Charsets:: Sets of related characters. +* MULE Characters:: Working with characters in XEmacs/MULE. +* Composite Characters:: Making new characters by overstriking other ones. +* ISO 2022:: An international standard for charsets and encodings. +* Coding Systems:: Ways of representing a string of chars using integers. +* CCL:: A special language for writing fast converters. +* Category Tables:: Subdividing charsets into groups. +@end menu + +@node Internationalization Terminology +@section Internationalization Terminology + + In internationalization terminology, a string of text is divided up +into @dfn{characters}, which are the printable units that make up the +text. A single character is (for example) a capital @samp{A}, the +number @samp{2}, a Katakana character, a Kanji ideograph (an +@dfn{ideograph} is a ``picture'' character, such as is used in Japanese +Kanji, Chinese Hanzi, and Korean Hangul; typically there are thousands +of such ideographs in each language), etc. The basic property of a +character is its shape. Note that the same character may be drawn by +two different people (or in two different fonts) in slightly different +ways, although the basic shape will be the same. + + In some cases, the differences will be significant enough that it is +actually possible to identify two or more distinct shapes that both +represent the same character. For example, the lowercase letters +@samp{a} and @samp{g} each have two distinct possible shapes -- the +@samp{a} can optionally have a curved tail projecting off the top, and +the @samp{g} can be formed either of two loops, or of one loop and a +tail hanging off the bottom. Such distinct possible shapes of a +character are called @dfn{glyphs}. The important characteristic of two +glyphs making up the same character is that the choice between one or +the other is purely stylistic and has no linguistic effect on a word +(this is the reason why a capital @samp{A} and lowercase @samp{a} +are different characters rather than different glyphs -- e.g. +@samp{Aspen} is a city while @samp{aspen} is a kind of tree). + + Note that @dfn{character} and @dfn{glyph} are used differently +here than elsewhere in XEmacs. + + A @dfn{character set} is simply a set of related characters. ASCII, +for example, is a set of 94 characters (or 128, if you count +non-printing characters). Other character sets are ISO8859-1 (ASCII +plus various accented characters and other international symbols), +JISX0201 (ASCII, more or less, plus half-width Katakana), JISX0208 +(Japanese Kanji), JISX0212 (a second set of less-used Japanese Kanji), +GB2312 (Mainland Chinese Hanzi), etc. + + Every character set has one or more @dfn{orderings}, which can be +viewed as a way of assigning a number (or set of numbers) to each +character in the set. For most character sets, there is a standard +ordering, and in fact all of the character sets mentioned above define a +particular ordering. ASCII, for example, places letters in their +``natural'' order, puts uppercase letters before lowercase letters, +numbers before letters, etc. Note that for many of the Asian character +sets, there is no natural ordering of the characters. The actual +orderings are based on one or more salient characteristic, of which +there are many to choose from -- e.g. number of strokes, common +radicals, phonetic ordering, etc. + + The set of numbers assigned to any particular character are called +the character's @dfn{position codes}. The number of position codes +required to index a particular character in a character set is called +the @dfn{dimension} of the character set. ASCII, being a relatively +small character set, is of dimension one, and each character in the +set is indexed using a single position code, in the range 0 through +127 (if non-printing characters are included) or 33 through 126 +(if only the printing characters are considered). JISX0208, i.e. +Japanese Kanji, has thousands of characters, and is of dimension two -- +every character is indexed by two position codes, each in the range +33 through 126. (Note that the choice of the range here is somewhat +arbitrary. Although a character set such as JISX0208 defines an +@emph{ordering} of all its characters, it does not define the actual +mapping between numbers and characters. You could just as easily +index the characters in JISX0208 using numbers in the range 0 through +93, 1 through 94, 2 through 95, etc. The reason for the actual range +chosen is so that the position codes match up with the actual values +used in the common encodings.) + + An @dfn{encoding} is a way of numerically representing characters from +one or more character sets into a stream of like-sized numerical values +called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit +quantities. If an encoding encompasses only one character set, then the +position codes for the characters in that character set could be used +directly. (This is the case with ASCII, and as a result, most people do +not understand the difference between a character set and an encoding.) +This is not possible, however, if more than one character set is to be +used in the encoding. For example, printed Japanese text typically +requires characters from multiple character sets -- ASCII, JISX0208, and +JISX0212, to be specific. Each of these is indexed using one or more +position codes in the range 33 through 126, so the position codes could +not be used directly or there would be no way to tell which character +was meant. Different Japanese encodings handle this differently -- JIS +uses special escape characters to denote different character sets; EUC +sets the high bit of the position codes for JISX0208 and JISX0212, and +puts a special extra byte before each JISX0212 character; etc. (JIS, +EUC, and most of the other encodings you will encounter are 7-bit or +8-bit encodings. There is one common 16-bit encoding, which is Unicode; +this strives to represent all the world's characters in a single large +character set. 32-bit encodings are generally used internally in +programs to simplify the code that manipulates them; however, they are +not much used externally because they are not very space-efficient.) + + Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In +a @dfn{modal encoding}, there are multiple states that the encoding can be in, +and the interpretation of the values in the stream depends on the +current global state of the encoding. Special values in the encoding, +called @dfn{escape sequences}, are used to change the global state. +JIS, for example, is a modal encoding. The bytes @samp{ESC $ B} +indicate that, from then on, bytes are to be interpreted as position +codes for JISX0208, rather than as ASCII. This effect is cancelled +using the bytes @samp{ESC ( B}, which mean ``switch from whatever the +current state is to ASCII''. To switch to JISX0212, the escape sequence +@samp{ESC $ ( D}. (Note that here, as is common, the escape sequences do +in fact begin with @samp{ESC}. This is not necessarily the case, +however.) + +A @dfn{non-modal encoding} has no global state that extends past the +character currently being interpreted. EUC, for example, is a +non-modal encoding. Characters in JISX0208 are encoded by setting +the high bit of the position codes, and characters in JISX0212 are +encoded by doing the same but also prefixing the character with the +byte 0x8F. + + The advantage of a modal encoding is that it is generally more +space-efficient, and is easily extendable because there are essentially +an arbitrary number of escape sequences that can be created. The +disadvantage, however, is that it is much more difficult to work with +if it is not being processed in a sequential manner. In the non-modal +EUC encoding, for example, the byte 0x41 always refers to the letter +@samp{A}; whereas in JIS, it could either be the letter @samp{A}, or +one of the two position codes in a JISX0208 character, or one of the +two position codes in a JISX0212 character. Determining exactly which +one is meant could be difficult and time-consuming if the previous +bytes in the string have not already been processed. + + Non-modal encodings are further divided into @dfn{fixed-width} and +@dfn{variable-width} formats. A fixed-width encoding always uses +the same number of words per character, whereas a variable-width +encoding does not. EUC is a good example of a variable-width +encoding: one to three bytes are used per character, depending on +the character set. 16-bit and 32-bit encodings are nearly always +fixed-width, and this is in fact one of the main reasons for using +an encoding with a larger word size. The advantages of fixed-width +encodings should be obvious. The advantages of variable-width +encodings are that they are generally more space-efficient and allow +for compatibility with existing 8-bit encodings such as ASCII. + + Note that the bytes in an 8-bit encoding are often referred to +as @dfn{octets} rather than simply as bytes. This terminology +dates back to the days before 8-bit bytes were universal, when +some computers had 9-bit bytes, others had 10-bit bytes, etc. + +@node Charsets +@section Charsets + + A @dfn{charset} in MULE is an object that encapsulates a +particular character set as well as an ordering of those characters. +Charsets are permanent objects and are named using symbols, like +faces. + +@defun charsetp object +This function returns non-@code{nil} if @var{object} is a charset. +@end defun + +@menu +* Charset Properties:: Properties of a charset. +* Basic Charset Functions:: Functions for working with charsets. +* Charset Property Functions:: Functions for accessing charset properties. +* Predefined Charsets:: Predefined charset objects. +@end menu + +@node Charset Properties +@subsection Charset Properties + + Charsets have the following properties: + +@table @code +@item name +A symbol naming the charset. Every charset must have a different name; +this allows a charset to be referred to using its name rather than +the actual charset object. +@item doc-string +A documentation string describing the charset. +@item registry +A regular expression matching the font registry field for this character +set. For example, both the @code{ascii} and @code{latin-iso8859-1} +charsets use the registry @code{"ISO8859-1"}. This field is used to +choose an appropriate font when the user gives a general font +specification such as @samp{-*-courier-medium-r-*-140-*}, i.e. a +14-point upright medium-weight Courier font. +@item dimension +Number of position codes used to index a character in the character set. +XEmacs/MULE can only handle character sets of dimension 1 or 2. +This property defaults to 1. +@item chars +Number of characters in each dimension. In XEmacs/MULE, the only +allowed values are 94 or 96. (There are a couple of pre-defined +character sets, such as ASCII, that do not follow this, but you cannot +define new ones like this.) Defaults to 94. Note that if the dimension +is 2, the character set thus described is 94x94 or 96x96. +@item columns +Number of columns used to display a character in this charset. +Only used in TTY mode. (Under X, the actual width of a character +can be derived from the font used to display the characters.) +If unspecified, defaults to the dimension. (This is almost +always the correct value, because character sets with dimension 2 +are usually ideograph character sets, which need two columns to +display the intricate ideographs.) +@item direction +A symbol, either @code{l2r} (left-to-right) or @code{r2l} +(right-to-left). Defaults to @code{l2r}. This specifies the +direction that the text should be displayed in, and will be +left-to-right for most charsets but right-to-left for Hebrew +and Arabic. (Right-to-left display is not currently implemented.) +@item final +Final byte of the standard ISO 2022 escape sequence designating this +charset. Must be supplied. Each combination of (@var{dimension}, +@var{chars}) defines a separate namespace for final bytes, and each +charset within a particular namespace must have a different final byte. +Note that ISO 2022 restricts the final byte to the range 0x30 - 0x7E if +dimension == 1, and 0x30 - 0x5F if dimension == 2. Note also that final +bytes in the range 0x30 - 0x3F are reserved for user-defined (not +official) character sets. For more information on ISO 2022, see @ref{Coding +Systems}. +@item graphic +0 (use left half of font on output) or 1 (use right half of font on +output). Defaults to 0. This specifies how to convert the position +codes that index a character in a character set into an index into the +font used to display the character set. With @code{graphic} set to 0, +position codes 33 through 126 map to font indices 33 through 126; with +it set to 1, position codes 33 through 126 map to font indices 161 +through 254 (i.e. the same number but with the high bit set). For +example, for a font whose registry is ISO8859-1, the left half of the +font (octets 0x20 - 0x7F) is the @code{ascii} charset, while the right +half (octets 0xA0 - 0xFF) is the @code{latin-iso8859-1} charset. +@item ccl-program +A compiled CCL program used to convert a character in this charset into +an index into the font. This is in addition to the @code{graphic} +property. If a CCL program is defined, the position codes of a +character will first be processed according to @code{graphic} and +then passed through the CCL program, with the resulting values used +to index the font. + +This is used, for example, in the Big5 character set (used in Taiwan). +This character set is not ISO-2022-compliant, and its size (94x157) does +not fit within the maximum 96x96 size of ISO-2022-compliant character +sets. As a result, XEmacs/MULE splits it (in a rather complex fashion, +so as to group the most commonly used characters together) into two +charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94, +and each charset object uses a CCL program to convert the modified +position codes back into standard Big5 indices to retrieve a character +from a Big5 font. +@end table + +Most of the above properties can only be changed when the charset +is created. @xref{Charset Property Functions}. + +@node Basic Charset Functions +@subsection Basic Charset Functions + +@defun find-charset charset-or-name +This function retrieves the charset of the given name. If +@var{charset-or-name} is a charset object, it is simply returned. +Otherwise, @var{charset-or-name} should be a symbol. If there is no +such charset, @code{nil} is returned. Otherwise the associated charset +object is returned. +@end defun + +@defun get-charset name +This function retrieves the charset of the given name. Same as +@code{find-charset} except an error is signalled if there is no such +charset instead of returning @code{nil}. +@end defun + +@defun charset-list +This function returns a list of the names of all defined charsets. +@end defun + +@defun make-charset name doc-string props +This function defines a new character set. This function is for use +with Mule support. @var{name} is a symbol, the name by which the +character set is normally referred. @var{doc-string} is a string +describing the character set. @var{props} is a property list, +describing the specific nature of the character set. The recognized +properties are @code{registry}, @code{dimension}, @code{columns}, +@code{chars}, @code{final}, @code{graphic}, @code{direction}, and +@code{ccl-program}, as previously described. +@end defun + +@defun make-reverse-direction-charset charset new-name +This function makes a charset equivalent to @var{charset} but which goes +in the opposite direction. @var{new-name} is the name of the new +charset. The new charset is returned. +@end defun + +@defun charset-from-attributes dimension chars final &optional direction +This function returns a charset with the given @var{dimension}, +@var{chars}, @var{final}, and @var{direction}. If @var{direction} is +omitted, both directions will be checked (left-to-right will be returned +if character sets exist for both directions). +@end defun + +@defun charset-reverse-direction-charset charset +This function returns the charset (if any) with the same dimension, +number of characters, and final byte as @var{charset}, but which is +displayed in the opposite direction. +@end defun + +@node Charset Property Functions +@subsection Charset Property Functions + +All of these functions accept either a charset name or charset object. + +@defun charset-property charset prop +This function returns property @var{prop} of @var{charset}. +@xref{Charset Properties}. +@end defun + +Convenience functions are also provided for retrieving individual +properties of a charset. + +@defun charset-name charset +This function returns the name of @var{charset}. This will be a symbol. +@end defun + +@defun charset-doc-string charset +This function returns the doc string of @var{charset}. +@end defun + +@defun charset-registry charset +This function returns the registry of @var{charset}. +@end defun + +@defun charset-dimension charset +This function returns the dimension of @var{charset}. +@end defun + +@defun charset-chars charset +This function returns the number of characters per dimension of +@var{charset}. +@end defun + +@defun charset-columns charset +This function returns the number of display columns per character (in +TTY mode) of @var{charset}. +@end defun + +@defun charset-direction charset +This function returns the display direction of @var{charset} -- either +@code{l2r} or @code{r2l}. +@end defun + +@defun charset-final charset +This function returns the final byte of the ISO 2022 escape sequence +designating @var{charset}. +@end defun + +@defun charset-graphic charset +This function returns either 0 or 1, depending on whether the position +codes of characters in @var{charset} map to the left or right half +of their font, respectively. +@end defun + +@defun charset-ccl-program charset +This function returns the CCL program, if any, for converting +position codes of characters in @var{charset} into font indices. +@end defun + +The only property of a charset that can currently be set after +the charset has been created is the CCL program. + +@defun set-charset-ccl-program charset ccl-program +This function sets the @code{ccl-program} property of @var{charset} to +@var{ccl-program}. +@end defun + +@node Predefined Charsets +@subsection Predefined Charsets + +The following charsets are predefined in the C code. + +@example +Name Type Fi Gr Dir Registry +-------------------------------------------------------------- +ascii 94 B 0 l2r ISO8859-1 +control-1 94 0 l2r --- +latin-iso8859-1 94 A 1 l2r ISO8859-1 +latin-iso8859-2 96 B 1 l2r ISO8859-2 +latin-iso8859-3 96 C 1 l2r ISO8859-3 +latin-iso8859-4 96 D 1 l2r ISO8859-4 +cyrillic-iso8859-5 96 L 1 l2r ISO8859-5 +arabic-iso8859-6 96 G 1 r2l ISO8859-6 +greek-iso8859-7 96 F 1 l2r ISO8859-7 +hebrew-iso8859-8 96 H 1 r2l ISO8859-8 +latin-iso8859-9 96 M 1 l2r ISO8859-9 +thai-tis620 96 T 1 l2r TIS620 +katakana-jisx0201 94 I 1 l2r JISX0201.1976 +latin-jisx0201 94 J 0 l2r JISX0201.1976 +japanese-jisx0208-1978 94x94 @@ 0 l2r JISX0208.1978 +japanese-jisx0208 94x94 B 0 l2r JISX0208.19(83|90) +japanese-jisx0212 94x94 D 0 l2r JISX0212 +chinese-gb2312 94x94 A 0 l2r GB2312 +chinese-cns11643-1 94x94 G 0 l2r CNS11643.1 +chinese-cns11643-2 94x94 H 0 l2r CNS11643.2 +chinese-big5-1 94x94 0 0 l2r Big5 +chinese-big5-2 94x94 1 0 l2r Big5 +korean-ksc5601 94x94 C 0 l2r KSC5601 +composite 96x96 0 l2r --- +@end example + +The following charsets are predefined in the Lisp code. + +@example +Name Type Fi Gr Dir Registry +-------------------------------------------------------------- +arabic-digit 94 2 0 l2r MuleArabic-0 +arabic-1-column 94 3 0 r2l MuleArabic-1 +arabic-2-column 94 4 0 r2l MuleArabic-2 +sisheng 94 0 0 l2r sisheng_cwnn\|OMRON_UDC_ZH +chinese-cns11643-3 94x94 I 0 l2r CNS11643.1 +chinese-cns11643-4 94x94 J 0 l2r CNS11643.1 +chinese-cns11643-5 94x94 K 0 l2r CNS11643.1 +chinese-cns11643-6 94x94 L 0 l2r CNS11643.1 +chinese-cns11643-7 94x94 M 0 l2r CNS11643.1 +ethiopic 94x94 2 0 l2r Ethio +ascii-r2l 94 B 0 r2l ISO8859-1 +ipa 96 0 1 l2r MuleIPA +vietnamese-lower 96 1 1 l2r VISCII1.1 +vietnamese-upper 96 2 1 l2r VISCII1.1 +@end example + +For all of the above charsets, the dimension and number of columns are +the same. + +Note that ASCII, Control-1, and Composite are handled specially. +This is why some of the fields are blank; and some of the filled-in +fields (e.g. the type) are not really accurate. + +@node MULE Characters +@section MULE Characters + +@defun make-char charset arg1 &optional arg2 +This function makes a multi-byte character from @var{charset} and octets +@var{arg1} and @var{arg2}. +@end defun + +@defun char-charset ch +This function returns the character set of char @var{ch}. +@end defun + +@defun char-octet ch &optional n +This function returns the octet (i.e. position code) numbered @var{n} +(should be 0 or 1) of char @var{ch}. @var{n} defaults to 0 if omitted. +@end defun + +@defun find-charset-region start end &optional buffer +This function returns a list of the charsets in the region between +@var{start} and @var{end}. @var{buffer} defaults to the current buffer +if omitted. +@end defun + +@defun find-charset-string string +This function returns a list of the charsets in @var{string}. +@end defun + +@node Composite Characters +@section Composite Characters + +Composite characters are not yet completely implemented. + +@defun make-composite-char string +This function converts a string into a single composite character. The +character is the result of overstriking all the characters in the +string. +@end defun + +@defun composite-char-string ch +This function returns a string of the characters comprising a composite +character. +@end defun + +@defun compose-region start end &optional buffer +This function composes the characters in the region from @var{start} to +@var{end} in @var{buffer} into one composite character. The composite +character replaces the composed characters. @var{buffer} defaults to +the current buffer if omitted. +@end defun + +@defun decompose-region start end &optional buffer +This function decomposes any composite characters in the region from +@var{start} to @var{end} in @var{buffer}. This converts each composite +character into one or more characters, the individual characters out of +which the composite character was formed. Non-composite characters are +left as-is. @var{buffer} defaults to the current buffer if omitted. +@end defun + +@node ISO 2022 +@section ISO 2022 + +This section briefly describes the ISO 2022 encoding standard. For more +thorough understanding, please refer to the original document of ISO +2022. + +Character sets (@dfn{charsets}) are classified into the following four +categories, according to the number of characters of charset: +94-charset, 96-charset, 94x94-charset, and 96x96-charset. + +@need 1000 +@table @asis +@item 94-charset + ASCII(B), left(J) and right(I) half of JISX0201, ... +@item 96-charset + Latin-1(A), Latin-2(B), Latin-3(C), ... +@item 94x94-charset + GB2312(A), JISX0208(B), KSC5601(C), ... +@item 96x96-charset + none for the moment +@end table + +The character in parentheses after the name of each charset +is the @dfn{final character} @var{F}, which can be regarded as +the identifier of the charset. ECMA allocates @var{F} to each +charset. @var{F} is in the range of 0x30..0x7F, but 0x30..0x3F +are only for private use. + +Note: @dfn{ECMA} = European Computer Manufacturers Association + +There are four @dfn{registers of charsets}, called G0 thru G3. +You can designate (or assign) any charset to one of these +registers. + +The code space contained within one octet (of size 256) is divided into +4 areas: C0, GL, C1, and GR. GL and GR are the areas into which a +register of charset can be invoked into. + +@example +@group + C0: 0x00 - 0x1F + GL: 0x20 - 0x7F + C1: 0x80 - 0x9F + GR: 0xA0 - 0xFF +@end group +@end example + +Usually, in the initial state, G0 is invoked into GL, and G1 +is invoked into GR. + +ISO 2022 distinguishes 7-bit environments and 8-bit environments. In +7-bit environments, only C0 and GL are used. + +Charset designation is done by escape sequences of the form: + +@example + ESC [@var{I}] @var{I} @var{F} +@end example + +where @var{I} is an intermediate character in the range 0x20 - 0x2F, and +@var{F} is the final character identifying this charset. + +The meaning of intermediate characters are: + +@example +@group + $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96). + ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}. + ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}. + * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}. + + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}. + - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}. + . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}. + / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}. +@end group +@end example + +The following rule is not allowed in ISO 2022 but can be used in Mule. + +@example + , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}. +@end example + +Here are examples of designations: + +@example +@group + ESC ( B : designate to G0 ASCII + ESC - A : designate to G1 Latin-1 + ESC $ ( A or ESC $ A : designate to G0 GB2312 + ESC $ ( B or ESC $ B : designate to G0 JISX0208 + ESC $ ) C : designate to G1 KSC5601 +@end group +@end example + +To use a charset designated to G2 or G3, and to use a charset designated +to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3 +into GL. There are two types of invocation, Locking Shift (forever) and +Single Shift (one character only). + +Locking Shift is done as follows: + +@example + LS0 or SI (0x0F): invoke G0 into GL + LS1 or SO (0x0E): invoke G1 into GL + LS2: invoke G2 into GL + LS3: invoke G3 into GL + LS1R: invoke G1 into GR + LS2R: invoke G2 into GR + LS3R: invoke G3 into GR +@end example + +Single Shift is done as follows: + +@example +@group + SS2 or ESC N: invoke G2 into GL + SS3 or ESC O: invoke G3 into GL +@end group +@end example + +(#### Ben says: I think the above is slightly incorrect. It appears that +SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and +ESC O behave as indicated. The above definitions will not parse +EUC-encoded text correctly, and it looks like the code in mule-coding.c +has similar problems.) + +You may realize that there are a lot of ISO-2022-compliant ways of +encoding multilingual text. Now, in the world, there exist many coding +systems such as X11's Compound Text, Japanese JUNET code, and so-called +EUC (Extended UNIX Code); all of these are variants of ISO 2022. + +In Mule, we characterize ISO 2022 by the following attributes: + +@enumerate +@item +Initial designation to G0 thru G3. +@item +Allow designation of short form for Japanese and Chinese. +@item +Should we designate ASCII to G0 before control characters? +@item +Should we designate ASCII to G0 at the end of line? +@item +7-bit environment or 8-bit environment. +@item +Use Locking Shift or not. +@item +Use ASCII or JIS0201-1976-Roman. +@item +Use JISX0208-1983 or JISX0208-1976. +@end enumerate + +(The last two are only for Japanese.) + +By specifying these attributes, you can create any variant +of ISO 2022. + +Here are several examples: + +@example +@group +junet -- Coding system used in JUNET. + 1. G0 <- ASCII, G1..3 <- never used + 2. Yes. + 3. Yes. + 4. Yes. + 5. 7-bit environment + 6. No. + 7. Use ASCII + 8. Use JISX0208-1983 +@end group + +@group +ctext -- Compound Text + 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used + 2. No. + 3. No. + 4. Yes. + 5. 8-bit environment + 6. No. + 7. Use ASCII + 8. Use JISX0208-1983 +@end group + +@group +euc-china -- Chinese EUC. Although many people call this +as "GB encoding", the name may cause misunderstanding. + 1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used + 2. No. + 3. Yes. + 4. Yes. + 5. 8-bit environment + 6. No. + 7. Use ASCII + 8. Use JISX0208-1983 +@end group + +@group +korean-mail -- Coding system used in Korean network. + 1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used + 2. No. + 3. Yes. + 4. Yes. + 5. 7-bit environment + 6. Yes. + 7. No. + 8. No. +@end group +@end example + +Mule creates all these coding systems by default. + +@node Coding Systems +@section Coding Systems + +A coding system is an object that defines how text containing multiple +character sets is encoded into a stream of (typically 8-bit) bytes. The +coding system is used to decode the stream into a series of characters +(which may be from multiple charsets) when the text is read from a file +or process, and is used to encode the text back into the same format +when it is written out to a file or process. + +For example, many ISO-2022-compliant coding systems (such as Compound +Text, which is used for inter-client data under the X Window System) use +escape sequences to switch between different charsets -- Japanese Kanji, +for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with +@samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See +@code{make-coding-system} for more information. + +Coding systems are normally identified using a symbol, and the symbol is +accepted in place of the actual coding system object whenever a coding +system is called for. (This is similar to how faces and charsets work.) + +@defun coding-system-p object +This function returns non-@code{nil} if @var{object} is a coding system. +@end defun + +@menu +* Coding System Types:: Classifying coding systems. +* EOL Conversion:: Dealing with different ways of denoting + the end of a line. +* Coding System Properties:: Properties of a coding system. +* Basic Coding System Functions:: Working with coding systems. +* Coding System Property Functions:: Retrieving a coding system's properties. +* Encoding and Decoding Text:: Encoding and decoding text. +* Detection of Textual Encoding:: Determining how text is encoded. +* Big5 and Shift-JIS Functions:: Special functions for these non-standard + encodings. +@end menu + +@node Coding System Types +@subsection Coding System Types + +@table @code +@item nil +@itemx autodetect +Automatic conversion. XEmacs attempts to detect the coding system used +in the file. +@item no-conversion +No conversion. Use this for binary files and such. On output, graphic +characters that are not in ASCII or Latin-1 will be replaced by a +@samp{?}. (For a no-conversion-encoded buffer, these characters will +only be present if you explicitly insert them.) +@item shift-jis +Shift-JIS (a Japanese encoding commonly used in PC operating systems). +@item iso2022 +Any ISO-2022-compliant encoding. Among other things, this includes JIS +(the Japanese encoding commonly used for e-mail), national variants of +EUC (the standard Unix encoding for Japanese and other languages), and +Compound Text (an encoding used in X11). You can specify more specific +information about the conversion with the @var{flags} argument. +@item big5 +Big5 (the encoding commonly used for Taiwanese). +@item ccl +The conversion is performed using a user-written pseudo-code program. +CCL (Code Conversion Language) is the name of this pseudo-code. +@item internal +Write out or read in the raw contents of the memory representing the +buffer's text. This is primarily useful for debugging purposes, and is +only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set +(the @samp{--debug} configure option). @strong{Warning}: Reading in a +file using @code{internal} conversion can result in an internal +inconsistency in the memory representing a buffer's text, which will +produce unpredictable results and may cause XEmacs to crash. Under +normal circumstances you should never use @code{internal} conversion. +@end table + +@node EOL Conversion +@subsection EOL Conversion + +@table @code +@item nil +Automatically detect the end-of-line type (LF, CRLF, or CR). Also +generate subsidiary coding systems named @code{@var{name}-unix}, +@code{@var{name}-dos}, and @code{@var{name}-mac}, that are identical to +this coding system but have an EOL-TYPE value of @code{lf}, @code{crlf}, +and @code{cr}, respectively. +@item lf +The end of a line is marked externally using ASCII LF. Since this is +also the way that XEmacs represents an end-of-line internally, +specifying this option results in no end-of-line conversion. This is +the standard format for Unix text files. +@item crlf +The end of a line is marked externally using ASCII CRLF. This is the +standard format for MS-DOS text files. +@item cr +The end of a line is marked externally using ASCII CR. This is the +standard format for Macintosh text files. +@item t +Automatically detect the end-of-line type but do not generate subsidiary +coding systems. (This value is converted to @code{nil} when stored +internally, and @code{coding-system-property} will return @code{nil}.) +@end table + +@node Coding System Properties +@subsection Coding System Properties + +@table @code +@item mnemonic +String to be displayed in the modeline when this coding system is +active. + +@item eol-type +End-of-line conversion to be used. It should be one of the types +listed in @ref{EOL Conversion}. + +@item post-read-conversion +Function called after a file has been read in, to perform the decoding. +Called with two arguments, @var{beg} and @var{end}, denoting a region of +the current buffer to be decoded. + +@item pre-write-conversion +Function called before a file is written out, to perform the encoding. +Called with two arguments, @var{beg} and @var{end}, denoting a region of +the current buffer to be encoded. +@end table + +The following additional properties are recognized if @var{type} is +@code{iso2022}: + +@table @code +@item charset-g0 +@itemx charset-g1 +@itemx charset-g2 +@itemx charset-g3 +The character set initially designated to the G0 - G3 registers. +The value should be one of + +@itemize @bullet +@item +A charset object (designate that character set) +@item +@code{nil} (do not ever use this register) +@item +@code{t} (no character set is initially designated to the register, but +may be later on; this automatically sets the corresponding +@code{force-g*-on-output} property) +@end itemize + +@item force-g0-on-output +@itemx force-g1-on-output +@itemx force-g2-on-output +@itemx force-g3-on-output +If non-@code{nil}, send an explicit designation sequence on output +before using the specified register. + +@item short +If non-@code{nil}, use the short forms @samp{ESC $ @@}, @samp{ESC $ A}, +and @samp{ESC $ B} on output in place of the full designation sequences +@samp{ESC $ ( @@}, @samp{ESC $ ( A}, and @samp{ESC $ ( B}. + +@item no-ascii-eol +If non-@code{nil}, don't designate ASCII to G0 at each end of line on +output. Setting this to non-@code{nil} also suppresses other +state-resetting that normally happens at the end of a line. + +@item no-ascii-cntl +If non-@code{nil}, don't designate ASCII to G0 before control chars on +output. + +@item seven +If non-@code{nil}, use 7-bit environment on output. Otherwise, use 8-bit +environment. + +@item lock-shift +If non-@code{nil}, use locking-shift (SO/SI) instead of single-shift or +designation by escape sequence. + +@item no-iso6429 +If non-@code{nil}, don't use ISO6429's direction specification. + +@item escape-quoted +If non-nil, literal control characters that are the same as the +beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in +particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F), +and CSI (0x9B)) are ``quoted'' with an escape character so that they can +be properly distinguished from an escape sequence. (Note that doing +this results in a non-portable encoding.) This encoding flag is used for +byte-compiled files. Note that ESC is a good choice for a quoting +character because there are no escape sequences whose second byte is a +character from the Control-0 or Control-1 character sets; this is +explicitly disallowed by the ISO 2022 standard. + +@item input-charset-conversion +A list of conversion specifications, specifying conversion of characters +in one charset to another when decoding is performed. Each +specification is a list of two elements: the source charset, and the +destination charset. + +@item output-charset-conversion +A list of conversion specifications, specifying conversion of characters +in one charset to another when encoding is performed. The form of each +specification is the same as for @code{input-charset-conversion}. +@end table + +The following additional properties are recognized (and required) if +@var{type} is @code{ccl}: + +@table @code +@item decode +CCL program used for decoding (converting to internal format). + +@item encode +CCL program used for encoding (converting to external format). +@end table + +@node Basic Coding System Functions +@subsection Basic Coding System Functions + +@defun find-coding-system coding-system-or-name +This function retrieves the coding system of the given name. + +If @var{coding-system-or-name} is a coding-system object, it is simply +returned. Otherwise, @var{coding-system-or-name} should be a symbol. +If there is no such coding system, @code{nil} is returned. Otherwise +the associated coding system object is returned. +@end defun + +@defun get-coding-system name +This function retrieves the coding system of the given name. Same as +@code{find-coding-system} except an error is signalled if there is no +such coding system instead of returning @code{nil}. +@end defun + +@defun coding-system-list +This function returns a list of the names of all defined coding systems. +@end defun + +@defun coding-system-name coding-system +This function returns the name of the given coding system. +@end defun + +@defun make-coding-system name type &optional doc-string props +This function registers symbol @var{name} as a coding system. + +@var{type} describes the conversion method used and should be one of +the types listed in @ref{Coding System Types}. + +@var{doc-string} is a string describing the coding system. + +@var{props} is a property list, describing the specific nature of the +character set. Recognized properties are as in @ref{Coding System +Properties}. +@end defun + +@defun copy-coding-system old-coding-system new-name +This function copies @var{old-coding-system} to @var{new-name}. If +@var{new-name} does not name an existing coding system, a new one will +be created. +@end defun + +@defun subsidiary-coding-system coding-system eol-type +This function returns the subsidiary coding system of +@var{coding-system} with eol type @var{eol-type}. +@end defun + +@node Coding System Property Functions +@subsection Coding System Property Functions + +@defun coding-system-doc-string coding-system +This function returns the doc string for @var{coding-system}. +@end defun + +@defun coding-system-type coding-system +This function returns the type of @var{coding-system}. +@end defun + +@defun coding-system-property coding-system prop +This function returns the @var{prop} property of @var{coding-system}. +@end defun + +@node Encoding and Decoding Text +@subsection Encoding and Decoding Text + +@defun decode-coding-region start end coding-system &optional buffer +This function decodes the text between @var{start} and @var{end} which +is encoded in @var{coding-system}. This is useful if you've read in +encoded text from a file without decoding it (e.g. you read in a +JIS-formatted file but used the @code{binary} or @code{no-conversion} coding +system, so that it shows up as @samp{^[$B!<!+^[(B}). The length of the +encoded text is returned. @var{buffer} defaults to the current buffer +if unspecified. +@end defun + +@defun encode-coding-region start end coding-system &optional buffer +This function encodes the text between @var{start} and @var{end} using +@var{coding-system}. This will, for example, convert Japanese +characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS +encoding. The length of the encoded text is returned. @var{buffer} +defaults to the current buffer if unspecified. +@end defun + +@node Detection of Textual Encoding +@subsection Detection of Textual Encoding + +@defun coding-category-list +This function returns a list of all recognized coding categories. +@end defun + +@defun set-coding-priority-list list +This function changes the priority order of the coding categories. +@var{list} should be a list of coding categories, in descending order of +priority. Unspecified coding categories will be lower in priority than +all specified ones, in the same relative order they were in previously. +@end defun + +@defun coding-priority-list +This function returns a list of coding categories in descending order of +priority. +@end defun + +@defun set-coding-category-system coding-category coding-system +This function changes the coding system associated with a coding category. +@end defun + +@defun coding-category-system coding-category +This function returns the coding system associated with a coding category. +@end defun + +@defun detect-coding-region start end &optional buffer +This function detects coding system of the text in the region between +@var{start} and @var{end}. Returned value is a list of possible coding +systems ordered by priority. If only ASCII characters are found, it +returns @code{autodetect} or one of its subsidiary coding systems +according to a detected end-of-line type. Optional arg @var{buffer} +defaults to the current buffer. +@end defun + +@node Big5 and Shift-JIS Functions +@subsection Big5 and Shift-JIS Functions + +These are special functions for working with the non-standard +Shift-JIS and Big5 encodings. + +@defun decode-shift-jis-char code +This function decodes a JISX0208 character of Shift-JIS coding-system. +@var{code} is the character code in Shift-JIS as a cons of type bytes. +The corresponding character is returned. +@end defun + +@defun encode-shift-jis-char ch +This function encodes a JISX0208 character @var{ch} to SHIFT-JIS +coding-system. The corresponding character code in SHIFT-JIS is +returned as a cons of two bytes. +@end defun + +@defun decode-big5-char code +This function decodes a Big5 character @var{code} of BIG5 coding-system. +@var{code} is the character code in BIG5. The corresponding character +is returned. +@end defun + +@defun encode-big5-char ch +This function encodes the Big5 character @var{char} to BIG5 +coding-system. The corresponding character code in Big5 is returned. +@end defun + +@node CCL, Category Tables, Coding Systems, MULE +@section CCL + +CCL (Code Conversion Language) is a simple structured programming +language designed for character coding conversions. A CCL program is +compiled to CCL code (represented by a vector of integers) and executed +by the CCL interpreter embedded in Emacs. The CCL interpreter +implements a virtual machine with 8 registers called @code{r0}, ..., +@code{r7}, a number of control structures, and some I/O operators. Take +care when using registers @code{r0} (used in implicit @dfn{set} +statements) and especially @code{r7} (used internally by several +statements and operations, especially for multiple return values and I/O +operations). + +CCL is used for code conversion during process I/O and file I/O for +non-ISO2022 coding systems. (It is the only way for a user to specify a +code conversion function.) It is also used for calculating the code +point of an X11 font from a character code. However, since CCL is +designed as a powerful programming language, it can be used for more +generic calculation where efficiency is demanded. A combination of +three or more arithmetic operations can be calculated faster by CCL than +by Emacs Lisp. + +@strong{Warning:} The code in @file{src/mule-ccl.c} and +@file{$packages/lisp/mule-base/mule-ccl.el} is the definitive +description of CCL's semantics. The previous version of this section +contained several typos and obsolete names left from earlier versions of +MULE, and many may remain. (I am not an experienced CCL programmer; the +few who know CCL well find writing English painful.) + +A CCL program transforms an input data stream into an output data +stream. The input stream, held in a buffer of constant bytes, is left +unchanged. The buffer may be filled by an external input operation, +taken from an Emacs buffer, or taken from a Lisp string. The output +buffer is a dynamic array of bytes, which can be written by an external +output operation, inserted into an Emacs buffer, or returned as a Lisp +string. + +A CCL program is a (Lisp) list containing two or three members. The +first member is the @dfn{buffer magnification}, which indicates the +required minimum size of the output buffer as a multiple of the input +buffer. It is followed by the @dfn{main block} which executes while +there is input remaining, and an optional @dfn{EOF block} which is +executed when the input is exhausted. Both the main block and the EOF +block are CCL blocks. + +A @dfn{CCL block} is either a CCL statement or list of CCL statements. +A @dfn{CCL statement} is either a @dfn{set statement} (either an integer +or an @dfn{assignment}, which is a list of a register to receive the +assignment, an assignment operator, and an expression) or a @dfn{control +statement} (a list starting with a keyword, whose allowable syntax +depends on the keyword). + +@menu +* CCL Syntax:: CCL program syntax in BNF notation. +* CCL Statements:: Semantics of CCL statements. +* CCL Expressions:: Operators and expressions in CCL. +* Calling CCL:: Running CCL programs. +* CCL Examples:: The encoding functions for Big5 and KOI-8. +@end menu + +@node CCL Syntax, CCL Statements, CCL, CCL +@comment Node, Next, Previous, Up +@subsection CCL Syntax + +The full syntax of a CCL program in BNF notation: + +@format +CCL_PROGRAM := + (BUFFER_MAGNIFICATION + CCL_MAIN_BLOCK + [ CCL_EOF_BLOCK ]) + +BUFFER_MAGNIFICATION := integer +CCL_MAIN_BLOCK := CCL_BLOCK +CCL_EOF_BLOCK := CCL_BLOCK + +CCL_BLOCK := + STATEMENT | (STATEMENT [STATEMENT ...]) +STATEMENT := + SET | IF | BRANCH | LOOP | REPEAT | BREAK | READ | WRITE + | CALL | END + +SET := + (REG = EXPRESSION) + | (REG ASSIGNMENT_OPERATOR EXPRESSION) + | integer + +EXPRESSION := ARG | (EXPRESSION OPERATOR ARG) + +IF := (if EXPRESSION CCL_BLOCK [CCL_BLOCK]) +BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...]) +LOOP := (loop STATEMENT [STATEMENT ...]) +BREAK := (break) +REPEAT := + (repeat) + | (write-repeat [REG | integer | string]) + | (write-read-repeat REG [integer | ARRAY]) +READ := + (read REG ...) + | (read-if (REG OPERATOR ARG) CCL_BLOCK CCL_BLOCK) + | (read-branch REG CCL_BLOCK [CCL_BLOCK ...]) +WRITE := + (write REG ...) + | (write EXPRESSION) + | (write integer) | (write string) | (write REG ARRAY) + | string +CALL := (call ccl-program-name) +END := (end) + +REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7 +ARG := REG | integer +OPERATOR := + + | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | // + | < | > | == | <= | >= | != | de-sjis | en-sjis +ASSIGNMENT_OPERATOR := + += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>= +ARRAY := '[' integer ... ']' +@end format + +@node CCL Statements, CCL Expressions, CCL Syntax, CCL +@comment Node, Next, Previous, Up +@subsection CCL Statements + +The Emacs Code Conversion Language provides the following statement +types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat}, +@dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, and @dfn{end}. + +@heading Set statement: + +The @dfn{set} statement has three variants with the syntaxes +@samp{(@var{reg} = @var{expression})}, +@samp{(@var{reg} @var{assignment_operator} @var{expression})}, and +@samp{@var{integer}}. The assignment operator variation of the +@dfn{set} statement works the same way as the corresponding C expression +statement does. The assignment operators are @code{+=}, @code{-=}, +@code{*=}, @code{/=}, @code{%=}, @code{&=}, @code{|=}, @code{^=}, +@code{<<=}, and @code{>>=}, and they have the same meanings as in C. A +"naked integer" @var{integer} is equivalent to a @var{set} statement of +the form @code{(r0 = @var{integer})}. + +@heading I/O statements: + +The @dfn{read} statement takes one or more registers as arguments. It +reads one byte (a C char) from the input into each register in turn. + +The @dfn{write} takes several forms. In the form @samp{(write @var{reg} +...)} it takes one or more registers as arguments and writes each in +turn to the output. The integer in a register (interpreted as an +Emchar) is encoded to multibyte form (ie, Bufbytes) and written to the +current output buffer. If it is less than 256, it is written as is. +The forms @samp{(write @var{expression})} and @samp{(write +@var{integer})} are treated analogously. The form @samp{(write +@var{string})} writes the constant string to the output. A +"naked string" @samp{@var{string}} is equivalent to the statement @samp{(write +@var{string})}. The form @samp{(write @var{reg} @var{array})} writes +the @var{reg}th element of the @var{array} to the output. + +@heading Conditional statements: + +The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and +an optional @var{second CCL block} as arguments. If the +@var{expression} evaluates to non-zero, the first @var{CCL block} is +executed. Otherwise, if there is a @var{second CCL block}, it is +executed. + +The @dfn{read-if} variant of the @dfn{if} statement takes an +@var{expression}, a @var{CCL block}, and an optional @var{second CCL +block} as arguments. The @var{expression} must have the form +@code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is +a register or an integer). The @code{read-if} statement first reads +from the input into the first register operand in the @var{expression}, +then conditionally executes a CCL block just as the @code{if} statement +does. + +The @dfn{branch} statement takes an @var{expression} and one or more CCL +blocks as arguments. The CCL blocks are treated as a zero-indexed +array, and the @code{branch} statement uses the @var{expression} as the +index of the CCL block to execute. Null CCL blocks may be used as +no-ops, continuing execution with the statement following the +@code{branch} statement in the containing CCL block. Out-of-range +values for the @var{EXPRESSION} are also treated as no-ops. + +The @dfn{read-branch} variant of the @dfn{branch} statement takes an +@var{register}, a @var{CCL block}, and an optional @var{second CCL +block} as arguments. The @code{read-branch} statement first reads from +the input into the @var{register}, then conditionally executes a CCL +block just as the @code{branch} statement does. + +@heading Loop control statements: + +The @dfn{loop} statement creates a block with an implied jump from the +end of the block back to its head. The loop is exited on a @code{break} +statement, and continued without executing the tail by a @code{repeat} +statement. + +The @dfn{break} statement, written @samp{(break)}, terminates the +current loop and continues with the next statement in the current +block. + +The @dfn{repeat} statement has three variants, @code{repeat}, +@code{write-repeat}, and @code{write-read-repeat}. Each continues the +current loop from its head, possibly after performing I/O. +@code{repeat} takes no arguments and does no I/O before jumping. +@code{write-repeat} takes a single argument (a register, an +integer, or a string), writes it to the output, then jumps. +@code{write-read-repeat} takes one or two arguments. The first must +be a register. The second may be an integer or an array; if absent, it +is implicitly set to the first (register) argument. +@code{write-read-repeat} writes its second argument to the output, then +reads from the input into the register, and finally jumps. See the +@code{write} and @code{read} statements for the semantics of the I/O +operations for each type of argument. + +@heading Other control statements: + +The @dfn{call} statement, written @samp{(call @var{ccl-program-name})}, +executes a CCL program as a subroutine. It does not return a value to +the caller, but can modify the register status. + +The @dfn{end} statement, written @samp{(end)}, terminates the CCL +program successfully, and returns to caller (which may be a CCL +program). It does not alter the status of the registers. + +@node CCL Expressions, Calling CCL, CCL Statements, CCL +@comment Node, Next, Previous, Up +@subsection CCL Expressions + +CCL, unlike Lisp, uses infix expressions. The simplest CCL expressions +consist of a single @var{operand}, either a register (one of @code{r0}, +..., @code{r0}) or an integer. Complex expressions are lists of the +form @code{( @var{expression} @var{operator} @var{operand} )}. Unlike +C, assignments are not expressions. + +In the following table, @var{X} is the target resister for a @dfn{set}. +In subexpressions, this is implicitly @code{r7}. This means that +@code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used +freely in subexpressions, since they return parts of their values in +@code{r7}. @var{Y} may be an expression, register, or integer, while +@var{Z} must be a register or an integer. + +@multitable @columnfractions .22 .14 .09 .55 +@item Name @tab Operator @tab Code @tab C-like Description +@item CCL_PLUS @tab @code{+} @tab 0x00 @tab X = Y + Z +@item CCL_MINUS @tab @code{-} @tab 0x01 @tab X = Y - Z +@item CCL_MUL @tab @code{*} @tab 0x02 @tab X = Y * Z +@item CCL_DIV @tab @code{/} @tab 0x03 @tab X = Y / Z +@item CCL_MOD @tab @code{%} @tab 0x04 @tab X = Y % Z +@item CCL_AND @tab @code{&} @tab 0x05 @tab X = Y & Z +@item CCL_OR @tab @code{|} @tab 0x06 @tab X = Y | Z +@item CCL_XOR @tab @code{^} @tab 0x07 @tab X = Y ^ Z +@item CCL_LSH @tab @code{<<} @tab 0x08 @tab X = Y << Z +@item CCL_RSH @tab @code{>>} @tab 0x09 @tab X = Y >> Z +@item CCL_LSH8 @tab @code{<8} @tab 0x0A @tab X = (Y << 8) | Z +@item CCL_RSH8 @tab @code{>8} @tab 0x0B @tab X = Y >> 8, r[7] = Y & 0xFF +@item CCL_DIVMOD @tab @code{//} @tab 0x0C @tab X = Y / Z, r[7] = Y % Z +@item CCL_LS @tab @code{<} @tab 0x10 @tab X = (X < Y) +@item CCL_GT @tab @code{>} @tab 0x11 @tab X = (X > Y) +@item CCL_EQ @tab @code{==} @tab 0x12 @tab X = (X == Y) +@item CCL_LE @tab @code{<=} @tab 0x13 @tab X = (X <= Y) +@item CCL_GE @tab @code{>=} @tab 0x14 @tab X = (X >= Y) +@item CCL_NE @tab @code{!=} @tab 0x15 @tab X = (X != Y) +@item CCL_ENCODE_SJIS @tab @code{en-sjis} @tab 0x16 @tab X = HIGHER_BYTE (SJIS (Y, Z)) +@item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z) +@item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z)) +@item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z)) +@end multitable + +The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8, +CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The CCL_ENCODE_SJIS +and CCL_DECODE_SJIS treat their first and second bytes as the high and +low bytes of a two-byte character code. (SJIS stands for Shift JIS, an +encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a +complicated transformation of the Japanese standard JIS encoding to +Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to +represent the SJIS operations in infix form. + +@node Calling CCL, CCL Examples, CCL Expressions, CCL +@comment Node, Next, Previous, Up +@subsection Calling CCL + +CCL programs are called automatically during Emacs buffer I/O when the +external representation has a coding system type of @code{shift-jis}, +@code{big5}, or @code{ccl}. The program is specified by the coding +system (@pxref{Coding Systems}). You can also call CCL programs from +other CCL programs, and from Lisp using these functions: + +@defun ccl-execute ccl-program status +Execute @var{ccl-program} with registers initialized by +@var{status}. @var{ccl-program} is a vector of compiled CCL code +created by @code{ccl-compile}. It is an error for the program to try to +execute a CCL I/O command. @var{status} must be a vector of nine +values, specifying the initial value for the R0, R1 .. R7 registers and +for the instruction counter IC. A @code{nil} value for a register +initializer causes the register to be set to 0. A @code{nil} value for +the IC initializer causes execution to start at the beginning of the +program. When the program is done, @var{status} is modified (by +side-effect) to contain the ending values for the corresponding +registers and IC. +@end defun + +@defun ccl-execute-on-string ccl-program status str &optional continue +Execute @var{ccl-program} with initial @var{status} on +@var{string}. @var{ccl-program} is a vector of compiled CCL code +created by @code{ccl-compile}. @var{status} must be a vector of nine +values, specifying the initial value for the R0, R1 .. R7 registers and +for the instruction counter IC. A @code{nil} value for a register +initializer causes the register to be set to 0. A @code{nil} value for +the IC initializer causes execution to start at the beginning of the +program. An optional fourth argument @var{continue}, if non-nil, causes +the IC to +remain on the unsatisfied read operation if the program terminates due +to exhaustion of the input buffer. Otherwise the IC is set to the end +of the program. When the program is done, @var{status} is modified (by +side-effect) to contain the ending values for the corresponding +registers and IC. Returns the resulting string. +@end defun + +To call a CCL program from another CCL program, it must first be +registered: + +@defun register-ccl-program name ccl-program +Register @var{name} for CCL program @var{program} in +@code{ccl-program-table}. @var{program} should be the compiled form of +a CCL program, or nil. Return index number of the registered CCL +program. +@end defun + +Information about the processor time used by the CCL interpreter can be +obtained using these functions: + +@defun ccl-elapsed-time +Returns the elapsed processor time of the CCL interpreter as cons of +user and system time, as +floating point numbers measured in seconds. If only one +overall value can be determined, the return value will be a cons of that +value and 0. +@end defun + +@defun ccl-reset-elapsed-time +Resets the CCL interpreter's internal elapsed time registers. +@end defun + +@node CCL Examples, , Calling CCL, CCL +@comment Node, Next, Previous, Up +@subsection CCL Examples + +This section is not yet written. + +@node Category Tables, , CCL, MULE +@section Category Tables + + A category table is a type of char table used for keeping track of +categories. Categories are used for classifying characters for use in +regexps -- you can refer to a category rather than having to use a +complicated [] expression (and category lookups are significantly +faster). + + There are 95 different categories available, one for each printable +character (including space) in the ASCII charset. Each category is +designated by one such character, called a @dfn{category designator}. +They are specified in a regexp using the syntax @samp{\cX}, where X is a +category designator. (This is not yet implemented.) + + A category table specifies, for each character, the categories that +the character is in. Note that a character can be in more than one +category. More specifically, a category table maps from a character to +either the value @code{nil} (meaning the character is in no categories) +or a 95-element bit vector, specifying for each of the 95 categories +whether the character is in that category. + + Special Lisp functions are provided that abstract this, so you do not +have to directly manipulate bit vectors. + +@defun category-table-p obj +This function returns @code{t} if @var{arg} is a category table. +@end defun + +@defun category-table &optional buffer +This function returns the current category table. This is the one +specified by the current buffer, or by @var{buffer} if it is +non-@code{nil}. +@end defun + +@defun standard-category-table +This function returns the standard category table. This is the one used +for new buffers. +@end defun + +@defun copy-category-table &optional table +This function constructs a new category table and return it. It is a +copy of the @var{table}, which defaults to the standard category table. +@end defun + +@defun set-category-table table &optional buffer +This function selects a new category table for @var{buffer}. One +argument, a category table. @var{buffer} defaults to the current buffer +if omitted. +@end defun + +@defun category-designator-p obj +This function returns @code{t} if @var{arg} is a category designator (a +char in the range @samp{' '} to @samp{'~'}). +@end defun + +@defun category-table-value-p obj +This function returns @code{t} if @var{arg} is a category table value. +Valid values are @code{nil} or a bit vector of size 95. +@end defun +