comparison man/lispref/mule.texi @ 428:3ecd8885ac67 r21-2-22

Import from CVS: tag r21-2-22
author cvs
date Mon, 13 Aug 2007 11:28:15 +0200
parents
children 8de8e3f6228a
comparison
equal deleted inserted replaced
427:0a0253eac470 428:3ecd8885ac67
1 @c -*-texinfo-*-
2 @c This is part of the XEmacs Lisp Reference Manual.
3 @c Copyright (C) 1996 Ben Wing.
4 @c See the file lispref.texi for copying conditions.
5 @setfilename ../../info/internationalization.info
6 @node MULE, Tips, Internationalization, top
7 @chapter MULE
8
9 @dfn{MULE} is the name originally given to the version of GNU Emacs
10 extended for multi-lingual (and in particular Asian-language) support.
11 ``MULE'' is short for ``MUlti-Lingual Emacs''. It was originally called
12 Nemacs (``Nihon Emacs'' where ``Nihon'' is the Japanese word for
13 ``Japan''), when it only provided support for Japanese. XEmacs
14 refers to its multi-lingual support as @dfn{MULE support} since it
15 is based on @dfn{MULE}.
16
17 @menu
18 * Internationalization Terminology::
19 Definition of various internationalization terms.
20 * Charsets:: Sets of related characters.
21 * MULE Characters:: Working with characters in XEmacs/MULE.
22 * Composite Characters:: Making new characters by overstriking other ones.
23 * ISO 2022:: An international standard for charsets and encodings.
24 * Coding Systems:: Ways of representing a string of chars using integers.
25 * CCL:: A special language for writing fast converters.
26 * Category Tables:: Subdividing charsets into groups.
27 @end menu
28
29 @node Internationalization Terminology
30 @section Internationalization Terminology
31
32 In internationalization terminology, a string of text is divided up
33 into @dfn{characters}, which are the printable units that make up the
34 text. A single character is (for example) a capital @samp{A}, the
35 number @samp{2}, a Katakana character, a Kanji ideograph (an
36 @dfn{ideograph} is a ``picture'' character, such as is used in Japanese
37 Kanji, Chinese Hanzi, and Korean Hangul; typically there are thousands
38 of such ideographs in each language), etc. The basic property of a
39 character is its shape. Note that the same character may be drawn by
40 two different people (or in two different fonts) in slightly different
41 ways, although the basic shape will be the same.
42
43 In some cases, the differences will be significant enough that it is
44 actually possible to identify two or more distinct shapes that both
45 represent the same character. For example, the lowercase letters
46 @samp{a} and @samp{g} each have two distinct possible shapes -- the
47 @samp{a} can optionally have a curved tail projecting off the top, and
48 the @samp{g} can be formed either of two loops, or of one loop and a
49 tail hanging off the bottom. Such distinct possible shapes of a
50 character are called @dfn{glyphs}. The important characteristic of two
51 glyphs making up the same character is that the choice between one or
52 the other is purely stylistic and has no linguistic effect on a word
53 (this is the reason why a capital @samp{A} and lowercase @samp{a}
54 are different characters rather than different glyphs -- e.g.
55 @samp{Aspen} is a city while @samp{aspen} is a kind of tree).
56
57 Note that @dfn{character} and @dfn{glyph} are used differently
58 here than elsewhere in XEmacs.
59
60 A @dfn{character set} is simply a set of related characters. ASCII,
61 for example, is a set of 94 characters (or 128, if you count
62 non-printing characters). Other character sets are ISO8859-1 (ASCII
63 plus various accented characters and other international symbols),
64 JISX0201 (ASCII, more or less, plus half-width Katakana), JISX0208
65 (Japanese Kanji), JISX0212 (a second set of less-used Japanese Kanji),
66 GB2312 (Mainland Chinese Hanzi), etc.
67
68 Every character set has one or more @dfn{orderings}, which can be
69 viewed as a way of assigning a number (or set of numbers) to each
70 character in the set. For most character sets, there is a standard
71 ordering, and in fact all of the character sets mentioned above define a
72 particular ordering. ASCII, for example, places letters in their
73 ``natural'' order, puts uppercase letters before lowercase letters,
74 numbers before letters, etc. Note that for many of the Asian character
75 sets, there is no natural ordering of the characters. The actual
76 orderings are based on one or more salient characteristic, of which
77 there are many to choose from -- e.g. number of strokes, common
78 radicals, phonetic ordering, etc.
79
80 The set of numbers assigned to any particular character are called
81 the character's @dfn{position codes}. The number of position codes
82 required to index a particular character in a character set is called
83 the @dfn{dimension} of the character set. ASCII, being a relatively
84 small character set, is of dimension one, and each character in the
85 set is indexed using a single position code, in the range 0 through
86 127 (if non-printing characters are included) or 33 through 126
87 (if only the printing characters are considered). JISX0208, i.e.
88 Japanese Kanji, has thousands of characters, and is of dimension two --
89 every character is indexed by two position codes, each in the range
90 33 through 126. (Note that the choice of the range here is somewhat
91 arbitrary. Although a character set such as JISX0208 defines an
92 @emph{ordering} of all its characters, it does not define the actual
93 mapping between numbers and characters. You could just as easily
94 index the characters in JISX0208 using numbers in the range 0 through
95 93, 1 through 94, 2 through 95, etc. The reason for the actual range
96 chosen is so that the position codes match up with the actual values
97 used in the common encodings.)
98
99 An @dfn{encoding} is a way of numerically representing characters from
100 one or more character sets into a stream of like-sized numerical values
101 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit
102 quantities. If an encoding encompasses only one character set, then the
103 position codes for the characters in that character set could be used
104 directly. (This is the case with ASCII, and as a result, most people do
105 not understand the difference between a character set and an encoding.)
106 This is not possible, however, if more than one character set is to be
107 used in the encoding. For example, printed Japanese text typically
108 requires characters from multiple character sets -- ASCII, JISX0208, and
109 JISX0212, to be specific. Each of these is indexed using one or more
110 position codes in the range 33 through 126, so the position codes could
111 not be used directly or there would be no way to tell which character
112 was meant. Different Japanese encodings handle this differently -- JIS
113 uses special escape characters to denote different character sets; EUC
114 sets the high bit of the position codes for JISX0208 and JISX0212, and
115 puts a special extra byte before each JISX0212 character; etc. (JIS,
116 EUC, and most of the other encodings you will encounter are 7-bit or
117 8-bit encodings. There is one common 16-bit encoding, which is Unicode;
118 this strives to represent all the world's characters in a single large
119 character set. 32-bit encodings are generally used internally in
120 programs to simplify the code that manipulates them; however, they are
121 not much used externally because they are not very space-efficient.)
122
123 Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In
124 a @dfn{modal encoding}, there are multiple states that the encoding can be in,
125 and the interpretation of the values in the stream depends on the
126 current global state of the encoding. Special values in the encoding,
127 called @dfn{escape sequences}, are used to change the global state.
128 JIS, for example, is a modal encoding. The bytes @samp{ESC $ B}
129 indicate that, from then on, bytes are to be interpreted as position
130 codes for JISX0208, rather than as ASCII. This effect is cancelled
131 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the
132 current state is to ASCII''. To switch to JISX0212, the escape sequence
133 @samp{ESC $ ( D}. (Note that here, as is common, the escape sequences do
134 in fact begin with @samp{ESC}. This is not necessarily the case,
135 however.)
136
137 A @dfn{non-modal encoding} has no global state that extends past the
138 character currently being interpreted. EUC, for example, is a
139 non-modal encoding. Characters in JISX0208 are encoded by setting
140 the high bit of the position codes, and characters in JISX0212 are
141 encoded by doing the same but also prefixing the character with the
142 byte 0x8F.
143
144 The advantage of a modal encoding is that it is generally more
145 space-efficient, and is easily extendable because there are essentially
146 an arbitrary number of escape sequences that can be created. The
147 disadvantage, however, is that it is much more difficult to work with
148 if it is not being processed in a sequential manner. In the non-modal
149 EUC encoding, for example, the byte 0x41 always refers to the letter
150 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or
151 one of the two position codes in a JISX0208 character, or one of the
152 two position codes in a JISX0212 character. Determining exactly which
153 one is meant could be difficult and time-consuming if the previous
154 bytes in the string have not already been processed.
155
156 Non-modal encodings are further divided into @dfn{fixed-width} and
157 @dfn{variable-width} formats. A fixed-width encoding always uses
158 the same number of words per character, whereas a variable-width
159 encoding does not. EUC is a good example of a variable-width
160 encoding: one to three bytes are used per character, depending on
161 the character set. 16-bit and 32-bit encodings are nearly always
162 fixed-width, and this is in fact one of the main reasons for using
163 an encoding with a larger word size. The advantages of fixed-width
164 encodings should be obvious. The advantages of variable-width
165 encodings are that they are generally more space-efficient and allow
166 for compatibility with existing 8-bit encodings such as ASCII.
167
168 Note that the bytes in an 8-bit encoding are often referred to
169 as @dfn{octets} rather than simply as bytes. This terminology
170 dates back to the days before 8-bit bytes were universal, when
171 some computers had 9-bit bytes, others had 10-bit bytes, etc.
172
173 @node Charsets
174 @section Charsets
175
176 A @dfn{charset} in MULE is an object that encapsulates a
177 particular character set as well as an ordering of those characters.
178 Charsets are permanent objects and are named using symbols, like
179 faces.
180
181 @defun charsetp object
182 This function returns non-@code{nil} if @var{object} is a charset.
183 @end defun
184
185 @menu
186 * Charset Properties:: Properties of a charset.
187 * Basic Charset Functions:: Functions for working with charsets.
188 * Charset Property Functions:: Functions for accessing charset properties.
189 * Predefined Charsets:: Predefined charset objects.
190 @end menu
191
192 @node Charset Properties
193 @subsection Charset Properties
194
195 Charsets have the following properties:
196
197 @table @code
198 @item name
199 A symbol naming the charset. Every charset must have a different name;
200 this allows a charset to be referred to using its name rather than
201 the actual charset object.
202 @item doc-string
203 A documentation string describing the charset.
204 @item registry
205 A regular expression matching the font registry field for this character
206 set. For example, both the @code{ascii} and @code{latin-iso8859-1}
207 charsets use the registry @code{"ISO8859-1"}. This field is used to
208 choose an appropriate font when the user gives a general font
209 specification such as @samp{-*-courier-medium-r-*-140-*}, i.e. a
210 14-point upright medium-weight Courier font.
211 @item dimension
212 Number of position codes used to index a character in the character set.
213 XEmacs/MULE can only handle character sets of dimension 1 or 2.
214 This property defaults to 1.
215 @item chars
216 Number of characters in each dimension. In XEmacs/MULE, the only
217 allowed values are 94 or 96. (There are a couple of pre-defined
218 character sets, such as ASCII, that do not follow this, but you cannot
219 define new ones like this.) Defaults to 94. Note that if the dimension
220 is 2, the character set thus described is 94x94 or 96x96.
221 @item columns
222 Number of columns used to display a character in this charset.
223 Only used in TTY mode. (Under X, the actual width of a character
224 can be derived from the font used to display the characters.)
225 If unspecified, defaults to the dimension. (This is almost
226 always the correct value, because character sets with dimension 2
227 are usually ideograph character sets, which need two columns to
228 display the intricate ideographs.)
229 @item direction
230 A symbol, either @code{l2r} (left-to-right) or @code{r2l}
231 (right-to-left). Defaults to @code{l2r}. This specifies the
232 direction that the text should be displayed in, and will be
233 left-to-right for most charsets but right-to-left for Hebrew
234 and Arabic. (Right-to-left display is not currently implemented.)
235 @item final
236 Final byte of the standard ISO 2022 escape sequence designating this
237 charset. Must be supplied. Each combination of (@var{dimension},
238 @var{chars}) defines a separate namespace for final bytes, and each
239 charset within a particular namespace must have a different final byte.
240 Note that ISO 2022 restricts the final byte to the range 0x30 - 0x7E if
241 dimension == 1, and 0x30 - 0x5F if dimension == 2. Note also that final
242 bytes in the range 0x30 - 0x3F are reserved for user-defined (not
243 official) character sets. For more information on ISO 2022, see @ref{Coding
244 Systems}.
245 @item graphic
246 0 (use left half of font on output) or 1 (use right half of font on
247 output). Defaults to 0. This specifies how to convert the position
248 codes that index a character in a character set into an index into the
249 font used to display the character set. With @code{graphic} set to 0,
250 position codes 33 through 126 map to font indices 33 through 126; with
251 it set to 1, position codes 33 through 126 map to font indices 161
252 through 254 (i.e. the same number but with the high bit set). For
253 example, for a font whose registry is ISO8859-1, the left half of the
254 font (octets 0x20 - 0x7F) is the @code{ascii} charset, while the right
255 half (octets 0xA0 - 0xFF) is the @code{latin-iso8859-1} charset.
256 @item ccl-program
257 A compiled CCL program used to convert a character in this charset into
258 an index into the font. This is in addition to the @code{graphic}
259 property. If a CCL program is defined, the position codes of a
260 character will first be processed according to @code{graphic} and
261 then passed through the CCL program, with the resulting values used
262 to index the font.
263
264 This is used, for example, in the Big5 character set (used in Taiwan).
265 This character set is not ISO-2022-compliant, and its size (94x157) does
266 not fit within the maximum 96x96 size of ISO-2022-compliant character
267 sets. As a result, XEmacs/MULE splits it (in a rather complex fashion,
268 so as to group the most commonly used characters together) into two
269 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94,
270 and each charset object uses a CCL program to convert the modified
271 position codes back into standard Big5 indices to retrieve a character
272 from a Big5 font.
273 @end table
274
275 Most of the above properties can only be changed when the charset
276 is created. @xref{Charset Property Functions}.
277
278 @node Basic Charset Functions
279 @subsection Basic Charset Functions
280
281 @defun find-charset charset-or-name
282 This function retrieves the charset of the given name. If
283 @var{charset-or-name} is a charset object, it is simply returned.
284 Otherwise, @var{charset-or-name} should be a symbol. If there is no
285 such charset, @code{nil} is returned. Otherwise the associated charset
286 object is returned.
287 @end defun
288
289 @defun get-charset name
290 This function retrieves the charset of the given name. Same as
291 @code{find-charset} except an error is signalled if there is no such
292 charset instead of returning @code{nil}.
293 @end defun
294
295 @defun charset-list
296 This function returns a list of the names of all defined charsets.
297 @end defun
298
299 @defun make-charset name doc-string props
300 This function defines a new character set. This function is for use
301 with Mule support. @var{name} is a symbol, the name by which the
302 character set is normally referred. @var{doc-string} is a string
303 describing the character set. @var{props} is a property list,
304 describing the specific nature of the character set. The recognized
305 properties are @code{registry}, @code{dimension}, @code{columns},
306 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and
307 @code{ccl-program}, as previously described.
308 @end defun
309
310 @defun make-reverse-direction-charset charset new-name
311 This function makes a charset equivalent to @var{charset} but which goes
312 in the opposite direction. @var{new-name} is the name of the new
313 charset. The new charset is returned.
314 @end defun
315
316 @defun charset-from-attributes dimension chars final &optional direction
317 This function returns a charset with the given @var{dimension},
318 @var{chars}, @var{final}, and @var{direction}. If @var{direction} is
319 omitted, both directions will be checked (left-to-right will be returned
320 if character sets exist for both directions).
321 @end defun
322
323 @defun charset-reverse-direction-charset charset
324 This function returns the charset (if any) with the same dimension,
325 number of characters, and final byte as @var{charset}, but which is
326 displayed in the opposite direction.
327 @end defun
328
329 @node Charset Property Functions
330 @subsection Charset Property Functions
331
332 All of these functions accept either a charset name or charset object.
333
334 @defun charset-property charset prop
335 This function returns property @var{prop} of @var{charset}.
336 @xref{Charset Properties}.
337 @end defun
338
339 Convenience functions are also provided for retrieving individual
340 properties of a charset.
341
342 @defun charset-name charset
343 This function returns the name of @var{charset}. This will be a symbol.
344 @end defun
345
346 @defun charset-doc-string charset
347 This function returns the doc string of @var{charset}.
348 @end defun
349
350 @defun charset-registry charset
351 This function returns the registry of @var{charset}.
352 @end defun
353
354 @defun charset-dimension charset
355 This function returns the dimension of @var{charset}.
356 @end defun
357
358 @defun charset-chars charset
359 This function returns the number of characters per dimension of
360 @var{charset}.
361 @end defun
362
363 @defun charset-columns charset
364 This function returns the number of display columns per character (in
365 TTY mode) of @var{charset}.
366 @end defun
367
368 @defun charset-direction charset
369 This function returns the display direction of @var{charset} -- either
370 @code{l2r} or @code{r2l}.
371 @end defun
372
373 @defun charset-final charset
374 This function returns the final byte of the ISO 2022 escape sequence
375 designating @var{charset}.
376 @end defun
377
378 @defun charset-graphic charset
379 This function returns either 0 or 1, depending on whether the position
380 codes of characters in @var{charset} map to the left or right half
381 of their font, respectively.
382 @end defun
383
384 @defun charset-ccl-program charset
385 This function returns the CCL program, if any, for converting
386 position codes of characters in @var{charset} into font indices.
387 @end defun
388
389 The only property of a charset that can currently be set after
390 the charset has been created is the CCL program.
391
392 @defun set-charset-ccl-program charset ccl-program
393 This function sets the @code{ccl-program} property of @var{charset} to
394 @var{ccl-program}.
395 @end defun
396
397 @node Predefined Charsets
398 @subsection Predefined Charsets
399
400 The following charsets are predefined in the C code.
401
402 @example
403 Name Type Fi Gr Dir Registry
404 --------------------------------------------------------------
405 ascii 94 B 0 l2r ISO8859-1
406 control-1 94 0 l2r ---
407 latin-iso8859-1 94 A 1 l2r ISO8859-1
408 latin-iso8859-2 96 B 1 l2r ISO8859-2
409 latin-iso8859-3 96 C 1 l2r ISO8859-3
410 latin-iso8859-4 96 D 1 l2r ISO8859-4
411 cyrillic-iso8859-5 96 L 1 l2r ISO8859-5
412 arabic-iso8859-6 96 G 1 r2l ISO8859-6
413 greek-iso8859-7 96 F 1 l2r ISO8859-7
414 hebrew-iso8859-8 96 H 1 r2l ISO8859-8
415 latin-iso8859-9 96 M 1 l2r ISO8859-9
416 thai-tis620 96 T 1 l2r TIS620
417 katakana-jisx0201 94 I 1 l2r JISX0201.1976
418 latin-jisx0201 94 J 0 l2r JISX0201.1976
419 japanese-jisx0208-1978 94x94 @@ 0 l2r JISX0208.1978
420 japanese-jisx0208 94x94 B 0 l2r JISX0208.19(83|90)
421 japanese-jisx0212 94x94 D 0 l2r JISX0212
422 chinese-gb2312 94x94 A 0 l2r GB2312
423 chinese-cns11643-1 94x94 G 0 l2r CNS11643.1
424 chinese-cns11643-2 94x94 H 0 l2r CNS11643.2
425 chinese-big5-1 94x94 0 0 l2r Big5
426 chinese-big5-2 94x94 1 0 l2r Big5
427 korean-ksc5601 94x94 C 0 l2r KSC5601
428 composite 96x96 0 l2r ---
429 @end example
430
431 The following charsets are predefined in the Lisp code.
432
433 @example
434 Name Type Fi Gr Dir Registry
435 --------------------------------------------------------------
436 arabic-digit 94 2 0 l2r MuleArabic-0
437 arabic-1-column 94 3 0 r2l MuleArabic-1
438 arabic-2-column 94 4 0 r2l MuleArabic-2
439 sisheng 94 0 0 l2r sisheng_cwnn\|OMRON_UDC_ZH
440 chinese-cns11643-3 94x94 I 0 l2r CNS11643.1
441 chinese-cns11643-4 94x94 J 0 l2r CNS11643.1
442 chinese-cns11643-5 94x94 K 0 l2r CNS11643.1
443 chinese-cns11643-6 94x94 L 0 l2r CNS11643.1
444 chinese-cns11643-7 94x94 M 0 l2r CNS11643.1
445 ethiopic 94x94 2 0 l2r Ethio
446 ascii-r2l 94 B 0 r2l ISO8859-1
447 ipa 96 0 1 l2r MuleIPA
448 vietnamese-lower 96 1 1 l2r VISCII1.1
449 vietnamese-upper 96 2 1 l2r VISCII1.1
450 @end example
451
452 For all of the above charsets, the dimension and number of columns are
453 the same.
454
455 Note that ASCII, Control-1, and Composite are handled specially.
456 This is why some of the fields are blank; and some of the filled-in
457 fields (e.g. the type) are not really accurate.
458
459 @node MULE Characters
460 @section MULE Characters
461
462 @defun make-char charset arg1 &optional arg2
463 This function makes a multi-byte character from @var{charset} and octets
464 @var{arg1} and @var{arg2}.
465 @end defun
466
467 @defun char-charset ch
468 This function returns the character set of char @var{ch}.
469 @end defun
470
471 @defun char-octet ch &optional n
472 This function returns the octet (i.e. position code) numbered @var{n}
473 (should be 0 or 1) of char @var{ch}. @var{n} defaults to 0 if omitted.
474 @end defun
475
476 @defun find-charset-region start end &optional buffer
477 This function returns a list of the charsets in the region between
478 @var{start} and @var{end}. @var{buffer} defaults to the current buffer
479 if omitted.
480 @end defun
481
482 @defun find-charset-string string
483 This function returns a list of the charsets in @var{string}.
484 @end defun
485
486 @node Composite Characters
487 @section Composite Characters
488
489 Composite characters are not yet completely implemented.
490
491 @defun make-composite-char string
492 This function converts a string into a single composite character. The
493 character is the result of overstriking all the characters in the
494 string.
495 @end defun
496
497 @defun composite-char-string ch
498 This function returns a string of the characters comprising a composite
499 character.
500 @end defun
501
502 @defun compose-region start end &optional buffer
503 This function composes the characters in the region from @var{start} to
504 @var{end} in @var{buffer} into one composite character. The composite
505 character replaces the composed characters. @var{buffer} defaults to
506 the current buffer if omitted.
507 @end defun
508
509 @defun decompose-region start end &optional buffer
510 This function decomposes any composite characters in the region from
511 @var{start} to @var{end} in @var{buffer}. This converts each composite
512 character into one or more characters, the individual characters out of
513 which the composite character was formed. Non-composite characters are
514 left as-is. @var{buffer} defaults to the current buffer if omitted.
515 @end defun
516
517 @node ISO 2022
518 @section ISO 2022
519
520 This section briefly describes the ISO 2022 encoding standard. For more
521 thorough understanding, please refer to the original document of ISO
522 2022.
523
524 Character sets (@dfn{charsets}) are classified into the following four
525 categories, according to the number of characters of charset:
526 94-charset, 96-charset, 94x94-charset, and 96x96-charset.
527
528 @need 1000
529 @table @asis
530 @item 94-charset
531 ASCII(B), left(J) and right(I) half of JISX0201, ...
532 @item 96-charset
533 Latin-1(A), Latin-2(B), Latin-3(C), ...
534 @item 94x94-charset
535 GB2312(A), JISX0208(B), KSC5601(C), ...
536 @item 96x96-charset
537 none for the moment
538 @end table
539
540 The character in parentheses after the name of each charset
541 is the @dfn{final character} @var{F}, which can be regarded as
542 the identifier of the charset. ECMA allocates @var{F} to each
543 charset. @var{F} is in the range of 0x30..0x7F, but 0x30..0x3F
544 are only for private use.
545
546 Note: @dfn{ECMA} = European Computer Manufacturers Association
547
548 There are four @dfn{registers of charsets}, called G0 thru G3.
549 You can designate (or assign) any charset to one of these
550 registers.
551
552 The code space contained within one octet (of size 256) is divided into
553 4 areas: C0, GL, C1, and GR. GL and GR are the areas into which a
554 register of charset can be invoked into.
555
556 @example
557 @group
558 C0: 0x00 - 0x1F
559 GL: 0x20 - 0x7F
560 C1: 0x80 - 0x9F
561 GR: 0xA0 - 0xFF
562 @end group
563 @end example
564
565 Usually, in the initial state, G0 is invoked into GL, and G1
566 is invoked into GR.
567
568 ISO 2022 distinguishes 7-bit environments and 8-bit environments. In
569 7-bit environments, only C0 and GL are used.
570
571 Charset designation is done by escape sequences of the form:
572
573 @example
574 ESC [@var{I}] @var{I} @var{F}
575 @end example
576
577 where @var{I} is an intermediate character in the range 0x20 - 0x2F, and
578 @var{F} is the final character identifying this charset.
579
580 The meaning of intermediate characters are:
581
582 @example
583 @group
584 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
585 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}.
586 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}.
587 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}.
588 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}.
589 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}.
590 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}.
591 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}.
592 @end group
593 @end example
594
595 The following rule is not allowed in ISO 2022 but can be used in Mule.
596
597 @example
598 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}.
599 @end example
600
601 Here are examples of designations:
602
603 @example
604 @group
605 ESC ( B : designate to G0 ASCII
606 ESC - A : designate to G1 Latin-1
607 ESC $ ( A or ESC $ A : designate to G0 GB2312
608 ESC $ ( B or ESC $ B : designate to G0 JISX0208
609 ESC $ ) C : designate to G1 KSC5601
610 @end group
611 @end example
612
613 To use a charset designated to G2 or G3, and to use a charset designated
614 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3
615 into GL. There are two types of invocation, Locking Shift (forever) and
616 Single Shift (one character only).
617
618 Locking Shift is done as follows:
619
620 @example
621 LS0 or SI (0x0F): invoke G0 into GL
622 LS1 or SO (0x0E): invoke G1 into GL
623 LS2: invoke G2 into GL
624 LS3: invoke G3 into GL
625 LS1R: invoke G1 into GR
626 LS2R: invoke G2 into GR
627 LS3R: invoke G3 into GR
628 @end example
629
630 Single Shift is done as follows:
631
632 @example
633 @group
634 SS2 or ESC N: invoke G2 into GL
635 SS3 or ESC O: invoke G3 into GL
636 @end group
637 @end example
638
639 (#### Ben says: I think the above is slightly incorrect. It appears that
640 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and
641 ESC O behave as indicated. The above definitions will not parse
642 EUC-encoded text correctly, and it looks like the code in mule-coding.c
643 has similar problems.)
644
645 You may realize that there are a lot of ISO-2022-compliant ways of
646 encoding multilingual text. Now, in the world, there exist many coding
647 systems such as X11's Compound Text, Japanese JUNET code, and so-called
648 EUC (Extended UNIX Code); all of these are variants of ISO 2022.
649
650 In Mule, we characterize ISO 2022 by the following attributes:
651
652 @enumerate
653 @item
654 Initial designation to G0 thru G3.
655 @item
656 Allow designation of short form for Japanese and Chinese.
657 @item
658 Should we designate ASCII to G0 before control characters?
659 @item
660 Should we designate ASCII to G0 at the end of line?
661 @item
662 7-bit environment or 8-bit environment.
663 @item
664 Use Locking Shift or not.
665 @item
666 Use ASCII or JIS0201-1976-Roman.
667 @item
668 Use JISX0208-1983 or JISX0208-1976.
669 @end enumerate
670
671 (The last two are only for Japanese.)
672
673 By specifying these attributes, you can create any variant
674 of ISO 2022.
675
676 Here are several examples:
677
678 @example
679 @group
680 junet -- Coding system used in JUNET.
681 1. G0 <- ASCII, G1..3 <- never used
682 2. Yes.
683 3. Yes.
684 4. Yes.
685 5. 7-bit environment
686 6. No.
687 7. Use ASCII
688 8. Use JISX0208-1983
689 @end group
690
691 @group
692 ctext -- Compound Text
693 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used
694 2. No.
695 3. No.
696 4. Yes.
697 5. 8-bit environment
698 6. No.
699 7. Use ASCII
700 8. Use JISX0208-1983
701 @end group
702
703 @group
704 euc-china -- Chinese EUC. Although many people call this
705 as "GB encoding", the name may cause misunderstanding.
706 1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used
707 2. No.
708 3. Yes.
709 4. Yes.
710 5. 8-bit environment
711 6. No.
712 7. Use ASCII
713 8. Use JISX0208-1983
714 @end group
715
716 @group
717 korean-mail -- Coding system used in Korean network.
718 1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used
719 2. No.
720 3. Yes.
721 4. Yes.
722 5. 7-bit environment
723 6. Yes.
724 7. No.
725 8. No.
726 @end group
727 @end example
728
729 Mule creates all these coding systems by default.
730
731 @node Coding Systems
732 @section Coding Systems
733
734 A coding system is an object that defines how text containing multiple
735 character sets is encoded into a stream of (typically 8-bit) bytes. The
736 coding system is used to decode the stream into a series of characters
737 (which may be from multiple charsets) when the text is read from a file
738 or process, and is used to encode the text back into the same format
739 when it is written out to a file or process.
740
741 For example, many ISO-2022-compliant coding systems (such as Compound
742 Text, which is used for inter-client data under the X Window System) use
743 escape sequences to switch between different charsets -- Japanese Kanji,
744 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with
745 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See
746 @code{make-coding-system} for more information.
747
748 Coding systems are normally identified using a symbol, and the symbol is
749 accepted in place of the actual coding system object whenever a coding
750 system is called for. (This is similar to how faces and charsets work.)
751
752 @defun coding-system-p object
753 This function returns non-@code{nil} if @var{object} is a coding system.
754 @end defun
755
756 @menu
757 * Coding System Types:: Classifying coding systems.
758 * EOL Conversion:: Dealing with different ways of denoting
759 the end of a line.
760 * Coding System Properties:: Properties of a coding system.
761 * Basic Coding System Functions:: Working with coding systems.
762 * Coding System Property Functions:: Retrieving a coding system's properties.
763 * Encoding and Decoding Text:: Encoding and decoding text.
764 * Detection of Textual Encoding:: Determining how text is encoded.
765 * Big5 and Shift-JIS Functions:: Special functions for these non-standard
766 encodings.
767 @end menu
768
769 @node Coding System Types
770 @subsection Coding System Types
771
772 @table @code
773 @item nil
774 @itemx autodetect
775 Automatic conversion. XEmacs attempts to detect the coding system used
776 in the file.
777 @item no-conversion
778 No conversion. Use this for binary files and such. On output, graphic
779 characters that are not in ASCII or Latin-1 will be replaced by a
780 @samp{?}. (For a no-conversion-encoded buffer, these characters will
781 only be present if you explicitly insert them.)
782 @item shift-jis
783 Shift-JIS (a Japanese encoding commonly used in PC operating systems).
784 @item iso2022
785 Any ISO-2022-compliant encoding. Among other things, this includes JIS
786 (the Japanese encoding commonly used for e-mail), national variants of
787 EUC (the standard Unix encoding for Japanese and other languages), and
788 Compound Text (an encoding used in X11). You can specify more specific
789 information about the conversion with the @var{flags} argument.
790 @item big5
791 Big5 (the encoding commonly used for Taiwanese).
792 @item ccl
793 The conversion is performed using a user-written pseudo-code program.
794 CCL (Code Conversion Language) is the name of this pseudo-code.
795 @item internal
796 Write out or read in the raw contents of the memory representing the
797 buffer's text. This is primarily useful for debugging purposes, and is
798 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set
799 (the @samp{--debug} configure option). @strong{Warning}: Reading in a
800 file using @code{internal} conversion can result in an internal
801 inconsistency in the memory representing a buffer's text, which will
802 produce unpredictable results and may cause XEmacs to crash. Under
803 normal circumstances you should never use @code{internal} conversion.
804 @end table
805
806 @node EOL Conversion
807 @subsection EOL Conversion
808
809 @table @code
810 @item nil
811 Automatically detect the end-of-line type (LF, CRLF, or CR). Also
812 generate subsidiary coding systems named @code{@var{name}-unix},
813 @code{@var{name}-dos}, and @code{@var{name}-mac}, that are identical to
814 this coding system but have an EOL-TYPE value of @code{lf}, @code{crlf},
815 and @code{cr}, respectively.
816 @item lf
817 The end of a line is marked externally using ASCII LF. Since this is
818 also the way that XEmacs represents an end-of-line internally,
819 specifying this option results in no end-of-line conversion. This is
820 the standard format for Unix text files.
821 @item crlf
822 The end of a line is marked externally using ASCII CRLF. This is the
823 standard format for MS-DOS text files.
824 @item cr
825 The end of a line is marked externally using ASCII CR. This is the
826 standard format for Macintosh text files.
827 @item t
828 Automatically detect the end-of-line type but do not generate subsidiary
829 coding systems. (This value is converted to @code{nil} when stored
830 internally, and @code{coding-system-property} will return @code{nil}.)
831 @end table
832
833 @node Coding System Properties
834 @subsection Coding System Properties
835
836 @table @code
837 @item mnemonic
838 String to be displayed in the modeline when this coding system is
839 active.
840
841 @item eol-type
842 End-of-line conversion to be used. It should be one of the types
843 listed in @ref{EOL Conversion}.
844
845 @item post-read-conversion
846 Function called after a file has been read in, to perform the decoding.
847 Called with two arguments, @var{beg} and @var{end}, denoting a region of
848 the current buffer to be decoded.
849
850 @item pre-write-conversion
851 Function called before a file is written out, to perform the encoding.
852 Called with two arguments, @var{beg} and @var{end}, denoting a region of
853 the current buffer to be encoded.
854 @end table
855
856 The following additional properties are recognized if @var{type} is
857 @code{iso2022}:
858
859 @table @code
860 @item charset-g0
861 @itemx charset-g1
862 @itemx charset-g2
863 @itemx charset-g3
864 The character set initially designated to the G0 - G3 registers.
865 The value should be one of
866
867 @itemize @bullet
868 @item
869 A charset object (designate that character set)
870 @item
871 @code{nil} (do not ever use this register)
872 @item
873 @code{t} (no character set is initially designated to the register, but
874 may be later on; this automatically sets the corresponding
875 @code{force-g*-on-output} property)
876 @end itemize
877
878 @item force-g0-on-output
879 @itemx force-g1-on-output
880 @itemx force-g2-on-output
881 @itemx force-g3-on-output
882 If non-@code{nil}, send an explicit designation sequence on output
883 before using the specified register.
884
885 @item short
886 If non-@code{nil}, use the short forms @samp{ESC $ @@}, @samp{ESC $ A},
887 and @samp{ESC $ B} on output in place of the full designation sequences
888 @samp{ESC $ ( @@}, @samp{ESC $ ( A}, and @samp{ESC $ ( B}.
889
890 @item no-ascii-eol
891 If non-@code{nil}, don't designate ASCII to G0 at each end of line on
892 output. Setting this to non-@code{nil} also suppresses other
893 state-resetting that normally happens at the end of a line.
894
895 @item no-ascii-cntl
896 If non-@code{nil}, don't designate ASCII to G0 before control chars on
897 output.
898
899 @item seven
900 If non-@code{nil}, use 7-bit environment on output. Otherwise, use 8-bit
901 environment.
902
903 @item lock-shift
904 If non-@code{nil}, use locking-shift (SO/SI) instead of single-shift or
905 designation by escape sequence.
906
907 @item no-iso6429
908 If non-@code{nil}, don't use ISO6429's direction specification.
909
910 @item escape-quoted
911 If non-nil, literal control characters that are the same as the
912 beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in
913 particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F),
914 and CSI (0x9B)) are ``quoted'' with an escape character so that they can
915 be properly distinguished from an escape sequence. (Note that doing
916 this results in a non-portable encoding.) This encoding flag is used for
917 byte-compiled files. Note that ESC is a good choice for a quoting
918 character because there are no escape sequences whose second byte is a
919 character from the Control-0 or Control-1 character sets; this is
920 explicitly disallowed by the ISO 2022 standard.
921
922 @item input-charset-conversion
923 A list of conversion specifications, specifying conversion of characters
924 in one charset to another when decoding is performed. Each
925 specification is a list of two elements: the source charset, and the
926 destination charset.
927
928 @item output-charset-conversion
929 A list of conversion specifications, specifying conversion of characters
930 in one charset to another when encoding is performed. The form of each
931 specification is the same as for @code{input-charset-conversion}.
932 @end table
933
934 The following additional properties are recognized (and required) if
935 @var{type} is @code{ccl}:
936
937 @table @code
938 @item decode
939 CCL program used for decoding (converting to internal format).
940
941 @item encode
942 CCL program used for encoding (converting to external format).
943 @end table
944
945 @node Basic Coding System Functions
946 @subsection Basic Coding System Functions
947
948 @defun find-coding-system coding-system-or-name
949 This function retrieves the coding system of the given name.
950
951 If @var{coding-system-or-name} is a coding-system object, it is simply
952 returned. Otherwise, @var{coding-system-or-name} should be a symbol.
953 If there is no such coding system, @code{nil} is returned. Otherwise
954 the associated coding system object is returned.
955 @end defun
956
957 @defun get-coding-system name
958 This function retrieves the coding system of the given name. Same as
959 @code{find-coding-system} except an error is signalled if there is no
960 such coding system instead of returning @code{nil}.
961 @end defun
962
963 @defun coding-system-list
964 This function returns a list of the names of all defined coding systems.
965 @end defun
966
967 @defun coding-system-name coding-system
968 This function returns the name of the given coding system.
969 @end defun
970
971 @defun make-coding-system name type &optional doc-string props
972 This function registers symbol @var{name} as a coding system.
973
974 @var{type} describes the conversion method used and should be one of
975 the types listed in @ref{Coding System Types}.
976
977 @var{doc-string} is a string describing the coding system.
978
979 @var{props} is a property list, describing the specific nature of the
980 character set. Recognized properties are as in @ref{Coding System
981 Properties}.
982 @end defun
983
984 @defun copy-coding-system old-coding-system new-name
985 This function copies @var{old-coding-system} to @var{new-name}. If
986 @var{new-name} does not name an existing coding system, a new one will
987 be created.
988 @end defun
989
990 @defun subsidiary-coding-system coding-system eol-type
991 This function returns the subsidiary coding system of
992 @var{coding-system} with eol type @var{eol-type}.
993 @end defun
994
995 @node Coding System Property Functions
996 @subsection Coding System Property Functions
997
998 @defun coding-system-doc-string coding-system
999 This function returns the doc string for @var{coding-system}.
1000 @end defun
1001
1002 @defun coding-system-type coding-system
1003 This function returns the type of @var{coding-system}.
1004 @end defun
1005
1006 @defun coding-system-property coding-system prop
1007 This function returns the @var{prop} property of @var{coding-system}.
1008 @end defun
1009
1010 @node Encoding and Decoding Text
1011 @subsection Encoding and Decoding Text
1012
1013 @defun decode-coding-region start end coding-system &optional buffer
1014 This function decodes the text between @var{start} and @var{end} which
1015 is encoded in @var{coding-system}. This is useful if you've read in
1016 encoded text from a file without decoding it (e.g. you read in a
1017 JIS-formatted file but used the @code{binary} or @code{no-conversion} coding
1018 system, so that it shows up as @samp{^[$B!<!+^[(B}). The length of the
1019 encoded text is returned. @var{buffer} defaults to the current buffer
1020 if unspecified.
1021 @end defun
1022
1023 @defun encode-coding-region start end coding-system &optional buffer
1024 This function encodes the text between @var{start} and @var{end} using
1025 @var{coding-system}. This will, for example, convert Japanese
1026 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS
1027 encoding. The length of the encoded text is returned. @var{buffer}
1028 defaults to the current buffer if unspecified.
1029 @end defun
1030
1031 @node Detection of Textual Encoding
1032 @subsection Detection of Textual Encoding
1033
1034 @defun coding-category-list
1035 This function returns a list of all recognized coding categories.
1036 @end defun
1037
1038 @defun set-coding-priority-list list
1039 This function changes the priority order of the coding categories.
1040 @var{list} should be a list of coding categories, in descending order of
1041 priority. Unspecified coding categories will be lower in priority than
1042 all specified ones, in the same relative order they were in previously.
1043 @end defun
1044
1045 @defun coding-priority-list
1046 This function returns a list of coding categories in descending order of
1047 priority.
1048 @end defun
1049
1050 @defun set-coding-category-system coding-category coding-system
1051 This function changes the coding system associated with a coding category.
1052 @end defun
1053
1054 @defun coding-category-system coding-category
1055 This function returns the coding system associated with a coding category.
1056 @end defun
1057
1058 @defun detect-coding-region start end &optional buffer
1059 This function detects coding system of the text in the region between
1060 @var{start} and @var{end}. Returned value is a list of possible coding
1061 systems ordered by priority. If only ASCII characters are found, it
1062 returns @code{autodetect} or one of its subsidiary coding systems
1063 according to a detected end-of-line type. Optional arg @var{buffer}
1064 defaults to the current buffer.
1065 @end defun
1066
1067 @node Big5 and Shift-JIS Functions
1068 @subsection Big5 and Shift-JIS Functions
1069
1070 These are special functions for working with the non-standard
1071 Shift-JIS and Big5 encodings.
1072
1073 @defun decode-shift-jis-char code
1074 This function decodes a JISX0208 character of Shift-JIS coding-system.
1075 @var{code} is the character code in Shift-JIS as a cons of type bytes.
1076 The corresponding character is returned.
1077 @end defun
1078
1079 @defun encode-shift-jis-char ch
1080 This function encodes a JISX0208 character @var{ch} to SHIFT-JIS
1081 coding-system. The corresponding character code in SHIFT-JIS is
1082 returned as a cons of two bytes.
1083 @end defun
1084
1085 @defun decode-big5-char code
1086 This function decodes a Big5 character @var{code} of BIG5 coding-system.
1087 @var{code} is the character code in BIG5. The corresponding character
1088 is returned.
1089 @end defun
1090
1091 @defun encode-big5-char ch
1092 This function encodes the Big5 character @var{char} to BIG5
1093 coding-system. The corresponding character code in Big5 is returned.
1094 @end defun
1095
1096 @node CCL, Category Tables, Coding Systems, MULE
1097 @section CCL
1098
1099 CCL (Code Conversion Language) is a simple structured programming
1100 language designed for character coding conversions. A CCL program is
1101 compiled to CCL code (represented by a vector of integers) and executed
1102 by the CCL interpreter embedded in Emacs. The CCL interpreter
1103 implements a virtual machine with 8 registers called @code{r0}, ...,
1104 @code{r7}, a number of control structures, and some I/O operators. Take
1105 care when using registers @code{r0} (used in implicit @dfn{set}
1106 statements) and especially @code{r7} (used internally by several
1107 statements and operations, especially for multiple return values and I/O
1108 operations).
1109
1110 CCL is used for code conversion during process I/O and file I/O for
1111 non-ISO2022 coding systems. (It is the only way for a user to specify a
1112 code conversion function.) It is also used for calculating the code
1113 point of an X11 font from a character code. However, since CCL is
1114 designed as a powerful programming language, it can be used for more
1115 generic calculation where efficiency is demanded. A combination of
1116 three or more arithmetic operations can be calculated faster by CCL than
1117 by Emacs Lisp.
1118
1119 @strong{Warning:} The code in @file{src/mule-ccl.c} and
1120 @file{$packages/lisp/mule-base/mule-ccl.el} is the definitive
1121 description of CCL's semantics. The previous version of this section
1122 contained several typos and obsolete names left from earlier versions of
1123 MULE, and many may remain. (I am not an experienced CCL programmer; the
1124 few who know CCL well find writing English painful.)
1125
1126 A CCL program transforms an input data stream into an output data
1127 stream. The input stream, held in a buffer of constant bytes, is left
1128 unchanged. The buffer may be filled by an external input operation,
1129 taken from an Emacs buffer, or taken from a Lisp string. The output
1130 buffer is a dynamic array of bytes, which can be written by an external
1131 output operation, inserted into an Emacs buffer, or returned as a Lisp
1132 string.
1133
1134 A CCL program is a (Lisp) list containing two or three members. The
1135 first member is the @dfn{buffer magnification}, which indicates the
1136 required minimum size of the output buffer as a multiple of the input
1137 buffer. It is followed by the @dfn{main block} which executes while
1138 there is input remaining, and an optional @dfn{EOF block} which is
1139 executed when the input is exhausted. Both the main block and the EOF
1140 block are CCL blocks.
1141
1142 A @dfn{CCL block} is either a CCL statement or list of CCL statements.
1143 A @dfn{CCL statement} is either a @dfn{set statement} (either an integer
1144 or an @dfn{assignment}, which is a list of a register to receive the
1145 assignment, an assignment operator, and an expression) or a @dfn{control
1146 statement} (a list starting with a keyword, whose allowable syntax
1147 depends on the keyword).
1148
1149 @menu
1150 * CCL Syntax:: CCL program syntax in BNF notation.
1151 * CCL Statements:: Semantics of CCL statements.
1152 * CCL Expressions:: Operators and expressions in CCL.
1153 * Calling CCL:: Running CCL programs.
1154 * CCL Examples:: The encoding functions for Big5 and KOI-8.
1155 @end menu
1156
1157 @node CCL Syntax, CCL Statements, CCL, CCL
1158 @comment Node, Next, Previous, Up
1159 @subsection CCL Syntax
1160
1161 The full syntax of a CCL program in BNF notation:
1162
1163 @format
1164 CCL_PROGRAM :=
1165 (BUFFER_MAGNIFICATION
1166 CCL_MAIN_BLOCK
1167 [ CCL_EOF_BLOCK ])
1168
1169 BUFFER_MAGNIFICATION := integer
1170 CCL_MAIN_BLOCK := CCL_BLOCK
1171 CCL_EOF_BLOCK := CCL_BLOCK
1172
1173 CCL_BLOCK :=
1174 STATEMENT | (STATEMENT [STATEMENT ...])
1175 STATEMENT :=
1176 SET | IF | BRANCH | LOOP | REPEAT | BREAK | READ | WRITE
1177 | CALL | END
1178
1179 SET :=
1180 (REG = EXPRESSION)
1181 | (REG ASSIGNMENT_OPERATOR EXPRESSION)
1182 | integer
1183
1184 EXPRESSION := ARG | (EXPRESSION OPERATOR ARG)
1185
1186 IF := (if EXPRESSION CCL_BLOCK [CCL_BLOCK])
1187 BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...])
1188 LOOP := (loop STATEMENT [STATEMENT ...])
1189 BREAK := (break)
1190 REPEAT :=
1191 (repeat)
1192 | (write-repeat [REG | integer | string])
1193 | (write-read-repeat REG [integer | ARRAY])
1194 READ :=
1195 (read REG ...)
1196 | (read-if (REG OPERATOR ARG) CCL_BLOCK CCL_BLOCK)
1197 | (read-branch REG CCL_BLOCK [CCL_BLOCK ...])
1198 WRITE :=
1199 (write REG ...)
1200 | (write EXPRESSION)
1201 | (write integer) | (write string) | (write REG ARRAY)
1202 | string
1203 CALL := (call ccl-program-name)
1204 END := (end)
1205
1206 REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7
1207 ARG := REG | integer
1208 OPERATOR :=
1209 + | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | //
1210 | < | > | == | <= | >= | != | de-sjis | en-sjis
1211 ASSIGNMENT_OPERATOR :=
1212 += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>=
1213 ARRAY := '[' integer ... ']'
1214 @end format
1215
1216 @node CCL Statements, CCL Expressions, CCL Syntax, CCL
1217 @comment Node, Next, Previous, Up
1218 @subsection CCL Statements
1219
1220 The Emacs Code Conversion Language provides the following statement
1221 types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat},
1222 @dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, and @dfn{end}.
1223
1224 @heading Set statement:
1225
1226 The @dfn{set} statement has three variants with the syntaxes
1227 @samp{(@var{reg} = @var{expression})},
1228 @samp{(@var{reg} @var{assignment_operator} @var{expression})}, and
1229 @samp{@var{integer}}. The assignment operator variation of the
1230 @dfn{set} statement works the same way as the corresponding C expression
1231 statement does. The assignment operators are @code{+=}, @code{-=},
1232 @code{*=}, @code{/=}, @code{%=}, @code{&=}, @code{|=}, @code{^=},
1233 @code{<<=}, and @code{>>=}, and they have the same meanings as in C. A
1234 "naked integer" @var{integer} is equivalent to a @var{set} statement of
1235 the form @code{(r0 = @var{integer})}.
1236
1237 @heading I/O statements:
1238
1239 The @dfn{read} statement takes one or more registers as arguments. It
1240 reads one byte (a C char) from the input into each register in turn.
1241
1242 The @dfn{write} takes several forms. In the form @samp{(write @var{reg}
1243 ...)} it takes one or more registers as arguments and writes each in
1244 turn to the output. The integer in a register (interpreted as an
1245 Emchar) is encoded to multibyte form (ie, Bufbytes) and written to the
1246 current output buffer. If it is less than 256, it is written as is.
1247 The forms @samp{(write @var{expression})} and @samp{(write
1248 @var{integer})} are treated analogously. The form @samp{(write
1249 @var{string})} writes the constant string to the output. A
1250 "naked string" @samp{@var{string}} is equivalent to the statement @samp{(write
1251 @var{string})}. The form @samp{(write @var{reg} @var{array})} writes
1252 the @var{reg}th element of the @var{array} to the output.
1253
1254 @heading Conditional statements:
1255
1256 The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and
1257 an optional @var{second CCL block} as arguments. If the
1258 @var{expression} evaluates to non-zero, the first @var{CCL block} is
1259 executed. Otherwise, if there is a @var{second CCL block}, it is
1260 executed.
1261
1262 The @dfn{read-if} variant of the @dfn{if} statement takes an
1263 @var{expression}, a @var{CCL block}, and an optional @var{second CCL
1264 block} as arguments. The @var{expression} must have the form
1265 @code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is
1266 a register or an integer). The @code{read-if} statement first reads
1267 from the input into the first register operand in the @var{expression},
1268 then conditionally executes a CCL block just as the @code{if} statement
1269 does.
1270
1271 The @dfn{branch} statement takes an @var{expression} and one or more CCL
1272 blocks as arguments. The CCL blocks are treated as a zero-indexed
1273 array, and the @code{branch} statement uses the @var{expression} as the
1274 index of the CCL block to execute. Null CCL blocks may be used as
1275 no-ops, continuing execution with the statement following the
1276 @code{branch} statement in the containing CCL block. Out-of-range
1277 values for the @var{EXPRESSION} are also treated as no-ops.
1278
1279 The @dfn{read-branch} variant of the @dfn{branch} statement takes an
1280 @var{register}, a @var{CCL block}, and an optional @var{second CCL
1281 block} as arguments. The @code{read-branch} statement first reads from
1282 the input into the @var{register}, then conditionally executes a CCL
1283 block just as the @code{branch} statement does.
1284
1285 @heading Loop control statements:
1286
1287 The @dfn{loop} statement creates a block with an implied jump from the
1288 end of the block back to its head. The loop is exited on a @code{break}
1289 statement, and continued without executing the tail by a @code{repeat}
1290 statement.
1291
1292 The @dfn{break} statement, written @samp{(break)}, terminates the
1293 current loop and continues with the next statement in the current
1294 block.
1295
1296 The @dfn{repeat} statement has three variants, @code{repeat},
1297 @code{write-repeat}, and @code{write-read-repeat}. Each continues the
1298 current loop from its head, possibly after performing I/O.
1299 @code{repeat} takes no arguments and does no I/O before jumping.
1300 @code{write-repeat} takes a single argument (a register, an
1301 integer, or a string), writes it to the output, then jumps.
1302 @code{write-read-repeat} takes one or two arguments. The first must
1303 be a register. The second may be an integer or an array; if absent, it
1304 is implicitly set to the first (register) argument.
1305 @code{write-read-repeat} writes its second argument to the output, then
1306 reads from the input into the register, and finally jumps. See the
1307 @code{write} and @code{read} statements for the semantics of the I/O
1308 operations for each type of argument.
1309
1310 @heading Other control statements:
1311
1312 The @dfn{call} statement, written @samp{(call @var{ccl-program-name})},
1313 executes a CCL program as a subroutine. It does not return a value to
1314 the caller, but can modify the register status.
1315
1316 The @dfn{end} statement, written @samp{(end)}, terminates the CCL
1317 program successfully, and returns to caller (which may be a CCL
1318 program). It does not alter the status of the registers.
1319
1320 @node CCL Expressions, Calling CCL, CCL Statements, CCL
1321 @comment Node, Next, Previous, Up
1322 @subsection CCL Expressions
1323
1324 CCL, unlike Lisp, uses infix expressions. The simplest CCL expressions
1325 consist of a single @var{operand}, either a register (one of @code{r0},
1326 ..., @code{r0}) or an integer. Complex expressions are lists of the
1327 form @code{( @var{expression} @var{operator} @var{operand} )}. Unlike
1328 C, assignments are not expressions.
1329
1330 In the following table, @var{X} is the target resister for a @dfn{set}.
1331 In subexpressions, this is implicitly @code{r7}. This means that
1332 @code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used
1333 freely in subexpressions, since they return parts of their values in
1334 @code{r7}. @var{Y} may be an expression, register, or integer, while
1335 @var{Z} must be a register or an integer.
1336
1337 @multitable @columnfractions .22 .14 .09 .55
1338 @item Name @tab Operator @tab Code @tab C-like Description
1339 @item CCL_PLUS @tab @code{+} @tab 0x00 @tab X = Y + Z
1340 @item CCL_MINUS @tab @code{-} @tab 0x01 @tab X = Y - Z
1341 @item CCL_MUL @tab @code{*} @tab 0x02 @tab X = Y * Z
1342 @item CCL_DIV @tab @code{/} @tab 0x03 @tab X = Y / Z
1343 @item CCL_MOD @tab @code{%} @tab 0x04 @tab X = Y % Z
1344 @item CCL_AND @tab @code{&} @tab 0x05 @tab X = Y & Z
1345 @item CCL_OR @tab @code{|} @tab 0x06 @tab X = Y | Z
1346 @item CCL_XOR @tab @code{^} @tab 0x07 @tab X = Y ^ Z
1347 @item CCL_LSH @tab @code{<<} @tab 0x08 @tab X = Y << Z
1348 @item CCL_RSH @tab @code{>>} @tab 0x09 @tab X = Y >> Z
1349 @item CCL_LSH8 @tab @code{<8} @tab 0x0A @tab X = (Y << 8) | Z
1350 @item CCL_RSH8 @tab @code{>8} @tab 0x0B @tab X = Y >> 8, r[7] = Y & 0xFF
1351 @item CCL_DIVMOD @tab @code{//} @tab 0x0C @tab X = Y / Z, r[7] = Y % Z
1352 @item CCL_LS @tab @code{<} @tab 0x10 @tab X = (X < Y)
1353 @item CCL_GT @tab @code{>} @tab 0x11 @tab X = (X > Y)
1354 @item CCL_EQ @tab @code{==} @tab 0x12 @tab X = (X == Y)
1355 @item CCL_LE @tab @code{<=} @tab 0x13 @tab X = (X <= Y)
1356 @item CCL_GE @tab @code{>=} @tab 0x14 @tab X = (X >= Y)
1357 @item CCL_NE @tab @code{!=} @tab 0x15 @tab X = (X != Y)
1358 @item CCL_ENCODE_SJIS @tab @code{en-sjis} @tab 0x16 @tab X = HIGHER_BYTE (SJIS (Y, Z))
1359 @item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z)
1360 @item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z))
1361 @item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z))
1362 @end multitable
1363
1364 The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8,
1365 CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The CCL_ENCODE_SJIS
1366 and CCL_DECODE_SJIS treat their first and second bytes as the high and
1367 low bytes of a two-byte character code. (SJIS stands for Shift JIS, an
1368 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a
1369 complicated transformation of the Japanese standard JIS encoding to
1370 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to
1371 represent the SJIS operations in infix form.
1372
1373 @node Calling CCL, CCL Examples, CCL Expressions, CCL
1374 @comment Node, Next, Previous, Up
1375 @subsection Calling CCL
1376
1377 CCL programs are called automatically during Emacs buffer I/O when the
1378 external representation has a coding system type of @code{shift-jis},
1379 @code{big5}, or @code{ccl}. The program is specified by the coding
1380 system (@pxref{Coding Systems}). You can also call CCL programs from
1381 other CCL programs, and from Lisp using these functions:
1382
1383 @defun ccl-execute ccl-program status
1384 Execute @var{ccl-program} with registers initialized by
1385 @var{status}. @var{ccl-program} is a vector of compiled CCL code
1386 created by @code{ccl-compile}. It is an error for the program to try to
1387 execute a CCL I/O command. @var{status} must be a vector of nine
1388 values, specifying the initial value for the R0, R1 .. R7 registers and
1389 for the instruction counter IC. A @code{nil} value for a register
1390 initializer causes the register to be set to 0. A @code{nil} value for
1391 the IC initializer causes execution to start at the beginning of the
1392 program. When the program is done, @var{status} is modified (by
1393 side-effect) to contain the ending values for the corresponding
1394 registers and IC.
1395 @end defun
1396
1397 @defun ccl-execute-on-string ccl-program status str &optional continue
1398 Execute @var{ccl-program} with initial @var{status} on
1399 @var{string}. @var{ccl-program} is a vector of compiled CCL code
1400 created by @code{ccl-compile}. @var{status} must be a vector of nine
1401 values, specifying the initial value for the R0, R1 .. R7 registers and
1402 for the instruction counter IC. A @code{nil} value for a register
1403 initializer causes the register to be set to 0. A @code{nil} value for
1404 the IC initializer causes execution to start at the beginning of the
1405 program. An optional fourth argument @var{continue}, if non-nil, causes
1406 the IC to
1407 remain on the unsatisfied read operation if the program terminates due
1408 to exhaustion of the input buffer. Otherwise the IC is set to the end
1409 of the program. When the program is done, @var{status} is modified (by
1410 side-effect) to contain the ending values for the corresponding
1411 registers and IC. Returns the resulting string.
1412 @end defun
1413
1414 To call a CCL program from another CCL program, it must first be
1415 registered:
1416
1417 @defun register-ccl-program name ccl-program
1418 Register @var{name} for CCL program @var{program} in
1419 @code{ccl-program-table}. @var{program} should be the compiled form of
1420 a CCL program, or nil. Return index number of the registered CCL
1421 program.
1422 @end defun
1423
1424 Information about the processor time used by the CCL interpreter can be
1425 obtained using these functions:
1426
1427 @defun ccl-elapsed-time
1428 Returns the elapsed processor time of the CCL interpreter as cons of
1429 user and system time, as
1430 floating point numbers measured in seconds. If only one
1431 overall value can be determined, the return value will be a cons of that
1432 value and 0.
1433 @end defun
1434
1435 @defun ccl-reset-elapsed-time
1436 Resets the CCL interpreter's internal elapsed time registers.
1437 @end defun
1438
1439 @node CCL Examples, , Calling CCL, CCL
1440 @comment Node, Next, Previous, Up
1441 @subsection CCL Examples
1442
1443 This section is not yet written.
1444
1445 @node Category Tables, , CCL, MULE
1446 @section Category Tables
1447
1448 A category table is a type of char table used for keeping track of
1449 categories. Categories are used for classifying characters for use in
1450 regexps -- you can refer to a category rather than having to use a
1451 complicated [] expression (and category lookups are significantly
1452 faster).
1453
1454 There are 95 different categories available, one for each printable
1455 character (including space) in the ASCII charset. Each category is
1456 designated by one such character, called a @dfn{category designator}.
1457 They are specified in a regexp using the syntax @samp{\cX}, where X is a
1458 category designator. (This is not yet implemented.)
1459
1460 A category table specifies, for each character, the categories that
1461 the character is in. Note that a character can be in more than one
1462 category. More specifically, a category table maps from a character to
1463 either the value @code{nil} (meaning the character is in no categories)
1464 or a 95-element bit vector, specifying for each of the 95 categories
1465 whether the character is in that category.
1466
1467 Special Lisp functions are provided that abstract this, so you do not
1468 have to directly manipulate bit vectors.
1469
1470 @defun category-table-p obj
1471 This function returns @code{t} if @var{arg} is a category table.
1472 @end defun
1473
1474 @defun category-table &optional buffer
1475 This function returns the current category table. This is the one
1476 specified by the current buffer, or by @var{buffer} if it is
1477 non-@code{nil}.
1478 @end defun
1479
1480 @defun standard-category-table
1481 This function returns the standard category table. This is the one used
1482 for new buffers.
1483 @end defun
1484
1485 @defun copy-category-table &optional table
1486 This function constructs a new category table and return it. It is a
1487 copy of the @var{table}, which defaults to the standard category table.
1488 @end defun
1489
1490 @defun set-category-table table &optional buffer
1491 This function selects a new category table for @var{buffer}. One
1492 argument, a category table. @var{buffer} defaults to the current buffer
1493 if omitted.
1494 @end defun
1495
1496 @defun category-designator-p obj
1497 This function returns @code{t} if @var{arg} is a category designator (a
1498 char in the range @samp{' '} to @samp{'~'}).
1499 @end defun
1500
1501 @defun category-table-value-p obj
1502 This function returns @code{t} if @var{arg} is a category table value.
1503 Valid values are @code{nil} or a bit vector of size 95.
1504 @end defun
1505