comparison man/lispref/mule.texi @ 0:376386a54a3c r19-14

Import from CVS: tag r19-14
author cvs
date Mon, 13 Aug 2007 08:45:50 +0200
parents
children 05472e90ae02
comparison
equal deleted inserted replaced
-1:000000000000 0:376386a54a3c
1 @c -*-texinfo-*-
2 @c This is part of the XEmacs Lisp Reference Manual.
3 @c Copyright (C) 1996 Ben Wing.
4 @c See the file lispref.texi for copying conditions.
5 @setfilename ../../info/internationalization.info
6 @node MULE, Tips, Internationalization, top
7 @chapter MULE
8
9 @dfn{MULE} is the name originally given to the version of GNU Emacs
10 extended for multi-lingual (and in particular Asian-language) support.
11 ``MULE'' is short for ``MUlti-Lingual Emacs''. It was originally called
12 Nemacs (``Nihon Emacs'' where ``Nihon'' is the Japanese word for
13 ``Japan''), when it only provided support for Japanese. XEmacs
14 refers to its multi-lingual support as @dfn{MULE support} since it
15 is based on @dfn{MULE}.
16
17 @menu
18 * Internationalization Terminology::
19 Definition of various internationalization terms.
20 * Charsets:: Sets of related characters.
21 * MULE Characters:: Working with characters in XEmacs/MULE.
22 * Composite Characters:: Making new characters by overstriking other ones.
23 * ISO 2022:: An international standard for charsets and encodings.
24 * Coding Systems:: Ways of representing a string of chars using integers.
25 * CCL:: A special language for writing fast converters.
26 * Category Tables:: Subdividing charsets into groups.
27 @end menu
28
29 @node Internationalization Terminology
30 @section Internationalization Terminology
31
32 In internationalization terminology, a string of text is divided up
33 into @dfn{characters}, which are the printable units that make up the
34 text. A single character is (for example) a capital @samp{A}, the
35 number @samp{2}, a Katakana character, a Kanji ideograph (an
36 @dfn{ideograph} is a ``picture'' character, such as is used in Japanese
37 Kanji, Chinese Hanzi, and Korean Hangul; typically there are thousands
38 of such ideographs in each language), etc. The basic property of a
39 character is its shape. Note that the same character may be drawn by
40 two different people (or in two different fonts) in slightly different
41 ways, although the basic shape will be the same.
42
43 In some cases, the differences will be significant enough that it is
44 actually possible to identify two or more distinct shapes that both
45 represent the same character. For example, the lowercase letters
46 @samp{a} and @samp{g} each have two distinct possible shapes -- the
47 @samp{a} can optionally have a curved tail projecting off the top, and
48 the @samp{g} can be formed either of two loops, or of one loop and a
49 tail hanging off the bottom. Such distinct possible shapes of a
50 character are called @dfn{glyphs}. The important characteristic of two
51 glyphs making up the same character is that the choice between one or
52 the other is purely stylistic and has no linguistic effect on a word
53 (this is the reason why a capital @samp{A} and lowercase @samp{a}
54 are different characters rather than different glyphs -- e.g.
55 @samp{Aspen} is a city while @samp{aspen} is a kind of tree).
56
57 Note that @dfn{character} and @dfn{glyph} are used differently
58 here than elsewhere in XEmacs.
59
60 A @dfn{character set} is simply a set of related characters. ASCII,
61 for example, is a set of 94 characters (or 128, if you count
62 non-printing characters). Other character sets are ISO8859-1 (ASCII
63 plus various accented characters and other international symbols),
64 JISX0201 (ASCII, more or less, plus half-width Katakana), JISX0208
65 (Japanese Kanji), JISX0212 (a second set of less-used Japanese Kanji),
66 GB2312 (Mainland Chinese Hanzi), etc.
67
68 Every character set has one or more @dfn{orderings}, which can be
69 viewed as a way of assigning a number (or set of numbers) to each
70 character in the set. For most character sets, there is a standard
71 ordering, and in fact all of the character sets mentioned above define a
72 particular ordering. ASCII, for example, places letters in their
73 ``natural'' order, puts uppercase letters before lowercase letters,
74 numbers before letters, etc. Note that for many of the Asian character
75 sets, there is no natural ordering of the characters. The actual
76 orderings are based on one or more salient characteristic, of which
77 there are many to choose from -- e.g. number of strokes, common
78 radicals, phonetic ordering, etc.
79
80 The set of numbers assigned to any particular character are called
81 the character's @dfn{position codes}. The number of position codes
82 required to index a particular character in a character set is called
83 the @dfn{dimension} of the character set. ASCII, being a relatively
84 small character set, is of dimension one, and each character in the
85 set is indexed using a single position code, in the range 0 through
86 127 (if non-printing characters are included) or 33 through 126
87 (if only the printing characters are considered). JISX0208, i.e.
88 Japanese Kanji, has thousands of characters, and is of dimension two --
89 every character is indexed by two position codes, each in the range
90 33 through 126. (Note that the choice of the range here is somewhat
91 arbitrary. Although a character set such as JISX0208 defines an
92 @emph{ordering} of all its characters, it does not define the actual
93 mapping between numbers and characters. You could just as easily
94 index the characters in JISX0208 using numbers in the range 0 through
95 93, 1 through 94, 2 through 95, etc. The reason for the actual range
96 chosen is so that the position codes match up with the actual values
97 used in the common encodings.)
98
99 An @dfn{encoding} is a way of numerically representing characters from
100 one or more character sets into a stream of like-sized numerical values
101 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit
102 quantities. If an encoding encompasses only one character set, then the
103 position codes for the characters in that character set could be used
104 directly. (This is the case with ASCII, and as a result, most people do
105 not understand the difference between a character set and an encoding.)
106 This is not possible, however, if more than one character set is to be
107 used in the encoding. For example, printed Japanese text typically
108 requires characters from multiple character sets -- ASCII, JISX0208, and
109 JISX0212, to be specific. Each of these is indexed using one or more
110 position codes in the range 33 through 126, so the position codes could
111 not be used directly or there would be no way to tell which character
112 was meant. Different Japanese encodings handle this differently -- JIS
113 uses special escape characters to denote different character sets; EUC
114 sets the high bit of the position codes for JISX0208 and JISX0212, and
115 puts a special extra byte before each JISX0212 character; etc. (JIS,
116 EUC, and most of the other encodings you will encounter are 7-bit or
117 8-bit encodings. There is one common 16-bit encoding, which is Unicode;
118 this strives to represent all the world's characters in a single large
119 character set. 32-bit encodings are generally used internally in
120 programs to simplify the code that manipulates them; however, they are
121 not much used externally because they are not very space-efficient.)
122
123 Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In
124 a @dfn{modal encoding}, there are multiple states that the encoding can be in,
125 and the interpretation of the values in the stream depends on the
126 current global state of the encoding. Special values in the encoding,
127 called @dfn{escape sequences}, are used to change the global state.
128 JIS, for example, is a modal encoding. The bytes @samp{ESC $ B}
129 indicate that, from then on, bytes are to be interpreted as position
130 codes for JISX0208, rather than as ASCII. This effect is cancelled
131 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the
132 current state is to ASCII''. To switch to JISX0212, the escape sequence
133 @samp{ESC $ ( D}. (Note that here, as is common, the escape sequences do
134 in fact begin with @samp{ESC}. This is not necessarily the case,
135 however.)
136
137 A @dfn{non-modal encoding} has no global state that extends past the
138 character currently being interpreted. EUC, for example, is a
139 non-modal encoding. Characters in JISX0208 are encoded by setting
140 the high bit of the position codes, and characters in JISX0212 are
141 encoded by doing the same but also prefixing the character with the
142 byte 0x8F.
143
144 The advantage of a modal encoding is that it is generally more
145 space-efficient, and is easily extendable because there are essentially
146 an arbitrary number of escape sequences that can be created. The
147 disadvantage, however, is that it is much more difficult to work with
148 if it is not being processed in a sequential manner. In the non-modal
149 EUC encoding, for example, the byte 0x41 always refers to the letter
150 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or
151 one of the two position codes in a JISX0208 character, or one of the
152 two position codes in a JISX0212 character. Determining exactly which
153 one is meant could be difficult and time-consuming if the previous
154 bytes in the string have not already been processed.
155
156 Non-modal encodings are further divided into @dfn{fixed-width} and
157 @dfn{variable-width} formats. A fixed-width encoding always uses
158 the same number of words per character, whereas a variable-width
159 encoding does not. EUC is a good example of a variable-width
160 encoding: one to three bytes are used per character, depending on
161 the character set. 16-bit and 32-bit encodings are nearly always
162 fixed-width, and this is in fact one of the main reasons for using
163 an encoding with a larger word size. The advantages of fixed-width
164 encodings should be obvious. The advantages of variable-width
165 encodings are that they are generally more space-efficient and allow
166 for compatibility with existing 8-bit encodings such as ASCII.
167
168 Note that the bytes in an 8-bit encoding are often referred to
169 as @dfn{octets} rather than simply as bytes. This terminology
170 dates back to the days before 8-bit bytes were universal, when
171 some computers had 9-bit bytes, others had 10-bit bytes, etc.
172
173 @node Charsets
174 @section Charsets
175
176 A @dfn{charset} in MULE is an object that encapsulates a
177 particular character set as well as an ordering of those characters.
178 Charsets are permanent objects and are named using symbols, like
179 faces.
180
181 @defun charsetp object
182 This function returns non-@code{nil} if @var{object} is a charset.
183 @end defun
184
185 @menu
186 * Charset Properties:: Properties of a charset.
187 * Basic Charset Functions:: Functions for working with charsets.
188 * Charset Property Functions:: Functions for accessing charset properties.
189 * Predefined Charsets:: Predefined charset objects.
190 @end menu
191
192 @node Charset Properties
193 @subsection Charset Properties
194
195 Charsets have the following properties:
196
197 @table @code
198 @item name
199 A symbol naming the charset. Every charset must have a different name;
200 this allows a charset to be referred to using its name rather than
201 the actual charset object.
202 @item doc-string
203 A documentation string describing the charset.
204 @item registry
205 A regular expression matching the font registry field for this character
206 set. For example, both the @code{ascii} and @code{latin-1} charsets
207 use the registry @code{"ISO8859-1"}. This field is used to choose
208 an appropriate font when the user gives a general font specification
209 such as @samp{-*-courier-medium-r-*-140-*}, i.e. a 14-point upright
210 medium-weight Courier font.
211 @item dimension
212 Number of position codes used to index a character in the character set.
213 XEmacs/MULE can only handle character sets of dimension 1 or 2.
214 This property defaults to 1.
215 @item chars
216 Number of characters in each dimension. In XEmacs/MULE, the only
217 allowed values are 94 or 96. (There are a couple of pre-defined
218 character sets, such as ASCII, that do not follow this, but you cannot
219 define new ones like this.) Defaults to 94. Note that if the dimension
220 is 2, the character set thus described is 94x94 or 96x96.
221 @item columns
222 Number of columns used to display a character in this charset.
223 Only used in TTY mode. (Under X, the actual width of a character
224 can be derived from the font used to display the characters.)
225 If unspecified, defaults to the dimension. (This is almost
226 always the correct value, because character sets with dimension 2
227 are usually ideograph character sets, which need two columns to
228 display the intricate ideographs.)
229 @item direction
230 A symbol, either @code{l2r} (left-to-right) or @code{r2l}
231 (right-to-left). Defaults to @code{l2r}. This specifies the
232 direction that the text should be displayed in, and will be
233 left-to-right for most charsets but right-to-left for Hebrew
234 and Arabic. (Right-to-left display is not currently implemented.)
235 @item final
236 Final byte of the standard ISO 2022 escape sequence designating this
237 charset. Must be supplied. Each combination of (@var{dimension},
238 @var{chars}) defines a separate namespace for final bytes, and each
239 charset within a particular namespace must have a different final byte.
240 Note that ISO 2022 restricts the final byte to the range 0x30 - 0x7E if
241 dimension == 1, and 0x30 - 0x5F if dimension == 2. Note also that final
242 bytes in the range 0x30 - 0x3F are reserved for user-defined (not
243 official) character sets. For more information on ISO 2022, see @ref{Coding
244 Systems}.
245 @item graphic
246 0 (use left half of font on output) or 1 (use right half of font on
247 output). Defaults to 0. This specifies how to convert the position
248 codes that index a character in a character set into an index into the
249 font used to display the character set. With @code{graphic} set to 0,
250 position codes 33 through 126 map to font indices 33 through 126; with
251 it set to 1, position codes 33 through 126 map to font indices 161
252 through 254 (i.e. the same number but with the high bit set). For
253 example, for a font whose registry is ISO8859-1, the left half of the
254 font (octets 0x20 - 0x7F) is the @code{ascii} charset, while the
255 right half (octets 0xA0 - 0xFF) is the @code{latin-1} charset.
256 @item ccl-program
257 A compiled CCL program used to convert a character in this charset into
258 an index into the font. This is in addition to the @code{graphic}
259 property. If a CCL program is defined, the position codes of a
260 character will first be processed according to @code{graphic} and
261 then passed through the CCL program, with the resulting values used
262 to index the font.
263
264 This is used, for example, in the Big5 character set (used in Taiwan).
265 This character set is not ISO-2022-compliant, and its size (94x157) does
266 not fit within the maximum 96x96 size of ISO-2022-compliant character
267 sets. As a result, XEmacs/MULE splits it (in a rather complex fashion,
268 so as to group the most commonly used characters together) into two
269 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94,
270 and each charset object uses a CCL program to convert the modified
271 position codes back into standard Big5 indices to retrieve a character
272 from a Big5 font.
273 @end table
274
275 Most of the above properties can only be changed when the charset
276 is created. @xref{Charset Property Functions}.
277
278 @node Basic Charset Functions
279 @subsection Basic Charset Functions
280
281 @defun find-charset charset-or-name
282 This function retrieves the charset of the given name. If
283 @var{charset-or-name} is a charset object, it is simply returned.
284 Otherwise, @var{charset-or-name} should be a symbol. If there is no
285 such charset, @code{nil} is returned. Otherwise the associated charset
286 object is returned.
287 @end defun
288
289 @defun get-charset name
290 This function retrieves the charset of the given name. Same as
291 @code{find-charset} except an error is signalled if there is no such
292 charset instead of returning @code{nil}.
293 @end defun
294
295 @defun charset-list
296 This function returns a list of the names of all defined charsets.
297 @end defun
298
299 @defun make-charset name doc-string props
300 This function defines a new character set. This function is for use
301 with Mule support. @var{name} is a symbol, the name by which the
302 character set is normally referred. @var{doc-string} is a string
303 describing the character set. @var{props} is a property list,
304 describing the specific nature of the character set. The recognized
305 properties are @code{registry}, @code{dimension}, @code{columns},
306 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and
307 @code{ccl-program}, as previously described.
308 @end defun
309
310 @defun make-reverse-direction-charset charset new-name
311 This function makes a charset equivalent to @var{charset} but which goes
312 in the opposite direction. @var{new-name} is the name of the new
313 charset. The new charset is returned.
314 @end defun
315
316 @defun charset-from-attributes dimension chars final &optional direction
317 This function returns a charset with the given @var{dimension},
318 @var{chars}, @var{final}, and @var{direction}. If @var{direction} is
319 omitted, both directions will be checked (left-to-right will be returned
320 if character sets exist for both directions).
321 @end defun
322
323 @defun charset-reverse-direction-charset charset
324 This function returns the charset (if any) with the same dimension,
325 number of characters, and final byte as @var{charset}, but which is
326 displayed in the opposite direction.
327 @end defun
328
329 @node Charset Property Functions
330 @subsection Charset Property Functions
331
332 All of these functions accept either a charset name or charset object.
333
334 @defun charset-property charset prop
335 This function returns property @var{prop} of @var{charset}.
336 @xref{Charset Properties}.
337 @end defun
338
339 Convenience functions are also provided for retrieving individual
340 properties of a charset.
341
342 @defun charset-name charset
343 This function returns the name of @var{charset}. This will be a symbol.
344 @end defun
345
346 @defun charset-doc-string charset
347 This function returns the doc string of @var{charset}.
348 @end defun
349
350 @defun charset-registry charset
351 This function returns the registry of @var{charset}.
352 @end defun
353
354 @defun charset-dimension charset
355 This function returns the dimension of @var{charset}.
356 @end defun
357
358 @defun charset-chars charset
359 This function returns the number of characters per dimension of
360 @var{charset}.
361 @end defun
362
363 @defun charset-columns charset
364 This function returns the number of display columns per character (in
365 TTY mode) of @var{charset}.
366 @end defun
367
368 @defun charset-direction charset
369 This function returns the display direction of @var{charset} -- either
370 @code{l2r} or @code{r2l}.
371 @end defun
372
373 @defun charset-final charset
374 This function returns the final byte of the ISO 2022 escape sequence
375 designating @var{charset}.
376 @end defun
377
378 @defun charset-graphic charset
379 This function returns either 0 or 1, depending on whether the position
380 codes of characters in @var{charset} map to the left or right half
381 of their font, respectively.
382 @end defun
383
384 @defun charset-ccl-program charset
385 This function returns the CCL program, if any, for converting
386 position codes of characters in @var{charset} into font indices.
387 @end defun
388
389 The only property of a charset that can currently be set after
390 the charset has been created is the CCL program.
391
392 @defun set-charset-ccl-program charset ccl-program
393 This function sets the @code{ccl-program} property of @var{charset} to
394 @var{ccl-program}.
395 @end defun
396
397 @node Predefined Charsets
398 @subsection Predefined Charsets
399
400 The following charsets are predefined in the C code.
401
402 @example
403 Name Doc String Type Fi Gr Dir Registry
404 --------------------------------------------------------------
405 ascii ASCII 94 B 0 l2r ISO8859-1
406 control-1 Control characters 94 0 l2r ---
407 latin-1 Latin-1 94 A 1 l2r ISO8859-1
408 latin-2 Latin-2 96 B 1 l2r ISO8859-2
409 latin-3 Latin-3 96 C 1 l2r ISO8859-3
410 latin-4 Latin-4 96 D 1 l2r ISO8859-4
411 cyrillic Cyrillic 96 L 1 l2r ISO8859-5
412 arabic Arabic 96 G 1 r2l ISO8859-6
413 greek Greek 96 F 1 l2r ISO8859-7
414 hebrew Hebrew 96 H 1 r2l ISO8859-8
415 latin-5 Latin-5 96 M 1 l2r ISO8859-9
416 thai Thai 96 T 1 l2r TIS620
417 japanese-kana Japanese Katakana 94 I 1 l2r JISX0201.1976
418 japanese-roman Japanese Roman 94 J 0 l2r JISX0201.1976
419 japanese-old Japanese Old 94x94 @@ 0 l2r JISX0208.1978
420 chinese-gb Chinese GB 94x94 A 0 l2r GB2312
421 japanese Japanese 94x94 B 0 l2r JISX0208.19(83|90)
422 korean Korean 94x94 C 0 l2r KSC5601
423 japanese-2 Japanese Supplement 94x94 D 0 l2r JISX0212
424 chinese-cns-1 Chinese CNS Plane 1 94x94 G 0 l2r CNS11643.1
425 chinese-cns-2 Chinese CNS Plane 2 94x94 H 0 l2r CNS11643.2
426 chinese-big5-1 Chinese Big5 Level 1 94x94 0 0 l2r Big5
427 chinese-big5-2 Chinese Big5 Level 2 94x94 1 0 l2r Big5
428 composite Composite 96x96 0 l2r ---
429 @end example
430
431 The following charsets are predefined in the Lisp code.
432
433 @example
434 Name Doc String Type Fi Gr Dir Registry
435 --------------------------------------------------------------
436 arabic-0 Arabic digits 94 2 0 l2r MuleArabic-0
437 arabic-1 one-column Arabic 94 3 0 r2l MuleArabic-1
438 arabic-2 one-column Arabic 94 4 0 r2l MuleArabic-2
439 sisheng PinYin-ZhuYin 94 0 0 l2r sisheng_cwnn\|
440 OMRON_UDC_ZH
441 chinese-cns-3 Chinese CNS Plane 3 94x94 I 0 l2r CNS11643.1
442 chinese-cns-4 Chinese CNS Plane 4 94x94 J 0 l2r CNS11643.1
443 chinese-cns-5 Chinese CNS Plane 5 94x94 K 0 l2r CNS11643.1
444 chinese-cns-6 Chinese CNS Plane 6 94x94 L 0 l2r CNS11643.1
445 chinese-cns-7 Chinese CNS Plane 7 94x94 M 0 l2r CNS11643.1
446 ethiopic Ethiopic 94x94 2 0 l2r Ethio
447 ascii-r2l Right-to-Left ASCII 94 B 0 r2l ISO8859-1
448 ipa IPA for Mule 96 0 1 l2r MuleIPA
449 vietnamese-1 VISCII lower 96 1 1 l2r VISCII1.1
450 vietnamese-2 VISCII upper 96 2 1 l2r VISCII1.1
451 @end example
452
453 For all of the above charsets, the dimension and number of columns are
454 the same.
455
456 Note that ASCII, Control-1, and Composite are handled specially.
457 This is why some of the fields are blank; and some of the filled-in
458 fields (e.g. the type) are not really accurate.
459
460 @node MULE Characters
461 @section MULE Characters
462
463 @defun make-char charset arg1 &optional arg2
464 This function makes a multi-byte character from @var{charset} and octets
465 @var{arg1} and @var{arg2}.
466 @end defun
467
468 @defun char-charset ch
469 This function returns the character set of char @var{ch}.
470 @end defun
471
472 @defun char-octet ch &optional n
473 This function returns the octet (i.e. position code) numbered @var{n}
474 (should be 0 or 1) of char @var{ch}. @var{n} defaults to 0 if omitted.
475 @end defun
476
477 @defun charsets-in-region start end &optional buffer
478 This function returns a list of the charsets in the region between
479 @var{start} and @var{end}. @var{buffer} defaults to the current buffer
480 if omitted.
481 @end defun
482
483 @defun charsets-in-string string
484 This function returns a list of the charsets in @var{string}.
485 @end defun
486
487 @node Composite Characters
488 @section Composite Characters
489
490 Composite characters are not yet completely implemented.
491
492 @defun make-composite-char string
493 This function converts a string into a single composite character. The
494 character is the result of overstriking all the characters in the
495 string.
496 @end defun
497
498 @defun composite-char-string ch
499 This function returns a string of the characters comprising a composite
500 character.
501 @end defun
502
503 @defun compose-region start end &optional buffer
504 This function composes the characters in the region from @var{start} to
505 @var{end} in @var{buffer} into one composite character. The composite
506 character replaces the composed characters. @var{buffer} defaults to
507 the current buffer if omitted.
508 @end defun
509
510 @defun decompose-region start end &optional buffer
511 This function decomposes any composite characters in the region from
512 @var{start} to @var{end} in @var{buffer}. This converts each composite
513 character into one or more characters, the individual characters out of
514 which the composite character was formed. Non-composite characters are
515 left as-is. @var{buffer} defaults to the current buffer if omitted.
516 @end defun
517
518 @node ISO 2022
519 @section ISO 2022
520
521 This section briefly describes the ISO2022 encoding standard. For more
522 thorough understanding, please refer to the original document of
523 ISO2022.
524
525 Character sets (@dfn{charsets}) are classified into the following four
526 categories, according to the number of characters of charset:
527 94-charset, 96-charset, 94x94-charset, and 96x96-charset.
528
529 @need 1000
530 @table @asis
531 @item 94-charset
532 ASCII(B), left(J) and right(I) half of JISX0201, ...
533 @item 96-charset
534 Latin-1(A), Latin-2(B), Latin-3(C), ...
535 @item 94x94-charset
536 GB2312(A), JISX0208(B), KSC5601(C), ...
537 @item 96x96-charset
538 none for the moment
539 @end table
540
541 The character in parentheses after the name of each charset
542 is the @dfn{final character} @var{F}, which can be regarded as
543 the identifier of the charset. ECMA allocates @var{F} to each
544 charset. @var{F} is in the range of 0x30..0x7F, but 0x30..0x3F
545 are only for private use.
546
547 Note: @dfn{ECMA} = European Computer Manufacturers Association
548
549 There are four @dfn{registers of charsets}, called G0 thru G3.
550 You can designate (or assign) any charset to one of these
551 registers.
552
553 The code space contained within one octet (of size 256) is divided into
554 4 areas: C0, GL, C1, and GR. GL and GR are the areas into which a
555 register of charset can be invoked into.
556
557 @example
558 @group
559 C0: 0x00 - 0x1F
560 GL: 0x20 - 0x7F
561 C1: 0x80 - 0x9F
562 GR: 0xA0 - 0xFF
563 @end group
564 @end example
565
566 Usually, in the initial state, G0 is invoked into GL, and G1
567 is invoked into GR.
568
569 ISO2022 distinguishes 7-bit environments and 8-bit
570 environments. In 7-bit environments, only C0 and GL are used.
571
572 Charset designation is done by escape sequences of the form:
573
574 @example
575 ESC [@var{I}] @var{I} @var{F}
576 @end example
577
578 where @var{I} is an intermediate character in the range 0x20 - 0x2F, and
579 @var{F} is the final character identifying this charset.
580
581 The meaning of intermediate characters are:
582
583 @example
584 @group
585 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
586 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}.
587 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}.
588 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}.
589 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}.
590 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}.
591 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}.
592 / [0x2F]: designate to G3 a 96-charset whose final byte is
593 @var{F}.
594 @end group
595 @end example
596
597 The following rule is not allowed in ISO2022 but can be used
598 in Mule.
599
600 @example
601 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}.
602 @end example
603
604 Here are examples of designations:
605
606 @example
607 @group
608 ESC ( B : designate to G0 ASCII
609 ESC - A : designate to G1 Latin-1
610 ESC $ ( A or ESC $ A : designate to G0 GB2312
611 ESC $ ( B or ESC $ B : designate to G0 JISX0208
612 ESC $ ) C : designate to G1 KSC5601
613 @end group
614 @end example
615
616 To use a charset designated to G2 or G3, and to use a
617 charset designated to G1 in a 7-bit environment, you must
618 explicitly invoke G1, G2, or G3 into GL. There are two
619 types of invocation, Locking Shift (forever) and Single
620 Shift (one character only).
621
622 Locking Shift is done as follows:
623
624 @example
625 SI or LS0: invoke G0 into GL
626 SO or LS1: invoke G1 into GL
627 LS2: invoke G2 into GL
628 LS3: invoke G3 into GL
629 LS1R: invoke G1 into GR
630 LS2R: invoke G2 into GR
631 LS3R: invoke G3 into GR
632 @end example
633
634 Single Shift is done as follows:
635
636 @example
637 @group
638 SS2 or ESC N: invoke G2 into GL
639 SS3 or ESC O: invoke G3 into GL
640 @end group
641 @end example
642
643 (#### Ben says: I think the above is slightly incorrect. It appears that
644 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and
645 ESC O behave as indicated. The above definitions will not parse
646 EUC-encoded text correctly, and it looks like the code in mule-coding.c
647 has similar problems.)
648
649 You may realize that there are a lot of ISO2022-compliant ways of
650 encoding multilingual text. Now, in the world, there exist many coding
651 systems such as X11's Compound Text, Japanese JUNET code, and so-called
652 EUC (Extended UNIX Code); all of these are variants of ISO2022.
653
654 In Mule, we characterize ISO2022 by the following attributes:
655
656 @enumerate
657 @item
658 Initial designation to G0 thru G3.
659 @item
660 Allow designation of short form for Japanese and Chinese.
661 @item
662 Should we designate ASCII to G0 before control characters?
663 @item
664 Should we designate ASCII to G0 at the end of line?
665 @item
666 7-bit environment or 8-bit environment.
667 @item
668 Use Locking Shift or not.
669 @item
670 Use ASCII or JIS0201-1976-Roman.
671 @item
672 Use JISX0208-1983 or JISX0208-1976.
673 @end enumerate
674
675 (The last two are only for Japanese.)
676
677 By specifying these attributes, you can create any variant
678 of ISO2022.
679
680 Here are several examples:
681
682 @example
683 @group
684 junet -- Coding system used in JUNET.
685 1. G0 <- ASCII, G1..3 <- never used
686 2. Yes.
687 3. Yes.
688 4. Yes.
689 5. 7-bit environment
690 6. No.
691 7. Use ASCII
692 8. Use JISX0208-1983
693 @end group
694
695 @group
696 ctext -- Compound Text
697 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used
698 2. No.
699 3. No.
700 4. Yes.
701 5. 8-bit environment
702 6. No.
703 7. Use ASCII
704 8. Use JISX0208-1983
705 @end group
706
707 @group
708 euc-china -- Chinese EUC. Although many people call this
709 as "GB encoding", the name may cause misunderstanding.
710 1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used
711 2. No.
712 3. Yes.
713 4. Yes.
714 5. 8-bit environment
715 6. No.
716 7. Use ASCII
717 8. Use JISX0208-1983
718 @end group
719
720 @group
721 korean-mail -- Coding system used in Korean network.
722 1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used
723 2. No.
724 3. Yes.
725 4. Yes.
726 5. 7-bit environment
727 6. Yes.
728 7. No.
729 8. No.
730 @end group
731 @end example
732
733 Mule creates all these coding systems by default.
734
735 @node Coding Systems
736 @section Coding Systems
737
738 A coding system is an object that defines how text containing multiple
739 character sets is encoded into a stream of (typically 8-bit) bytes. The
740 coding system is used to decode the stream into a series of characters
741 (which may be from multiple charsets) when the text is read from a file
742 or process, and is used to encode the text back into the same format
743 when it is written out to a file or process.
744
745 For example, many ISO2022-compliant coding systems (such as Compound
746 Text, which is used for inter-client data under the X Window System) use
747 escape sequences to switch between different charsets -- Japanese Kanji,
748 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with
749 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See
750 @code{make-coding-system} for more information.
751
752 Coding systems are normally identified using a symbol, and the symbol is
753 accepted in place of the actual coding system object whenever a coding
754 system is called for. (This is similar to how faces and charsets work.)
755
756 @defun coding-system-p object
757 This function returns non-@code{nil} if @var{object} is a coding system.
758 @end defun
759
760 @menu
761 * Coding System Types:: Classifying coding systems.
762 * EOL Conversion:: Dealing with different ways of denoting
763 the end of a line.
764 * Coding System Properties:: Properties of a coding system.
765 * Basic Coding System Functions:: Working with coding systems.
766 * Coding System Property Functions:: Retrieving a coding system's properties.
767 * Encoding and Decoding Text:: Encoding and decoding text.
768 * Detection of Textual Encoding:: Determining how text is encoded.
769 * Big5 and Shift-JIS Functions:: Special functions for these non-standard
770 encodings.
771 @end menu
772
773 @node Coding System Types
774 @subsection Coding System Types
775
776 @table @code
777 @item nil
778 @itemx autodetect
779 Automatic conversion. XEmacs attempts to detect the coding system used
780 in the file.
781 @item noconv
782 No conversion. Use this for binary files and such. On output, graphic
783 characters that are not in ASCII or Latin-1 will be replaced by a
784 @samp{?}. (For a noconv-encoded buffer, these characters will only be
785 present if you explicitly insert them.)
786 @item shift-jis
787 Shift-JIS (a Japanese encoding commonly used in PC operating systems).
788 @item iso2022
789 Any ISO2022-compliant encoding. Among other things, this includes JIS
790 (the Japanese encoding commonly used for e-mail), EUC (the standard Unix
791 encoding for Japanese and other languages), and Compound Text (the
792 encoding used in X11). You can specify more specific information about
793 the conversion with the @var{flags} argument.
794 @item big5
795 Big5 (the encoding commonly used for Taiwanese).
796 @item ccl
797 The conversion is performed using a user-written pseudo-code program.
798 CCL (Code Conversion Language) is the name of this pseudo-code.
799 @item internal
800 Write out or read in the raw contents of the memory representing the
801 buffer's text. This is primarily useful for debugging purposes, and is
802 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set
803 (the @samp{--debug} configure option). @strong{Warning}: Reading in a
804 file using @code{internal} conversion can result in an internal
805 inconsistency in the memory representing a buffer's text, which will
806 produce unpredictable results and may cause XEmacs to crash. Under
807 normal circumstances you should never use @code{internal} conversion.
808 @end table
809
810 @node EOL Conversion
811 @subsection EOL Conversion
812
813 @table @code
814 @item nil
815 Automatically detect the end-of-line type (LF, CRLF, or CR). Also
816 generate subsidiary coding systems named @code{@var{name}-unix},
817 @code{@var{name}-dos}, and @code{@var{name}-mac}, that are identical to
818 this coding system but have an EOL-TYPE value of @code{lf}, @code{crlf},
819 and @code{cr}, respectively.
820 @item lf
821 The end of a line is marked externally using ASCII LF. Since this is
822 also the way that XEmacs represents an end-of-line internally,
823 specifying this option results in no end-of-line conversion. This is
824 the standard format for Unix text files.
825 @item crlf
826 The end of a line is marked externally using ASCII CRLF. This is the
827 standard format for MS-DOS text files.
828 @item cr
829 The end of a line is marked externally using ASCII CR. This is the
830 standard format for Macintosh text files.
831 @item t
832 Automatically detect the end-of-line type but do not generate subsidiary
833 coding systems. (This value is converted to @code{nil} when stored
834 internally, and @code{coding-system-property} will return @code{nil}.)
835 @end table
836
837 @node Coding System Properties
838 @subsection Coding System Properties
839
840 @table @code
841 @item mnemonic
842 String to be displayed in the modeline when this coding system is
843 active.
844
845 @item eol-type
846 End-of-line conversion to be used. It should be one of the types
847 listed in @ref{EOL Conversion}.
848
849 @item post-read-conversion
850 Function called after a file has been read in, to perform the decoding.
851 Called with two arguments, @var{beg} and @var{end}, denoting a region of
852 the current buffer to be decoded.
853
854 @item pre-write-conversion
855 Function called before a file is written out, to perform the encoding.
856 Called with two arguments, @var{beg} and @var{end}, denoting a region of
857 the current buffer to be encoded.
858 @end table
859
860 The following additional properties are recognized if @var{type} is
861 @code{iso2022}:
862
863 @table @code
864 @item charset-g0
865 @itemx charset-g1
866 @itemx charset-g2
867 @itemx charset-g3
868 The character set initially designated to the G0 - G3 registers.
869 The value should be one of
870
871 @itemize @bullet
872 @item
873 A charset object (designate that character set)
874 @item
875 @code{nil} (do not ever use this register)
876 @item
877 @code{t} (no character set is initially designated to the register, but
878 may be later on; this automatically sets the corresponding
879 @code{force-g*-on-output} property)
880 @end itemize
881
882 @item force-g0-on-output
883 @itemx force-g1-on-output
884 @itemx force-g2-on-output
885 @itemx force-g2-on-output
886 If non-@code{nil}, send an explicit designation sequence on output
887 before using the specified register.
888
889 @item short
890 If non-@code{nil}, use the short forms @samp{ESC $ @@}, @samp{ESC $ A},
891 and @samp{ESC $ B} on output in place of the full designation sequences
892 @samp{ESC $ ( @@}, @samp{ESC $ ( A}, and @samp{ESC $ ( B}.
893
894 @item no-ascii-eol
895 If non-@code{nil}, don't designate ASCII to G0 at each end of line on
896 output. Setting this to non-@code{nil} also suppresses other
897 state-resetting that normally happens at the end of a line.
898
899 @item no-ascii-cntl
900 If non-@code{nil}, don't designate ASCII to G0 before control chars on
901 output.
902
903 @item seven
904 If non-@code{nil}, use 7-bit environment on output. Otherwise, use 8-bit
905 environment.
906
907 @item lock-shift
908 If non-@code{nil}, use locking-shift (SO/SI) instead of single-shift or
909 designation by escape sequence.
910
911 @item no-iso6429
912 If non-@code{nil}, don't use ISO6429's direction specification.
913
914 @item escape-quoted
915 If non-nil, literal control characters that are the same as the
916 beginning of a recognized ISO2022 or ISO6429 escape sequence (in
917 particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F),
918 and CSI (0x9B)) are ``quoted'' with an escape character so that they can
919 be properly distinguished from an escape sequence. (Note that doing
920 this results in a non-portable encoding.) This encoding flag is used for
921 byte-compiled files. Note that ESC is a good choice for a quoting
922 character because there are no escape sequences whose second byte is a
923 character from the Control-0 or Control-1 character sets; this is
924 explicitly disallowed by the ISO2022 standard.
925
926 @item input-charset-conversion
927 A list of conversion specifications, specifying conversion of characters
928 in one charset to another when decoding is performed. Each
929 specification is a list of two elements: the source charset, and the
930 destination charset.
931
932 @item output-charset-conversion
933 A list of conversion specifications, specifying conversion of characters
934 in one charset to another when encoding is performed. The form of each
935 specification is the same as for @code{input-charset-conversion}.
936 @end table
937
938 The following additional properties are recognized (and required) if
939 @var{type} is @code{ccl}:
940
941 @table @code
942 @item decode
943 CCL program used for decoding (converting to internal format).
944
945 @item encode
946 CCL program used for encoding (converting to external format).
947 @end table
948
949 @node Basic Coding System Functions
950 @subsection Basic Coding System Functions
951
952 @defun find-coding-system coding-system-or-name
953 This function retrieves the coding system of the given name.
954
955 If @var{coding-system-or-name} is a coding-system object, it is simply
956 returned. Otherwise, @var{coding-system-or-name} should be a symbol.
957 If there is no such coding system, @code{nil} is returned. Otherwise
958 the associated coding system object is returned.
959 @end defun
960
961 @defun get-coding-system name
962 This function retrieves the coding system of the given name. Same as
963 @code{find-coding-system} except an error is signalled if there is no
964 such coding system instead of returning @code{nil}.
965 @end defun
966
967 @defun coding-system-list
968 This function returns a list of the names of all defined coding systems.
969 @end defun
970
971 @defun coding-system-name coding-system
972 This function returns the name of the given coding system.
973 @end defun
974
975 @defun make-coding-system name type &optional doc-string props
976 This function registers symbol @var{name} as a coding system.
977
978 @var{type} describes the conversion method used and should be one of
979 the types listed in @ref{Coding System Types}.
980
981 @var{doc-string} is a string describing the coding system.
982
983 @var{props} is a property list, describing the specific nature of the
984 character set. Recognized properties are as in @ref{Coding System
985 Properties}.
986 @end defun
987
988 @defun copy-coding-system old-coding-system new-name
989 This function copies @var{old-coding-system} to @var{new-name}. If
990 @var{new-name} does not name an existing coding system, a new one will
991 be created.
992 @end defun
993
994 @defun subsidiary-coding-system coding-system eol-type
995 This function returns the subsidiary coding system of
996 @var{coding-system} with eol type @var{eol-type}.
997 @end defun
998
999 @node Coding System Property Functions
1000 @subsection Coding System Property Functions
1001
1002 @defun coding-system-doc-string coding-system
1003 This function returns the doc string for @var{coding-system}.
1004 @end defun
1005
1006 @defun coding-system-type coding-system
1007 This function returns the type of @var{coding-system}.
1008 @end defun
1009
1010 @defun coding-system-property coding-system prop
1011 This function returns the @var{prop} property of @var{coding-system}.
1012 @end defun
1013
1014 @node Encoding and Decoding Text
1015 @subsection Encoding and Decoding Text
1016
1017 @defun decode-coding-region start end coding-system &optional buffer
1018 This function decodes the text between @var{start} and @var{end} which
1019 is encoded in @var{coding-system}. This is useful if you've read in
1020 encoded text from a file without decoding it (e.g. you read in a
1021 JIS-formatted file but used the @code{binary} or @code{noconv} coding
1022 system, so that it shows up as @samp{^[$B!<!+^[(B}). The length of the
1023 encoded text is returned. @var{buffer} defaults to the current buffer
1024 if unspecified.
1025 @end defun
1026
1027 @defun encode-coding-region start end coding-system &optional buffer
1028 This function encodes the text between @var{start} and @var{end} using
1029 @var{coding-system}. This will, for example, convert Japanese
1030 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS
1031 encoding. The length of the encoded text is returned. @var{buffer}
1032 defaults to the current buffer if unspecified.
1033 @end defun
1034
1035 @node Detection of Textual Encoding
1036 @subsection Detection of Textual Encoding
1037
1038 @defun coding-category-list
1039 This function returns a list of all recognized coding categories.
1040 @end defun
1041
1042 @defun set-coding-priority-list list
1043 This function changes the priority order of the coding categories.
1044 @var{list} should be a list of coding categories, in descending order of
1045 priority. Unspecified coding categories will be lower in priority than
1046 all specified ones, in the same relative order they were in previously.
1047 @end defun
1048
1049 @defun coding-priority-list
1050 This function returns a list of coding categories in descending order of
1051 priority.
1052 @end defun
1053
1054 @defun set-coding-category-system coding-category coding-system
1055 This function changes the coding system associated with a coding category.
1056 @end defun
1057
1058 @defun coding-category-system coding-category
1059 This function returns the coding system associated with a coding category.
1060 @end defun
1061
1062 @defun detect-coding-region start end &optional buffer
1063 This function detects coding system of the text in the region between
1064 @var{start} and @var{end}. Returned value is a list of possible coding
1065 systems ordered by priority. If only ASCII characters are found, it
1066 returns @code{autodetect} or one of its subsidiary coding systems
1067 according to a detected end-of-line type. Optional arg @var{buffer}
1068 defaults to the current buffer.
1069 @end defun
1070
1071 @node Big5 and Shift-JIS Functions
1072 @subsection Big5 and Shift-JIS Functions
1073
1074 These are special functions for working with the non-standard
1075 Shift-JIS and Big5 encodings.
1076
1077 @defun decode-shift-jis-char code
1078 This function decodes a JISX0208 character of Shift-JIS coding-system.
1079 @var{code} is the character code in Shift-JIS as a cons of type bytes.
1080 The corresponding character is returned.
1081 @end defun
1082
1083 @defun encode-shift-jis-char ch
1084 This function encodes a JISX0208 character @var{ch} to SHIFT-JIS
1085 coding-system. The corresponding character code in SHIFT-JIS is
1086 returned as a cons of two bytes.
1087 @end defun
1088
1089 @defun decode-big5-char code
1090 This function decodes a Big5 character @var{code} of BIG5 coding-system.
1091 @var{code} is the character code in BIG5. The corresponding character
1092 is returned.
1093 @end defun
1094
1095 @defun encode-big5-char ch
1096 This function encodes the Big5 character @var{char} to BIG5
1097 coding-system. The corresponding character code in Big5 is returned.
1098 @end defun
1099
1100 @node CCL
1101 @section CCL
1102
1103 @defun execute-ccl-program ccl-program status
1104 This function executes @var{ccl-program} with registers initialized by
1105 @var{status}. @var{ccl-program} is a vector of compiled CCL code
1106 created by @code{ccl-compile}. @var{status} must be a vector of nine
1107 values, specifying the initial value for the R0, R1 .. R7 registers and
1108 for the instruction counter IC. A @code{nil} value for a register
1109 initializer causes the register to be set to 0. A @code{nil} value for
1110 the IC initializer causes execution to start at the beginning of the
1111 program. When the program is done, @var{status} is modified (by
1112 side-effect) to contain the ending values for the corresponding
1113 registers and IC.
1114 @end defun
1115
1116 @defun execute-ccl-program-string ccl-program status str
1117 This function executes @var{ccl-program} with initial @var{status} on
1118 @var{string}. @var{ccl-program} is a vector of compiled CCL code
1119 created by @code{ccl-compile}. @var{status} must be a vector of nine
1120 values, specifying the initial value for the R0, R1 .. R7 registers and
1121 for the instruction counter IC. A @code{nil} value for a register
1122 initializer causes the register to be set to 0. A @code{nil} value for
1123 the IC initializer causes execution to start at the beginning of the
1124 program. When the program is done, @var{status} is modified (by
1125 side-effect) to contain the ending values for the corresponding
1126 registers and IC. Returns the resulting string.
1127 @end defun
1128
1129 @defun ccl-reset-elapsed-time
1130 This function resets the internal value which holds the time elapsed by
1131 CCL interpreter.
1132 @end defun
1133
1134 @defun ccl-elapsed-time
1135 This function returns the time elapsed by CCL interpreter as cons of
1136 user and system time. This measures processor time, not real time.
1137 Both values are floating point numbers measured in seconds. If only one
1138 overall value can be determined, the return value will be a cons of that
1139 value and 0.
1140 @end defun
1141
1142 @node Category Tables
1143 @section Category Tables
1144
1145 A category table is a type of char table used for keeping track of
1146 categories. Categories are used for classifying characters for use in
1147 regexps -- you can refer to a category rather than having to use a
1148 complicated [] expression (and category lookups are significantly
1149 faster).
1150
1151 There are 95 different categories available, one for each printable
1152 character (including space) in the ASCII charset. Each category is
1153 designated by one such character, called a @dfn{category designator}.
1154 They are specified in a regexp using the syntax @samp{\cX}, where X is a
1155 category designator. (This is not yet implemented.)
1156
1157 A category table specifies, for each character, the categories that
1158 the character is in. Note that a character can be in more than one
1159 category. More specifically, a category table maps from a character to
1160 either the value @code{nil} (meaning the character is in no categories)
1161 or a 95-element bit vector, specifying for each of the 95 categories
1162 whether the character is in that category.
1163
1164 Special Lisp functions are provided that abstract this, so you do not
1165 have to directly manipulate bit vectors.
1166
1167 @defun category-table-p obj
1168 This function returns @code{t} if @var{arg} is a category table.
1169 @end defun
1170
1171 @defun category-table &optional buffer
1172 This function returns the current category table. This is the one
1173 specified by the current buffer, or by @var{buffer} if it is
1174 non-@code{nil}.
1175 @end defun
1176
1177 @defun standard-category-table
1178 This function returns the standard category table. This is the one used
1179 for new buffers.
1180 @end defun
1181
1182 @defun copy-category-table &optional table
1183 This function constructs a new category table and return it. It is a
1184 copy of the @var{table}, which defaults to the standard category table.
1185 @end defun
1186
1187 @defun set-category-table table &optional buffer
1188 This function selects a new category table for @var{buffer}. One
1189 argument, a category table. @var{buffer} defaults to the current buffer
1190 if omitted.
1191 @end defun
1192
1193 @defun category-designator-p obj
1194 This function returns @code{t} if @var{arg} is a category designator (a
1195 char in the range @samp{' '} to @samp{'~'}).
1196 @end defun
1197
1198 @defun category-table-value-p obj
1199 This function returns @code{t} if @var{arg} is a category table value.
1200 Valid values are @code{nil} or a bit vector of size 95.
1201 @end defun
1202