Mercurial > hg > xemacs-beta
annotate man/lispref/mule.texi @ 5840:93a18dbcfd8c
Don't leave fields uninitialized.
author | Marcus Crestani <marcus@crestani.de> |
---|---|
date | Sat, 13 Dec 2014 14:20:17 +0100 |
parents | 9fae6227ede5 |
children |
rev | line source |
---|---|
428 | 1 @c -*-texinfo-*- |
2 @c This is part of the XEmacs Lisp Reference Manual. | |
775 | 3 @c Copyright (C) 1996 Ben Wing, 2001-2002 Free Software Foundation. |
428 | 4 @c See the file lispref.texi for copying conditions. |
5 @setfilename ../../info/internationalization.info | |
5791
9fae6227ede5
Silence texinfo 5.2 warnings, primarily by adding next, prev, and up
Jerry James <james@xemacs.org>
parents:
5384
diff
changeset
|
6 @node MULE, Tips, Internationalization, Top |
428 | 7 @chapter MULE |
8 | |
442 | 9 @dfn{MULE} is the name originally given to the version of GNU Emacs |
428 | 10 extended for multi-lingual (and in particular Asian-language) support. |
442 | 11 ``MULE'' is short for ``MUlti-Lingual Emacs''. It is an extension and |
12 complete rewrite of Nemacs (``Nihon Emacs'' where ``Nihon'' is the | |
13 Japanese word for ``Japan''), which only provided support for Japanese. | |
14 XEmacs refers to its multi-lingual support as @dfn{MULE support} since | |
15 it is based on @dfn{MULE}. | |
428 | 16 |
17 @menu | |
18 * Internationalization Terminology:: | |
19 Definition of various internationalization terms. | |
20 * Charsets:: Sets of related characters. | |
21 * MULE Characters:: Working with characters in XEmacs/MULE. | |
22 * Composite Characters:: Making new characters by overstriking other ones. | |
23 * Coding Systems:: Ways of representing a string of chars using integers. | |
24 * CCL:: A special language for writing fast converters. | |
25 * Category Tables:: Subdividing charsets into groups. | |
775 | 26 * Unicode Support:: The universal coded character set. |
1183 | 27 * Charset Unification:: Handling overlapping character sets. |
28 * Charsets and Coding Systems:: Tables and reference information. | |
428 | 29 @end menu |
30 | |
442 | 31 @node Internationalization Terminology, Charsets, , MULE |
428 | 32 @section Internationalization Terminology |
33 | |
442 | 34 In internationalization terminology, a string of text is divided up |
428 | 35 into @dfn{characters}, which are the printable units that make up the |
36 text. A single character is (for example) a capital @samp{A}, the | |
442 | 37 number @samp{2}, a Katakana character, a Hangul character, a Kanji |
38 ideograph (an @dfn{ideograph} is a ``picture'' character, such as is | |
39 used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there | |
40 are thousands of such ideographs in each language), etc. The basic | |
41 property of a character is that it is the smallest unit of text with | |
1261 | 42 semantic significance in text processing---i.e., characters are abstract |
43 units defined by their meaning, not by their exact appearance. | |
442 | 44 |
45 Human beings normally process text visually, so to a first approximation | |
46 a character may be identified with its shape. Note that the same | |
47 character may be drawn by two different people (or in two different | |
48 fonts) in slightly different ways, although the "basic shape" will be the | |
49 same. But consider the works of Scott Kim; human beings can recognize | |
50 hugely variant shapes as the "same" character. Sometimes, especially | |
51 where characters are extremely complicated to write, completely | |
52 different shapes may be defined as the "same" character in national | |
53 standards. The Taiwanese variant of Hanzi is generally the most | |
444 | 54 complicated; over the centuries, the Japanese, Koreans, and the People's |
442 | 55 Republic of China have adopted simplifications of the shape, but the |
56 line of descent from the original shape is recorded, and the meanings | |
57 and pronunciation of different forms of the same character are | |
58 considered to be identical within each language. (Of course, it may | |
59 take a specialist to recognize the related form; the point is that the | |
60 relations are standardized, despite the differing shapes.) | |
428 | 61 |
62 In some cases, the differences will be significant enough that it is | |
63 actually possible to identify two or more distinct shapes that both | |
64 represent the same character. For example, the lowercase letters | |
440 | 65 @samp{a} and @samp{g} each have two distinct possible shapes---the |
428 | 66 @samp{a} can optionally have a curved tail projecting off the top, and |
67 the @samp{g} can be formed either of two loops, or of one loop and a | |
68 tail hanging off the bottom. Such distinct possible shapes of a | |
69 character are called @dfn{glyphs}. The important characteristic of two | |
70 glyphs making up the same character is that the choice between one or | |
71 the other is purely stylistic and has no linguistic effect on a word | |
72 (this is the reason why a capital @samp{A} and lowercase @samp{a} | |
440 | 73 are different characters rather than different glyphs---e.g. |
428 | 74 @samp{Aspen} is a city while @samp{aspen} is a kind of tree). |
75 | |
76 Note that @dfn{character} and @dfn{glyph} are used differently | |
77 here than elsewhere in XEmacs. | |
78 | |
442 | 79 A @dfn{character set} is essentially a set of related characters. ASCII, |
428 | 80 for example, is a set of 94 characters (or 128, if you count |
81 non-printing characters). Other character sets are ISO8859-1 (ASCII | |
82 plus various accented characters and other international symbols), | |
442 | 83 JIS X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208 |
84 (Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji), | |
428 | 85 GB2312 (Mainland Chinese Hanzi), etc. |
86 | |
442 | 87 The definition of a character set will implicitly or explicitly give |
88 it an @dfn{ordering}, a way of assigning a number to each character in | |
89 the set. For many character sets, there is a natural ordering, for | |
90 example the ``ABC'' ordering of the Roman letters. But it is not clear | |
91 whether digits should come before or after the letters, and in fact | |
92 different European languages treat the ordering of accented characters | |
93 differently. It is useful to use the natural order where available, of | |
94 course. The number assigned to any particular character is called the | |
95 character's @dfn{code point}. (Within a given character set, each | |
96 character has a unique code point. Thus the word "set" is ill-chosen; | |
97 different orderings of the same characters are different character sets. | |
98 Identifying characters is simple enough for alphabetic character sets, | |
99 but the difference in ordering can cause great headaches when the same | |
100 thousands of characters are used by different cultures as in the Hanzi.) | |
428 | 101 |
1261 | 102 It's important to understand that a character is defined not by any |
103 number attached to it, but by its meaning. For example, ASCII and | |
104 EBCDIC are two charsets containing exactly the same characters | |
105 (lowercase and uppercase letters, numbers 0 through 9, particular | |
106 punctuation marks) but with different numberings. The @samp{comma} | |
107 character in ASCII and EBCDIC, for instance, is the same character | |
108 despite having a different numbering. Conversely, when comparing ASCII | |
109 and JIS-Roman, which look the same except that the latter has a yen sign | |
110 substituted for the backslash, we would say that the backslash and yen | |
111 sign are @emph{not} the same characters, despite having the same number | |
112 (95) and despite the fact that all other characters are present in both | |
113 charsets, with the same numbering. ASCII and JIS-Roman, then, do | |
114 @emph{not} have exactly the same characters in them (ASCII has a | |
115 backslash character but no yen-sign character, and vice-versa for | |
116 JIS-Roman), unlike ASCII and EBCDIC, even though the numberings in ASCII | |
117 and JIS-Roman are closer. | |
118 | |
119 Sometimes, a code point is not a single number, but instead a group of | |
120 numbers, called @dfn{position codes}. In such cases, the number of | |
121 position codes required to index a particular character in a character | |
122 set is called the @dfn{dimension} of the character set. Character sets | |
123 indexed by more than one position code typically use byte-sized position | |
124 codes. Small character sets, e.g. ASCII, invariably use a single | |
125 position code, but for larger character sets, the choice of whether to | |
126 use multiple position codes or a single large (16-bit or 32-bit) number | |
127 is arbitrary. Unicode typically uses a single large number, but | |
128 language-specific or "national" character sets often use multiple | |
129 (usually two) position codes. For example, JIS X 0208, i.e. Japanese | |
130 Kanji, has thousands of characters, and is of dimension two -- every | |
131 character is indexed by two position codes, each in the range 1 through | |
132 94. (This number ``94'' is not a coincidence; it is the same as the | |
133 number of printable characters in ASCII, and was chosen so that JIS | |
134 characters could be directly encoded using two printable ASCII | |
135 characters.) Note that the choice of the range here is somewhat | |
136 arbitrary -- it could just as easily be 0 through 93, 2 through 95, etc. | |
137 In fact, the range for JIS position codes (and for other character sets | |
138 modeled after it) is often given as range 33 through 126, so as to | |
139 directly match ASCII printing characters. | |
428 | 140 |
141 An @dfn{encoding} is a way of numerically representing characters from | |
142 one or more character sets into a stream of like-sized numerical values | |
1261 | 143 called @dfn{words} -- typically 8-bit bytes, but sometimes 16-bit or |
2818 | 144 32-bit quantities. In a context where dealing with Japanese motivates |
145 much of XEmacs' design in this area, it's important to clearly | |
146 distinguish between charsets and encodings. For a simple charset like | |
147 ASCII, there is only one encoding normally used -- each character is | |
148 represented by a single byte, with the same value as its code point. | |
149 For more complicated charsets, however, or when a single encoding needs | |
150 to represent more than charset, things are not so obvious. Unicode | |
151 version 2, for example, is a large charset with thousands of characters, | |
152 each indexed by a 16-bit number, often represented in hex, e.g. 0x05D0 | |
153 for the Hebrew letter "aleph". One obvious encoding (actually two | |
154 encodings, depending on which of the two possible byte orderings is | |
155 chosen) simply uses two bytes per character. This encoding is | |
156 convenient for internal processing of Unicode text; however, it's | |
157 incompatible with ASCII, and thus external text (files, e-mail, etc.) | |
158 that is encoded this way is completely uninterpretable by programs | |
159 lacking Unicode support. For this reason, a different, ASCII-compatible | |
160 encoding, e.g. UTF-8, is usually used for external text. UTF-8 | |
161 represents Unicode characters with one to three bytes (often extended to | |
162 six bytes to handle characters with up to 31-bit indices). Unicode | |
163 characters 00 to 7F (identical with ASCII) are directly represented with | |
164 one byte, and other characters with two or more bytes, each in the range | |
165 80 to FF. Applications that don't understand Unicode will still be able | |
166 to process ASCII characters represented in UTF-8-encoded text, and will | |
167 typically ignore (and hopefully preserve) the high-bit characters. | |
168 | |
169 Similarly, Shift-JIS and EUC-JP are different encodings normally used to | |
170 encode the same character set(s), these character sets being subsets of | |
171 Unicode. However, the obvious approach of unifying XEmacs' internal | |
172 encoding across character sets, as was part of the motivation behind | |
173 Unicode, wasn't taken. This means that characters in these character | |
174 sets that are identical to characters in other character sets---for | |
175 example, the Greek alphabet is in the large Japanese character sets and | |
176 at least one European character set--are unfortunately disjoint. | |
1261 | 177 |
178 Naive use of code points is also not possible if more than one | |
179 character set is to be used in the encoding. For example, printed | |
442 | 180 Japanese text typically requires characters from multiple character sets |
181 -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is | |
1261 | 182 indexed using one or more position codes in the range 1 through 94 (or |
183 33 through 126), so the position codes could not be used directly or | |
184 there would be no way to tell which character was meant. Different | |
185 Japanese encodings handle this differently -- JIS uses special escape | |
186 characters to denote different character sets; EUC sets the high bit of | |
187 the position codes for JIS X 0208 and JIS X 0212, and puts a special | |
188 extra byte before each JIS X 0212 character; etc. | |
189 | |
190 The encodings described above are all 7-bit or 8-bit encodings. The | |
191 fixed-width Unicode encoding previous described, however, is sometimes | |
192 considered to be a 16-bit encoding, in which case the issue of byte | |
193 ordering does not come up. (Imagine, for example, that the text is | |
194 represented as an array of shorts.) Similarly, Unicode version 3 (which | |
195 has characters with indices above 0xFFFF), and other very large | |
196 character sets, may be represented internally as 32-bit encodings, | |
197 i.e. arrays of ints. However, it does not make too much sense to talk | |
198 about 16-bit or 32-bit encodings for external data, since nowadays 8-bit | |
199 data is a universal standard -- the closest you can get is fixed-width | |
200 encodings using two or four bytes to encode 16-bit or 32-bit values. (A | |
201 "7-bit" encoding is used when it cannot be guaranteed that the high bit | |
202 of 8-bit data will be correctly preserved. Some e-mail gateways, for | |
203 example, strip the high bit of text passing through them. These same | |
204 gateways often handle non-printable characters incorrectly, and so 7-bit | |
205 encodings usually avoid using bytes with such values.) | |
442 | 206 |
207 A general method of handling text using multiple character sets | |
208 (whether for multilingual text, or simply text in an extremely | |
209 complicated single language like Japanese) is defined in the | |
210 international standard ISO 2022. ISO 2022 will be discussed in more | |
211 detail later (@pxref{ISO 2022}), but for now suffice it to say that text | |
212 needs control functions (at least spacing), and if escape sequences are | |
213 to be used, an escape sequence introducer. It was decided to make all | |
214 text streams compatible with ASCII in the sense that the codes 0--31 | |
215 (and 128-159) would always be control codes, never graphic characters, | |
216 and where defined by the character set the @samp{SPC} character would be | |
217 assigned code 32, and @samp{DEL} would be assigned 127. Thus there are | |
218 94 code points remaining if 7 bits are used. This is the reason that | |
219 most character sets are defined using position codes in the range 1 | |
220 through 94. Then ISO 2022 compatible encodings are produced by shifting | |
221 the position codes 1 to 94 into character codes 33 to 126, or (if 8 bit | |
222 codes are available) into character codes 161 to 254. | |
428 | 223 |
224 Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In | |
442 | 225 a @dfn{modal encoding}, there are multiple states that the encoding can |
226 be in, and the interpretation of the values in the stream depends on the | |
428 | 227 current global state of the encoding. Special values in the encoding, |
228 called @dfn{escape sequences}, are used to change the global state. | |
229 JIS, for example, is a modal encoding. The bytes @samp{ESC $ B} | |
230 indicate that, from then on, bytes are to be interpreted as position | |
442 | 231 codes for JIS X 0208, rather than as ASCII. This effect is cancelled |
428 | 232 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the |
442 | 233 current state is to ASCII''. To switch to JIS X 0212, the escape |
234 sequence @samp{ESC $ ( D}. (Note that here, as is common, the escape | |
235 sequences do in fact begin with @samp{ESC}. This is not necessarily the | |
236 case, however. Some encodings use control characters called "locking | |
237 shifts" (effect persists until cancelled) to switch character sets.) | |
428 | 238 |
442 | 239 A @dfn{non-modal encoding} has no global state that extends past the |
428 | 240 character currently being interpreted. EUC, for example, is a |
442 | 241 non-modal encoding. Characters in JIS X 0208 are encoded by setting |
242 the high bit of the position codes, and characters in JIS X 0212 are | |
428 | 243 encoded by doing the same but also prefixing the character with the |
244 byte 0x8F. | |
245 | |
246 The advantage of a modal encoding is that it is generally more | |
442 | 247 space-efficient, and is easily extendible because there are essentially |
428 | 248 an arbitrary number of escape sequences that can be created. The |
249 disadvantage, however, is that it is much more difficult to work with | |
250 if it is not being processed in a sequential manner. In the non-modal | |
251 EUC encoding, for example, the byte 0x41 always refers to the letter | |
252 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or | |
442 | 253 one of the two position codes in a JIS X 0208 character, or one of the |
254 two position codes in a JIS X 0212 character. Determining exactly which | |
428 | 255 one is meant could be difficult and time-consuming if the previous |
442 | 256 bytes in the string have not already been processed, or impossible if |
257 they are drawn from an external stream that cannot be rewound. | |
428 | 258 |
259 Non-modal encodings are further divided into @dfn{fixed-width} and | |
260 @dfn{variable-width} formats. A fixed-width encoding always uses | |
261 the same number of words per character, whereas a variable-width | |
262 encoding does not. EUC is a good example of a variable-width | |
263 encoding: one to three bytes are used per character, depending on | |
264 the character set. 16-bit and 32-bit encodings are nearly always | |
265 fixed-width, and this is in fact one of the main reasons for using | |
266 an encoding with a larger word size. The advantages of fixed-width | |
267 encodings should be obvious. The advantages of variable-width | |
268 encodings are that they are generally more space-efficient and allow | |
442 | 269 for compatibility with existing 8-bit encodings such as ASCII. (For |
270 example, in Unicode ASCII characters are simply promoted to a 16-bit | |
271 representation. That means that every ASCII character contains a | |
272 @samp{NUL} byte; evidently all of the standard string manipulation | |
273 functions will lose badly in a fixed-width Unicode environment.) | |
428 | 274 |
442 | 275 The bytes in an 8-bit encoding are often referred to as @dfn{octets} |
276 rather than simply as bytes. This terminology dates back to the days | |
277 before 8-bit bytes were universal, when some computers had 9-bit bytes, | |
278 others had 10-bit bytes, etc. | |
428 | 279 |
442 | 280 @node Charsets, MULE Characters, Internationalization Terminology, MULE |
428 | 281 @section Charsets |
282 | |
283 A @dfn{charset} in MULE is an object that encapsulates a | |
284 particular character set as well as an ordering of those characters. | |
285 Charsets are permanent objects and are named using symbols, like | |
286 faces. | |
287 | |
288 @defun charsetp object | |
289 This function returns non-@code{nil} if @var{object} is a charset. | |
290 @end defun | |
291 | |
292 @menu | |
293 * Charset Properties:: Properties of a charset. | |
294 * Basic Charset Functions:: Functions for working with charsets. | |
295 * Charset Property Functions:: Functions for accessing charset properties. | |
296 * Predefined Charsets:: Predefined charset objects. | |
297 @end menu | |
298 | |
442 | 299 @node Charset Properties, Basic Charset Functions, , Charsets |
428 | 300 @subsection Charset Properties |
301 | |
302 Charsets have the following properties: | |
303 | |
304 @table @code | |
305 @item name | |
306 A symbol naming the charset. Every charset must have a different name; | |
307 this allows a charset to be referred to using its name rather than | |
308 the actual charset object. | |
309 @item doc-string | |
310 A documentation string describing the charset. | |
311 @item registry | |
312 A regular expression matching the font registry field for this character | |
313 set. For example, both the @code{ascii} and @code{latin-iso8859-1} | |
314 charsets use the registry @code{"ISO8859-1"}. This field is used to | |
315 choose an appropriate font when the user gives a general font | |
316 specification such as @samp{-*-courier-medium-r-*-140-*}, i.e. a | |
317 14-point upright medium-weight Courier font. | |
318 @item dimension | |
319 Number of position codes used to index a character in the character set. | |
320 XEmacs/MULE can only handle character sets of dimension 1 or 2. | |
321 This property defaults to 1. | |
322 @item chars | |
323 Number of characters in each dimension. In XEmacs/MULE, the only | |
324 allowed values are 94 or 96. (There are a couple of pre-defined | |
325 character sets, such as ASCII, that do not follow this, but you cannot | |
326 define new ones like this.) Defaults to 94. Note that if the dimension | |
327 is 2, the character set thus described is 94x94 or 96x96. | |
328 @item columns | |
329 Number of columns used to display a character in this charset. | |
330 Only used in TTY mode. (Under X, the actual width of a character | |
331 can be derived from the font used to display the characters.) | |
332 If unspecified, defaults to the dimension. (This is almost | |
333 always the correct value, because character sets with dimension 2 | |
334 are usually ideograph character sets, which need two columns to | |
335 display the intricate ideographs.) | |
336 @item direction | |
337 A symbol, either @code{l2r} (left-to-right) or @code{r2l} | |
338 (right-to-left). Defaults to @code{l2r}. This specifies the | |
339 direction that the text should be displayed in, and will be | |
340 left-to-right for most charsets but right-to-left for Hebrew | |
341 and Arabic. (Right-to-left display is not currently implemented.) | |
342 @item final | |
343 Final byte of the standard ISO 2022 escape sequence designating this | |
344 charset. Must be supplied. Each combination of (@var{dimension}, | |
345 @var{chars}) defines a separate namespace for final bytes, and each | |
346 charset within a particular namespace must have a different final byte. | |
347 Note that ISO 2022 restricts the final byte to the range 0x30 - 0x7E if | |
348 dimension == 1, and 0x30 - 0x5F if dimension == 2. Note also that final | |
349 bytes in the range 0x30 - 0x3F are reserved for user-defined (not | |
350 official) character sets. For more information on ISO 2022, see @ref{Coding | |
351 Systems}. | |
352 @item graphic | |
353 0 (use left half of font on output) or 1 (use right half of font on | |
354 output). Defaults to 0. This specifies how to convert the position | |
355 codes that index a character in a character set into an index into the | |
356 font used to display the character set. With @code{graphic} set to 0, | |
357 position codes 33 through 126 map to font indices 33 through 126; with | |
358 it set to 1, position codes 33 through 126 map to font indices 161 | |
359 through 254 (i.e. the same number but with the high bit set). For | |
360 example, for a font whose registry is ISO8859-1, the left half of the | |
361 font (octets 0x20 - 0x7F) is the @code{ascii} charset, while the right | |
362 half (octets 0xA0 - 0xFF) is the @code{latin-iso8859-1} charset. | |
363 @item ccl-program | |
364 A compiled CCL program used to convert a character in this charset into | |
365 an index into the font. This is in addition to the @code{graphic} | |
366 property. If a CCL program is defined, the position codes of a | |
367 character will first be processed according to @code{graphic} and | |
368 then passed through the CCL program, with the resulting values used | |
369 to index the font. | |
370 | |
442 | 371 This is used, for example, in the Big5 character set (used in Taiwan). |
428 | 372 This character set is not ISO-2022-compliant, and its size (94x157) does |
373 not fit within the maximum 96x96 size of ISO-2022-compliant character | |
374 sets. As a result, XEmacs/MULE splits it (in a rather complex fashion, | |
375 so as to group the most commonly used characters together) into two | |
376 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94, | |
377 and each charset object uses a CCL program to convert the modified | |
378 position codes back into standard Big5 indices to retrieve a character | |
379 from a Big5 font. | |
380 @end table | |
381 | |
442 | 382 Most of the above properties can only be set when the charset is |
383 initialized, and cannot be changed later. | |
384 @xref{Charset Property Functions}. | |
428 | 385 |
442 | 386 @node Basic Charset Functions, Charset Property Functions, Charset Properties, Charsets |
428 | 387 @subsection Basic Charset Functions |
388 | |
389 @defun find-charset charset-or-name | |
390 This function retrieves the charset of the given name. If | |
391 @var{charset-or-name} is a charset object, it is simply returned. | |
392 Otherwise, @var{charset-or-name} should be a symbol. If there is no | |
393 such charset, @code{nil} is returned. Otherwise the associated charset | |
394 object is returned. | |
395 @end defun | |
396 | |
397 @defun get-charset name | |
398 This function retrieves the charset of the given name. Same as | |
399 @code{find-charset} except an error is signalled if there is no such | |
400 charset instead of returning @code{nil}. | |
401 @end defun | |
402 | |
403 @defun charset-list | |
404 This function returns a list of the names of all defined charsets. | |
405 @end defun | |
406 | |
407 @defun make-charset name doc-string props | |
408 This function defines a new character set. This function is for use | |
442 | 409 with MULE support. @var{name} is a symbol, the name by which the |
428 | 410 character set is normally referred. @var{doc-string} is a string |
411 describing the character set. @var{props} is a property list, | |
412 describing the specific nature of the character set. The recognized | |
413 properties are @code{registry}, @code{dimension}, @code{columns}, | |
414 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and | |
415 @code{ccl-program}, as previously described. | |
416 @end defun | |
417 | |
418 @defun make-reverse-direction-charset charset new-name | |
419 This function makes a charset equivalent to @var{charset} but which goes | |
420 in the opposite direction. @var{new-name} is the name of the new | |
421 charset. The new charset is returned. | |
422 @end defun | |
423 | |
424 @defun charset-from-attributes dimension chars final &optional direction | |
425 This function returns a charset with the given @var{dimension}, | |
426 @var{chars}, @var{final}, and @var{direction}. If @var{direction} is | |
427 omitted, both directions will be checked (left-to-right will be returned | |
428 if character sets exist for both directions). | |
429 @end defun | |
430 | |
431 @defun charset-reverse-direction-charset charset | |
432 This function returns the charset (if any) with the same dimension, | |
433 number of characters, and final byte as @var{charset}, but which is | |
434 displayed in the opposite direction. | |
435 @end defun | |
436 | |
442 | 437 @node Charset Property Functions, Predefined Charsets, Basic Charset Functions, Charsets |
428 | 438 @subsection Charset Property Functions |
439 | |
442 | 440 All of these functions accept either a charset name or charset object. |
428 | 441 |
442 @defun charset-property charset prop | |
443 This function returns property @var{prop} of @var{charset}. | |
444 @xref{Charset Properties}. | |
445 @end defun | |
446 | |
442 | 447 Convenience functions are also provided for retrieving individual |
428 | 448 properties of a charset. |
449 | |
450 @defun charset-name charset | |
451 This function returns the name of @var{charset}. This will be a symbol. | |
452 @end defun | |
453 | |
444 | 454 @defun charset-description charset |
455 This function returns the documentation string of @var{charset}. | |
428 | 456 @end defun |
457 | |
458 @defun charset-registry charset | |
459 This function returns the registry of @var{charset}. | |
460 @end defun | |
461 | |
462 @defun charset-dimension charset | |
463 This function returns the dimension of @var{charset}. | |
464 @end defun | |
465 | |
466 @defun charset-chars charset | |
467 This function returns the number of characters per dimension of | |
468 @var{charset}. | |
469 @end defun | |
470 | |
444 | 471 @defun charset-width charset |
428 | 472 This function returns the number of display columns per character (in |
473 TTY mode) of @var{charset}. | |
474 @end defun | |
475 | |
476 @defun charset-direction charset | |
440 | 477 This function returns the display direction of @var{charset}---either |
428 | 478 @code{l2r} or @code{r2l}. |
479 @end defun | |
480 | |
444 | 481 @defun charset-iso-final-char charset |
428 | 482 This function returns the final byte of the ISO 2022 escape sequence |
483 designating @var{charset}. | |
484 @end defun | |
485 | |
444 | 486 @defun charset-iso-graphic-plane charset |
428 | 487 This function returns either 0 or 1, depending on whether the position |
488 codes of characters in @var{charset} map to the left or right half | |
489 of their font, respectively. | |
490 @end defun | |
491 | |
492 @defun charset-ccl-program charset | |
493 This function returns the CCL program, if any, for converting | |
494 position codes of characters in @var{charset} into font indices. | |
495 @end defun | |
496 | |
1734 | 497 The two properties of a charset that can currently be set after the |
498 charset has been created are the CCL program and the font registry. | |
428 | 499 |
500 @defun set-charset-ccl-program charset ccl-program | |
501 This function sets the @code{ccl-program} property of @var{charset} to | |
502 @var{ccl-program}. | |
503 @end defun | |
504 | |
1734 | 505 @defun set-charset-registry charset registry |
506 This function sets the @code{registry} property of @var{charset} to | |
507 @var{registry}. | |
508 @end defun | |
509 | |
442 | 510 @node Predefined Charsets, , Charset Property Functions, Charsets |
428 | 511 @subsection Predefined Charsets |
512 | |
442 | 513 The following charsets are predefined in the C code. |
428 | 514 |
515 @example | |
516 Name Type Fi Gr Dir Registry | |
517 -------------------------------------------------------------- | |
518 ascii 94 B 0 l2r ISO8859-1 | |
519 control-1 94 0 l2r --- | |
520 latin-iso8859-1 94 A 1 l2r ISO8859-1 | |
521 latin-iso8859-2 96 B 1 l2r ISO8859-2 | |
522 latin-iso8859-3 96 C 1 l2r ISO8859-3 | |
523 latin-iso8859-4 96 D 1 l2r ISO8859-4 | |
524 cyrillic-iso8859-5 96 L 1 l2r ISO8859-5 | |
525 arabic-iso8859-6 96 G 1 r2l ISO8859-6 | |
526 greek-iso8859-7 96 F 1 l2r ISO8859-7 | |
527 hebrew-iso8859-8 96 H 1 r2l ISO8859-8 | |
528 latin-iso8859-9 96 M 1 l2r ISO8859-9 | |
529 thai-tis620 96 T 1 l2r TIS620 | |
530 katakana-jisx0201 94 I 1 l2r JISX0201.1976 | |
531 latin-jisx0201 94 J 0 l2r JISX0201.1976 | |
532 japanese-jisx0208-1978 94x94 @@ 0 l2r JISX0208.1978 | |
533 japanese-jisx0208 94x94 B 0 l2r JISX0208.19(83|90) | |
534 japanese-jisx0212 94x94 D 0 l2r JISX0212 | |
535 chinese-gb2312 94x94 A 0 l2r GB2312 | |
536 chinese-cns11643-1 94x94 G 0 l2r CNS11643.1 | |
537 chinese-cns11643-2 94x94 H 0 l2r CNS11643.2 | |
538 chinese-big5-1 94x94 0 0 l2r Big5 | |
539 chinese-big5-2 94x94 1 0 l2r Big5 | |
540 korean-ksc5601 94x94 C 0 l2r KSC5601 | |
541 composite 96x96 0 l2r --- | |
542 @end example | |
543 | |
442 | 544 The following charsets are predefined in the Lisp code. |
428 | 545 |
546 @example | |
547 Name Type Fi Gr Dir Registry | |
548 -------------------------------------------------------------- | |
549 arabic-digit 94 2 0 l2r MuleArabic-0 | |
550 arabic-1-column 94 3 0 r2l MuleArabic-1 | |
551 arabic-2-column 94 4 0 r2l MuleArabic-2 | |
552 sisheng 94 0 0 l2r sisheng_cwnn\|OMRON_UDC_ZH | |
553 chinese-cns11643-3 94x94 I 0 l2r CNS11643.1 | |
554 chinese-cns11643-4 94x94 J 0 l2r CNS11643.1 | |
555 chinese-cns11643-5 94x94 K 0 l2r CNS11643.1 | |
556 chinese-cns11643-6 94x94 L 0 l2r CNS11643.1 | |
557 chinese-cns11643-7 94x94 M 0 l2r CNS11643.1 | |
558 ethiopic 94x94 2 0 l2r Ethio | |
559 ascii-r2l 94 B 0 r2l ISO8859-1 | |
560 ipa 96 0 1 l2r MuleIPA | |
1734 | 561 vietnamese-viscii-lower 96 1 1 l2r VISCII1.1 |
562 vietnamese-viscii-upper 96 2 1 l2r VISCII1.1 | |
428 | 563 @end example |
564 | |
565 For all of the above charsets, the dimension and number of columns are | |
566 the same. | |
567 | |
442 | 568 Note that ASCII, Control-1, and Composite are handled specially. |
428 | 569 This is why some of the fields are blank; and some of the filled-in |
570 fields (e.g. the type) are not really accurate. | |
571 | |
442 | 572 @node MULE Characters, Composite Characters, Charsets, MULE |
428 | 573 @section MULE Characters |
574 | |
575 @defun make-char charset arg1 &optional arg2 | |
576 This function makes a multi-byte character from @var{charset} and octets | |
577 @var{arg1} and @var{arg2}. | |
578 @end defun | |
579 | |
444 | 580 @defun char-charset character |
581 This function returns the character set of char @var{character}. | |
428 | 582 @end defun |
583 | |
444 | 584 @defun char-octet character &optional n |
428 | 585 This function returns the octet (i.e. position code) numbered @var{n} |
444 | 586 (should be 0 or 1) of char @var{character}. @var{n} defaults to 0 if omitted. |
428 | 587 @end defun |
588 | |
589 @defun find-charset-region start end &optional buffer | |
590 This function returns a list of the charsets in the region between | |
591 @var{start} and @var{end}. @var{buffer} defaults to the current buffer | |
592 if omitted. | |
593 @end defun | |
594 | |
595 @defun find-charset-string string | |
596 This function returns a list of the charsets in @var{string}. | |
597 @end defun | |
598 | |
442 | 599 @node Composite Characters, Coding Systems, MULE Characters, MULE |
428 | 600 @section Composite Characters |
601 | |
442 | 602 Composite characters are not yet completely implemented. |
428 | 603 |
604 @defun make-composite-char string | |
605 This function converts a string into a single composite character. The | |
606 character is the result of overstriking all the characters in the | |
607 string. | |
608 @end defun | |
609 | |
444 | 610 @defun composite-char-string character |
428 | 611 This function returns a string of the characters comprising a composite |
612 character. | |
613 @end defun | |
614 | |
615 @defun compose-region start end &optional buffer | |
616 This function composes the characters in the region from @var{start} to | |
617 @var{end} in @var{buffer} into one composite character. The composite | |
618 character replaces the composed characters. @var{buffer} defaults to | |
619 the current buffer if omitted. | |
620 @end defun | |
621 | |
622 @defun decompose-region start end &optional buffer | |
623 This function decomposes any composite characters in the region from | |
624 @var{start} to @var{end} in @var{buffer}. This converts each composite | |
625 character into one or more characters, the individual characters out of | |
626 which the composite character was formed. Non-composite characters are | |
627 left as-is. @var{buffer} defaults to the current buffer if omitted. | |
628 @end defun | |
629 | |
442 | 630 @node Coding Systems, CCL, Composite Characters, MULE |
631 @section Coding Systems | |
632 | |
633 A coding system is an object that defines how text containing multiple | |
634 character sets is encoded into a stream of (typically 8-bit) bytes. The | |
635 coding system is used to decode the stream into a series of characters | |
636 (which may be from multiple charsets) when the text is read from a file | |
637 or process, and is used to encode the text back into the same format | |
638 when it is written out to a file or process. | |
639 | |
640 For example, many ISO-2022-compliant coding systems (such as Compound | |
641 Text, which is used for inter-client data under the X Window System) use | |
642 escape sequences to switch between different charsets -- Japanese Kanji, | |
643 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with | |
644 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See | |
645 @code{make-coding-system} for more information. | |
646 | |
647 Coding systems are normally identified using a symbol, and the symbol is | |
648 accepted in place of the actual coding system object whenever a coding | |
649 system is called for. (This is similar to how faces and charsets work.) | |
650 | |
651 @defun coding-system-p object | |
652 This function returns non-@code{nil} if @var{object} is a coding system. | |
653 @end defun | |
428 | 654 |
442 | 655 @menu |
656 * Coding System Types:: Classifying coding systems. | |
657 * ISO 2022:: An international standard for | |
658 charsets and encodings. | |
659 * EOL Conversion:: Dealing with different ways of denoting | |
660 the end of a line. | |
661 * Coding System Properties:: Properties of a coding system. | |
662 * Basic Coding System Functions:: Working with coding systems. | |
663 * Coding System Property Functions:: Retrieving a coding system's properties. | |
664 * Encoding and Decoding Text:: Encoding and decoding text. | |
665 * Detection of Textual Encoding:: Determining how text is encoded. | |
666 * Big5 and Shift-JIS Functions:: Special functions for these non-standard | |
667 encodings. | |
668 * Predefined Coding Systems:: Coding systems implemented by MULE. | |
669 @end menu | |
428 | 670 |
442 | 671 @node Coding System Types, ISO 2022, , Coding Systems |
672 @subsection Coding System Types | |
673 | |
674 The coding system type determines the basic algorithm XEmacs will use to | |
675 decode or encode a data stream. Character encodings will be converted | |
676 to the MULE encoding, escape sequences processed, and newline sequences | |
677 converted to XEmacs's internal representation. There are three basic | |
678 classes of coding system type: no-conversion, ISO-2022, and special. | |
679 | |
680 No conversion allows you to look at the file's internal representation. | |
681 Since XEmacs is basically a text editor, "no conversion" does convert | |
682 newline conventions by default. (Use the 'binary coding-system if this | |
683 is not desired.) | |
428 | 684 |
442 | 685 ISO 2022 (@pxref{ISO 2022}) is the basic international standard regulating |
686 use of "coded character sets for the exchange of data", ie, text | |
687 streams. ISO 2022 contains functions that make it possible to encode | |
688 text streams to comply with restrictions of the Internet mail system and | |
689 de facto restrictions of most file systems (eg, use of the separator | |
690 character in file names). Coding systems which are not ISO 2022 | |
691 conformant can be difficult to handle. Perhaps more important, they are | |
692 not adaptable to multilingual information interchange, with the obvious | |
693 exception of ISO 10646 (Unicode). (Unicode is partially supported by | |
694 XEmacs with the addition of the Lisp package ucs-conv.) | |
695 | |
696 The special class of coding systems includes automatic detection, CCL (a | |
697 "little language" embedded as an interpreter, useful for translating | |
698 between variants of a single character set), non-ISO-2022-conformant | |
699 encodings like Unicode, Shift JIS, and Big5, and MULE internal coding. | |
700 (NB: this list is based on XEmacs 21.2. Terminology may vary slightly | |
701 for other versions of XEmacs and for GNU Emacs 20.) | |
702 | |
703 @table @code | |
704 @item no-conversion | |
705 No conversion, for binary files, and a few special cases of non-ISO-2022 | |
706 coding systems where conversion is done by hook functions (usually | |
707 implemented in CCL). On output, graphic characters that are not in | |
708 ASCII or Latin-1 will be replaced by a @samp{?}. (For a | |
709 no-conversion-encoded buffer, these characters will only be present if | |
710 you explicitly insert them.) | |
711 @item iso2022 | |
712 Any ISO-2022-compliant encoding. Among others, this includes JIS (the | |
713 Japanese encoding commonly used for e-mail), national variants of EUC | |
714 (the standard Unix encoding for Japanese and other languages), and | |
715 Compound Text (an encoding used in X11). You can specify more specific | |
716 information about the conversion with the @var{flags} argument. | |
717 @item ucs-4 | |
718 ISO 10646 UCS-4 encoding. A 31-bit fixed-width superset of Unicode. | |
719 @item utf-8 | |
720 ISO 10646 UTF-8 encoding. A ``file system safe'' transformation format | |
721 that can be used with both UCS-4 and Unicode. | |
722 @item undecided | |
723 Automatic conversion. XEmacs attempts to detect the coding system used | |
724 in the file. | |
725 @item shift-jis | |
726 Shift-JIS (a Japanese encoding commonly used in PC operating systems). | |
727 @item big5 | |
728 Big5 (the encoding commonly used for Taiwanese). | |
729 @item ccl | |
730 The conversion is performed using a user-written pseudo-code program. | |
731 CCL (Code Conversion Language) is the name of this pseudo-code. For | |
732 example, CCL is used to map KOI8-R characters (an encoding for Russian | |
733 Cyrillic) to ISO8859-5 (the form used internally by MULE). | |
734 @item internal | |
735 Write out or read in the raw contents of the memory representing the | |
736 buffer's text. This is primarily useful for debugging purposes, and is | |
737 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set | |
738 (the @samp{--debug} configure option). @strong{Warning}: Reading in a | |
739 file using @code{internal} conversion can result in an internal | |
740 inconsistency in the memory representing a buffer's text, which will | |
741 produce unpredictable results and may cause XEmacs to crash. Under | |
742 normal circumstances you should never use @code{internal} conversion. | |
428 | 743 @end table |
744 | |
442 | 745 @node ISO 2022, EOL Conversion, Coding System Types, Coding Systems |
746 @section ISO 2022 | |
747 | |
748 This section briefly describes the ISO 2022 encoding standard. A more | |
749 thorough treatment is available in the original document of ISO | |
750 2022 as well as various national standards (such as JIS X 0202). | |
428 | 751 |
442 | 752 Character sets (@dfn{charsets}) are classified into the following four |
753 categories, according to the number of characters in the charset: | |
754 94-charset, 96-charset, 94x94-charset, and 96x96-charset. This means | |
755 that although an ISO 2022 coding system may have variable width | |
756 characters, each charset used is fixed-width (in contrast to the MULE | |
757 character set and UTF-8, for example). | |
758 | |
759 ISO 2022 provides for switching between character sets via escape | |
760 sequences. This switching is somewhat complicated, because ISO 2022 | |
761 provides for both legacy applications like Internet mail that accept | |
444 | 762 only 7 significant bits in some contexts (RFC 822 headers, for example), |
442 | 763 and more modern "8-bit clean" applications. It also provides for |
764 compact and transparent representation of languages like Japanese which | |
765 mix ASCII and a national script (even outside of computer programs). | |
428 | 766 |
442 | 767 First, ISO 2022 codified prevailing practice by dividing the code space |
768 into "control" and "graphic" regions. The code points 0x00-0x1F and | |
769 0x80-0x9F are reserved for "control characters", while "graphic | |
770 characters" must be assigned to code points in the regions 0x20-0x7F and | |
771 0xA0-0xFF. The positions 0x20 and 0x7F are special, and under some | |
772 circumstances must be assigned the graphic character "ASCII SPACE" and | |
773 the control character "ASCII DEL" respectively. | |
428 | 774 |
442 | 775 The various regions are given the name C0 (0x00-0x1F), GL (0x20-0x7F), |
776 C1 (0x80-0x9F), and GR (0xA0-0xFF). GL and GR stand for "graphic left" | |
777 and "graphic right", respectively, because of the standard method of | |
778 displaying graphic character sets in tables with the high byte indexing | |
444 | 779 columns and the low byte indexing rows. I don't find it very intuitive, |
442 | 780 but these are called "registers". |
781 | |
782 An ISO 2022-conformant encoding for a graphic character set must use a | |
783 fixed number of bytes per character, and the values must fit into a | |
784 single register; that is, each byte must range over either 0x20-0x7F, or | |
785 0xA0-0xFF. It is not allowed to extend the range of the repertoire of a | |
786 character set by using both ranges at the same. This is why a standard | |
787 character set such as ISO 8859-1 is actually considered by ISO 2022 to | |
788 be an aggregation of two character sets, ASCII and LATIN-1, and why it | |
789 is technically incorrect to refer to ISO 8859-1 as "Latin 1". Also, a | |
790 single character's bytes must all be drawn from the same register; this | |
791 is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO | |
792 2022-compatible encodings. | |
428 | 793 |
442 | 794 The reason for this restriction becomes clear when you attempt to define |
795 an efficient, robust encoding for a language like Japanese. Like ISO | |
796 8859, Japanese encodings are aggregations of several character sets. In | |
797 practice, the vast majority of characters are drawn from the "JIS Roman" | |
798 character set (a derivative of ASCII; it won't hurt to think of it as | |
799 ASCII) and the JIS X 0208 standard "basic Japanese" character set | |
800 including not only ideographic characters ("kanji") but syllabic | |
801 Japanese characters ("kana"), a wide variety of symbols, and many | |
802 alphabetic characters (Roman, Greek, and Cyrillic) as well. Although | |
803 JIS X 0208 includes the whole Roman alphabet, as a 2-byte code it is not | |
804 suited to programming; thus the inclusion of ASCII in the standard | |
805 Japanese encodings. | |
428 | 806 |
442 | 807 For normal Japanese text such as in newspapers, a broad repertoire of |
808 approximately 3000 characters is used. Evidently this won't fit into | |
809 one byte; two must be used. But much of the text processed by Japanese | |
810 computers is computer source code, nearly all of which is ASCII. A not | |
811 insignificant portion of ordinary text is English (as such or as | |
812 borrowed Japanese vocabulary) or other languages which can represented | |
813 at least approximately in ASCII, as well. It seems reasonable then to | |
814 represent ASCII in one byte, and JIS X 0208 in two. And this is exactly | |
815 what the Extended Unix Code for Japanese (EUC-JP) does. ASCII is | |
816 invoked to the GL register, and JIS X 0208 is invoked to the GR | |
817 register. Thus, each byte can be tested for its character set by | |
818 looking at the high bit; if set, it is Japanese, if clear, it is ASCII. | |
819 Furthermore, since control characters like newline can never be part of | |
820 a graphic character, even in the case of corruption in transmission the | |
821 stream will be resynchronized at every line break, on the order of 60-80 | |
822 bytes. This coding system requires no escape sequences or special | |
823 control codes to represent 99.9% of all Japanese text. | |
428 | 824 |
442 | 825 Note carefully the distinction between the character sets (ASCII and JIS |
826 X 0208), the encoding (EUC-JP), and the coding system (ISO 2022). The | |
827 JIS X 0208 character set is used in three different encodings for | |
828 Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is | |
829 always clear), in EUC-JP it is invoked into GR (setting the high bit in | |
830 the process), and in Shift JIS the high bit may be set or reset, and the | |
831 significant bits are shifted within the 16-bit character so that the two | |
832 main character sets can coexist with a third (the "halfwidth katakana" | |
833 of JIS X 0201). As the name implies, the ISO-2022-JP encoding is also a | |
834 version of the ISO-2022 coding system. | |
428 | 835 |
442 | 836 In order to systematically treat subsidiary character sets (like the |
837 "halfwidth katakana" already mentioned, and the "supplementary kanji" of | |
838 JIS X 0212), four further registers are defined: G0, G1, G2, and G3. | |
839 Unlike GL and GR, they are not logically distinguished by internal | |
840 format. Instead, the process of "invocation" mentioned earlier is | |
841 broken into two steps: first, a character set is @dfn{designated} to one | |
842 of the registers G0-G3 by use of an @dfn{escape sequence} of the form: | |
428 | 843 |
844 @example | |
440 | 845 ESC [@var{I}] @var{I} @var{F} |
428 | 846 @end example |
847 | |
442 | 848 where @var{I} is an intermediate character or characters in the range |
849 0x20 - 0x3F, and @var{F}, from the range 0x30-0x7Fm is the final | |
850 character identifying this charset. (Final characters in the range | |
851 0x30-0x3F are reserved for private use and will never have a publicly | |
852 registered meaning.) | |
853 | |
854 Then that register is @dfn{invoked} to either GL or GR, either | |
855 automatically (designations to G0 normally involve invocation to GL as | |
856 well), or by use of shifting (affecting only the following character in | |
857 the data stream) or locking (effective until the next designation or | |
858 locking) control sequences. An encoding conformant to ISO 2022 is | |
859 typically defined by designating the initial contents of the G0-G3 | |
901 | 860 registers, specifying a 7 or 8 bit environment, and specifying whether |
442 | 861 further designations will be recognized. |
862 | |
863 Some examples of character sets and the registered final characters | |
864 @var{F} used to designate them: | |
428 | 865 |
442 | 866 @need 1000 |
867 @table @asis | |
868 @item 94-charset | |
869 ASCII (B), left (J) and right (I) half of JIS X 0201, ... | |
870 @item 96-charset | |
871 Latin-1 (A), Latin-2 (B), Latin-3 (C), ... | |
872 @item 94x94-charset | |
873 GB2312 (A), JIS X 0208 (B), KSC5601 (C), ... | |
874 @item 96x96-charset | |
875 none for the moment | |
876 @end table | |
877 | |
878 The meanings of the various characters in these sequences, where not | |
879 specified by the ISO 2022 standard (such as the ESC character), are | |
880 assigned by @dfn{ECMA}, the European Computer Manufacturers Association. | |
881 | |
882 The meaning of intermediate characters are: | |
428 | 883 |
884 @example | |
885 @group | |
440 | 886 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96). |
887 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}. | |
888 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}. | |
889 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}. | |
890 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}. | |
442 | 891 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}. |
440 | 892 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}. |
893 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}. | |
894 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}. | |
428 | 895 @end group |
896 @end example | |
897 | |
442 | 898 The comma may be used in files read and written only by MULE, as a MULE |
899 extension, but this is illegal in ISO 2022. (The reason is that in ISO | |
900 2022 G0 must be a 94-member character set, with 0x20 assigned the value | |
901 SPACE, and 0x7F assigned the value DEL.) | |
428 | 902 |
442 | 903 Here are examples of designations: |
428 | 904 |
905 @example | |
906 @group | |
440 | 907 ESC ( B : designate to G0 ASCII |
908 ESC - A : designate to G1 Latin-1 | |
909 ESC $ ( A or ESC $ A : designate to G0 GB2312 | |
910 ESC $ ( B or ESC $ B : designate to G0 JISX0208 | |
911 ESC $ ) C : designate to G1 KSC5601 | |
428 | 912 @end group |
913 @end example | |
914 | |
442 | 915 (The short forms used to designate GB2312 and JIS X 0208 are for |
916 backwards compatibility; the long forms are preferred.) | |
917 | |
918 To use a charset designated to G2 or G3, and to use a charset designated | |
428 | 919 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3 |
920 into GL. There are two types of invocation, Locking Shift (forever) and | |
921 Single Shift (one character only). | |
922 | |
442 | 923 Locking Shift is done as follows: |
428 | 924 |
925 @example | |
440 | 926 LS0 or SI (0x0F): invoke G0 into GL |
927 LS1 or SO (0x0E): invoke G1 into GL | |
928 LS2: invoke G2 into GL | |
929 LS3: invoke G3 into GL | |
930 LS1R: invoke G1 into GR | |
931 LS2R: invoke G2 into GR | |
932 LS3R: invoke G3 into GR | |
428 | 933 @end example |
934 | |
442 | 935 Single Shift is done as follows: |
428 | 936 |
937 @example | |
938 @group | |
440 | 939 SS2 or ESC N: invoke G2 into GL |
940 SS3 or ESC O: invoke G3 into GL | |
428 | 941 @end group |
942 @end example | |
943 | |
442 | 944 The shift functions (such as LS1R and SS3) are represented by control |
945 characters (from C1) in 8 bit environments and by escape sequences in 7 | |
946 bit environments. | |
947 | |
428 | 948 (#### Ben says: I think the above is slightly incorrect. It appears that |
949 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and | |
444 | 950 ESC O behave as indicated. The above definitions will not parse |
428 | 951 EUC-encoded text correctly, and it looks like the code in mule-coding.c |
952 has similar problems.) | |
953 | |
442 | 954 Evidently there are a lot of ISO-2022-compliant ways of encoding |
955 multilingual text. Now, in the world, there exist many coding systems | |
956 such as X11's Compound Text, Japanese JUNET code, and so-called EUC | |
957 (Extended UNIX Code); all of these are variants of ISO 2022. | |
428 | 958 |
442 | 959 In MULE, we characterize a version of ISO 2022 by the following |
960 attributes: | |
428 | 961 |
962 @enumerate | |
963 @item | |
442 | 964 The character sets initially designated to G0 thru G3. |
428 | 965 @item |
442 | 966 Whether short form designations are allowed for Japanese and Chinese. |
428 | 967 @item |
442 | 968 Whether ASCII should be designated to G0 before control characters. |
428 | 969 @item |
442 | 970 Whether ASCII should be designated to G0 at the end of line. |
428 | 971 @item |
972 7-bit environment or 8-bit environment. | |
973 @item | |
442 | 974 Whether Locking Shifts are used or not. |
428 | 975 @item |
442 | 976 Whether to use ASCII or the variant JIS X 0201-1976-Roman. |
428 | 977 @item |
442 | 978 Whether to use JIS X 0208-1983 or the older version JIS X 0208-1976. |
428 | 979 @end enumerate |
980 | |
981 (The last two are only for Japanese.) | |
982 | |
442 | 983 By specifying these attributes, you can create any variant |
428 | 984 of ISO 2022. |
985 | |
442 | 986 Here are several examples: |
428 | 987 |
988 @example | |
989 @group | |
442 | 990 ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check). |
440 | 991 1. G0 <- ASCII, G1..3 <- never used |
992 2. Yes. | |
993 3. Yes. | |
994 4. Yes. | |
995 5. 7-bit environment | |
996 6. No. | |
997 7. Use ASCII | |
442 | 998 8. Use JIS X 0208-1983 |
428 | 999 @end group |
1000 | |
1001 @group | |
442 | 1002 ctext -- X11 Compound Text |
1003 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used. | |
440 | 1004 2. No. |
1005 3. No. | |
1006 4. Yes. | |
442 | 1007 5. 8-bit environment. |
440 | 1008 6. No. |
442 | 1009 7. Use ASCII. |
1010 8. Use JIS X 0208-1983. | |
428 | 1011 @end group |
1012 | |
1013 @group | |
442 | 1014 euc-china -- Chinese EUC. Often called the "GB encoding", but that is |
1015 technically incorrect. | |
1016 1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used. | |
440 | 1017 2. No. |
1018 3. Yes. | |
1019 4. Yes. | |
442 | 1020 5. 8-bit environment. |
440 | 1021 6. No. |
442 | 1022 7. Use ASCII. |
1023 8. Use JIS X 0208-1983. | |
428 | 1024 @end group |
1025 | |
1026 @group | |
442 | 1027 ISO-2022-KR -- Coding system used in Korean email. |
1028 1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used. | |
440 | 1029 2. No. |
1030 3. Yes. | |
1031 4. Yes. | |
442 | 1032 5. 7-bit environment. |
440 | 1033 6. Yes. |
442 | 1034 7. Use ASCII. |
1035 8. Use JIS X 0208-1983. | |
428 | 1036 @end group |
1037 @end example | |
1038 | |
442 | 1039 MULE creates all of these coding systems by default. |
428 | 1040 |
442 | 1041 @node EOL Conversion, Coding System Properties, ISO 2022, Coding Systems |
428 | 1042 @subsection EOL Conversion |
1043 | |
1044 @table @code | |
1045 @item nil | |
1046 Automatically detect the end-of-line type (LF, CRLF, or CR). Also | |
1047 generate subsidiary coding systems named @code{@var{name}-unix}, | |
1048 @code{@var{name}-dos}, and @code{@var{name}-mac}, that are identical to | |
1049 this coding system but have an EOL-TYPE value of @code{lf}, @code{crlf}, | |
1050 and @code{cr}, respectively. | |
1051 @item lf | |
1052 The end of a line is marked externally using ASCII LF. Since this is | |
1053 also the way that XEmacs represents an end-of-line internally, | |
1054 specifying this option results in no end-of-line conversion. This is | |
1055 the standard format for Unix text files. | |
1056 @item crlf | |
1057 The end of a line is marked externally using ASCII CRLF. This is the | |
1058 standard format for MS-DOS text files. | |
1059 @item cr | |
1060 The end of a line is marked externally using ASCII CR. This is the | |
1061 standard format for Macintosh text files. | |
1062 @item t | |
1063 Automatically detect the end-of-line type but do not generate subsidiary | |
1064 coding systems. (This value is converted to @code{nil} when stored | |
1065 internally, and @code{coding-system-property} will return @code{nil}.) | |
1066 @end table | |
1067 | |
442 | 1068 @node Coding System Properties, Basic Coding System Functions, EOL Conversion, Coding Systems |
428 | 1069 @subsection Coding System Properties |
1070 | |
1071 @table @code | |
1072 @item mnemonic | |
1073 String to be displayed in the modeline when this coding system is | |
1074 active. | |
1075 | |
1076 @item eol-type | |
1077 End-of-line conversion to be used. It should be one of the types | |
1078 listed in @ref{EOL Conversion}. | |
1079 | |
442 | 1080 @item eol-lf |
444 | 1081 The coding system which is the same as this one, except that it uses the |
442 | 1082 Unix line-breaking convention. |
1083 | |
1084 @item eol-crlf | |
444 | 1085 The coding system which is the same as this one, except that it uses the |
442 | 1086 DOS line-breaking convention. |
1087 | |
1088 @item eol-cr | |
444 | 1089 The coding system which is the same as this one, except that it uses the |
442 | 1090 Macintosh line-breaking convention. |
1091 | |
428 | 1092 @item post-read-conversion |
1093 Function called after a file has been read in, to perform the decoding. | |
444 | 1094 Called with two arguments, @var{start} and @var{end}, denoting a region of |
428 | 1095 the current buffer to be decoded. |
1096 | |
1097 @item pre-write-conversion | |
1098 Function called before a file is written out, to perform the encoding. | |
444 | 1099 Called with two arguments, @var{start} and @var{end}, denoting a region of |
428 | 1100 the current buffer to be encoded. |
1101 @end table | |
1102 | |
442 | 1103 The following additional properties are recognized if @var{type} is |
428 | 1104 @code{iso2022}: |
1105 | |
1106 @table @code | |
1107 @item charset-g0 | |
1108 @itemx charset-g1 | |
1109 @itemx charset-g2 | |
1110 @itemx charset-g3 | |
1111 The character set initially designated to the G0 - G3 registers. | |
1112 The value should be one of | |
1113 | |
1114 @itemize @bullet | |
1115 @item | |
1116 A charset object (designate that character set) | |
1117 @item | |
1118 @code{nil} (do not ever use this register) | |
1119 @item | |
1120 @code{t} (no character set is initially designated to the register, but | |
1121 may be later on; this automatically sets the corresponding | |
1122 @code{force-g*-on-output} property) | |
1123 @end itemize | |
1124 | |
1125 @item force-g0-on-output | |
1126 @itemx force-g1-on-output | |
1127 @itemx force-g2-on-output | |
1128 @itemx force-g3-on-output | |
1129 If non-@code{nil}, send an explicit designation sequence on output | |
1130 before using the specified register. | |
1131 | |
1132 @item short | |
1133 If non-@code{nil}, use the short forms @samp{ESC $ @@}, @samp{ESC $ A}, | |
1134 and @samp{ESC $ B} on output in place of the full designation sequences | |
1135 @samp{ESC $ ( @@}, @samp{ESC $ ( A}, and @samp{ESC $ ( B}. | |
1136 | |
1137 @item no-ascii-eol | |
1138 If non-@code{nil}, don't designate ASCII to G0 at each end of line on | |
1139 output. Setting this to non-@code{nil} also suppresses other | |
1140 state-resetting that normally happens at the end of a line. | |
1141 | |
1142 @item no-ascii-cntl | |
1143 If non-@code{nil}, don't designate ASCII to G0 before control chars on | |
1144 output. | |
1145 | |
1146 @item seven | |
1147 If non-@code{nil}, use 7-bit environment on output. Otherwise, use 8-bit | |
1148 environment. | |
1149 | |
1150 @item lock-shift | |
1151 If non-@code{nil}, use locking-shift (SO/SI) instead of single-shift or | |
1152 designation by escape sequence. | |
1153 | |
1154 @item no-iso6429 | |
1155 If non-@code{nil}, don't use ISO6429's direction specification. | |
1156 | |
1157 @item escape-quoted | |
444 | 1158 If non-@code{nil}, literal control characters that are the same as the |
428 | 1159 beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in |
1160 particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F), | |
1161 and CSI (0x9B)) are ``quoted'' with an escape character so that they can | |
1162 be properly distinguished from an escape sequence. (Note that doing | |
1163 this results in a non-portable encoding.) This encoding flag is used for | |
1164 byte-compiled files. Note that ESC is a good choice for a quoting | |
1165 character because there are no escape sequences whose second byte is a | |
1166 character from the Control-0 or Control-1 character sets; this is | |
1167 explicitly disallowed by the ISO 2022 standard. | |
1168 | |
1169 @item input-charset-conversion | |
1170 A list of conversion specifications, specifying conversion of characters | |
1171 in one charset to another when decoding is performed. Each | |
1172 specification is a list of two elements: the source charset, and the | |
1173 destination charset. | |
1174 | |
1175 @item output-charset-conversion | |
1176 A list of conversion specifications, specifying conversion of characters | |
1177 in one charset to another when encoding is performed. The form of each | |
1178 specification is the same as for @code{input-charset-conversion}. | |
1179 @end table | |
1180 | |
442 | 1181 The following additional properties are recognized (and required) if |
428 | 1182 @var{type} is @code{ccl}: |
1183 | |
1184 @table @code | |
1185 @item decode | |
1186 CCL program used for decoding (converting to internal format). | |
1187 | |
1188 @item encode | |
1189 CCL program used for encoding (converting to external format). | |
1190 @end table | |
1191 | |
442 | 1192 The following properties are used internally: @var{eol-cr}, |
1193 @var{eol-crlf}, @var{eol-lf}, and @var{base}. | |
1194 | |
1195 @node Basic Coding System Functions, Coding System Property Functions, Coding System Properties, Coding Systems | |
428 | 1196 @subsection Basic Coding System Functions |
1197 | |
1198 @defun find-coding-system coding-system-or-name | |
1199 This function retrieves the coding system of the given name. | |
1200 | |
442 | 1201 If @var{coding-system-or-name} is a coding-system object, it is simply |
428 | 1202 returned. Otherwise, @var{coding-system-or-name} should be a symbol. |
1203 If there is no such coding system, @code{nil} is returned. Otherwise | |
1204 the associated coding system object is returned. | |
1205 @end defun | |
1206 | |
1207 @defun get-coding-system name | |
1208 This function retrieves the coding system of the given name. Same as | |
1209 @code{find-coding-system} except an error is signalled if there is no | |
1210 such coding system instead of returning @code{nil}. | |
1211 @end defun | |
1212 | |
1213 @defun coding-system-list | |
1214 This function returns a list of the names of all defined coding systems. | |
1215 @end defun | |
1216 | |
1217 @defun coding-system-name coding-system | |
1218 This function returns the name of the given coding system. | |
1219 @end defun | |
1220 | |
442 | 1221 @defun coding-system-base coding-system |
1222 Returns the base coding system (undecided EOL convention) | |
1223 coding system. | |
1224 @end defun | |
1225 | |
428 | 1226 @defun make-coding-system name type &optional doc-string props |
1227 This function registers symbol @var{name} as a coding system. | |
1228 | |
1229 @var{type} describes the conversion method used and should be one of | |
1230 the types listed in @ref{Coding System Types}. | |
1231 | |
1232 @var{doc-string} is a string describing the coding system. | |
1233 | |
1234 @var{props} is a property list, describing the specific nature of the | |
1235 character set. Recognized properties are as in @ref{Coding System | |
1236 Properties}. | |
1237 @end defun | |
1238 | |
1239 @defun copy-coding-system old-coding-system new-name | |
1240 This function copies @var{old-coding-system} to @var{new-name}. If | |
1241 @var{new-name} does not name an existing coding system, a new one will | |
1242 be created. | |
1243 @end defun | |
1244 | |
1245 @defun subsidiary-coding-system coding-system eol-type | |
1246 This function returns the subsidiary coding system of | |
1247 @var{coding-system} with eol type @var{eol-type}. | |
1248 @end defun | |
1249 | |
442 | 1250 @node Coding System Property Functions, Encoding and Decoding Text, Basic Coding System Functions, Coding Systems |
428 | 1251 @subsection Coding System Property Functions |
1252 | |
1253 @defun coding-system-doc-string coding-system | |
1254 This function returns the doc string for @var{coding-system}. | |
1255 @end defun | |
1256 | |
1257 @defun coding-system-type coding-system | |
1258 This function returns the type of @var{coding-system}. | |
1259 @end defun | |
1260 | |
1261 @defun coding-system-property coding-system prop | |
1262 This function returns the @var{prop} property of @var{coding-system}. | |
1263 @end defun | |
1264 | |
442 | 1265 @node Encoding and Decoding Text, Detection of Textual Encoding, Coding System Property Functions, Coding Systems |
428 | 1266 @subsection Encoding and Decoding Text |
1267 | |
1268 @defun decode-coding-region start end coding-system &optional buffer | |
1269 This function decodes the text between @var{start} and @var{end} which | |
1270 is encoded in @var{coding-system}. This is useful if you've read in | |
1271 encoded text from a file without decoding it (e.g. you read in a | |
1272 JIS-formatted file but used the @code{binary} or @code{no-conversion} coding | |
1273 system, so that it shows up as @samp{^[$B!<!+^[(B}). The length of the | |
1274 encoded text is returned. @var{buffer} defaults to the current buffer | |
1275 if unspecified. | |
1276 @end defun | |
1277 | |
1278 @defun encode-coding-region start end coding-system &optional buffer | |
1279 This function encodes the text between @var{start} and @var{end} using | |
1280 @var{coding-system}. This will, for example, convert Japanese | |
1281 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS | |
1282 encoding. The length of the encoded text is returned. @var{buffer} | |
1283 defaults to the current buffer if unspecified. | |
1284 @end defun | |
1285 | |
442 | 1286 @node Detection of Textual Encoding, Big5 and Shift-JIS Functions, Encoding and Decoding Text, Coding Systems |
428 | 1287 @subsection Detection of Textual Encoding |
1288 | |
1289 @defun coding-category-list | |
1290 This function returns a list of all recognized coding categories. | |
1291 @end defun | |
1292 | |
1293 @defun set-coding-priority-list list | |
1294 This function changes the priority order of the coding categories. | |
1295 @var{list} should be a list of coding categories, in descending order of | |
1296 priority. Unspecified coding categories will be lower in priority than | |
1297 all specified ones, in the same relative order they were in previously. | |
1298 @end defun | |
1299 | |
1300 @defun coding-priority-list | |
1301 This function returns a list of coding categories in descending order of | |
1302 priority. | |
1303 @end defun | |
1304 | |
1305 @defun set-coding-category-system coding-category coding-system | |
1306 This function changes the coding system associated with a coding category. | |
1307 @end defun | |
1308 | |
1309 @defun coding-category-system coding-category | |
1310 This function returns the coding system associated with a coding category. | |
1311 @end defun | |
1312 | |
1313 @defun detect-coding-region start end &optional buffer | |
1314 This function detects coding system of the text in the region between | |
1315 @var{start} and @var{end}. Returned value is a list of possible coding | |
1316 systems ordered by priority. If only ASCII characters are found, it | |
1317 returns @code{autodetect} or one of its subsidiary coding systems | |
1318 according to a detected end-of-line type. Optional arg @var{buffer} | |
1319 defaults to the current buffer. | |
1320 @end defun | |
1321 | |
442 | 1322 @node Big5 and Shift-JIS Functions, Predefined Coding Systems, Detection of Textual Encoding, Coding Systems |
428 | 1323 @subsection Big5 and Shift-JIS Functions |
1324 | |
442 | 1325 These are special functions for working with the non-standard |
428 | 1326 Shift-JIS and Big5 encodings. |
1327 | |
1328 @defun decode-shift-jis-char code | |
442 | 1329 This function decodes a JIS X 0208 character of Shift-JIS coding-system. |
428 | 1330 @var{code} is the character code in Shift-JIS as a cons of type bytes. |
1331 The corresponding character is returned. | |
1332 @end defun | |
1333 | |
444 | 1334 @defun encode-shift-jis-char character |
1335 This function encodes a JIS X 0208 character @var{character} to | |
1336 SHIFT-JIS coding-system. The corresponding character code in SHIFT-JIS | |
1337 is returned as a cons of two bytes. | |
428 | 1338 @end defun |
1339 | |
1340 @defun decode-big5-char code | |
1341 This function decodes a Big5 character @var{code} of BIG5 coding-system. | |
1342 @var{code} is the character code in BIG5. The corresponding character | |
1343 is returned. | |
1344 @end defun | |
1345 | |
444 | 1346 @defun encode-big5-char character |
1347 This function encodes the Big5 character @var{character} to BIG5 | |
428 | 1348 coding-system. The corresponding character code in Big5 is returned. |
1349 @end defun | |
1350 | |
442 | 1351 @node Predefined Coding Systems, , Big5 and Shift-JIS Functions, Coding Systems |
1352 @subsection Coding Systems Implemented | |
1353 | |
1354 MULE initializes most of the commonly used coding systems at XEmacs's | |
1355 startup. A few others are initialized only when the relevant language | |
1356 environment is selected and support libraries are loaded. (NB: The | |
444 | 1357 following list is based on XEmacs 21.2.19, the development branch at the |
442 | 1358 time of writing. The list may be somewhat different for other |
1359 versions. Recent versions of GNU Emacs 20 implement a few more rare | |
1360 coding systems; work is being done to port these to XEmacs.) | |
1361 | |
444 | 1362 Unfortunately, there is not a consistent naming convention for character |
1363 sets, and for practical purposes coding systems often take their name | |
442 | 1364 from their principal character sets (ASCII, KOI8-R, Shift JIS). Others |
444 | 1365 take their names from the coding system (ISO-2022-JP, EUC-KR), and a few |
1366 from their non-text usages (internal, binary). To provide for this, and | |
442 | 1367 for the fact that many coding systems have several common names, an |
1368 aliasing system is provided. Finally, some effort has been made to use | |
1369 names that are registered as MIME charsets (this is why the name | |
1370 'shift_jis contains that un-Lisp-y underscore). | |
1371 | |
1372 There is a systematic naming convention regarding end-of-line (EOL) | |
1373 conventions for different systems. A coding system whose name ends in | |
1374 "-unix" forces the assumptions that lines are broken by newlines (0x0A). | |
1375 A coding system whose name ends in "-mac" forces the assumptions that | |
1376 lines are broken by ASCII CRs (0x0D). A coding system whose name ends | |
1377 in "-dos" forces the assumptions that lines are broken by CRLF sequences | |
1378 (0x0D 0x0A). These subsidiary coding systems are automatically derived | |
1379 from a base coding system. Use of the base coding system implies | |
1380 autodetection of the text file convention. (The fact that the -unix, | |
1381 -mac, and -dos are derived from a base system results in them showing up | |
1382 as "aliases" in `list-coding-systems'.) These subsidiaries have a | |
1383 consistent modeline indicator as well. "-dos" coding systems have ":T" | |
1384 appended to their modeline indicator, while "-mac" coding systems have | |
1385 ":t" appended (eg, "ISO8:t" for iso-2022-8-mac). | |
1386 | |
1387 In the following table, each coding system is given with its mode line | |
1388 indicator in parentheses. Non-textual coding systems are listed first, | |
1389 followed by textual coding systems and their aliases. (The coding system | |
1390 subsidiary modeline indicators ":T" and ":t" will be omitted from the | |
1391 table of coding systems.) | |
1392 | |
1393 ### SJT 1999-08-23 Maybe should order these by language? Definitely | |
1394 need language usage for the ISO-8859 family. | |
1395 | |
1396 Note that although true coding system aliases have been implemented for | |
444 | 1397 XEmacs 21.2, the coding system initialization has not yet been converted |
442 | 1398 as of 21.2.19. So coding systems described as aliases have the same |
1399 properties as the aliased coding system, but will not be equal as Lisp | |
1400 objects. | |
1401 | |
1402 @table @code | |
1403 | |
1404 @item automatic-conversion | |
1405 @itemx undecided | |
1406 @itemx undecided-dos | |
1407 @itemx undecided-mac | |
1408 @itemx undecided-unix | |
1409 | |
1410 Modeline indicator: @code{Auto}. A type @code{undecided} coding system. | |
1411 Attempts to determine an appropriate coding system from file contents or | |
1412 the environment. | |
1413 | |
1414 @item raw-text | |
1415 @itemx no-conversion | |
1416 @itemx raw-text-dos | |
1417 @itemx raw-text-mac | |
1418 @itemx raw-text-unix | |
1419 @itemx no-conversion-dos | |
1420 @itemx no-conversion-mac | |
1421 @itemx no-conversion-unix | |
1422 | |
1423 Modeline indicator: @code{Raw}. A type @code{no-conversion} coding system, | |
1424 which converts only line-break-codes. An implementation quirk means | |
1425 that this coding system is also used for ISO8859-1. | |
1426 | |
1427 @item binary | |
1428 Modeline indicator: @code{Binary}. A type @code{no-conversion} coding | |
1429 system which does no character coding or EOL conversions. An alias for | |
1430 @code{raw-text-unix}. | |
1431 | |
1432 @item alternativnyj | |
1433 @itemx alternativnyj-dos | |
1434 @itemx alternativnyj-mac | |
1435 @itemx alternativnyj-unix | |
1436 | |
1437 Modeline indicator: @code{Cy.Alt}. A type @code{ccl} coding system used for | |
1438 Alternativnyj, an encoding of the Cyrillic alphabet. | |
1439 | |
1440 @item big5 | |
1441 @itemx big5-dos | |
1442 @itemx big5-mac | |
1443 @itemx big5-unix | |
1444 | |
1445 Modeline indicator: @code{Zh/Big5}. A type @code{big5} coding system used for | |
1446 BIG5, the most common encoding of traditional Chinese as used in Taiwan. | |
1447 | |
1448 @item cn-gb-2312 | |
1449 @itemx cn-gb-2312-dos | |
1450 @itemx cn-gb-2312-mac | |
1451 @itemx cn-gb-2312-unix | |
1452 | |
1453 Modeline indicator: @code{Zh-GB/EUC}. A type @code{iso2022} coding system used | |
1454 for simplified Chinese (as used in the People's Republic of China), with | |
1455 the @code{ascii} (G0), @code{chinese-gb2312} (G1), and @code{sisheng} | |
1456 (G2) character sets initially designated. Chinese EUC (Extended Unix | |
1457 Code). | |
1458 | |
1459 @item ctext-hebrew | |
1460 @itemx ctext-hebrew-dos | |
1461 @itemx ctext-hebrew-mac | |
1462 @itemx ctext-hebrew-unix | |
1463 | |
1464 Modeline indicator: @code{CText/Hbrw}. A type @code{iso2022} coding system | |
1465 with the @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) character | |
1466 sets initially designated for Hebrew. | |
1467 | |
1468 @item ctext | |
1469 @itemx ctext-dos | |
1470 @itemx ctext-mac | |
1471 @itemx ctext-unix | |
1472 | |
1473 Modeline indicator: @code{CText}. A type @code{iso2022} 8-bit coding system | |
1474 with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1) character | |
1475 sets initially designated. X11 Compound Text Encoding. Often | |
1476 mistakenly recognized instead of EUC encodings; usual cause is | |
1477 inappropriate setting of @code{coding-priority-list}. | |
1478 | |
1479 @item escape-quoted | |
1480 | |
1481 Modeline indicator: @code{ESC/Quot}. A type @code{iso2022} 8-bit coding | |
1482 system with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1) | |
1483 character sets initially designated and escape quoting. Unix EOL | |
1484 conversion (ie, no conversion). It is used for .ELC files. | |
1485 | |
1486 @item euc-jp | |
1487 @itemx euc-jp-dos | |
1488 @itemx euc-jp-mac | |
1489 @itemx euc-jp-unix | |
1490 | |
1491 Modeline indicator: @code{Ja/EUC}. A type @code{iso2022} 8-bit coding system | |
1492 with @code{ascii} (G0), @code{japanese-jisx0208} (G1), | |
1493 @code{katakana-jisx0201} (G2), and @code{japanese-jisx0212} (G3) | |
1494 initially designated. Japanese EUC (Extended Unix Code). | |
1495 | |
1496 @item euc-kr | |
1497 @itemx euc-kr-dos | |
1498 @itemx euc-kr-mac | |
1499 @itemx euc-kr-unix | |
1500 | |
1501 Modeline indicator: @code{ko/EUC}. A type @code{iso2022} 8-bit coding system | |
1502 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially | |
1503 designated. Korean EUC (Extended Unix Code). | |
1504 | |
1505 @item hz-gb-2312 | |
1506 Modeline indicator: @code{Zh-GB/Hz}. A type @code{no-conversion} coding | |
1507 system with Unix EOL convention (ie, no conversion) using | |
1508 post-read-decode and pre-write-encode functions to translate the Hz/ZW | |
1509 coding system used for Chinese. | |
1510 | |
1511 @item iso-2022-7bit | |
1512 @itemx iso-2022-7bit-unix | |
1513 @itemx iso-2022-7bit-dos | |
1514 @itemx iso-2022-7bit-mac | |
1515 @itemx iso-2022-7 | |
1516 | |
1517 Modeline indicator: @code{ISO7}. A type @code{iso2022} 7-bit coding system | |
1518 with @code{ascii} (G0) initially designated. Other character sets must | |
1519 be explicitly designated to be used. | |
1520 | |
1521 @item iso-2022-7bit-ss2 | |
1522 @itemx iso-2022-7bit-ss2-dos | |
1523 @itemx iso-2022-7bit-ss2-mac | |
1524 @itemx iso-2022-7bit-ss2-unix | |
1525 | |
1526 Modeline indicator: @code{ISO7/SS}. A type @code{iso2022} 7-bit coding system | |
1527 with @code{ascii} (G0) initially designated. Other character sets must | |
1528 be explicitly designated to be used. SS2 is used to invoke a | |
1529 96-charset, one character at a time. | |
1530 | |
1531 @item iso-2022-8 | |
1532 @itemx iso-2022-8-dos | |
1533 @itemx iso-2022-8-mac | |
1534 @itemx iso-2022-8-unix | |
1535 | |
1536 Modeline indicator: @code{ISO8}. A type @code{iso2022} 8-bit coding system | |
1537 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially | |
1538 designated. Other character sets must be explicitly designated to be | |
1539 used. No single-shift or locking-shift. | |
1540 | |
1541 @item iso-2022-8bit-ss2 | |
1542 @itemx iso-2022-8bit-ss2-dos | |
1543 @itemx iso-2022-8bit-ss2-mac | |
1544 @itemx iso-2022-8bit-ss2-unix | |
1545 | |
1546 Modeline indicator: @code{ISO8/SS}. A type @code{iso2022} 8-bit coding system | |
1547 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially | |
1548 designated. Other character sets must be explicitly designated to be | |
1549 used. SS2 is used to invoke a 96-charset, one character at a time. | |
1550 | |
1551 @item iso-2022-int-1 | |
1552 @itemx iso-2022-int-1-dos | |
1553 @itemx iso-2022-int-1-mac | |
1554 @itemx iso-2022-int-1-unix | |
1555 | |
1556 Modeline indicator: @code{INT-1}. A type @code{iso2022} 7-bit coding system | |
1557 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially | |
1558 designated. ISO-2022-INT-1. | |
1559 | |
1560 @item iso-2022-jp-1978-irv | |
1561 @itemx iso-2022-jp-1978-irv-dos | |
1562 @itemx iso-2022-jp-1978-irv-mac | |
1563 @itemx iso-2022-jp-1978-irv-unix | |
1564 | |
1565 Modeline indicator: @code{Ja-78/7bit}. A type @code{iso2022} 7-bit coding | |
1566 system. For compatibility with old Japanese terminals; if you need to | |
1567 know, look at the source. | |
1568 | |
1569 @item iso-2022-jp | |
1570 @itemx iso-2022-jp-2 (ISO7/SS) | |
1571 @itemx iso-2022-jp-dos | |
1572 @itemx iso-2022-jp-mac | |
1573 @itemx iso-2022-jp-unix | |
1574 @itemx iso-2022-jp-2-dos | |
1575 @itemx iso-2022-jp-2-mac | |
1576 @itemx iso-2022-jp-2-unix | |
1577 | |
1578 Modeline indicator: @code{MULE/7bit}. A type @code{iso2022} 7-bit coding | |
1579 system with @code{ascii} (G0) initially designated, and complex | |
1580 specifications to insure backward compatibility with old Japanese | |
1581 systems. Used for communication with mail and news in Japan. The "-2" | |
1582 versions also use SS2 to invoke a 96-charset one character at a time. | |
1583 | |
1584 @item iso-2022-kr | |
1585 Modeline indicator: @code{Ko/7bit} A type @code{iso2022} 7-bit coding | |
1586 system with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially | |
1587 designated. Used for e-mail in Korea. | |
1588 | |
1589 @item iso-2022-lock | |
1590 @itemx iso-2022-lock-dos | |
1591 @itemx iso-2022-lock-mac | |
1592 @itemx iso-2022-lock-unix | |
1593 | |
1594 Modeline indicator: @code{ISO7/Lock}. A type @code{iso2022} 7-bit coding | |
1595 system with @code{ascii} (G0) initially designated, using Locking-Shift | |
1596 to invoke a 96-charset. | |
1597 | |
1598 @item iso-8859-1 | |
1599 @itemx iso-8859-1-dos | |
1600 @itemx iso-8859-1-mac | |
1601 @itemx iso-8859-1-unix | |
1602 | |
1603 Due to implementation, this is not a type @code{iso2022} coding system, | |
1604 but rather an alias for the @code{raw-text} coding system. | |
1605 | |
1606 @item iso-8859-2 | |
1607 @itemx iso-8859-2-dos | |
1608 @itemx iso-8859-2-mac | |
1609 @itemx iso-8859-2-unix | |
1610 | |
1611 Modeline indicator: @code{MIME/Ltn-2}. A type @code{iso2022} coding | |
1612 system with @code{ascii} (G0) and @code{latin-iso8859-2} (G1) initially | |
1613 invoked. | |
1614 | |
1615 @item iso-8859-3 | |
1616 @itemx iso-8859-3-dos | |
1617 @itemx iso-8859-3-mac | |
1618 @itemx iso-8859-3-unix | |
1619 | |
1620 Modeline indicator: @code{MIME/Ltn-3}. A type @code{iso2022} coding system | |
1621 with @code{ascii} (G0) and @code{latin-iso8859-3} (G1) initially | |
1622 invoked. | |
1623 | |
1624 @item iso-8859-4 | |
1625 @itemx iso-8859-4-dos | |
1626 @itemx iso-8859-4-mac | |
1627 @itemx iso-8859-4-unix | |
1628 | |
1629 Modeline indicator: @code{MIME/Ltn-4}. A type @code{iso2022} coding system | |
1630 with @code{ascii} (G0) and @code{latin-iso8859-4} (G1) initially | |
1631 invoked. | |
1632 | |
1633 @item iso-8859-5 | |
1634 @itemx iso-8859-5-dos | |
1635 @itemx iso-8859-5-mac | |
1636 @itemx iso-8859-5-unix | |
1637 | |
1638 Modeline indicator: @code{ISO8/Cyr}. A type @code{iso2022} coding system with | |
1639 @code{ascii} (G0) and @code{cyrillic-iso8859-5} (G1) initially invoked. | |
1640 | |
1641 @item iso-8859-7 | |
1642 @itemx iso-8859-7-dos | |
1643 @itemx iso-8859-7-mac | |
1644 @itemx iso-8859-7-unix | |
1645 | |
1646 Modeline indicator: @code{Grk}. A type @code{iso2022} coding system with | |
1647 @code{ascii} (G0) and @code{greek-iso8859-7} (G1) initially invoked. | |
1648 | |
1649 @item iso-8859-8 | |
1650 @itemx iso-8859-8-dos | |
1651 @itemx iso-8859-8-mac | |
1652 @itemx iso-8859-8-unix | |
1653 | |
1654 Modeline indicator: @code{MIME/Hbrw}. A type @code{iso2022} coding system with | |
1655 @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) initially invoked. | |
1656 | |
1657 @item iso-8859-9 | |
1658 @itemx iso-8859-9-dos | |
1659 @itemx iso-8859-9-mac | |
1660 @itemx iso-8859-9-unix | |
1661 | |
1662 Modeline indicator: @code{MIME/Ltn-5}. A type @code{iso2022} coding system | |
1663 with @code{ascii} (G0) and @code{latin-iso8859-9} (G1) initially | |
1664 invoked. | |
1665 | |
1666 @item koi8-r | |
1667 @itemx koi8-r-dos | |
1668 @itemx koi8-r-mac | |
1669 @itemx koi8-r-unix | |
1670 | |
1671 Modeline indicator: @code{KOI8}. A type @code{ccl} coding-system used for | |
1672 KOI8-R, an encoding of the Cyrillic alphabet. | |
1673 | |
1674 @item shift_jis | |
1675 @itemx shift_jis-dos | |
1676 @itemx shift_jis-mac | |
1677 @itemx shift_jis-unix | |
1678 | |
1679 Modeline indicator: @code{Ja/SJIS}. A type @code{shift-jis} coding-system | |
1680 implementing the Shift-JIS encoding for Japanese. The underscore is to | |
1681 conform to the MIME charset implementing this encoding. | |
1682 | |
1683 @item tis-620 | |
1684 @itemx tis-620-dos | |
1685 @itemx tis-620-mac | |
1686 @itemx tis-620-unix | |
1687 | |
1688 Modeline indicator: @code{TIS620}. A type @code{ccl} encoding for Thai. The | |
1689 external encoding is defined by TIS620, the internal encoding is | |
1690 peculiar to MULE, and called @code{thai-xtis}. | |
1691 | |
1692 @item viqr | |
1693 | |
1694 Modeline indicator: @code{VIQR}. A type @code{no-conversion} coding | |
1695 system with Unix EOL convention (ie, no conversion) using | |
1696 post-read-decode and pre-write-encode functions to translate the VIQR | |
1697 coding system for Vietnamese. | |
1698 | |
1699 @item viscii | |
1700 @itemx viscii-dos | |
1701 @itemx viscii-mac | |
1702 @itemx viscii-unix | |
1703 | |
1704 Modeline indicator: @code{VISCII}. A type @code{ccl} coding-system used | |
1705 for VISCII 1.1 for Vietnamese. Differs slightly from VSCII; VISCII is | |
1706 given priority by XEmacs. | |
1707 | |
1708 @item vscii | |
1709 @itemx vscii-dos | |
1710 @itemx vscii-mac | |
1711 @itemx vscii-unix | |
1712 | |
1713 Modeline indicator: @code{VSCII}. A type @code{ccl} coding-system used | |
1714 for VSCII 1.1 for Vietnamese. Differs slightly from VISCII, which is | |
1715 given priority by XEmacs. Use | |
1716 @code{(prefer-coding-system 'vietnamese-vscii)} to give priority to VSCII. | |
1717 | |
1718 @end table | |
1719 | |
428 | 1720 @node CCL, Category Tables, Coding Systems, MULE |
1721 @section CCL | |
1722 | |
442 | 1723 CCL (Code Conversion Language) is a simple structured programming |
428 | 1724 language designed for character coding conversions. A CCL program is |
1725 compiled to CCL code (represented by a vector of integers) and executed | |
1726 by the CCL interpreter embedded in Emacs. The CCL interpreter | |
1727 implements a virtual machine with 8 registers called @code{r0}, ..., | |
1728 @code{r7}, a number of control structures, and some I/O operators. Take | |
1729 care when using registers @code{r0} (used in implicit @dfn{set} | |
1730 statements) and especially @code{r7} (used internally by several | |
444 | 1731 statements and operations, especially for multiple return values and I/O |
428 | 1732 operations). |
1733 | |
442 | 1734 CCL is used for code conversion during process I/O and file I/O for |
428 | 1735 non-ISO2022 coding systems. (It is the only way for a user to specify a |
1736 code conversion function.) It is also used for calculating the code | |
1737 point of an X11 font from a character code. However, since CCL is | |
1738 designed as a powerful programming language, it can be used for more | |
1739 generic calculation where efficiency is demanded. A combination of | |
1740 three or more arithmetic operations can be calculated faster by CCL than | |
1741 by Emacs Lisp. | |
1742 | |
442 | 1743 @strong{Warning:} The code in @file{src/mule-ccl.c} and |
428 | 1744 @file{$packages/lisp/mule-base/mule-ccl.el} is the definitive |
1745 description of CCL's semantics. The previous version of this section | |
1746 contained several typos and obsolete names left from earlier versions of | |
1747 MULE, and many may remain. (I am not an experienced CCL programmer; the | |
1748 few who know CCL well find writing English painful.) | |
1749 | |
442 | 1750 A CCL program transforms an input data stream into an output data |
428 | 1751 stream. The input stream, held in a buffer of constant bytes, is left |
1752 unchanged. The buffer may be filled by an external input operation, | |
1753 taken from an Emacs buffer, or taken from a Lisp string. The output | |
1754 buffer is a dynamic array of bytes, which can be written by an external | |
1755 output operation, inserted into an Emacs buffer, or returned as a Lisp | |
1756 string. | |
1757 | |
442 | 1758 A CCL program is a (Lisp) list containing two or three members. The |
428 | 1759 first member is the @dfn{buffer magnification}, which indicates the |
1760 required minimum size of the output buffer as a multiple of the input | |
1761 buffer. It is followed by the @dfn{main block} which executes while | |
1762 there is input remaining, and an optional @dfn{EOF block} which is | |
1763 executed when the input is exhausted. Both the main block and the EOF | |
1764 block are CCL blocks. | |
1765 | |
442 | 1766 A @dfn{CCL block} is either a CCL statement or list of CCL statements. |
444 | 1767 A @dfn{CCL statement} is either a @dfn{set statement} (either an integer |
428 | 1768 or an @dfn{assignment}, which is a list of a register to receive the |
444 | 1769 assignment, an assignment operator, and an expression) or a @dfn{control |
428 | 1770 statement} (a list starting with a keyword, whose allowable syntax |
1771 depends on the keyword). | |
1772 | |
1773 @menu | |
1774 * CCL Syntax:: CCL program syntax in BNF notation. | |
1775 * CCL Statements:: Semantics of CCL statements. | |
1776 * CCL Expressions:: Operators and expressions in CCL. | |
1777 * Calling CCL:: Running CCL programs. | |
2640 | 1778 * CCL Example:: A trivial program to transform the Web's URL encoding. |
428 | 1779 @end menu |
1780 | |
442 | 1781 @node CCL Syntax, CCL Statements, , CCL |
428 | 1782 @comment Node, Next, Previous, Up |
1783 @subsection CCL Syntax | |
1784 | |
442 | 1785 The full syntax of a CCL program in BNF notation: |
428 | 1786 |
1787 @format | |
1788 CCL_PROGRAM := | |
1789 (BUFFER_MAGNIFICATION | |
1790 CCL_MAIN_BLOCK | |
1791 [ CCL_EOF_BLOCK ]) | |
1792 | |
1793 BUFFER_MAGNIFICATION := integer | |
1794 CCL_MAIN_BLOCK := CCL_BLOCK | |
1795 CCL_EOF_BLOCK := CCL_BLOCK | |
1796 | |
1797 CCL_BLOCK := | |
1798 STATEMENT | (STATEMENT [STATEMENT ...]) | |
1799 STATEMENT := | |
2367 | 1800 SET | IF | BRANCH | LOOP | REPEAT | BREAK | READ | WRITE | CALL |
1801 | TRANSLATE | MAP | END | |
428 | 1802 |
1803 SET := | |
1804 (REG = EXPRESSION) | |
1805 | (REG ASSIGNMENT_OPERATOR EXPRESSION) | |
2367 | 1806 | INT-OR-CHAR |
428 | 1807 |
1808 EXPRESSION := ARG | (EXPRESSION OPERATOR ARG) | |
1809 | |
1810 IF := (if EXPRESSION CCL_BLOCK [CCL_BLOCK]) | |
1811 BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...]) | |
1812 LOOP := (loop STATEMENT [STATEMENT ...]) | |
1813 BREAK := (break) | |
1814 REPEAT := | |
1815 (repeat) | |
2367 | 1816 | (write-repeat [REG | INT-OR-CHAR | string]) |
1817 | (write-read-repeat REG [INT-OR-CHAR | ARRAY]) | |
428 | 1818 READ := |
1819 (read REG ...) | |
2367 | 1820 | (read-if (REG OPERATOR ARG) CCL_BLOCK [CCL_BLOCK]) |
428 | 1821 | (read-branch REG CCL_BLOCK [CCL_BLOCK ...]) |
1822 WRITE := | |
1823 (write REG ...) | |
1824 | (write EXPRESSION) | |
2367 | 1825 | (write INT-OR-CHAR) | (write string) | (write REG ARRAY) |
428 | 1826 | string |
1827 CALL := (call ccl-program-name) | |
3439 | 1828 |
1829 | |
1830 TRANSLATE := ;; Not implemented under XEmacs, except mule-to-unicode and | |
1831 ;; unicode-to-mule. | |
1832 (translate-character REG(table) REG(charset) REG(codepoint)) | |
1833 | (translate-character SYMBOL REG(charset) REG(codepoint)) | |
1834 | (mule-to-unicode REG(charset) REG(codepoint)) | |
1835 | (unicode-to-mule REG(unicode,code) REG(CHARSET)) | |
1836 | |
428 | 1837 END := (end) |
1838 | |
1839 REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7 | |
2367 | 1840 ARG := REG | INT-OR-CHAR |
428 | 1841 OPERATOR := |
1842 + | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | // | |
1843 | < | > | == | <= | >= | != | de-sjis | en-sjis | |
1844 ASSIGNMENT_OPERATOR := | |
1845 += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>= | |
2367 | 1846 ARRAY := '[' INT-OR-CHAR ... ']' |
1847 INT-OR-CHAR := integer | character | |
1848 | |
428 | 1849 @end format |
1850 | |
1851 @node CCL Statements, CCL Expressions, CCL Syntax, CCL | |
1852 @comment Node, Next, Previous, Up | |
1853 @subsection CCL Statements | |
1854 | |
442 | 1855 The Emacs Code Conversion Language provides the following statement |
428 | 1856 types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat}, |
3439 | 1857 @dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, @dfn{translate} and |
1858 @dfn{end}. | |
428 | 1859 |
1860 @heading Set statement: | |
1861 | |
442 | 1862 The @dfn{set} statement has three variants with the syntaxes |
428 | 1863 @samp{(@var{reg} = @var{expression})}, |
1864 @samp{(@var{reg} @var{assignment_operator} @var{expression})}, and | |
1865 @samp{@var{integer}}. The assignment operator variation of the | |
1866 @dfn{set} statement works the same way as the corresponding C expression | |
1867 statement does. The assignment operators are @code{+=}, @code{-=}, | |
1868 @code{*=}, @code{/=}, @code{%=}, @code{&=}, @code{|=}, @code{^=}, | |
1869 @code{<<=}, and @code{>>=}, and they have the same meanings as in C. A | |
1870 "naked integer" @var{integer} is equivalent to a @var{set} statement of | |
1871 the form @code{(r0 = @var{integer})}. | |
1872 | |
1873 @heading I/O statements: | |
1874 | |
442 | 1875 The @dfn{read} statement takes one or more registers as arguments. It |
444 | 1876 reads one byte (a C char) from the input into each register in turn. |
428 | 1877 |
442 | 1878 The @dfn{write} takes several forms. In the form @samp{(write @var{reg} |
428 | 1879 ...)} it takes one or more registers as arguments and writes each in |
1880 turn to the output. The integer in a register (interpreted as an | |
2367 | 1881 Ichar) is encoded to multibyte form (ie, Ibytes) and written to the |
428 | 1882 current output buffer. If it is less than 256, it is written as is. |
1883 The forms @samp{(write @var{expression})} and @samp{(write | |
1884 @var{integer})} are treated analogously. The form @samp{(write | |
1885 @var{string})} writes the constant string to the output. A | |
1886 "naked string" @samp{@var{string}} is equivalent to the statement @samp{(write | |
1887 @var{string})}. The form @samp{(write @var{reg} @var{array})} writes | |
1888 the @var{reg}th element of the @var{array} to the output. | |
1889 | |
1890 @heading Conditional statements: | |
1891 | |
442 | 1892 The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and |
428 | 1893 an optional @var{second CCL block} as arguments. If the |
1894 @var{expression} evaluates to non-zero, the first @var{CCL block} is | |
1895 executed. Otherwise, if there is a @var{second CCL block}, it is | |
1896 executed. | |
1897 | |
442 | 1898 The @dfn{read-if} variant of the @dfn{if} statement takes an |
428 | 1899 @var{expression}, a @var{CCL block}, and an optional @var{second CCL |
1900 block} as arguments. The @var{expression} must have the form | |
1901 @code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is | |
1902 a register or an integer). The @code{read-if} statement first reads | |
1903 from the input into the first register operand in the @var{expression}, | |
1904 then conditionally executes a CCL block just as the @code{if} statement | |
1905 does. | |
1906 | |
442 | 1907 The @dfn{branch} statement takes an @var{expression} and one or more CCL |
428 | 1908 blocks as arguments. The CCL blocks are treated as a zero-indexed |
1909 array, and the @code{branch} statement uses the @var{expression} as the | |
1910 index of the CCL block to execute. Null CCL blocks may be used as | |
1911 no-ops, continuing execution with the statement following the | |
1912 @code{branch} statement in the containing CCL block. Out-of-range | |
444 | 1913 values for the @var{expression} are also treated as no-ops. |
428 | 1914 |
442 | 1915 The @dfn{read-branch} variant of the @dfn{branch} statement takes an |
428 | 1916 @var{register}, a @var{CCL block}, and an optional @var{second CCL |
1917 block} as arguments. The @code{read-branch} statement first reads from | |
1918 the input into the @var{register}, then conditionally executes a CCL | |
1919 block just as the @code{branch} statement does. | |
1920 | |
1921 @heading Loop control statements: | |
1922 | |
442 | 1923 The @dfn{loop} statement creates a block with an implied jump from the |
444 | 1924 end of the block back to its head. The loop is exited on a @code{break} |
428 | 1925 statement, and continued without executing the tail by a @code{repeat} |
1926 statement. | |
1927 | |
442 | 1928 The @dfn{break} statement, written @samp{(break)}, terminates the |
428 | 1929 current loop and continues with the next statement in the current |
444 | 1930 block. |
428 | 1931 |
442 | 1932 The @dfn{repeat} statement has three variants, @code{repeat}, |
428 | 1933 @code{write-repeat}, and @code{write-read-repeat}. Each continues the |
1934 current loop from its head, possibly after performing I/O. | |
1935 @code{repeat} takes no arguments and does no I/O before jumping. | |
444 | 1936 @code{write-repeat} takes a single argument (a register, an |
428 | 1937 integer, or a string), writes it to the output, then jumps. |
1938 @code{write-read-repeat} takes one or two arguments. The first must | |
1939 be a register. The second may be an integer or an array; if absent, it | |
1940 is implicitly set to the first (register) argument. | |
1941 @code{write-read-repeat} writes its second argument to the output, then | |
1942 reads from the input into the register, and finally jumps. See the | |
1943 @code{write} and @code{read} statements for the semantics of the I/O | |
1944 operations for each type of argument. | |
1945 | |
3439 | 1946 @heading Other statements: |
428 | 1947 |
442 | 1948 The @dfn{call} statement, written @samp{(call @var{ccl-program-name})}, |
428 | 1949 executes a CCL program as a subroutine. It does not return a value to |
1950 the caller, but can modify the register status. | |
1951 | |
3439 | 1952 The @dfn{mule-to-unicode} statement translates an XEmacs character into a |
1953 UCS code point, using U+FFFD REPLACEMENT CHARACTER if the given XEmacs | |
1954 character has no known corresponding code point. It takes two | |
1955 arguments; the first is a register in which is stored the character set | |
1956 ID of the character to be translated, and into which the UCS code is | |
1957 stored. The second is a register which stores the XEmacs code of the | |
1958 character in question; if it is from a multidimensional character set, | |
1959 like most of the East Asian national sets, it's stored as @samp{((c1 << | |
1960 8) & c2)}, where @samp{c1} is the first code, and @samp{c2} the second. | |
1961 (That is, as a single integer, the high-order eight bits of which encode | |
1962 the first position code, and the low order bits of which encode the | |
1963 second.) | |
1964 | |
1965 The @dfn{unicode-to-mule} statement translates a Unicode code point | |
1966 (an integer) into an XEmacs character. Its first argument is a register | |
1967 containing the UCS code point; the code for the correspond character | |
1968 will be written into this register, in the same format as for | |
1969 @samp{mule-to-unicode} The second argument is a register into which will | |
1970 be written the character set ID of the converted character. | |
1971 | |
442 | 1972 The @dfn{end} statement, written @samp{(end)}, terminates the CCL |
428 | 1973 program successfully, and returns to caller (which may be a CCL |
1974 program). It does not alter the status of the registers. | |
1975 | |
1976 @node CCL Expressions, Calling CCL, CCL Statements, CCL | |
1977 @comment Node, Next, Previous, Up | |
1978 @subsection CCL Expressions | |
1979 | |
442 | 1980 CCL, unlike Lisp, uses infix expressions. The simplest CCL expressions |
428 | 1981 consist of a single @var{operand}, either a register (one of @code{r0}, |
1982 ..., @code{r0}) or an integer. Complex expressions are lists of the | |
1983 form @code{( @var{expression} @var{operator} @var{operand} )}. Unlike | |
1984 C, assignments are not expressions. | |
1985 | |
442 | 1986 In the following table, @var{X} is the target resister for a @dfn{set}. |
428 | 1987 In subexpressions, this is implicitly @code{r7}. This means that |
1988 @code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used | |
1989 freely in subexpressions, since they return parts of their values in | |
1990 @code{r7}. @var{Y} may be an expression, register, or integer, while | |
1991 @var{Z} must be a register or an integer. | |
1992 | |
1993 @multitable @columnfractions .22 .14 .09 .55 | |
1994 @item Name @tab Operator @tab Code @tab C-like Description | |
1995 @item CCL_PLUS @tab @code{+} @tab 0x00 @tab X = Y + Z | |
1996 @item CCL_MINUS @tab @code{-} @tab 0x01 @tab X = Y - Z | |
1997 @item CCL_MUL @tab @code{*} @tab 0x02 @tab X = Y * Z | |
1998 @item CCL_DIV @tab @code{/} @tab 0x03 @tab X = Y / Z | |
1999 @item CCL_MOD @tab @code{%} @tab 0x04 @tab X = Y % Z | |
2000 @item CCL_AND @tab @code{&} @tab 0x05 @tab X = Y & Z | |
2001 @item CCL_OR @tab @code{|} @tab 0x06 @tab X = Y | Z | |
2002 @item CCL_XOR @tab @code{^} @tab 0x07 @tab X = Y ^ Z | |
2003 @item CCL_LSH @tab @code{<<} @tab 0x08 @tab X = Y << Z | |
2004 @item CCL_RSH @tab @code{>>} @tab 0x09 @tab X = Y >> Z | |
2005 @item CCL_LSH8 @tab @code{<8} @tab 0x0A @tab X = (Y << 8) | Z | |
2006 @item CCL_RSH8 @tab @code{>8} @tab 0x0B @tab X = Y >> 8, r[7] = Y & 0xFF | |
2007 @item CCL_DIVMOD @tab @code{//} @tab 0x0C @tab X = Y / Z, r[7] = Y % Z | |
2008 @item CCL_LS @tab @code{<} @tab 0x10 @tab X = (X < Y) | |
2009 @item CCL_GT @tab @code{>} @tab 0x11 @tab X = (X > Y) | |
2010 @item CCL_EQ @tab @code{==} @tab 0x12 @tab X = (X == Y) | |
2011 @item CCL_LE @tab @code{<=} @tab 0x13 @tab X = (X <= Y) | |
2012 @item CCL_GE @tab @code{>=} @tab 0x14 @tab X = (X >= Y) | |
2013 @item CCL_NE @tab @code{!=} @tab 0x15 @tab X = (X != Y) | |
2014 @item CCL_ENCODE_SJIS @tab @code{en-sjis} @tab 0x16 @tab X = HIGHER_BYTE (SJIS (Y, Z)) | |
2015 @item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z) | |
2016 @item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z)) | |
2017 @item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z)) | |
2018 @end multitable | |
2019 | |
442 | 2020 The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8, |
428 | 2021 CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The CCL_ENCODE_SJIS |
2022 and CCL_DECODE_SJIS treat their first and second bytes as the high and | |
2023 low bytes of a two-byte character code. (SJIS stands for Shift JIS, an | |
2024 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a | |
2025 complicated transformation of the Japanese standard JIS encoding to | |
2026 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to | |
2027 represent the SJIS operations in infix form. | |
2028 | |
2640 | 2029 @node Calling CCL, CCL Example, CCL Expressions, CCL |
428 | 2030 @comment Node, Next, Previous, Up |
2031 @subsection Calling CCL | |
2032 | |
442 | 2033 CCL programs are called automatically during Emacs buffer I/O when the |
428 | 2034 external representation has a coding system type of @code{shift-jis}, |
2035 @code{big5}, or @code{ccl}. The program is specified by the coding | |
2036 system (@pxref{Coding Systems}). You can also call CCL programs from | |
2037 other CCL programs, and from Lisp using these functions: | |
2038 | |
2039 @defun ccl-execute ccl-program status | |
2040 Execute @var{ccl-program} with registers initialized by | |
2041 @var{status}. @var{ccl-program} is a vector of compiled CCL code | |
444 | 2042 created by @code{ccl-compile}. It is an error for the program to try to |
428 | 2043 execute a CCL I/O command. @var{status} must be a vector of nine |
2044 values, specifying the initial value for the R0, R1 .. R7 registers and | |
2045 for the instruction counter IC. A @code{nil} value for a register | |
2046 initializer causes the register to be set to 0. A @code{nil} value for | |
2047 the IC initializer causes execution to start at the beginning of the | |
2048 program. When the program is done, @var{status} is modified (by | |
2049 side-effect) to contain the ending values for the corresponding | |
444 | 2050 registers and IC. |
428 | 2051 @end defun |
2052 | |
444 | 2053 @defun ccl-execute-on-string ccl-program status string &optional continue |
428 | 2054 Execute @var{ccl-program} with initial @var{status} on |
2055 @var{string}. @var{ccl-program} is a vector of compiled CCL code | |
2056 created by @code{ccl-compile}. @var{status} must be a vector of nine | |
2057 values, specifying the initial value for the R0, R1 .. R7 registers and | |
2058 for the instruction counter IC. A @code{nil} value for a register | |
2059 initializer causes the register to be set to 0. A @code{nil} value for | |
2060 the IC initializer causes execution to start at the beginning of the | |
444 | 2061 program. An optional fourth argument @var{continue}, if non-@code{nil}, causes |
428 | 2062 the IC to |
2063 remain on the unsatisfied read operation if the program terminates due | |
2064 to exhaustion of the input buffer. Otherwise the IC is set to the end | |
444 | 2065 of the program. When the program is done, @var{status} is modified (by |
428 | 2066 side-effect) to contain the ending values for the corresponding |
2067 registers and IC. Returns the resulting string. | |
2068 @end defun | |
2069 | |
442 | 2070 To call a CCL program from another CCL program, it must first be |
428 | 2071 registered: |
2072 | |
2073 @defun register-ccl-program name ccl-program | |
444 | 2074 Register @var{name} for CCL program @var{ccl-program} in |
2075 @code{ccl-program-table}. @var{ccl-program} should be the compiled form of | |
2076 a CCL program, or @code{nil}. Return index number of the registered CCL | |
428 | 2077 program. |
2078 @end defun | |
2079 | |
442 | 2080 Information about the processor time used by the CCL interpreter can be |
428 | 2081 obtained using these functions: |
2082 | |
2083 @defun ccl-elapsed-time | |
2084 Returns the elapsed processor time of the CCL interpreter as cons of | |
2085 user and system time, as | |
2086 floating point numbers measured in seconds. If only one | |
2087 overall value can be determined, the return value will be a cons of that | |
2088 value and 0. | |
2089 @end defun | |
2090 | |
2091 @defun ccl-reset-elapsed-time | |
2092 Resets the CCL interpreter's internal elapsed time registers. | |
2093 @end defun | |
2094 | |
2640 | 2095 @node CCL Example, , Calling CCL, CCL |
428 | 2096 @comment Node, Next, Previous, Up |
2640 | 2097 @subsection CCL Example |
2098 | |
2099 In this section, we describe the implementation of a trivial coding | |
2100 system to transform from the Web's URL encoding to XEmacs' internal | |
2101 coding. Many people will have been first exposed to URL encoding when | |
2102 they saw ``%20'' where they expected a space in a file's name on their | |
2103 local hard disk; this can happen when a browser saves a file from the | |
2104 web and doesn't encode the name, as passed from the server, properly. | |
2105 | |
2106 URL encoding itself is underspecified with regard to encodings beyond | |
2107 ASCII. The relevant document, RFC 1738, explicitly doesn't give any | |
2108 information on how to encode non-ASCII characters, and the ``obvious'' | |
2109 way---use the %xx values for the octets of the eight bit MIME character | |
2110 set in which the page was served---breaks when a user types a character | |
2111 outside that character set. Best practice for web development is to | |
2112 serve all pages as UTF-8 and treat incoming form data as using that | |
2113 coding system. (Oh, and gamble that your clients won't ever want to | |
2114 type anything outside Unicode. But that's not so much of a gamble with | |
2115 today's client operating systems.) We don't treat non-ASCII in this | |
2116 example, as dealing with @samp{(read-multibyte-character ...)} and | |
2117 errors therewith would make it much harder to understand. | |
2118 | |
2119 Since CCL isn't a very rich language, we move much of the logic that | |
2120 would ordinarily be computed from operations like @code{(member ..)}, | |
2121 @code{(and ...)} and @code{(or ...)} into tables, from which register | |
2122 values are read and written, and on which @code{if} statements are | |
2123 predicated. Much more of the implementation of this coding system is | |
2124 occupied with constructing these tables---in normal Emacs Lisp---than it | |
2125 is with actual CCL code. | |
2126 | |
2127 All the @code{defvar} statements we deal with in the next few sections | |
2128 are surrounded by a @code{(eval-and-compile ...)}, which means that the | |
2129 logic which initializes these variables executes at compile time, and if | |
2130 XEmacs loads the compiled version of the file, these variables are | |
2131 initialized as constants. | |
2132 | |
2133 @menu | |
2134 * Four bits to ASCII:: Two tables used for getting hex digits from ASCII. | |
2135 * URI Encoding constants:: Useful predefined characters. | |
2136 * Numeric to ASCII-hexadecimal conversion:: Trivial in Lisp, not so in CCL. | |
2137 * Characters to be preserved:: No transformation needed for these characters. | |
2138 * The program to decode to internal format:: . | |
2139 * The program to encode from internal format:: . | |
2690 | 2140 * The actual coding system:: . |
2640 | 2141 @end menu |
2142 | |
2143 @node Four bits to ASCII, URI Encoding constants, , CCL Example | |
2144 @subsubsection Four bits to ASCII | |
2145 | |
2146 The first @code{defvar} is for | |
2147 @code{url-coding-high-order-nybble-as-ascii}, a 256-entry table that | |
2148 maps from an octet's value to the ASCII encoding for the hex value of | |
2149 its most significant four bits. That might sound complex, but it isn't; | |
2150 for decimal 65, hex value @samp{#x41}, the entry in the table is the | |
2151 ASCII encoding of `4'. For decimal 122, ASCII `z', hex value | |
2152 @code{#x7a}, @code{(elt url-coding-high-order-nybble-as-ascii #x7a)} | |
2153 after this file is loaded gives the ASCII encoding of 7. | |
2154 | |
2155 @example | |
2156 (defvar url-coding-high-order-nybble-as-ascii | |
2157 (let ((val (make-vector 256 0)) | |
2158 (i 0)) | |
2159 (while (< i (length val)) | |
2690 | 2160 (aset val i (char-to-int (aref (format "%02X" i) 0))) |
2640 | 2161 (setq i (1+ i))) |
2162 val) | |
2163 "Table to find an ASCII version of an octet's most significant 4 bits.") | |
2164 @end example | |
2165 | |
2166 The next table, @code{url-coding-low-order-nybble-as-ascii} is almost | |
2167 the same thing, but this time it has a map for the hex encoding of the | |
2690 | 2168 low-order four bits. So the sixty-fifth entry (offset @samp{#x41}) is |
2640 | 2169 the ASCII encoding of `1', the hundred-and-twenty-second (offset |
2170 @samp{#x7a}) is the ASCII encoding of `A'. | |
2171 | |
2172 @example | |
2173 (defvar url-coding-low-order-nybble-as-ascii | |
2174 (let ((val (make-vector 256 0)) | |
2175 (i 0)) | |
2176 (while (< i (length val)) | |
2690 | 2177 (aset val i (char-to-int (aref (format "%02X" i) 1))) |
2640 | 2178 (setq i (1+ i))) |
2179 val) | |
2180 "Table to find an ASCII version of an octet's least significant 4 bits.") | |
2181 @end example | |
2182 | |
2183 @node URI Encoding constants, Numeric to ASCII-hexadecimal conversion, Four bits to ASCII, CCL Example | |
2184 @subsubsection URI Encoding constants | |
2185 | |
2186 Next, we have a couple of variables that make the CCL code more | |
2187 readable. The first is the ASCII encoding of the percentage sign; this | |
2188 character is used as an escape code, to start the encoding of a | |
2189 non-printable character. For historical reasons, URL encoding allows | |
2190 the space character to be encoded as a plus sign--it does make typing | |
2191 URLs like @samp{http://google.com/search?q=XEmacs+home+page} easier--and | |
2192 as such, we have to check when decoding for this value, and map it to | |
2193 the space character. When doing this in CCL, we use the | |
2194 @code{url-coding-escaped-space-code} variable. | |
2195 | |
2196 @example | |
2690 | 2197 (defvar url-coding-escape-character-code (char-to-int ?%) |
2640 | 2198 "The code point for the percentage sign, in ASCII.") |
2199 | |
2690 | 2200 (defvar url-coding-escaped-space-code (char-to-int ?+) |
2640 | 2201 "The URL-encoded value of the space character, that is, +.") |
2202 @end example | |
2203 | |
2690 | 2204 @node Numeric to ASCII-hexadecimal conversion, Characters to be preserved, URI Encoding constants, CCL Example |
2640 | 2205 @subsubsection Numeric to ASCII-hexadecimal conversion |
2206 | |
2207 Now, we have a couple of utility tables that wouldn't be necessary in | |
2208 a more expressive programming language than is CCL. The first is sixteen | |
2209 in length, and maps a hexadecimal number to the ASCII encoding of that | |
2210 number; so zero maps to ASCII `0', ten maps to ASCII `A.' The second | |
2211 does the reverse; that is, it maps an ASCII character to its value when | |
2212 interpreted as a hexadecimal digit. ('A' => 10, 'c' => 12, '2' => 2, as | |
2213 a few examples.) | |
2214 | |
2215 @example | |
2216 (defvar url-coding-hex-digit-table | |
2217 (let ((i 0) | |
2218 (val (make-vector 16 0))) | |
2219 (while (< i 16) | |
2690 | 2220 (aset val i (char-to-int (aref (format "%X" i) 0))) |
2640 | 2221 (setq i (1+ i))) |
2222 val) | |
2223 "A map from a hexadecimal digit's numeric value to its encoding in ASCII.") | |
2224 | |
2225 (defvar url-coding-latin-1-as-hex-table | |
2226 (let ((val (make-vector 256 0)) | |
2227 (i 0)) | |
2228 (while (< i (length val)) | |
2229 ;; Get a hex val for this ASCII character. | |
2230 (aset val i (string-to-int (format "%c" i) 16)) | |
2231 (setq i (1+ i))) | |
2232 val) | |
2233 "A map from Latin 1 code points to their values as hexadecimal digits.") | |
2234 @end example | |
2235 | |
2690 | 2236 @node Characters to be preserved, The program to decode to internal format, Numeric to ASCII-hexadecimal conversion, CCL Example |
2640 | 2237 @subsubsection Characters to be preserved |
2238 | |
2239 And finally, the last of these tables. URL encoding says that | |
2240 alphanumeric characters, the underscore, hyphen and the full stop | |
2241 @footnote{That's what the standards call it, though my North American | |
2242 readers will be more familiar with it as the period character.} retain | |
2243 their ASCII encoding, and don't undergo transformation. | |
2244 @code{url-coding-should-preserve-table} is an array in which the entries | |
2245 are one if the corresponding ASCII character should be left as-is, and | |
2246 zero if they should be transformed. So the entries for all the control | |
2247 and most of the punctuation charcters are zero. Lisp programmers will | |
2248 observe that this initialization is particularly inefficient, but | |
2249 they'll also be aware that this is a long way from an inner loop where | |
2250 every nanosecond counts. | |
2251 | |
2252 @example | |
2253 (defvar url-coding-should-preserve-table | |
2254 (let ((preserve | |
2255 (list ?- ?_ ?. ?a ?b ?c ?d ?e ?f ?g ?h ?i ?j ?k ?l ?m ?n ?o | |
2256 ?p ?q ?r ?s ?t ?u ?v ?w ?x ?y ?z ?A ?B ?C ?D ?E ?F ?G | |
2257 ?H ?I ?J ?K ?L ?M ?N ?O ?P ?Q ?R ?S ?T ?U ?V ?W ?X ?Y | |
2258 ?Z ?0 ?1 ?2 ?3 ?4 ?5 ?6 ?7 ?8 ?9)) | |
2259 (i 0) | |
2260 (res (make-vector 256 0))) | |
2261 (while (< i 256) | |
2262 (when (member (int-char i) preserve) | |
2263 (aset res i 1)) | |
2264 (setq i (1+ i))) | |
2265 res) | |
2266 "A 256-entry array of flags, indicating whether or not to preserve an | |
2267 octet as its ASCII encoding.") | |
2268 @end example | |
2269 | |
2690 | 2270 @node The program to decode to internal format, The program to encode from internal format, Characters to be preserved, CCL Example |
2640 | 2271 @subsubsection The program to decode to internal format |
2272 | |
2273 After the almost interminable tables, we get to the CCL. The first | |
2274 CCL program, @code{ccl-decode-urlcoding} decodes from the URL coding to | |
2275 our internal format; since this version of CCL doesn't have support for | |
2276 error checking on the input, we don't do any verification on it. | |
2277 | |
2278 The buffer magnification--approximate ratio of the size of the output | |
2279 buffer to the size of the input buffer--is declared as one, because | |
2280 fractional values aren't allowed. (Since all those %20's will map to | |
2281 ` ', the length of the output text will be less than that of the input | |
2282 text.) | |
2283 | |
2284 So, first we read an octet from the input buffer into register | |
2285 @samp{r0}, to set up the loop. Next, we start the loop, with a | |
2286 @code{(loop ...)} statement, and we check if the value in @samp{r0} is a | |
2287 percentage sign. (Note the comma before | |
2288 @code{url-coding-escape-character-code}; since CCL is a Lisp macro | |
2289 language, we can break out of the macro evaluation with a comman, and as | |
2290 such, ``@code{,url-coding-escape-character-code}'' will be evaluated as a | |
2291 literal `37.') | |
2292 | |
2293 If it is a percentage sign, we read the next two octets into @samp{r2} | |
2294 and @samp{r3}, and convert them into their hexadecimal numeric values, | |
2295 using the @code{url-coding-latin-1-as-hex-table} array declared above. | |
2296 (But again, it'll be interpreted as a literal array.) We then left | |
2297 shift the first by four bits, mask the two together, and write the | |
2298 result to the output buffer. | |
2299 | |
2300 If it isn't a percentage sign, and it is a `+' sign, we write a | |
2301 space--hexadecimal 20--to the output buffer. | |
2302 | |
2303 If none of those things are true, we pass the octet to the output buffer | |
2304 untransformed. (This could be a place to put error checking, in a more | |
2305 expressive language.) We then read one more octet from the input | |
2306 buffer, and move to the next iteration of the loop. | |
2307 | |
2308 @example | |
2309 (define-ccl-program ccl-decode-urlcoding | |
2310 `(1 | |
2311 ((read r0) | |
2312 (loop | |
2313 (if (r0 == ,url-coding-escape-character-code) | |
2314 ((read r2 r3) | |
2315 ;; Assign the value at offset r2 in the url-coding-hex-digit-table | |
2316 ;; to r3. | |
2317 (r2 = r2 ,url-coding-latin-1-as-hex-table) | |
2318 (r3 = r3 ,url-coding-latin-1-as-hex-table) | |
2319 (r2 <<= 4) | |
2320 (r3 |= r2) | |
2321 (write r3)) | |
2322 (if (r0 == ,url-coding-escaped-space-code) | |
2323 (write #x20) | |
2324 (write r0))) | |
2325 (read r0) | |
2326 (repeat)))) | |
2327 "CCL program to take URI-encoded ASCII text and transform it to our | |
2328 internal encoding. ") | |
2329 @end example | |
2330 | |
2690 | 2331 @node The program to encode from internal format, The actual coding system, The program to decode to internal format, CCL Example |
2640 | 2332 @subsubsection The program to encode from internal format |
2333 | |
2334 Next, we see the CCL program to encode ASCII text as URL coded text. | |
2335 Here, the buffer magnification is specified as three, to account for ` ' | |
2336 mapping to %20, etc. As before, we read an octet from the input into | |
2337 @samp{r0}, and move into the body of the loop. Next, we check if we | |
2338 should preserve the value of this octet, by reading from offset | |
2339 @samp{r0} in the @code{url-coding-should-preserve-table} into @samp{r1}. | |
2340 Then we have an @samp{if} statement predicated on the value in | |
2341 @samp{r1}; for the true branch, we write the input octet directly. For | |
2342 the false branch, we write a percentage sign, the ASCII encoding of the | |
2343 high four bits in hex, and then the ASCII encoding of the low four bits | |
2344 in hex. | |
2345 | |
2346 We then read an octet from the input into @samp{r0}, and repeat the loop. | |
2347 | |
2348 @example | |
2349 (define-ccl-program ccl-encode-urlcoding | |
2350 `(3 | |
2351 ((read r0) | |
2352 (loop | |
2353 (r1 = r0 ,url-coding-should-preserve-table) | |
2354 ;; If we should preserve the value, just write the octet directly. | |
2355 (if r1 | |
2356 (write r0) | |
2357 ;; else, write a percentage sign, and the hex value of the octet, in | |
2358 ;; an ASCII-friendly format. | |
2359 ((write ,url-coding-escape-character-code) | |
2360 (write r0 ,url-coding-high-order-nybble-as-ascii) | |
2361 (write r0 ,url-coding-low-order-nybble-as-ascii))) | |
2362 (read r0) | |
2363 (repeat)))) | |
2364 "CCL program to encode octets (almost) according to RFC 1738") | |
2365 @end example | |
428 | 2366 |
2690 | 2367 @node The actual coding system, , The program to encode from internal format, CCL Example |
2368 @subsubsection The actual coding system | |
2369 | |
2370 To actually create the coding system, we call | |
2371 @samp{make-coding-system}. The first argument is the symbol that is to | |
2372 be the name of the coding system, in our case @samp{url-coding}. The | |
2373 second specifies that the coding system is to be of type | |
2374 @samp{ccl}---there are several other coding system types available, | |
2375 including, see the documentation for @samp{make-coding-system} for the | |
2376 full list. Then there's a documentation string describing the wherefore | |
2377 and caveats of the coding system, and the final argument is a property | |
2378 list giving information about the CCL programs and the coding system's | |
2379 mnemonic. | |
2380 | |
2381 @example | |
2382 (make-coding-system | |
2383 'url-coding 'ccl | |
2384 "The coding used by application/x-www-form-urlencoded HTTP applications. | |
2385 This coding form doesn't specify anything about non-ASCII characters, so | |
2386 make sure you've transformed to a seven-bit coding system first." | |
2387 '(decode ccl-decode-urlcoding | |
2388 encode ccl-encode-urlcoding | |
2389 mnemonic "URLenc")) | |
2390 @end example | |
2391 | |
2392 If you're lucky, the @samp{url-coding} coding system describe here | |
2393 should be available in the XEmacs package system. Otherwise, downloading | |
2394 it from @samp{http://www.parhasard.net/url-coding.el} should work for | |
2395 the foreseeable future. | |
2396 | |
775 | 2397 @node Category Tables, Unicode Support, CCL, MULE |
428 | 2398 @section Category Tables |
2399 | |
2400 A category table is a type of char table used for keeping track of | |
2401 categories. Categories are used for classifying characters for use in | |
440 | 2402 regexps---you can refer to a category rather than having to use a |
428 | 2403 complicated [] expression (and category lookups are significantly |
2404 faster). | |
2405 | |
2406 There are 95 different categories available, one for each printable | |
2407 character (including space) in the ASCII charset. Each category is | |
2408 designated by one such character, called a @dfn{category designator}. | |
2409 They are specified in a regexp using the syntax @samp{\cX}, where X is a | |
2410 category designator. (This is not yet implemented.) | |
2411 | |
2412 A category table specifies, for each character, the categories that | |
2413 the character is in. Note that a character can be in more than one | |
2414 category. More specifically, a category table maps from a character to | |
2415 either the value @code{nil} (meaning the character is in no categories) | |
2416 or a 95-element bit vector, specifying for each of the 95 categories | |
2417 whether the character is in that category. | |
2418 | |
2419 Special Lisp functions are provided that abstract this, so you do not | |
2420 have to directly manipulate bit vectors. | |
2421 | |
444 | 2422 @defun category-table-p object |
2423 This function returns @code{t} if @var{object} is a category table. | |
428 | 2424 @end defun |
2425 | |
2426 @defun category-table &optional buffer | |
2427 This function returns the current category table. This is the one | |
2428 specified by the current buffer, or by @var{buffer} if it is | |
2429 non-@code{nil}. | |
2430 @end defun | |
2431 | |
2432 @defun standard-category-table | |
2433 This function returns the standard category table. This is the one used | |
2434 for new buffers. | |
2435 @end defun | |
2436 | |
444 | 2437 @defun copy-category-table &optional category-table |
2438 This function returns a new category table which is a copy of | |
2439 @var{category-table}, which defaults to the standard category table. | |
428 | 2440 @end defun |
2441 | |
444 | 2442 @defun set-category-table category-table &optional buffer |
2443 This function selects @var{category-table} as the new category table for | |
2444 @var{buffer}. @var{buffer} defaults to the current buffer if omitted. | |
428 | 2445 @end defun |
2446 | |
444 | 2447 @defun category-designator-p object |
2448 This function returns @code{t} if @var{object} is a category designator (a | |
428 | 2449 char in the range @samp{' '} to @samp{'~'}). |
2450 @end defun | |
2451 | |
444 | 2452 @defun category-table-value-p object |
2453 This function returns @code{t} if @var{object} is a category table value. | |
428 | 2454 Valid values are @code{nil} or a bit vector of size 95. |
2455 @end defun | |
2456 | |
775 | 2457 |
2458 @c Added 2002-03-13 sjt | |
1183 | 2459 @node Unicode Support, Charset Unification, Category Tables, MULE |
775 | 2460 @section Unicode Support |
2461 @cindex unicode | |
2462 @cindex utf-8 | |
2463 @cindex utf-16 | |
2464 @cindex ucs-2 | |
2465 @cindex ucs-4 | |
2466 @cindex bmp | |
2467 @cindex basic multilingual plance | |
2468 | |
2469 Unicode support was added by Ben Wing to XEmacs 21.5.6. | |
2470 | |
2471 @defun set-language-unicode-precedence-list list | |
2472 Set the language-specific precedence list used for Unicode decoding. | |
2473 This is a list of charsets, which are consulted in order for a translation | |
2474 matching a given Unicode character. If no matches are found, the charsets | |
2475 in the default precedence list (see | |
2476 @code{set-default-unicode-precedence-list}) are consulted, and then all | |
2477 remaining charsets, in some arbitrary order. | |
2478 | |
2479 The language-specific precedence list is meant to be set as part of the | |
2480 language environment initialization; the default precedence list is meant | |
2481 to be set by the user. | |
2482 @end defun | |
2483 | |
2484 @defun language-unicode-precedence-list | |
2485 Return the language-specific precedence list used for Unicode decoding. | |
2486 See @code{set-language-unicode-precedence-list} for more information. | |
2487 @end defun | |
2488 | |
2489 @defun set-default-unicode-precedence-list list | |
2490 Set the default precedence list used for Unicode decoding. | |
2491 This is meant to be set by the user. See | |
2492 `set-language-unicode-precedence-list' for more information. | |
2493 @end defun | |
2494 | |
2495 @defun default-unicode-precedence-list | |
2496 Return the default precedence list used for Unicode decoding. | |
2497 See @code{set-language-unicode-precedence-list} for more information. | |
2498 @end defun | |
2499 | |
2500 @defun set-unicode-conversion character code | |
2501 Add conversion information between Unicode codepoints and characters. | |
2502 @var{character} is one of the following: | |
2503 | |
2504 @c #### fix this markup | |
2505 -- A character (in which case @var{code} must be a non-negative integer) | |
2506 -- A vector of characters (in which case @var{code} must be a vector of | |
2507 non-negative integers of the same length) | |
2508 | |
2509 Values of @var{code} above 2^20 - 1 are allowed for the purpose of specifying | |
2510 private characters, but will cause errors when converted to UTF-16 or UTF-32. | |
2511 UCS-4 and UTF-8 can handle values to 2^31 - 1, but XEmacs Lisp integers top | |
2512 out at 2^30 - 1. | |
2513 @end defun | |
2514 | |
2515 @defun character-to-unicode character | |
2516 Convert @var{character} to Unicode codepoint. | |
2517 When there is no international support (i.e. MULE is not defined), | |
2518 this function simply does @code{char-to-int}. | |
2519 @end defun | |
2520 | |
2521 @defun unicode-to-character code [charsets] | |
2522 Convert Unicode codepoint @var{code} to character. | |
2523 @var{code} should be a non-negative integer. | |
2524 If @var{charsets} is given, it should be a list of charsets, and only those | |
2525 charsets will be consulted, in the given order, for a translation. | |
2526 Otherwise, the default ordering of all charsets will be given (see | |
2527 @code{set-unicode-charset-precedence}). | |
2528 | |
2529 When there is no international support (i.e. MULE is not defined), | |
2530 this function simply does @code{int-to-char} and ignores the | |
2531 @var{charsets} argument. | |
2532 @end defun | |
2533 | |
2534 @defun parse-unicode-translation-table filename charset start end offset flags | |
2535 Parse Unicode translation data in @var{filename} for MULE @var{charset}. | |
2536 Data is text, in the form of one translation per line -- charset | |
2537 codepoint followed by Unicode codepoint. Numbers are decimal or hex | |
2538 \(preceded by 0x). Comments are marked with a #. Charset codepoints | |
2539 for two-dimensional charsets should have the first octet stored in the | |
2540 high 8 bits of the hex number and the second in the low 8 bits. | |
2541 | |
2542 If @var{start} and @var{end} are given, only charset codepoints within | |
2543 the given range will be processed. If @var{offset} is given, that value | |
2544 will be added to all charset codepoints in the file to obtain the | |
2545 internal charset codepoint. @var{start} and @var{end} apply to the | |
2546 codepoints in the file, before @var{offset} is applied. | |
2547 | |
2548 (Note that, as usual, we assume that octets are in the range 32 to | |
2549 127 or 33 to 126. If you have a table in kuten form, with octets in | |
2550 the range 1 to 94, you will have to use an offset of 5140, | |
2551 i.e. 0x2020.) | |
2552 | |
2553 @var{flags}, if specified, control further how the tables are interpreted | |
2554 and are used to special-case certain known table weirdnesses in the | |
2555 Unicode tables: | |
2556 | |
2557 @table @code | |
2558 @item ignore-first-column' | |
2559 Exactly as it sounds. The JIS X 0208 tables have 3 columns of data instead | |
2560 of 2; the first is the Shift-JIS codepoint. | |
2561 | |
2562 @item big5 | |
2563 The charset codepoint is a Big Five codepoint; convert it to the | |
2564 proper hacked-up codepoint in `chinese-big5-1' or `chinese-big5-2'. | |
2565 @end table | |
2566 @end defun | |
2567 | |
1183 | 2568 |
2569 @node Charset Unification, Charsets and Coding Systems, Unicode Support, MULE | |
2570 @section Character Set Unification | |
2571 | |
2572 Mule suffers from a design defect that causes it to consider the ISO | |
2573 Latin character sets to be disjoint. This results in oddities such as | |
2574 files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO | |
2575 2022 control sequences to switch between them, as well as more plausible | |
2576 but often unnecessary combinations like ISO 8859/1 with ISO 8859/2. | |
2577 This can be very annoying when sending messages or even in simple | |
2578 editing on a single host. Unification works around the problem by | |
2579 converting as many characters as possible to use a single Latin coded | |
2580 character set before saving the buffer. | |
2581 | |
2582 This node and its children were ripp'd untimely from | |
2583 @file{latin-unity.texi}, and have been quickly converted for use here. | |
2584 However as APIs are likely to diverge, beware of inaccuracies. Please | |
2585 report any you discover with @kbd{M-x report-xemacs-bug RET}, as well | |
2586 as any ambiguities or downright unintelligible passages. | |
2587 | |
2588 A lot of the stuff here doesn't belong here; it belongs in the | |
2589 @ref{Top, , , xemacs, XEmacs User's Manual}. Report those as bugs, | |
2590 too, preferably with patches. | |
2591 | |
2592 @menu | |
2593 * Overview:: Unification history and general information. | |
2594 * Usage:: An overview of the operation of Unification. | |
2595 * Configuration:: Configuring Unification for use. | |
2596 * Theory of Operation:: How Unification works. | |
2597 * What Unification Cannot Do for You:: Inherent problems of 8-bit charsets. | |
2598 * Charsets and Coding Systems:: Reference lists with annotations. | |
1188 | 2599 * Unification Internals:: Utilities and implementation details. |
1183 | 2600 @end menu |
2601 | |
2602 @node Overview, Usage, Charset Unification, Charset Unification | |
2603 @subsection An Overview of Unification | |
2604 | |
2605 Mule suffers from a design defect that causes it to consider the ISO | |
2606 Latin character sets to be disjoint. This manifests itself when a user | |
2607 enters characters using input methods associated with different coded | |
2608 character sets into a single buffer. | |
2609 | |
2610 A very important example involves email. Many sites, especially in the | |
2611 U.S., default to use of the ISO 8859/1 coded character set (also called | |
2612 ``Latin 1,'' though these are somewhat different concepts). However, | |
2613 ISO 8859/1 provides a generic CURRENCY SIGN character. Now that the | |
2614 Euro has become the official currency of most countries in Europe, this | |
2615 is unsatisfactory (and in practice, useless). So Europeans generally | |
2616 use ISO 8859/15, which is nearly identical to ISO 8859/1 for most | |
2617 languages, except that it substitutes EURO SIGN for CURRENCY SIGN. | |
2618 | |
2619 Suppose a European user yanks text from a post encoded in ISO 8859/1 | |
2620 into a message composition buffer, and enters some text including the | |
2621 Euro sign. Then Mule will consider the buffer to contain both ISO | |
2622 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively | |
2623 programmed) send the message as a multipart mixed MIME body! | |
2624 | |
2625 This is clearly stupid. What is not as obvious is that, just as any | |
2626 European can include American English in their text because ASCII is a | |
2627 subset of ISO 8859/15, most European languages which use Latin | |
2628 characters (eg, German and Polish) can typically be mixed while using | |
2629 only one Latin coded character set (in this case, ISO 8859/2). However, | |
2630 this often depends on exactly what text is to be encoded. | |
2631 | |
2632 Unification works around the problem by converting as many characters as | |
2633 possible to use a single Latin coded character set before saving the | |
2634 buffer. | |
2635 | |
2636 @node Usage, Configuration, Overview, Charset Unification | |
2637 @subsection Operation of Unification | |
2638 | |
2639 Normally, Unification works in the background by installing | |
2640 @code{unity-sanity-check} on @code{write-region-pre-hook}. This is | |
2641 done by default for the ISO 8859 Latin family of character sets. The | |
2642 user activates this functionality for other character set families by | |
2643 invoking @code{enable-unification}, either interactively or in her | |
2644 init file. @xref{Init File, , , xemacs}. Unification can be | |
2645 deactivated by invoking @code{disable-unification}. | |
2646 | |
2647 Unification also provides a few functions for remapping or recoding the | |
2648 buffer by hand. To @dfn{remap} a character means to change the buffer | |
2649 representation of the character by using another coded character set. | |
2650 Remapping never changes the identity of the character, but may involve | |
2651 altering the code point of the character. To @dfn{recode} a character | |
2652 means to simply change the coded character set. Recoding never alters | |
2653 the code point of the character, but may change the identity of the | |
2654 character. @xref{Theory of Operation}. | |
2655 | |
2656 There are a few variables which determine which coding systems are | |
2657 always acceptable to Unification: @code{unity-ucs-list}, | |
2658 @code{unity-preferred-coding-system-list}, and | |
2659 @code{unity-preapproved-coding-system-list}. The latter two default | |
2660 to @code{()}, and should probably be avoided because they short-circuit | |
2661 the sanity check. If you find you need to use them, consider reporting | |
2662 it as a bug or request for enhancement. Because they seem unsafe, the | |
2663 recommended interface is likely to change. | |
2664 | |
2665 @menu | |
2666 * Basic Functionality:: User interface and customization. | |
2667 * Interactive Usage:: Treating text by hand. | |
2668 Also documents the hook function(s). | |
2669 @end menu | |
2670 | |
2671 | |
2672 @node Basic Functionality, Interactive Usage, , Usage | |
2673 @section Basic Functionality | |
2674 | |
2675 These functions and user options initialize and configure Unification. | |
2676 In normal use, none of these should be needed. | |
2677 | |
2678 @strong{These APIs are certain to change.} | |
2679 | |
2680 @defun enable-unification | |
2681 Set up hooks and initialize variables for latin-unity. | |
2682 | |
2683 There are no arguments. | |
2684 | |
2685 This function is idempotent. It will reinitialize any hooks or variables | |
2686 that are not in initial state. | |
2687 @end defun | |
2688 | |
2689 @defun disable-unification | |
2690 There are no arguments. | |
2691 | |
2692 Clean up hooks and void variables used by latin-unity. | |
2693 @end defun | |
2694 | |
2695 @defopt unity-ucs-list | |
2696 List of coding systems considered to be universal. | |
2697 | |
2698 The default value is @code{'(utf-8 iso-2022-7 ctext escape-quoted)}. | |
2699 | |
2700 Order matters; coding systems earlier in the list will be preferred when | |
2701 recommending a coding system. These coding systems will not be used | |
2702 without querying the user (unless they are also present in | |
2703 @code{unity-preapproved-coding-system-list}), and follow the | |
2704 @code{unity-preferred-coding-system-list} in the list of suggested | |
2705 coding systems. | |
2706 | |
2707 If none of the preferred coding systems are feasible, the first in | |
2708 this list will be the default. | |
2709 | |
2710 Notes on certain coding systems: @code{escape-quoted} is a special | |
2711 coding system used for autosaves and compiled Lisp in Mule. You should | |
2712 @c #### fix in latin-unity.texi | |
2713 never delete this, although it is rare that a user would want to use it | |
2714 directly. Unification does not try to be \"smart\" about other general | |
2715 ISO 2022 coding systems, such as ISO-2022-JP. (They are not recognized | |
2716 as equivalent to @code{iso-2022-7}.) If your preferred coding system is | |
2717 one of these, you may consider adding it to @code{unity-ucs-list}. | |
2718 However, this will typically have the side effect that (eg) ISO 8859/1 | |
2719 files will be saved in 7-bit form with ISO 2022 escape sequences. | |
2720 @end defopt | |
2721 | |
2722 Coding systems which are not Latin and not in | |
2723 @code{unity-ucs-list} are handled by short circuiting checks of | |
2724 coding system against the next two variables. | |
2725 | |
2726 @defopt unity-preapproved-coding-system-list | |
2727 List of coding systems used without querying the user if feasible. | |
2728 | |
2729 The default value is @samp{(buffer-default preferred)}. | |
2730 | |
2731 The first feasible coding system in this list is used. The special values | |
2732 @samp{preferred} and @samp{buffer-default} may be present: | |
2733 | |
2734 @table @code | |
2735 @item buffer-default | |
2736 Use the coding system used by @samp{write-region}, if feasible. | |
2737 | |
2738 @item preferred | |
2739 Use the coding system specified by @samp{prefer-coding-system} if feasible. | |
2740 @end table | |
2741 | |
2742 "Feasible" means that all characters in the buffer can be represented by | |
2743 the coding system. Coding systems in @samp{unity-ucs-list} are | |
2744 always considered feasible. Other feasible coding systems are computed | |
2745 by @samp{unity-representations-feasible-region}. | |
2746 | |
2747 Note that the first universal coding system in this list shadows all | |
2748 other coding systems. In particular, if your preferred coding system is | |
2749 a universal coding system, and @code{preferred} is a member of this | |
2750 list, unification will blithely convert all your files to that coding | |
2751 system. This is considered a feature, but it may surprise most users. | |
2752 Users who don't like this behavior should put @code{preferred} in | |
2753 @code{unity-preferred-coding-system-list}. | |
2754 @end defopt | |
2755 | |
2756 @defopt unity-preferred-coding-system-list | |
2757 @c #### fix in latin-unity.texi | |
2758 List of coding systems suggested to the user if feasible. | |
2759 | |
2760 The default value is @samp{(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3 | |
2761 iso-8859-4 iso-8859-9)}. | |
2762 | |
2763 If none of the coding systems in | |
2764 @c #### fix in latin-unity.texi | |
2765 @code{unity-preapproved-coding-system-list} are feasible, this list | |
2766 will be recommended to the user, followed by the | |
2767 @code{unity-ucs-list}. The first coding system in this list is default. The | |
2768 special values @samp{preferred} and @samp{buffer-default} may be | |
2769 present: | |
2770 | |
2771 @table @code | |
2772 @item buffer-default | |
2773 Use the coding system used by @samp{write-region}, if feasible. | |
2774 | |
2775 @item preferred | |
2776 Use the coding system specified by @samp{prefer-coding-system} if feasible. | |
2777 @end table | |
2778 | |
2779 "Feasible" means that all characters in the buffer can be represented by | |
2780 the coding system. Coding systems in @samp{unity-ucs-list} are | |
2781 always considered feasible. Other feasible coding systems are computed | |
2782 by @samp{unity-representations-feasible-region}. | |
2783 @end defopt | |
2784 | |
2785 | |
2786 @defvar unity-iso-8859-1-aliases | |
2787 List of coding systems to be treated as aliases of ISO 8859/1. | |
2788 | |
2789 The default value is '(iso-8859-1). | |
2790 | |
2791 This is not a user variable; to customize input of coding systems or | |
2792 charsets, @samp{unity-coding-system-alias-alist} or | |
2793 @samp{unity-charset-alias-alist}. | |
2794 @end defvar | |
2795 | |
2796 | |
2797 @node Interactive Usage, , Basic Functionality, Usage | |
2798 @section Interactive Usage | |
2799 | |
2800 First, the hook function @code{unity-sanity-check} is documented. | |
2801 (It is placed here because it is not an interactive function, and there | |
2802 is not yet a programmer's section of the manual.) | |
2803 | |
2804 These functions provide access to internal functionality (such as the | |
2805 remapping function) and to extra functionality (the recoding functions | |
2806 and the test function). | |
2807 | |
2808 | |
2809 @defun unity-sanity-check begin end filename append visit lockname &optional coding-system | |
2810 | |
2811 Check if @var{coding-system} can represent all characters between | |
2812 @var{begin} and @var{end}. | |
2813 | |
2814 For compatibility with old broken versions of @code{write-region}, | |
2815 @var{coding-system} defaults to @code{buffer-file-coding-system}. | |
2816 @var{filename}, @var{append}, @var{visit}, and @var{lockname} are | |
2817 ignored. | |
2818 | |
2819 Return nil if buffer-file-coding-system is not (ISO-2022-compatible) | |
2820 Latin. If @code{buffer-file-coding-system} is safe for the charsets | |
2821 actually present in the buffer, return it. Otherwise, ask the user to | |
2822 choose a coding system, and return that. | |
2823 | |
2824 This function does @emph{not} do the safe thing when | |
2825 @code{buffer-file-coding-system} is nil (aka no-conversion). It | |
2826 considers that ``non-Latin,'' and passes it on to the Mule detection | |
2827 mechanism. | |
2828 | |
2829 This function is intended for use as a @code{write-region-pre-hook}. It | |
2830 does nothing except return @var{coding-system} if @code{write-region} | |
2831 handlers are inhibited. | |
2832 @end defun | |
2833 | |
2834 @defun unity-buffer-representations-feasible | |
2835 | |
2836 There are no arguments. | |
2837 | |
2838 Apply unity-region-representations-feasible to the current buffer. | |
2839 @end defun | |
2840 | |
2841 @defun unity-region-representations-feasible begin end &optional buf | |
2842 | |
2843 Return character sets that can represent the text from @var{begin} to @var{end} in @var{buf}. | |
2844 | |
2845 @var{buf} defaults to the current buffer. Called interactively, will be | |
2846 applied to the region. Function assumes @var{begin} <= @var{end}. | |
2847 | |
2848 The return value is a cons. The car is the list of character sets | |
2849 that can individually represent all of the non-ASCII portion of the | |
2850 buffer, and the cdr is the list of character sets that can | |
2851 individually represent all of the ASCII portion. | |
2852 | |
2853 The following is taken from a comment in the source. Please refer to | |
2854 the source to be sure of an accurate description. | |
2855 | |
2856 The basic algorithm is to map over the region, compute the set of | |
2857 charsets that can represent each character (the ``feasible charset''), | |
2858 and take the intersection of those sets. | |
2859 | |
2860 The current implementation takes advantage of the fact that ASCII | |
2861 characters are common and cannot change asciisets. Then using | |
2862 skip-chars-forward makes motion over ASCII subregions very fast. | |
2863 | |
2864 This same strategy could be applied generally by precomputing classes | |
2865 of characters equivalent according to their effect on latinsets, and | |
2866 adding a whole class to the skip-chars-forward string once a member is | |
2867 found. | |
2868 | |
2869 Probably efficiency is a function of the number of characters matched, | |
2870 or maybe the length of the match string? With @code{skip-category-forward} | |
2871 over a precomputed category table it should be really fast. In practice | |
2872 for Latin character sets there are only 29 classes. | |
2873 @end defun | |
2874 | |
2875 @defun unity-remap-region begin end character-set &optional coding-system | |
2876 | |
2877 Remap characters between @var{begin} and @var{end} to equivalents in | |
2878 @var{character-set}. Optional argument @var{coding-system} may be a | |
2879 coding system name (a symbol) or nil. Characters with no equivalent are | |
2880 left as-is. | |
2881 | |
2882 When called interactively, @var{begin} and @var{end} are set to the | |
2883 beginning and end, respectively, of the active region, and the function | |
2884 prompts for @var{character-set}. The function does completion, knows | |
2885 how to guess a character set name from a coding system name, and also | |
2886 provides some common aliases. See @code{unity-guess-charset}. | |
2887 There is no way to specify @var{coding-system}, as it has no useful | |
2888 function interactively. | |
2889 | |
2890 Return @var{coding-system} if @var{coding-system} can encode all | |
2891 characters in the region, t if @var{coding-system} is nil and the coding | |
2892 system with G0 = 'ascii and G1 = @var{character-set} can encode all | |
2893 characters, and otherwise nil. Note that a non-null return does | |
2894 @emph{not} mean it is safe to write the file, only the specified region. | |
2895 (This behavior is useful for multipart MIME encoding and the like.) | |
2896 | |
2897 Note: by default this function is quite fascist about universal coding | |
2898 systems. It only admits @samp{utf-8}, @samp{iso-2022-7}, and | |
2899 @samp{ctext}. Customize @code{unity-approved-ucs-list} to change | |
2900 this. | |
2901 | |
2902 This function remaps characters that are artificially distinguished by Mule | |
2903 internal code. It may change the code point as well as the character set. | |
2904 To recode characters that were decoded in the wrong coding system, use | |
2905 @code{unity-recode-region}. | |
2906 @end defun | |
2907 | |
2908 @defun unity-recode-region begin end wrong-cs right-cs | |
2909 | |
2910 Recode characters between @var{begin} and @var{end} from @var{wrong-cs} | |
2911 to @var{right-cs}. | |
2912 | |
2913 @var{wrong-cs} and @var{right-cs} are character sets. Characters retain | |
2914 the same code point but the character set is changed. Only characters | |
2915 from @var{wrong-cs} are changed to @var{right-cs}. The identity of the | |
2916 character may change. Note that this could be dangerous, if characters | |
2917 whose identities you do not want changed are included in the region. | |
2918 This function cannot guess which characters you want changed, and which | |
2919 should be left alone. | |
2920 | |
2921 When called interactively, @var{begin} and @var{end} are set to the | |
2922 beginning and end, respectively, of the active region, and the function | |
2923 prompts for @var{wrong-cs} and @var{right-cs}. The function does | |
2924 completion, knows how to guess a character set name from a coding system | |
2925 name, and also provides some common aliases. See | |
2926 @code{unity-guess-charset}. | |
2927 | |
2928 Another way to accomplish this, but using coding systems rather than | |
2929 character sets to specify the desired recoding, is | |
2930 @samp{unity-recode-coding-region}. That function may be faster | |
2931 but is somewhat more dangerous, because it may recode more than one | |
2932 character set. | |
2933 | |
2934 To change from one Mule representation to another without changing identity | |
2935 of any characters, use @samp{unity-remap-region}. | |
2936 @end defun | |
2937 | |
2938 @defun unity-recode-coding-region begin end wrong-cs right-cs | |
2939 | |
2940 Recode text between @var{begin} and @var{end} from @var{wrong-cs} to | |
2941 @var{right-cs}. | |
2942 | |
2943 @var{wrong-cs} and @var{right-cs} are coding systems. Characters retain | |
2944 the same code point but the character set is changed. The identity of | |
2945 characters may change. This is an inherently dangerous function; | |
2946 multilingual text may be recoded in unexpected ways. #### It's also | |
2947 dangerous because the coding systems are not sanity-checked in the | |
2948 current implementation. | |
2949 | |
2950 When called interactively, @var{begin} and @var{end} are set to the | |
2951 beginning and end, respectively, of the active region, and the function | |
2952 prompts for @var{wrong-cs} and @var{right-cs}. The function does | |
2953 completion, knows how to guess a coding system name from a character set | |
2954 name, and also provides some common aliases. See | |
2955 @code{unity-guess-coding-system}. | |
2956 | |
2957 Another, safer, way to accomplish this, using character sets rather | |
2958 than coding systems to specify the desired recoding, is to use | |
2959 @c #### fixme in latin-unity.texi | |
2960 @code{unity-recode-region}. | |
2961 | |
2962 To change from one Mule representation to another without changing identity | |
2963 of any characters, use @code{unity-remap-region}. | |
2964 @end defun | |
2965 | |
2966 Helper functions for input of coding system and character set names. | |
2967 | |
2968 @defun unity-guess-charset candidate | |
2969 Guess a charset based on the symbol @var{candidate}. | |
2970 | |
2971 @var{candidate} itself is not tried as the value. | |
2972 | |
2973 Uses the natural mapping in @samp{unity-cset-codesys-alist}, and | |
2974 the values in @samp{unity-charset-alias-alist}." | |
2975 @end defun | |
2976 | |
2977 @defun unity-guess-coding-system candidate | |
2978 Guess a coding system based on the symbol @var{candidate}. | |
2979 | |
2980 @var{candidate} itself is not tried as the value. | |
2981 | |
2982 Uses the natural mapping in @samp{unity-cset-codesys-alist}, and | |
2983 the values in @samp{unity-coding-system-alias-alist}." | |
2984 @end defun | |
2985 | |
2986 @defun unity-example | |
2987 | |
2988 A cheesy example for Unification. | |
2989 | |
2990 At present it just makes a multilingual buffer. To test, setq | |
2991 buffer-file-coding-system to some value, make the buffer dirty (eg | |
2992 with RET BackSpace), and save. | |
2993 @end defun | |
2994 | |
2995 | |
2996 @node Configuration, Theory of Operation, Usage, Charset Unification | |
2997 @subsection Configuring Unification for Use | |
2998 | |
2999 If you want Unification to be automatically initialized, invoke | |
3000 @samp{enable-unification} with no arguments in your init file. | |
3001 @xref{Init File, , , xemacs}. If you are using GNU Emacs or an XEmacs | |
3002 earlier than 21.1, you should also load @file{auto-autoloads} using the | |
3003 full path (@emph{never} @samp{require} @file{auto-autoloads} libraries). | |
3004 | |
3005 You may wish to define aliases for commonly used character sets and | |
3006 coding systems for convenience in input. | |
3007 | |
3008 @defopt unity-charset-alias-alist | |
3009 Alist mapping aliases to Mule charset names (symbols)." | |
3010 | |
3011 The default value is | |
3012 @example | |
3013 ((latin-1 . latin-iso8859-1) | |
3014 (latin-2 . latin-iso8859-2) | |
3015 (latin-3 . latin-iso8859-3) | |
3016 (latin-4 . latin-iso8859-4) | |
3017 (latin-5 . latin-iso8859-9) | |
3018 (latin-9 . latin-iso8859-15) | |
3019 (latin-10 . latin-iso8859-16)) | |
3020 @end example | |
3021 | |
3022 If a charset does not exist on your system, it will not complete and you | |
3023 will not be able to enter it in response to prompts. A real charset | |
3024 with the same name as an alias in this list will shadow the alias. | |
3025 @end defopt | |
3026 | |
3027 @defopt unity-coding-system-alias-alist nil | |
3028 Alist mapping aliases to Mule coding system names (symbols). | |
3029 | |
3030 The default value is @samp{nil}. | |
3031 @end defopt | |
3032 | |
3033 | |
3034 @node Theory of Operation, What Unification Cannot Do for You, Configuration, Charset Unification | |
3035 @subsection Theory of Operation | |
3036 | |
3037 Standard encodings suffer from the design defect that they do not | |
3038 provide a reliable way to recognize which coded character sets in use. | |
3039 @xref{What Unification Cannot Do for You}. There are scores of | |
3040 character sets which can be represented by a single octet (8-bit byte), | |
3041 whose union contains many hundreds of characters. Obviously this | |
3042 results in great confusion, since you can't tell the players without a | |
3043 scorecard, and there is no scorecard. | |
3044 | |
3045 There are two ways to solve this problem. The first is to create a | |
3046 universal coded character set. This is the concept behind Unicode. | |
3047 However, there have been satisfactory (nearly) universal character sets | |
3048 for several decades, but even today many Westerners resist using Unicode | |
3049 because they consider its space requirements excessive. On the other | |
3050 hand, Asians dislike Unicode because they consider it to be incomplete. | |
3051 (This is partly, but not entirely, political.) | |
3052 | |
3053 In any case, Unicode only solves the internal representation problem. | |
3054 Many data sets will contain files in ``legacy'' encodings, and Unicode | |
3055 does not help distinguish among them. | |
3056 | |
3057 The second approach is to embed information about the encodings used in | |
3058 a document in its text. This approach is taken by the ISO 2022 | |
3059 standard. This would solve the problem completely from the users' of | |
3060 view, except that ISO 2022 is basically not implemented at all, in the | |
3061 sense that few applications or systems implement more than a small | |
3062 subset of ISO 2022 functionality. This is due to the fact that | |
3063 mono-literate users object to the presence of escape sequences in their | |
3064 texts (which they, with some justification, consider data corruption). | |
3065 Programmers are more than willing to cater to these users, since | |
3066 implementing ISO 2022 is a painstaking task. | |
3067 | |
3068 In fact, Emacs/Mule adopts both of these approaches. Internally it uses | |
3069 a universal character set, @dfn{Mule code}. Externally it uses ISO 2022 | |
3070 techniques both to save files in forms robust to encoding issues, and as | |
3071 hints when attempting to ``guess'' an unknown encoding. However, Mule | |
3072 suffers from a design defect, namely it embeds the character set | |
3073 information that ISO 2022 attaches to runs of characters by introducing | |
3074 them with a control sequence in each character. That causes Mule to | |
3075 consider the ISO Latin character sets to be disjoint. This manifests | |
3076 itself when a user enters characters using input methods associated with | |
3077 different coded character sets into a single buffer. | |
3078 | |
3079 There are two problems stemming from this design. First, Mule | |
1188 | 3080 represents the same character in different ways. Abstractly, 'ó' |
1183 | 3081 (LATIN SMALL LETTER O WITH ACUTE) can get represented as |
3082 [latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73]. So what looks like | |
1188 | 3083 'óó' in the display might actually be represented [latin-iso8859-1 |
1183 | 3084 #x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B |
3085 #xF3 ESC - A] in the file. In some cases this treatment would be | |
3086 appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00 | |
3087 (the CJK ideographic character meaning ``one'')), and although arguably | |
3088 incorrect it is convenient when mixing the CJK scripts. But in the case | |
3089 of the Latin scripts this is wrong. | |
3090 | |
3091 Worse yet, it is very likely to occur when mixing ``different'' encodings | |
3092 (such as ISO 8859/1 and ISO 8859/15) that differ only in a few code | |
3093 points that are almost never used. A very important example involves | |
3094 email. Many sites, especially in the U.S., default to use of the ISO | |
3095 8859/1 coded character set (also called ``Latin 1,'' though these are | |
3096 somewhat different concepts). However, ISO 8859/1 provides a generic | |
3097 CURRENCY SIGN character. Now that the Euro has become the official | |
3098 currency of most countries in Europe, this is unsatisfactory (and in | |
3099 practice, useless). So Europeans generally use ISO 8859/15, which is | |
3100 nearly identical to ISO 8859/1 for most languages, except that it | |
3101 substitutes EURO SIGN for CURRENCY SIGN. | |
3102 | |
3103 Suppose a European user yanks text from a post encoded in ISO 8859/1 | |
3104 into a message composition buffer, and enters some text including the | |
3105 Euro sign. Then Mule will consider the buffer to contain both ISO | |
3106 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively | |
3107 programmed) send the message as a multipart mixed MIME body! | |
3108 | |
3109 This is clearly stupid. What is not as obvious is that, just as any | |
3110 European can include American English in their text because ASCII is a | |
3111 subset of ISO 8859/15, most European languages which use Latin | |
3112 characters (eg, German and Polish) can typically be mixed while using | |
3113 only one Latin coded character set (in the case of German and Polish, | |
3114 ISO 8859/2). However, this often depends on exactly what text is to be | |
3115 encoded (even for the same pair of languages). | |
3116 | |
3117 Unification works around the problem by converting as many characters as | |
3118 possible to use a single Latin coded character set before saving the | |
3119 buffer. | |
3120 | |
5384
3889ef128488
Fix misspelled words, and some grammar, across the entire source tree.
Jerry James <james@xemacs.org>
parents:
3439
diff
changeset
|
3121 Because the problem is rarely noticeable in editing a buffer, but tends |
1183 | 3122 to manifest when that buffer is exported to a file or process, the |
3123 Unification package uses the strategy of examining the buffer prior to | |
3124 export. If use of multiple Latin coded character sets is detected, | |
3125 Unification attempts to unify them by finding a single coded character | |
3126 set which contains all of the Latin characters in the buffer. | |
3127 | |
3128 The primary purpose of Unification is to fix the problem by giving the | |
3129 user the choice to change the representation of all characters to one | |
3130 character set and give sensible recommendations based on context. In | |
1188 | 3131 the 'ó' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and |
1183 | 3132 both will be suggested. In the EURO SIGN example, only ISO 8859/15 |
3133 makes sense, and that is what will be recommended. In both cases, the | |
3134 user will be reminded that there are universal encodings available. | |
3135 | |
3136 I call this @dfn{remapping} (from the universal character set to a | |
3137 particular ISO 8859 coded character set). It is mere accident that this | |
3138 letter has the same code point in both character sets. (Not entirely, | |
3139 but there are many examples of Latin characters that have different code | |
3140 points in different Latin-X sets.) | |
3141 | |
1188 | 3142 Note that, in the 'ó' example, that treating the buffer in this way will |
1183 | 3143 result in a representation such as [latin-iso8859-2 |
3144 #x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3]. | |
3145 This is guaranteed to occasionally result in the second problem you | |
3146 observed, to which we now turn. | |
3147 | |
3148 This problem is that, although the file is intended to be an | |
3149 ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX | |
3150 compliant program---this is required by the standard, obvious if you | |
3151 think a bit, @pxref{What Unification Cannot Do for You}) will read that | |
3152 file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73]. Of course this | |
3153 is no problem if all of the characters in the file are contained in ISO | |
3154 8859/1, but suppose there are some which are not, but are contained in | |
3155 the (intended) ISO 8859/2. | |
3156 | |
3157 You now want to fix this, but not by finding the same character in | |
3158 another set. Instead, you want to simply change the character set that | |
3159 Mule associates with that buffer position without changing the code. | |
3160 (This is conceptually somewhat distinct from the first problem, and | |
3161 logically ought to be handled in the code that defines coding systems. | |
3162 However, unification is not an unreasonable place for it.) Unification | |
3163 provides two functions (one fast and dangerous, the other slow and | |
3164 careful) to handle this. I call this @dfn{recoding}, because the | |
3165 transformation actually involves @emph{encoding} the buffer to file | |
3166 representation, then @emph{decoding} it to buffer representation (in a | |
3167 different character set). This cannot be done automatically because | |
3168 Mule can have no idea what the correct encoding is---after all, it | |
3169 already gave you its best guess. @xref{What Unification Cannot Do for | |
3170 You}. So these functions must be invoked by the user. @xref{Interactive | |
3171 Usage}. | |
3172 | |
3173 | |
3174 @node What Unification Cannot Do for You, Unification Internals, Theory of Operation, Charset Unification | |
3175 @subsection What Unification Cannot Do for You | |
3176 | |
3177 Unification @strong{cannot} save you if you insist on exporting data in | |
3178 8-bit encodings in a multilingual environment. @emph{You will | |
3179 eventually corrupt data if you do this.} It is not Mule's, or any | |
3180 application's, fault. You will have only yourself to blame; consider | |
3181 yourself warned. (It is true that Mule has bugs, which make Mule | |
3182 somewhat more dangerous and inconvenient than some naive applications. | |
3183 We're working to address those, but no application can remedy the | |
3184 inherent defect of 8-bit encodings.) | |
3185 | |
3186 Use standard universal encodings, preferably Unicode (UTF-8) unless | |
3187 applicable standards indicate otherwise. The most important such case | |
3188 is Internet messages, where MIME should be used, whether or not the | |
3189 subordinate encoding is a universal encoding. (Note that since one of | |
3190 the important provisions of MIME is the @samp{Content-Type} header, | |
3191 which has the charset parameter, MIME is to be considered a universal | |
3192 encoding for the purposes of this manual. Of course, technically | |
3193 speaking it's neither a coded character set nor a coding extension | |
3194 technique compliant with ISO 2022.) | |
3195 | |
3196 As mentioned earlier, the problem is that standard encodings suffer from | |
3197 the design defect that they do not provide a reliable way to recognize | |
3198 which coded character sets are in use. There are scores of character | |
3199 sets which can be represented by a single octet (8-bit byte), whose | |
3200 union contains many hundreds of characters. Thus any 8-bit coded | |
3201 character set must contain characters that share code points used for | |
3202 different characters in other coded character sets. | |
3203 | |
3204 This means that a given file's intended encoding cannot be identified | |
3205 with 100% reliability unless it contains encoding markers such as those | |
3206 provided by MIME or ISO 2022. | |
3207 | |
3208 Unification actually makes it more likely that you will have problems of | |
3209 this kind. Traditionally Mule has been ``helpful'' by simply using an | |
3210 ISO 2022 universal coding system when the current buffer coding system | |
3211 cannot handle all the characters in the buffer. This has the effect | |
3212 that, because the file contains control sequences, it is not recognized | |
3213 as being in the locale's normal 8-bit encoding. It may be annoying if | |
3214 you are not a Mule expert, but your data is automatically recoverable | |
3215 with a tool you already have: Mule. | |
3216 | |
3217 However, with unification, Mule converts to a single 8-bit character set | |
3218 when possible. But typically this will @emph{not} be in your usual | |
3219 locale. Ie, the times that an ISO 8859/1 user will need Unification is | |
3220 when there are ISO 8859/2 characters in the buffer. But then most | |
3221 likely the file will be saved in a pure 8-bit encoding that is not ISO | |
3222 8859/1, ie, ISO 8859/2. Mule's autorecognizer (which is probably the | |
3223 most sophisticated yet available) cannot tell the difference between ISO | |
3224 8859/1 and ISO 8859/2, and in a Western European locale will choose the | |
3225 former even though the latter was intended. Even the extension | |
3226 (``statistical recognition'') planned for XEmacs 22 is unlikely to be at | |
3227 all accurate in the case of mixed codes. | |
3228 | |
3229 So now consider adding some additional ISO 8859/1 text to the buffer. | |
3230 If it includes any ISO 8859/1 codes that are used by different | |
3231 characters in ISO 8859/2, you now have a file that cannot be | |
3232 mechanically disentangled. You need a human being who can recognize | |
3233 that @emph{this is German and Swedish} and stays in Latin-1, while | |
3234 @emph{that is Polish} and needs to be recoded to Latin-2. | |
3235 | |
3236 Moral: switch to a universal coded character set, preferably Unicode | |
3237 using the UTF-8 transformation format. If you really need the space, | |
3238 compress your files. | |
3239 | |
3240 | |
3241 @node Unification Internals, , What Unification Cannot Do for You, Charset Unification | |
3242 @subsection Internals | |
3243 | |
3244 No internals documentation yet. | |
3245 | |
3246 @file{unity-utils.el} provides one utility function. | |
3247 | |
3248 @defun unity-dump-tables | |
3249 | |
3250 Dump the temporary table created by loading @file{unity-utils.el} | |
3251 to @file{unity-tables.el}. Loading the latter file initializes | |
3252 @samp{unity-equivalences}. | |
3253 @end defun | |
3254 | |
3255 | |
3256 @node Charsets and Coding Systems, , Charset Unification, MULE | |
3257 @subsection Charsets and Coding Systems | |
3258 | |
3259 This section provides reference lists of Mule charsets and coding | |
3260 systems. Mule charsets are typically named by character set and | |
3261 standard. | |
3262 | |
3263 @table @strong | |
3264 @item ASCII variants | |
3265 | |
3266 Identification of equivalent characters in these sets is not properly | |
3267 implemented. Unification does not distinguish the two charsets. | |
3268 | |
3269 @samp{ascii} @samp{latin-jisx0201} | |
3270 | |
3271 @item Extended Latin | |
3272 | |
3273 Characters from the following ISO 2022 conformant charsets are | |
3274 identified with equivalents in other charsets in the group by | |
3275 Unification. | |
3276 | |
3277 @samp{latin-iso8859-1} @samp{latin-iso8859-15} @samp{latin-iso8859-2} | |
3278 @samp{latin-iso8859-3} @samp{latin-iso8859-4} @samp{latin-iso8859-9} | |
3279 @samp{latin-iso8859-13} @samp{latin-iso8859-16} | |
3280 | |
3281 The follow charsets are Latin variants which are not understood by | |
3282 Unification. In addition, many of the Asian language standards provide | |
3283 ASCII, at least, and sometimes other Latin characters. None of these | |
3284 are identified with their ISO 8859 equivalents. | |
3285 | |
3286 @samp{vietnamese-viscii-lower} | |
3287 @samp{vietnamese-viscii-upper} | |
3288 | |
3289 @item Other character sets | |
3290 | |
3291 @samp{arabic-1-column} | |
3292 @samp{arabic-2-column} | |
3293 @samp{arabic-digit} | |
3294 @samp{arabic-iso8859-6} | |
3295 @samp{chinese-big5-1} | |
3296 @samp{chinese-big5-2} | |
3297 @samp{chinese-cns11643-1} | |
3298 @samp{chinese-cns11643-2} | |
3299 @samp{chinese-cns11643-3} | |
3300 @samp{chinese-cns11643-4} | |
3301 @samp{chinese-cns11643-5} | |
3302 @samp{chinese-cns11643-6} | |
3303 @samp{chinese-cns11643-7} | |
3304 @samp{chinese-gb2312} | |
3305 @samp{chinese-isoir165} | |
3306 @samp{cyrillic-iso8859-5} | |
3307 @samp{ethiopic} | |
3308 @samp{greek-iso8859-7} | |
3309 @samp{hebrew-iso8859-8} | |
3310 @samp{ipa} | |
3311 @samp{japanese-jisx0208} | |
3312 @samp{japanese-jisx0208-1978} | |
3313 @samp{japanese-jisx0212} | |
3314 @samp{katakana-jisx0201} | |
3315 @samp{korean-ksc5601} | |
3316 @samp{sisheng} | |
3317 @samp{thai-tis620} | |
3318 @samp{thai-xtis} | |
3319 | |
3320 @item Non-graphic charsets | |
3321 | |
3322 @samp{control-1} | |
3323 @end table | |
3324 | |
3325 @table @strong | |
3326 @item No conversion | |
3327 | |
3328 Some of these coding systems may specify EOL conventions. Note that | |
3329 @samp{iso-8859-1} is a no-conversion coding system, not an ISO 2022 | |
3330 coding system. Although unification attempts to compensate for this, it | |
3331 is possible that the @samp{iso-8859-1} coding system will behave | |
3332 differently from other ISO 8859 coding systems. | |
3333 | |
3334 @samp{binary} @samp{no-conversion} @samp{raw-text} @samp{iso-8859-1} | |
3335 | |
3336 @item Latin coding systems | |
3337 | |
3338 These coding systems are all single-byte, 8-bit ISO 2022 coding systems, | |
3339 combining ASCII in the GL register (bytes with high-bit clear) and an | |
3340 extended Latin character set in the GR register (bytes with high-bit set). | |
3341 | |
3342 @samp{iso-8859-15} @samp{iso-8859-2} @samp{iso-8859-3} @samp{iso-8859-4} | |
3343 @samp{iso-8859-9} @samp{iso-8859-13} @samp{iso-8859-14} @samp{iso-8859-16} | |
3344 | |
3345 These coding systems are single-byte, 8-bit coding systems that do not | |
3346 conform to international standards. They should be avoided in all | |
3347 potentially multilingual contexts, including any text distributed over | |
3348 the Internet and World Wide Web. | |
3349 | |
3350 @samp{windows-1251} | |
3351 | |
3352 @item Multilingual coding systems | |
3353 | |
3354 The following ISO-2022-based coding systems are useful for multilingual | |
3355 text. | |
3356 | |
3357 @samp{ctext} @samp{iso-2022-lock} @samp{iso-2022-7} @samp{iso-2022-7bit} | |
3358 @samp{iso-2022-7bit-ss2} @samp{iso-2022-8} @samp{iso-2022-8bit-ss2} | |
3359 | |
3360 XEmacs also supports Unicode with the Mule-UCS package. These are the | |
3361 preferred coding systems for multilingual use. (There is a possible | |
3362 exception for texts that mix several Asian ideographic character sets.) | |
3363 | |
3364 @samp{utf-16-be} @samp{utf-16-be-no-signature} @samp{utf-16-le} | |
3365 @samp{utf-16-le-no-signature} @samp{utf-7} @samp{utf-7-safe} | |
3366 @samp{utf-8} @samp{utf-8-ws} | |
3367 | |
3368 Development versions of XEmacs (the 21.5 series) support Unicode | |
3369 internally, with (at least) the following coding systems implemented: | |
3370 | |
3371 @samp{utf-16-be} @samp{utf-16-be-bom} @samp{utf-16-le} | |
3372 @samp{utf-16-le-bom} @samp{utf-8} @samp{utf-8-bom} | |
3373 | |
3374 @item Asian ideographic languages | |
3375 | |
3376 The following coding systems are based on ISO 2022, and are more or less | |
3377 suitable for encoding multilingual texts. They all can represent ASCII | |
3378 at least, and sometimes several other foreign character sets, without | |
3379 resort to arbitrary ISO 2022 designations. However, these subsets are | |
3380 not identified with the corresponding national standards in XEmacs Mule. | |
3381 | |
3382 @samp{chinese-euc} @samp{cn-big5} @samp{cn-gb-2312} @samp{gb2312} | |
3383 @samp{hz} @samp{hz-gb-2312} @samp{old-jis} @samp{japanese-euc} | |
3384 @samp{junet} @samp{euc-japan} @samp{euc-jp} @samp{iso-2022-jp} | |
3385 @samp{iso-2022-jp-1978-irv} @samp{iso-2022-jp-2} @samp{euc-kr} | |
3386 @samp{korean-euc} @samp{iso-2022-kr} @samp{iso-2022-int-1} | |
3387 | |
3388 The following coding systems cannot be used for general multilingual | |
3389 text and do not cooperate well with other coding systems. | |
3390 | |
3391 @samp{big5} @samp{shift_jis} | |
3392 | |
3393 @item Other languages | |
3394 | |
3395 The following coding systems are based on ISO 2022. Though none of them | |
3396 provides any Latin characters beyond ASCII, XEmacs Mule allows (and up | |
3397 to 21.4 defaults to) use of ISO 2022 control sequences to designate | |
3398 other character sets for inclusion the text. | |
3399 | |
3400 @samp{iso-8859-5} @samp{iso-8859-7} @samp{iso-8859-8} | |
3401 @samp{ctext-hebrew} | |
3402 | |
3403 The following are character sets that do not conform to ISO 2022 and | |
3404 thus cannot be safely used in a multilingual context. | |
3405 | |
3406 @samp{alternativnyj} @samp{koi8-r} @samp{tis-620} @samp{viqr} | |
3407 @samp{viscii} @samp{vscii} | |
3408 | |
3409 @item Special coding systems | |
3410 | |
3411 Mule uses the following coding systems for special purposes. | |
3412 | |
3413 @samp{automatic-conversion} @samp{undecided} @samp{escape-quoted} | |
3414 | |
3415 @samp{escape-quoted} is especially important, as it is used internally | |
3416 as the coding system for autosaved data. | |
3417 | |
3418 The following coding systems are aliases for others, and are used for | |
3419 communication with the host operating system. | |
3420 | |
3421 @samp{file-name} @samp{keyboard} @samp{terminal} | |
3422 | |
3423 @end table | |
3424 | |
3425 Mule detection of coding systems is actually limited to detection of | |
3426 classes of coding systems called @dfn{coding categories}. These coding | |
3427 categories are identified by the ISO 2022 control sequences they use, if | |
3428 any, by their conformance to ISO 2022 restrictions on code points that | |
3429 may be used, and by characteristic patterns of use of 8-bit code points. | |
3430 | |
3431 @samp{no-conversion} | |
3432 @samp{utf-8} | |
3433 @samp{ucs-4} | |
3434 @samp{iso-7} | |
3435 @samp{iso-lock-shift} | |
3436 @samp{iso-8-1} | |
3437 @samp{iso-8-2} | |
3438 @samp{iso-8-designate} | |
3439 @samp{shift-jis} | |
3440 @samp{big5} | |
3441 | |
3442 | |
3443 @c end of mule.texi | |
3444 |