Mercurial > hg > xemacs-beta
annotate man/lispref/mule.texi @ 5888:a85efdabe237
Call #'read-passwd when requesting a password from the user, tls.c
src/ChangeLog addition:
2015-04-09 Aidan Kehoe <kehoea@parhasard.net>
* tls.c (nss_pk11_password):
* tls.c (gnutls_pk11_password):
* tls.c (openssl_password):
* tls.c (syms_of_tls):
Our read-a-password function is #'read-passwd, not
#'read-password, correct that in this file.
| author | Aidan Kehoe <kehoea@parhasard.net> |
|---|---|
| date | Thu, 09 Apr 2015 14:54:37 +0100 |
| parents | 9fae6227ede5 |
| children |
| rev | line source |
|---|---|
| 428 | 1 @c -*-texinfo-*- |
| 2 @c This is part of the XEmacs Lisp Reference Manual. | |
| 775 | 3 @c Copyright (C) 1996 Ben Wing, 2001-2002 Free Software Foundation. |
| 428 | 4 @c See the file lispref.texi for copying conditions. |
| 5 @setfilename ../../info/internationalization.info | |
|
5791
9fae6227ede5
Silence texinfo 5.2 warnings, primarily by adding next, prev, and up
Jerry James <james@xemacs.org>
parents:
5384
diff
changeset
|
6 @node MULE, Tips, Internationalization, Top |
| 428 | 7 @chapter MULE |
| 8 | |
| 442 | 9 @dfn{MULE} is the name originally given to the version of GNU Emacs |
| 428 | 10 extended for multi-lingual (and in particular Asian-language) support. |
| 442 | 11 ``MULE'' is short for ``MUlti-Lingual Emacs''. It is an extension and |
| 12 complete rewrite of Nemacs (``Nihon Emacs'' where ``Nihon'' is the | |
| 13 Japanese word for ``Japan''), which only provided support for Japanese. | |
| 14 XEmacs refers to its multi-lingual support as @dfn{MULE support} since | |
| 15 it is based on @dfn{MULE}. | |
| 428 | 16 |
| 17 @menu | |
| 18 * Internationalization Terminology:: | |
| 19 Definition of various internationalization terms. | |
| 20 * Charsets:: Sets of related characters. | |
| 21 * MULE Characters:: Working with characters in XEmacs/MULE. | |
| 22 * Composite Characters:: Making new characters by overstriking other ones. | |
| 23 * Coding Systems:: Ways of representing a string of chars using integers. | |
| 24 * CCL:: A special language for writing fast converters. | |
| 25 * Category Tables:: Subdividing charsets into groups. | |
| 775 | 26 * Unicode Support:: The universal coded character set. |
| 1183 | 27 * Charset Unification:: Handling overlapping character sets. |
| 28 * Charsets and Coding Systems:: Tables and reference information. | |
| 428 | 29 @end menu |
| 30 | |
| 442 | 31 @node Internationalization Terminology, Charsets, , MULE |
| 428 | 32 @section Internationalization Terminology |
| 33 | |
| 442 | 34 In internationalization terminology, a string of text is divided up |
| 428 | 35 into @dfn{characters}, which are the printable units that make up the |
| 36 text. A single character is (for example) a capital @samp{A}, the | |
| 442 | 37 number @samp{2}, a Katakana character, a Hangul character, a Kanji |
| 38 ideograph (an @dfn{ideograph} is a ``picture'' character, such as is | |
| 39 used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there | |
| 40 are thousands of such ideographs in each language), etc. The basic | |
| 41 property of a character is that it is the smallest unit of text with | |
| 1261 | 42 semantic significance in text processing---i.e., characters are abstract |
| 43 units defined by their meaning, not by their exact appearance. | |
| 442 | 44 |
| 45 Human beings normally process text visually, so to a first approximation | |
| 46 a character may be identified with its shape. Note that the same | |
| 47 character may be drawn by two different people (or in two different | |
| 48 fonts) in slightly different ways, although the "basic shape" will be the | |
| 49 same. But consider the works of Scott Kim; human beings can recognize | |
| 50 hugely variant shapes as the "same" character. Sometimes, especially | |
| 51 where characters are extremely complicated to write, completely | |
| 52 different shapes may be defined as the "same" character in national | |
| 53 standards. The Taiwanese variant of Hanzi is generally the most | |
| 444 | 54 complicated; over the centuries, the Japanese, Koreans, and the People's |
| 442 | 55 Republic of China have adopted simplifications of the shape, but the |
| 56 line of descent from the original shape is recorded, and the meanings | |
| 57 and pronunciation of different forms of the same character are | |
| 58 considered to be identical within each language. (Of course, it may | |
| 59 take a specialist to recognize the related form; the point is that the | |
| 60 relations are standardized, despite the differing shapes.) | |
| 428 | 61 |
| 62 In some cases, the differences will be significant enough that it is | |
| 63 actually possible to identify two or more distinct shapes that both | |
| 64 represent the same character. For example, the lowercase letters | |
| 440 | 65 @samp{a} and @samp{g} each have two distinct possible shapes---the |
| 428 | 66 @samp{a} can optionally have a curved tail projecting off the top, and |
| 67 the @samp{g} can be formed either of two loops, or of one loop and a | |
| 68 tail hanging off the bottom. Such distinct possible shapes of a | |
| 69 character are called @dfn{glyphs}. The important characteristic of two | |
| 70 glyphs making up the same character is that the choice between one or | |
| 71 the other is purely stylistic and has no linguistic effect on a word | |
| 72 (this is the reason why a capital @samp{A} and lowercase @samp{a} | |
| 440 | 73 are different characters rather than different glyphs---e.g. |
| 428 | 74 @samp{Aspen} is a city while @samp{aspen} is a kind of tree). |
| 75 | |
| 76 Note that @dfn{character} and @dfn{glyph} are used differently | |
| 77 here than elsewhere in XEmacs. | |
| 78 | |
| 442 | 79 A @dfn{character set} is essentially a set of related characters. ASCII, |
| 428 | 80 for example, is a set of 94 characters (or 128, if you count |
| 81 non-printing characters). Other character sets are ISO8859-1 (ASCII | |
| 82 plus various accented characters and other international symbols), | |
| 442 | 83 JIS X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208 |
| 84 (Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji), | |
| 428 | 85 GB2312 (Mainland Chinese Hanzi), etc. |
| 86 | |
| 442 | 87 The definition of a character set will implicitly or explicitly give |
| 88 it an @dfn{ordering}, a way of assigning a number to each character in | |
| 89 the set. For many character sets, there is a natural ordering, for | |
| 90 example the ``ABC'' ordering of the Roman letters. But it is not clear | |
| 91 whether digits should come before or after the letters, and in fact | |
| 92 different European languages treat the ordering of accented characters | |
| 93 differently. It is useful to use the natural order where available, of | |
| 94 course. The number assigned to any particular character is called the | |
| 95 character's @dfn{code point}. (Within a given character set, each | |
| 96 character has a unique code point. Thus the word "set" is ill-chosen; | |
| 97 different orderings of the same characters are different character sets. | |
| 98 Identifying characters is simple enough for alphabetic character sets, | |
| 99 but the difference in ordering can cause great headaches when the same | |
| 100 thousands of characters are used by different cultures as in the Hanzi.) | |
| 428 | 101 |
| 1261 | 102 It's important to understand that a character is defined not by any |
| 103 number attached to it, but by its meaning. For example, ASCII and | |
| 104 EBCDIC are two charsets containing exactly the same characters | |
| 105 (lowercase and uppercase letters, numbers 0 through 9, particular | |
| 106 punctuation marks) but with different numberings. The @samp{comma} | |
| 107 character in ASCII and EBCDIC, for instance, is the same character | |
| 108 despite having a different numbering. Conversely, when comparing ASCII | |
| 109 and JIS-Roman, which look the same except that the latter has a yen sign | |
| 110 substituted for the backslash, we would say that the backslash and yen | |
| 111 sign are @emph{not} the same characters, despite having the same number | |
| 112 (95) and despite the fact that all other characters are present in both | |
| 113 charsets, with the same numbering. ASCII and JIS-Roman, then, do | |
| 114 @emph{not} have exactly the same characters in them (ASCII has a | |
| 115 backslash character but no yen-sign character, and vice-versa for | |
| 116 JIS-Roman), unlike ASCII and EBCDIC, even though the numberings in ASCII | |
| 117 and JIS-Roman are closer. | |
| 118 | |
| 119 Sometimes, a code point is not a single number, but instead a group of | |
| 120 numbers, called @dfn{position codes}. In such cases, the number of | |
| 121 position codes required to index a particular character in a character | |
| 122 set is called the @dfn{dimension} of the character set. Character sets | |
| 123 indexed by more than one position code typically use byte-sized position | |
| 124 codes. Small character sets, e.g. ASCII, invariably use a single | |
| 125 position code, but for larger character sets, the choice of whether to | |
| 126 use multiple position codes or a single large (16-bit or 32-bit) number | |
| 127 is arbitrary. Unicode typically uses a single large number, but | |
| 128 language-specific or "national" character sets often use multiple | |
| 129 (usually two) position codes. For example, JIS X 0208, i.e. Japanese | |
| 130 Kanji, has thousands of characters, and is of dimension two -- every | |
| 131 character is indexed by two position codes, each in the range 1 through | |
| 132 94. (This number ``94'' is not a coincidence; it is the same as the | |
| 133 number of printable characters in ASCII, and was chosen so that JIS | |
| 134 characters could be directly encoded using two printable ASCII | |
| 135 characters.) Note that the choice of the range here is somewhat | |
| 136 arbitrary -- it could just as easily be 0 through 93, 2 through 95, etc. | |
| 137 In fact, the range for JIS position codes (and for other character sets | |
| 138 modeled after it) is often given as range 33 through 126, so as to | |
| 139 directly match ASCII printing characters. | |
| 428 | 140 |
| 141 An @dfn{encoding} is a way of numerically representing characters from | |
| 142 one or more character sets into a stream of like-sized numerical values | |
| 1261 | 143 called @dfn{words} -- typically 8-bit bytes, but sometimes 16-bit or |
| 2818 | 144 32-bit quantities. In a context where dealing with Japanese motivates |
| 145 much of XEmacs' design in this area, it's important to clearly | |
| 146 distinguish between charsets and encodings. For a simple charset like | |
| 147 ASCII, there is only one encoding normally used -- each character is | |
| 148 represented by a single byte, with the same value as its code point. | |
| 149 For more complicated charsets, however, or when a single encoding needs | |
| 150 to represent more than charset, things are not so obvious. Unicode | |
| 151 version 2, for example, is a large charset with thousands of characters, | |
| 152 each indexed by a 16-bit number, often represented in hex, e.g. 0x05D0 | |
| 153 for the Hebrew letter "aleph". One obvious encoding (actually two | |
| 154 encodings, depending on which of the two possible byte orderings is | |
| 155 chosen) simply uses two bytes per character. This encoding is | |
| 156 convenient for internal processing of Unicode text; however, it's | |
| 157 incompatible with ASCII, and thus external text (files, e-mail, etc.) | |
| 158 that is encoded this way is completely uninterpretable by programs | |
| 159 lacking Unicode support. For this reason, a different, ASCII-compatible | |
| 160 encoding, e.g. UTF-8, is usually used for external text. UTF-8 | |
| 161 represents Unicode characters with one to three bytes (often extended to | |
| 162 six bytes to handle characters with up to 31-bit indices). Unicode | |
| 163 characters 00 to 7F (identical with ASCII) are directly represented with | |
| 164 one byte, and other characters with two or more bytes, each in the range | |
| 165 80 to FF. Applications that don't understand Unicode will still be able | |
| 166 to process ASCII characters represented in UTF-8-encoded text, and will | |
| 167 typically ignore (and hopefully preserve) the high-bit characters. | |
| 168 | |
| 169 Similarly, Shift-JIS and EUC-JP are different encodings normally used to | |
| 170 encode the same character set(s), these character sets being subsets of | |
| 171 Unicode. However, the obvious approach of unifying XEmacs' internal | |
| 172 encoding across character sets, as was part of the motivation behind | |
| 173 Unicode, wasn't taken. This means that characters in these character | |
| 174 sets that are identical to characters in other character sets---for | |
| 175 example, the Greek alphabet is in the large Japanese character sets and | |
| 176 at least one European character set--are unfortunately disjoint. | |
| 1261 | 177 |
| 178 Naive use of code points is also not possible if more than one | |
| 179 character set is to be used in the encoding. For example, printed | |
| 442 | 180 Japanese text typically requires characters from multiple character sets |
| 181 -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is | |
| 1261 | 182 indexed using one or more position codes in the range 1 through 94 (or |
| 183 33 through 126), so the position codes could not be used directly or | |
| 184 there would be no way to tell which character was meant. Different | |
| 185 Japanese encodings handle this differently -- JIS uses special escape | |
| 186 characters to denote different character sets; EUC sets the high bit of | |
| 187 the position codes for JIS X 0208 and JIS X 0212, and puts a special | |
| 188 extra byte before each JIS X 0212 character; etc. | |
| 189 | |
| 190 The encodings described above are all 7-bit or 8-bit encodings. The | |
| 191 fixed-width Unicode encoding previous described, however, is sometimes | |
| 192 considered to be a 16-bit encoding, in which case the issue of byte | |
| 193 ordering does not come up. (Imagine, for example, that the text is | |
| 194 represented as an array of shorts.) Similarly, Unicode version 3 (which | |
| 195 has characters with indices above 0xFFFF), and other very large | |
| 196 character sets, may be represented internally as 32-bit encodings, | |
| 197 i.e. arrays of ints. However, it does not make too much sense to talk | |
| 198 about 16-bit or 32-bit encodings for external data, since nowadays 8-bit | |
| 199 data is a universal standard -- the closest you can get is fixed-width | |
| 200 encodings using two or four bytes to encode 16-bit or 32-bit values. (A | |
| 201 "7-bit" encoding is used when it cannot be guaranteed that the high bit | |
| 202 of 8-bit data will be correctly preserved. Some e-mail gateways, for | |
| 203 example, strip the high bit of text passing through them. These same | |
| 204 gateways often handle non-printable characters incorrectly, and so 7-bit | |
| 205 encodings usually avoid using bytes with such values.) | |
| 442 | 206 |
| 207 A general method of handling text using multiple character sets | |
| 208 (whether for multilingual text, or simply text in an extremely | |
| 209 complicated single language like Japanese) is defined in the | |
| 210 international standard ISO 2022. ISO 2022 will be discussed in more | |
| 211 detail later (@pxref{ISO 2022}), but for now suffice it to say that text | |
| 212 needs control functions (at least spacing), and if escape sequences are | |
| 213 to be used, an escape sequence introducer. It was decided to make all | |
| 214 text streams compatible with ASCII in the sense that the codes 0--31 | |
| 215 (and 128-159) would always be control codes, never graphic characters, | |
| 216 and where defined by the character set the @samp{SPC} character would be | |
| 217 assigned code 32, and @samp{DEL} would be assigned 127. Thus there are | |
| 218 94 code points remaining if 7 bits are used. This is the reason that | |
| 219 most character sets are defined using position codes in the range 1 | |
| 220 through 94. Then ISO 2022 compatible encodings are produced by shifting | |
| 221 the position codes 1 to 94 into character codes 33 to 126, or (if 8 bit | |
| 222 codes are available) into character codes 161 to 254. | |
| 428 | 223 |
| 224 Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In | |
| 442 | 225 a @dfn{modal encoding}, there are multiple states that the encoding can |
| 226 be in, and the interpretation of the values in the stream depends on the | |
| 428 | 227 current global state of the encoding. Special values in the encoding, |
| 228 called @dfn{escape sequences}, are used to change the global state. | |
| 229 JIS, for example, is a modal encoding. The bytes @samp{ESC $ B} | |
| 230 indicate that, from then on, bytes are to be interpreted as position | |
| 442 | 231 codes for JIS X 0208, rather than as ASCII. This effect is cancelled |
| 428 | 232 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the |
| 442 | 233 current state is to ASCII''. To switch to JIS X 0212, the escape |
| 234 sequence @samp{ESC $ ( D}. (Note that here, as is common, the escape | |
| 235 sequences do in fact begin with @samp{ESC}. This is not necessarily the | |
| 236 case, however. Some encodings use control characters called "locking | |
| 237 shifts" (effect persists until cancelled) to switch character sets.) | |
| 428 | 238 |
| 442 | 239 A @dfn{non-modal encoding} has no global state that extends past the |
| 428 | 240 character currently being interpreted. EUC, for example, is a |
| 442 | 241 non-modal encoding. Characters in JIS X 0208 are encoded by setting |
| 242 the high bit of the position codes, and characters in JIS X 0212 are | |
| 428 | 243 encoded by doing the same but also prefixing the character with the |
| 244 byte 0x8F. | |
| 245 | |
| 246 The advantage of a modal encoding is that it is generally more | |
| 442 | 247 space-efficient, and is easily extendible because there are essentially |
| 428 | 248 an arbitrary number of escape sequences that can be created. The |
| 249 disadvantage, however, is that it is much more difficult to work with | |
| 250 if it is not being processed in a sequential manner. In the non-modal | |
| 251 EUC encoding, for example, the byte 0x41 always refers to the letter | |
| 252 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or | |
| 442 | 253 one of the two position codes in a JIS X 0208 character, or one of the |
| 254 two position codes in a JIS X 0212 character. Determining exactly which | |
| 428 | 255 one is meant could be difficult and time-consuming if the previous |
| 442 | 256 bytes in the string have not already been processed, or impossible if |
| 257 they are drawn from an external stream that cannot be rewound. | |
| 428 | 258 |
| 259 Non-modal encodings are further divided into @dfn{fixed-width} and | |
| 260 @dfn{variable-width} formats. A fixed-width encoding always uses | |
| 261 the same number of words per character, whereas a variable-width | |
| 262 encoding does not. EUC is a good example of a variable-width | |
| 263 encoding: one to three bytes are used per character, depending on | |
| 264 the character set. 16-bit and 32-bit encodings are nearly always | |
| 265 fixed-width, and this is in fact one of the main reasons for using | |
| 266 an encoding with a larger word size. The advantages of fixed-width | |
| 267 encodings should be obvious. The advantages of variable-width | |
| 268 encodings are that they are generally more space-efficient and allow | |
| 442 | 269 for compatibility with existing 8-bit encodings such as ASCII. (For |
| 270 example, in Unicode ASCII characters are simply promoted to a 16-bit | |
| 271 representation. That means that every ASCII character contains a | |
| 272 @samp{NUL} byte; evidently all of the standard string manipulation | |
| 273 functions will lose badly in a fixed-width Unicode environment.) | |
| 428 | 274 |
| 442 | 275 The bytes in an 8-bit encoding are often referred to as @dfn{octets} |
| 276 rather than simply as bytes. This terminology dates back to the days | |
| 277 before 8-bit bytes were universal, when some computers had 9-bit bytes, | |
| 278 others had 10-bit bytes, etc. | |
| 428 | 279 |
| 442 | 280 @node Charsets, MULE Characters, Internationalization Terminology, MULE |
| 428 | 281 @section Charsets |
| 282 | |
| 283 A @dfn{charset} in MULE is an object that encapsulates a | |
| 284 particular character set as well as an ordering of those characters. | |
| 285 Charsets are permanent objects and are named using symbols, like | |
| 286 faces. | |
| 287 | |
| 288 @defun charsetp object | |
| 289 This function returns non-@code{nil} if @var{object} is a charset. | |
| 290 @end defun | |
| 291 | |
| 292 @menu | |
| 293 * Charset Properties:: Properties of a charset. | |
| 294 * Basic Charset Functions:: Functions for working with charsets. | |
| 295 * Charset Property Functions:: Functions for accessing charset properties. | |
| 296 * Predefined Charsets:: Predefined charset objects. | |
| 297 @end menu | |
| 298 | |
| 442 | 299 @node Charset Properties, Basic Charset Functions, , Charsets |
| 428 | 300 @subsection Charset Properties |
| 301 | |
| 302 Charsets have the following properties: | |
| 303 | |
| 304 @table @code | |
| 305 @item name | |
| 306 A symbol naming the charset. Every charset must have a different name; | |
| 307 this allows a charset to be referred to using its name rather than | |
| 308 the actual charset object. | |
| 309 @item doc-string | |
| 310 A documentation string describing the charset. | |
| 311 @item registry | |
| 312 A regular expression matching the font registry field for this character | |
| 313 set. For example, both the @code{ascii} and @code{latin-iso8859-1} | |
| 314 charsets use the registry @code{"ISO8859-1"}. This field is used to | |
| 315 choose an appropriate font when the user gives a general font | |
| 316 specification such as @samp{-*-courier-medium-r-*-140-*}, i.e. a | |
| 317 14-point upright medium-weight Courier font. | |
| 318 @item dimension | |
| 319 Number of position codes used to index a character in the character set. | |
| 320 XEmacs/MULE can only handle character sets of dimension 1 or 2. | |
| 321 This property defaults to 1. | |
| 322 @item chars | |
| 323 Number of characters in each dimension. In XEmacs/MULE, the only | |
| 324 allowed values are 94 or 96. (There are a couple of pre-defined | |
| 325 character sets, such as ASCII, that do not follow this, but you cannot | |
| 326 define new ones like this.) Defaults to 94. Note that if the dimension | |
| 327 is 2, the character set thus described is 94x94 or 96x96. | |
| 328 @item columns | |
| 329 Number of columns used to display a character in this charset. | |
| 330 Only used in TTY mode. (Under X, the actual width of a character | |
| 331 can be derived from the font used to display the characters.) | |
| 332 If unspecified, defaults to the dimension. (This is almost | |
| 333 always the correct value, because character sets with dimension 2 | |
| 334 are usually ideograph character sets, which need two columns to | |
| 335 display the intricate ideographs.) | |
| 336 @item direction | |
| 337 A symbol, either @code{l2r} (left-to-right) or @code{r2l} | |
| 338 (right-to-left). Defaults to @code{l2r}. This specifies the | |
| 339 direction that the text should be displayed in, and will be | |
| 340 left-to-right for most charsets but right-to-left for Hebrew | |
| 341 and Arabic. (Right-to-left display is not currently implemented.) | |
| 342 @item final | |
| 343 Final byte of the standard ISO 2022 escape sequence designating this | |
| 344 charset. Must be supplied. Each combination of (@var{dimension}, | |
| 345 @var{chars}) defines a separate namespace for final bytes, and each | |
| 346 charset within a particular namespace must have a different final byte. | |
| 347 Note that ISO 2022 restricts the final byte to the range 0x30 - 0x7E if | |
| 348 dimension == 1, and 0x30 - 0x5F if dimension == 2. Note also that final | |
| 349 bytes in the range 0x30 - 0x3F are reserved for user-defined (not | |
| 350 official) character sets. For more information on ISO 2022, see @ref{Coding | |
| 351 Systems}. | |
| 352 @item graphic | |
| 353 0 (use left half of font on output) or 1 (use right half of font on | |
| 354 output). Defaults to 0. This specifies how to convert the position | |
| 355 codes that index a character in a character set into an index into the | |
| 356 font used to display the character set. With @code{graphic} set to 0, | |
| 357 position codes 33 through 126 map to font indices 33 through 126; with | |
| 358 it set to 1, position codes 33 through 126 map to font indices 161 | |
| 359 through 254 (i.e. the same number but with the high bit set). For | |
| 360 example, for a font whose registry is ISO8859-1, the left half of the | |
| 361 font (octets 0x20 - 0x7F) is the @code{ascii} charset, while the right | |
| 362 half (octets 0xA0 - 0xFF) is the @code{latin-iso8859-1} charset. | |
| 363 @item ccl-program | |
| 364 A compiled CCL program used to convert a character in this charset into | |
| 365 an index into the font. This is in addition to the @code{graphic} | |
| 366 property. If a CCL program is defined, the position codes of a | |
| 367 character will first be processed according to @code{graphic} and | |
| 368 then passed through the CCL program, with the resulting values used | |
| 369 to index the font. | |
| 370 | |
| 442 | 371 This is used, for example, in the Big5 character set (used in Taiwan). |
| 428 | 372 This character set is not ISO-2022-compliant, and its size (94x157) does |
| 373 not fit within the maximum 96x96 size of ISO-2022-compliant character | |
| 374 sets. As a result, XEmacs/MULE splits it (in a rather complex fashion, | |
| 375 so as to group the most commonly used characters together) into two | |
| 376 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94, | |
| 377 and each charset object uses a CCL program to convert the modified | |
| 378 position codes back into standard Big5 indices to retrieve a character | |
| 379 from a Big5 font. | |
| 380 @end table | |
| 381 | |
| 442 | 382 Most of the above properties can only be set when the charset is |
| 383 initialized, and cannot be changed later. | |
| 384 @xref{Charset Property Functions}. | |
| 428 | 385 |
| 442 | 386 @node Basic Charset Functions, Charset Property Functions, Charset Properties, Charsets |
| 428 | 387 @subsection Basic Charset Functions |
| 388 | |
| 389 @defun find-charset charset-or-name | |
| 390 This function retrieves the charset of the given name. If | |
| 391 @var{charset-or-name} is a charset object, it is simply returned. | |
| 392 Otherwise, @var{charset-or-name} should be a symbol. If there is no | |
| 393 such charset, @code{nil} is returned. Otherwise the associated charset | |
| 394 object is returned. | |
| 395 @end defun | |
| 396 | |
| 397 @defun get-charset name | |
| 398 This function retrieves the charset of the given name. Same as | |
| 399 @code{find-charset} except an error is signalled if there is no such | |
| 400 charset instead of returning @code{nil}. | |
| 401 @end defun | |
| 402 | |
| 403 @defun charset-list | |
| 404 This function returns a list of the names of all defined charsets. | |
| 405 @end defun | |
| 406 | |
| 407 @defun make-charset name doc-string props | |
| 408 This function defines a new character set. This function is for use | |
| 442 | 409 with MULE support. @var{name} is a symbol, the name by which the |
| 428 | 410 character set is normally referred. @var{doc-string} is a string |
| 411 describing the character set. @var{props} is a property list, | |
| 412 describing the specific nature of the character set. The recognized | |
| 413 properties are @code{registry}, @code{dimension}, @code{columns}, | |
| 414 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and | |
| 415 @code{ccl-program}, as previously described. | |
| 416 @end defun | |
| 417 | |
| 418 @defun make-reverse-direction-charset charset new-name | |
| 419 This function makes a charset equivalent to @var{charset} but which goes | |
| 420 in the opposite direction. @var{new-name} is the name of the new | |
| 421 charset. The new charset is returned. | |
| 422 @end defun | |
| 423 | |
| 424 @defun charset-from-attributes dimension chars final &optional direction | |
| 425 This function returns a charset with the given @var{dimension}, | |
| 426 @var{chars}, @var{final}, and @var{direction}. If @var{direction} is | |
| 427 omitted, both directions will be checked (left-to-right will be returned | |
| 428 if character sets exist for both directions). | |
| 429 @end defun | |
| 430 | |
| 431 @defun charset-reverse-direction-charset charset | |
| 432 This function returns the charset (if any) with the same dimension, | |
| 433 number of characters, and final byte as @var{charset}, but which is | |
| 434 displayed in the opposite direction. | |
| 435 @end defun | |
| 436 | |
| 442 | 437 @node Charset Property Functions, Predefined Charsets, Basic Charset Functions, Charsets |
| 428 | 438 @subsection Charset Property Functions |
| 439 | |
| 442 | 440 All of these functions accept either a charset name or charset object. |
| 428 | 441 |
| 442 @defun charset-property charset prop | |
| 443 This function returns property @var{prop} of @var{charset}. | |
| 444 @xref{Charset Properties}. | |
| 445 @end defun | |
| 446 | |
| 442 | 447 Convenience functions are also provided for retrieving individual |
| 428 | 448 properties of a charset. |
| 449 | |
| 450 @defun charset-name charset | |
| 451 This function returns the name of @var{charset}. This will be a symbol. | |
| 452 @end defun | |
| 453 | |
| 444 | 454 @defun charset-description charset |
| 455 This function returns the documentation string of @var{charset}. | |
| 428 | 456 @end defun |
| 457 | |
| 458 @defun charset-registry charset | |
| 459 This function returns the registry of @var{charset}. | |
| 460 @end defun | |
| 461 | |
| 462 @defun charset-dimension charset | |
| 463 This function returns the dimension of @var{charset}. | |
| 464 @end defun | |
| 465 | |
| 466 @defun charset-chars charset | |
| 467 This function returns the number of characters per dimension of | |
| 468 @var{charset}. | |
| 469 @end defun | |
| 470 | |
| 444 | 471 @defun charset-width charset |
| 428 | 472 This function returns the number of display columns per character (in |
| 473 TTY mode) of @var{charset}. | |
| 474 @end defun | |
| 475 | |
| 476 @defun charset-direction charset | |
| 440 | 477 This function returns the display direction of @var{charset}---either |
| 428 | 478 @code{l2r} or @code{r2l}. |
| 479 @end defun | |
| 480 | |
| 444 | 481 @defun charset-iso-final-char charset |
| 428 | 482 This function returns the final byte of the ISO 2022 escape sequence |
| 483 designating @var{charset}. | |
| 484 @end defun | |
| 485 | |
| 444 | 486 @defun charset-iso-graphic-plane charset |
| 428 | 487 This function returns either 0 or 1, depending on whether the position |
| 488 codes of characters in @var{charset} map to the left or right half | |
| 489 of their font, respectively. | |
| 490 @end defun | |
| 491 | |
| 492 @defun charset-ccl-program charset | |
| 493 This function returns the CCL program, if any, for converting | |
| 494 position codes of characters in @var{charset} into font indices. | |
| 495 @end defun | |
| 496 | |
| 1734 | 497 The two properties of a charset that can currently be set after the |
| 498 charset has been created are the CCL program and the font registry. | |
| 428 | 499 |
| 500 @defun set-charset-ccl-program charset ccl-program | |
| 501 This function sets the @code{ccl-program} property of @var{charset} to | |
| 502 @var{ccl-program}. | |
| 503 @end defun | |
| 504 | |
| 1734 | 505 @defun set-charset-registry charset registry |
| 506 This function sets the @code{registry} property of @var{charset} to | |
| 507 @var{registry}. | |
| 508 @end defun | |
| 509 | |
| 442 | 510 @node Predefined Charsets, , Charset Property Functions, Charsets |
| 428 | 511 @subsection Predefined Charsets |
| 512 | |
| 442 | 513 The following charsets are predefined in the C code. |
| 428 | 514 |
| 515 @example | |
| 516 Name Type Fi Gr Dir Registry | |
| 517 -------------------------------------------------------------- | |
| 518 ascii 94 B 0 l2r ISO8859-1 | |
| 519 control-1 94 0 l2r --- | |
| 520 latin-iso8859-1 94 A 1 l2r ISO8859-1 | |
| 521 latin-iso8859-2 96 B 1 l2r ISO8859-2 | |
| 522 latin-iso8859-3 96 C 1 l2r ISO8859-3 | |
| 523 latin-iso8859-4 96 D 1 l2r ISO8859-4 | |
| 524 cyrillic-iso8859-5 96 L 1 l2r ISO8859-5 | |
| 525 arabic-iso8859-6 96 G 1 r2l ISO8859-6 | |
| 526 greek-iso8859-7 96 F 1 l2r ISO8859-7 | |
| 527 hebrew-iso8859-8 96 H 1 r2l ISO8859-8 | |
| 528 latin-iso8859-9 96 M 1 l2r ISO8859-9 | |
| 529 thai-tis620 96 T 1 l2r TIS620 | |
| 530 katakana-jisx0201 94 I 1 l2r JISX0201.1976 | |
| 531 latin-jisx0201 94 J 0 l2r JISX0201.1976 | |
| 532 japanese-jisx0208-1978 94x94 @@ 0 l2r JISX0208.1978 | |
| 533 japanese-jisx0208 94x94 B 0 l2r JISX0208.19(83|90) | |
| 534 japanese-jisx0212 94x94 D 0 l2r JISX0212 | |
| 535 chinese-gb2312 94x94 A 0 l2r GB2312 | |
| 536 chinese-cns11643-1 94x94 G 0 l2r CNS11643.1 | |
| 537 chinese-cns11643-2 94x94 H 0 l2r CNS11643.2 | |
| 538 chinese-big5-1 94x94 0 0 l2r Big5 | |
| 539 chinese-big5-2 94x94 1 0 l2r Big5 | |
| 540 korean-ksc5601 94x94 C 0 l2r KSC5601 | |
| 541 composite 96x96 0 l2r --- | |
| 542 @end example | |
| 543 | |
| 442 | 544 The following charsets are predefined in the Lisp code. |
| 428 | 545 |
| 546 @example | |
| 547 Name Type Fi Gr Dir Registry | |
| 548 -------------------------------------------------------------- | |
| 549 arabic-digit 94 2 0 l2r MuleArabic-0 | |
| 550 arabic-1-column 94 3 0 r2l MuleArabic-1 | |
| 551 arabic-2-column 94 4 0 r2l MuleArabic-2 | |
| 552 sisheng 94 0 0 l2r sisheng_cwnn\|OMRON_UDC_ZH | |
| 553 chinese-cns11643-3 94x94 I 0 l2r CNS11643.1 | |
| 554 chinese-cns11643-4 94x94 J 0 l2r CNS11643.1 | |
| 555 chinese-cns11643-5 94x94 K 0 l2r CNS11643.1 | |
| 556 chinese-cns11643-6 94x94 L 0 l2r CNS11643.1 | |
| 557 chinese-cns11643-7 94x94 M 0 l2r CNS11643.1 | |
| 558 ethiopic 94x94 2 0 l2r Ethio | |
| 559 ascii-r2l 94 B 0 r2l ISO8859-1 | |
| 560 ipa 96 0 1 l2r MuleIPA | |
| 1734 | 561 vietnamese-viscii-lower 96 1 1 l2r VISCII1.1 |
| 562 vietnamese-viscii-upper 96 2 1 l2r VISCII1.1 | |
| 428 | 563 @end example |
| 564 | |
| 565 For all of the above charsets, the dimension and number of columns are | |
| 566 the same. | |
| 567 | |
| 442 | 568 Note that ASCII, Control-1, and Composite are handled specially. |
| 428 | 569 This is why some of the fields are blank; and some of the filled-in |
| 570 fields (e.g. the type) are not really accurate. | |
| 571 | |
| 442 | 572 @node MULE Characters, Composite Characters, Charsets, MULE |
| 428 | 573 @section MULE Characters |
| 574 | |
| 575 @defun make-char charset arg1 &optional arg2 | |
| 576 This function makes a multi-byte character from @var{charset} and octets | |
| 577 @var{arg1} and @var{arg2}. | |
| 578 @end defun | |
| 579 | |
| 444 | 580 @defun char-charset character |
| 581 This function returns the character set of char @var{character}. | |
| 428 | 582 @end defun |
| 583 | |
| 444 | 584 @defun char-octet character &optional n |
| 428 | 585 This function returns the octet (i.e. position code) numbered @var{n} |
| 444 | 586 (should be 0 or 1) of char @var{character}. @var{n} defaults to 0 if omitted. |
| 428 | 587 @end defun |
| 588 | |
| 589 @defun find-charset-region start end &optional buffer | |
| 590 This function returns a list of the charsets in the region between | |
| 591 @var{start} and @var{end}. @var{buffer} defaults to the current buffer | |
| 592 if omitted. | |
| 593 @end defun | |
| 594 | |
| 595 @defun find-charset-string string | |
| 596 This function returns a list of the charsets in @var{string}. | |
| 597 @end defun | |
| 598 | |
| 442 | 599 @node Composite Characters, Coding Systems, MULE Characters, MULE |
| 428 | 600 @section Composite Characters |
| 601 | |
| 442 | 602 Composite characters are not yet completely implemented. |
| 428 | 603 |
| 604 @defun make-composite-char string | |
| 605 This function converts a string into a single composite character. The | |
| 606 character is the result of overstriking all the characters in the | |
| 607 string. | |
| 608 @end defun | |
| 609 | |
| 444 | 610 @defun composite-char-string character |
| 428 | 611 This function returns a string of the characters comprising a composite |
| 612 character. | |
| 613 @end defun | |
| 614 | |
| 615 @defun compose-region start end &optional buffer | |
| 616 This function composes the characters in the region from @var{start} to | |
| 617 @var{end} in @var{buffer} into one composite character. The composite | |
| 618 character replaces the composed characters. @var{buffer} defaults to | |
| 619 the current buffer if omitted. | |
| 620 @end defun | |
| 621 | |
| 622 @defun decompose-region start end &optional buffer | |
| 623 This function decomposes any composite characters in the region from | |
| 624 @var{start} to @var{end} in @var{buffer}. This converts each composite | |
| 625 character into one or more characters, the individual characters out of | |
| 626 which the composite character was formed. Non-composite characters are | |
| 627 left as-is. @var{buffer} defaults to the current buffer if omitted. | |
| 628 @end defun | |
| 629 | |
| 442 | 630 @node Coding Systems, CCL, Composite Characters, MULE |
| 631 @section Coding Systems | |
| 632 | |
| 633 A coding system is an object that defines how text containing multiple | |
| 634 character sets is encoded into a stream of (typically 8-bit) bytes. The | |
| 635 coding system is used to decode the stream into a series of characters | |
| 636 (which may be from multiple charsets) when the text is read from a file | |
| 637 or process, and is used to encode the text back into the same format | |
| 638 when it is written out to a file or process. | |
| 639 | |
| 640 For example, many ISO-2022-compliant coding systems (such as Compound | |
| 641 Text, which is used for inter-client data under the X Window System) use | |
| 642 escape sequences to switch between different charsets -- Japanese Kanji, | |
| 643 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with | |
| 644 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See | |
| 645 @code{make-coding-system} for more information. | |
| 646 | |
| 647 Coding systems are normally identified using a symbol, and the symbol is | |
| 648 accepted in place of the actual coding system object whenever a coding | |
| 649 system is called for. (This is similar to how faces and charsets work.) | |
| 650 | |
| 651 @defun coding-system-p object | |
| 652 This function returns non-@code{nil} if @var{object} is a coding system. | |
| 653 @end defun | |
| 428 | 654 |
| 442 | 655 @menu |
| 656 * Coding System Types:: Classifying coding systems. | |
| 657 * ISO 2022:: An international standard for | |
| 658 charsets and encodings. | |
| 659 * EOL Conversion:: Dealing with different ways of denoting | |
| 660 the end of a line. | |
| 661 * Coding System Properties:: Properties of a coding system. | |
| 662 * Basic Coding System Functions:: Working with coding systems. | |
| 663 * Coding System Property Functions:: Retrieving a coding system's properties. | |
| 664 * Encoding and Decoding Text:: Encoding and decoding text. | |
| 665 * Detection of Textual Encoding:: Determining how text is encoded. | |
| 666 * Big5 and Shift-JIS Functions:: Special functions for these non-standard | |
| 667 encodings. | |
| 668 * Predefined Coding Systems:: Coding systems implemented by MULE. | |
| 669 @end menu | |
| 428 | 670 |
| 442 | 671 @node Coding System Types, ISO 2022, , Coding Systems |
| 672 @subsection Coding System Types | |
| 673 | |
| 674 The coding system type determines the basic algorithm XEmacs will use to | |
| 675 decode or encode a data stream. Character encodings will be converted | |
| 676 to the MULE encoding, escape sequences processed, and newline sequences | |
| 677 converted to XEmacs's internal representation. There are three basic | |
| 678 classes of coding system type: no-conversion, ISO-2022, and special. | |
| 679 | |
| 680 No conversion allows you to look at the file's internal representation. | |
| 681 Since XEmacs is basically a text editor, "no conversion" does convert | |
| 682 newline conventions by default. (Use the 'binary coding-system if this | |
| 683 is not desired.) | |
| 428 | 684 |
| 442 | 685 ISO 2022 (@pxref{ISO 2022}) is the basic international standard regulating |
| 686 use of "coded character sets for the exchange of data", ie, text | |
| 687 streams. ISO 2022 contains functions that make it possible to encode | |
| 688 text streams to comply with restrictions of the Internet mail system and | |
| 689 de facto restrictions of most file systems (eg, use of the separator | |
| 690 character in file names). Coding systems which are not ISO 2022 | |
| 691 conformant can be difficult to handle. Perhaps more important, they are | |
| 692 not adaptable to multilingual information interchange, with the obvious | |
| 693 exception of ISO 10646 (Unicode). (Unicode is partially supported by | |
| 694 XEmacs with the addition of the Lisp package ucs-conv.) | |
| 695 | |
| 696 The special class of coding systems includes automatic detection, CCL (a | |
| 697 "little language" embedded as an interpreter, useful for translating | |
| 698 between variants of a single character set), non-ISO-2022-conformant | |
| 699 encodings like Unicode, Shift JIS, and Big5, and MULE internal coding. | |
| 700 (NB: this list is based on XEmacs 21.2. Terminology may vary slightly | |
| 701 for other versions of XEmacs and for GNU Emacs 20.) | |
| 702 | |
| 703 @table @code | |
| 704 @item no-conversion | |
| 705 No conversion, for binary files, and a few special cases of non-ISO-2022 | |
| 706 coding systems where conversion is done by hook functions (usually | |
| 707 implemented in CCL). On output, graphic characters that are not in | |
| 708 ASCII or Latin-1 will be replaced by a @samp{?}. (For a | |
| 709 no-conversion-encoded buffer, these characters will only be present if | |
| 710 you explicitly insert them.) | |
| 711 @item iso2022 | |
| 712 Any ISO-2022-compliant encoding. Among others, this includes JIS (the | |
| 713 Japanese encoding commonly used for e-mail), national variants of EUC | |
| 714 (the standard Unix encoding for Japanese and other languages), and | |
| 715 Compound Text (an encoding used in X11). You can specify more specific | |
| 716 information about the conversion with the @var{flags} argument. | |
| 717 @item ucs-4 | |
| 718 ISO 10646 UCS-4 encoding. A 31-bit fixed-width superset of Unicode. | |
| 719 @item utf-8 | |
| 720 ISO 10646 UTF-8 encoding. A ``file system safe'' transformation format | |
| 721 that can be used with both UCS-4 and Unicode. | |
| 722 @item undecided | |
| 723 Automatic conversion. XEmacs attempts to detect the coding system used | |
| 724 in the file. | |
| 725 @item shift-jis | |
| 726 Shift-JIS (a Japanese encoding commonly used in PC operating systems). | |
| 727 @item big5 | |
| 728 Big5 (the encoding commonly used for Taiwanese). | |
| 729 @item ccl | |
| 730 The conversion is performed using a user-written pseudo-code program. | |
| 731 CCL (Code Conversion Language) is the name of this pseudo-code. For | |
| 732 example, CCL is used to map KOI8-R characters (an encoding for Russian | |
| 733 Cyrillic) to ISO8859-5 (the form used internally by MULE). | |
| 734 @item internal | |
| 735 Write out or read in the raw contents of the memory representing the | |
| 736 buffer's text. This is primarily useful for debugging purposes, and is | |
| 737 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set | |
| 738 (the @samp{--debug} configure option). @strong{Warning}: Reading in a | |
| 739 file using @code{internal} conversion can result in an internal | |
| 740 inconsistency in the memory representing a buffer's text, which will | |
| 741 produce unpredictable results and may cause XEmacs to crash. Under | |
| 742 normal circumstances you should never use @code{internal} conversion. | |
| 428 | 743 @end table |
| 744 | |
| 442 | 745 @node ISO 2022, EOL Conversion, Coding System Types, Coding Systems |
| 746 @section ISO 2022 | |
| 747 | |
| 748 This section briefly describes the ISO 2022 encoding standard. A more | |
| 749 thorough treatment is available in the original document of ISO | |
| 750 2022 as well as various national standards (such as JIS X 0202). | |
| 428 | 751 |
| 442 | 752 Character sets (@dfn{charsets}) are classified into the following four |
| 753 categories, according to the number of characters in the charset: | |
| 754 94-charset, 96-charset, 94x94-charset, and 96x96-charset. This means | |
| 755 that although an ISO 2022 coding system may have variable width | |
| 756 characters, each charset used is fixed-width (in contrast to the MULE | |
| 757 character set and UTF-8, for example). | |
| 758 | |
| 759 ISO 2022 provides for switching between character sets via escape | |
| 760 sequences. This switching is somewhat complicated, because ISO 2022 | |
| 761 provides for both legacy applications like Internet mail that accept | |
| 444 | 762 only 7 significant bits in some contexts (RFC 822 headers, for example), |
| 442 | 763 and more modern "8-bit clean" applications. It also provides for |
| 764 compact and transparent representation of languages like Japanese which | |
| 765 mix ASCII and a national script (even outside of computer programs). | |
| 428 | 766 |
| 442 | 767 First, ISO 2022 codified prevailing practice by dividing the code space |
| 768 into "control" and "graphic" regions. The code points 0x00-0x1F and | |
| 769 0x80-0x9F are reserved for "control characters", while "graphic | |
| 770 characters" must be assigned to code points in the regions 0x20-0x7F and | |
| 771 0xA0-0xFF. The positions 0x20 and 0x7F are special, and under some | |
| 772 circumstances must be assigned the graphic character "ASCII SPACE" and | |
| 773 the control character "ASCII DEL" respectively. | |
| 428 | 774 |
| 442 | 775 The various regions are given the name C0 (0x00-0x1F), GL (0x20-0x7F), |
| 776 C1 (0x80-0x9F), and GR (0xA0-0xFF). GL and GR stand for "graphic left" | |
| 777 and "graphic right", respectively, because of the standard method of | |
| 778 displaying graphic character sets in tables with the high byte indexing | |
| 444 | 779 columns and the low byte indexing rows. I don't find it very intuitive, |
| 442 | 780 but these are called "registers". |
| 781 | |
| 782 An ISO 2022-conformant encoding for a graphic character set must use a | |
| 783 fixed number of bytes per character, and the values must fit into a | |
| 784 single register; that is, each byte must range over either 0x20-0x7F, or | |
| 785 0xA0-0xFF. It is not allowed to extend the range of the repertoire of a | |
| 786 character set by using both ranges at the same. This is why a standard | |
| 787 character set such as ISO 8859-1 is actually considered by ISO 2022 to | |
| 788 be an aggregation of two character sets, ASCII and LATIN-1, and why it | |
| 789 is technically incorrect to refer to ISO 8859-1 as "Latin 1". Also, a | |
| 790 single character's bytes must all be drawn from the same register; this | |
| 791 is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO | |
| 792 2022-compatible encodings. | |
| 428 | 793 |
| 442 | 794 The reason for this restriction becomes clear when you attempt to define |
| 795 an efficient, robust encoding for a language like Japanese. Like ISO | |
| 796 8859, Japanese encodings are aggregations of several character sets. In | |
| 797 practice, the vast majority of characters are drawn from the "JIS Roman" | |
| 798 character set (a derivative of ASCII; it won't hurt to think of it as | |
| 799 ASCII) and the JIS X 0208 standard "basic Japanese" character set | |
| 800 including not only ideographic characters ("kanji") but syllabic | |
| 801 Japanese characters ("kana"), a wide variety of symbols, and many | |
| 802 alphabetic characters (Roman, Greek, and Cyrillic) as well. Although | |
| 803 JIS X 0208 includes the whole Roman alphabet, as a 2-byte code it is not | |
| 804 suited to programming; thus the inclusion of ASCII in the standard | |
| 805 Japanese encodings. | |
| 428 | 806 |
| 442 | 807 For normal Japanese text such as in newspapers, a broad repertoire of |
| 808 approximately 3000 characters is used. Evidently this won't fit into | |
| 809 one byte; two must be used. But much of the text processed by Japanese | |
| 810 computers is computer source code, nearly all of which is ASCII. A not | |
| 811 insignificant portion of ordinary text is English (as such or as | |
| 812 borrowed Japanese vocabulary) or other languages which can represented | |
| 813 at least approximately in ASCII, as well. It seems reasonable then to | |
| 814 represent ASCII in one byte, and JIS X 0208 in two. And this is exactly | |
| 815 what the Extended Unix Code for Japanese (EUC-JP) does. ASCII is | |
| 816 invoked to the GL register, and JIS X 0208 is invoked to the GR | |
| 817 register. Thus, each byte can be tested for its character set by | |
| 818 looking at the high bit; if set, it is Japanese, if clear, it is ASCII. | |
| 819 Furthermore, since control characters like newline can never be part of | |
| 820 a graphic character, even in the case of corruption in transmission the | |
| 821 stream will be resynchronized at every line break, on the order of 60-80 | |
| 822 bytes. This coding system requires no escape sequences or special | |
| 823 control codes to represent 99.9% of all Japanese text. | |
| 428 | 824 |
| 442 | 825 Note carefully the distinction between the character sets (ASCII and JIS |
| 826 X 0208), the encoding (EUC-JP), and the coding system (ISO 2022). The | |
| 827 JIS X 0208 character set is used in three different encodings for | |
| 828 Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is | |
| 829 always clear), in EUC-JP it is invoked into GR (setting the high bit in | |
| 830 the process), and in Shift JIS the high bit may be set or reset, and the | |
| 831 significant bits are shifted within the 16-bit character so that the two | |
| 832 main character sets can coexist with a third (the "halfwidth katakana" | |
| 833 of JIS X 0201). As the name implies, the ISO-2022-JP encoding is also a | |
| 834 version of the ISO-2022 coding system. | |
| 428 | 835 |
| 442 | 836 In order to systematically treat subsidiary character sets (like the |
| 837 "halfwidth katakana" already mentioned, and the "supplementary kanji" of | |
| 838 JIS X 0212), four further registers are defined: G0, G1, G2, and G3. | |
| 839 Unlike GL and GR, they are not logically distinguished by internal | |
| 840 format. Instead, the process of "invocation" mentioned earlier is | |
| 841 broken into two steps: first, a character set is @dfn{designated} to one | |
| 842 of the registers G0-G3 by use of an @dfn{escape sequence} of the form: | |
| 428 | 843 |
| 844 @example | |
| 440 | 845 ESC [@var{I}] @var{I} @var{F} |
| 428 | 846 @end example |
| 847 | |
| 442 | 848 where @var{I} is an intermediate character or characters in the range |
| 849 0x20 - 0x3F, and @var{F}, from the range 0x30-0x7Fm is the final | |
| 850 character identifying this charset. (Final characters in the range | |
| 851 0x30-0x3F are reserved for private use and will never have a publicly | |
| 852 registered meaning.) | |
| 853 | |
| 854 Then that register is @dfn{invoked} to either GL or GR, either | |
| 855 automatically (designations to G0 normally involve invocation to GL as | |
| 856 well), or by use of shifting (affecting only the following character in | |
| 857 the data stream) or locking (effective until the next designation or | |
| 858 locking) control sequences. An encoding conformant to ISO 2022 is | |
| 859 typically defined by designating the initial contents of the G0-G3 | |
| 901 | 860 registers, specifying a 7 or 8 bit environment, and specifying whether |
| 442 | 861 further designations will be recognized. |
| 862 | |
| 863 Some examples of character sets and the registered final characters | |
| 864 @var{F} used to designate them: | |
| 428 | 865 |
| 442 | 866 @need 1000 |
| 867 @table @asis | |
| 868 @item 94-charset | |
| 869 ASCII (B), left (J) and right (I) half of JIS X 0201, ... | |
| 870 @item 96-charset | |
| 871 Latin-1 (A), Latin-2 (B), Latin-3 (C), ... | |
| 872 @item 94x94-charset | |
| 873 GB2312 (A), JIS X 0208 (B), KSC5601 (C), ... | |
| 874 @item 96x96-charset | |
| 875 none for the moment | |
| 876 @end table | |
| 877 | |
| 878 The meanings of the various characters in these sequences, where not | |
| 879 specified by the ISO 2022 standard (such as the ESC character), are | |
| 880 assigned by @dfn{ECMA}, the European Computer Manufacturers Association. | |
| 881 | |
| 882 The meaning of intermediate characters are: | |
| 428 | 883 |
| 884 @example | |
| 885 @group | |
| 440 | 886 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96). |
| 887 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}. | |
| 888 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}. | |
| 889 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}. | |
| 890 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}. | |
| 442 | 891 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}. |
| 440 | 892 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}. |
| 893 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}. | |
| 894 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}. | |
| 428 | 895 @end group |
| 896 @end example | |
| 897 | |
| 442 | 898 The comma may be used in files read and written only by MULE, as a MULE |
| 899 extension, but this is illegal in ISO 2022. (The reason is that in ISO | |
| 900 2022 G0 must be a 94-member character set, with 0x20 assigned the value | |
| 901 SPACE, and 0x7F assigned the value DEL.) | |
| 428 | 902 |
| 442 | 903 Here are examples of designations: |
| 428 | 904 |
| 905 @example | |
| 906 @group | |
| 440 | 907 ESC ( B : designate to G0 ASCII |
| 908 ESC - A : designate to G1 Latin-1 | |
| 909 ESC $ ( A or ESC $ A : designate to G0 GB2312 | |
| 910 ESC $ ( B or ESC $ B : designate to G0 JISX0208 | |
| 911 ESC $ ) C : designate to G1 KSC5601 | |
| 428 | 912 @end group |
| 913 @end example | |
| 914 | |
| 442 | 915 (The short forms used to designate GB2312 and JIS X 0208 are for |
| 916 backwards compatibility; the long forms are preferred.) | |
| 917 | |
| 918 To use a charset designated to G2 or G3, and to use a charset designated | |
| 428 | 919 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3 |
| 920 into GL. There are two types of invocation, Locking Shift (forever) and | |
| 921 Single Shift (one character only). | |
| 922 | |
| 442 | 923 Locking Shift is done as follows: |
| 428 | 924 |
| 925 @example | |
| 440 | 926 LS0 or SI (0x0F): invoke G0 into GL |
| 927 LS1 or SO (0x0E): invoke G1 into GL | |
| 928 LS2: invoke G2 into GL | |
| 929 LS3: invoke G3 into GL | |
| 930 LS1R: invoke G1 into GR | |
| 931 LS2R: invoke G2 into GR | |
| 932 LS3R: invoke G3 into GR | |
| 428 | 933 @end example |
| 934 | |
| 442 | 935 Single Shift is done as follows: |
| 428 | 936 |
| 937 @example | |
| 938 @group | |
| 440 | 939 SS2 or ESC N: invoke G2 into GL |
| 940 SS3 or ESC O: invoke G3 into GL | |
| 428 | 941 @end group |
| 942 @end example | |
| 943 | |
| 442 | 944 The shift functions (such as LS1R and SS3) are represented by control |
| 945 characters (from C1) in 8 bit environments and by escape sequences in 7 | |
| 946 bit environments. | |
| 947 | |
| 428 | 948 (#### Ben says: I think the above is slightly incorrect. It appears that |
| 949 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and | |
| 444 | 950 ESC O behave as indicated. The above definitions will not parse |
| 428 | 951 EUC-encoded text correctly, and it looks like the code in mule-coding.c |
| 952 has similar problems.) | |
| 953 | |
| 442 | 954 Evidently there are a lot of ISO-2022-compliant ways of encoding |
| 955 multilingual text. Now, in the world, there exist many coding systems | |
| 956 such as X11's Compound Text, Japanese JUNET code, and so-called EUC | |
| 957 (Extended UNIX Code); all of these are variants of ISO 2022. | |
| 428 | 958 |
| 442 | 959 In MULE, we characterize a version of ISO 2022 by the following |
| 960 attributes: | |
| 428 | 961 |
| 962 @enumerate | |
| 963 @item | |
| 442 | 964 The character sets initially designated to G0 thru G3. |
| 428 | 965 @item |
| 442 | 966 Whether short form designations are allowed for Japanese and Chinese. |
| 428 | 967 @item |
| 442 | 968 Whether ASCII should be designated to G0 before control characters. |
| 428 | 969 @item |
| 442 | 970 Whether ASCII should be designated to G0 at the end of line. |
| 428 | 971 @item |
| 972 7-bit environment or 8-bit environment. | |
| 973 @item | |
| 442 | 974 Whether Locking Shifts are used or not. |
| 428 | 975 @item |
| 442 | 976 Whether to use ASCII or the variant JIS X 0201-1976-Roman. |
| 428 | 977 @item |
| 442 | 978 Whether to use JIS X 0208-1983 or the older version JIS X 0208-1976. |
| 428 | 979 @end enumerate |
| 980 | |
| 981 (The last two are only for Japanese.) | |
| 982 | |
| 442 | 983 By specifying these attributes, you can create any variant |
| 428 | 984 of ISO 2022. |
| 985 | |
| 442 | 986 Here are several examples: |
| 428 | 987 |
| 988 @example | |
| 989 @group | |
| 442 | 990 ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check). |
| 440 | 991 1. G0 <- ASCII, G1..3 <- never used |
| 992 2. Yes. | |
| 993 3. Yes. | |
| 994 4. Yes. | |
| 995 5. 7-bit environment | |
| 996 6. No. | |
| 997 7. Use ASCII | |
| 442 | 998 8. Use JIS X 0208-1983 |
| 428 | 999 @end group |
| 1000 | |
| 1001 @group | |
| 442 | 1002 ctext -- X11 Compound Text |
| 1003 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used. | |
| 440 | 1004 2. No. |
| 1005 3. No. | |
| 1006 4. Yes. | |
| 442 | 1007 5. 8-bit environment. |
| 440 | 1008 6. No. |
| 442 | 1009 7. Use ASCII. |
| 1010 8. Use JIS X 0208-1983. | |
| 428 | 1011 @end group |
| 1012 | |
| 1013 @group | |
| 442 | 1014 euc-china -- Chinese EUC. Often called the "GB encoding", but that is |
| 1015 technically incorrect. | |
| 1016 1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used. | |
| 440 | 1017 2. No. |
| 1018 3. Yes. | |
| 1019 4. Yes. | |
| 442 | 1020 5. 8-bit environment. |
| 440 | 1021 6. No. |
| 442 | 1022 7. Use ASCII. |
| 1023 8. Use JIS X 0208-1983. | |
| 428 | 1024 @end group |
| 1025 | |
| 1026 @group | |
| 442 | 1027 ISO-2022-KR -- Coding system used in Korean email. |
| 1028 1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used. | |
| 440 | 1029 2. No. |
| 1030 3. Yes. | |
| 1031 4. Yes. | |
| 442 | 1032 5. 7-bit environment. |
| 440 | 1033 6. Yes. |
| 442 | 1034 7. Use ASCII. |
| 1035 8. Use JIS X 0208-1983. | |
| 428 | 1036 @end group |
| 1037 @end example | |
| 1038 | |
| 442 | 1039 MULE creates all of these coding systems by default. |
| 428 | 1040 |
| 442 | 1041 @node EOL Conversion, Coding System Properties, ISO 2022, Coding Systems |
| 428 | 1042 @subsection EOL Conversion |
| 1043 | |
| 1044 @table @code | |
| 1045 @item nil | |
| 1046 Automatically detect the end-of-line type (LF, CRLF, or CR). Also | |
| 1047 generate subsidiary coding systems named @code{@var{name}-unix}, | |
| 1048 @code{@var{name}-dos}, and @code{@var{name}-mac}, that are identical to | |
| 1049 this coding system but have an EOL-TYPE value of @code{lf}, @code{crlf}, | |
| 1050 and @code{cr}, respectively. | |
| 1051 @item lf | |
| 1052 The end of a line is marked externally using ASCII LF. Since this is | |
| 1053 also the way that XEmacs represents an end-of-line internally, | |
| 1054 specifying this option results in no end-of-line conversion. This is | |
| 1055 the standard format for Unix text files. | |
| 1056 @item crlf | |
| 1057 The end of a line is marked externally using ASCII CRLF. This is the | |
| 1058 standard format for MS-DOS text files. | |
| 1059 @item cr | |
| 1060 The end of a line is marked externally using ASCII CR. This is the | |
| 1061 standard format for Macintosh text files. | |
| 1062 @item t | |
| 1063 Automatically detect the end-of-line type but do not generate subsidiary | |
| 1064 coding systems. (This value is converted to @code{nil} when stored | |
| 1065 internally, and @code{coding-system-property} will return @code{nil}.) | |
| 1066 @end table | |
| 1067 | |
| 442 | 1068 @node Coding System Properties, Basic Coding System Functions, EOL Conversion, Coding Systems |
| 428 | 1069 @subsection Coding System Properties |
| 1070 | |
| 1071 @table @code | |
| 1072 @item mnemonic | |
| 1073 String to be displayed in the modeline when this coding system is | |
| 1074 active. | |
| 1075 | |
| 1076 @item eol-type | |
| 1077 End-of-line conversion to be used. It should be one of the types | |
| 1078 listed in @ref{EOL Conversion}. | |
| 1079 | |
| 442 | 1080 @item eol-lf |
| 444 | 1081 The coding system which is the same as this one, except that it uses the |
| 442 | 1082 Unix line-breaking convention. |
| 1083 | |
| 1084 @item eol-crlf | |
| 444 | 1085 The coding system which is the same as this one, except that it uses the |
| 442 | 1086 DOS line-breaking convention. |
| 1087 | |
| 1088 @item eol-cr | |
| 444 | 1089 The coding system which is the same as this one, except that it uses the |
| 442 | 1090 Macintosh line-breaking convention. |
| 1091 | |
| 428 | 1092 @item post-read-conversion |
| 1093 Function called after a file has been read in, to perform the decoding. | |
| 444 | 1094 Called with two arguments, @var{start} and @var{end}, denoting a region of |
| 428 | 1095 the current buffer to be decoded. |
| 1096 | |
| 1097 @item pre-write-conversion | |
| 1098 Function called before a file is written out, to perform the encoding. | |
| 444 | 1099 Called with two arguments, @var{start} and @var{end}, denoting a region of |
| 428 | 1100 the current buffer to be encoded. |
| 1101 @end table | |
| 1102 | |
| 442 | 1103 The following additional properties are recognized if @var{type} is |
| 428 | 1104 @code{iso2022}: |
| 1105 | |
| 1106 @table @code | |
| 1107 @item charset-g0 | |
| 1108 @itemx charset-g1 | |
| 1109 @itemx charset-g2 | |
| 1110 @itemx charset-g3 | |
| 1111 The character set initially designated to the G0 - G3 registers. | |
| 1112 The value should be one of | |
| 1113 | |
| 1114 @itemize @bullet | |
| 1115 @item | |
| 1116 A charset object (designate that character set) | |
| 1117 @item | |
| 1118 @code{nil} (do not ever use this register) | |
| 1119 @item | |
| 1120 @code{t} (no character set is initially designated to the register, but | |
| 1121 may be later on; this automatically sets the corresponding | |
| 1122 @code{force-g*-on-output} property) | |
| 1123 @end itemize | |
| 1124 | |
| 1125 @item force-g0-on-output | |
| 1126 @itemx force-g1-on-output | |
| 1127 @itemx force-g2-on-output | |
| 1128 @itemx force-g3-on-output | |
| 1129 If non-@code{nil}, send an explicit designation sequence on output | |
| 1130 before using the specified register. | |
| 1131 | |
| 1132 @item short | |
| 1133 If non-@code{nil}, use the short forms @samp{ESC $ @@}, @samp{ESC $ A}, | |
| 1134 and @samp{ESC $ B} on output in place of the full designation sequences | |
| 1135 @samp{ESC $ ( @@}, @samp{ESC $ ( A}, and @samp{ESC $ ( B}. | |
| 1136 | |
| 1137 @item no-ascii-eol | |
| 1138 If non-@code{nil}, don't designate ASCII to G0 at each end of line on | |
| 1139 output. Setting this to non-@code{nil} also suppresses other | |
| 1140 state-resetting that normally happens at the end of a line. | |
| 1141 | |
| 1142 @item no-ascii-cntl | |
| 1143 If non-@code{nil}, don't designate ASCII to G0 before control chars on | |
| 1144 output. | |
| 1145 | |
| 1146 @item seven | |
| 1147 If non-@code{nil}, use 7-bit environment on output. Otherwise, use 8-bit | |
| 1148 environment. | |
| 1149 | |
| 1150 @item lock-shift | |
| 1151 If non-@code{nil}, use locking-shift (SO/SI) instead of single-shift or | |
| 1152 designation by escape sequence. | |
| 1153 | |
| 1154 @item no-iso6429 | |
| 1155 If non-@code{nil}, don't use ISO6429's direction specification. | |
| 1156 | |
| 1157 @item escape-quoted | |
| 444 | 1158 If non-@code{nil}, literal control characters that are the same as the |
| 428 | 1159 beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in |
| 1160 particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F), | |
| 1161 and CSI (0x9B)) are ``quoted'' with an escape character so that they can | |
| 1162 be properly distinguished from an escape sequence. (Note that doing | |
| 1163 this results in a non-portable encoding.) This encoding flag is used for | |
| 1164 byte-compiled files. Note that ESC is a good choice for a quoting | |
| 1165 character because there are no escape sequences whose second byte is a | |
| 1166 character from the Control-0 or Control-1 character sets; this is | |
| 1167 explicitly disallowed by the ISO 2022 standard. | |
| 1168 | |
| 1169 @item input-charset-conversion | |
| 1170 A list of conversion specifications, specifying conversion of characters | |
| 1171 in one charset to another when decoding is performed. Each | |
| 1172 specification is a list of two elements: the source charset, and the | |
| 1173 destination charset. | |
| 1174 | |
| 1175 @item output-charset-conversion | |
| 1176 A list of conversion specifications, specifying conversion of characters | |
| 1177 in one charset to another when encoding is performed. The form of each | |
| 1178 specification is the same as for @code{input-charset-conversion}. | |
| 1179 @end table | |
| 1180 | |
| 442 | 1181 The following additional properties are recognized (and required) if |
| 428 | 1182 @var{type} is @code{ccl}: |
| 1183 | |
| 1184 @table @code | |
| 1185 @item decode | |
| 1186 CCL program used for decoding (converting to internal format). | |
| 1187 | |
| 1188 @item encode | |
| 1189 CCL program used for encoding (converting to external format). | |
| 1190 @end table | |
| 1191 | |
| 442 | 1192 The following properties are used internally: @var{eol-cr}, |
| 1193 @var{eol-crlf}, @var{eol-lf}, and @var{base}. | |
| 1194 | |
| 1195 @node Basic Coding System Functions, Coding System Property Functions, Coding System Properties, Coding Systems | |
| 428 | 1196 @subsection Basic Coding System Functions |
| 1197 | |
| 1198 @defun find-coding-system coding-system-or-name | |
| 1199 This function retrieves the coding system of the given name. | |
| 1200 | |
| 442 | 1201 If @var{coding-system-or-name} is a coding-system object, it is simply |
| 428 | 1202 returned. Otherwise, @var{coding-system-or-name} should be a symbol. |
| 1203 If there is no such coding system, @code{nil} is returned. Otherwise | |
| 1204 the associated coding system object is returned. | |
| 1205 @end defun | |
| 1206 | |
| 1207 @defun get-coding-system name | |
| 1208 This function retrieves the coding system of the given name. Same as | |
| 1209 @code{find-coding-system} except an error is signalled if there is no | |
| 1210 such coding system instead of returning @code{nil}. | |
| 1211 @end defun | |
| 1212 | |
| 1213 @defun coding-system-list | |
| 1214 This function returns a list of the names of all defined coding systems. | |
| 1215 @end defun | |
| 1216 | |
| 1217 @defun coding-system-name coding-system | |
| 1218 This function returns the name of the given coding system. | |
| 1219 @end defun | |
| 1220 | |
| 442 | 1221 @defun coding-system-base coding-system |
| 1222 Returns the base coding system (undecided EOL convention) | |
| 1223 coding system. | |
| 1224 @end defun | |
| 1225 | |
| 428 | 1226 @defun make-coding-system name type &optional doc-string props |
| 1227 This function registers symbol @var{name} as a coding system. | |
| 1228 | |
| 1229 @var{type} describes the conversion method used and should be one of | |
| 1230 the types listed in @ref{Coding System Types}. | |
| 1231 | |
| 1232 @var{doc-string} is a string describing the coding system. | |
| 1233 | |
| 1234 @var{props} is a property list, describing the specific nature of the | |
| 1235 character set. Recognized properties are as in @ref{Coding System | |
| 1236 Properties}. | |
| 1237 @end defun | |
| 1238 | |
| 1239 @defun copy-coding-system old-coding-system new-name | |
| 1240 This function copies @var{old-coding-system} to @var{new-name}. If | |
| 1241 @var{new-name} does not name an existing coding system, a new one will | |
| 1242 be created. | |
| 1243 @end defun | |
| 1244 | |
| 1245 @defun subsidiary-coding-system coding-system eol-type | |
| 1246 This function returns the subsidiary coding system of | |
| 1247 @var{coding-system} with eol type @var{eol-type}. | |
| 1248 @end defun | |
| 1249 | |
| 442 | 1250 @node Coding System Property Functions, Encoding and Decoding Text, Basic Coding System Functions, Coding Systems |
| 428 | 1251 @subsection Coding System Property Functions |
| 1252 | |
| 1253 @defun coding-system-doc-string coding-system | |
| 1254 This function returns the doc string for @var{coding-system}. | |
| 1255 @end defun | |
| 1256 | |
| 1257 @defun coding-system-type coding-system | |
| 1258 This function returns the type of @var{coding-system}. | |
| 1259 @end defun | |
| 1260 | |
| 1261 @defun coding-system-property coding-system prop | |
| 1262 This function returns the @var{prop} property of @var{coding-system}. | |
| 1263 @end defun | |
| 1264 | |
| 442 | 1265 @node Encoding and Decoding Text, Detection of Textual Encoding, Coding System Property Functions, Coding Systems |
| 428 | 1266 @subsection Encoding and Decoding Text |
| 1267 | |
| 1268 @defun decode-coding-region start end coding-system &optional buffer | |
| 1269 This function decodes the text between @var{start} and @var{end} which | |
| 1270 is encoded in @var{coding-system}. This is useful if you've read in | |
| 1271 encoded text from a file without decoding it (e.g. you read in a | |
| 1272 JIS-formatted file but used the @code{binary} or @code{no-conversion} coding | |
| 1273 system, so that it shows up as @samp{^[$B!<!+^[(B}). The length of the | |
| 1274 encoded text is returned. @var{buffer} defaults to the current buffer | |
| 1275 if unspecified. | |
| 1276 @end defun | |
| 1277 | |
| 1278 @defun encode-coding-region start end coding-system &optional buffer | |
| 1279 This function encodes the text between @var{start} and @var{end} using | |
| 1280 @var{coding-system}. This will, for example, convert Japanese | |
| 1281 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS | |
| 1282 encoding. The length of the encoded text is returned. @var{buffer} | |
| 1283 defaults to the current buffer if unspecified. | |
| 1284 @end defun | |
| 1285 | |
| 442 | 1286 @node Detection of Textual Encoding, Big5 and Shift-JIS Functions, Encoding and Decoding Text, Coding Systems |
| 428 | 1287 @subsection Detection of Textual Encoding |
| 1288 | |
| 1289 @defun coding-category-list | |
| 1290 This function returns a list of all recognized coding categories. | |
| 1291 @end defun | |
| 1292 | |
| 1293 @defun set-coding-priority-list list | |
| 1294 This function changes the priority order of the coding categories. | |
| 1295 @var{list} should be a list of coding categories, in descending order of | |
| 1296 priority. Unspecified coding categories will be lower in priority than | |
| 1297 all specified ones, in the same relative order they were in previously. | |
| 1298 @end defun | |
| 1299 | |
| 1300 @defun coding-priority-list | |
| 1301 This function returns a list of coding categories in descending order of | |
| 1302 priority. | |
| 1303 @end defun | |
| 1304 | |
| 1305 @defun set-coding-category-system coding-category coding-system | |
| 1306 This function changes the coding system associated with a coding category. | |
| 1307 @end defun | |
| 1308 | |
| 1309 @defun coding-category-system coding-category | |
| 1310 This function returns the coding system associated with a coding category. | |
| 1311 @end defun | |
| 1312 | |
| 1313 @defun detect-coding-region start end &optional buffer | |
| 1314 This function detects coding system of the text in the region between | |
| 1315 @var{start} and @var{end}. Returned value is a list of possible coding | |
| 1316 systems ordered by priority. If only ASCII characters are found, it | |
| 1317 returns @code{autodetect} or one of its subsidiary coding systems | |
| 1318 according to a detected end-of-line type. Optional arg @var{buffer} | |
| 1319 defaults to the current buffer. | |
| 1320 @end defun | |
| 1321 | |
| 442 | 1322 @node Big5 and Shift-JIS Functions, Predefined Coding Systems, Detection of Textual Encoding, Coding Systems |
| 428 | 1323 @subsection Big5 and Shift-JIS Functions |
| 1324 | |
| 442 | 1325 These are special functions for working with the non-standard |
| 428 | 1326 Shift-JIS and Big5 encodings. |
| 1327 | |
| 1328 @defun decode-shift-jis-char code | |
| 442 | 1329 This function decodes a JIS X 0208 character of Shift-JIS coding-system. |
| 428 | 1330 @var{code} is the character code in Shift-JIS as a cons of type bytes. |
| 1331 The corresponding character is returned. | |
| 1332 @end defun | |
| 1333 | |
| 444 | 1334 @defun encode-shift-jis-char character |
| 1335 This function encodes a JIS X 0208 character @var{character} to | |
| 1336 SHIFT-JIS coding-system. The corresponding character code in SHIFT-JIS | |
| 1337 is returned as a cons of two bytes. | |
| 428 | 1338 @end defun |
| 1339 | |
| 1340 @defun decode-big5-char code | |
| 1341 This function decodes a Big5 character @var{code} of BIG5 coding-system. | |
| 1342 @var{code} is the character code in BIG5. The corresponding character | |
| 1343 is returned. | |
| 1344 @end defun | |
| 1345 | |
| 444 | 1346 @defun encode-big5-char character |
| 1347 This function encodes the Big5 character @var{character} to BIG5 | |
| 428 | 1348 coding-system. The corresponding character code in Big5 is returned. |
| 1349 @end defun | |
| 1350 | |
| 442 | 1351 @node Predefined Coding Systems, , Big5 and Shift-JIS Functions, Coding Systems |
| 1352 @subsection Coding Systems Implemented | |
| 1353 | |
| 1354 MULE initializes most of the commonly used coding systems at XEmacs's | |
| 1355 startup. A few others are initialized only when the relevant language | |
| 1356 environment is selected and support libraries are loaded. (NB: The | |
| 444 | 1357 following list is based on XEmacs 21.2.19, the development branch at the |
| 442 | 1358 time of writing. The list may be somewhat different for other |
| 1359 versions. Recent versions of GNU Emacs 20 implement a few more rare | |
| 1360 coding systems; work is being done to port these to XEmacs.) | |
| 1361 | |
| 444 | 1362 Unfortunately, there is not a consistent naming convention for character |
| 1363 sets, and for practical purposes coding systems often take their name | |
| 442 | 1364 from their principal character sets (ASCII, KOI8-R, Shift JIS). Others |
| 444 | 1365 take their names from the coding system (ISO-2022-JP, EUC-KR), and a few |
| 1366 from their non-text usages (internal, binary). To provide for this, and | |
| 442 | 1367 for the fact that many coding systems have several common names, an |
| 1368 aliasing system is provided. Finally, some effort has been made to use | |
| 1369 names that are registered as MIME charsets (this is why the name | |
| 1370 'shift_jis contains that un-Lisp-y underscore). | |
| 1371 | |
| 1372 There is a systematic naming convention regarding end-of-line (EOL) | |
| 1373 conventions for different systems. A coding system whose name ends in | |
| 1374 "-unix" forces the assumptions that lines are broken by newlines (0x0A). | |
| 1375 A coding system whose name ends in "-mac" forces the assumptions that | |
| 1376 lines are broken by ASCII CRs (0x0D). A coding system whose name ends | |
| 1377 in "-dos" forces the assumptions that lines are broken by CRLF sequences | |
| 1378 (0x0D 0x0A). These subsidiary coding systems are automatically derived | |
| 1379 from a base coding system. Use of the base coding system implies | |
| 1380 autodetection of the text file convention. (The fact that the -unix, | |
| 1381 -mac, and -dos are derived from a base system results in them showing up | |
| 1382 as "aliases" in `list-coding-systems'.) These subsidiaries have a | |
| 1383 consistent modeline indicator as well. "-dos" coding systems have ":T" | |
| 1384 appended to their modeline indicator, while "-mac" coding systems have | |
| 1385 ":t" appended (eg, "ISO8:t" for iso-2022-8-mac). | |
| 1386 | |
| 1387 In the following table, each coding system is given with its mode line | |
| 1388 indicator in parentheses. Non-textual coding systems are listed first, | |
| 1389 followed by textual coding systems and their aliases. (The coding system | |
| 1390 subsidiary modeline indicators ":T" and ":t" will be omitted from the | |
| 1391 table of coding systems.) | |
| 1392 | |
| 1393 ### SJT 1999-08-23 Maybe should order these by language? Definitely | |
| 1394 need language usage for the ISO-8859 family. | |
| 1395 | |
| 1396 Note that although true coding system aliases have been implemented for | |
| 444 | 1397 XEmacs 21.2, the coding system initialization has not yet been converted |
| 442 | 1398 as of 21.2.19. So coding systems described as aliases have the same |
| 1399 properties as the aliased coding system, but will not be equal as Lisp | |
| 1400 objects. | |
| 1401 | |
| 1402 @table @code | |
| 1403 | |
| 1404 @item automatic-conversion | |
| 1405 @itemx undecided | |
| 1406 @itemx undecided-dos | |
| 1407 @itemx undecided-mac | |
| 1408 @itemx undecided-unix | |
| 1409 | |
| 1410 Modeline indicator: @code{Auto}. A type @code{undecided} coding system. | |
| 1411 Attempts to determine an appropriate coding system from file contents or | |
| 1412 the environment. | |
| 1413 | |
| 1414 @item raw-text | |
| 1415 @itemx no-conversion | |
| 1416 @itemx raw-text-dos | |
| 1417 @itemx raw-text-mac | |
| 1418 @itemx raw-text-unix | |
| 1419 @itemx no-conversion-dos | |
| 1420 @itemx no-conversion-mac | |
| 1421 @itemx no-conversion-unix | |
| 1422 | |
| 1423 Modeline indicator: @code{Raw}. A type @code{no-conversion} coding system, | |
| 1424 which converts only line-break-codes. An implementation quirk means | |
| 1425 that this coding system is also used for ISO8859-1. | |
| 1426 | |
| 1427 @item binary | |
| 1428 Modeline indicator: @code{Binary}. A type @code{no-conversion} coding | |
| 1429 system which does no character coding or EOL conversions. An alias for | |
| 1430 @code{raw-text-unix}. | |
| 1431 | |
| 1432 @item alternativnyj | |
| 1433 @itemx alternativnyj-dos | |
| 1434 @itemx alternativnyj-mac | |
| 1435 @itemx alternativnyj-unix | |
| 1436 | |
| 1437 Modeline indicator: @code{Cy.Alt}. A type @code{ccl} coding system used for | |
| 1438 Alternativnyj, an encoding of the Cyrillic alphabet. | |
| 1439 | |
| 1440 @item big5 | |
| 1441 @itemx big5-dos | |
| 1442 @itemx big5-mac | |
| 1443 @itemx big5-unix | |
| 1444 | |
| 1445 Modeline indicator: @code{Zh/Big5}. A type @code{big5} coding system used for | |
| 1446 BIG5, the most common encoding of traditional Chinese as used in Taiwan. | |
| 1447 | |
| 1448 @item cn-gb-2312 | |
| 1449 @itemx cn-gb-2312-dos | |
| 1450 @itemx cn-gb-2312-mac | |
| 1451 @itemx cn-gb-2312-unix | |
| 1452 | |
| 1453 Modeline indicator: @code{Zh-GB/EUC}. A type @code{iso2022} coding system used | |
| 1454 for simplified Chinese (as used in the People's Republic of China), with | |
| 1455 the @code{ascii} (G0), @code{chinese-gb2312} (G1), and @code{sisheng} | |
| 1456 (G2) character sets initially designated. Chinese EUC (Extended Unix | |
| 1457 Code). | |
| 1458 | |
| 1459 @item ctext-hebrew | |
| 1460 @itemx ctext-hebrew-dos | |
| 1461 @itemx ctext-hebrew-mac | |
| 1462 @itemx ctext-hebrew-unix | |
| 1463 | |
| 1464 Modeline indicator: @code{CText/Hbrw}. A type @code{iso2022} coding system | |
| 1465 with the @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) character | |
| 1466 sets initially designated for Hebrew. | |
| 1467 | |
| 1468 @item ctext | |
| 1469 @itemx ctext-dos | |
| 1470 @itemx ctext-mac | |
| 1471 @itemx ctext-unix | |
| 1472 | |
| 1473 Modeline indicator: @code{CText}. A type @code{iso2022} 8-bit coding system | |
| 1474 with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1) character | |
| 1475 sets initially designated. X11 Compound Text Encoding. Often | |
| 1476 mistakenly recognized instead of EUC encodings; usual cause is | |
| 1477 inappropriate setting of @code{coding-priority-list}. | |
| 1478 | |
| 1479 @item escape-quoted | |
| 1480 | |
| 1481 Modeline indicator: @code{ESC/Quot}. A type @code{iso2022} 8-bit coding | |
| 1482 system with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1) | |
| 1483 character sets initially designated and escape quoting. Unix EOL | |
| 1484 conversion (ie, no conversion). It is used for .ELC files. | |
| 1485 | |
| 1486 @item euc-jp | |
| 1487 @itemx euc-jp-dos | |
| 1488 @itemx euc-jp-mac | |
| 1489 @itemx euc-jp-unix | |
| 1490 | |
| 1491 Modeline indicator: @code{Ja/EUC}. A type @code{iso2022} 8-bit coding system | |
| 1492 with @code{ascii} (G0), @code{japanese-jisx0208} (G1), | |
| 1493 @code{katakana-jisx0201} (G2), and @code{japanese-jisx0212} (G3) | |
| 1494 initially designated. Japanese EUC (Extended Unix Code). | |
| 1495 | |
| 1496 @item euc-kr | |
| 1497 @itemx euc-kr-dos | |
| 1498 @itemx euc-kr-mac | |
| 1499 @itemx euc-kr-unix | |
| 1500 | |
| 1501 Modeline indicator: @code{ko/EUC}. A type @code{iso2022} 8-bit coding system | |
| 1502 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially | |
| 1503 designated. Korean EUC (Extended Unix Code). | |
| 1504 | |
| 1505 @item hz-gb-2312 | |
| 1506 Modeline indicator: @code{Zh-GB/Hz}. A type @code{no-conversion} coding | |
| 1507 system with Unix EOL convention (ie, no conversion) using | |
| 1508 post-read-decode and pre-write-encode functions to translate the Hz/ZW | |
| 1509 coding system used for Chinese. | |
| 1510 | |
| 1511 @item iso-2022-7bit | |
| 1512 @itemx iso-2022-7bit-unix | |
| 1513 @itemx iso-2022-7bit-dos | |
| 1514 @itemx iso-2022-7bit-mac | |
| 1515 @itemx iso-2022-7 | |
| 1516 | |
| 1517 Modeline indicator: @code{ISO7}. A type @code{iso2022} 7-bit coding system | |
| 1518 with @code{ascii} (G0) initially designated. Other character sets must | |
| 1519 be explicitly designated to be used. | |
| 1520 | |
| 1521 @item iso-2022-7bit-ss2 | |
| 1522 @itemx iso-2022-7bit-ss2-dos | |
| 1523 @itemx iso-2022-7bit-ss2-mac | |
| 1524 @itemx iso-2022-7bit-ss2-unix | |
| 1525 | |
| 1526 Modeline indicator: @code{ISO7/SS}. A type @code{iso2022} 7-bit coding system | |
| 1527 with @code{ascii} (G0) initially designated. Other character sets must | |
| 1528 be explicitly designated to be used. SS2 is used to invoke a | |
| 1529 96-charset, one character at a time. | |
| 1530 | |
| 1531 @item iso-2022-8 | |
| 1532 @itemx iso-2022-8-dos | |
| 1533 @itemx iso-2022-8-mac | |
| 1534 @itemx iso-2022-8-unix | |
| 1535 | |
| 1536 Modeline indicator: @code{ISO8}. A type @code{iso2022} 8-bit coding system | |
| 1537 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially | |
| 1538 designated. Other character sets must be explicitly designated to be | |
| 1539 used. No single-shift or locking-shift. | |
| 1540 | |
| 1541 @item iso-2022-8bit-ss2 | |
| 1542 @itemx iso-2022-8bit-ss2-dos | |
| 1543 @itemx iso-2022-8bit-ss2-mac | |
| 1544 @itemx iso-2022-8bit-ss2-unix | |
| 1545 | |
| 1546 Modeline indicator: @code{ISO8/SS}. A type @code{iso2022} 8-bit coding system | |
| 1547 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially | |
| 1548 designated. Other character sets must be explicitly designated to be | |
| 1549 used. SS2 is used to invoke a 96-charset, one character at a time. | |
| 1550 | |
| 1551 @item iso-2022-int-1 | |
| 1552 @itemx iso-2022-int-1-dos | |
| 1553 @itemx iso-2022-int-1-mac | |
| 1554 @itemx iso-2022-int-1-unix | |
| 1555 | |
| 1556 Modeline indicator: @code{INT-1}. A type @code{iso2022} 7-bit coding system | |
| 1557 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially | |
| 1558 designated. ISO-2022-INT-1. | |
| 1559 | |
| 1560 @item iso-2022-jp-1978-irv | |
| 1561 @itemx iso-2022-jp-1978-irv-dos | |
| 1562 @itemx iso-2022-jp-1978-irv-mac | |
| 1563 @itemx iso-2022-jp-1978-irv-unix | |
| 1564 | |
| 1565 Modeline indicator: @code{Ja-78/7bit}. A type @code{iso2022} 7-bit coding | |
| 1566 system. For compatibility with old Japanese terminals; if you need to | |
| 1567 know, look at the source. | |
| 1568 | |
| 1569 @item iso-2022-jp | |
| 1570 @itemx iso-2022-jp-2 (ISO7/SS) | |
| 1571 @itemx iso-2022-jp-dos | |
| 1572 @itemx iso-2022-jp-mac | |
| 1573 @itemx iso-2022-jp-unix | |
| 1574 @itemx iso-2022-jp-2-dos | |
| 1575 @itemx iso-2022-jp-2-mac | |
| 1576 @itemx iso-2022-jp-2-unix | |
| 1577 | |
| 1578 Modeline indicator: @code{MULE/7bit}. A type @code{iso2022} 7-bit coding | |
| 1579 system with @code{ascii} (G0) initially designated, and complex | |
| 1580 specifications to insure backward compatibility with old Japanese | |
| 1581 systems. Used for communication with mail and news in Japan. The "-2" | |
| 1582 versions also use SS2 to invoke a 96-charset one character at a time. | |
| 1583 | |
| 1584 @item iso-2022-kr | |
| 1585 Modeline indicator: @code{Ko/7bit} A type @code{iso2022} 7-bit coding | |
| 1586 system with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially | |
| 1587 designated. Used for e-mail in Korea. | |
| 1588 | |
| 1589 @item iso-2022-lock | |
| 1590 @itemx iso-2022-lock-dos | |
| 1591 @itemx iso-2022-lock-mac | |
| 1592 @itemx iso-2022-lock-unix | |
| 1593 | |
| 1594 Modeline indicator: @code{ISO7/Lock}. A type @code{iso2022} 7-bit coding | |
| 1595 system with @code{ascii} (G0) initially designated, using Locking-Shift | |
| 1596 to invoke a 96-charset. | |
| 1597 | |
| 1598 @item iso-8859-1 | |
| 1599 @itemx iso-8859-1-dos | |
| 1600 @itemx iso-8859-1-mac | |
| 1601 @itemx iso-8859-1-unix | |
| 1602 | |
| 1603 Due to implementation, this is not a type @code{iso2022} coding system, | |
| 1604 but rather an alias for the @code{raw-text} coding system. | |
| 1605 | |
| 1606 @item iso-8859-2 | |
| 1607 @itemx iso-8859-2-dos | |
| 1608 @itemx iso-8859-2-mac | |
| 1609 @itemx iso-8859-2-unix | |
| 1610 | |
| 1611 Modeline indicator: @code{MIME/Ltn-2}. A type @code{iso2022} coding | |
| 1612 system with @code{ascii} (G0) and @code{latin-iso8859-2} (G1) initially | |
| 1613 invoked. | |
| 1614 | |
| 1615 @item iso-8859-3 | |
| 1616 @itemx iso-8859-3-dos | |
| 1617 @itemx iso-8859-3-mac | |
| 1618 @itemx iso-8859-3-unix | |
| 1619 | |
| 1620 Modeline indicator: @code{MIME/Ltn-3}. A type @code{iso2022} coding system | |
| 1621 with @code{ascii} (G0) and @code{latin-iso8859-3} (G1) initially | |
| 1622 invoked. | |
| 1623 | |
| 1624 @item iso-8859-4 | |
| 1625 @itemx iso-8859-4-dos | |
| 1626 @itemx iso-8859-4-mac | |
| 1627 @itemx iso-8859-4-unix | |
| 1628 | |
| 1629 Modeline indicator: @code{MIME/Ltn-4}. A type @code{iso2022} coding system | |
| 1630 with @code{ascii} (G0) and @code{latin-iso8859-4} (G1) initially | |
| 1631 invoked. | |
| 1632 | |
| 1633 @item iso-8859-5 | |
| 1634 @itemx iso-8859-5-dos | |
| 1635 @itemx iso-8859-5-mac | |
| 1636 @itemx iso-8859-5-unix | |
| 1637 | |
| 1638 Modeline indicator: @code{ISO8/Cyr}. A type @code{iso2022} coding system with | |
| 1639 @code{ascii} (G0) and @code{cyrillic-iso8859-5} (G1) initially invoked. | |
| 1640 | |
| 1641 @item iso-8859-7 | |
| 1642 @itemx iso-8859-7-dos | |
| 1643 @itemx iso-8859-7-mac | |
| 1644 @itemx iso-8859-7-unix | |
| 1645 | |
| 1646 Modeline indicator: @code{Grk}. A type @code{iso2022} coding system with | |
| 1647 @code{ascii} (G0) and @code{greek-iso8859-7} (G1) initially invoked. | |
| 1648 | |
| 1649 @item iso-8859-8 | |
| 1650 @itemx iso-8859-8-dos | |
| 1651 @itemx iso-8859-8-mac | |
| 1652 @itemx iso-8859-8-unix | |
| 1653 | |
| 1654 Modeline indicator: @code{MIME/Hbrw}. A type @code{iso2022} coding system with | |
| 1655 @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) initially invoked. | |
| 1656 | |
| 1657 @item iso-8859-9 | |
| 1658 @itemx iso-8859-9-dos | |
| 1659 @itemx iso-8859-9-mac | |
| 1660 @itemx iso-8859-9-unix | |
| 1661 | |
| 1662 Modeline indicator: @code{MIME/Ltn-5}. A type @code{iso2022} coding system | |
| 1663 with @code{ascii} (G0) and @code{latin-iso8859-9} (G1) initially | |
| 1664 invoked. | |
| 1665 | |
| 1666 @item koi8-r | |
| 1667 @itemx koi8-r-dos | |
| 1668 @itemx koi8-r-mac | |
| 1669 @itemx koi8-r-unix | |
| 1670 | |
| 1671 Modeline indicator: @code{KOI8}. A type @code{ccl} coding-system used for | |
| 1672 KOI8-R, an encoding of the Cyrillic alphabet. | |
| 1673 | |
| 1674 @item shift_jis | |
| 1675 @itemx shift_jis-dos | |
| 1676 @itemx shift_jis-mac | |
| 1677 @itemx shift_jis-unix | |
| 1678 | |
| 1679 Modeline indicator: @code{Ja/SJIS}. A type @code{shift-jis} coding-system | |
| 1680 implementing the Shift-JIS encoding for Japanese. The underscore is to | |
| 1681 conform to the MIME charset implementing this encoding. | |
| 1682 | |
| 1683 @item tis-620 | |
| 1684 @itemx tis-620-dos | |
| 1685 @itemx tis-620-mac | |
| 1686 @itemx tis-620-unix | |
| 1687 | |
| 1688 Modeline indicator: @code{TIS620}. A type @code{ccl} encoding for Thai. The | |
| 1689 external encoding is defined by TIS620, the internal encoding is | |
| 1690 peculiar to MULE, and called @code{thai-xtis}. | |
| 1691 | |
| 1692 @item viqr | |
| 1693 | |
| 1694 Modeline indicator: @code{VIQR}. A type @code{no-conversion} coding | |
| 1695 system with Unix EOL convention (ie, no conversion) using | |
| 1696 post-read-decode and pre-write-encode functions to translate the VIQR | |
| 1697 coding system for Vietnamese. | |
| 1698 | |
| 1699 @item viscii | |
| 1700 @itemx viscii-dos | |
| 1701 @itemx viscii-mac | |
| 1702 @itemx viscii-unix | |
| 1703 | |
| 1704 Modeline indicator: @code{VISCII}. A type @code{ccl} coding-system used | |
| 1705 for VISCII 1.1 for Vietnamese. Differs slightly from VSCII; VISCII is | |
| 1706 given priority by XEmacs. | |
| 1707 | |
| 1708 @item vscii | |
| 1709 @itemx vscii-dos | |
| 1710 @itemx vscii-mac | |
| 1711 @itemx vscii-unix | |
| 1712 | |
| 1713 Modeline indicator: @code{VSCII}. A type @code{ccl} coding-system used | |
| 1714 for VSCII 1.1 for Vietnamese. Differs slightly from VISCII, which is | |
| 1715 given priority by XEmacs. Use | |
| 1716 @code{(prefer-coding-system 'vietnamese-vscii)} to give priority to VSCII. | |
| 1717 | |
| 1718 @end table | |
| 1719 | |
| 428 | 1720 @node CCL, Category Tables, Coding Systems, MULE |
| 1721 @section CCL | |
| 1722 | |
| 442 | 1723 CCL (Code Conversion Language) is a simple structured programming |
| 428 | 1724 language designed for character coding conversions. A CCL program is |
| 1725 compiled to CCL code (represented by a vector of integers) and executed | |
| 1726 by the CCL interpreter embedded in Emacs. The CCL interpreter | |
| 1727 implements a virtual machine with 8 registers called @code{r0}, ..., | |
| 1728 @code{r7}, a number of control structures, and some I/O operators. Take | |
| 1729 care when using registers @code{r0} (used in implicit @dfn{set} | |
| 1730 statements) and especially @code{r7} (used internally by several | |
| 444 | 1731 statements and operations, especially for multiple return values and I/O |
| 428 | 1732 operations). |
| 1733 | |
| 442 | 1734 CCL is used for code conversion during process I/O and file I/O for |
| 428 | 1735 non-ISO2022 coding systems. (It is the only way for a user to specify a |
| 1736 code conversion function.) It is also used for calculating the code | |
| 1737 point of an X11 font from a character code. However, since CCL is | |
| 1738 designed as a powerful programming language, it can be used for more | |
| 1739 generic calculation where efficiency is demanded. A combination of | |
| 1740 three or more arithmetic operations can be calculated faster by CCL than | |
| 1741 by Emacs Lisp. | |
| 1742 | |
| 442 | 1743 @strong{Warning:} The code in @file{src/mule-ccl.c} and |
| 428 | 1744 @file{$packages/lisp/mule-base/mule-ccl.el} is the definitive |
| 1745 description of CCL's semantics. The previous version of this section | |
| 1746 contained several typos and obsolete names left from earlier versions of | |
| 1747 MULE, and many may remain. (I am not an experienced CCL programmer; the | |
| 1748 few who know CCL well find writing English painful.) | |
| 1749 | |
| 442 | 1750 A CCL program transforms an input data stream into an output data |
| 428 | 1751 stream. The input stream, held in a buffer of constant bytes, is left |
| 1752 unchanged. The buffer may be filled by an external input operation, | |
| 1753 taken from an Emacs buffer, or taken from a Lisp string. The output | |
| 1754 buffer is a dynamic array of bytes, which can be written by an external | |
| 1755 output operation, inserted into an Emacs buffer, or returned as a Lisp | |
| 1756 string. | |
| 1757 | |
| 442 | 1758 A CCL program is a (Lisp) list containing two or three members. The |
| 428 | 1759 first member is the @dfn{buffer magnification}, which indicates the |
| 1760 required minimum size of the output buffer as a multiple of the input | |
| 1761 buffer. It is followed by the @dfn{main block} which executes while | |
| 1762 there is input remaining, and an optional @dfn{EOF block} which is | |
| 1763 executed when the input is exhausted. Both the main block and the EOF | |
| 1764 block are CCL blocks. | |
| 1765 | |
| 442 | 1766 A @dfn{CCL block} is either a CCL statement or list of CCL statements. |
| 444 | 1767 A @dfn{CCL statement} is either a @dfn{set statement} (either an integer |
| 428 | 1768 or an @dfn{assignment}, which is a list of a register to receive the |
| 444 | 1769 assignment, an assignment operator, and an expression) or a @dfn{control |
| 428 | 1770 statement} (a list starting with a keyword, whose allowable syntax |
| 1771 depends on the keyword). | |
| 1772 | |
| 1773 @menu | |
| 1774 * CCL Syntax:: CCL program syntax in BNF notation. | |
| 1775 * CCL Statements:: Semantics of CCL statements. | |
| 1776 * CCL Expressions:: Operators and expressions in CCL. | |
| 1777 * Calling CCL:: Running CCL programs. | |
| 2640 | 1778 * CCL Example:: A trivial program to transform the Web's URL encoding. |
| 428 | 1779 @end menu |
| 1780 | |
| 442 | 1781 @node CCL Syntax, CCL Statements, , CCL |
| 428 | 1782 @comment Node, Next, Previous, Up |
| 1783 @subsection CCL Syntax | |
| 1784 | |
| 442 | 1785 The full syntax of a CCL program in BNF notation: |
| 428 | 1786 |
| 1787 @format | |
| 1788 CCL_PROGRAM := | |
| 1789 (BUFFER_MAGNIFICATION | |
| 1790 CCL_MAIN_BLOCK | |
| 1791 [ CCL_EOF_BLOCK ]) | |
| 1792 | |
| 1793 BUFFER_MAGNIFICATION := integer | |
| 1794 CCL_MAIN_BLOCK := CCL_BLOCK | |
| 1795 CCL_EOF_BLOCK := CCL_BLOCK | |
| 1796 | |
| 1797 CCL_BLOCK := | |
| 1798 STATEMENT | (STATEMENT [STATEMENT ...]) | |
| 1799 STATEMENT := | |
| 2367 | 1800 SET | IF | BRANCH | LOOP | REPEAT | BREAK | READ | WRITE | CALL |
| 1801 | TRANSLATE | MAP | END | |
| 428 | 1802 |
| 1803 SET := | |
| 1804 (REG = EXPRESSION) | |
| 1805 | (REG ASSIGNMENT_OPERATOR EXPRESSION) | |
| 2367 | 1806 | INT-OR-CHAR |
| 428 | 1807 |
| 1808 EXPRESSION := ARG | (EXPRESSION OPERATOR ARG) | |
| 1809 | |
| 1810 IF := (if EXPRESSION CCL_BLOCK [CCL_BLOCK]) | |
| 1811 BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...]) | |
| 1812 LOOP := (loop STATEMENT [STATEMENT ...]) | |
| 1813 BREAK := (break) | |
| 1814 REPEAT := | |
| 1815 (repeat) | |
| 2367 | 1816 | (write-repeat [REG | INT-OR-CHAR | string]) |
| 1817 | (write-read-repeat REG [INT-OR-CHAR | ARRAY]) | |
| 428 | 1818 READ := |
| 1819 (read REG ...) | |
| 2367 | 1820 | (read-if (REG OPERATOR ARG) CCL_BLOCK [CCL_BLOCK]) |
| 428 | 1821 | (read-branch REG CCL_BLOCK [CCL_BLOCK ...]) |
| 1822 WRITE := | |
| 1823 (write REG ...) | |
| 1824 | (write EXPRESSION) | |
| 2367 | 1825 | (write INT-OR-CHAR) | (write string) | (write REG ARRAY) |
| 428 | 1826 | string |
| 1827 CALL := (call ccl-program-name) | |
| 3439 | 1828 |
| 1829 | |
| 1830 TRANSLATE := ;; Not implemented under XEmacs, except mule-to-unicode and | |
| 1831 ;; unicode-to-mule. | |
| 1832 (translate-character REG(table) REG(charset) REG(codepoint)) | |
| 1833 | (translate-character SYMBOL REG(charset) REG(codepoint)) | |
| 1834 | (mule-to-unicode REG(charset) REG(codepoint)) | |
| 1835 | (unicode-to-mule REG(unicode,code) REG(CHARSET)) | |
| 1836 | |
| 428 | 1837 END := (end) |
| 1838 | |
| 1839 REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7 | |
| 2367 | 1840 ARG := REG | INT-OR-CHAR |
| 428 | 1841 OPERATOR := |
| 1842 + | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | // | |
| 1843 | < | > | == | <= | >= | != | de-sjis | en-sjis | |
| 1844 ASSIGNMENT_OPERATOR := | |
| 1845 += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>= | |
| 2367 | 1846 ARRAY := '[' INT-OR-CHAR ... ']' |
| 1847 INT-OR-CHAR := integer | character | |
| 1848 | |
| 428 | 1849 @end format |
| 1850 | |
| 1851 @node CCL Statements, CCL Expressions, CCL Syntax, CCL | |
| 1852 @comment Node, Next, Previous, Up | |
| 1853 @subsection CCL Statements | |
| 1854 | |
| 442 | 1855 The Emacs Code Conversion Language provides the following statement |
| 428 | 1856 types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat}, |
| 3439 | 1857 @dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, @dfn{translate} and |
| 1858 @dfn{end}. | |
| 428 | 1859 |
| 1860 @heading Set statement: | |
| 1861 | |
| 442 | 1862 The @dfn{set} statement has three variants with the syntaxes |
| 428 | 1863 @samp{(@var{reg} = @var{expression})}, |
| 1864 @samp{(@var{reg} @var{assignment_operator} @var{expression})}, and | |
| 1865 @samp{@var{integer}}. The assignment operator variation of the | |
| 1866 @dfn{set} statement works the same way as the corresponding C expression | |
| 1867 statement does. The assignment operators are @code{+=}, @code{-=}, | |
| 1868 @code{*=}, @code{/=}, @code{%=}, @code{&=}, @code{|=}, @code{^=}, | |
| 1869 @code{<<=}, and @code{>>=}, and they have the same meanings as in C. A | |
| 1870 "naked integer" @var{integer} is equivalent to a @var{set} statement of | |
| 1871 the form @code{(r0 = @var{integer})}. | |
| 1872 | |
| 1873 @heading I/O statements: | |
| 1874 | |
| 442 | 1875 The @dfn{read} statement takes one or more registers as arguments. It |
| 444 | 1876 reads one byte (a C char) from the input into each register in turn. |
| 428 | 1877 |
| 442 | 1878 The @dfn{write} takes several forms. In the form @samp{(write @var{reg} |
| 428 | 1879 ...)} it takes one or more registers as arguments and writes each in |
| 1880 turn to the output. The integer in a register (interpreted as an | |
| 2367 | 1881 Ichar) is encoded to multibyte form (ie, Ibytes) and written to the |
| 428 | 1882 current output buffer. If it is less than 256, it is written as is. |
| 1883 The forms @samp{(write @var{expression})} and @samp{(write | |
| 1884 @var{integer})} are treated analogously. The form @samp{(write | |
| 1885 @var{string})} writes the constant string to the output. A | |
| 1886 "naked string" @samp{@var{string}} is equivalent to the statement @samp{(write | |
| 1887 @var{string})}. The form @samp{(write @var{reg} @var{array})} writes | |
| 1888 the @var{reg}th element of the @var{array} to the output. | |
| 1889 | |
| 1890 @heading Conditional statements: | |
| 1891 | |
| 442 | 1892 The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and |
| 428 | 1893 an optional @var{second CCL block} as arguments. If the |
| 1894 @var{expression} evaluates to non-zero, the first @var{CCL block} is | |
| 1895 executed. Otherwise, if there is a @var{second CCL block}, it is | |
| 1896 executed. | |
| 1897 | |
| 442 | 1898 The @dfn{read-if} variant of the @dfn{if} statement takes an |
| 428 | 1899 @var{expression}, a @var{CCL block}, and an optional @var{second CCL |
| 1900 block} as arguments. The @var{expression} must have the form | |
| 1901 @code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is | |
| 1902 a register or an integer). The @code{read-if} statement first reads | |
| 1903 from the input into the first register operand in the @var{expression}, | |
| 1904 then conditionally executes a CCL block just as the @code{if} statement | |
| 1905 does. | |
| 1906 | |
| 442 | 1907 The @dfn{branch} statement takes an @var{expression} and one or more CCL |
| 428 | 1908 blocks as arguments. The CCL blocks are treated as a zero-indexed |
| 1909 array, and the @code{branch} statement uses the @var{expression} as the | |
| 1910 index of the CCL block to execute. Null CCL blocks may be used as | |
| 1911 no-ops, continuing execution with the statement following the | |
| 1912 @code{branch} statement in the containing CCL block. Out-of-range | |
| 444 | 1913 values for the @var{expression} are also treated as no-ops. |
| 428 | 1914 |
| 442 | 1915 The @dfn{read-branch} variant of the @dfn{branch} statement takes an |
| 428 | 1916 @var{register}, a @var{CCL block}, and an optional @var{second CCL |
| 1917 block} as arguments. The @code{read-branch} statement first reads from | |
| 1918 the input into the @var{register}, then conditionally executes a CCL | |
| 1919 block just as the @code{branch} statement does. | |
| 1920 | |
| 1921 @heading Loop control statements: | |
| 1922 | |
| 442 | 1923 The @dfn{loop} statement creates a block with an implied jump from the |
| 444 | 1924 end of the block back to its head. The loop is exited on a @code{break} |
| 428 | 1925 statement, and continued without executing the tail by a @code{repeat} |
| 1926 statement. | |
| 1927 | |
| 442 | 1928 The @dfn{break} statement, written @samp{(break)}, terminates the |
| 428 | 1929 current loop and continues with the next statement in the current |
| 444 | 1930 block. |
| 428 | 1931 |
| 442 | 1932 The @dfn{repeat} statement has three variants, @code{repeat}, |
| 428 | 1933 @code{write-repeat}, and @code{write-read-repeat}. Each continues the |
| 1934 current loop from its head, possibly after performing I/O. | |
| 1935 @code{repeat} takes no arguments and does no I/O before jumping. | |
| 444 | 1936 @code{write-repeat} takes a single argument (a register, an |
| 428 | 1937 integer, or a string), writes it to the output, then jumps. |
| 1938 @code{write-read-repeat} takes one or two arguments. The first must | |
| 1939 be a register. The second may be an integer or an array; if absent, it | |
| 1940 is implicitly set to the first (register) argument. | |
| 1941 @code{write-read-repeat} writes its second argument to the output, then | |
| 1942 reads from the input into the register, and finally jumps. See the | |
| 1943 @code{write} and @code{read} statements for the semantics of the I/O | |
| 1944 operations for each type of argument. | |
| 1945 | |
| 3439 | 1946 @heading Other statements: |
| 428 | 1947 |
| 442 | 1948 The @dfn{call} statement, written @samp{(call @var{ccl-program-name})}, |
| 428 | 1949 executes a CCL program as a subroutine. It does not return a value to |
| 1950 the caller, but can modify the register status. | |
| 1951 | |
| 3439 | 1952 The @dfn{mule-to-unicode} statement translates an XEmacs character into a |
| 1953 UCS code point, using U+FFFD REPLACEMENT CHARACTER if the given XEmacs | |
| 1954 character has no known corresponding code point. It takes two | |
| 1955 arguments; the first is a register in which is stored the character set | |
| 1956 ID of the character to be translated, and into which the UCS code is | |
| 1957 stored. The second is a register which stores the XEmacs code of the | |
| 1958 character in question; if it is from a multidimensional character set, | |
| 1959 like most of the East Asian national sets, it's stored as @samp{((c1 << | |
| 1960 8) & c2)}, where @samp{c1} is the first code, and @samp{c2} the second. | |
| 1961 (That is, as a single integer, the high-order eight bits of which encode | |
| 1962 the first position code, and the low order bits of which encode the | |
| 1963 second.) | |
| 1964 | |
| 1965 The @dfn{unicode-to-mule} statement translates a Unicode code point | |
| 1966 (an integer) into an XEmacs character. Its first argument is a register | |
| 1967 containing the UCS code point; the code for the correspond character | |
| 1968 will be written into this register, in the same format as for | |
| 1969 @samp{mule-to-unicode} The second argument is a register into which will | |
| 1970 be written the character set ID of the converted character. | |
| 1971 | |
| 442 | 1972 The @dfn{end} statement, written @samp{(end)}, terminates the CCL |
| 428 | 1973 program successfully, and returns to caller (which may be a CCL |
| 1974 program). It does not alter the status of the registers. | |
| 1975 | |
| 1976 @node CCL Expressions, Calling CCL, CCL Statements, CCL | |
| 1977 @comment Node, Next, Previous, Up | |
| 1978 @subsection CCL Expressions | |
| 1979 | |
| 442 | 1980 CCL, unlike Lisp, uses infix expressions. The simplest CCL expressions |
| 428 | 1981 consist of a single @var{operand}, either a register (one of @code{r0}, |
| 1982 ..., @code{r0}) or an integer. Complex expressions are lists of the | |
| 1983 form @code{( @var{expression} @var{operator} @var{operand} )}. Unlike | |
| 1984 C, assignments are not expressions. | |
| 1985 | |
| 442 | 1986 In the following table, @var{X} is the target resister for a @dfn{set}. |
| 428 | 1987 In subexpressions, this is implicitly @code{r7}. This means that |
| 1988 @code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used | |
| 1989 freely in subexpressions, since they return parts of their values in | |
| 1990 @code{r7}. @var{Y} may be an expression, register, or integer, while | |
| 1991 @var{Z} must be a register or an integer. | |
| 1992 | |
| 1993 @multitable @columnfractions .22 .14 .09 .55 | |
| 1994 @item Name @tab Operator @tab Code @tab C-like Description | |
| 1995 @item CCL_PLUS @tab @code{+} @tab 0x00 @tab X = Y + Z | |
| 1996 @item CCL_MINUS @tab @code{-} @tab 0x01 @tab X = Y - Z | |
| 1997 @item CCL_MUL @tab @code{*} @tab 0x02 @tab X = Y * Z | |
| 1998 @item CCL_DIV @tab @code{/} @tab 0x03 @tab X = Y / Z | |
| 1999 @item CCL_MOD @tab @code{%} @tab 0x04 @tab X = Y % Z | |
| 2000 @item CCL_AND @tab @code{&} @tab 0x05 @tab X = Y & Z | |
| 2001 @item CCL_OR @tab @code{|} @tab 0x06 @tab X = Y | Z | |
| 2002 @item CCL_XOR @tab @code{^} @tab 0x07 @tab X = Y ^ Z | |
| 2003 @item CCL_LSH @tab @code{<<} @tab 0x08 @tab X = Y << Z | |
| 2004 @item CCL_RSH @tab @code{>>} @tab 0x09 @tab X = Y >> Z | |
| 2005 @item CCL_LSH8 @tab @code{<8} @tab 0x0A @tab X = (Y << 8) | Z | |
| 2006 @item CCL_RSH8 @tab @code{>8} @tab 0x0B @tab X = Y >> 8, r[7] = Y & 0xFF | |
| 2007 @item CCL_DIVMOD @tab @code{//} @tab 0x0C @tab X = Y / Z, r[7] = Y % Z | |
| 2008 @item CCL_LS @tab @code{<} @tab 0x10 @tab X = (X < Y) | |
| 2009 @item CCL_GT @tab @code{>} @tab 0x11 @tab X = (X > Y) | |
| 2010 @item CCL_EQ @tab @code{==} @tab 0x12 @tab X = (X == Y) | |
| 2011 @item CCL_LE @tab @code{<=} @tab 0x13 @tab X = (X <= Y) | |
| 2012 @item CCL_GE @tab @code{>=} @tab 0x14 @tab X = (X >= Y) | |
| 2013 @item CCL_NE @tab @code{!=} @tab 0x15 @tab X = (X != Y) | |
| 2014 @item CCL_ENCODE_SJIS @tab @code{en-sjis} @tab 0x16 @tab X = HIGHER_BYTE (SJIS (Y, Z)) | |
| 2015 @item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z) | |
| 2016 @item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z)) | |
| 2017 @item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z)) | |
| 2018 @end multitable | |
| 2019 | |
| 442 | 2020 The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8, |
| 428 | 2021 CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The CCL_ENCODE_SJIS |
| 2022 and CCL_DECODE_SJIS treat their first and second bytes as the high and | |
| 2023 low bytes of a two-byte character code. (SJIS stands for Shift JIS, an | |
| 2024 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a | |
| 2025 complicated transformation of the Japanese standard JIS encoding to | |
| 2026 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to | |
| 2027 represent the SJIS operations in infix form. | |
| 2028 | |
| 2640 | 2029 @node Calling CCL, CCL Example, CCL Expressions, CCL |
| 428 | 2030 @comment Node, Next, Previous, Up |
| 2031 @subsection Calling CCL | |
| 2032 | |
| 442 | 2033 CCL programs are called automatically during Emacs buffer I/O when the |
| 428 | 2034 external representation has a coding system type of @code{shift-jis}, |
| 2035 @code{big5}, or @code{ccl}. The program is specified by the coding | |
| 2036 system (@pxref{Coding Systems}). You can also call CCL programs from | |
| 2037 other CCL programs, and from Lisp using these functions: | |
| 2038 | |
| 2039 @defun ccl-execute ccl-program status | |
| 2040 Execute @var{ccl-program} with registers initialized by | |
| 2041 @var{status}. @var{ccl-program} is a vector of compiled CCL code | |
| 444 | 2042 created by @code{ccl-compile}. It is an error for the program to try to |
| 428 | 2043 execute a CCL I/O command. @var{status} must be a vector of nine |
| 2044 values, specifying the initial value for the R0, R1 .. R7 registers and | |
| 2045 for the instruction counter IC. A @code{nil} value for a register | |
| 2046 initializer causes the register to be set to 0. A @code{nil} value for | |
| 2047 the IC initializer causes execution to start at the beginning of the | |
| 2048 program. When the program is done, @var{status} is modified (by | |
| 2049 side-effect) to contain the ending values for the corresponding | |
| 444 | 2050 registers and IC. |
| 428 | 2051 @end defun |
| 2052 | |
| 444 | 2053 @defun ccl-execute-on-string ccl-program status string &optional continue |
| 428 | 2054 Execute @var{ccl-program} with initial @var{status} on |
| 2055 @var{string}. @var{ccl-program} is a vector of compiled CCL code | |
| 2056 created by @code{ccl-compile}. @var{status} must be a vector of nine | |
| 2057 values, specifying the initial value for the R0, R1 .. R7 registers and | |
| 2058 for the instruction counter IC. A @code{nil} value for a register | |
| 2059 initializer causes the register to be set to 0. A @code{nil} value for | |
| 2060 the IC initializer causes execution to start at the beginning of the | |
| 444 | 2061 program. An optional fourth argument @var{continue}, if non-@code{nil}, causes |
| 428 | 2062 the IC to |
| 2063 remain on the unsatisfied read operation if the program terminates due | |
| 2064 to exhaustion of the input buffer. Otherwise the IC is set to the end | |
| 444 | 2065 of the program. When the program is done, @var{status} is modified (by |
| 428 | 2066 side-effect) to contain the ending values for the corresponding |
| 2067 registers and IC. Returns the resulting string. | |
| 2068 @end defun | |
| 2069 | |
| 442 | 2070 To call a CCL program from another CCL program, it must first be |
| 428 | 2071 registered: |
| 2072 | |
| 2073 @defun register-ccl-program name ccl-program | |
| 444 | 2074 Register @var{name} for CCL program @var{ccl-program} in |
| 2075 @code{ccl-program-table}. @var{ccl-program} should be the compiled form of | |
| 2076 a CCL program, or @code{nil}. Return index number of the registered CCL | |
| 428 | 2077 program. |
| 2078 @end defun | |
| 2079 | |
| 442 | 2080 Information about the processor time used by the CCL interpreter can be |
| 428 | 2081 obtained using these functions: |
| 2082 | |
| 2083 @defun ccl-elapsed-time | |
| 2084 Returns the elapsed processor time of the CCL interpreter as cons of | |
| 2085 user and system time, as | |
| 2086 floating point numbers measured in seconds. If only one | |
| 2087 overall value can be determined, the return value will be a cons of that | |
| 2088 value and 0. | |
| 2089 @end defun | |
| 2090 | |
| 2091 @defun ccl-reset-elapsed-time | |
| 2092 Resets the CCL interpreter's internal elapsed time registers. | |
| 2093 @end defun | |
| 2094 | |
| 2640 | 2095 @node CCL Example, , Calling CCL, CCL |
| 428 | 2096 @comment Node, Next, Previous, Up |
| 2640 | 2097 @subsection CCL Example |
| 2098 | |
| 2099 In this section, we describe the implementation of a trivial coding | |
| 2100 system to transform from the Web's URL encoding to XEmacs' internal | |
| 2101 coding. Many people will have been first exposed to URL encoding when | |
| 2102 they saw ``%20'' where they expected a space in a file's name on their | |
| 2103 local hard disk; this can happen when a browser saves a file from the | |
| 2104 web and doesn't encode the name, as passed from the server, properly. | |
| 2105 | |
| 2106 URL encoding itself is underspecified with regard to encodings beyond | |
| 2107 ASCII. The relevant document, RFC 1738, explicitly doesn't give any | |
| 2108 information on how to encode non-ASCII characters, and the ``obvious'' | |
| 2109 way---use the %xx values for the octets of the eight bit MIME character | |
| 2110 set in which the page was served---breaks when a user types a character | |
| 2111 outside that character set. Best practice for web development is to | |
| 2112 serve all pages as UTF-8 and treat incoming form data as using that | |
| 2113 coding system. (Oh, and gamble that your clients won't ever want to | |
| 2114 type anything outside Unicode. But that's not so much of a gamble with | |
| 2115 today's client operating systems.) We don't treat non-ASCII in this | |
| 2116 example, as dealing with @samp{(read-multibyte-character ...)} and | |
| 2117 errors therewith would make it much harder to understand. | |
| 2118 | |
| 2119 Since CCL isn't a very rich language, we move much of the logic that | |
| 2120 would ordinarily be computed from operations like @code{(member ..)}, | |
| 2121 @code{(and ...)} and @code{(or ...)} into tables, from which register | |
| 2122 values are read and written, and on which @code{if} statements are | |
| 2123 predicated. Much more of the implementation of this coding system is | |
| 2124 occupied with constructing these tables---in normal Emacs Lisp---than it | |
| 2125 is with actual CCL code. | |
| 2126 | |
| 2127 All the @code{defvar} statements we deal with in the next few sections | |
| 2128 are surrounded by a @code{(eval-and-compile ...)}, which means that the | |
| 2129 logic which initializes these variables executes at compile time, and if | |
| 2130 XEmacs loads the compiled version of the file, these variables are | |
| 2131 initialized as constants. | |
| 2132 | |
| 2133 @menu | |
| 2134 * Four bits to ASCII:: Two tables used for getting hex digits from ASCII. | |
| 2135 * URI Encoding constants:: Useful predefined characters. | |
| 2136 * Numeric to ASCII-hexadecimal conversion:: Trivial in Lisp, not so in CCL. | |
| 2137 * Characters to be preserved:: No transformation needed for these characters. | |
| 2138 * The program to decode to internal format:: . | |
| 2139 * The program to encode from internal format:: . | |
| 2690 | 2140 * The actual coding system:: . |
| 2640 | 2141 @end menu |
| 2142 | |
| 2143 @node Four bits to ASCII, URI Encoding constants, , CCL Example | |
| 2144 @subsubsection Four bits to ASCII | |
| 2145 | |
| 2146 The first @code{defvar} is for | |
| 2147 @code{url-coding-high-order-nybble-as-ascii}, a 256-entry table that | |
| 2148 maps from an octet's value to the ASCII encoding for the hex value of | |
| 2149 its most significant four bits. That might sound complex, but it isn't; | |
| 2150 for decimal 65, hex value @samp{#x41}, the entry in the table is the | |
| 2151 ASCII encoding of `4'. For decimal 122, ASCII `z', hex value | |
| 2152 @code{#x7a}, @code{(elt url-coding-high-order-nybble-as-ascii #x7a)} | |
| 2153 after this file is loaded gives the ASCII encoding of 7. | |
| 2154 | |
| 2155 @example | |
| 2156 (defvar url-coding-high-order-nybble-as-ascii | |
| 2157 (let ((val (make-vector 256 0)) | |
| 2158 (i 0)) | |
| 2159 (while (< i (length val)) | |
| 2690 | 2160 (aset val i (char-to-int (aref (format "%02X" i) 0))) |
| 2640 | 2161 (setq i (1+ i))) |
| 2162 val) | |
| 2163 "Table to find an ASCII version of an octet's most significant 4 bits.") | |
| 2164 @end example | |
| 2165 | |
| 2166 The next table, @code{url-coding-low-order-nybble-as-ascii} is almost | |
| 2167 the same thing, but this time it has a map for the hex encoding of the | |
| 2690 | 2168 low-order four bits. So the sixty-fifth entry (offset @samp{#x41}) is |
| 2640 | 2169 the ASCII encoding of `1', the hundred-and-twenty-second (offset |
| 2170 @samp{#x7a}) is the ASCII encoding of `A'. | |
| 2171 | |
| 2172 @example | |
| 2173 (defvar url-coding-low-order-nybble-as-ascii | |
| 2174 (let ((val (make-vector 256 0)) | |
| 2175 (i 0)) | |
| 2176 (while (< i (length val)) | |
| 2690 | 2177 (aset val i (char-to-int (aref (format "%02X" i) 1))) |
| 2640 | 2178 (setq i (1+ i))) |
| 2179 val) | |
| 2180 "Table to find an ASCII version of an octet's least significant 4 bits.") | |
| 2181 @end example | |
| 2182 | |
| 2183 @node URI Encoding constants, Numeric to ASCII-hexadecimal conversion, Four bits to ASCII, CCL Example | |
| 2184 @subsubsection URI Encoding constants | |
| 2185 | |
| 2186 Next, we have a couple of variables that make the CCL code more | |
| 2187 readable. The first is the ASCII encoding of the percentage sign; this | |
| 2188 character is used as an escape code, to start the encoding of a | |
| 2189 non-printable character. For historical reasons, URL encoding allows | |
| 2190 the space character to be encoded as a plus sign--it does make typing | |
| 2191 URLs like @samp{http://google.com/search?q=XEmacs+home+page} easier--and | |
| 2192 as such, we have to check when decoding for this value, and map it to | |
| 2193 the space character. When doing this in CCL, we use the | |
| 2194 @code{url-coding-escaped-space-code} variable. | |
| 2195 | |
| 2196 @example | |
| 2690 | 2197 (defvar url-coding-escape-character-code (char-to-int ?%) |
| 2640 | 2198 "The code point for the percentage sign, in ASCII.") |
| 2199 | |
| 2690 | 2200 (defvar url-coding-escaped-space-code (char-to-int ?+) |
| 2640 | 2201 "The URL-encoded value of the space character, that is, +.") |
| 2202 @end example | |
| 2203 | |
| 2690 | 2204 @node Numeric to ASCII-hexadecimal conversion, Characters to be preserved, URI Encoding constants, CCL Example |
| 2640 | 2205 @subsubsection Numeric to ASCII-hexadecimal conversion |
| 2206 | |
| 2207 Now, we have a couple of utility tables that wouldn't be necessary in | |
| 2208 a more expressive programming language than is CCL. The first is sixteen | |
| 2209 in length, and maps a hexadecimal number to the ASCII encoding of that | |
| 2210 number; so zero maps to ASCII `0', ten maps to ASCII `A.' The second | |
| 2211 does the reverse; that is, it maps an ASCII character to its value when | |
| 2212 interpreted as a hexadecimal digit. ('A' => 10, 'c' => 12, '2' => 2, as | |
| 2213 a few examples.) | |
| 2214 | |
| 2215 @example | |
| 2216 (defvar url-coding-hex-digit-table | |
| 2217 (let ((i 0) | |
| 2218 (val (make-vector 16 0))) | |
| 2219 (while (< i 16) | |
| 2690 | 2220 (aset val i (char-to-int (aref (format "%X" i) 0))) |
| 2640 | 2221 (setq i (1+ i))) |
| 2222 val) | |
| 2223 "A map from a hexadecimal digit's numeric value to its encoding in ASCII.") | |
| 2224 | |
| 2225 (defvar url-coding-latin-1-as-hex-table | |
| 2226 (let ((val (make-vector 256 0)) | |
| 2227 (i 0)) | |
| 2228 (while (< i (length val)) | |
| 2229 ;; Get a hex val for this ASCII character. | |
| 2230 (aset val i (string-to-int (format "%c" i) 16)) | |
| 2231 (setq i (1+ i))) | |
| 2232 val) | |
| 2233 "A map from Latin 1 code points to their values as hexadecimal digits.") | |
| 2234 @end example | |
| 2235 | |
| 2690 | 2236 @node Characters to be preserved, The program to decode to internal format, Numeric to ASCII-hexadecimal conversion, CCL Example |
| 2640 | 2237 @subsubsection Characters to be preserved |
| 2238 | |
| 2239 And finally, the last of these tables. URL encoding says that | |
| 2240 alphanumeric characters, the underscore, hyphen and the full stop | |
| 2241 @footnote{That's what the standards call it, though my North American | |
| 2242 readers will be more familiar with it as the period character.} retain | |
| 2243 their ASCII encoding, and don't undergo transformation. | |
| 2244 @code{url-coding-should-preserve-table} is an array in which the entries | |
| 2245 are one if the corresponding ASCII character should be left as-is, and | |
| 2246 zero if they should be transformed. So the entries for all the control | |
| 2247 and most of the punctuation charcters are zero. Lisp programmers will | |
| 2248 observe that this initialization is particularly inefficient, but | |
| 2249 they'll also be aware that this is a long way from an inner loop where | |
| 2250 every nanosecond counts. | |
| 2251 | |
| 2252 @example | |
| 2253 (defvar url-coding-should-preserve-table | |
| 2254 (let ((preserve | |
| 2255 (list ?- ?_ ?. ?a ?b ?c ?d ?e ?f ?g ?h ?i ?j ?k ?l ?m ?n ?o | |
| 2256 ?p ?q ?r ?s ?t ?u ?v ?w ?x ?y ?z ?A ?B ?C ?D ?E ?F ?G | |
| 2257 ?H ?I ?J ?K ?L ?M ?N ?O ?P ?Q ?R ?S ?T ?U ?V ?W ?X ?Y | |
| 2258 ?Z ?0 ?1 ?2 ?3 ?4 ?5 ?6 ?7 ?8 ?9)) | |
| 2259 (i 0) | |
| 2260 (res (make-vector 256 0))) | |
| 2261 (while (< i 256) | |
| 2262 (when (member (int-char i) preserve) | |
| 2263 (aset res i 1)) | |
| 2264 (setq i (1+ i))) | |
| 2265 res) | |
| 2266 "A 256-entry array of flags, indicating whether or not to preserve an | |
| 2267 octet as its ASCII encoding.") | |
| 2268 @end example | |
| 2269 | |
| 2690 | 2270 @node The program to decode to internal format, The program to encode from internal format, Characters to be preserved, CCL Example |
| 2640 | 2271 @subsubsection The program to decode to internal format |
| 2272 | |
| 2273 After the almost interminable tables, we get to the CCL. The first | |
| 2274 CCL program, @code{ccl-decode-urlcoding} decodes from the URL coding to | |
| 2275 our internal format; since this version of CCL doesn't have support for | |
| 2276 error checking on the input, we don't do any verification on it. | |
| 2277 | |
| 2278 The buffer magnification--approximate ratio of the size of the output | |
| 2279 buffer to the size of the input buffer--is declared as one, because | |
| 2280 fractional values aren't allowed. (Since all those %20's will map to | |
| 2281 ` ', the length of the output text will be less than that of the input | |
| 2282 text.) | |
| 2283 | |
| 2284 So, first we read an octet from the input buffer into register | |
| 2285 @samp{r0}, to set up the loop. Next, we start the loop, with a | |
| 2286 @code{(loop ...)} statement, and we check if the value in @samp{r0} is a | |
| 2287 percentage sign. (Note the comma before | |
| 2288 @code{url-coding-escape-character-code}; since CCL is a Lisp macro | |
| 2289 language, we can break out of the macro evaluation with a comman, and as | |
| 2290 such, ``@code{,url-coding-escape-character-code}'' will be evaluated as a | |
| 2291 literal `37.') | |
| 2292 | |
| 2293 If it is a percentage sign, we read the next two octets into @samp{r2} | |
| 2294 and @samp{r3}, and convert them into their hexadecimal numeric values, | |
| 2295 using the @code{url-coding-latin-1-as-hex-table} array declared above. | |
| 2296 (But again, it'll be interpreted as a literal array.) We then left | |
| 2297 shift the first by four bits, mask the two together, and write the | |
| 2298 result to the output buffer. | |
| 2299 | |
| 2300 If it isn't a percentage sign, and it is a `+' sign, we write a | |
| 2301 space--hexadecimal 20--to the output buffer. | |
| 2302 | |
| 2303 If none of those things are true, we pass the octet to the output buffer | |
| 2304 untransformed. (This could be a place to put error checking, in a more | |
| 2305 expressive language.) We then read one more octet from the input | |
| 2306 buffer, and move to the next iteration of the loop. | |
| 2307 | |
| 2308 @example | |
| 2309 (define-ccl-program ccl-decode-urlcoding | |
| 2310 `(1 | |
| 2311 ((read r0) | |
| 2312 (loop | |
| 2313 (if (r0 == ,url-coding-escape-character-code) | |
| 2314 ((read r2 r3) | |
| 2315 ;; Assign the value at offset r2 in the url-coding-hex-digit-table | |
| 2316 ;; to r3. | |
| 2317 (r2 = r2 ,url-coding-latin-1-as-hex-table) | |
| 2318 (r3 = r3 ,url-coding-latin-1-as-hex-table) | |
| 2319 (r2 <<= 4) | |
| 2320 (r3 |= r2) | |
| 2321 (write r3)) | |
| 2322 (if (r0 == ,url-coding-escaped-space-code) | |
| 2323 (write #x20) | |
| 2324 (write r0))) | |
| 2325 (read r0) | |
| 2326 (repeat)))) | |
| 2327 "CCL program to take URI-encoded ASCII text and transform it to our | |
| 2328 internal encoding. ") | |
| 2329 @end example | |
| 2330 | |
| 2690 | 2331 @node The program to encode from internal format, The actual coding system, The program to decode to internal format, CCL Example |
| 2640 | 2332 @subsubsection The program to encode from internal format |
| 2333 | |
| 2334 Next, we see the CCL program to encode ASCII text as URL coded text. | |
| 2335 Here, the buffer magnification is specified as three, to account for ` ' | |
| 2336 mapping to %20, etc. As before, we read an octet from the input into | |
| 2337 @samp{r0}, and move into the body of the loop. Next, we check if we | |
| 2338 should preserve the value of this octet, by reading from offset | |
| 2339 @samp{r0} in the @code{url-coding-should-preserve-table} into @samp{r1}. | |
| 2340 Then we have an @samp{if} statement predicated on the value in | |
| 2341 @samp{r1}; for the true branch, we write the input octet directly. For | |
| 2342 the false branch, we write a percentage sign, the ASCII encoding of the | |
| 2343 high four bits in hex, and then the ASCII encoding of the low four bits | |
| 2344 in hex. | |
| 2345 | |
| 2346 We then read an octet from the input into @samp{r0}, and repeat the loop. | |
| 2347 | |
| 2348 @example | |
| 2349 (define-ccl-program ccl-encode-urlcoding | |
| 2350 `(3 | |
| 2351 ((read r0) | |
| 2352 (loop | |
| 2353 (r1 = r0 ,url-coding-should-preserve-table) | |
| 2354 ;; If we should preserve the value, just write the octet directly. | |
| 2355 (if r1 | |
| 2356 (write r0) | |
| 2357 ;; else, write a percentage sign, and the hex value of the octet, in | |
| 2358 ;; an ASCII-friendly format. | |
| 2359 ((write ,url-coding-escape-character-code) | |
| 2360 (write r0 ,url-coding-high-order-nybble-as-ascii) | |
| 2361 (write r0 ,url-coding-low-order-nybble-as-ascii))) | |
| 2362 (read r0) | |
| 2363 (repeat)))) | |
| 2364 "CCL program to encode octets (almost) according to RFC 1738") | |
| 2365 @end example | |
| 428 | 2366 |
| 2690 | 2367 @node The actual coding system, , The program to encode from internal format, CCL Example |
| 2368 @subsubsection The actual coding system | |
| 2369 | |
| 2370 To actually create the coding system, we call | |
| 2371 @samp{make-coding-system}. The first argument is the symbol that is to | |
| 2372 be the name of the coding system, in our case @samp{url-coding}. The | |
| 2373 second specifies that the coding system is to be of type | |
| 2374 @samp{ccl}---there are several other coding system types available, | |
| 2375 including, see the documentation for @samp{make-coding-system} for the | |
| 2376 full list. Then there's a documentation string describing the wherefore | |
| 2377 and caveats of the coding system, and the final argument is a property | |
| 2378 list giving information about the CCL programs and the coding system's | |
| 2379 mnemonic. | |
| 2380 | |
| 2381 @example | |
| 2382 (make-coding-system | |
| 2383 'url-coding 'ccl | |
| 2384 "The coding used by application/x-www-form-urlencoded HTTP applications. | |
| 2385 This coding form doesn't specify anything about non-ASCII characters, so | |
| 2386 make sure you've transformed to a seven-bit coding system first." | |
| 2387 '(decode ccl-decode-urlcoding | |
| 2388 encode ccl-encode-urlcoding | |
| 2389 mnemonic "URLenc")) | |
| 2390 @end example | |
| 2391 | |
| 2392 If you're lucky, the @samp{url-coding} coding system describe here | |
| 2393 should be available in the XEmacs package system. Otherwise, downloading | |
| 2394 it from @samp{http://www.parhasard.net/url-coding.el} should work for | |
| 2395 the foreseeable future. | |
| 2396 | |
| 775 | 2397 @node Category Tables, Unicode Support, CCL, MULE |
| 428 | 2398 @section Category Tables |
| 2399 | |
| 2400 A category table is a type of char table used for keeping track of | |
| 2401 categories. Categories are used for classifying characters for use in | |
| 440 | 2402 regexps---you can refer to a category rather than having to use a |
| 428 | 2403 complicated [] expression (and category lookups are significantly |
| 2404 faster). | |
| 2405 | |
| 2406 There are 95 different categories available, one for each printable | |
| 2407 character (including space) in the ASCII charset. Each category is | |
| 2408 designated by one such character, called a @dfn{category designator}. | |
| 2409 They are specified in a regexp using the syntax @samp{\cX}, where X is a | |
| 2410 category designator. (This is not yet implemented.) | |
| 2411 | |
| 2412 A category table specifies, for each character, the categories that | |
| 2413 the character is in. Note that a character can be in more than one | |
| 2414 category. More specifically, a category table maps from a character to | |
| 2415 either the value @code{nil} (meaning the character is in no categories) | |
| 2416 or a 95-element bit vector, specifying for each of the 95 categories | |
| 2417 whether the character is in that category. | |
| 2418 | |
| 2419 Special Lisp functions are provided that abstract this, so you do not | |
| 2420 have to directly manipulate bit vectors. | |
| 2421 | |
| 444 | 2422 @defun category-table-p object |
| 2423 This function returns @code{t} if @var{object} is a category table. | |
| 428 | 2424 @end defun |
| 2425 | |
| 2426 @defun category-table &optional buffer | |
| 2427 This function returns the current category table. This is the one | |
| 2428 specified by the current buffer, or by @var{buffer} if it is | |
| 2429 non-@code{nil}. | |
| 2430 @end defun | |
| 2431 | |
| 2432 @defun standard-category-table | |
| 2433 This function returns the standard category table. This is the one used | |
| 2434 for new buffers. | |
| 2435 @end defun | |
| 2436 | |
| 444 | 2437 @defun copy-category-table &optional category-table |
| 2438 This function returns a new category table which is a copy of | |
| 2439 @var{category-table}, which defaults to the standard category table. | |
| 428 | 2440 @end defun |
| 2441 | |
| 444 | 2442 @defun set-category-table category-table &optional buffer |
| 2443 This function selects @var{category-table} as the new category table for | |
| 2444 @var{buffer}. @var{buffer} defaults to the current buffer if omitted. | |
| 428 | 2445 @end defun |
| 2446 | |
| 444 | 2447 @defun category-designator-p object |
| 2448 This function returns @code{t} if @var{object} is a category designator (a | |
| 428 | 2449 char in the range @samp{' '} to @samp{'~'}). |
| 2450 @end defun | |
| 2451 | |
| 444 | 2452 @defun category-table-value-p object |
| 2453 This function returns @code{t} if @var{object} is a category table value. | |
| 428 | 2454 Valid values are @code{nil} or a bit vector of size 95. |
| 2455 @end defun | |
| 2456 | |
| 775 | 2457 |
| 2458 @c Added 2002-03-13 sjt | |
| 1183 | 2459 @node Unicode Support, Charset Unification, Category Tables, MULE |
| 775 | 2460 @section Unicode Support |
| 2461 @cindex unicode | |
| 2462 @cindex utf-8 | |
| 2463 @cindex utf-16 | |
| 2464 @cindex ucs-2 | |
| 2465 @cindex ucs-4 | |
| 2466 @cindex bmp | |
| 2467 @cindex basic multilingual plance | |
| 2468 | |
| 2469 Unicode support was added by Ben Wing to XEmacs 21.5.6. | |
| 2470 | |
| 2471 @defun set-language-unicode-precedence-list list | |
| 2472 Set the language-specific precedence list used for Unicode decoding. | |
| 2473 This is a list of charsets, which are consulted in order for a translation | |
| 2474 matching a given Unicode character. If no matches are found, the charsets | |
| 2475 in the default precedence list (see | |
| 2476 @code{set-default-unicode-precedence-list}) are consulted, and then all | |
| 2477 remaining charsets, in some arbitrary order. | |
| 2478 | |
| 2479 The language-specific precedence list is meant to be set as part of the | |
| 2480 language environment initialization; the default precedence list is meant | |
| 2481 to be set by the user. | |
| 2482 @end defun | |
| 2483 | |
| 2484 @defun language-unicode-precedence-list | |
| 2485 Return the language-specific precedence list used for Unicode decoding. | |
| 2486 See @code{set-language-unicode-precedence-list} for more information. | |
| 2487 @end defun | |
| 2488 | |
| 2489 @defun set-default-unicode-precedence-list list | |
| 2490 Set the default precedence list used for Unicode decoding. | |
| 2491 This is meant to be set by the user. See | |
| 2492 `set-language-unicode-precedence-list' for more information. | |
| 2493 @end defun | |
| 2494 | |
| 2495 @defun default-unicode-precedence-list | |
| 2496 Return the default precedence list used for Unicode decoding. | |
| 2497 See @code{set-language-unicode-precedence-list} for more information. | |
| 2498 @end defun | |
| 2499 | |
| 2500 @defun set-unicode-conversion character code | |
| 2501 Add conversion information between Unicode codepoints and characters. | |
| 2502 @var{character} is one of the following: | |
| 2503 | |
| 2504 @c #### fix this markup | |
| 2505 -- A character (in which case @var{code} must be a non-negative integer) | |
| 2506 -- A vector of characters (in which case @var{code} must be a vector of | |
| 2507 non-negative integers of the same length) | |
| 2508 | |
| 2509 Values of @var{code} above 2^20 - 1 are allowed for the purpose of specifying | |
| 2510 private characters, but will cause errors when converted to UTF-16 or UTF-32. | |
| 2511 UCS-4 and UTF-8 can handle values to 2^31 - 1, but XEmacs Lisp integers top | |
| 2512 out at 2^30 - 1. | |
| 2513 @end defun | |
| 2514 | |
| 2515 @defun character-to-unicode character | |
| 2516 Convert @var{character} to Unicode codepoint. | |
| 2517 When there is no international support (i.e. MULE is not defined), | |
| 2518 this function simply does @code{char-to-int}. | |
| 2519 @end defun | |
| 2520 | |
| 2521 @defun unicode-to-character code [charsets] | |
| 2522 Convert Unicode codepoint @var{code} to character. | |
| 2523 @var{code} should be a non-negative integer. | |
| 2524 If @var{charsets} is given, it should be a list of charsets, and only those | |
| 2525 charsets will be consulted, in the given order, for a translation. | |
| 2526 Otherwise, the default ordering of all charsets will be given (see | |
| 2527 @code{set-unicode-charset-precedence}). | |
| 2528 | |
| 2529 When there is no international support (i.e. MULE is not defined), | |
| 2530 this function simply does @code{int-to-char} and ignores the | |
| 2531 @var{charsets} argument. | |
| 2532 @end defun | |
| 2533 | |
| 2534 @defun parse-unicode-translation-table filename charset start end offset flags | |
| 2535 Parse Unicode translation data in @var{filename} for MULE @var{charset}. | |
| 2536 Data is text, in the form of one translation per line -- charset | |
| 2537 codepoint followed by Unicode codepoint. Numbers are decimal or hex | |
| 2538 \(preceded by 0x). Comments are marked with a #. Charset codepoints | |
| 2539 for two-dimensional charsets should have the first octet stored in the | |
| 2540 high 8 bits of the hex number and the second in the low 8 bits. | |
| 2541 | |
| 2542 If @var{start} and @var{end} are given, only charset codepoints within | |
| 2543 the given range will be processed. If @var{offset} is given, that value | |
| 2544 will be added to all charset codepoints in the file to obtain the | |
| 2545 internal charset codepoint. @var{start} and @var{end} apply to the | |
| 2546 codepoints in the file, before @var{offset} is applied. | |
| 2547 | |
| 2548 (Note that, as usual, we assume that octets are in the range 32 to | |
| 2549 127 or 33 to 126. If you have a table in kuten form, with octets in | |
| 2550 the range 1 to 94, you will have to use an offset of 5140, | |
| 2551 i.e. 0x2020.) | |
| 2552 | |
| 2553 @var{flags}, if specified, control further how the tables are interpreted | |
| 2554 and are used to special-case certain known table weirdnesses in the | |
| 2555 Unicode tables: | |
| 2556 | |
| 2557 @table @code | |
| 2558 @item ignore-first-column' | |
| 2559 Exactly as it sounds. The JIS X 0208 tables have 3 columns of data instead | |
| 2560 of 2; the first is the Shift-JIS codepoint. | |
| 2561 | |
| 2562 @item big5 | |
| 2563 The charset codepoint is a Big Five codepoint; convert it to the | |
| 2564 proper hacked-up codepoint in `chinese-big5-1' or `chinese-big5-2'. | |
| 2565 @end table | |
| 2566 @end defun | |
| 2567 | |
| 1183 | 2568 |
| 2569 @node Charset Unification, Charsets and Coding Systems, Unicode Support, MULE | |
| 2570 @section Character Set Unification | |
| 2571 | |
| 2572 Mule suffers from a design defect that causes it to consider the ISO | |
| 2573 Latin character sets to be disjoint. This results in oddities such as | |
| 2574 files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO | |
| 2575 2022 control sequences to switch between them, as well as more plausible | |
| 2576 but often unnecessary combinations like ISO 8859/1 with ISO 8859/2. | |
| 2577 This can be very annoying when sending messages or even in simple | |
| 2578 editing on a single host. Unification works around the problem by | |
| 2579 converting as many characters as possible to use a single Latin coded | |
| 2580 character set before saving the buffer. | |
| 2581 | |
| 2582 This node and its children were ripp'd untimely from | |
| 2583 @file{latin-unity.texi}, and have been quickly converted for use here. | |
| 2584 However as APIs are likely to diverge, beware of inaccuracies. Please | |
| 2585 report any you discover with @kbd{M-x report-xemacs-bug RET}, as well | |
| 2586 as any ambiguities or downright unintelligible passages. | |
| 2587 | |
| 2588 A lot of the stuff here doesn't belong here; it belongs in the | |
| 2589 @ref{Top, , , xemacs, XEmacs User's Manual}. Report those as bugs, | |
| 2590 too, preferably with patches. | |
| 2591 | |
| 2592 @menu | |
| 2593 * Overview:: Unification history and general information. | |
| 2594 * Usage:: An overview of the operation of Unification. | |
| 2595 * Configuration:: Configuring Unification for use. | |
| 2596 * Theory of Operation:: How Unification works. | |
| 2597 * What Unification Cannot Do for You:: Inherent problems of 8-bit charsets. | |
| 2598 * Charsets and Coding Systems:: Reference lists with annotations. | |
| 1188 | 2599 * Unification Internals:: Utilities and implementation details. |
| 1183 | 2600 @end menu |
| 2601 | |
| 2602 @node Overview, Usage, Charset Unification, Charset Unification | |
| 2603 @subsection An Overview of Unification | |
| 2604 | |
| 2605 Mule suffers from a design defect that causes it to consider the ISO | |
| 2606 Latin character sets to be disjoint. This manifests itself when a user | |
| 2607 enters characters using input methods associated with different coded | |
| 2608 character sets into a single buffer. | |
| 2609 | |
| 2610 A very important example involves email. Many sites, especially in the | |
| 2611 U.S., default to use of the ISO 8859/1 coded character set (also called | |
| 2612 ``Latin 1,'' though these are somewhat different concepts). However, | |
| 2613 ISO 8859/1 provides a generic CURRENCY SIGN character. Now that the | |
| 2614 Euro has become the official currency of most countries in Europe, this | |
| 2615 is unsatisfactory (and in practice, useless). So Europeans generally | |
| 2616 use ISO 8859/15, which is nearly identical to ISO 8859/1 for most | |
| 2617 languages, except that it substitutes EURO SIGN for CURRENCY SIGN. | |
| 2618 | |
| 2619 Suppose a European user yanks text from a post encoded in ISO 8859/1 | |
| 2620 into a message composition buffer, and enters some text including the | |
| 2621 Euro sign. Then Mule will consider the buffer to contain both ISO | |
| 2622 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively | |
| 2623 programmed) send the message as a multipart mixed MIME body! | |
| 2624 | |
| 2625 This is clearly stupid. What is not as obvious is that, just as any | |
| 2626 European can include American English in their text because ASCII is a | |
| 2627 subset of ISO 8859/15, most European languages which use Latin | |
| 2628 characters (eg, German and Polish) can typically be mixed while using | |
| 2629 only one Latin coded character set (in this case, ISO 8859/2). However, | |
| 2630 this often depends on exactly what text is to be encoded. | |
| 2631 | |
| 2632 Unification works around the problem by converting as many characters as | |
| 2633 possible to use a single Latin coded character set before saving the | |
| 2634 buffer. | |
| 2635 | |
| 2636 @node Usage, Configuration, Overview, Charset Unification | |
| 2637 @subsection Operation of Unification | |
| 2638 | |
| 2639 Normally, Unification works in the background by installing | |
| 2640 @code{unity-sanity-check} on @code{write-region-pre-hook}. This is | |
| 2641 done by default for the ISO 8859 Latin family of character sets. The | |
| 2642 user activates this functionality for other character set families by | |
| 2643 invoking @code{enable-unification}, either interactively or in her | |
| 2644 init file. @xref{Init File, , , xemacs}. Unification can be | |
| 2645 deactivated by invoking @code{disable-unification}. | |
| 2646 | |
| 2647 Unification also provides a few functions for remapping or recoding the | |
| 2648 buffer by hand. To @dfn{remap} a character means to change the buffer | |
| 2649 representation of the character by using another coded character set. | |
| 2650 Remapping never changes the identity of the character, but may involve | |
| 2651 altering the code point of the character. To @dfn{recode} a character | |
| 2652 means to simply change the coded character set. Recoding never alters | |
| 2653 the code point of the character, but may change the identity of the | |
| 2654 character. @xref{Theory of Operation}. | |
| 2655 | |
| 2656 There are a few variables which determine which coding systems are | |
| 2657 always acceptable to Unification: @code{unity-ucs-list}, | |
| 2658 @code{unity-preferred-coding-system-list}, and | |
| 2659 @code{unity-preapproved-coding-system-list}. The latter two default | |
| 2660 to @code{()}, and should probably be avoided because they short-circuit | |
| 2661 the sanity check. If you find you need to use them, consider reporting | |
| 2662 it as a bug or request for enhancement. Because they seem unsafe, the | |
| 2663 recommended interface is likely to change. | |
| 2664 | |
| 2665 @menu | |
| 2666 * Basic Functionality:: User interface and customization. | |
| 2667 * Interactive Usage:: Treating text by hand. | |
| 2668 Also documents the hook function(s). | |
| 2669 @end menu | |
| 2670 | |
| 2671 | |
| 2672 @node Basic Functionality, Interactive Usage, , Usage | |
| 2673 @section Basic Functionality | |
| 2674 | |
| 2675 These functions and user options initialize and configure Unification. | |
| 2676 In normal use, none of these should be needed. | |
| 2677 | |
| 2678 @strong{These APIs are certain to change.} | |
| 2679 | |
| 2680 @defun enable-unification | |
| 2681 Set up hooks and initialize variables for latin-unity. | |
| 2682 | |
| 2683 There are no arguments. | |
| 2684 | |
| 2685 This function is idempotent. It will reinitialize any hooks or variables | |
| 2686 that are not in initial state. | |
| 2687 @end defun | |
| 2688 | |
| 2689 @defun disable-unification | |
| 2690 There are no arguments. | |
| 2691 | |
| 2692 Clean up hooks and void variables used by latin-unity. | |
| 2693 @end defun | |
| 2694 | |
| 2695 @defopt unity-ucs-list | |
| 2696 List of coding systems considered to be universal. | |
| 2697 | |
| 2698 The default value is @code{'(utf-8 iso-2022-7 ctext escape-quoted)}. | |
| 2699 | |
| 2700 Order matters; coding systems earlier in the list will be preferred when | |
| 2701 recommending a coding system. These coding systems will not be used | |
| 2702 without querying the user (unless they are also present in | |
| 2703 @code{unity-preapproved-coding-system-list}), and follow the | |
| 2704 @code{unity-preferred-coding-system-list} in the list of suggested | |
| 2705 coding systems. | |
| 2706 | |
| 2707 If none of the preferred coding systems are feasible, the first in | |
| 2708 this list will be the default. | |
| 2709 | |
| 2710 Notes on certain coding systems: @code{escape-quoted} is a special | |
| 2711 coding system used for autosaves and compiled Lisp in Mule. You should | |
| 2712 @c #### fix in latin-unity.texi | |
| 2713 never delete this, although it is rare that a user would want to use it | |
| 2714 directly. Unification does not try to be \"smart\" about other general | |
| 2715 ISO 2022 coding systems, such as ISO-2022-JP. (They are not recognized | |
| 2716 as equivalent to @code{iso-2022-7}.) If your preferred coding system is | |
| 2717 one of these, you may consider adding it to @code{unity-ucs-list}. | |
| 2718 However, this will typically have the side effect that (eg) ISO 8859/1 | |
| 2719 files will be saved in 7-bit form with ISO 2022 escape sequences. | |
| 2720 @end defopt | |
| 2721 | |
| 2722 Coding systems which are not Latin and not in | |
| 2723 @code{unity-ucs-list} are handled by short circuiting checks of | |
| 2724 coding system against the next two variables. | |
| 2725 | |
| 2726 @defopt unity-preapproved-coding-system-list | |
| 2727 List of coding systems used without querying the user if feasible. | |
| 2728 | |
| 2729 The default value is @samp{(buffer-default preferred)}. | |
| 2730 | |
| 2731 The first feasible coding system in this list is used. The special values | |
| 2732 @samp{preferred} and @samp{buffer-default} may be present: | |
| 2733 | |
| 2734 @table @code | |
| 2735 @item buffer-default | |
| 2736 Use the coding system used by @samp{write-region}, if feasible. | |
| 2737 | |
| 2738 @item preferred | |
| 2739 Use the coding system specified by @samp{prefer-coding-system} if feasible. | |
| 2740 @end table | |
| 2741 | |
| 2742 "Feasible" means that all characters in the buffer can be represented by | |
| 2743 the coding system. Coding systems in @samp{unity-ucs-list} are | |
| 2744 always considered feasible. Other feasible coding systems are computed | |
| 2745 by @samp{unity-representations-feasible-region}. | |
| 2746 | |
| 2747 Note that the first universal coding system in this list shadows all | |
| 2748 other coding systems. In particular, if your preferred coding system is | |
| 2749 a universal coding system, and @code{preferred} is a member of this | |
| 2750 list, unification will blithely convert all your files to that coding | |
| 2751 system. This is considered a feature, but it may surprise most users. | |
| 2752 Users who don't like this behavior should put @code{preferred} in | |
| 2753 @code{unity-preferred-coding-system-list}. | |
| 2754 @end defopt | |
| 2755 | |
| 2756 @defopt unity-preferred-coding-system-list | |
| 2757 @c #### fix in latin-unity.texi | |
| 2758 List of coding systems suggested to the user if feasible. | |
| 2759 | |
| 2760 The default value is @samp{(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3 | |
| 2761 iso-8859-4 iso-8859-9)}. | |
| 2762 | |
| 2763 If none of the coding systems in | |
| 2764 @c #### fix in latin-unity.texi | |
| 2765 @code{unity-preapproved-coding-system-list} are feasible, this list | |
| 2766 will be recommended to the user, followed by the | |
| 2767 @code{unity-ucs-list}. The first coding system in this list is default. The | |
| 2768 special values @samp{preferred} and @samp{buffer-default} may be | |
| 2769 present: | |
| 2770 | |
| 2771 @table @code | |
| 2772 @item buffer-default | |
| 2773 Use the coding system used by @samp{write-region}, if feasible. | |
| 2774 | |
| 2775 @item preferred | |
| 2776 Use the coding system specified by @samp{prefer-coding-system} if feasible. | |
| 2777 @end table | |
| 2778 | |
| 2779 "Feasible" means that all characters in the buffer can be represented by | |
| 2780 the coding system. Coding systems in @samp{unity-ucs-list} are | |
| 2781 always considered feasible. Other feasible coding systems are computed | |
| 2782 by @samp{unity-representations-feasible-region}. | |
| 2783 @end defopt | |
| 2784 | |
| 2785 | |
| 2786 @defvar unity-iso-8859-1-aliases | |
| 2787 List of coding systems to be treated as aliases of ISO 8859/1. | |
| 2788 | |
| 2789 The default value is '(iso-8859-1). | |
| 2790 | |
| 2791 This is not a user variable; to customize input of coding systems or | |
| 2792 charsets, @samp{unity-coding-system-alias-alist} or | |
| 2793 @samp{unity-charset-alias-alist}. | |
| 2794 @end defvar | |
| 2795 | |
| 2796 | |
| 2797 @node Interactive Usage, , Basic Functionality, Usage | |
| 2798 @section Interactive Usage | |
| 2799 | |
| 2800 First, the hook function @code{unity-sanity-check} is documented. | |
| 2801 (It is placed here because it is not an interactive function, and there | |
| 2802 is not yet a programmer's section of the manual.) | |
| 2803 | |
| 2804 These functions provide access to internal functionality (such as the | |
| 2805 remapping function) and to extra functionality (the recoding functions | |
| 2806 and the test function). | |
| 2807 | |
| 2808 | |
| 2809 @defun unity-sanity-check begin end filename append visit lockname &optional coding-system | |
| 2810 | |
| 2811 Check if @var{coding-system} can represent all characters between | |
| 2812 @var{begin} and @var{end}. | |
| 2813 | |
| 2814 For compatibility with old broken versions of @code{write-region}, | |
| 2815 @var{coding-system} defaults to @code{buffer-file-coding-system}. | |
| 2816 @var{filename}, @var{append}, @var{visit}, and @var{lockname} are | |
| 2817 ignored. | |
| 2818 | |
| 2819 Return nil if buffer-file-coding-system is not (ISO-2022-compatible) | |
| 2820 Latin. If @code{buffer-file-coding-system} is safe for the charsets | |
| 2821 actually present in the buffer, return it. Otherwise, ask the user to | |
| 2822 choose a coding system, and return that. | |
| 2823 | |
| 2824 This function does @emph{not} do the safe thing when | |
| 2825 @code{buffer-file-coding-system} is nil (aka no-conversion). It | |
| 2826 considers that ``non-Latin,'' and passes it on to the Mule detection | |
| 2827 mechanism. | |
| 2828 | |
| 2829 This function is intended for use as a @code{write-region-pre-hook}. It | |
| 2830 does nothing except return @var{coding-system} if @code{write-region} | |
| 2831 handlers are inhibited. | |
| 2832 @end defun | |
| 2833 | |
| 2834 @defun unity-buffer-representations-feasible | |
| 2835 | |
| 2836 There are no arguments. | |
| 2837 | |
| 2838 Apply unity-region-representations-feasible to the current buffer. | |
| 2839 @end defun | |
| 2840 | |
| 2841 @defun unity-region-representations-feasible begin end &optional buf | |
| 2842 | |
| 2843 Return character sets that can represent the text from @var{begin} to @var{end} in @var{buf}. | |
| 2844 | |
| 2845 @var{buf} defaults to the current buffer. Called interactively, will be | |
| 2846 applied to the region. Function assumes @var{begin} <= @var{end}. | |
| 2847 | |
| 2848 The return value is a cons. The car is the list of character sets | |
| 2849 that can individually represent all of the non-ASCII portion of the | |
| 2850 buffer, and the cdr is the list of character sets that can | |
| 2851 individually represent all of the ASCII portion. | |
| 2852 | |
| 2853 The following is taken from a comment in the source. Please refer to | |
| 2854 the source to be sure of an accurate description. | |
| 2855 | |
| 2856 The basic algorithm is to map over the region, compute the set of | |
| 2857 charsets that can represent each character (the ``feasible charset''), | |
| 2858 and take the intersection of those sets. | |
| 2859 | |
| 2860 The current implementation takes advantage of the fact that ASCII | |
| 2861 characters are common and cannot change asciisets. Then using | |
| 2862 skip-chars-forward makes motion over ASCII subregions very fast. | |
| 2863 | |
| 2864 This same strategy could be applied generally by precomputing classes | |
| 2865 of characters equivalent according to their effect on latinsets, and | |
| 2866 adding a whole class to the skip-chars-forward string once a member is | |
| 2867 found. | |
| 2868 | |
| 2869 Probably efficiency is a function of the number of characters matched, | |
| 2870 or maybe the length of the match string? With @code{skip-category-forward} | |
| 2871 over a precomputed category table it should be really fast. In practice | |
| 2872 for Latin character sets there are only 29 classes. | |
| 2873 @end defun | |
| 2874 | |
| 2875 @defun unity-remap-region begin end character-set &optional coding-system | |
| 2876 | |
| 2877 Remap characters between @var{begin} and @var{end} to equivalents in | |
| 2878 @var{character-set}. Optional argument @var{coding-system} may be a | |
| 2879 coding system name (a symbol) or nil. Characters with no equivalent are | |
| 2880 left as-is. | |
| 2881 | |
| 2882 When called interactively, @var{begin} and @var{end} are set to the | |
| 2883 beginning and end, respectively, of the active region, and the function | |
| 2884 prompts for @var{character-set}. The function does completion, knows | |
| 2885 how to guess a character set name from a coding system name, and also | |
| 2886 provides some common aliases. See @code{unity-guess-charset}. | |
| 2887 There is no way to specify @var{coding-system}, as it has no useful | |
| 2888 function interactively. | |
| 2889 | |
| 2890 Return @var{coding-system} if @var{coding-system} can encode all | |
| 2891 characters in the region, t if @var{coding-system} is nil and the coding | |
| 2892 system with G0 = 'ascii and G1 = @var{character-set} can encode all | |
| 2893 characters, and otherwise nil. Note that a non-null return does | |
| 2894 @emph{not} mean it is safe to write the file, only the specified region. | |
| 2895 (This behavior is useful for multipart MIME encoding and the like.) | |
| 2896 | |
| 2897 Note: by default this function is quite fascist about universal coding | |
| 2898 systems. It only admits @samp{utf-8}, @samp{iso-2022-7}, and | |
| 2899 @samp{ctext}. Customize @code{unity-approved-ucs-list} to change | |
| 2900 this. | |
| 2901 | |
| 2902 This function remaps characters that are artificially distinguished by Mule | |
| 2903 internal code. It may change the code point as well as the character set. | |
| 2904 To recode characters that were decoded in the wrong coding system, use | |
| 2905 @code{unity-recode-region}. | |
| 2906 @end defun | |
| 2907 | |
| 2908 @defun unity-recode-region begin end wrong-cs right-cs | |
| 2909 | |
| 2910 Recode characters between @var{begin} and @var{end} from @var{wrong-cs} | |
| 2911 to @var{right-cs}. | |
| 2912 | |
| 2913 @var{wrong-cs} and @var{right-cs} are character sets. Characters retain | |
| 2914 the same code point but the character set is changed. Only characters | |
| 2915 from @var{wrong-cs} are changed to @var{right-cs}. The identity of the | |
| 2916 character may change. Note that this could be dangerous, if characters | |
| 2917 whose identities you do not want changed are included in the region. | |
| 2918 This function cannot guess which characters you want changed, and which | |
| 2919 should be left alone. | |
| 2920 | |
| 2921 When called interactively, @var{begin} and @var{end} are set to the | |
| 2922 beginning and end, respectively, of the active region, and the function | |
| 2923 prompts for @var{wrong-cs} and @var{right-cs}. The function does | |
| 2924 completion, knows how to guess a character set name from a coding system | |
| 2925 name, and also provides some common aliases. See | |
| 2926 @code{unity-guess-charset}. | |
| 2927 | |
| 2928 Another way to accomplish this, but using coding systems rather than | |
| 2929 character sets to specify the desired recoding, is | |
| 2930 @samp{unity-recode-coding-region}. That function may be faster | |
| 2931 but is somewhat more dangerous, because it may recode more than one | |
| 2932 character set. | |
| 2933 | |
| 2934 To change from one Mule representation to another without changing identity | |
| 2935 of any characters, use @samp{unity-remap-region}. | |
| 2936 @end defun | |
| 2937 | |
| 2938 @defun unity-recode-coding-region begin end wrong-cs right-cs | |
| 2939 | |
| 2940 Recode text between @var{begin} and @var{end} from @var{wrong-cs} to | |
| 2941 @var{right-cs}. | |
| 2942 | |
| 2943 @var{wrong-cs} and @var{right-cs} are coding systems. Characters retain | |
| 2944 the same code point but the character set is changed. The identity of | |
| 2945 characters may change. This is an inherently dangerous function; | |
| 2946 multilingual text may be recoded in unexpected ways. #### It's also | |
| 2947 dangerous because the coding systems are not sanity-checked in the | |
| 2948 current implementation. | |
| 2949 | |
| 2950 When called interactively, @var{begin} and @var{end} are set to the | |
| 2951 beginning and end, respectively, of the active region, and the function | |
| 2952 prompts for @var{wrong-cs} and @var{right-cs}. The function does | |
| 2953 completion, knows how to guess a coding system name from a character set | |
| 2954 name, and also provides some common aliases. See | |
| 2955 @code{unity-guess-coding-system}. | |
| 2956 | |
| 2957 Another, safer, way to accomplish this, using character sets rather | |
| 2958 than coding systems to specify the desired recoding, is to use | |
| 2959 @c #### fixme in latin-unity.texi | |
| 2960 @code{unity-recode-region}. | |
| 2961 | |
| 2962 To change from one Mule representation to another without changing identity | |
| 2963 of any characters, use @code{unity-remap-region}. | |
| 2964 @end defun | |
| 2965 | |
| 2966 Helper functions for input of coding system and character set names. | |
| 2967 | |
| 2968 @defun unity-guess-charset candidate | |
| 2969 Guess a charset based on the symbol @var{candidate}. | |
| 2970 | |
| 2971 @var{candidate} itself is not tried as the value. | |
| 2972 | |
| 2973 Uses the natural mapping in @samp{unity-cset-codesys-alist}, and | |
| 2974 the values in @samp{unity-charset-alias-alist}." | |
| 2975 @end defun | |
| 2976 | |
| 2977 @defun unity-guess-coding-system candidate | |
| 2978 Guess a coding system based on the symbol @var{candidate}. | |
| 2979 | |
| 2980 @var{candidate} itself is not tried as the value. | |
| 2981 | |
| 2982 Uses the natural mapping in @samp{unity-cset-codesys-alist}, and | |
| 2983 the values in @samp{unity-coding-system-alias-alist}." | |
| 2984 @end defun | |
| 2985 | |
| 2986 @defun unity-example | |
| 2987 | |
| 2988 A cheesy example for Unification. | |
| 2989 | |
| 2990 At present it just makes a multilingual buffer. To test, setq | |
| 2991 buffer-file-coding-system to some value, make the buffer dirty (eg | |
| 2992 with RET BackSpace), and save. | |
| 2993 @end defun | |
| 2994 | |
| 2995 | |
| 2996 @node Configuration, Theory of Operation, Usage, Charset Unification | |
| 2997 @subsection Configuring Unification for Use | |
| 2998 | |
| 2999 If you want Unification to be automatically initialized, invoke | |
| 3000 @samp{enable-unification} with no arguments in your init file. | |
| 3001 @xref{Init File, , , xemacs}. If you are using GNU Emacs or an XEmacs | |
| 3002 earlier than 21.1, you should also load @file{auto-autoloads} using the | |
| 3003 full path (@emph{never} @samp{require} @file{auto-autoloads} libraries). | |
| 3004 | |
| 3005 You may wish to define aliases for commonly used character sets and | |
| 3006 coding systems for convenience in input. | |
| 3007 | |
| 3008 @defopt unity-charset-alias-alist | |
| 3009 Alist mapping aliases to Mule charset names (symbols)." | |
| 3010 | |
| 3011 The default value is | |
| 3012 @example | |
| 3013 ((latin-1 . latin-iso8859-1) | |
| 3014 (latin-2 . latin-iso8859-2) | |
| 3015 (latin-3 . latin-iso8859-3) | |
| 3016 (latin-4 . latin-iso8859-4) | |
| 3017 (latin-5 . latin-iso8859-9) | |
| 3018 (latin-9 . latin-iso8859-15) | |
| 3019 (latin-10 . latin-iso8859-16)) | |
| 3020 @end example | |
| 3021 | |
| 3022 If a charset does not exist on your system, it will not complete and you | |
| 3023 will not be able to enter it in response to prompts. A real charset | |
| 3024 with the same name as an alias in this list will shadow the alias. | |
| 3025 @end defopt | |
| 3026 | |
| 3027 @defopt unity-coding-system-alias-alist nil | |
| 3028 Alist mapping aliases to Mule coding system names (symbols). | |
| 3029 | |
| 3030 The default value is @samp{nil}. | |
| 3031 @end defopt | |
| 3032 | |
| 3033 | |
| 3034 @node Theory of Operation, What Unification Cannot Do for You, Configuration, Charset Unification | |
| 3035 @subsection Theory of Operation | |
| 3036 | |
| 3037 Standard encodings suffer from the design defect that they do not | |
| 3038 provide a reliable way to recognize which coded character sets in use. | |
| 3039 @xref{What Unification Cannot Do for You}. There are scores of | |
| 3040 character sets which can be represented by a single octet (8-bit byte), | |
| 3041 whose union contains many hundreds of characters. Obviously this | |
| 3042 results in great confusion, since you can't tell the players without a | |
| 3043 scorecard, and there is no scorecard. | |
| 3044 | |
| 3045 There are two ways to solve this problem. The first is to create a | |
| 3046 universal coded character set. This is the concept behind Unicode. | |
| 3047 However, there have been satisfactory (nearly) universal character sets | |
| 3048 for several decades, but even today many Westerners resist using Unicode | |
| 3049 because they consider its space requirements excessive. On the other | |
| 3050 hand, Asians dislike Unicode because they consider it to be incomplete. | |
| 3051 (This is partly, but not entirely, political.) | |
| 3052 | |
| 3053 In any case, Unicode only solves the internal representation problem. | |
| 3054 Many data sets will contain files in ``legacy'' encodings, and Unicode | |
| 3055 does not help distinguish among them. | |
| 3056 | |
| 3057 The second approach is to embed information about the encodings used in | |
| 3058 a document in its text. This approach is taken by the ISO 2022 | |
| 3059 standard. This would solve the problem completely from the users' of | |
| 3060 view, except that ISO 2022 is basically not implemented at all, in the | |
| 3061 sense that few applications or systems implement more than a small | |
| 3062 subset of ISO 2022 functionality. This is due to the fact that | |
| 3063 mono-literate users object to the presence of escape sequences in their | |
| 3064 texts (which they, with some justification, consider data corruption). | |
| 3065 Programmers are more than willing to cater to these users, since | |
| 3066 implementing ISO 2022 is a painstaking task. | |
| 3067 | |
| 3068 In fact, Emacs/Mule adopts both of these approaches. Internally it uses | |
| 3069 a universal character set, @dfn{Mule code}. Externally it uses ISO 2022 | |
| 3070 techniques both to save files in forms robust to encoding issues, and as | |
| 3071 hints when attempting to ``guess'' an unknown encoding. However, Mule | |
| 3072 suffers from a design defect, namely it embeds the character set | |
| 3073 information that ISO 2022 attaches to runs of characters by introducing | |
| 3074 them with a control sequence in each character. That causes Mule to | |
| 3075 consider the ISO Latin character sets to be disjoint. This manifests | |
| 3076 itself when a user enters characters using input methods associated with | |
| 3077 different coded character sets into a single buffer. | |
| 3078 | |
| 3079 There are two problems stemming from this design. First, Mule | |
| 1188 | 3080 represents the same character in different ways. Abstractly, 'ó' |
| 1183 | 3081 (LATIN SMALL LETTER O WITH ACUTE) can get represented as |
| 3082 [latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73]. So what looks like | |
| 1188 | 3083 'óó' in the display might actually be represented [latin-iso8859-1 |
| 1183 | 3084 #x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B |
| 3085 #xF3 ESC - A] in the file. In some cases this treatment would be | |
| 3086 appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00 | |
| 3087 (the CJK ideographic character meaning ``one'')), and although arguably | |
| 3088 incorrect it is convenient when mixing the CJK scripts. But in the case | |
| 3089 of the Latin scripts this is wrong. | |
| 3090 | |
| 3091 Worse yet, it is very likely to occur when mixing ``different'' encodings | |
| 3092 (such as ISO 8859/1 and ISO 8859/15) that differ only in a few code | |
| 3093 points that are almost never used. A very important example involves | |
| 3094 email. Many sites, especially in the U.S., default to use of the ISO | |
| 3095 8859/1 coded character set (also called ``Latin 1,'' though these are | |
| 3096 somewhat different concepts). However, ISO 8859/1 provides a generic | |
| 3097 CURRENCY SIGN character. Now that the Euro has become the official | |
| 3098 currency of most countries in Europe, this is unsatisfactory (and in | |
| 3099 practice, useless). So Europeans generally use ISO 8859/15, which is | |
| 3100 nearly identical to ISO 8859/1 for most languages, except that it | |
| 3101 substitutes EURO SIGN for CURRENCY SIGN. | |
| 3102 | |
| 3103 Suppose a European user yanks text from a post encoded in ISO 8859/1 | |
| 3104 into a message composition buffer, and enters some text including the | |
| 3105 Euro sign. Then Mule will consider the buffer to contain both ISO | |
| 3106 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively | |
| 3107 programmed) send the message as a multipart mixed MIME body! | |
| 3108 | |
| 3109 This is clearly stupid. What is not as obvious is that, just as any | |
| 3110 European can include American English in their text because ASCII is a | |
| 3111 subset of ISO 8859/15, most European languages which use Latin | |
| 3112 characters (eg, German and Polish) can typically be mixed while using | |
| 3113 only one Latin coded character set (in the case of German and Polish, | |
| 3114 ISO 8859/2). However, this often depends on exactly what text is to be | |
| 3115 encoded (even for the same pair of languages). | |
| 3116 | |
| 3117 Unification works around the problem by converting as many characters as | |
| 3118 possible to use a single Latin coded character set before saving the | |
| 3119 buffer. | |
| 3120 | |
|
5384
3889ef128488
Fix misspelled words, and some grammar, across the entire source tree.
Jerry James <james@xemacs.org>
parents:
3439
diff
changeset
|
3121 Because the problem is rarely noticeable in editing a buffer, but tends |
| 1183 | 3122 to manifest when that buffer is exported to a file or process, the |
| 3123 Unification package uses the strategy of examining the buffer prior to | |
| 3124 export. If use of multiple Latin coded character sets is detected, | |
| 3125 Unification attempts to unify them by finding a single coded character | |
| 3126 set which contains all of the Latin characters in the buffer. | |
| 3127 | |
| 3128 The primary purpose of Unification is to fix the problem by giving the | |
| 3129 user the choice to change the representation of all characters to one | |
| 3130 character set and give sensible recommendations based on context. In | |
| 1188 | 3131 the 'ó' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and |
| 1183 | 3132 both will be suggested. In the EURO SIGN example, only ISO 8859/15 |
| 3133 makes sense, and that is what will be recommended. In both cases, the | |
| 3134 user will be reminded that there are universal encodings available. | |
| 3135 | |
| 3136 I call this @dfn{remapping} (from the universal character set to a | |
| 3137 particular ISO 8859 coded character set). It is mere accident that this | |
| 3138 letter has the same code point in both character sets. (Not entirely, | |
| 3139 but there are many examples of Latin characters that have different code | |
| 3140 points in different Latin-X sets.) | |
| 3141 | |
| 1188 | 3142 Note that, in the 'ó' example, that treating the buffer in this way will |
| 1183 | 3143 result in a representation such as [latin-iso8859-2 |
| 3144 #x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3]. | |
| 3145 This is guaranteed to occasionally result in the second problem you | |
| 3146 observed, to which we now turn. | |
| 3147 | |
| 3148 This problem is that, although the file is intended to be an | |
| 3149 ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX | |
| 3150 compliant program---this is required by the standard, obvious if you | |
| 3151 think a bit, @pxref{What Unification Cannot Do for You}) will read that | |
| 3152 file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73]. Of course this | |
| 3153 is no problem if all of the characters in the file are contained in ISO | |
| 3154 8859/1, but suppose there are some which are not, but are contained in | |
| 3155 the (intended) ISO 8859/2. | |
| 3156 | |
| 3157 You now want to fix this, but not by finding the same character in | |
| 3158 another set. Instead, you want to simply change the character set that | |
| 3159 Mule associates with that buffer position without changing the code. | |
| 3160 (This is conceptually somewhat distinct from the first problem, and | |
| 3161 logically ought to be handled in the code that defines coding systems. | |
| 3162 However, unification is not an unreasonable place for it.) Unification | |
| 3163 provides two functions (one fast and dangerous, the other slow and | |
| 3164 careful) to handle this. I call this @dfn{recoding}, because the | |
| 3165 transformation actually involves @emph{encoding} the buffer to file | |
| 3166 representation, then @emph{decoding} it to buffer representation (in a | |
| 3167 different character set). This cannot be done automatically because | |
| 3168 Mule can have no idea what the correct encoding is---after all, it | |
| 3169 already gave you its best guess. @xref{What Unification Cannot Do for | |
| 3170 You}. So these functions must be invoked by the user. @xref{Interactive | |
| 3171 Usage}. | |
| 3172 | |
| 3173 | |
| 3174 @node What Unification Cannot Do for You, Unification Internals, Theory of Operation, Charset Unification | |
| 3175 @subsection What Unification Cannot Do for You | |
| 3176 | |
| 3177 Unification @strong{cannot} save you if you insist on exporting data in | |
| 3178 8-bit encodings in a multilingual environment. @emph{You will | |
| 3179 eventually corrupt data if you do this.} It is not Mule's, or any | |
| 3180 application's, fault. You will have only yourself to blame; consider | |
| 3181 yourself warned. (It is true that Mule has bugs, which make Mule | |
| 3182 somewhat more dangerous and inconvenient than some naive applications. | |
| 3183 We're working to address those, but no application can remedy the | |
| 3184 inherent defect of 8-bit encodings.) | |
| 3185 | |
| 3186 Use standard universal encodings, preferably Unicode (UTF-8) unless | |
| 3187 applicable standards indicate otherwise. The most important such case | |
| 3188 is Internet messages, where MIME should be used, whether or not the | |
| 3189 subordinate encoding is a universal encoding. (Note that since one of | |
| 3190 the important provisions of MIME is the @samp{Content-Type} header, | |
| 3191 which has the charset parameter, MIME is to be considered a universal | |
| 3192 encoding for the purposes of this manual. Of course, technically | |
| 3193 speaking it's neither a coded character set nor a coding extension | |
| 3194 technique compliant with ISO 2022.) | |
| 3195 | |
| 3196 As mentioned earlier, the problem is that standard encodings suffer from | |
| 3197 the design defect that they do not provide a reliable way to recognize | |
| 3198 which coded character sets are in use. There are scores of character | |
| 3199 sets which can be represented by a single octet (8-bit byte), whose | |
| 3200 union contains many hundreds of characters. Thus any 8-bit coded | |
| 3201 character set must contain characters that share code points used for | |
| 3202 different characters in other coded character sets. | |
| 3203 | |
| 3204 This means that a given file's intended encoding cannot be identified | |
| 3205 with 100% reliability unless it contains encoding markers such as those | |
| 3206 provided by MIME or ISO 2022. | |
| 3207 | |
| 3208 Unification actually makes it more likely that you will have problems of | |
| 3209 this kind. Traditionally Mule has been ``helpful'' by simply using an | |
| 3210 ISO 2022 universal coding system when the current buffer coding system | |
| 3211 cannot handle all the characters in the buffer. This has the effect | |
| 3212 that, because the file contains control sequences, it is not recognized | |
| 3213 as being in the locale's normal 8-bit encoding. It may be annoying if | |
| 3214 you are not a Mule expert, but your data is automatically recoverable | |
| 3215 with a tool you already have: Mule. | |
| 3216 | |
| 3217 However, with unification, Mule converts to a single 8-bit character set | |
| 3218 when possible. But typically this will @emph{not} be in your usual | |
| 3219 locale. Ie, the times that an ISO 8859/1 user will need Unification is | |
| 3220 when there are ISO 8859/2 characters in the buffer. But then most | |
| 3221 likely the file will be saved in a pure 8-bit encoding that is not ISO | |
| 3222 8859/1, ie, ISO 8859/2. Mule's autorecognizer (which is probably the | |
| 3223 most sophisticated yet available) cannot tell the difference between ISO | |
| 3224 8859/1 and ISO 8859/2, and in a Western European locale will choose the | |
| 3225 former even though the latter was intended. Even the extension | |
| 3226 (``statistical recognition'') planned for XEmacs 22 is unlikely to be at | |
| 3227 all accurate in the case of mixed codes. | |
| 3228 | |
| 3229 So now consider adding some additional ISO 8859/1 text to the buffer. | |
| 3230 If it includes any ISO 8859/1 codes that are used by different | |
| 3231 characters in ISO 8859/2, you now have a file that cannot be | |
| 3232 mechanically disentangled. You need a human being who can recognize | |
| 3233 that @emph{this is German and Swedish} and stays in Latin-1, while | |
| 3234 @emph{that is Polish} and needs to be recoded to Latin-2. | |
| 3235 | |
| 3236 Moral: switch to a universal coded character set, preferably Unicode | |
| 3237 using the UTF-8 transformation format. If you really need the space, | |
| 3238 compress your files. | |
| 3239 | |
| 3240 | |
| 3241 @node Unification Internals, , What Unification Cannot Do for You, Charset Unification | |
| 3242 @subsection Internals | |
| 3243 | |
| 3244 No internals documentation yet. | |
| 3245 | |
| 3246 @file{unity-utils.el} provides one utility function. | |
| 3247 | |
| 3248 @defun unity-dump-tables | |
| 3249 | |
| 3250 Dump the temporary table created by loading @file{unity-utils.el} | |
| 3251 to @file{unity-tables.el}. Loading the latter file initializes | |
| 3252 @samp{unity-equivalences}. | |
| 3253 @end defun | |
| 3254 | |
| 3255 | |
| 3256 @node Charsets and Coding Systems, , Charset Unification, MULE | |
| 3257 @subsection Charsets and Coding Systems | |
| 3258 | |
| 3259 This section provides reference lists of Mule charsets and coding | |
| 3260 systems. Mule charsets are typically named by character set and | |
| 3261 standard. | |
| 3262 | |
| 3263 @table @strong | |
| 3264 @item ASCII variants | |
| 3265 | |
| 3266 Identification of equivalent characters in these sets is not properly | |
| 3267 implemented. Unification does not distinguish the two charsets. | |
| 3268 | |
| 3269 @samp{ascii} @samp{latin-jisx0201} | |
| 3270 | |
| 3271 @item Extended Latin | |
| 3272 | |
| 3273 Characters from the following ISO 2022 conformant charsets are | |
| 3274 identified with equivalents in other charsets in the group by | |
| 3275 Unification. | |
| 3276 | |
| 3277 @samp{latin-iso8859-1} @samp{latin-iso8859-15} @samp{latin-iso8859-2} | |
| 3278 @samp{latin-iso8859-3} @samp{latin-iso8859-4} @samp{latin-iso8859-9} | |
| 3279 @samp{latin-iso8859-13} @samp{latin-iso8859-16} | |
| 3280 | |
| 3281 The follow charsets are Latin variants which are not understood by | |
| 3282 Unification. In addition, many of the Asian language standards provide | |
| 3283 ASCII, at least, and sometimes other Latin characters. None of these | |
| 3284 are identified with their ISO 8859 equivalents. | |
| 3285 | |
| 3286 @samp{vietnamese-viscii-lower} | |
| 3287 @samp{vietnamese-viscii-upper} | |
| 3288 | |
| 3289 @item Other character sets | |
| 3290 | |
| 3291 @samp{arabic-1-column} | |
| 3292 @samp{arabic-2-column} | |
| 3293 @samp{arabic-digit} | |
| 3294 @samp{arabic-iso8859-6} | |
| 3295 @samp{chinese-big5-1} | |
| 3296 @samp{chinese-big5-2} | |
| 3297 @samp{chinese-cns11643-1} | |
| 3298 @samp{chinese-cns11643-2} | |
| 3299 @samp{chinese-cns11643-3} | |
| 3300 @samp{chinese-cns11643-4} | |
| 3301 @samp{chinese-cns11643-5} | |
| 3302 @samp{chinese-cns11643-6} | |
| 3303 @samp{chinese-cns11643-7} | |
| 3304 @samp{chinese-gb2312} | |
| 3305 @samp{chinese-isoir165} | |
| 3306 @samp{cyrillic-iso8859-5} | |
| 3307 @samp{ethiopic} | |
| 3308 @samp{greek-iso8859-7} | |
| 3309 @samp{hebrew-iso8859-8} | |
| 3310 @samp{ipa} | |
| 3311 @samp{japanese-jisx0208} | |
| 3312 @samp{japanese-jisx0208-1978} | |
| 3313 @samp{japanese-jisx0212} | |
| 3314 @samp{katakana-jisx0201} | |
| 3315 @samp{korean-ksc5601} | |
| 3316 @samp{sisheng} | |
| 3317 @samp{thai-tis620} | |
| 3318 @samp{thai-xtis} | |
| 3319 | |
| 3320 @item Non-graphic charsets | |
| 3321 | |
| 3322 @samp{control-1} | |
| 3323 @end table | |
| 3324 | |
| 3325 @table @strong | |
| 3326 @item No conversion | |
| 3327 | |
| 3328 Some of these coding systems may specify EOL conventions. Note that | |
| 3329 @samp{iso-8859-1} is a no-conversion coding system, not an ISO 2022 | |
| 3330 coding system. Although unification attempts to compensate for this, it | |
| 3331 is possible that the @samp{iso-8859-1} coding system will behave | |
| 3332 differently from other ISO 8859 coding systems. | |
| 3333 | |
| 3334 @samp{binary} @samp{no-conversion} @samp{raw-text} @samp{iso-8859-1} | |
| 3335 | |
| 3336 @item Latin coding systems | |
| 3337 | |
| 3338 These coding systems are all single-byte, 8-bit ISO 2022 coding systems, | |
| 3339 combining ASCII in the GL register (bytes with high-bit clear) and an | |
| 3340 extended Latin character set in the GR register (bytes with high-bit set). | |
| 3341 | |
| 3342 @samp{iso-8859-15} @samp{iso-8859-2} @samp{iso-8859-3} @samp{iso-8859-4} | |
| 3343 @samp{iso-8859-9} @samp{iso-8859-13} @samp{iso-8859-14} @samp{iso-8859-16} | |
| 3344 | |
| 3345 These coding systems are single-byte, 8-bit coding systems that do not | |
| 3346 conform to international standards. They should be avoided in all | |
| 3347 potentially multilingual contexts, including any text distributed over | |
| 3348 the Internet and World Wide Web. | |
| 3349 | |
| 3350 @samp{windows-1251} | |
| 3351 | |
| 3352 @item Multilingual coding systems | |
| 3353 | |
| 3354 The following ISO-2022-based coding systems are useful for multilingual | |
| 3355 text. | |
| 3356 | |
| 3357 @samp{ctext} @samp{iso-2022-lock} @samp{iso-2022-7} @samp{iso-2022-7bit} | |
| 3358 @samp{iso-2022-7bit-ss2} @samp{iso-2022-8} @samp{iso-2022-8bit-ss2} | |
| 3359 | |
| 3360 XEmacs also supports Unicode with the Mule-UCS package. These are the | |
| 3361 preferred coding systems for multilingual use. (There is a possible | |
| 3362 exception for texts that mix several Asian ideographic character sets.) | |
| 3363 | |
| 3364 @samp{utf-16-be} @samp{utf-16-be-no-signature} @samp{utf-16-le} | |
| 3365 @samp{utf-16-le-no-signature} @samp{utf-7} @samp{utf-7-safe} | |
| 3366 @samp{utf-8} @samp{utf-8-ws} | |
| 3367 | |
| 3368 Development versions of XEmacs (the 21.5 series) support Unicode | |
| 3369 internally, with (at least) the following coding systems implemented: | |
| 3370 | |
| 3371 @samp{utf-16-be} @samp{utf-16-be-bom} @samp{utf-16-le} | |
| 3372 @samp{utf-16-le-bom} @samp{utf-8} @samp{utf-8-bom} | |
| 3373 | |
| 3374 @item Asian ideographic languages | |
| 3375 | |
| 3376 The following coding systems are based on ISO 2022, and are more or less | |
| 3377 suitable for encoding multilingual texts. They all can represent ASCII | |
| 3378 at least, and sometimes several other foreign character sets, without | |
| 3379 resort to arbitrary ISO 2022 designations. However, these subsets are | |
| 3380 not identified with the corresponding national standards in XEmacs Mule. | |
| 3381 | |
| 3382 @samp{chinese-euc} @samp{cn-big5} @samp{cn-gb-2312} @samp{gb2312} | |
| 3383 @samp{hz} @samp{hz-gb-2312} @samp{old-jis} @samp{japanese-euc} | |
| 3384 @samp{junet} @samp{euc-japan} @samp{euc-jp} @samp{iso-2022-jp} | |
| 3385 @samp{iso-2022-jp-1978-irv} @samp{iso-2022-jp-2} @samp{euc-kr} | |
| 3386 @samp{korean-euc} @samp{iso-2022-kr} @samp{iso-2022-int-1} | |
| 3387 | |
| 3388 The following coding systems cannot be used for general multilingual | |
| 3389 text and do not cooperate well with other coding systems. | |
| 3390 | |
| 3391 @samp{big5} @samp{shift_jis} | |
| 3392 | |
| 3393 @item Other languages | |
| 3394 | |
| 3395 The following coding systems are based on ISO 2022. Though none of them | |
| 3396 provides any Latin characters beyond ASCII, XEmacs Mule allows (and up | |
| 3397 to 21.4 defaults to) use of ISO 2022 control sequences to designate | |
| 3398 other character sets for inclusion the text. | |
| 3399 | |
| 3400 @samp{iso-8859-5} @samp{iso-8859-7} @samp{iso-8859-8} | |
| 3401 @samp{ctext-hebrew} | |
| 3402 | |
| 3403 The following are character sets that do not conform to ISO 2022 and | |
| 3404 thus cannot be safely used in a multilingual context. | |
| 3405 | |
| 3406 @samp{alternativnyj} @samp{koi8-r} @samp{tis-620} @samp{viqr} | |
| 3407 @samp{viscii} @samp{vscii} | |
| 3408 | |
| 3409 @item Special coding systems | |
| 3410 | |
| 3411 Mule uses the following coding systems for special purposes. | |
| 3412 | |
| 3413 @samp{automatic-conversion} @samp{undecided} @samp{escape-quoted} | |
| 3414 | |
| 3415 @samp{escape-quoted} is especially important, as it is used internally | |
| 3416 as the coding system for autosaved data. | |
| 3417 | |
| 3418 The following coding systems are aliases for others, and are used for | |
| 3419 communication with the host operating system. | |
| 3420 | |
| 3421 @samp{file-name} @samp{keyboard} @samp{terminal} | |
| 3422 | |
| 3423 @end table | |
| 3424 | |
| 3425 Mule detection of coding systems is actually limited to detection of | |
| 3426 classes of coding systems called @dfn{coding categories}. These coding | |
| 3427 categories are identified by the ISO 2022 control sequences they use, if | |
| 3428 any, by their conformance to ISO 2022 restrictions on code points that | |
| 3429 may be used, and by characteristic patterns of use of 8-bit code points. | |
| 3430 | |
| 3431 @samp{no-conversion} | |
| 3432 @samp{utf-8} | |
| 3433 @samp{ucs-4} | |
| 3434 @samp{iso-7} | |
| 3435 @samp{iso-lock-shift} | |
| 3436 @samp{iso-8-1} | |
| 3437 @samp{iso-8-2} | |
| 3438 @samp{iso-8-designate} | |
| 3439 @samp{shift-jis} | |
| 3440 @samp{big5} | |
| 3441 | |
| 3442 | |
| 3443 @c end of mule.texi | |
| 3444 |
