428
|
1 @c -*-texinfo-*-
|
|
2 @c This is part of the XEmacs Lisp Reference Manual.
|
775
|
3 @c Copyright (C) 1996 Ben Wing, 2001-2002 Free Software Foundation.
|
428
|
4 @c See the file lispref.texi for copying conditions.
|
|
5 @setfilename ../../info/internationalization.info
|
|
6 @node MULE, Tips, Internationalization, top
|
|
7 @chapter MULE
|
|
8
|
442
|
9 @dfn{MULE} is the name originally given to the version of GNU Emacs
|
428
|
10 extended for multi-lingual (and in particular Asian-language) support.
|
442
|
11 ``MULE'' is short for ``MUlti-Lingual Emacs''. It is an extension and
|
|
12 complete rewrite of Nemacs (``Nihon Emacs'' where ``Nihon'' is the
|
|
13 Japanese word for ``Japan''), which only provided support for Japanese.
|
|
14 XEmacs refers to its multi-lingual support as @dfn{MULE support} since
|
|
15 it is based on @dfn{MULE}.
|
428
|
16
|
|
17 @menu
|
|
18 * Internationalization Terminology::
|
|
19 Definition of various internationalization terms.
|
|
20 * Charsets:: Sets of related characters.
|
|
21 * MULE Characters:: Working with characters in XEmacs/MULE.
|
|
22 * Composite Characters:: Making new characters by overstriking other ones.
|
|
23 * Coding Systems:: Ways of representing a string of chars using integers.
|
|
24 * CCL:: A special language for writing fast converters.
|
|
25 * Category Tables:: Subdividing charsets into groups.
|
775
|
26 * Unicode Support:: The universal coded character set.
|
1183
|
27 * Charset Unification:: Handling overlapping character sets.
|
|
28 * Charsets and Coding Systems:: Tables and reference information.
|
428
|
29 @end menu
|
|
30
|
442
|
31 @node Internationalization Terminology, Charsets, , MULE
|
428
|
32 @section Internationalization Terminology
|
|
33
|
442
|
34 In internationalization terminology, a string of text is divided up
|
428
|
35 into @dfn{characters}, which are the printable units that make up the
|
|
36 text. A single character is (for example) a capital @samp{A}, the
|
442
|
37 number @samp{2}, a Katakana character, a Hangul character, a Kanji
|
|
38 ideograph (an @dfn{ideograph} is a ``picture'' character, such as is
|
|
39 used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there
|
|
40 are thousands of such ideographs in each language), etc. The basic
|
|
41 property of a character is that it is the smallest unit of text with
|
1261
|
42 semantic significance in text processing---i.e., characters are abstract
|
|
43 units defined by their meaning, not by their exact appearance.
|
442
|
44
|
|
45 Human beings normally process text visually, so to a first approximation
|
|
46 a character may be identified with its shape. Note that the same
|
|
47 character may be drawn by two different people (or in two different
|
|
48 fonts) in slightly different ways, although the "basic shape" will be the
|
|
49 same. But consider the works of Scott Kim; human beings can recognize
|
|
50 hugely variant shapes as the "same" character. Sometimes, especially
|
|
51 where characters are extremely complicated to write, completely
|
|
52 different shapes may be defined as the "same" character in national
|
|
53 standards. The Taiwanese variant of Hanzi is generally the most
|
444
|
54 complicated; over the centuries, the Japanese, Koreans, and the People's
|
442
|
55 Republic of China have adopted simplifications of the shape, but the
|
|
56 line of descent from the original shape is recorded, and the meanings
|
|
57 and pronunciation of different forms of the same character are
|
|
58 considered to be identical within each language. (Of course, it may
|
|
59 take a specialist to recognize the related form; the point is that the
|
|
60 relations are standardized, despite the differing shapes.)
|
428
|
61
|
|
62 In some cases, the differences will be significant enough that it is
|
|
63 actually possible to identify two or more distinct shapes that both
|
|
64 represent the same character. For example, the lowercase letters
|
440
|
65 @samp{a} and @samp{g} each have two distinct possible shapes---the
|
428
|
66 @samp{a} can optionally have a curved tail projecting off the top, and
|
|
67 the @samp{g} can be formed either of two loops, or of one loop and a
|
|
68 tail hanging off the bottom. Such distinct possible shapes of a
|
|
69 character are called @dfn{glyphs}. The important characteristic of two
|
|
70 glyphs making up the same character is that the choice between one or
|
|
71 the other is purely stylistic and has no linguistic effect on a word
|
|
72 (this is the reason why a capital @samp{A} and lowercase @samp{a}
|
440
|
73 are different characters rather than different glyphs---e.g.
|
428
|
74 @samp{Aspen} is a city while @samp{aspen} is a kind of tree).
|
|
75
|
|
76 Note that @dfn{character} and @dfn{glyph} are used differently
|
|
77 here than elsewhere in XEmacs.
|
|
78
|
442
|
79 A @dfn{character set} is essentially a set of related characters. ASCII,
|
428
|
80 for example, is a set of 94 characters (or 128, if you count
|
|
81 non-printing characters). Other character sets are ISO8859-1 (ASCII
|
|
82 plus various accented characters and other international symbols),
|
442
|
83 JIS X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208
|
|
84 (Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji),
|
428
|
85 GB2312 (Mainland Chinese Hanzi), etc.
|
|
86
|
442
|
87 The definition of a character set will implicitly or explicitly give
|
|
88 it an @dfn{ordering}, a way of assigning a number to each character in
|
|
89 the set. For many character sets, there is a natural ordering, for
|
|
90 example the ``ABC'' ordering of the Roman letters. But it is not clear
|
|
91 whether digits should come before or after the letters, and in fact
|
|
92 different European languages treat the ordering of accented characters
|
|
93 differently. It is useful to use the natural order where available, of
|
|
94 course. The number assigned to any particular character is called the
|
|
95 character's @dfn{code point}. (Within a given character set, each
|
|
96 character has a unique code point. Thus the word "set" is ill-chosen;
|
|
97 different orderings of the same characters are different character sets.
|
|
98 Identifying characters is simple enough for alphabetic character sets,
|
|
99 but the difference in ordering can cause great headaches when the same
|
|
100 thousands of characters are used by different cultures as in the Hanzi.)
|
428
|
101
|
1261
|
102 It's important to understand that a character is defined not by any
|
|
103 number attached to it, but by its meaning. For example, ASCII and
|
|
104 EBCDIC are two charsets containing exactly the same characters
|
|
105 (lowercase and uppercase letters, numbers 0 through 9, particular
|
|
106 punctuation marks) but with different numberings. The @samp{comma}
|
|
107 character in ASCII and EBCDIC, for instance, is the same character
|
|
108 despite having a different numbering. Conversely, when comparing ASCII
|
|
109 and JIS-Roman, which look the same except that the latter has a yen sign
|
|
110 substituted for the backslash, we would say that the backslash and yen
|
|
111 sign are @emph{not} the same characters, despite having the same number
|
|
112 (95) and despite the fact that all other characters are present in both
|
|
113 charsets, with the same numbering. ASCII and JIS-Roman, then, do
|
|
114 @emph{not} have exactly the same characters in them (ASCII has a
|
|
115 backslash character but no yen-sign character, and vice-versa for
|
|
116 JIS-Roman), unlike ASCII and EBCDIC, even though the numberings in ASCII
|
|
117 and JIS-Roman are closer.
|
|
118
|
|
119 Sometimes, a code point is not a single number, but instead a group of
|
|
120 numbers, called @dfn{position codes}. In such cases, the number of
|
|
121 position codes required to index a particular character in a character
|
|
122 set is called the @dfn{dimension} of the character set. Character sets
|
|
123 indexed by more than one position code typically use byte-sized position
|
|
124 codes. Small character sets, e.g. ASCII, invariably use a single
|
|
125 position code, but for larger character sets, the choice of whether to
|
|
126 use multiple position codes or a single large (16-bit or 32-bit) number
|
|
127 is arbitrary. Unicode typically uses a single large number, but
|
|
128 language-specific or "national" character sets often use multiple
|
|
129 (usually two) position codes. For example, JIS X 0208, i.e. Japanese
|
|
130 Kanji, has thousands of characters, and is of dimension two -- every
|
|
131 character is indexed by two position codes, each in the range 1 through
|
|
132 94. (This number ``94'' is not a coincidence; it is the same as the
|
|
133 number of printable characters in ASCII, and was chosen so that JIS
|
|
134 characters could be directly encoded using two printable ASCII
|
|
135 characters.) Note that the choice of the range here is somewhat
|
|
136 arbitrary -- it could just as easily be 0 through 93, 2 through 95, etc.
|
|
137 In fact, the range for JIS position codes (and for other character sets
|
|
138 modeled after it) is often given as range 33 through 126, so as to
|
|
139 directly match ASCII printing characters.
|
428
|
140
|
|
141 An @dfn{encoding} is a way of numerically representing characters from
|
|
142 one or more character sets into a stream of like-sized numerical values
|
1261
|
143 called @dfn{words} -- typically 8-bit bytes, but sometimes 16-bit or
|
|
144 32-bit quantities. It's very important to clearly distinguish between
|
|
145 charsets and encodings. For a simple charset like ASCII, there is only
|
|
146 one encoding normally used -- each character is represented by a single
|
|
147 byte, with the same value as its code point. For more complicated
|
|
148 charsets, however, or when a single encoding needs to represent more
|
|
149 than charset, things are not so obvious. Unicode version 2, for
|
|
150 example, is a large charset with thousands of characters, each indexed
|
|
151 by a 16-bit number, often represented in hex, e.g. 0x05D0 for the Hebrew
|
|
152 letter "aleph". One obvious encoding (actually two encodings, depending
|
|
153 on which of the two possible byte orderings is chosen) simply uses two
|
|
154 bytes per character. This encoding is convenient for internal
|
|
155 processing of Unicode text; however, it's incompatible with ASCII, and
|
|
156 thus external text (files, e-mail, etc.) that is encoded this way is
|
|
157 completely uninterpretable by programs lacking Unicode support. For
|
|
158 this reason, a different, ASCII-compatible encoding, e.g. UTF-8, is
|
|
159 usually used for external text. UTF-8 represents Unicode characters
|
|
160 with one to three bytes (often extended to six bytes to handle
|
|
161 characters with up to 31-bit indices). Unicode characters 00 to 7F
|
|
162 (identical with ASCII) are directly represented with one byte, and other
|
|
163 characters with two or more bytes, each in the range 80 to FF.
|
|
164 Applications that don't understand Unicode will still be able to process
|
|
165 ASCII characters represented in UTF-8-encoded text, and will typically
|
|
166 ignore (and hopefully preserve) the high-bit characters.
|
|
167
|
|
168 Naive use of code points is also not possible if more than one
|
|
169 character set is to be used in the encoding. For example, printed
|
442
|
170 Japanese text typically requires characters from multiple character sets
|
|
171 -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is
|
1261
|
172 indexed using one or more position codes in the range 1 through 94 (or
|
|
173 33 through 126), so the position codes could not be used directly or
|
|
174 there would be no way to tell which character was meant. Different
|
|
175 Japanese encodings handle this differently -- JIS uses special escape
|
|
176 characters to denote different character sets; EUC sets the high bit of
|
|
177 the position codes for JIS X 0208 and JIS X 0212, and puts a special
|
|
178 extra byte before each JIS X 0212 character; etc.
|
|
179
|
|
180 The encodings described above are all 7-bit or 8-bit encodings. The
|
|
181 fixed-width Unicode encoding previous described, however, is sometimes
|
|
182 considered to be a 16-bit encoding, in which case the issue of byte
|
|
183 ordering does not come up. (Imagine, for example, that the text is
|
|
184 represented as an array of shorts.) Similarly, Unicode version 3 (which
|
|
185 has characters with indices above 0xFFFF), and other very large
|
|
186 character sets, may be represented internally as 32-bit encodings,
|
|
187 i.e. arrays of ints. However, it does not make too much sense to talk
|
|
188 about 16-bit or 32-bit encodings for external data, since nowadays 8-bit
|
|
189 data is a universal standard -- the closest you can get is fixed-width
|
|
190 encodings using two or four bytes to encode 16-bit or 32-bit values. (A
|
|
191 "7-bit" encoding is used when it cannot be guaranteed that the high bit
|
|
192 of 8-bit data will be correctly preserved. Some e-mail gateways, for
|
|
193 example, strip the high bit of text passing through them. These same
|
|
194 gateways often handle non-printable characters incorrectly, and so 7-bit
|
|
195 encodings usually avoid using bytes with such values.)
|
442
|
196
|
|
197 A general method of handling text using multiple character sets
|
|
198 (whether for multilingual text, or simply text in an extremely
|
|
199 complicated single language like Japanese) is defined in the
|
|
200 international standard ISO 2022. ISO 2022 will be discussed in more
|
|
201 detail later (@pxref{ISO 2022}), but for now suffice it to say that text
|
|
202 needs control functions (at least spacing), and if escape sequences are
|
|
203 to be used, an escape sequence introducer. It was decided to make all
|
|
204 text streams compatible with ASCII in the sense that the codes 0--31
|
|
205 (and 128-159) would always be control codes, never graphic characters,
|
|
206 and where defined by the character set the @samp{SPC} character would be
|
|
207 assigned code 32, and @samp{DEL} would be assigned 127. Thus there are
|
|
208 94 code points remaining if 7 bits are used. This is the reason that
|
|
209 most character sets are defined using position codes in the range 1
|
|
210 through 94. Then ISO 2022 compatible encodings are produced by shifting
|
|
211 the position codes 1 to 94 into character codes 33 to 126, or (if 8 bit
|
|
212 codes are available) into character codes 161 to 254.
|
428
|
213
|
|
214 Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In
|
442
|
215 a @dfn{modal encoding}, there are multiple states that the encoding can
|
|
216 be in, and the interpretation of the values in the stream depends on the
|
428
|
217 current global state of the encoding. Special values in the encoding,
|
|
218 called @dfn{escape sequences}, are used to change the global state.
|
|
219 JIS, for example, is a modal encoding. The bytes @samp{ESC $ B}
|
|
220 indicate that, from then on, bytes are to be interpreted as position
|
442
|
221 codes for JIS X 0208, rather than as ASCII. This effect is cancelled
|
428
|
222 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the
|
442
|
223 current state is to ASCII''. To switch to JIS X 0212, the escape
|
|
224 sequence @samp{ESC $ ( D}. (Note that here, as is common, the escape
|
|
225 sequences do in fact begin with @samp{ESC}. This is not necessarily the
|
|
226 case, however. Some encodings use control characters called "locking
|
|
227 shifts" (effect persists until cancelled) to switch character sets.)
|
428
|
228
|
442
|
229 A @dfn{non-modal encoding} has no global state that extends past the
|
428
|
230 character currently being interpreted. EUC, for example, is a
|
442
|
231 non-modal encoding. Characters in JIS X 0208 are encoded by setting
|
|
232 the high bit of the position codes, and characters in JIS X 0212 are
|
428
|
233 encoded by doing the same but also prefixing the character with the
|
|
234 byte 0x8F.
|
|
235
|
|
236 The advantage of a modal encoding is that it is generally more
|
442
|
237 space-efficient, and is easily extendible because there are essentially
|
428
|
238 an arbitrary number of escape sequences that can be created. The
|
|
239 disadvantage, however, is that it is much more difficult to work with
|
|
240 if it is not being processed in a sequential manner. In the non-modal
|
|
241 EUC encoding, for example, the byte 0x41 always refers to the letter
|
|
242 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or
|
442
|
243 one of the two position codes in a JIS X 0208 character, or one of the
|
|
244 two position codes in a JIS X 0212 character. Determining exactly which
|
428
|
245 one is meant could be difficult and time-consuming if the previous
|
442
|
246 bytes in the string have not already been processed, or impossible if
|
|
247 they are drawn from an external stream that cannot be rewound.
|
428
|
248
|
|
249 Non-modal encodings are further divided into @dfn{fixed-width} and
|
|
250 @dfn{variable-width} formats. A fixed-width encoding always uses
|
|
251 the same number of words per character, whereas a variable-width
|
|
252 encoding does not. EUC is a good example of a variable-width
|
|
253 encoding: one to three bytes are used per character, depending on
|
|
254 the character set. 16-bit and 32-bit encodings are nearly always
|
|
255 fixed-width, and this is in fact one of the main reasons for using
|
|
256 an encoding with a larger word size. The advantages of fixed-width
|
|
257 encodings should be obvious. The advantages of variable-width
|
|
258 encodings are that they are generally more space-efficient and allow
|
442
|
259 for compatibility with existing 8-bit encodings such as ASCII. (For
|
|
260 example, in Unicode ASCII characters are simply promoted to a 16-bit
|
|
261 representation. That means that every ASCII character contains a
|
|
262 @samp{NUL} byte; evidently all of the standard string manipulation
|
|
263 functions will lose badly in a fixed-width Unicode environment.)
|
428
|
264
|
442
|
265 The bytes in an 8-bit encoding are often referred to as @dfn{octets}
|
|
266 rather than simply as bytes. This terminology dates back to the days
|
|
267 before 8-bit bytes were universal, when some computers had 9-bit bytes,
|
|
268 others had 10-bit bytes, etc.
|
428
|
269
|
442
|
270 @node Charsets, MULE Characters, Internationalization Terminology, MULE
|
428
|
271 @section Charsets
|
|
272
|
|
273 A @dfn{charset} in MULE is an object that encapsulates a
|
|
274 particular character set as well as an ordering of those characters.
|
|
275 Charsets are permanent objects and are named using symbols, like
|
|
276 faces.
|
|
277
|
|
278 @defun charsetp object
|
|
279 This function returns non-@code{nil} if @var{object} is a charset.
|
|
280 @end defun
|
|
281
|
|
282 @menu
|
|
283 * Charset Properties:: Properties of a charset.
|
|
284 * Basic Charset Functions:: Functions for working with charsets.
|
|
285 * Charset Property Functions:: Functions for accessing charset properties.
|
|
286 * Predefined Charsets:: Predefined charset objects.
|
|
287 @end menu
|
|
288
|
442
|
289 @node Charset Properties, Basic Charset Functions, , Charsets
|
428
|
290 @subsection Charset Properties
|
|
291
|
|
292 Charsets have the following properties:
|
|
293
|
|
294 @table @code
|
|
295 @item name
|
|
296 A symbol naming the charset. Every charset must have a different name;
|
|
297 this allows a charset to be referred to using its name rather than
|
|
298 the actual charset object.
|
|
299 @item doc-string
|
|
300 A documentation string describing the charset.
|
|
301 @item registry
|
|
302 A regular expression matching the font registry field for this character
|
|
303 set. For example, both the @code{ascii} and @code{latin-iso8859-1}
|
|
304 charsets use the registry @code{"ISO8859-1"}. This field is used to
|
|
305 choose an appropriate font when the user gives a general font
|
|
306 specification such as @samp{-*-courier-medium-r-*-140-*}, i.e. a
|
|
307 14-point upright medium-weight Courier font.
|
|
308 @item dimension
|
|
309 Number of position codes used to index a character in the character set.
|
|
310 XEmacs/MULE can only handle character sets of dimension 1 or 2.
|
|
311 This property defaults to 1.
|
|
312 @item chars
|
|
313 Number of characters in each dimension. In XEmacs/MULE, the only
|
|
314 allowed values are 94 or 96. (There are a couple of pre-defined
|
|
315 character sets, such as ASCII, that do not follow this, but you cannot
|
|
316 define new ones like this.) Defaults to 94. Note that if the dimension
|
|
317 is 2, the character set thus described is 94x94 or 96x96.
|
|
318 @item columns
|
|
319 Number of columns used to display a character in this charset.
|
|
320 Only used in TTY mode. (Under X, the actual width of a character
|
|
321 can be derived from the font used to display the characters.)
|
|
322 If unspecified, defaults to the dimension. (This is almost
|
|
323 always the correct value, because character sets with dimension 2
|
|
324 are usually ideograph character sets, which need two columns to
|
|
325 display the intricate ideographs.)
|
|
326 @item direction
|
|
327 A symbol, either @code{l2r} (left-to-right) or @code{r2l}
|
|
328 (right-to-left). Defaults to @code{l2r}. This specifies the
|
|
329 direction that the text should be displayed in, and will be
|
|
330 left-to-right for most charsets but right-to-left for Hebrew
|
|
331 and Arabic. (Right-to-left display is not currently implemented.)
|
|
332 @item final
|
|
333 Final byte of the standard ISO 2022 escape sequence designating this
|
|
334 charset. Must be supplied. Each combination of (@var{dimension},
|
|
335 @var{chars}) defines a separate namespace for final bytes, and each
|
|
336 charset within a particular namespace must have a different final byte.
|
|
337 Note that ISO 2022 restricts the final byte to the range 0x30 - 0x7E if
|
|
338 dimension == 1, and 0x30 - 0x5F if dimension == 2. Note also that final
|
|
339 bytes in the range 0x30 - 0x3F are reserved for user-defined (not
|
|
340 official) character sets. For more information on ISO 2022, see @ref{Coding
|
|
341 Systems}.
|
|
342 @item graphic
|
|
343 0 (use left half of font on output) or 1 (use right half of font on
|
|
344 output). Defaults to 0. This specifies how to convert the position
|
|
345 codes that index a character in a character set into an index into the
|
|
346 font used to display the character set. With @code{graphic} set to 0,
|
|
347 position codes 33 through 126 map to font indices 33 through 126; with
|
|
348 it set to 1, position codes 33 through 126 map to font indices 161
|
|
349 through 254 (i.e. the same number but with the high bit set). For
|
|
350 example, for a font whose registry is ISO8859-1, the left half of the
|
|
351 font (octets 0x20 - 0x7F) is the @code{ascii} charset, while the right
|
|
352 half (octets 0xA0 - 0xFF) is the @code{latin-iso8859-1} charset.
|
|
353 @item ccl-program
|
|
354 A compiled CCL program used to convert a character in this charset into
|
|
355 an index into the font. This is in addition to the @code{graphic}
|
|
356 property. If a CCL program is defined, the position codes of a
|
|
357 character will first be processed according to @code{graphic} and
|
|
358 then passed through the CCL program, with the resulting values used
|
|
359 to index the font.
|
|
360
|
442
|
361 This is used, for example, in the Big5 character set (used in Taiwan).
|
428
|
362 This character set is not ISO-2022-compliant, and its size (94x157) does
|
|
363 not fit within the maximum 96x96 size of ISO-2022-compliant character
|
|
364 sets. As a result, XEmacs/MULE splits it (in a rather complex fashion,
|
|
365 so as to group the most commonly used characters together) into two
|
|
366 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94,
|
|
367 and each charset object uses a CCL program to convert the modified
|
|
368 position codes back into standard Big5 indices to retrieve a character
|
|
369 from a Big5 font.
|
|
370 @end table
|
|
371
|
442
|
372 Most of the above properties can only be set when the charset is
|
|
373 initialized, and cannot be changed later.
|
|
374 @xref{Charset Property Functions}.
|
428
|
375
|
442
|
376 @node Basic Charset Functions, Charset Property Functions, Charset Properties, Charsets
|
428
|
377 @subsection Basic Charset Functions
|
|
378
|
|
379 @defun find-charset charset-or-name
|
|
380 This function retrieves the charset of the given name. If
|
|
381 @var{charset-or-name} is a charset object, it is simply returned.
|
|
382 Otherwise, @var{charset-or-name} should be a symbol. If there is no
|
|
383 such charset, @code{nil} is returned. Otherwise the associated charset
|
|
384 object is returned.
|
|
385 @end defun
|
|
386
|
|
387 @defun get-charset name
|
|
388 This function retrieves the charset of the given name. Same as
|
|
389 @code{find-charset} except an error is signalled if there is no such
|
|
390 charset instead of returning @code{nil}.
|
|
391 @end defun
|
|
392
|
|
393 @defun charset-list
|
|
394 This function returns a list of the names of all defined charsets.
|
|
395 @end defun
|
|
396
|
|
397 @defun make-charset name doc-string props
|
|
398 This function defines a new character set. This function is for use
|
442
|
399 with MULE support. @var{name} is a symbol, the name by which the
|
428
|
400 character set is normally referred. @var{doc-string} is a string
|
|
401 describing the character set. @var{props} is a property list,
|
|
402 describing the specific nature of the character set. The recognized
|
|
403 properties are @code{registry}, @code{dimension}, @code{columns},
|
|
404 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and
|
|
405 @code{ccl-program}, as previously described.
|
|
406 @end defun
|
|
407
|
|
408 @defun make-reverse-direction-charset charset new-name
|
|
409 This function makes a charset equivalent to @var{charset} but which goes
|
|
410 in the opposite direction. @var{new-name} is the name of the new
|
|
411 charset. The new charset is returned.
|
|
412 @end defun
|
|
413
|
|
414 @defun charset-from-attributes dimension chars final &optional direction
|
|
415 This function returns a charset with the given @var{dimension},
|
|
416 @var{chars}, @var{final}, and @var{direction}. If @var{direction} is
|
|
417 omitted, both directions will be checked (left-to-right will be returned
|
|
418 if character sets exist for both directions).
|
|
419 @end defun
|
|
420
|
|
421 @defun charset-reverse-direction-charset charset
|
|
422 This function returns the charset (if any) with the same dimension,
|
|
423 number of characters, and final byte as @var{charset}, but which is
|
|
424 displayed in the opposite direction.
|
|
425 @end defun
|
|
426
|
442
|
427 @node Charset Property Functions, Predefined Charsets, Basic Charset Functions, Charsets
|
428
|
428 @subsection Charset Property Functions
|
|
429
|
442
|
430 All of these functions accept either a charset name or charset object.
|
428
|
431
|
|
432 @defun charset-property charset prop
|
|
433 This function returns property @var{prop} of @var{charset}.
|
|
434 @xref{Charset Properties}.
|
|
435 @end defun
|
|
436
|
442
|
437 Convenience functions are also provided for retrieving individual
|
428
|
438 properties of a charset.
|
|
439
|
|
440 @defun charset-name charset
|
|
441 This function returns the name of @var{charset}. This will be a symbol.
|
|
442 @end defun
|
|
443
|
444
|
444 @defun charset-description charset
|
|
445 This function returns the documentation string of @var{charset}.
|
428
|
446 @end defun
|
|
447
|
|
448 @defun charset-registry charset
|
|
449 This function returns the registry of @var{charset}.
|
|
450 @end defun
|
|
451
|
|
452 @defun charset-dimension charset
|
|
453 This function returns the dimension of @var{charset}.
|
|
454 @end defun
|
|
455
|
|
456 @defun charset-chars charset
|
|
457 This function returns the number of characters per dimension of
|
|
458 @var{charset}.
|
|
459 @end defun
|
|
460
|
444
|
461 @defun charset-width charset
|
428
|
462 This function returns the number of display columns per character (in
|
|
463 TTY mode) of @var{charset}.
|
|
464 @end defun
|
|
465
|
|
466 @defun charset-direction charset
|
440
|
467 This function returns the display direction of @var{charset}---either
|
428
|
468 @code{l2r} or @code{r2l}.
|
|
469 @end defun
|
|
470
|
444
|
471 @defun charset-iso-final-char charset
|
428
|
472 This function returns the final byte of the ISO 2022 escape sequence
|
|
473 designating @var{charset}.
|
|
474 @end defun
|
|
475
|
444
|
476 @defun charset-iso-graphic-plane charset
|
428
|
477 This function returns either 0 or 1, depending on whether the position
|
|
478 codes of characters in @var{charset} map to the left or right half
|
|
479 of their font, respectively.
|
|
480 @end defun
|
|
481
|
|
482 @defun charset-ccl-program charset
|
|
483 This function returns the CCL program, if any, for converting
|
|
484 position codes of characters in @var{charset} into font indices.
|
|
485 @end defun
|
|
486
|
1734
|
487 The two properties of a charset that can currently be set after the
|
|
488 charset has been created are the CCL program and the font registry.
|
428
|
489
|
|
490 @defun set-charset-ccl-program charset ccl-program
|
|
491 This function sets the @code{ccl-program} property of @var{charset} to
|
|
492 @var{ccl-program}.
|
|
493 @end defun
|
|
494
|
1734
|
495 @defun set-charset-registry charset registry
|
|
496 This function sets the @code{registry} property of @var{charset} to
|
|
497 @var{registry}.
|
|
498 @end defun
|
|
499
|
442
|
500 @node Predefined Charsets, , Charset Property Functions, Charsets
|
428
|
501 @subsection Predefined Charsets
|
|
502
|
442
|
503 The following charsets are predefined in the C code.
|
428
|
504
|
|
505 @example
|
|
506 Name Type Fi Gr Dir Registry
|
|
507 --------------------------------------------------------------
|
|
508 ascii 94 B 0 l2r ISO8859-1
|
|
509 control-1 94 0 l2r ---
|
|
510 latin-iso8859-1 94 A 1 l2r ISO8859-1
|
|
511 latin-iso8859-2 96 B 1 l2r ISO8859-2
|
|
512 latin-iso8859-3 96 C 1 l2r ISO8859-3
|
|
513 latin-iso8859-4 96 D 1 l2r ISO8859-4
|
|
514 cyrillic-iso8859-5 96 L 1 l2r ISO8859-5
|
|
515 arabic-iso8859-6 96 G 1 r2l ISO8859-6
|
|
516 greek-iso8859-7 96 F 1 l2r ISO8859-7
|
|
517 hebrew-iso8859-8 96 H 1 r2l ISO8859-8
|
|
518 latin-iso8859-9 96 M 1 l2r ISO8859-9
|
|
519 thai-tis620 96 T 1 l2r TIS620
|
|
520 katakana-jisx0201 94 I 1 l2r JISX0201.1976
|
|
521 latin-jisx0201 94 J 0 l2r JISX0201.1976
|
|
522 japanese-jisx0208-1978 94x94 @@ 0 l2r JISX0208.1978
|
|
523 japanese-jisx0208 94x94 B 0 l2r JISX0208.19(83|90)
|
|
524 japanese-jisx0212 94x94 D 0 l2r JISX0212
|
|
525 chinese-gb2312 94x94 A 0 l2r GB2312
|
|
526 chinese-cns11643-1 94x94 G 0 l2r CNS11643.1
|
|
527 chinese-cns11643-2 94x94 H 0 l2r CNS11643.2
|
|
528 chinese-big5-1 94x94 0 0 l2r Big5
|
|
529 chinese-big5-2 94x94 1 0 l2r Big5
|
|
530 korean-ksc5601 94x94 C 0 l2r KSC5601
|
|
531 composite 96x96 0 l2r ---
|
|
532 @end example
|
|
533
|
442
|
534 The following charsets are predefined in the Lisp code.
|
428
|
535
|
|
536 @example
|
|
537 Name Type Fi Gr Dir Registry
|
|
538 --------------------------------------------------------------
|
|
539 arabic-digit 94 2 0 l2r MuleArabic-0
|
|
540 arabic-1-column 94 3 0 r2l MuleArabic-1
|
|
541 arabic-2-column 94 4 0 r2l MuleArabic-2
|
|
542 sisheng 94 0 0 l2r sisheng_cwnn\|OMRON_UDC_ZH
|
|
543 chinese-cns11643-3 94x94 I 0 l2r CNS11643.1
|
|
544 chinese-cns11643-4 94x94 J 0 l2r CNS11643.1
|
|
545 chinese-cns11643-5 94x94 K 0 l2r CNS11643.1
|
|
546 chinese-cns11643-6 94x94 L 0 l2r CNS11643.1
|
|
547 chinese-cns11643-7 94x94 M 0 l2r CNS11643.1
|
|
548 ethiopic 94x94 2 0 l2r Ethio
|
|
549 ascii-r2l 94 B 0 r2l ISO8859-1
|
|
550 ipa 96 0 1 l2r MuleIPA
|
1734
|
551 vietnamese-viscii-lower 96 1 1 l2r VISCII1.1
|
|
552 vietnamese-viscii-upper 96 2 1 l2r VISCII1.1
|
428
|
553 @end example
|
|
554
|
|
555 For all of the above charsets, the dimension and number of columns are
|
|
556 the same.
|
|
557
|
442
|
558 Note that ASCII, Control-1, and Composite are handled specially.
|
428
|
559 This is why some of the fields are blank; and some of the filled-in
|
|
560 fields (e.g. the type) are not really accurate.
|
|
561
|
442
|
562 @node MULE Characters, Composite Characters, Charsets, MULE
|
428
|
563 @section MULE Characters
|
|
564
|
|
565 @defun make-char charset arg1 &optional arg2
|
|
566 This function makes a multi-byte character from @var{charset} and octets
|
|
567 @var{arg1} and @var{arg2}.
|
|
568 @end defun
|
|
569
|
444
|
570 @defun char-charset character
|
|
571 This function returns the character set of char @var{character}.
|
428
|
572 @end defun
|
|
573
|
444
|
574 @defun char-octet character &optional n
|
428
|
575 This function returns the octet (i.e. position code) numbered @var{n}
|
444
|
576 (should be 0 or 1) of char @var{character}. @var{n} defaults to 0 if omitted.
|
428
|
577 @end defun
|
|
578
|
|
579 @defun find-charset-region start end &optional buffer
|
|
580 This function returns a list of the charsets in the region between
|
|
581 @var{start} and @var{end}. @var{buffer} defaults to the current buffer
|
|
582 if omitted.
|
|
583 @end defun
|
|
584
|
|
585 @defun find-charset-string string
|
|
586 This function returns a list of the charsets in @var{string}.
|
|
587 @end defun
|
|
588
|
442
|
589 @node Composite Characters, Coding Systems, MULE Characters, MULE
|
428
|
590 @section Composite Characters
|
|
591
|
442
|
592 Composite characters are not yet completely implemented.
|
428
|
593
|
|
594 @defun make-composite-char string
|
|
595 This function converts a string into a single composite character. The
|
|
596 character is the result of overstriking all the characters in the
|
|
597 string.
|
|
598 @end defun
|
|
599
|
444
|
600 @defun composite-char-string character
|
428
|
601 This function returns a string of the characters comprising a composite
|
|
602 character.
|
|
603 @end defun
|
|
604
|
|
605 @defun compose-region start end &optional buffer
|
|
606 This function composes the characters in the region from @var{start} to
|
|
607 @var{end} in @var{buffer} into one composite character. The composite
|
|
608 character replaces the composed characters. @var{buffer} defaults to
|
|
609 the current buffer if omitted.
|
|
610 @end defun
|
|
611
|
|
612 @defun decompose-region start end &optional buffer
|
|
613 This function decomposes any composite characters in the region from
|
|
614 @var{start} to @var{end} in @var{buffer}. This converts each composite
|
|
615 character into one or more characters, the individual characters out of
|
|
616 which the composite character was formed. Non-composite characters are
|
|
617 left as-is. @var{buffer} defaults to the current buffer if omitted.
|
|
618 @end defun
|
|
619
|
442
|
620 @node Coding Systems, CCL, Composite Characters, MULE
|
|
621 @section Coding Systems
|
|
622
|
|
623 A coding system is an object that defines how text containing multiple
|
|
624 character sets is encoded into a stream of (typically 8-bit) bytes. The
|
|
625 coding system is used to decode the stream into a series of characters
|
|
626 (which may be from multiple charsets) when the text is read from a file
|
|
627 or process, and is used to encode the text back into the same format
|
|
628 when it is written out to a file or process.
|
|
629
|
|
630 For example, many ISO-2022-compliant coding systems (such as Compound
|
|
631 Text, which is used for inter-client data under the X Window System) use
|
|
632 escape sequences to switch between different charsets -- Japanese Kanji,
|
|
633 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with
|
|
634 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See
|
|
635 @code{make-coding-system} for more information.
|
|
636
|
|
637 Coding systems are normally identified using a symbol, and the symbol is
|
|
638 accepted in place of the actual coding system object whenever a coding
|
|
639 system is called for. (This is similar to how faces and charsets work.)
|
|
640
|
|
641 @defun coding-system-p object
|
|
642 This function returns non-@code{nil} if @var{object} is a coding system.
|
|
643 @end defun
|
428
|
644
|
442
|
645 @menu
|
|
646 * Coding System Types:: Classifying coding systems.
|
|
647 * ISO 2022:: An international standard for
|
|
648 charsets and encodings.
|
|
649 * EOL Conversion:: Dealing with different ways of denoting
|
|
650 the end of a line.
|
|
651 * Coding System Properties:: Properties of a coding system.
|
|
652 * Basic Coding System Functions:: Working with coding systems.
|
|
653 * Coding System Property Functions:: Retrieving a coding system's properties.
|
|
654 * Encoding and Decoding Text:: Encoding and decoding text.
|
|
655 * Detection of Textual Encoding:: Determining how text is encoded.
|
|
656 * Big5 and Shift-JIS Functions:: Special functions for these non-standard
|
|
657 encodings.
|
|
658 * Predefined Coding Systems:: Coding systems implemented by MULE.
|
|
659 @end menu
|
428
|
660
|
442
|
661 @node Coding System Types, ISO 2022, , Coding Systems
|
|
662 @subsection Coding System Types
|
|
663
|
|
664 The coding system type determines the basic algorithm XEmacs will use to
|
|
665 decode or encode a data stream. Character encodings will be converted
|
|
666 to the MULE encoding, escape sequences processed, and newline sequences
|
|
667 converted to XEmacs's internal representation. There are three basic
|
|
668 classes of coding system type: no-conversion, ISO-2022, and special.
|
|
669
|
|
670 No conversion allows you to look at the file's internal representation.
|
|
671 Since XEmacs is basically a text editor, "no conversion" does convert
|
|
672 newline conventions by default. (Use the 'binary coding-system if this
|
|
673 is not desired.)
|
428
|
674
|
442
|
675 ISO 2022 (@pxref{ISO 2022}) is the basic international standard regulating
|
|
676 use of "coded character sets for the exchange of data", ie, text
|
|
677 streams. ISO 2022 contains functions that make it possible to encode
|
|
678 text streams to comply with restrictions of the Internet mail system and
|
|
679 de facto restrictions of most file systems (eg, use of the separator
|
|
680 character in file names). Coding systems which are not ISO 2022
|
|
681 conformant can be difficult to handle. Perhaps more important, they are
|
|
682 not adaptable to multilingual information interchange, with the obvious
|
|
683 exception of ISO 10646 (Unicode). (Unicode is partially supported by
|
|
684 XEmacs with the addition of the Lisp package ucs-conv.)
|
|
685
|
|
686 The special class of coding systems includes automatic detection, CCL (a
|
|
687 "little language" embedded as an interpreter, useful for translating
|
|
688 between variants of a single character set), non-ISO-2022-conformant
|
|
689 encodings like Unicode, Shift JIS, and Big5, and MULE internal coding.
|
|
690 (NB: this list is based on XEmacs 21.2. Terminology may vary slightly
|
|
691 for other versions of XEmacs and for GNU Emacs 20.)
|
|
692
|
|
693 @table @code
|
|
694 @item no-conversion
|
|
695 No conversion, for binary files, and a few special cases of non-ISO-2022
|
|
696 coding systems where conversion is done by hook functions (usually
|
|
697 implemented in CCL). On output, graphic characters that are not in
|
|
698 ASCII or Latin-1 will be replaced by a @samp{?}. (For a
|
|
699 no-conversion-encoded buffer, these characters will only be present if
|
|
700 you explicitly insert them.)
|
|
701 @item iso2022
|
|
702 Any ISO-2022-compliant encoding. Among others, this includes JIS (the
|
|
703 Japanese encoding commonly used for e-mail), national variants of EUC
|
|
704 (the standard Unix encoding for Japanese and other languages), and
|
|
705 Compound Text (an encoding used in X11). You can specify more specific
|
|
706 information about the conversion with the @var{flags} argument.
|
|
707 @item ucs-4
|
|
708 ISO 10646 UCS-4 encoding. A 31-bit fixed-width superset of Unicode.
|
|
709 @item utf-8
|
|
710 ISO 10646 UTF-8 encoding. A ``file system safe'' transformation format
|
|
711 that can be used with both UCS-4 and Unicode.
|
|
712 @item undecided
|
|
713 Automatic conversion. XEmacs attempts to detect the coding system used
|
|
714 in the file.
|
|
715 @item shift-jis
|
|
716 Shift-JIS (a Japanese encoding commonly used in PC operating systems).
|
|
717 @item big5
|
|
718 Big5 (the encoding commonly used for Taiwanese).
|
|
719 @item ccl
|
|
720 The conversion is performed using a user-written pseudo-code program.
|
|
721 CCL (Code Conversion Language) is the name of this pseudo-code. For
|
|
722 example, CCL is used to map KOI8-R characters (an encoding for Russian
|
|
723 Cyrillic) to ISO8859-5 (the form used internally by MULE).
|
|
724 @item internal
|
|
725 Write out or read in the raw contents of the memory representing the
|
|
726 buffer's text. This is primarily useful for debugging purposes, and is
|
|
727 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set
|
|
728 (the @samp{--debug} configure option). @strong{Warning}: Reading in a
|
|
729 file using @code{internal} conversion can result in an internal
|
|
730 inconsistency in the memory representing a buffer's text, which will
|
|
731 produce unpredictable results and may cause XEmacs to crash. Under
|
|
732 normal circumstances you should never use @code{internal} conversion.
|
428
|
733 @end table
|
|
734
|
442
|
735 @node ISO 2022, EOL Conversion, Coding System Types, Coding Systems
|
|
736 @section ISO 2022
|
|
737
|
|
738 This section briefly describes the ISO 2022 encoding standard. A more
|
|
739 thorough treatment is available in the original document of ISO
|
|
740 2022 as well as various national standards (such as JIS X 0202).
|
428
|
741
|
442
|
742 Character sets (@dfn{charsets}) are classified into the following four
|
|
743 categories, according to the number of characters in the charset:
|
|
744 94-charset, 96-charset, 94x94-charset, and 96x96-charset. This means
|
|
745 that although an ISO 2022 coding system may have variable width
|
|
746 characters, each charset used is fixed-width (in contrast to the MULE
|
|
747 character set and UTF-8, for example).
|
|
748
|
|
749 ISO 2022 provides for switching between character sets via escape
|
|
750 sequences. This switching is somewhat complicated, because ISO 2022
|
|
751 provides for both legacy applications like Internet mail that accept
|
444
|
752 only 7 significant bits in some contexts (RFC 822 headers, for example),
|
442
|
753 and more modern "8-bit clean" applications. It also provides for
|
|
754 compact and transparent representation of languages like Japanese which
|
|
755 mix ASCII and a national script (even outside of computer programs).
|
428
|
756
|
442
|
757 First, ISO 2022 codified prevailing practice by dividing the code space
|
|
758 into "control" and "graphic" regions. The code points 0x00-0x1F and
|
|
759 0x80-0x9F are reserved for "control characters", while "graphic
|
|
760 characters" must be assigned to code points in the regions 0x20-0x7F and
|
|
761 0xA0-0xFF. The positions 0x20 and 0x7F are special, and under some
|
|
762 circumstances must be assigned the graphic character "ASCII SPACE" and
|
|
763 the control character "ASCII DEL" respectively.
|
428
|
764
|
442
|
765 The various regions are given the name C0 (0x00-0x1F), GL (0x20-0x7F),
|
|
766 C1 (0x80-0x9F), and GR (0xA0-0xFF). GL and GR stand for "graphic left"
|
|
767 and "graphic right", respectively, because of the standard method of
|
|
768 displaying graphic character sets in tables with the high byte indexing
|
444
|
769 columns and the low byte indexing rows. I don't find it very intuitive,
|
442
|
770 but these are called "registers".
|
|
771
|
|
772 An ISO 2022-conformant encoding for a graphic character set must use a
|
|
773 fixed number of bytes per character, and the values must fit into a
|
|
774 single register; that is, each byte must range over either 0x20-0x7F, or
|
|
775 0xA0-0xFF. It is not allowed to extend the range of the repertoire of a
|
|
776 character set by using both ranges at the same. This is why a standard
|
|
777 character set such as ISO 8859-1 is actually considered by ISO 2022 to
|
|
778 be an aggregation of two character sets, ASCII and LATIN-1, and why it
|
|
779 is technically incorrect to refer to ISO 8859-1 as "Latin 1". Also, a
|
|
780 single character's bytes must all be drawn from the same register; this
|
|
781 is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO
|
|
782 2022-compatible encodings.
|
428
|
783
|
442
|
784 The reason for this restriction becomes clear when you attempt to define
|
|
785 an efficient, robust encoding for a language like Japanese. Like ISO
|
|
786 8859, Japanese encodings are aggregations of several character sets. In
|
|
787 practice, the vast majority of characters are drawn from the "JIS Roman"
|
|
788 character set (a derivative of ASCII; it won't hurt to think of it as
|
|
789 ASCII) and the JIS X 0208 standard "basic Japanese" character set
|
|
790 including not only ideographic characters ("kanji") but syllabic
|
|
791 Japanese characters ("kana"), a wide variety of symbols, and many
|
|
792 alphabetic characters (Roman, Greek, and Cyrillic) as well. Although
|
|
793 JIS X 0208 includes the whole Roman alphabet, as a 2-byte code it is not
|
|
794 suited to programming; thus the inclusion of ASCII in the standard
|
|
795 Japanese encodings.
|
428
|
796
|
442
|
797 For normal Japanese text such as in newspapers, a broad repertoire of
|
|
798 approximately 3000 characters is used. Evidently this won't fit into
|
|
799 one byte; two must be used. But much of the text processed by Japanese
|
|
800 computers is computer source code, nearly all of which is ASCII. A not
|
|
801 insignificant portion of ordinary text is English (as such or as
|
|
802 borrowed Japanese vocabulary) or other languages which can represented
|
|
803 at least approximately in ASCII, as well. It seems reasonable then to
|
|
804 represent ASCII in one byte, and JIS X 0208 in two. And this is exactly
|
|
805 what the Extended Unix Code for Japanese (EUC-JP) does. ASCII is
|
|
806 invoked to the GL register, and JIS X 0208 is invoked to the GR
|
|
807 register. Thus, each byte can be tested for its character set by
|
|
808 looking at the high bit; if set, it is Japanese, if clear, it is ASCII.
|
|
809 Furthermore, since control characters like newline can never be part of
|
|
810 a graphic character, even in the case of corruption in transmission the
|
|
811 stream will be resynchronized at every line break, on the order of 60-80
|
|
812 bytes. This coding system requires no escape sequences or special
|
|
813 control codes to represent 99.9% of all Japanese text.
|
428
|
814
|
442
|
815 Note carefully the distinction between the character sets (ASCII and JIS
|
|
816 X 0208), the encoding (EUC-JP), and the coding system (ISO 2022). The
|
|
817 JIS X 0208 character set is used in three different encodings for
|
|
818 Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is
|
|
819 always clear), in EUC-JP it is invoked into GR (setting the high bit in
|
|
820 the process), and in Shift JIS the high bit may be set or reset, and the
|
|
821 significant bits are shifted within the 16-bit character so that the two
|
|
822 main character sets can coexist with a third (the "halfwidth katakana"
|
|
823 of JIS X 0201). As the name implies, the ISO-2022-JP encoding is also a
|
|
824 version of the ISO-2022 coding system.
|
428
|
825
|
442
|
826 In order to systematically treat subsidiary character sets (like the
|
|
827 "halfwidth katakana" already mentioned, and the "supplementary kanji" of
|
|
828 JIS X 0212), four further registers are defined: G0, G1, G2, and G3.
|
|
829 Unlike GL and GR, they are not logically distinguished by internal
|
|
830 format. Instead, the process of "invocation" mentioned earlier is
|
|
831 broken into two steps: first, a character set is @dfn{designated} to one
|
|
832 of the registers G0-G3 by use of an @dfn{escape sequence} of the form:
|
428
|
833
|
|
834 @example
|
440
|
835 ESC [@var{I}] @var{I} @var{F}
|
428
|
836 @end example
|
|
837
|
442
|
838 where @var{I} is an intermediate character or characters in the range
|
|
839 0x20 - 0x3F, and @var{F}, from the range 0x30-0x7Fm is the final
|
|
840 character identifying this charset. (Final characters in the range
|
|
841 0x30-0x3F are reserved for private use and will never have a publicly
|
|
842 registered meaning.)
|
|
843
|
|
844 Then that register is @dfn{invoked} to either GL or GR, either
|
|
845 automatically (designations to G0 normally involve invocation to GL as
|
|
846 well), or by use of shifting (affecting only the following character in
|
|
847 the data stream) or locking (effective until the next designation or
|
|
848 locking) control sequences. An encoding conformant to ISO 2022 is
|
|
849 typically defined by designating the initial contents of the G0-G3
|
901
|
850 registers, specifying a 7 or 8 bit environment, and specifying whether
|
442
|
851 further designations will be recognized.
|
|
852
|
|
853 Some examples of character sets and the registered final characters
|
|
854 @var{F} used to designate them:
|
428
|
855
|
442
|
856 @need 1000
|
|
857 @table @asis
|
|
858 @item 94-charset
|
|
859 ASCII (B), left (J) and right (I) half of JIS X 0201, ...
|
|
860 @item 96-charset
|
|
861 Latin-1 (A), Latin-2 (B), Latin-3 (C), ...
|
|
862 @item 94x94-charset
|
|
863 GB2312 (A), JIS X 0208 (B), KSC5601 (C), ...
|
|
864 @item 96x96-charset
|
|
865 none for the moment
|
|
866 @end table
|
|
867
|
|
868 The meanings of the various characters in these sequences, where not
|
|
869 specified by the ISO 2022 standard (such as the ESC character), are
|
|
870 assigned by @dfn{ECMA}, the European Computer Manufacturers Association.
|
|
871
|
|
872 The meaning of intermediate characters are:
|
428
|
873
|
|
874 @example
|
|
875 @group
|
440
|
876 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
|
|
877 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}.
|
|
878 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}.
|
|
879 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}.
|
|
880 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}.
|
442
|
881 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}.
|
440
|
882 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}.
|
|
883 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}.
|
|
884 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}.
|
428
|
885 @end group
|
|
886 @end example
|
|
887
|
442
|
888 The comma may be used in files read and written only by MULE, as a MULE
|
|
889 extension, but this is illegal in ISO 2022. (The reason is that in ISO
|
|
890 2022 G0 must be a 94-member character set, with 0x20 assigned the value
|
|
891 SPACE, and 0x7F assigned the value DEL.)
|
428
|
892
|
442
|
893 Here are examples of designations:
|
428
|
894
|
|
895 @example
|
|
896 @group
|
440
|
897 ESC ( B : designate to G0 ASCII
|
|
898 ESC - A : designate to G1 Latin-1
|
|
899 ESC $ ( A or ESC $ A : designate to G0 GB2312
|
|
900 ESC $ ( B or ESC $ B : designate to G0 JISX0208
|
|
901 ESC $ ) C : designate to G1 KSC5601
|
428
|
902 @end group
|
|
903 @end example
|
|
904
|
442
|
905 (The short forms used to designate GB2312 and JIS X 0208 are for
|
|
906 backwards compatibility; the long forms are preferred.)
|
|
907
|
|
908 To use a charset designated to G2 or G3, and to use a charset designated
|
428
|
909 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3
|
|
910 into GL. There are two types of invocation, Locking Shift (forever) and
|
|
911 Single Shift (one character only).
|
|
912
|
442
|
913 Locking Shift is done as follows:
|
428
|
914
|
|
915 @example
|
440
|
916 LS0 or SI (0x0F): invoke G0 into GL
|
|
917 LS1 or SO (0x0E): invoke G1 into GL
|
|
918 LS2: invoke G2 into GL
|
|
919 LS3: invoke G3 into GL
|
|
920 LS1R: invoke G1 into GR
|
|
921 LS2R: invoke G2 into GR
|
|
922 LS3R: invoke G3 into GR
|
428
|
923 @end example
|
|
924
|
442
|
925 Single Shift is done as follows:
|
428
|
926
|
|
927 @example
|
|
928 @group
|
440
|
929 SS2 or ESC N: invoke G2 into GL
|
|
930 SS3 or ESC O: invoke G3 into GL
|
428
|
931 @end group
|
|
932 @end example
|
|
933
|
442
|
934 The shift functions (such as LS1R and SS3) are represented by control
|
|
935 characters (from C1) in 8 bit environments and by escape sequences in 7
|
|
936 bit environments.
|
|
937
|
428
|
938 (#### Ben says: I think the above is slightly incorrect. It appears that
|
|
939 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and
|
444
|
940 ESC O behave as indicated. The above definitions will not parse
|
428
|
941 EUC-encoded text correctly, and it looks like the code in mule-coding.c
|
|
942 has similar problems.)
|
|
943
|
442
|
944 Evidently there are a lot of ISO-2022-compliant ways of encoding
|
|
945 multilingual text. Now, in the world, there exist many coding systems
|
|
946 such as X11's Compound Text, Japanese JUNET code, and so-called EUC
|
|
947 (Extended UNIX Code); all of these are variants of ISO 2022.
|
428
|
948
|
442
|
949 In MULE, we characterize a version of ISO 2022 by the following
|
|
950 attributes:
|
428
|
951
|
|
952 @enumerate
|
|
953 @item
|
442
|
954 The character sets initially designated to G0 thru G3.
|
428
|
955 @item
|
442
|
956 Whether short form designations are allowed for Japanese and Chinese.
|
428
|
957 @item
|
442
|
958 Whether ASCII should be designated to G0 before control characters.
|
428
|
959 @item
|
442
|
960 Whether ASCII should be designated to G0 at the end of line.
|
428
|
961 @item
|
|
962 7-bit environment or 8-bit environment.
|
|
963 @item
|
442
|
964 Whether Locking Shifts are used or not.
|
428
|
965 @item
|
442
|
966 Whether to use ASCII or the variant JIS X 0201-1976-Roman.
|
428
|
967 @item
|
442
|
968 Whether to use JIS X 0208-1983 or the older version JIS X 0208-1976.
|
428
|
969 @end enumerate
|
|
970
|
|
971 (The last two are only for Japanese.)
|
|
972
|
442
|
973 By specifying these attributes, you can create any variant
|
428
|
974 of ISO 2022.
|
|
975
|
442
|
976 Here are several examples:
|
428
|
977
|
|
978 @example
|
|
979 @group
|
442
|
980 ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check).
|
440
|
981 1. G0 <- ASCII, G1..3 <- never used
|
|
982 2. Yes.
|
|
983 3. Yes.
|
|
984 4. Yes.
|
|
985 5. 7-bit environment
|
|
986 6. No.
|
|
987 7. Use ASCII
|
442
|
988 8. Use JIS X 0208-1983
|
428
|
989 @end group
|
|
990
|
|
991 @group
|
442
|
992 ctext -- X11 Compound Text
|
|
993 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used.
|
440
|
994 2. No.
|
|
995 3. No.
|
|
996 4. Yes.
|
442
|
997 5. 8-bit environment.
|
440
|
998 6. No.
|
442
|
999 7. Use ASCII.
|
|
1000 8. Use JIS X 0208-1983.
|
428
|
1001 @end group
|
|
1002
|
|
1003 @group
|
442
|
1004 euc-china -- Chinese EUC. Often called the "GB encoding", but that is
|
|
1005 technically incorrect.
|
|
1006 1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used.
|
440
|
1007 2. No.
|
|
1008 3. Yes.
|
|
1009 4. Yes.
|
442
|
1010 5. 8-bit environment.
|
440
|
1011 6. No.
|
442
|
1012 7. Use ASCII.
|
|
1013 8. Use JIS X 0208-1983.
|
428
|
1014 @end group
|
|
1015
|
|
1016 @group
|
442
|
1017 ISO-2022-KR -- Coding system used in Korean email.
|
|
1018 1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used.
|
440
|
1019 2. No.
|
|
1020 3. Yes.
|
|
1021 4. Yes.
|
442
|
1022 5. 7-bit environment.
|
440
|
1023 6. Yes.
|
442
|
1024 7. Use ASCII.
|
|
1025 8. Use JIS X 0208-1983.
|
428
|
1026 @end group
|
|
1027 @end example
|
|
1028
|
442
|
1029 MULE creates all of these coding systems by default.
|
428
|
1030
|
442
|
1031 @node EOL Conversion, Coding System Properties, ISO 2022, Coding Systems
|
428
|
1032 @subsection EOL Conversion
|
|
1033
|
|
1034 @table @code
|
|
1035 @item nil
|
|
1036 Automatically detect the end-of-line type (LF, CRLF, or CR). Also
|
|
1037 generate subsidiary coding systems named @code{@var{name}-unix},
|
|
1038 @code{@var{name}-dos}, and @code{@var{name}-mac}, that are identical to
|
|
1039 this coding system but have an EOL-TYPE value of @code{lf}, @code{crlf},
|
|
1040 and @code{cr}, respectively.
|
|
1041 @item lf
|
|
1042 The end of a line is marked externally using ASCII LF. Since this is
|
|
1043 also the way that XEmacs represents an end-of-line internally,
|
|
1044 specifying this option results in no end-of-line conversion. This is
|
|
1045 the standard format for Unix text files.
|
|
1046 @item crlf
|
|
1047 The end of a line is marked externally using ASCII CRLF. This is the
|
|
1048 standard format for MS-DOS text files.
|
|
1049 @item cr
|
|
1050 The end of a line is marked externally using ASCII CR. This is the
|
|
1051 standard format for Macintosh text files.
|
|
1052 @item t
|
|
1053 Automatically detect the end-of-line type but do not generate subsidiary
|
|
1054 coding systems. (This value is converted to @code{nil} when stored
|
|
1055 internally, and @code{coding-system-property} will return @code{nil}.)
|
|
1056 @end table
|
|
1057
|
442
|
1058 @node Coding System Properties, Basic Coding System Functions, EOL Conversion, Coding Systems
|
428
|
1059 @subsection Coding System Properties
|
|
1060
|
|
1061 @table @code
|
|
1062 @item mnemonic
|
|
1063 String to be displayed in the modeline when this coding system is
|
|
1064 active.
|
|
1065
|
|
1066 @item eol-type
|
|
1067 End-of-line conversion to be used. It should be one of the types
|
|
1068 listed in @ref{EOL Conversion}.
|
|
1069
|
442
|
1070 @item eol-lf
|
444
|
1071 The coding system which is the same as this one, except that it uses the
|
442
|
1072 Unix line-breaking convention.
|
|
1073
|
|
1074 @item eol-crlf
|
444
|
1075 The coding system which is the same as this one, except that it uses the
|
442
|
1076 DOS line-breaking convention.
|
|
1077
|
|
1078 @item eol-cr
|
444
|
1079 The coding system which is the same as this one, except that it uses the
|
442
|
1080 Macintosh line-breaking convention.
|
|
1081
|
428
|
1082 @item post-read-conversion
|
|
1083 Function called after a file has been read in, to perform the decoding.
|
444
|
1084 Called with two arguments, @var{start} and @var{end}, denoting a region of
|
428
|
1085 the current buffer to be decoded.
|
|
1086
|
|
1087 @item pre-write-conversion
|
|
1088 Function called before a file is written out, to perform the encoding.
|
444
|
1089 Called with two arguments, @var{start} and @var{end}, denoting a region of
|
428
|
1090 the current buffer to be encoded.
|
|
1091 @end table
|
|
1092
|
442
|
1093 The following additional properties are recognized if @var{type} is
|
428
|
1094 @code{iso2022}:
|
|
1095
|
|
1096 @table @code
|
|
1097 @item charset-g0
|
|
1098 @itemx charset-g1
|
|
1099 @itemx charset-g2
|
|
1100 @itemx charset-g3
|
|
1101 The character set initially designated to the G0 - G3 registers.
|
|
1102 The value should be one of
|
|
1103
|
|
1104 @itemize @bullet
|
|
1105 @item
|
|
1106 A charset object (designate that character set)
|
|
1107 @item
|
|
1108 @code{nil} (do not ever use this register)
|
|
1109 @item
|
|
1110 @code{t} (no character set is initially designated to the register, but
|
|
1111 may be later on; this automatically sets the corresponding
|
|
1112 @code{force-g*-on-output} property)
|
|
1113 @end itemize
|
|
1114
|
|
1115 @item force-g0-on-output
|
|
1116 @itemx force-g1-on-output
|
|
1117 @itemx force-g2-on-output
|
|
1118 @itemx force-g3-on-output
|
|
1119 If non-@code{nil}, send an explicit designation sequence on output
|
|
1120 before using the specified register.
|
|
1121
|
|
1122 @item short
|
|
1123 If non-@code{nil}, use the short forms @samp{ESC $ @@}, @samp{ESC $ A},
|
|
1124 and @samp{ESC $ B} on output in place of the full designation sequences
|
|
1125 @samp{ESC $ ( @@}, @samp{ESC $ ( A}, and @samp{ESC $ ( B}.
|
|
1126
|
|
1127 @item no-ascii-eol
|
|
1128 If non-@code{nil}, don't designate ASCII to G0 at each end of line on
|
|
1129 output. Setting this to non-@code{nil} also suppresses other
|
|
1130 state-resetting that normally happens at the end of a line.
|
|
1131
|
|
1132 @item no-ascii-cntl
|
|
1133 If non-@code{nil}, don't designate ASCII to G0 before control chars on
|
|
1134 output.
|
|
1135
|
|
1136 @item seven
|
|
1137 If non-@code{nil}, use 7-bit environment on output. Otherwise, use 8-bit
|
|
1138 environment.
|
|
1139
|
|
1140 @item lock-shift
|
|
1141 If non-@code{nil}, use locking-shift (SO/SI) instead of single-shift or
|
|
1142 designation by escape sequence.
|
|
1143
|
|
1144 @item no-iso6429
|
|
1145 If non-@code{nil}, don't use ISO6429's direction specification.
|
|
1146
|
|
1147 @item escape-quoted
|
444
|
1148 If non-@code{nil}, literal control characters that are the same as the
|
428
|
1149 beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in
|
|
1150 particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F),
|
|
1151 and CSI (0x9B)) are ``quoted'' with an escape character so that they can
|
|
1152 be properly distinguished from an escape sequence. (Note that doing
|
|
1153 this results in a non-portable encoding.) This encoding flag is used for
|
|
1154 byte-compiled files. Note that ESC is a good choice for a quoting
|
|
1155 character because there are no escape sequences whose second byte is a
|
|
1156 character from the Control-0 or Control-1 character sets; this is
|
|
1157 explicitly disallowed by the ISO 2022 standard.
|
|
1158
|
|
1159 @item input-charset-conversion
|
|
1160 A list of conversion specifications, specifying conversion of characters
|
|
1161 in one charset to another when decoding is performed. Each
|
|
1162 specification is a list of two elements: the source charset, and the
|
|
1163 destination charset.
|
|
1164
|
|
1165 @item output-charset-conversion
|
|
1166 A list of conversion specifications, specifying conversion of characters
|
|
1167 in one charset to another when encoding is performed. The form of each
|
|
1168 specification is the same as for @code{input-charset-conversion}.
|
|
1169 @end table
|
|
1170
|
442
|
1171 The following additional properties are recognized (and required) if
|
428
|
1172 @var{type} is @code{ccl}:
|
|
1173
|
|
1174 @table @code
|
|
1175 @item decode
|
|
1176 CCL program used for decoding (converting to internal format).
|
|
1177
|
|
1178 @item encode
|
|
1179 CCL program used for encoding (converting to external format).
|
|
1180 @end table
|
|
1181
|
442
|
1182 The following properties are used internally: @var{eol-cr},
|
|
1183 @var{eol-crlf}, @var{eol-lf}, and @var{base}.
|
|
1184
|
|
1185 @node Basic Coding System Functions, Coding System Property Functions, Coding System Properties, Coding Systems
|
428
|
1186 @subsection Basic Coding System Functions
|
|
1187
|
|
1188 @defun find-coding-system coding-system-or-name
|
|
1189 This function retrieves the coding system of the given name.
|
|
1190
|
442
|
1191 If @var{coding-system-or-name} is a coding-system object, it is simply
|
428
|
1192 returned. Otherwise, @var{coding-system-or-name} should be a symbol.
|
|
1193 If there is no such coding system, @code{nil} is returned. Otherwise
|
|
1194 the associated coding system object is returned.
|
|
1195 @end defun
|
|
1196
|
|
1197 @defun get-coding-system name
|
|
1198 This function retrieves the coding system of the given name. Same as
|
|
1199 @code{find-coding-system} except an error is signalled if there is no
|
|
1200 such coding system instead of returning @code{nil}.
|
|
1201 @end defun
|
|
1202
|
|
1203 @defun coding-system-list
|
|
1204 This function returns a list of the names of all defined coding systems.
|
|
1205 @end defun
|
|
1206
|
|
1207 @defun coding-system-name coding-system
|
|
1208 This function returns the name of the given coding system.
|
|
1209 @end defun
|
|
1210
|
442
|
1211 @defun coding-system-base coding-system
|
|
1212 Returns the base coding system (undecided EOL convention)
|
|
1213 coding system.
|
|
1214 @end defun
|
|
1215
|
428
|
1216 @defun make-coding-system name type &optional doc-string props
|
|
1217 This function registers symbol @var{name} as a coding system.
|
|
1218
|
|
1219 @var{type} describes the conversion method used and should be one of
|
|
1220 the types listed in @ref{Coding System Types}.
|
|
1221
|
|
1222 @var{doc-string} is a string describing the coding system.
|
|
1223
|
|
1224 @var{props} is a property list, describing the specific nature of the
|
|
1225 character set. Recognized properties are as in @ref{Coding System
|
|
1226 Properties}.
|
|
1227 @end defun
|
|
1228
|
|
1229 @defun copy-coding-system old-coding-system new-name
|
|
1230 This function copies @var{old-coding-system} to @var{new-name}. If
|
|
1231 @var{new-name} does not name an existing coding system, a new one will
|
|
1232 be created.
|
|
1233 @end defun
|
|
1234
|
|
1235 @defun subsidiary-coding-system coding-system eol-type
|
|
1236 This function returns the subsidiary coding system of
|
|
1237 @var{coding-system} with eol type @var{eol-type}.
|
|
1238 @end defun
|
|
1239
|
442
|
1240 @node Coding System Property Functions, Encoding and Decoding Text, Basic Coding System Functions, Coding Systems
|
428
|
1241 @subsection Coding System Property Functions
|
|
1242
|
|
1243 @defun coding-system-doc-string coding-system
|
|
1244 This function returns the doc string for @var{coding-system}.
|
|
1245 @end defun
|
|
1246
|
|
1247 @defun coding-system-type coding-system
|
|
1248 This function returns the type of @var{coding-system}.
|
|
1249 @end defun
|
|
1250
|
|
1251 @defun coding-system-property coding-system prop
|
|
1252 This function returns the @var{prop} property of @var{coding-system}.
|
|
1253 @end defun
|
|
1254
|
442
|
1255 @node Encoding and Decoding Text, Detection of Textual Encoding, Coding System Property Functions, Coding Systems
|
428
|
1256 @subsection Encoding and Decoding Text
|
|
1257
|
|
1258 @defun decode-coding-region start end coding-system &optional buffer
|
|
1259 This function decodes the text between @var{start} and @var{end} which
|
|
1260 is encoded in @var{coding-system}. This is useful if you've read in
|
|
1261 encoded text from a file without decoding it (e.g. you read in a
|
|
1262 JIS-formatted file but used the @code{binary} or @code{no-conversion} coding
|
|
1263 system, so that it shows up as @samp{^[$B!<!+^[(B}). The length of the
|
|
1264 encoded text is returned. @var{buffer} defaults to the current buffer
|
|
1265 if unspecified.
|
|
1266 @end defun
|
|
1267
|
|
1268 @defun encode-coding-region start end coding-system &optional buffer
|
|
1269 This function encodes the text between @var{start} and @var{end} using
|
|
1270 @var{coding-system}. This will, for example, convert Japanese
|
|
1271 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS
|
|
1272 encoding. The length of the encoded text is returned. @var{buffer}
|
|
1273 defaults to the current buffer if unspecified.
|
|
1274 @end defun
|
|
1275
|
442
|
1276 @node Detection of Textual Encoding, Big5 and Shift-JIS Functions, Encoding and Decoding Text, Coding Systems
|
428
|
1277 @subsection Detection of Textual Encoding
|
|
1278
|
|
1279 @defun coding-category-list
|
|
1280 This function returns a list of all recognized coding categories.
|
|
1281 @end defun
|
|
1282
|
|
1283 @defun set-coding-priority-list list
|
|
1284 This function changes the priority order of the coding categories.
|
|
1285 @var{list} should be a list of coding categories, in descending order of
|
|
1286 priority. Unspecified coding categories will be lower in priority than
|
|
1287 all specified ones, in the same relative order they were in previously.
|
|
1288 @end defun
|
|
1289
|
|
1290 @defun coding-priority-list
|
|
1291 This function returns a list of coding categories in descending order of
|
|
1292 priority.
|
|
1293 @end defun
|
|
1294
|
|
1295 @defun set-coding-category-system coding-category coding-system
|
|
1296 This function changes the coding system associated with a coding category.
|
|
1297 @end defun
|
|
1298
|
|
1299 @defun coding-category-system coding-category
|
|
1300 This function returns the coding system associated with a coding category.
|
|
1301 @end defun
|
|
1302
|
|
1303 @defun detect-coding-region start end &optional buffer
|
|
1304 This function detects coding system of the text in the region between
|
|
1305 @var{start} and @var{end}. Returned value is a list of possible coding
|
|
1306 systems ordered by priority. If only ASCII characters are found, it
|
|
1307 returns @code{autodetect} or one of its subsidiary coding systems
|
|
1308 according to a detected end-of-line type. Optional arg @var{buffer}
|
|
1309 defaults to the current buffer.
|
|
1310 @end defun
|
|
1311
|
442
|
1312 @node Big5 and Shift-JIS Functions, Predefined Coding Systems, Detection of Textual Encoding, Coding Systems
|
428
|
1313 @subsection Big5 and Shift-JIS Functions
|
|
1314
|
442
|
1315 These are special functions for working with the non-standard
|
428
|
1316 Shift-JIS and Big5 encodings.
|
|
1317
|
|
1318 @defun decode-shift-jis-char code
|
442
|
1319 This function decodes a JIS X 0208 character of Shift-JIS coding-system.
|
428
|
1320 @var{code} is the character code in Shift-JIS as a cons of type bytes.
|
|
1321 The corresponding character is returned.
|
|
1322 @end defun
|
|
1323
|
444
|
1324 @defun encode-shift-jis-char character
|
|
1325 This function encodes a JIS X 0208 character @var{character} to
|
|
1326 SHIFT-JIS coding-system. The corresponding character code in SHIFT-JIS
|
|
1327 is returned as a cons of two bytes.
|
428
|
1328 @end defun
|
|
1329
|
|
1330 @defun decode-big5-char code
|
|
1331 This function decodes a Big5 character @var{code} of BIG5 coding-system.
|
|
1332 @var{code} is the character code in BIG5. The corresponding character
|
|
1333 is returned.
|
|
1334 @end defun
|
|
1335
|
444
|
1336 @defun encode-big5-char character
|
|
1337 This function encodes the Big5 character @var{character} to BIG5
|
428
|
1338 coding-system. The corresponding character code in Big5 is returned.
|
|
1339 @end defun
|
|
1340
|
442
|
1341 @node Predefined Coding Systems, , Big5 and Shift-JIS Functions, Coding Systems
|
|
1342 @subsection Coding Systems Implemented
|
|
1343
|
|
1344 MULE initializes most of the commonly used coding systems at XEmacs's
|
|
1345 startup. A few others are initialized only when the relevant language
|
|
1346 environment is selected and support libraries are loaded. (NB: The
|
444
|
1347 following list is based on XEmacs 21.2.19, the development branch at the
|
442
|
1348 time of writing. The list may be somewhat different for other
|
|
1349 versions. Recent versions of GNU Emacs 20 implement a few more rare
|
|
1350 coding systems; work is being done to port these to XEmacs.)
|
|
1351
|
444
|
1352 Unfortunately, there is not a consistent naming convention for character
|
|
1353 sets, and for practical purposes coding systems often take their name
|
442
|
1354 from their principal character sets (ASCII, KOI8-R, Shift JIS). Others
|
444
|
1355 take their names from the coding system (ISO-2022-JP, EUC-KR), and a few
|
|
1356 from their non-text usages (internal, binary). To provide for this, and
|
442
|
1357 for the fact that many coding systems have several common names, an
|
|
1358 aliasing system is provided. Finally, some effort has been made to use
|
|
1359 names that are registered as MIME charsets (this is why the name
|
|
1360 'shift_jis contains that un-Lisp-y underscore).
|
|
1361
|
|
1362 There is a systematic naming convention regarding end-of-line (EOL)
|
|
1363 conventions for different systems. A coding system whose name ends in
|
|
1364 "-unix" forces the assumptions that lines are broken by newlines (0x0A).
|
|
1365 A coding system whose name ends in "-mac" forces the assumptions that
|
|
1366 lines are broken by ASCII CRs (0x0D). A coding system whose name ends
|
|
1367 in "-dos" forces the assumptions that lines are broken by CRLF sequences
|
|
1368 (0x0D 0x0A). These subsidiary coding systems are automatically derived
|
|
1369 from a base coding system. Use of the base coding system implies
|
|
1370 autodetection of the text file convention. (The fact that the -unix,
|
|
1371 -mac, and -dos are derived from a base system results in them showing up
|
|
1372 as "aliases" in `list-coding-systems'.) These subsidiaries have a
|
|
1373 consistent modeline indicator as well. "-dos" coding systems have ":T"
|
|
1374 appended to their modeline indicator, while "-mac" coding systems have
|
|
1375 ":t" appended (eg, "ISO8:t" for iso-2022-8-mac).
|
|
1376
|
|
1377 In the following table, each coding system is given with its mode line
|
|
1378 indicator in parentheses. Non-textual coding systems are listed first,
|
|
1379 followed by textual coding systems and their aliases. (The coding system
|
|
1380 subsidiary modeline indicators ":T" and ":t" will be omitted from the
|
|
1381 table of coding systems.)
|
|
1382
|
|
1383 ### SJT 1999-08-23 Maybe should order these by language? Definitely
|
|
1384 need language usage for the ISO-8859 family.
|
|
1385
|
|
1386 Note that although true coding system aliases have been implemented for
|
444
|
1387 XEmacs 21.2, the coding system initialization has not yet been converted
|
442
|
1388 as of 21.2.19. So coding systems described as aliases have the same
|
|
1389 properties as the aliased coding system, but will not be equal as Lisp
|
|
1390 objects.
|
|
1391
|
|
1392 @table @code
|
|
1393
|
|
1394 @item automatic-conversion
|
|
1395 @itemx undecided
|
|
1396 @itemx undecided-dos
|
|
1397 @itemx undecided-mac
|
|
1398 @itemx undecided-unix
|
|
1399
|
|
1400 Modeline indicator: @code{Auto}. A type @code{undecided} coding system.
|
|
1401 Attempts to determine an appropriate coding system from file contents or
|
|
1402 the environment.
|
|
1403
|
|
1404 @item raw-text
|
|
1405 @itemx no-conversion
|
|
1406 @itemx raw-text-dos
|
|
1407 @itemx raw-text-mac
|
|
1408 @itemx raw-text-unix
|
|
1409 @itemx no-conversion-dos
|
|
1410 @itemx no-conversion-mac
|
|
1411 @itemx no-conversion-unix
|
|
1412
|
|
1413 Modeline indicator: @code{Raw}. A type @code{no-conversion} coding system,
|
|
1414 which converts only line-break-codes. An implementation quirk means
|
|
1415 that this coding system is also used for ISO8859-1.
|
|
1416
|
|
1417 @item binary
|
|
1418 Modeline indicator: @code{Binary}. A type @code{no-conversion} coding
|
|
1419 system which does no character coding or EOL conversions. An alias for
|
|
1420 @code{raw-text-unix}.
|
|
1421
|
|
1422 @item alternativnyj
|
|
1423 @itemx alternativnyj-dos
|
|
1424 @itemx alternativnyj-mac
|
|
1425 @itemx alternativnyj-unix
|
|
1426
|
|
1427 Modeline indicator: @code{Cy.Alt}. A type @code{ccl} coding system used for
|
|
1428 Alternativnyj, an encoding of the Cyrillic alphabet.
|
|
1429
|
|
1430 @item big5
|
|
1431 @itemx big5-dos
|
|
1432 @itemx big5-mac
|
|
1433 @itemx big5-unix
|
|
1434
|
|
1435 Modeline indicator: @code{Zh/Big5}. A type @code{big5} coding system used for
|
|
1436 BIG5, the most common encoding of traditional Chinese as used in Taiwan.
|
|
1437
|
|
1438 @item cn-gb-2312
|
|
1439 @itemx cn-gb-2312-dos
|
|
1440 @itemx cn-gb-2312-mac
|
|
1441 @itemx cn-gb-2312-unix
|
|
1442
|
|
1443 Modeline indicator: @code{Zh-GB/EUC}. A type @code{iso2022} coding system used
|
|
1444 for simplified Chinese (as used in the People's Republic of China), with
|
|
1445 the @code{ascii} (G0), @code{chinese-gb2312} (G1), and @code{sisheng}
|
|
1446 (G2) character sets initially designated. Chinese EUC (Extended Unix
|
|
1447 Code).
|
|
1448
|
|
1449 @item ctext-hebrew
|
|
1450 @itemx ctext-hebrew-dos
|
|
1451 @itemx ctext-hebrew-mac
|
|
1452 @itemx ctext-hebrew-unix
|
|
1453
|
|
1454 Modeline indicator: @code{CText/Hbrw}. A type @code{iso2022} coding system
|
|
1455 with the @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) character
|
|
1456 sets initially designated for Hebrew.
|
|
1457
|
|
1458 @item ctext
|
|
1459 @itemx ctext-dos
|
|
1460 @itemx ctext-mac
|
|
1461 @itemx ctext-unix
|
|
1462
|
|
1463 Modeline indicator: @code{CText}. A type @code{iso2022} 8-bit coding system
|
|
1464 with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1) character
|
|
1465 sets initially designated. X11 Compound Text Encoding. Often
|
|
1466 mistakenly recognized instead of EUC encodings; usual cause is
|
|
1467 inappropriate setting of @code{coding-priority-list}.
|
|
1468
|
|
1469 @item escape-quoted
|
|
1470
|
|
1471 Modeline indicator: @code{ESC/Quot}. A type @code{iso2022} 8-bit coding
|
|
1472 system with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1)
|
|
1473 character sets initially designated and escape quoting. Unix EOL
|
|
1474 conversion (ie, no conversion). It is used for .ELC files.
|
|
1475
|
|
1476 @item euc-jp
|
|
1477 @itemx euc-jp-dos
|
|
1478 @itemx euc-jp-mac
|
|
1479 @itemx euc-jp-unix
|
|
1480
|
|
1481 Modeline indicator: @code{Ja/EUC}. A type @code{iso2022} 8-bit coding system
|
|
1482 with @code{ascii} (G0), @code{japanese-jisx0208} (G1),
|
|
1483 @code{katakana-jisx0201} (G2), and @code{japanese-jisx0212} (G3)
|
|
1484 initially designated. Japanese EUC (Extended Unix Code).
|
|
1485
|
|
1486 @item euc-kr
|
|
1487 @itemx euc-kr-dos
|
|
1488 @itemx euc-kr-mac
|
|
1489 @itemx euc-kr-unix
|
|
1490
|
|
1491 Modeline indicator: @code{ko/EUC}. A type @code{iso2022} 8-bit coding system
|
|
1492 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially
|
|
1493 designated. Korean EUC (Extended Unix Code).
|
|
1494
|
|
1495 @item hz-gb-2312
|
|
1496 Modeline indicator: @code{Zh-GB/Hz}. A type @code{no-conversion} coding
|
|
1497 system with Unix EOL convention (ie, no conversion) using
|
|
1498 post-read-decode and pre-write-encode functions to translate the Hz/ZW
|
|
1499 coding system used for Chinese.
|
|
1500
|
|
1501 @item iso-2022-7bit
|
|
1502 @itemx iso-2022-7bit-unix
|
|
1503 @itemx iso-2022-7bit-dos
|
|
1504 @itemx iso-2022-7bit-mac
|
|
1505 @itemx iso-2022-7
|
|
1506
|
|
1507 Modeline indicator: @code{ISO7}. A type @code{iso2022} 7-bit coding system
|
|
1508 with @code{ascii} (G0) initially designated. Other character sets must
|
|
1509 be explicitly designated to be used.
|
|
1510
|
|
1511 @item iso-2022-7bit-ss2
|
|
1512 @itemx iso-2022-7bit-ss2-dos
|
|
1513 @itemx iso-2022-7bit-ss2-mac
|
|
1514 @itemx iso-2022-7bit-ss2-unix
|
|
1515
|
|
1516 Modeline indicator: @code{ISO7/SS}. A type @code{iso2022} 7-bit coding system
|
|
1517 with @code{ascii} (G0) initially designated. Other character sets must
|
|
1518 be explicitly designated to be used. SS2 is used to invoke a
|
|
1519 96-charset, one character at a time.
|
|
1520
|
|
1521 @item iso-2022-8
|
|
1522 @itemx iso-2022-8-dos
|
|
1523 @itemx iso-2022-8-mac
|
|
1524 @itemx iso-2022-8-unix
|
|
1525
|
|
1526 Modeline indicator: @code{ISO8}. A type @code{iso2022} 8-bit coding system
|
|
1527 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially
|
|
1528 designated. Other character sets must be explicitly designated to be
|
|
1529 used. No single-shift or locking-shift.
|
|
1530
|
|
1531 @item iso-2022-8bit-ss2
|
|
1532 @itemx iso-2022-8bit-ss2-dos
|
|
1533 @itemx iso-2022-8bit-ss2-mac
|
|
1534 @itemx iso-2022-8bit-ss2-unix
|
|
1535
|
|
1536 Modeline indicator: @code{ISO8/SS}. A type @code{iso2022} 8-bit coding system
|
|
1537 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially
|
|
1538 designated. Other character sets must be explicitly designated to be
|
|
1539 used. SS2 is used to invoke a 96-charset, one character at a time.
|
|
1540
|
|
1541 @item iso-2022-int-1
|
|
1542 @itemx iso-2022-int-1-dos
|
|
1543 @itemx iso-2022-int-1-mac
|
|
1544 @itemx iso-2022-int-1-unix
|
|
1545
|
|
1546 Modeline indicator: @code{INT-1}. A type @code{iso2022} 7-bit coding system
|
|
1547 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially
|
|
1548 designated. ISO-2022-INT-1.
|
|
1549
|
|
1550 @item iso-2022-jp-1978-irv
|
|
1551 @itemx iso-2022-jp-1978-irv-dos
|
|
1552 @itemx iso-2022-jp-1978-irv-mac
|
|
1553 @itemx iso-2022-jp-1978-irv-unix
|
|
1554
|
|
1555 Modeline indicator: @code{Ja-78/7bit}. A type @code{iso2022} 7-bit coding
|
|
1556 system. For compatibility with old Japanese terminals; if you need to
|
|
1557 know, look at the source.
|
|
1558
|
|
1559 @item iso-2022-jp
|
|
1560 @itemx iso-2022-jp-2 (ISO7/SS)
|
|
1561 @itemx iso-2022-jp-dos
|
|
1562 @itemx iso-2022-jp-mac
|
|
1563 @itemx iso-2022-jp-unix
|
|
1564 @itemx iso-2022-jp-2-dos
|
|
1565 @itemx iso-2022-jp-2-mac
|
|
1566 @itemx iso-2022-jp-2-unix
|
|
1567
|
|
1568 Modeline indicator: @code{MULE/7bit}. A type @code{iso2022} 7-bit coding
|
|
1569 system with @code{ascii} (G0) initially designated, and complex
|
|
1570 specifications to insure backward compatibility with old Japanese
|
|
1571 systems. Used for communication with mail and news in Japan. The "-2"
|
|
1572 versions also use SS2 to invoke a 96-charset one character at a time.
|
|
1573
|
|
1574 @item iso-2022-kr
|
|
1575 Modeline indicator: @code{Ko/7bit} A type @code{iso2022} 7-bit coding
|
|
1576 system with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially
|
|
1577 designated. Used for e-mail in Korea.
|
|
1578
|
|
1579 @item iso-2022-lock
|
|
1580 @itemx iso-2022-lock-dos
|
|
1581 @itemx iso-2022-lock-mac
|
|
1582 @itemx iso-2022-lock-unix
|
|
1583
|
|
1584 Modeline indicator: @code{ISO7/Lock}. A type @code{iso2022} 7-bit coding
|
|
1585 system with @code{ascii} (G0) initially designated, using Locking-Shift
|
|
1586 to invoke a 96-charset.
|
|
1587
|
|
1588 @item iso-8859-1
|
|
1589 @itemx iso-8859-1-dos
|
|
1590 @itemx iso-8859-1-mac
|
|
1591 @itemx iso-8859-1-unix
|
|
1592
|
|
1593 Due to implementation, this is not a type @code{iso2022} coding system,
|
|
1594 but rather an alias for the @code{raw-text} coding system.
|
|
1595
|
|
1596 @item iso-8859-2
|
|
1597 @itemx iso-8859-2-dos
|
|
1598 @itemx iso-8859-2-mac
|
|
1599 @itemx iso-8859-2-unix
|
|
1600
|
|
1601 Modeline indicator: @code{MIME/Ltn-2}. A type @code{iso2022} coding
|
|
1602 system with @code{ascii} (G0) and @code{latin-iso8859-2} (G1) initially
|
|
1603 invoked.
|
|
1604
|
|
1605 @item iso-8859-3
|
|
1606 @itemx iso-8859-3-dos
|
|
1607 @itemx iso-8859-3-mac
|
|
1608 @itemx iso-8859-3-unix
|
|
1609
|
|
1610 Modeline indicator: @code{MIME/Ltn-3}. A type @code{iso2022} coding system
|
|
1611 with @code{ascii} (G0) and @code{latin-iso8859-3} (G1) initially
|
|
1612 invoked.
|
|
1613
|
|
1614 @item iso-8859-4
|
|
1615 @itemx iso-8859-4-dos
|
|
1616 @itemx iso-8859-4-mac
|
|
1617 @itemx iso-8859-4-unix
|
|
1618
|
|
1619 Modeline indicator: @code{MIME/Ltn-4}. A type @code{iso2022} coding system
|
|
1620 with @code{ascii} (G0) and @code{latin-iso8859-4} (G1) initially
|
|
1621 invoked.
|
|
1622
|
|
1623 @item iso-8859-5
|
|
1624 @itemx iso-8859-5-dos
|
|
1625 @itemx iso-8859-5-mac
|
|
1626 @itemx iso-8859-5-unix
|
|
1627
|
|
1628 Modeline indicator: @code{ISO8/Cyr}. A type @code{iso2022} coding system with
|
|
1629 @code{ascii} (G0) and @code{cyrillic-iso8859-5} (G1) initially invoked.
|
|
1630
|
|
1631 @item iso-8859-7
|
|
1632 @itemx iso-8859-7-dos
|
|
1633 @itemx iso-8859-7-mac
|
|
1634 @itemx iso-8859-7-unix
|
|
1635
|
|
1636 Modeline indicator: @code{Grk}. A type @code{iso2022} coding system with
|
|
1637 @code{ascii} (G0) and @code{greek-iso8859-7} (G1) initially invoked.
|
|
1638
|
|
1639 @item iso-8859-8
|
|
1640 @itemx iso-8859-8-dos
|
|
1641 @itemx iso-8859-8-mac
|
|
1642 @itemx iso-8859-8-unix
|
|
1643
|
|
1644 Modeline indicator: @code{MIME/Hbrw}. A type @code{iso2022} coding system with
|
|
1645 @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) initially invoked.
|
|
1646
|
|
1647 @item iso-8859-9
|
|
1648 @itemx iso-8859-9-dos
|
|
1649 @itemx iso-8859-9-mac
|
|
1650 @itemx iso-8859-9-unix
|
|
1651
|
|
1652 Modeline indicator: @code{MIME/Ltn-5}. A type @code{iso2022} coding system
|
|
1653 with @code{ascii} (G0) and @code{latin-iso8859-9} (G1) initially
|
|
1654 invoked.
|
|
1655
|
|
1656 @item koi8-r
|
|
1657 @itemx koi8-r-dos
|
|
1658 @itemx koi8-r-mac
|
|
1659 @itemx koi8-r-unix
|
|
1660
|
|
1661 Modeline indicator: @code{KOI8}. A type @code{ccl} coding-system used for
|
|
1662 KOI8-R, an encoding of the Cyrillic alphabet.
|
|
1663
|
|
1664 @item shift_jis
|
|
1665 @itemx shift_jis-dos
|
|
1666 @itemx shift_jis-mac
|
|
1667 @itemx shift_jis-unix
|
|
1668
|
|
1669 Modeline indicator: @code{Ja/SJIS}. A type @code{shift-jis} coding-system
|
|
1670 implementing the Shift-JIS encoding for Japanese. The underscore is to
|
|
1671 conform to the MIME charset implementing this encoding.
|
|
1672
|
|
1673 @item tis-620
|
|
1674 @itemx tis-620-dos
|
|
1675 @itemx tis-620-mac
|
|
1676 @itemx tis-620-unix
|
|
1677
|
|
1678 Modeline indicator: @code{TIS620}. A type @code{ccl} encoding for Thai. The
|
|
1679 external encoding is defined by TIS620, the internal encoding is
|
|
1680 peculiar to MULE, and called @code{thai-xtis}.
|
|
1681
|
|
1682 @item viqr
|
|
1683
|
|
1684 Modeline indicator: @code{VIQR}. A type @code{no-conversion} coding
|
|
1685 system with Unix EOL convention (ie, no conversion) using
|
|
1686 post-read-decode and pre-write-encode functions to translate the VIQR
|
|
1687 coding system for Vietnamese.
|
|
1688
|
|
1689 @item viscii
|
|
1690 @itemx viscii-dos
|
|
1691 @itemx viscii-mac
|
|
1692 @itemx viscii-unix
|
|
1693
|
|
1694 Modeline indicator: @code{VISCII}. A type @code{ccl} coding-system used
|
|
1695 for VISCII 1.1 for Vietnamese. Differs slightly from VSCII; VISCII is
|
|
1696 given priority by XEmacs.
|
|
1697
|
|
1698 @item vscii
|
|
1699 @itemx vscii-dos
|
|
1700 @itemx vscii-mac
|
|
1701 @itemx vscii-unix
|
|
1702
|
|
1703 Modeline indicator: @code{VSCII}. A type @code{ccl} coding-system used
|
|
1704 for VSCII 1.1 for Vietnamese. Differs slightly from VISCII, which is
|
|
1705 given priority by XEmacs. Use
|
|
1706 @code{(prefer-coding-system 'vietnamese-vscii)} to give priority to VSCII.
|
|
1707
|
|
1708 @end table
|
|
1709
|
428
|
1710 @node CCL, Category Tables, Coding Systems, MULE
|
|
1711 @section CCL
|
|
1712
|
442
|
1713 CCL (Code Conversion Language) is a simple structured programming
|
428
|
1714 language designed for character coding conversions. A CCL program is
|
|
1715 compiled to CCL code (represented by a vector of integers) and executed
|
|
1716 by the CCL interpreter embedded in Emacs. The CCL interpreter
|
|
1717 implements a virtual machine with 8 registers called @code{r0}, ...,
|
|
1718 @code{r7}, a number of control structures, and some I/O operators. Take
|
|
1719 care when using registers @code{r0} (used in implicit @dfn{set}
|
|
1720 statements) and especially @code{r7} (used internally by several
|
444
|
1721 statements and operations, especially for multiple return values and I/O
|
428
|
1722 operations).
|
|
1723
|
442
|
1724 CCL is used for code conversion during process I/O and file I/O for
|
428
|
1725 non-ISO2022 coding systems. (It is the only way for a user to specify a
|
|
1726 code conversion function.) It is also used for calculating the code
|
|
1727 point of an X11 font from a character code. However, since CCL is
|
|
1728 designed as a powerful programming language, it can be used for more
|
|
1729 generic calculation where efficiency is demanded. A combination of
|
|
1730 three or more arithmetic operations can be calculated faster by CCL than
|
|
1731 by Emacs Lisp.
|
|
1732
|
442
|
1733 @strong{Warning:} The code in @file{src/mule-ccl.c} and
|
428
|
1734 @file{$packages/lisp/mule-base/mule-ccl.el} is the definitive
|
|
1735 description of CCL's semantics. The previous version of this section
|
|
1736 contained several typos and obsolete names left from earlier versions of
|
|
1737 MULE, and many may remain. (I am not an experienced CCL programmer; the
|
|
1738 few who know CCL well find writing English painful.)
|
|
1739
|
442
|
1740 A CCL program transforms an input data stream into an output data
|
428
|
1741 stream. The input stream, held in a buffer of constant bytes, is left
|
|
1742 unchanged. The buffer may be filled by an external input operation,
|
|
1743 taken from an Emacs buffer, or taken from a Lisp string. The output
|
|
1744 buffer is a dynamic array of bytes, which can be written by an external
|
|
1745 output operation, inserted into an Emacs buffer, or returned as a Lisp
|
|
1746 string.
|
|
1747
|
442
|
1748 A CCL program is a (Lisp) list containing two or three members. The
|
428
|
1749 first member is the @dfn{buffer magnification}, which indicates the
|
|
1750 required minimum size of the output buffer as a multiple of the input
|
|
1751 buffer. It is followed by the @dfn{main block} which executes while
|
|
1752 there is input remaining, and an optional @dfn{EOF block} which is
|
|
1753 executed when the input is exhausted. Both the main block and the EOF
|
|
1754 block are CCL blocks.
|
|
1755
|
442
|
1756 A @dfn{CCL block} is either a CCL statement or list of CCL statements.
|
444
|
1757 A @dfn{CCL statement} is either a @dfn{set statement} (either an integer
|
428
|
1758 or an @dfn{assignment}, which is a list of a register to receive the
|
444
|
1759 assignment, an assignment operator, and an expression) or a @dfn{control
|
428
|
1760 statement} (a list starting with a keyword, whose allowable syntax
|
|
1761 depends on the keyword).
|
|
1762
|
|
1763 @menu
|
|
1764 * CCL Syntax:: CCL program syntax in BNF notation.
|
|
1765 * CCL Statements:: Semantics of CCL statements.
|
|
1766 * CCL Expressions:: Operators and expressions in CCL.
|
|
1767 * Calling CCL:: Running CCL programs.
|
2640
|
1768 * CCL Example:: A trivial program to transform the Web's URL encoding.
|
428
|
1769 @end menu
|
|
1770
|
442
|
1771 @node CCL Syntax, CCL Statements, , CCL
|
428
|
1772 @comment Node, Next, Previous, Up
|
|
1773 @subsection CCL Syntax
|
|
1774
|
442
|
1775 The full syntax of a CCL program in BNF notation:
|
428
|
1776
|
|
1777 @format
|
|
1778 CCL_PROGRAM :=
|
|
1779 (BUFFER_MAGNIFICATION
|
|
1780 CCL_MAIN_BLOCK
|
|
1781 [ CCL_EOF_BLOCK ])
|
|
1782
|
|
1783 BUFFER_MAGNIFICATION := integer
|
|
1784 CCL_MAIN_BLOCK := CCL_BLOCK
|
|
1785 CCL_EOF_BLOCK := CCL_BLOCK
|
|
1786
|
|
1787 CCL_BLOCK :=
|
|
1788 STATEMENT | (STATEMENT [STATEMENT ...])
|
|
1789 STATEMENT :=
|
2367
|
1790 SET | IF | BRANCH | LOOP | REPEAT | BREAK | READ | WRITE | CALL
|
|
1791 | TRANSLATE | MAP | END
|
428
|
1792
|
|
1793 SET :=
|
|
1794 (REG = EXPRESSION)
|
|
1795 | (REG ASSIGNMENT_OPERATOR EXPRESSION)
|
2367
|
1796 | INT-OR-CHAR
|
428
|
1797
|
|
1798 EXPRESSION := ARG | (EXPRESSION OPERATOR ARG)
|
|
1799
|
|
1800 IF := (if EXPRESSION CCL_BLOCK [CCL_BLOCK])
|
|
1801 BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...])
|
|
1802 LOOP := (loop STATEMENT [STATEMENT ...])
|
|
1803 BREAK := (break)
|
|
1804 REPEAT :=
|
|
1805 (repeat)
|
2367
|
1806 | (write-repeat [REG | INT-OR-CHAR | string])
|
|
1807 | (write-read-repeat REG [INT-OR-CHAR | ARRAY])
|
428
|
1808 READ :=
|
|
1809 (read REG ...)
|
2367
|
1810 | (read-if (REG OPERATOR ARG) CCL_BLOCK [CCL_BLOCK])
|
428
|
1811 | (read-branch REG CCL_BLOCK [CCL_BLOCK ...])
|
|
1812 WRITE :=
|
|
1813 (write REG ...)
|
|
1814 | (write EXPRESSION)
|
2367
|
1815 | (write INT-OR-CHAR) | (write string) | (write REG ARRAY)
|
428
|
1816 | string
|
|
1817 CALL := (call ccl-program-name)
|
|
1818 END := (end)
|
|
1819
|
|
1820 REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7
|
2367
|
1821 ARG := REG | INT-OR-CHAR
|
428
|
1822 OPERATOR :=
|
|
1823 + | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | //
|
|
1824 | < | > | == | <= | >= | != | de-sjis | en-sjis
|
|
1825 ASSIGNMENT_OPERATOR :=
|
|
1826 += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>=
|
2367
|
1827 ARRAY := '[' INT-OR-CHAR ... ']'
|
|
1828 INT-OR-CHAR := integer | character
|
|
1829
|
428
|
1830 @end format
|
|
1831
|
|
1832 @node CCL Statements, CCL Expressions, CCL Syntax, CCL
|
|
1833 @comment Node, Next, Previous, Up
|
|
1834 @subsection CCL Statements
|
|
1835
|
442
|
1836 The Emacs Code Conversion Language provides the following statement
|
428
|
1837 types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat},
|
|
1838 @dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, and @dfn{end}.
|
|
1839
|
|
1840 @heading Set statement:
|
|
1841
|
442
|
1842 The @dfn{set} statement has three variants with the syntaxes
|
428
|
1843 @samp{(@var{reg} = @var{expression})},
|
|
1844 @samp{(@var{reg} @var{assignment_operator} @var{expression})}, and
|
|
1845 @samp{@var{integer}}. The assignment operator variation of the
|
|
1846 @dfn{set} statement works the same way as the corresponding C expression
|
|
1847 statement does. The assignment operators are @code{+=}, @code{-=},
|
|
1848 @code{*=}, @code{/=}, @code{%=}, @code{&=}, @code{|=}, @code{^=},
|
|
1849 @code{<<=}, and @code{>>=}, and they have the same meanings as in C. A
|
|
1850 "naked integer" @var{integer} is equivalent to a @var{set} statement of
|
|
1851 the form @code{(r0 = @var{integer})}.
|
|
1852
|
|
1853 @heading I/O statements:
|
|
1854
|
442
|
1855 The @dfn{read} statement takes one or more registers as arguments. It
|
444
|
1856 reads one byte (a C char) from the input into each register in turn.
|
428
|
1857
|
442
|
1858 The @dfn{write} takes several forms. In the form @samp{(write @var{reg}
|
428
|
1859 ...)} it takes one or more registers as arguments and writes each in
|
|
1860 turn to the output. The integer in a register (interpreted as an
|
2367
|
1861 Ichar) is encoded to multibyte form (ie, Ibytes) and written to the
|
428
|
1862 current output buffer. If it is less than 256, it is written as is.
|
|
1863 The forms @samp{(write @var{expression})} and @samp{(write
|
|
1864 @var{integer})} are treated analogously. The form @samp{(write
|
|
1865 @var{string})} writes the constant string to the output. A
|
|
1866 "naked string" @samp{@var{string}} is equivalent to the statement @samp{(write
|
|
1867 @var{string})}. The form @samp{(write @var{reg} @var{array})} writes
|
|
1868 the @var{reg}th element of the @var{array} to the output.
|
|
1869
|
|
1870 @heading Conditional statements:
|
|
1871
|
442
|
1872 The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and
|
428
|
1873 an optional @var{second CCL block} as arguments. If the
|
|
1874 @var{expression} evaluates to non-zero, the first @var{CCL block} is
|
|
1875 executed. Otherwise, if there is a @var{second CCL block}, it is
|
|
1876 executed.
|
|
1877
|
442
|
1878 The @dfn{read-if} variant of the @dfn{if} statement takes an
|
428
|
1879 @var{expression}, a @var{CCL block}, and an optional @var{second CCL
|
|
1880 block} as arguments. The @var{expression} must have the form
|
|
1881 @code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is
|
|
1882 a register or an integer). The @code{read-if} statement first reads
|
|
1883 from the input into the first register operand in the @var{expression},
|
|
1884 then conditionally executes a CCL block just as the @code{if} statement
|
|
1885 does.
|
|
1886
|
442
|
1887 The @dfn{branch} statement takes an @var{expression} and one or more CCL
|
428
|
1888 blocks as arguments. The CCL blocks are treated as a zero-indexed
|
|
1889 array, and the @code{branch} statement uses the @var{expression} as the
|
|
1890 index of the CCL block to execute. Null CCL blocks may be used as
|
|
1891 no-ops, continuing execution with the statement following the
|
|
1892 @code{branch} statement in the containing CCL block. Out-of-range
|
444
|
1893 values for the @var{expression} are also treated as no-ops.
|
428
|
1894
|
442
|
1895 The @dfn{read-branch} variant of the @dfn{branch} statement takes an
|
428
|
1896 @var{register}, a @var{CCL block}, and an optional @var{second CCL
|
|
1897 block} as arguments. The @code{read-branch} statement first reads from
|
|
1898 the input into the @var{register}, then conditionally executes a CCL
|
|
1899 block just as the @code{branch} statement does.
|
|
1900
|
|
1901 @heading Loop control statements:
|
|
1902
|
442
|
1903 The @dfn{loop} statement creates a block with an implied jump from the
|
444
|
1904 end of the block back to its head. The loop is exited on a @code{break}
|
428
|
1905 statement, and continued without executing the tail by a @code{repeat}
|
|
1906 statement.
|
|
1907
|
442
|
1908 The @dfn{break} statement, written @samp{(break)}, terminates the
|
428
|
1909 current loop and continues with the next statement in the current
|
444
|
1910 block.
|
428
|
1911
|
442
|
1912 The @dfn{repeat} statement has three variants, @code{repeat},
|
428
|
1913 @code{write-repeat}, and @code{write-read-repeat}. Each continues the
|
|
1914 current loop from its head, possibly after performing I/O.
|
|
1915 @code{repeat} takes no arguments and does no I/O before jumping.
|
444
|
1916 @code{write-repeat} takes a single argument (a register, an
|
428
|
1917 integer, or a string), writes it to the output, then jumps.
|
|
1918 @code{write-read-repeat} takes one or two arguments. The first must
|
|
1919 be a register. The second may be an integer or an array; if absent, it
|
|
1920 is implicitly set to the first (register) argument.
|
|
1921 @code{write-read-repeat} writes its second argument to the output, then
|
|
1922 reads from the input into the register, and finally jumps. See the
|
|
1923 @code{write} and @code{read} statements for the semantics of the I/O
|
|
1924 operations for each type of argument.
|
|
1925
|
|
1926 @heading Other control statements:
|
|
1927
|
442
|
1928 The @dfn{call} statement, written @samp{(call @var{ccl-program-name})},
|
428
|
1929 executes a CCL program as a subroutine. It does not return a value to
|
|
1930 the caller, but can modify the register status.
|
|
1931
|
442
|
1932 The @dfn{end} statement, written @samp{(end)}, terminates the CCL
|
428
|
1933 program successfully, and returns to caller (which may be a CCL
|
|
1934 program). It does not alter the status of the registers.
|
|
1935
|
|
1936 @node CCL Expressions, Calling CCL, CCL Statements, CCL
|
|
1937 @comment Node, Next, Previous, Up
|
|
1938 @subsection CCL Expressions
|
|
1939
|
442
|
1940 CCL, unlike Lisp, uses infix expressions. The simplest CCL expressions
|
428
|
1941 consist of a single @var{operand}, either a register (one of @code{r0},
|
|
1942 ..., @code{r0}) or an integer. Complex expressions are lists of the
|
|
1943 form @code{( @var{expression} @var{operator} @var{operand} )}. Unlike
|
|
1944 C, assignments are not expressions.
|
|
1945
|
442
|
1946 In the following table, @var{X} is the target resister for a @dfn{set}.
|
428
|
1947 In subexpressions, this is implicitly @code{r7}. This means that
|
|
1948 @code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used
|
|
1949 freely in subexpressions, since they return parts of their values in
|
|
1950 @code{r7}. @var{Y} may be an expression, register, or integer, while
|
|
1951 @var{Z} must be a register or an integer.
|
|
1952
|
|
1953 @multitable @columnfractions .22 .14 .09 .55
|
|
1954 @item Name @tab Operator @tab Code @tab C-like Description
|
|
1955 @item CCL_PLUS @tab @code{+} @tab 0x00 @tab X = Y + Z
|
|
1956 @item CCL_MINUS @tab @code{-} @tab 0x01 @tab X = Y - Z
|
|
1957 @item CCL_MUL @tab @code{*} @tab 0x02 @tab X = Y * Z
|
|
1958 @item CCL_DIV @tab @code{/} @tab 0x03 @tab X = Y / Z
|
|
1959 @item CCL_MOD @tab @code{%} @tab 0x04 @tab X = Y % Z
|
|
1960 @item CCL_AND @tab @code{&} @tab 0x05 @tab X = Y & Z
|
|
1961 @item CCL_OR @tab @code{|} @tab 0x06 @tab X = Y | Z
|
|
1962 @item CCL_XOR @tab @code{^} @tab 0x07 @tab X = Y ^ Z
|
|
1963 @item CCL_LSH @tab @code{<<} @tab 0x08 @tab X = Y << Z
|
|
1964 @item CCL_RSH @tab @code{>>} @tab 0x09 @tab X = Y >> Z
|
|
1965 @item CCL_LSH8 @tab @code{<8} @tab 0x0A @tab X = (Y << 8) | Z
|
|
1966 @item CCL_RSH8 @tab @code{>8} @tab 0x0B @tab X = Y >> 8, r[7] = Y & 0xFF
|
|
1967 @item CCL_DIVMOD @tab @code{//} @tab 0x0C @tab X = Y / Z, r[7] = Y % Z
|
|
1968 @item CCL_LS @tab @code{<} @tab 0x10 @tab X = (X < Y)
|
|
1969 @item CCL_GT @tab @code{>} @tab 0x11 @tab X = (X > Y)
|
|
1970 @item CCL_EQ @tab @code{==} @tab 0x12 @tab X = (X == Y)
|
|
1971 @item CCL_LE @tab @code{<=} @tab 0x13 @tab X = (X <= Y)
|
|
1972 @item CCL_GE @tab @code{>=} @tab 0x14 @tab X = (X >= Y)
|
|
1973 @item CCL_NE @tab @code{!=} @tab 0x15 @tab X = (X != Y)
|
|
1974 @item CCL_ENCODE_SJIS @tab @code{en-sjis} @tab 0x16 @tab X = HIGHER_BYTE (SJIS (Y, Z))
|
|
1975 @item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z)
|
|
1976 @item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z))
|
|
1977 @item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z))
|
|
1978 @end multitable
|
|
1979
|
442
|
1980 The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8,
|
428
|
1981 CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The CCL_ENCODE_SJIS
|
|
1982 and CCL_DECODE_SJIS treat their first and second bytes as the high and
|
|
1983 low bytes of a two-byte character code. (SJIS stands for Shift JIS, an
|
|
1984 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a
|
|
1985 complicated transformation of the Japanese standard JIS encoding to
|
|
1986 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to
|
|
1987 represent the SJIS operations in infix form.
|
|
1988
|
2640
|
1989 @node Calling CCL, CCL Example, CCL Expressions, CCL
|
428
|
1990 @comment Node, Next, Previous, Up
|
|
1991 @subsection Calling CCL
|
|
1992
|
442
|
1993 CCL programs are called automatically during Emacs buffer I/O when the
|
428
|
1994 external representation has a coding system type of @code{shift-jis},
|
|
1995 @code{big5}, or @code{ccl}. The program is specified by the coding
|
|
1996 system (@pxref{Coding Systems}). You can also call CCL programs from
|
|
1997 other CCL programs, and from Lisp using these functions:
|
|
1998
|
|
1999 @defun ccl-execute ccl-program status
|
|
2000 Execute @var{ccl-program} with registers initialized by
|
|
2001 @var{status}. @var{ccl-program} is a vector of compiled CCL code
|
444
|
2002 created by @code{ccl-compile}. It is an error for the program to try to
|
428
|
2003 execute a CCL I/O command. @var{status} must be a vector of nine
|
|
2004 values, specifying the initial value for the R0, R1 .. R7 registers and
|
|
2005 for the instruction counter IC. A @code{nil} value for a register
|
|
2006 initializer causes the register to be set to 0. A @code{nil} value for
|
|
2007 the IC initializer causes execution to start at the beginning of the
|
|
2008 program. When the program is done, @var{status} is modified (by
|
|
2009 side-effect) to contain the ending values for the corresponding
|
444
|
2010 registers and IC.
|
428
|
2011 @end defun
|
|
2012
|
444
|
2013 @defun ccl-execute-on-string ccl-program status string &optional continue
|
428
|
2014 Execute @var{ccl-program} with initial @var{status} on
|
|
2015 @var{string}. @var{ccl-program} is a vector of compiled CCL code
|
|
2016 created by @code{ccl-compile}. @var{status} must be a vector of nine
|
|
2017 values, specifying the initial value for the R0, R1 .. R7 registers and
|
|
2018 for the instruction counter IC. A @code{nil} value for a register
|
|
2019 initializer causes the register to be set to 0. A @code{nil} value for
|
|
2020 the IC initializer causes execution to start at the beginning of the
|
444
|
2021 program. An optional fourth argument @var{continue}, if non-@code{nil}, causes
|
428
|
2022 the IC to
|
|
2023 remain on the unsatisfied read operation if the program terminates due
|
|
2024 to exhaustion of the input buffer. Otherwise the IC is set to the end
|
444
|
2025 of the program. When the program is done, @var{status} is modified (by
|
428
|
2026 side-effect) to contain the ending values for the corresponding
|
|
2027 registers and IC. Returns the resulting string.
|
|
2028 @end defun
|
|
2029
|
442
|
2030 To call a CCL program from another CCL program, it must first be
|
428
|
2031 registered:
|
|
2032
|
|
2033 @defun register-ccl-program name ccl-program
|
444
|
2034 Register @var{name} for CCL program @var{ccl-program} in
|
|
2035 @code{ccl-program-table}. @var{ccl-program} should be the compiled form of
|
|
2036 a CCL program, or @code{nil}. Return index number of the registered CCL
|
428
|
2037 program.
|
|
2038 @end defun
|
|
2039
|
442
|
2040 Information about the processor time used by the CCL interpreter can be
|
428
|
2041 obtained using these functions:
|
|
2042
|
|
2043 @defun ccl-elapsed-time
|
|
2044 Returns the elapsed processor time of the CCL interpreter as cons of
|
|
2045 user and system time, as
|
|
2046 floating point numbers measured in seconds. If only one
|
|
2047 overall value can be determined, the return value will be a cons of that
|
|
2048 value and 0.
|
|
2049 @end defun
|
|
2050
|
|
2051 @defun ccl-reset-elapsed-time
|
|
2052 Resets the CCL interpreter's internal elapsed time registers.
|
|
2053 @end defun
|
|
2054
|
2640
|
2055 @node CCL Example, , Calling CCL, CCL
|
428
|
2056 @comment Node, Next, Previous, Up
|
2640
|
2057 @subsection CCL Example
|
|
2058
|
|
2059 In this section, we describe the implementation of a trivial coding
|
|
2060 system to transform from the Web's URL encoding to XEmacs' internal
|
|
2061 coding. Many people will have been first exposed to URL encoding when
|
|
2062 they saw ``%20'' where they expected a space in a file's name on their
|
|
2063 local hard disk; this can happen when a browser saves a file from the
|
|
2064 web and doesn't encode the name, as passed from the server, properly.
|
|
2065
|
|
2066 URL encoding itself is underspecified with regard to encodings beyond
|
|
2067 ASCII. The relevant document, RFC 1738, explicitly doesn't give any
|
|
2068 information on how to encode non-ASCII characters, and the ``obvious''
|
|
2069 way---use the %xx values for the octets of the eight bit MIME character
|
|
2070 set in which the page was served---breaks when a user types a character
|
|
2071 outside that character set. Best practice for web development is to
|
|
2072 serve all pages as UTF-8 and treat incoming form data as using that
|
|
2073 coding system. (Oh, and gamble that your clients won't ever want to
|
|
2074 type anything outside Unicode. But that's not so much of a gamble with
|
|
2075 today's client operating systems.) We don't treat non-ASCII in this
|
|
2076 example, as dealing with @samp{(read-multibyte-character ...)} and
|
|
2077 errors therewith would make it much harder to understand.
|
|
2078
|
|
2079 Since CCL isn't a very rich language, we move much of the logic that
|
|
2080 would ordinarily be computed from operations like @code{(member ..)},
|
|
2081 @code{(and ...)} and @code{(or ...)} into tables, from which register
|
|
2082 values are read and written, and on which @code{if} statements are
|
|
2083 predicated. Much more of the implementation of this coding system is
|
|
2084 occupied with constructing these tables---in normal Emacs Lisp---than it
|
|
2085 is with actual CCL code.
|
|
2086
|
|
2087 All the @code{defvar} statements we deal with in the next few sections
|
|
2088 are surrounded by a @code{(eval-and-compile ...)}, which means that the
|
|
2089 logic which initializes these variables executes at compile time, and if
|
|
2090 XEmacs loads the compiled version of the file, these variables are
|
|
2091 initialized as constants.
|
|
2092
|
|
2093 @menu
|
|
2094 * Four bits to ASCII:: Two tables used for getting hex digits from ASCII.
|
|
2095 * URI Encoding constants:: Useful predefined characters.
|
|
2096 * Numeric to ASCII-hexadecimal conversion:: Trivial in Lisp, not so in CCL.
|
|
2097 * Characters to be preserved:: No transformation needed for these characters.
|
|
2098 * The program to decode to internal format:: .
|
|
2099 * The program to encode from internal format:: .
|
2690
|
2100 * The actual coding system:: .
|
2640
|
2101 @end menu
|
|
2102
|
|
2103 @node Four bits to ASCII, URI Encoding constants, , CCL Example
|
|
2104 @subsubsection Four bits to ASCII
|
|
2105
|
|
2106 The first @code{defvar} is for
|
|
2107 @code{url-coding-high-order-nybble-as-ascii}, a 256-entry table that
|
|
2108 maps from an octet's value to the ASCII encoding for the hex value of
|
|
2109 its most significant four bits. That might sound complex, but it isn't;
|
|
2110 for decimal 65, hex value @samp{#x41}, the entry in the table is the
|
|
2111 ASCII encoding of `4'. For decimal 122, ASCII `z', hex value
|
|
2112 @code{#x7a}, @code{(elt url-coding-high-order-nybble-as-ascii #x7a)}
|
|
2113 after this file is loaded gives the ASCII encoding of 7.
|
|
2114
|
|
2115 @example
|
|
2116 (defvar url-coding-high-order-nybble-as-ascii
|
|
2117 (let ((val (make-vector 256 0))
|
|
2118 (i 0))
|
|
2119 (while (< i (length val))
|
2690
|
2120 (aset val i (char-to-int (aref (format "%02X" i) 0)))
|
2640
|
2121 (setq i (1+ i)))
|
|
2122 val)
|
|
2123 "Table to find an ASCII version of an octet's most significant 4 bits.")
|
|
2124 @end example
|
|
2125
|
|
2126 The next table, @code{url-coding-low-order-nybble-as-ascii} is almost
|
|
2127 the same thing, but this time it has a map for the hex encoding of the
|
2690
|
2128 low-order four bits. So the sixty-fifth entry (offset @samp{#x41}) is
|
2640
|
2129 the ASCII encoding of `1', the hundred-and-twenty-second (offset
|
|
2130 @samp{#x7a}) is the ASCII encoding of `A'.
|
|
2131
|
|
2132 @example
|
|
2133 (defvar url-coding-low-order-nybble-as-ascii
|
|
2134 (let ((val (make-vector 256 0))
|
|
2135 (i 0))
|
|
2136 (while (< i (length val))
|
2690
|
2137 (aset val i (char-to-int (aref (format "%02X" i) 1)))
|
2640
|
2138 (setq i (1+ i)))
|
|
2139 val)
|
|
2140 "Table to find an ASCII version of an octet's least significant 4 bits.")
|
|
2141 @end example
|
|
2142
|
|
2143 @node URI Encoding constants, Numeric to ASCII-hexadecimal conversion, Four bits to ASCII, CCL Example
|
|
2144 @subsubsection URI Encoding constants
|
|
2145
|
|
2146 Next, we have a couple of variables that make the CCL code more
|
|
2147 readable. The first is the ASCII encoding of the percentage sign; this
|
|
2148 character is used as an escape code, to start the encoding of a
|
|
2149 non-printable character. For historical reasons, URL encoding allows
|
|
2150 the space character to be encoded as a plus sign--it does make typing
|
|
2151 URLs like @samp{http://google.com/search?q=XEmacs+home+page} easier--and
|
|
2152 as such, we have to check when decoding for this value, and map it to
|
|
2153 the space character. When doing this in CCL, we use the
|
|
2154 @code{url-coding-escaped-space-code} variable.
|
|
2155
|
|
2156 @example
|
2690
|
2157 (defvar url-coding-escape-character-code (char-to-int ?%)
|
2640
|
2158 "The code point for the percentage sign, in ASCII.")
|
|
2159
|
2690
|
2160 (defvar url-coding-escaped-space-code (char-to-int ?+)
|
2640
|
2161 "The URL-encoded value of the space character, that is, +.")
|
|
2162 @end example
|
|
2163
|
2690
|
2164 @node Numeric to ASCII-hexadecimal conversion, Characters to be preserved, URI Encoding constants, CCL Example
|
2640
|
2165 @subsubsection Numeric to ASCII-hexadecimal conversion
|
|
2166
|
|
2167 Now, we have a couple of utility tables that wouldn't be necessary in
|
|
2168 a more expressive programming language than is CCL. The first is sixteen
|
|
2169 in length, and maps a hexadecimal number to the ASCII encoding of that
|
|
2170 number; so zero maps to ASCII `0', ten maps to ASCII `A.' The second
|
|
2171 does the reverse; that is, it maps an ASCII character to its value when
|
|
2172 interpreted as a hexadecimal digit. ('A' => 10, 'c' => 12, '2' => 2, as
|
|
2173 a few examples.)
|
|
2174
|
|
2175 @example
|
|
2176 (defvar url-coding-hex-digit-table
|
|
2177 (let ((i 0)
|
|
2178 (val (make-vector 16 0)))
|
|
2179 (while (< i 16)
|
2690
|
2180 (aset val i (char-to-int (aref (format "%X" i) 0)))
|
2640
|
2181 (setq i (1+ i)))
|
|
2182 val)
|
|
2183 "A map from a hexadecimal digit's numeric value to its encoding in ASCII.")
|
|
2184
|
|
2185 (defvar url-coding-latin-1-as-hex-table
|
|
2186 (let ((val (make-vector 256 0))
|
|
2187 (i 0))
|
|
2188 (while (< i (length val))
|
|
2189 ;; Get a hex val for this ASCII character.
|
|
2190 (aset val i (string-to-int (format "%c" i) 16))
|
|
2191 (setq i (1+ i)))
|
|
2192 val)
|
|
2193 "A map from Latin 1 code points to their values as hexadecimal digits.")
|
|
2194 @end example
|
|
2195
|
2690
|
2196 @node Characters to be preserved, The program to decode to internal format, Numeric to ASCII-hexadecimal conversion, CCL Example
|
2640
|
2197 @subsubsection Characters to be preserved
|
|
2198
|
|
2199 And finally, the last of these tables. URL encoding says that
|
|
2200 alphanumeric characters, the underscore, hyphen and the full stop
|
|
2201 @footnote{That's what the standards call it, though my North American
|
|
2202 readers will be more familiar with it as the period character.} retain
|
|
2203 their ASCII encoding, and don't undergo transformation.
|
|
2204 @code{url-coding-should-preserve-table} is an array in which the entries
|
|
2205 are one if the corresponding ASCII character should be left as-is, and
|
|
2206 zero if they should be transformed. So the entries for all the control
|
|
2207 and most of the punctuation charcters are zero. Lisp programmers will
|
|
2208 observe that this initialization is particularly inefficient, but
|
|
2209 they'll also be aware that this is a long way from an inner loop where
|
|
2210 every nanosecond counts.
|
|
2211
|
|
2212 @example
|
|
2213 (defvar url-coding-should-preserve-table
|
|
2214 (let ((preserve
|
|
2215 (list ?- ?_ ?. ?a ?b ?c ?d ?e ?f ?g ?h ?i ?j ?k ?l ?m ?n ?o
|
|
2216 ?p ?q ?r ?s ?t ?u ?v ?w ?x ?y ?z ?A ?B ?C ?D ?E ?F ?G
|
|
2217 ?H ?I ?J ?K ?L ?M ?N ?O ?P ?Q ?R ?S ?T ?U ?V ?W ?X ?Y
|
|
2218 ?Z ?0 ?1 ?2 ?3 ?4 ?5 ?6 ?7 ?8 ?9))
|
|
2219 (i 0)
|
|
2220 (res (make-vector 256 0)))
|
|
2221 (while (< i 256)
|
|
2222 (when (member (int-char i) preserve)
|
|
2223 (aset res i 1))
|
|
2224 (setq i (1+ i)))
|
|
2225 res)
|
|
2226 "A 256-entry array of flags, indicating whether or not to preserve an
|
|
2227 octet as its ASCII encoding.")
|
|
2228 @end example
|
|
2229
|
2690
|
2230 @node The program to decode to internal format, The program to encode from internal format, Characters to be preserved, CCL Example
|
2640
|
2231 @subsubsection The program to decode to internal format
|
|
2232
|
|
2233 After the almost interminable tables, we get to the CCL. The first
|
|
2234 CCL program, @code{ccl-decode-urlcoding} decodes from the URL coding to
|
|
2235 our internal format; since this version of CCL doesn't have support for
|
|
2236 error checking on the input, we don't do any verification on it.
|
|
2237
|
|
2238 The buffer magnification--approximate ratio of the size of the output
|
|
2239 buffer to the size of the input buffer--is declared as one, because
|
|
2240 fractional values aren't allowed. (Since all those %20's will map to
|
|
2241 ` ', the length of the output text will be less than that of the input
|
|
2242 text.)
|
|
2243
|
|
2244 So, first we read an octet from the input buffer into register
|
|
2245 @samp{r0}, to set up the loop. Next, we start the loop, with a
|
|
2246 @code{(loop ...)} statement, and we check if the value in @samp{r0} is a
|
|
2247 percentage sign. (Note the comma before
|
|
2248 @code{url-coding-escape-character-code}; since CCL is a Lisp macro
|
|
2249 language, we can break out of the macro evaluation with a comman, and as
|
|
2250 such, ``@code{,url-coding-escape-character-code}'' will be evaluated as a
|
|
2251 literal `37.')
|
|
2252
|
|
2253 If it is a percentage sign, we read the next two octets into @samp{r2}
|
|
2254 and @samp{r3}, and convert them into their hexadecimal numeric values,
|
|
2255 using the @code{url-coding-latin-1-as-hex-table} array declared above.
|
|
2256 (But again, it'll be interpreted as a literal array.) We then left
|
|
2257 shift the first by four bits, mask the two together, and write the
|
|
2258 result to the output buffer.
|
|
2259
|
|
2260 If it isn't a percentage sign, and it is a `+' sign, we write a
|
|
2261 space--hexadecimal 20--to the output buffer.
|
|
2262
|
|
2263 If none of those things are true, we pass the octet to the output buffer
|
|
2264 untransformed. (This could be a place to put error checking, in a more
|
|
2265 expressive language.) We then read one more octet from the input
|
|
2266 buffer, and move to the next iteration of the loop.
|
|
2267
|
|
2268 @example
|
|
2269 (define-ccl-program ccl-decode-urlcoding
|
|
2270 `(1
|
|
2271 ((read r0)
|
|
2272 (loop
|
|
2273 (if (r0 == ,url-coding-escape-character-code)
|
|
2274 ((read r2 r3)
|
|
2275 ;; Assign the value at offset r2 in the url-coding-hex-digit-table
|
|
2276 ;; to r3.
|
|
2277 (r2 = r2 ,url-coding-latin-1-as-hex-table)
|
|
2278 (r3 = r3 ,url-coding-latin-1-as-hex-table)
|
|
2279 (r2 <<= 4)
|
|
2280 (r3 |= r2)
|
|
2281 (write r3))
|
|
2282 (if (r0 == ,url-coding-escaped-space-code)
|
|
2283 (write #x20)
|
|
2284 (write r0)))
|
|
2285 (read r0)
|
|
2286 (repeat))))
|
|
2287 "CCL program to take URI-encoded ASCII text and transform it to our
|
|
2288 internal encoding. ")
|
|
2289 @end example
|
|
2290
|
2690
|
2291 @node The program to encode from internal format, The actual coding system, The program to decode to internal format, CCL Example
|
2640
|
2292 @subsubsection The program to encode from internal format
|
|
2293
|
|
2294 Next, we see the CCL program to encode ASCII text as URL coded text.
|
|
2295 Here, the buffer magnification is specified as three, to account for ` '
|
|
2296 mapping to %20, etc. As before, we read an octet from the input into
|
|
2297 @samp{r0}, and move into the body of the loop. Next, we check if we
|
|
2298 should preserve the value of this octet, by reading from offset
|
|
2299 @samp{r0} in the @code{url-coding-should-preserve-table} into @samp{r1}.
|
|
2300 Then we have an @samp{if} statement predicated on the value in
|
|
2301 @samp{r1}; for the true branch, we write the input octet directly. For
|
|
2302 the false branch, we write a percentage sign, the ASCII encoding of the
|
|
2303 high four bits in hex, and then the ASCII encoding of the low four bits
|
|
2304 in hex.
|
|
2305
|
|
2306 We then read an octet from the input into @samp{r0}, and repeat the loop.
|
|
2307
|
|
2308 @example
|
|
2309 (define-ccl-program ccl-encode-urlcoding
|
|
2310 `(3
|
|
2311 ((read r0)
|
|
2312 (loop
|
|
2313 (r1 = r0 ,url-coding-should-preserve-table)
|
|
2314 ;; If we should preserve the value, just write the octet directly.
|
|
2315 (if r1
|
|
2316 (write r0)
|
|
2317 ;; else, write a percentage sign, and the hex value of the octet, in
|
|
2318 ;; an ASCII-friendly format.
|
|
2319 ((write ,url-coding-escape-character-code)
|
|
2320 (write r0 ,url-coding-high-order-nybble-as-ascii)
|
|
2321 (write r0 ,url-coding-low-order-nybble-as-ascii)))
|
|
2322 (read r0)
|
|
2323 (repeat))))
|
|
2324 "CCL program to encode octets (almost) according to RFC 1738")
|
|
2325 @end example
|
428
|
2326
|
2690
|
2327 @node The actual coding system, , The program to encode from internal format, CCL Example
|
|
2328 @subsubsection The actual coding system
|
|
2329
|
|
2330 To actually create the coding system, we call
|
|
2331 @samp{make-coding-system}. The first argument is the symbol that is to
|
|
2332 be the name of the coding system, in our case @samp{url-coding}. The
|
|
2333 second specifies that the coding system is to be of type
|
|
2334 @samp{ccl}---there are several other coding system types available,
|
|
2335 including, see the documentation for @samp{make-coding-system} for the
|
|
2336 full list. Then there's a documentation string describing the wherefore
|
|
2337 and caveats of the coding system, and the final argument is a property
|
|
2338 list giving information about the CCL programs and the coding system's
|
|
2339 mnemonic.
|
|
2340
|
|
2341 @example
|
|
2342 (make-coding-system
|
|
2343 'url-coding 'ccl
|
|
2344 "The coding used by application/x-www-form-urlencoded HTTP applications.
|
|
2345 This coding form doesn't specify anything about non-ASCII characters, so
|
|
2346 make sure you've transformed to a seven-bit coding system first."
|
|
2347 '(decode ccl-decode-urlcoding
|
|
2348 encode ccl-encode-urlcoding
|
|
2349 mnemonic "URLenc"))
|
|
2350 @end example
|
|
2351
|
|
2352 If you're lucky, the @samp{url-coding} coding system describe here
|
|
2353 should be available in the XEmacs package system. Otherwise, downloading
|
|
2354 it from @samp{http://www.parhasard.net/url-coding.el} should work for
|
|
2355 the foreseeable future.
|
|
2356
|
775
|
2357 @node Category Tables, Unicode Support, CCL, MULE
|
428
|
2358 @section Category Tables
|
|
2359
|
|
2360 A category table is a type of char table used for keeping track of
|
|
2361 categories. Categories are used for classifying characters for use in
|
440
|
2362 regexps---you can refer to a category rather than having to use a
|
428
|
2363 complicated [] expression (and category lookups are significantly
|
|
2364 faster).
|
|
2365
|
|
2366 There are 95 different categories available, one for each printable
|
|
2367 character (including space) in the ASCII charset. Each category is
|
|
2368 designated by one such character, called a @dfn{category designator}.
|
|
2369 They are specified in a regexp using the syntax @samp{\cX}, where X is a
|
|
2370 category designator. (This is not yet implemented.)
|
|
2371
|
|
2372 A category table specifies, for each character, the categories that
|
|
2373 the character is in. Note that a character can be in more than one
|
|
2374 category. More specifically, a category table maps from a character to
|
|
2375 either the value @code{nil} (meaning the character is in no categories)
|
|
2376 or a 95-element bit vector, specifying for each of the 95 categories
|
|
2377 whether the character is in that category.
|
|
2378
|
|
2379 Special Lisp functions are provided that abstract this, so you do not
|
|
2380 have to directly manipulate bit vectors.
|
|
2381
|
444
|
2382 @defun category-table-p object
|
|
2383 This function returns @code{t} if @var{object} is a category table.
|
428
|
2384 @end defun
|
|
2385
|
|
2386 @defun category-table &optional buffer
|
|
2387 This function returns the current category table. This is the one
|
|
2388 specified by the current buffer, or by @var{buffer} if it is
|
|
2389 non-@code{nil}.
|
|
2390 @end defun
|
|
2391
|
|
2392 @defun standard-category-table
|
|
2393 This function returns the standard category table. This is the one used
|
|
2394 for new buffers.
|
|
2395 @end defun
|
|
2396
|
444
|
2397 @defun copy-category-table &optional category-table
|
|
2398 This function returns a new category table which is a copy of
|
|
2399 @var{category-table}, which defaults to the standard category table.
|
428
|
2400 @end defun
|
|
2401
|
444
|
2402 @defun set-category-table category-table &optional buffer
|
|
2403 This function selects @var{category-table} as the new category table for
|
|
2404 @var{buffer}. @var{buffer} defaults to the current buffer if omitted.
|
428
|
2405 @end defun
|
|
2406
|
444
|
2407 @defun category-designator-p object
|
|
2408 This function returns @code{t} if @var{object} is a category designator (a
|
428
|
2409 char in the range @samp{' '} to @samp{'~'}).
|
|
2410 @end defun
|
|
2411
|
444
|
2412 @defun category-table-value-p object
|
|
2413 This function returns @code{t} if @var{object} is a category table value.
|
428
|
2414 Valid values are @code{nil} or a bit vector of size 95.
|
|
2415 @end defun
|
|
2416
|
775
|
2417
|
|
2418 @c Added 2002-03-13 sjt
|
1183
|
2419 @node Unicode Support, Charset Unification, Category Tables, MULE
|
775
|
2420 @section Unicode Support
|
|
2421 @cindex unicode
|
|
2422 @cindex utf-8
|
|
2423 @cindex utf-16
|
|
2424 @cindex ucs-2
|
|
2425 @cindex ucs-4
|
|
2426 @cindex bmp
|
|
2427 @cindex basic multilingual plance
|
|
2428
|
|
2429 Unicode support was added by Ben Wing to XEmacs 21.5.6.
|
|
2430
|
|
2431 @defun set-language-unicode-precedence-list list
|
|
2432 Set the language-specific precedence list used for Unicode decoding.
|
|
2433 This is a list of charsets, which are consulted in order for a translation
|
|
2434 matching a given Unicode character. If no matches are found, the charsets
|
|
2435 in the default precedence list (see
|
|
2436 @code{set-default-unicode-precedence-list}) are consulted, and then all
|
|
2437 remaining charsets, in some arbitrary order.
|
|
2438
|
|
2439 The language-specific precedence list is meant to be set as part of the
|
|
2440 language environment initialization; the default precedence list is meant
|
|
2441 to be set by the user.
|
|
2442 @end defun
|
|
2443
|
|
2444 @defun language-unicode-precedence-list
|
|
2445 Return the language-specific precedence list used for Unicode decoding.
|
|
2446 See @code{set-language-unicode-precedence-list} for more information.
|
|
2447 @end defun
|
|
2448
|
|
2449 @defun set-default-unicode-precedence-list list
|
|
2450 Set the default precedence list used for Unicode decoding.
|
|
2451 This is meant to be set by the user. See
|
|
2452 `set-language-unicode-precedence-list' for more information.
|
|
2453 @end defun
|
|
2454
|
|
2455 @defun default-unicode-precedence-list
|
|
2456 Return the default precedence list used for Unicode decoding.
|
|
2457 See @code{set-language-unicode-precedence-list} for more information.
|
|
2458 @end defun
|
|
2459
|
|
2460 @defun set-unicode-conversion character code
|
|
2461 Add conversion information between Unicode codepoints and characters.
|
|
2462 @var{character} is one of the following:
|
|
2463
|
|
2464 @c #### fix this markup
|
|
2465 -- A character (in which case @var{code} must be a non-negative integer)
|
|
2466 -- A vector of characters (in which case @var{code} must be a vector of
|
|
2467 non-negative integers of the same length)
|
|
2468
|
|
2469 Values of @var{code} above 2^20 - 1 are allowed for the purpose of specifying
|
|
2470 private characters, but will cause errors when converted to UTF-16 or UTF-32.
|
|
2471 UCS-4 and UTF-8 can handle values to 2^31 - 1, but XEmacs Lisp integers top
|
|
2472 out at 2^30 - 1.
|
|
2473 @end defun
|
|
2474
|
|
2475 @defun character-to-unicode character
|
|
2476 Convert @var{character} to Unicode codepoint.
|
|
2477 When there is no international support (i.e. MULE is not defined),
|
|
2478 this function simply does @code{char-to-int}.
|
|
2479 @end defun
|
|
2480
|
|
2481 @defun unicode-to-character code [charsets]
|
|
2482 Convert Unicode codepoint @var{code} to character.
|
|
2483 @var{code} should be a non-negative integer.
|
|
2484 If @var{charsets} is given, it should be a list of charsets, and only those
|
|
2485 charsets will be consulted, in the given order, for a translation.
|
|
2486 Otherwise, the default ordering of all charsets will be given (see
|
|
2487 @code{set-unicode-charset-precedence}).
|
|
2488
|
|
2489 When there is no international support (i.e. MULE is not defined),
|
|
2490 this function simply does @code{int-to-char} and ignores the
|
|
2491 @var{charsets} argument.
|
|
2492 @end defun
|
|
2493
|
|
2494 @defun parse-unicode-translation-table filename charset start end offset flags
|
|
2495 Parse Unicode translation data in @var{filename} for MULE @var{charset}.
|
|
2496 Data is text, in the form of one translation per line -- charset
|
|
2497 codepoint followed by Unicode codepoint. Numbers are decimal or hex
|
|
2498 \(preceded by 0x). Comments are marked with a #. Charset codepoints
|
|
2499 for two-dimensional charsets should have the first octet stored in the
|
|
2500 high 8 bits of the hex number and the second in the low 8 bits.
|
|
2501
|
|
2502 If @var{start} and @var{end} are given, only charset codepoints within
|
|
2503 the given range will be processed. If @var{offset} is given, that value
|
|
2504 will be added to all charset codepoints in the file to obtain the
|
|
2505 internal charset codepoint. @var{start} and @var{end} apply to the
|
|
2506 codepoints in the file, before @var{offset} is applied.
|
|
2507
|
|
2508 (Note that, as usual, we assume that octets are in the range 32 to
|
|
2509 127 or 33 to 126. If you have a table in kuten form, with octets in
|
|
2510 the range 1 to 94, you will have to use an offset of 5140,
|
|
2511 i.e. 0x2020.)
|
|
2512
|
|
2513 @var{flags}, if specified, control further how the tables are interpreted
|
|
2514 and are used to special-case certain known table weirdnesses in the
|
|
2515 Unicode tables:
|
|
2516
|
|
2517 @table @code
|
|
2518 @item ignore-first-column'
|
|
2519 Exactly as it sounds. The JIS X 0208 tables have 3 columns of data instead
|
|
2520 of 2; the first is the Shift-JIS codepoint.
|
|
2521
|
|
2522 @item big5
|
|
2523 The charset codepoint is a Big Five codepoint; convert it to the
|
|
2524 proper hacked-up codepoint in `chinese-big5-1' or `chinese-big5-2'.
|
|
2525 @end table
|
|
2526 @end defun
|
|
2527
|
1183
|
2528
|
|
2529 @node Charset Unification, Charsets and Coding Systems, Unicode Support, MULE
|
|
2530 @section Character Set Unification
|
|
2531
|
|
2532 Mule suffers from a design defect that causes it to consider the ISO
|
|
2533 Latin character sets to be disjoint. This results in oddities such as
|
|
2534 files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO
|
|
2535 2022 control sequences to switch between them, as well as more plausible
|
|
2536 but often unnecessary combinations like ISO 8859/1 with ISO 8859/2.
|
|
2537 This can be very annoying when sending messages or even in simple
|
|
2538 editing on a single host. Unification works around the problem by
|
|
2539 converting as many characters as possible to use a single Latin coded
|
|
2540 character set before saving the buffer.
|
|
2541
|
|
2542 This node and its children were ripp'd untimely from
|
|
2543 @file{latin-unity.texi}, and have been quickly converted for use here.
|
|
2544 However as APIs are likely to diverge, beware of inaccuracies. Please
|
|
2545 report any you discover with @kbd{M-x report-xemacs-bug RET}, as well
|
|
2546 as any ambiguities or downright unintelligible passages.
|
|
2547
|
|
2548 A lot of the stuff here doesn't belong here; it belongs in the
|
|
2549 @ref{Top, , , xemacs, XEmacs User's Manual}. Report those as bugs,
|
|
2550 too, preferably with patches.
|
|
2551
|
|
2552 @menu
|
|
2553 * Overview:: Unification history and general information.
|
|
2554 * Usage:: An overview of the operation of Unification.
|
|
2555 * Configuration:: Configuring Unification for use.
|
|
2556 * Theory of Operation:: How Unification works.
|
|
2557 * What Unification Cannot Do for You:: Inherent problems of 8-bit charsets.
|
|
2558 * Charsets and Coding Systems:: Reference lists with annotations.
|
1188
|
2559 * Unification Internals:: Utilities and implementation details.
|
1183
|
2560 @end menu
|
|
2561
|
|
2562 @node Overview, Usage, Charset Unification, Charset Unification
|
|
2563 @subsection An Overview of Unification
|
|
2564
|
|
2565 Mule suffers from a design defect that causes it to consider the ISO
|
|
2566 Latin character sets to be disjoint. This manifests itself when a user
|
|
2567 enters characters using input methods associated with different coded
|
|
2568 character sets into a single buffer.
|
|
2569
|
|
2570 A very important example involves email. Many sites, especially in the
|
|
2571 U.S., default to use of the ISO 8859/1 coded character set (also called
|
|
2572 ``Latin 1,'' though these are somewhat different concepts). However,
|
|
2573 ISO 8859/1 provides a generic CURRENCY SIGN character. Now that the
|
|
2574 Euro has become the official currency of most countries in Europe, this
|
|
2575 is unsatisfactory (and in practice, useless). So Europeans generally
|
|
2576 use ISO 8859/15, which is nearly identical to ISO 8859/1 for most
|
|
2577 languages, except that it substitutes EURO SIGN for CURRENCY SIGN.
|
|
2578
|
|
2579 Suppose a European user yanks text from a post encoded in ISO 8859/1
|
|
2580 into a message composition buffer, and enters some text including the
|
|
2581 Euro sign. Then Mule will consider the buffer to contain both ISO
|
|
2582 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
|
|
2583 programmed) send the message as a multipart mixed MIME body!
|
|
2584
|
|
2585 This is clearly stupid. What is not as obvious is that, just as any
|
|
2586 European can include American English in their text because ASCII is a
|
|
2587 subset of ISO 8859/15, most European languages which use Latin
|
|
2588 characters (eg, German and Polish) can typically be mixed while using
|
|
2589 only one Latin coded character set (in this case, ISO 8859/2). However,
|
|
2590 this often depends on exactly what text is to be encoded.
|
|
2591
|
|
2592 Unification works around the problem by converting as many characters as
|
|
2593 possible to use a single Latin coded character set before saving the
|
|
2594 buffer.
|
|
2595
|
|
2596 @node Usage, Configuration, Overview, Charset Unification
|
|
2597 @subsection Operation of Unification
|
|
2598
|
|
2599 Normally, Unification works in the background by installing
|
|
2600 @code{unity-sanity-check} on @code{write-region-pre-hook}. This is
|
|
2601 done by default for the ISO 8859 Latin family of character sets. The
|
|
2602 user activates this functionality for other character set families by
|
|
2603 invoking @code{enable-unification}, either interactively or in her
|
|
2604 init file. @xref{Init File, , , xemacs}. Unification can be
|
|
2605 deactivated by invoking @code{disable-unification}.
|
|
2606
|
|
2607 Unification also provides a few functions for remapping or recoding the
|
|
2608 buffer by hand. To @dfn{remap} a character means to change the buffer
|
|
2609 representation of the character by using another coded character set.
|
|
2610 Remapping never changes the identity of the character, but may involve
|
|
2611 altering the code point of the character. To @dfn{recode} a character
|
|
2612 means to simply change the coded character set. Recoding never alters
|
|
2613 the code point of the character, but may change the identity of the
|
|
2614 character. @xref{Theory of Operation}.
|
|
2615
|
|
2616 There are a few variables which determine which coding systems are
|
|
2617 always acceptable to Unification: @code{unity-ucs-list},
|
|
2618 @code{unity-preferred-coding-system-list}, and
|
|
2619 @code{unity-preapproved-coding-system-list}. The latter two default
|
|
2620 to @code{()}, and should probably be avoided because they short-circuit
|
|
2621 the sanity check. If you find you need to use them, consider reporting
|
|
2622 it as a bug or request for enhancement. Because they seem unsafe, the
|
|
2623 recommended interface is likely to change.
|
|
2624
|
|
2625 @menu
|
|
2626 * Basic Functionality:: User interface and customization.
|
|
2627 * Interactive Usage:: Treating text by hand.
|
|
2628 Also documents the hook function(s).
|
|
2629 @end menu
|
|
2630
|
|
2631
|
|
2632 @node Basic Functionality, Interactive Usage, , Usage
|
|
2633 @section Basic Functionality
|
|
2634
|
|
2635 These functions and user options initialize and configure Unification.
|
|
2636 In normal use, none of these should be needed.
|
|
2637
|
|
2638 @strong{These APIs are certain to change.}
|
|
2639
|
|
2640 @defun enable-unification
|
|
2641 Set up hooks and initialize variables for latin-unity.
|
|
2642
|
|
2643 There are no arguments.
|
|
2644
|
|
2645 This function is idempotent. It will reinitialize any hooks or variables
|
|
2646 that are not in initial state.
|
|
2647 @end defun
|
|
2648
|
|
2649 @defun disable-unification
|
|
2650 There are no arguments.
|
|
2651
|
|
2652 Clean up hooks and void variables used by latin-unity.
|
|
2653 @end defun
|
|
2654
|
|
2655 @defopt unity-ucs-list
|
|
2656 List of coding systems considered to be universal.
|
|
2657
|
|
2658 The default value is @code{'(utf-8 iso-2022-7 ctext escape-quoted)}.
|
|
2659
|
|
2660 Order matters; coding systems earlier in the list will be preferred when
|
|
2661 recommending a coding system. These coding systems will not be used
|
|
2662 without querying the user (unless they are also present in
|
|
2663 @code{unity-preapproved-coding-system-list}), and follow the
|
|
2664 @code{unity-preferred-coding-system-list} in the list of suggested
|
|
2665 coding systems.
|
|
2666
|
|
2667 If none of the preferred coding systems are feasible, the first in
|
|
2668 this list will be the default.
|
|
2669
|
|
2670 Notes on certain coding systems: @code{escape-quoted} is a special
|
|
2671 coding system used for autosaves and compiled Lisp in Mule. You should
|
|
2672 @c #### fix in latin-unity.texi
|
|
2673 never delete this, although it is rare that a user would want to use it
|
|
2674 directly. Unification does not try to be \"smart\" about other general
|
|
2675 ISO 2022 coding systems, such as ISO-2022-JP. (They are not recognized
|
|
2676 as equivalent to @code{iso-2022-7}.) If your preferred coding system is
|
|
2677 one of these, you may consider adding it to @code{unity-ucs-list}.
|
|
2678 However, this will typically have the side effect that (eg) ISO 8859/1
|
|
2679 files will be saved in 7-bit form with ISO 2022 escape sequences.
|
|
2680 @end defopt
|
|
2681
|
|
2682 Coding systems which are not Latin and not in
|
|
2683 @code{unity-ucs-list} are handled by short circuiting checks of
|
|
2684 coding system against the next two variables.
|
|
2685
|
|
2686 @defopt unity-preapproved-coding-system-list
|
|
2687 List of coding systems used without querying the user if feasible.
|
|
2688
|
|
2689 The default value is @samp{(buffer-default preferred)}.
|
|
2690
|
|
2691 The first feasible coding system in this list is used. The special values
|
|
2692 @samp{preferred} and @samp{buffer-default} may be present:
|
|
2693
|
|
2694 @table @code
|
|
2695 @item buffer-default
|
|
2696 Use the coding system used by @samp{write-region}, if feasible.
|
|
2697
|
|
2698 @item preferred
|
|
2699 Use the coding system specified by @samp{prefer-coding-system} if feasible.
|
|
2700 @end table
|
|
2701
|
|
2702 "Feasible" means that all characters in the buffer can be represented by
|
|
2703 the coding system. Coding systems in @samp{unity-ucs-list} are
|
|
2704 always considered feasible. Other feasible coding systems are computed
|
|
2705 by @samp{unity-representations-feasible-region}.
|
|
2706
|
|
2707 Note that the first universal coding system in this list shadows all
|
|
2708 other coding systems. In particular, if your preferred coding system is
|
|
2709 a universal coding system, and @code{preferred} is a member of this
|
|
2710 list, unification will blithely convert all your files to that coding
|
|
2711 system. This is considered a feature, but it may surprise most users.
|
|
2712 Users who don't like this behavior should put @code{preferred} in
|
|
2713 @code{unity-preferred-coding-system-list}.
|
|
2714 @end defopt
|
|
2715
|
|
2716 @defopt unity-preferred-coding-system-list
|
|
2717 @c #### fix in latin-unity.texi
|
|
2718 List of coding systems suggested to the user if feasible.
|
|
2719
|
|
2720 The default value is @samp{(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3
|
|
2721 iso-8859-4 iso-8859-9)}.
|
|
2722
|
|
2723 If none of the coding systems in
|
|
2724 @c #### fix in latin-unity.texi
|
|
2725 @code{unity-preapproved-coding-system-list} are feasible, this list
|
|
2726 will be recommended to the user, followed by the
|
|
2727 @code{unity-ucs-list}. The first coding system in this list is default. The
|
|
2728 special values @samp{preferred} and @samp{buffer-default} may be
|
|
2729 present:
|
|
2730
|
|
2731 @table @code
|
|
2732 @item buffer-default
|
|
2733 Use the coding system used by @samp{write-region}, if feasible.
|
|
2734
|
|
2735 @item preferred
|
|
2736 Use the coding system specified by @samp{prefer-coding-system} if feasible.
|
|
2737 @end table
|
|
2738
|
|
2739 "Feasible" means that all characters in the buffer can be represented by
|
|
2740 the coding system. Coding systems in @samp{unity-ucs-list} are
|
|
2741 always considered feasible. Other feasible coding systems are computed
|
|
2742 by @samp{unity-representations-feasible-region}.
|
|
2743 @end defopt
|
|
2744
|
|
2745
|
|
2746 @defvar unity-iso-8859-1-aliases
|
|
2747 List of coding systems to be treated as aliases of ISO 8859/1.
|
|
2748
|
|
2749 The default value is '(iso-8859-1).
|
|
2750
|
|
2751 This is not a user variable; to customize input of coding systems or
|
|
2752 charsets, @samp{unity-coding-system-alias-alist} or
|
|
2753 @samp{unity-charset-alias-alist}.
|
|
2754 @end defvar
|
|
2755
|
|
2756
|
|
2757 @node Interactive Usage, , Basic Functionality, Usage
|
|
2758 @section Interactive Usage
|
|
2759
|
|
2760 First, the hook function @code{unity-sanity-check} is documented.
|
|
2761 (It is placed here because it is not an interactive function, and there
|
|
2762 is not yet a programmer's section of the manual.)
|
|
2763
|
|
2764 These functions provide access to internal functionality (such as the
|
|
2765 remapping function) and to extra functionality (the recoding functions
|
|
2766 and the test function).
|
|
2767
|
|
2768
|
|
2769 @defun unity-sanity-check begin end filename append visit lockname &optional coding-system
|
|
2770
|
|
2771 Check if @var{coding-system} can represent all characters between
|
|
2772 @var{begin} and @var{end}.
|
|
2773
|
|
2774 For compatibility with old broken versions of @code{write-region},
|
|
2775 @var{coding-system} defaults to @code{buffer-file-coding-system}.
|
|
2776 @var{filename}, @var{append}, @var{visit}, and @var{lockname} are
|
|
2777 ignored.
|
|
2778
|
|
2779 Return nil if buffer-file-coding-system is not (ISO-2022-compatible)
|
|
2780 Latin. If @code{buffer-file-coding-system} is safe for the charsets
|
|
2781 actually present in the buffer, return it. Otherwise, ask the user to
|
|
2782 choose a coding system, and return that.
|
|
2783
|
|
2784 This function does @emph{not} do the safe thing when
|
|
2785 @code{buffer-file-coding-system} is nil (aka no-conversion). It
|
|
2786 considers that ``non-Latin,'' and passes it on to the Mule detection
|
|
2787 mechanism.
|
|
2788
|
|
2789 This function is intended for use as a @code{write-region-pre-hook}. It
|
|
2790 does nothing except return @var{coding-system} if @code{write-region}
|
|
2791 handlers are inhibited.
|
|
2792 @end defun
|
|
2793
|
|
2794 @defun unity-buffer-representations-feasible
|
|
2795
|
|
2796 There are no arguments.
|
|
2797
|
|
2798 Apply unity-region-representations-feasible to the current buffer.
|
|
2799 @end defun
|
|
2800
|
|
2801 @defun unity-region-representations-feasible begin end &optional buf
|
|
2802
|
|
2803 Return character sets that can represent the text from @var{begin} to @var{end} in @var{buf}.
|
|
2804
|
|
2805 @var{buf} defaults to the current buffer. Called interactively, will be
|
|
2806 applied to the region. Function assumes @var{begin} <= @var{end}.
|
|
2807
|
|
2808 The return value is a cons. The car is the list of character sets
|
|
2809 that can individually represent all of the non-ASCII portion of the
|
|
2810 buffer, and the cdr is the list of character sets that can
|
|
2811 individually represent all of the ASCII portion.
|
|
2812
|
|
2813 The following is taken from a comment in the source. Please refer to
|
|
2814 the source to be sure of an accurate description.
|
|
2815
|
|
2816 The basic algorithm is to map over the region, compute the set of
|
|
2817 charsets that can represent each character (the ``feasible charset''),
|
|
2818 and take the intersection of those sets.
|
|
2819
|
|
2820 The current implementation takes advantage of the fact that ASCII
|
|
2821 characters are common and cannot change asciisets. Then using
|
|
2822 skip-chars-forward makes motion over ASCII subregions very fast.
|
|
2823
|
|
2824 This same strategy could be applied generally by precomputing classes
|
|
2825 of characters equivalent according to their effect on latinsets, and
|
|
2826 adding a whole class to the skip-chars-forward string once a member is
|
|
2827 found.
|
|
2828
|
|
2829 Probably efficiency is a function of the number of characters matched,
|
|
2830 or maybe the length of the match string? With @code{skip-category-forward}
|
|
2831 over a precomputed category table it should be really fast. In practice
|
|
2832 for Latin character sets there are only 29 classes.
|
|
2833 @end defun
|
|
2834
|
|
2835 @defun unity-remap-region begin end character-set &optional coding-system
|
|
2836
|
|
2837 Remap characters between @var{begin} and @var{end} to equivalents in
|
|
2838 @var{character-set}. Optional argument @var{coding-system} may be a
|
|
2839 coding system name (a symbol) or nil. Characters with no equivalent are
|
|
2840 left as-is.
|
|
2841
|
|
2842 When called interactively, @var{begin} and @var{end} are set to the
|
|
2843 beginning and end, respectively, of the active region, and the function
|
|
2844 prompts for @var{character-set}. The function does completion, knows
|
|
2845 how to guess a character set name from a coding system name, and also
|
|
2846 provides some common aliases. See @code{unity-guess-charset}.
|
|
2847 There is no way to specify @var{coding-system}, as it has no useful
|
|
2848 function interactively.
|
|
2849
|
|
2850 Return @var{coding-system} if @var{coding-system} can encode all
|
|
2851 characters in the region, t if @var{coding-system} is nil and the coding
|
|
2852 system with G0 = 'ascii and G1 = @var{character-set} can encode all
|
|
2853 characters, and otherwise nil. Note that a non-null return does
|
|
2854 @emph{not} mean it is safe to write the file, only the specified region.
|
|
2855 (This behavior is useful for multipart MIME encoding and the like.)
|
|
2856
|
|
2857 Note: by default this function is quite fascist about universal coding
|
|
2858 systems. It only admits @samp{utf-8}, @samp{iso-2022-7}, and
|
|
2859 @samp{ctext}. Customize @code{unity-approved-ucs-list} to change
|
|
2860 this.
|
|
2861
|
|
2862 This function remaps characters that are artificially distinguished by Mule
|
|
2863 internal code. It may change the code point as well as the character set.
|
|
2864 To recode characters that were decoded in the wrong coding system, use
|
|
2865 @code{unity-recode-region}.
|
|
2866 @end defun
|
|
2867
|
|
2868 @defun unity-recode-region begin end wrong-cs right-cs
|
|
2869
|
|
2870 Recode characters between @var{begin} and @var{end} from @var{wrong-cs}
|
|
2871 to @var{right-cs}.
|
|
2872
|
|
2873 @var{wrong-cs} and @var{right-cs} are character sets. Characters retain
|
|
2874 the same code point but the character set is changed. Only characters
|
|
2875 from @var{wrong-cs} are changed to @var{right-cs}. The identity of the
|
|
2876 character may change. Note that this could be dangerous, if characters
|
|
2877 whose identities you do not want changed are included in the region.
|
|
2878 This function cannot guess which characters you want changed, and which
|
|
2879 should be left alone.
|
|
2880
|
|
2881 When called interactively, @var{begin} and @var{end} are set to the
|
|
2882 beginning and end, respectively, of the active region, and the function
|
|
2883 prompts for @var{wrong-cs} and @var{right-cs}. The function does
|
|
2884 completion, knows how to guess a character set name from a coding system
|
|
2885 name, and also provides some common aliases. See
|
|
2886 @code{unity-guess-charset}.
|
|
2887
|
|
2888 Another way to accomplish this, but using coding systems rather than
|
|
2889 character sets to specify the desired recoding, is
|
|
2890 @samp{unity-recode-coding-region}. That function may be faster
|
|
2891 but is somewhat more dangerous, because it may recode more than one
|
|
2892 character set.
|
|
2893
|
|
2894 To change from one Mule representation to another without changing identity
|
|
2895 of any characters, use @samp{unity-remap-region}.
|
|
2896 @end defun
|
|
2897
|
|
2898 @defun unity-recode-coding-region begin end wrong-cs right-cs
|
|
2899
|
|
2900 Recode text between @var{begin} and @var{end} from @var{wrong-cs} to
|
|
2901 @var{right-cs}.
|
|
2902
|
|
2903 @var{wrong-cs} and @var{right-cs} are coding systems. Characters retain
|
|
2904 the same code point but the character set is changed. The identity of
|
|
2905 characters may change. This is an inherently dangerous function;
|
|
2906 multilingual text may be recoded in unexpected ways. #### It's also
|
|
2907 dangerous because the coding systems are not sanity-checked in the
|
|
2908 current implementation.
|
|
2909
|
|
2910 When called interactively, @var{begin} and @var{end} are set to the
|
|
2911 beginning and end, respectively, of the active region, and the function
|
|
2912 prompts for @var{wrong-cs} and @var{right-cs}. The function does
|
|
2913 completion, knows how to guess a coding system name from a character set
|
|
2914 name, and also provides some common aliases. See
|
|
2915 @code{unity-guess-coding-system}.
|
|
2916
|
|
2917 Another, safer, way to accomplish this, using character sets rather
|
|
2918 than coding systems to specify the desired recoding, is to use
|
|
2919 @c #### fixme in latin-unity.texi
|
|
2920 @code{unity-recode-region}.
|
|
2921
|
|
2922 To change from one Mule representation to another without changing identity
|
|
2923 of any characters, use @code{unity-remap-region}.
|
|
2924 @end defun
|
|
2925
|
|
2926 Helper functions for input of coding system and character set names.
|
|
2927
|
|
2928 @defun unity-guess-charset candidate
|
|
2929 Guess a charset based on the symbol @var{candidate}.
|
|
2930
|
|
2931 @var{candidate} itself is not tried as the value.
|
|
2932
|
|
2933 Uses the natural mapping in @samp{unity-cset-codesys-alist}, and
|
|
2934 the values in @samp{unity-charset-alias-alist}."
|
|
2935 @end defun
|
|
2936
|
|
2937 @defun unity-guess-coding-system candidate
|
|
2938 Guess a coding system based on the symbol @var{candidate}.
|
|
2939
|
|
2940 @var{candidate} itself is not tried as the value.
|
|
2941
|
|
2942 Uses the natural mapping in @samp{unity-cset-codesys-alist}, and
|
|
2943 the values in @samp{unity-coding-system-alias-alist}."
|
|
2944 @end defun
|
|
2945
|
|
2946 @defun unity-example
|
|
2947
|
|
2948 A cheesy example for Unification.
|
|
2949
|
|
2950 At present it just makes a multilingual buffer. To test, setq
|
|
2951 buffer-file-coding-system to some value, make the buffer dirty (eg
|
|
2952 with RET BackSpace), and save.
|
|
2953 @end defun
|
|
2954
|
|
2955
|
|
2956 @node Configuration, Theory of Operation, Usage, Charset Unification
|
|
2957 @subsection Configuring Unification for Use
|
|
2958
|
|
2959 If you want Unification to be automatically initialized, invoke
|
|
2960 @samp{enable-unification} with no arguments in your init file.
|
|
2961 @xref{Init File, , , xemacs}. If you are using GNU Emacs or an XEmacs
|
|
2962 earlier than 21.1, you should also load @file{auto-autoloads} using the
|
|
2963 full path (@emph{never} @samp{require} @file{auto-autoloads} libraries).
|
|
2964
|
|
2965 You may wish to define aliases for commonly used character sets and
|
|
2966 coding systems for convenience in input.
|
|
2967
|
|
2968 @defopt unity-charset-alias-alist
|
|
2969 Alist mapping aliases to Mule charset names (symbols)."
|
|
2970
|
|
2971 The default value is
|
|
2972 @example
|
|
2973 ((latin-1 . latin-iso8859-1)
|
|
2974 (latin-2 . latin-iso8859-2)
|
|
2975 (latin-3 . latin-iso8859-3)
|
|
2976 (latin-4 . latin-iso8859-4)
|
|
2977 (latin-5 . latin-iso8859-9)
|
|
2978 (latin-9 . latin-iso8859-15)
|
|
2979 (latin-10 . latin-iso8859-16))
|
|
2980 @end example
|
|
2981
|
|
2982 If a charset does not exist on your system, it will not complete and you
|
|
2983 will not be able to enter it in response to prompts. A real charset
|
|
2984 with the same name as an alias in this list will shadow the alias.
|
|
2985 @end defopt
|
|
2986
|
|
2987 @defopt unity-coding-system-alias-alist nil
|
|
2988 Alist mapping aliases to Mule coding system names (symbols).
|
|
2989
|
|
2990 The default value is @samp{nil}.
|
|
2991 @end defopt
|
|
2992
|
|
2993
|
|
2994 @node Theory of Operation, What Unification Cannot Do for You, Configuration, Charset Unification
|
|
2995 @subsection Theory of Operation
|
|
2996
|
|
2997 Standard encodings suffer from the design defect that they do not
|
|
2998 provide a reliable way to recognize which coded character sets in use.
|
|
2999 @xref{What Unification Cannot Do for You}. There are scores of
|
|
3000 character sets which can be represented by a single octet (8-bit byte),
|
|
3001 whose union contains many hundreds of characters. Obviously this
|
|
3002 results in great confusion, since you can't tell the players without a
|
|
3003 scorecard, and there is no scorecard.
|
|
3004
|
|
3005 There are two ways to solve this problem. The first is to create a
|
|
3006 universal coded character set. This is the concept behind Unicode.
|
|
3007 However, there have been satisfactory (nearly) universal character sets
|
|
3008 for several decades, but even today many Westerners resist using Unicode
|
|
3009 because they consider its space requirements excessive. On the other
|
|
3010 hand, Asians dislike Unicode because they consider it to be incomplete.
|
|
3011 (This is partly, but not entirely, political.)
|
|
3012
|
|
3013 In any case, Unicode only solves the internal representation problem.
|
|
3014 Many data sets will contain files in ``legacy'' encodings, and Unicode
|
|
3015 does not help distinguish among them.
|
|
3016
|
|
3017 The second approach is to embed information about the encodings used in
|
|
3018 a document in its text. This approach is taken by the ISO 2022
|
|
3019 standard. This would solve the problem completely from the users' of
|
|
3020 view, except that ISO 2022 is basically not implemented at all, in the
|
|
3021 sense that few applications or systems implement more than a small
|
|
3022 subset of ISO 2022 functionality. This is due to the fact that
|
|
3023 mono-literate users object to the presence of escape sequences in their
|
|
3024 texts (which they, with some justification, consider data corruption).
|
|
3025 Programmers are more than willing to cater to these users, since
|
|
3026 implementing ISO 2022 is a painstaking task.
|
|
3027
|
|
3028 In fact, Emacs/Mule adopts both of these approaches. Internally it uses
|
|
3029 a universal character set, @dfn{Mule code}. Externally it uses ISO 2022
|
|
3030 techniques both to save files in forms robust to encoding issues, and as
|
|
3031 hints when attempting to ``guess'' an unknown encoding. However, Mule
|
|
3032 suffers from a design defect, namely it embeds the character set
|
|
3033 information that ISO 2022 attaches to runs of characters by introducing
|
|
3034 them with a control sequence in each character. That causes Mule to
|
|
3035 consider the ISO Latin character sets to be disjoint. This manifests
|
|
3036 itself when a user enters characters using input methods associated with
|
|
3037 different coded character sets into a single buffer.
|
|
3038
|
|
3039 There are two problems stemming from this design. First, Mule
|
1188
|
3040 represents the same character in different ways. Abstractly, 'ó'
|
1183
|
3041 (LATIN SMALL LETTER O WITH ACUTE) can get represented as
|
|
3042 [latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73]. So what looks like
|
1188
|
3043 'óó' in the display might actually be represented [latin-iso8859-1
|
1183
|
3044 #x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B
|
|
3045 #xF3 ESC - A] in the file. In some cases this treatment would be
|
|
3046 appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00
|
|
3047 (the CJK ideographic character meaning ``one'')), and although arguably
|
|
3048 incorrect it is convenient when mixing the CJK scripts. But in the case
|
|
3049 of the Latin scripts this is wrong.
|
|
3050
|
|
3051 Worse yet, it is very likely to occur when mixing ``different'' encodings
|
|
3052 (such as ISO 8859/1 and ISO 8859/15) that differ only in a few code
|
|
3053 points that are almost never used. A very important example involves
|
|
3054 email. Many sites, especially in the U.S., default to use of the ISO
|
|
3055 8859/1 coded character set (also called ``Latin 1,'' though these are
|
|
3056 somewhat different concepts). However, ISO 8859/1 provides a generic
|
|
3057 CURRENCY SIGN character. Now that the Euro has become the official
|
|
3058 currency of most countries in Europe, this is unsatisfactory (and in
|
|
3059 practice, useless). So Europeans generally use ISO 8859/15, which is
|
|
3060 nearly identical to ISO 8859/1 for most languages, except that it
|
|
3061 substitutes EURO SIGN for CURRENCY SIGN.
|
|
3062
|
|
3063 Suppose a European user yanks text from a post encoded in ISO 8859/1
|
|
3064 into a message composition buffer, and enters some text including the
|
|
3065 Euro sign. Then Mule will consider the buffer to contain both ISO
|
|
3066 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
|
|
3067 programmed) send the message as a multipart mixed MIME body!
|
|
3068
|
|
3069 This is clearly stupid. What is not as obvious is that, just as any
|
|
3070 European can include American English in their text because ASCII is a
|
|
3071 subset of ISO 8859/15, most European languages which use Latin
|
|
3072 characters (eg, German and Polish) can typically be mixed while using
|
|
3073 only one Latin coded character set (in the case of German and Polish,
|
|
3074 ISO 8859/2). However, this often depends on exactly what text is to be
|
|
3075 encoded (even for the same pair of languages).
|
|
3076
|
|
3077 Unification works around the problem by converting as many characters as
|
|
3078 possible to use a single Latin coded character set before saving the
|
|
3079 buffer.
|
|
3080
|
|
3081 Because the problem is rarely noticable in editing a buffer, but tends
|
|
3082 to manifest when that buffer is exported to a file or process, the
|
|
3083 Unification package uses the strategy of examining the buffer prior to
|
|
3084 export. If use of multiple Latin coded character sets is detected,
|
|
3085 Unification attempts to unify them by finding a single coded character
|
|
3086 set which contains all of the Latin characters in the buffer.
|
|
3087
|
|
3088 The primary purpose of Unification is to fix the problem by giving the
|
|
3089 user the choice to change the representation of all characters to one
|
|
3090 character set and give sensible recommendations based on context. In
|
1188
|
3091 the 'ó' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and
|
1183
|
3092 both will be suggested. In the EURO SIGN example, only ISO 8859/15
|
|
3093 makes sense, and that is what will be recommended. In both cases, the
|
|
3094 user will be reminded that there are universal encodings available.
|
|
3095
|
|
3096 I call this @dfn{remapping} (from the universal character set to a
|
|
3097 particular ISO 8859 coded character set). It is mere accident that this
|
|
3098 letter has the same code point in both character sets. (Not entirely,
|
|
3099 but there are many examples of Latin characters that have different code
|
|
3100 points in different Latin-X sets.)
|
|
3101
|
1188
|
3102 Note that, in the 'ó' example, that treating the buffer in this way will
|
1183
|
3103 result in a representation such as [latin-iso8859-2
|
|
3104 #x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3].
|
|
3105 This is guaranteed to occasionally result in the second problem you
|
|
3106 observed, to which we now turn.
|
|
3107
|
|
3108 This problem is that, although the file is intended to be an
|
|
3109 ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX
|
|
3110 compliant program---this is required by the standard, obvious if you
|
|
3111 think a bit, @pxref{What Unification Cannot Do for You}) will read that
|
|
3112 file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73]. Of course this
|
|
3113 is no problem if all of the characters in the file are contained in ISO
|
|
3114 8859/1, but suppose there are some which are not, but are contained in
|
|
3115 the (intended) ISO 8859/2.
|
|
3116
|
|
3117 You now want to fix this, but not by finding the same character in
|
|
3118 another set. Instead, you want to simply change the character set that
|
|
3119 Mule associates with that buffer position without changing the code.
|
|
3120 (This is conceptually somewhat distinct from the first problem, and
|
|
3121 logically ought to be handled in the code that defines coding systems.
|
|
3122 However, unification is not an unreasonable place for it.) Unification
|
|
3123 provides two functions (one fast and dangerous, the other slow and
|
|
3124 careful) to handle this. I call this @dfn{recoding}, because the
|
|
3125 transformation actually involves @emph{encoding} the buffer to file
|
|
3126 representation, then @emph{decoding} it to buffer representation (in a
|
|
3127 different character set). This cannot be done automatically because
|
|
3128 Mule can have no idea what the correct encoding is---after all, it
|
|
3129 already gave you its best guess. @xref{What Unification Cannot Do for
|
|
3130 You}. So these functions must be invoked by the user. @xref{Interactive
|
|
3131 Usage}.
|
|
3132
|
|
3133
|
|
3134 @node What Unification Cannot Do for You, Unification Internals, Theory of Operation, Charset Unification
|
|
3135 @subsection What Unification Cannot Do for You
|
|
3136
|
|
3137 Unification @strong{cannot} save you if you insist on exporting data in
|
|
3138 8-bit encodings in a multilingual environment. @emph{You will
|
|
3139 eventually corrupt data if you do this.} It is not Mule's, or any
|
|
3140 application's, fault. You will have only yourself to blame; consider
|
|
3141 yourself warned. (It is true that Mule has bugs, which make Mule
|
|
3142 somewhat more dangerous and inconvenient than some naive applications.
|
|
3143 We're working to address those, but no application can remedy the
|
|
3144 inherent defect of 8-bit encodings.)
|
|
3145
|
|
3146 Use standard universal encodings, preferably Unicode (UTF-8) unless
|
|
3147 applicable standards indicate otherwise. The most important such case
|
|
3148 is Internet messages, where MIME should be used, whether or not the
|
|
3149 subordinate encoding is a universal encoding. (Note that since one of
|
|
3150 the important provisions of MIME is the @samp{Content-Type} header,
|
|
3151 which has the charset parameter, MIME is to be considered a universal
|
|
3152 encoding for the purposes of this manual. Of course, technically
|
|
3153 speaking it's neither a coded character set nor a coding extension
|
|
3154 technique compliant with ISO 2022.)
|
|
3155
|
|
3156 As mentioned earlier, the problem is that standard encodings suffer from
|
|
3157 the design defect that they do not provide a reliable way to recognize
|
|
3158 which coded character sets are in use. There are scores of character
|
|
3159 sets which can be represented by a single octet (8-bit byte), whose
|
|
3160 union contains many hundreds of characters. Thus any 8-bit coded
|
|
3161 character set must contain characters that share code points used for
|
|
3162 different characters in other coded character sets.
|
|
3163
|
|
3164 This means that a given file's intended encoding cannot be identified
|
|
3165 with 100% reliability unless it contains encoding markers such as those
|
|
3166 provided by MIME or ISO 2022.
|
|
3167
|
|
3168 Unification actually makes it more likely that you will have problems of
|
|
3169 this kind. Traditionally Mule has been ``helpful'' by simply using an
|
|
3170 ISO 2022 universal coding system when the current buffer coding system
|
|
3171 cannot handle all the characters in the buffer. This has the effect
|
|
3172 that, because the file contains control sequences, it is not recognized
|
|
3173 as being in the locale's normal 8-bit encoding. It may be annoying if
|
|
3174 you are not a Mule expert, but your data is automatically recoverable
|
|
3175 with a tool you already have: Mule.
|
|
3176
|
|
3177 However, with unification, Mule converts to a single 8-bit character set
|
|
3178 when possible. But typically this will @emph{not} be in your usual
|
|
3179 locale. Ie, the times that an ISO 8859/1 user will need Unification is
|
|
3180 when there are ISO 8859/2 characters in the buffer. But then most
|
|
3181 likely the file will be saved in a pure 8-bit encoding that is not ISO
|
|
3182 8859/1, ie, ISO 8859/2. Mule's autorecognizer (which is probably the
|
|
3183 most sophisticated yet available) cannot tell the difference between ISO
|
|
3184 8859/1 and ISO 8859/2, and in a Western European locale will choose the
|
|
3185 former even though the latter was intended. Even the extension
|
|
3186 (``statistical recognition'') planned for XEmacs 22 is unlikely to be at
|
|
3187 all accurate in the case of mixed codes.
|
|
3188
|
|
3189 So now consider adding some additional ISO 8859/1 text to the buffer.
|
|
3190 If it includes any ISO 8859/1 codes that are used by different
|
|
3191 characters in ISO 8859/2, you now have a file that cannot be
|
|
3192 mechanically disentangled. You need a human being who can recognize
|
|
3193 that @emph{this is German and Swedish} and stays in Latin-1, while
|
|
3194 @emph{that is Polish} and needs to be recoded to Latin-2.
|
|
3195
|
|
3196 Moral: switch to a universal coded character set, preferably Unicode
|
|
3197 using the UTF-8 transformation format. If you really need the space,
|
|
3198 compress your files.
|
|
3199
|
|
3200
|
|
3201 @node Unification Internals, , What Unification Cannot Do for You, Charset Unification
|
|
3202 @subsection Internals
|
|
3203
|
|
3204 No internals documentation yet.
|
|
3205
|
|
3206 @file{unity-utils.el} provides one utility function.
|
|
3207
|
|
3208 @defun unity-dump-tables
|
|
3209
|
|
3210 Dump the temporary table created by loading @file{unity-utils.el}
|
|
3211 to @file{unity-tables.el}. Loading the latter file initializes
|
|
3212 @samp{unity-equivalences}.
|
|
3213 @end defun
|
|
3214
|
|
3215
|
|
3216 @node Charsets and Coding Systems, , Charset Unification, MULE
|
|
3217 @subsection Charsets and Coding Systems
|
|
3218
|
|
3219 This section provides reference lists of Mule charsets and coding
|
|
3220 systems. Mule charsets are typically named by character set and
|
|
3221 standard.
|
|
3222
|
|
3223 @table @strong
|
|
3224 @item ASCII variants
|
|
3225
|
|
3226 Identification of equivalent characters in these sets is not properly
|
|
3227 implemented. Unification does not distinguish the two charsets.
|
|
3228
|
|
3229 @samp{ascii} @samp{latin-jisx0201}
|
|
3230
|
|
3231 @item Extended Latin
|
|
3232
|
|
3233 Characters from the following ISO 2022 conformant charsets are
|
|
3234 identified with equivalents in other charsets in the group by
|
|
3235 Unification.
|
|
3236
|
|
3237 @samp{latin-iso8859-1} @samp{latin-iso8859-15} @samp{latin-iso8859-2}
|
|
3238 @samp{latin-iso8859-3} @samp{latin-iso8859-4} @samp{latin-iso8859-9}
|
|
3239 @samp{latin-iso8859-13} @samp{latin-iso8859-16}
|
|
3240
|
|
3241 The follow charsets are Latin variants which are not understood by
|
|
3242 Unification. In addition, many of the Asian language standards provide
|
|
3243 ASCII, at least, and sometimes other Latin characters. None of these
|
|
3244 are identified with their ISO 8859 equivalents.
|
|
3245
|
|
3246 @samp{vietnamese-viscii-lower}
|
|
3247 @samp{vietnamese-viscii-upper}
|
|
3248
|
|
3249 @item Other character sets
|
|
3250
|
|
3251 @samp{arabic-1-column}
|
|
3252 @samp{arabic-2-column}
|
|
3253 @samp{arabic-digit}
|
|
3254 @samp{arabic-iso8859-6}
|
|
3255 @samp{chinese-big5-1}
|
|
3256 @samp{chinese-big5-2}
|
|
3257 @samp{chinese-cns11643-1}
|
|
3258 @samp{chinese-cns11643-2}
|
|
3259 @samp{chinese-cns11643-3}
|
|
3260 @samp{chinese-cns11643-4}
|
|
3261 @samp{chinese-cns11643-5}
|
|
3262 @samp{chinese-cns11643-6}
|
|
3263 @samp{chinese-cns11643-7}
|
|
3264 @samp{chinese-gb2312}
|
|
3265 @samp{chinese-isoir165}
|
|
3266 @samp{cyrillic-iso8859-5}
|
|
3267 @samp{ethiopic}
|
|
3268 @samp{greek-iso8859-7}
|
|
3269 @samp{hebrew-iso8859-8}
|
|
3270 @samp{ipa}
|
|
3271 @samp{japanese-jisx0208}
|
|
3272 @samp{japanese-jisx0208-1978}
|
|
3273 @samp{japanese-jisx0212}
|
|
3274 @samp{katakana-jisx0201}
|
|
3275 @samp{korean-ksc5601}
|
|
3276 @samp{sisheng}
|
|
3277 @samp{thai-tis620}
|
|
3278 @samp{thai-xtis}
|
|
3279
|
|
3280 @item Non-graphic charsets
|
|
3281
|
|
3282 @samp{control-1}
|
|
3283 @end table
|
|
3284
|
|
3285 @table @strong
|
|
3286 @item No conversion
|
|
3287
|
|
3288 Some of these coding systems may specify EOL conventions. Note that
|
|
3289 @samp{iso-8859-1} is a no-conversion coding system, not an ISO 2022
|
|
3290 coding system. Although unification attempts to compensate for this, it
|
|
3291 is possible that the @samp{iso-8859-1} coding system will behave
|
|
3292 differently from other ISO 8859 coding systems.
|
|
3293
|
|
3294 @samp{binary} @samp{no-conversion} @samp{raw-text} @samp{iso-8859-1}
|
|
3295
|
|
3296 @item Latin coding systems
|
|
3297
|
|
3298 These coding systems are all single-byte, 8-bit ISO 2022 coding systems,
|
|
3299 combining ASCII in the GL register (bytes with high-bit clear) and an
|
|
3300 extended Latin character set in the GR register (bytes with high-bit set).
|
|
3301
|
|
3302 @samp{iso-8859-15} @samp{iso-8859-2} @samp{iso-8859-3} @samp{iso-8859-4}
|
|
3303 @samp{iso-8859-9} @samp{iso-8859-13} @samp{iso-8859-14} @samp{iso-8859-16}
|
|
3304
|
|
3305 These coding systems are single-byte, 8-bit coding systems that do not
|
|
3306 conform to international standards. They should be avoided in all
|
|
3307 potentially multilingual contexts, including any text distributed over
|
|
3308 the Internet and World Wide Web.
|
|
3309
|
|
3310 @samp{windows-1251}
|
|
3311
|
|
3312 @item Multilingual coding systems
|
|
3313
|
|
3314 The following ISO-2022-based coding systems are useful for multilingual
|
|
3315 text.
|
|
3316
|
|
3317 @samp{ctext} @samp{iso-2022-lock} @samp{iso-2022-7} @samp{iso-2022-7bit}
|
|
3318 @samp{iso-2022-7bit-ss2} @samp{iso-2022-8} @samp{iso-2022-8bit-ss2}
|
|
3319
|
|
3320 XEmacs also supports Unicode with the Mule-UCS package. These are the
|
|
3321 preferred coding systems for multilingual use. (There is a possible
|
|
3322 exception for texts that mix several Asian ideographic character sets.)
|
|
3323
|
|
3324 @samp{utf-16-be} @samp{utf-16-be-no-signature} @samp{utf-16-le}
|
|
3325 @samp{utf-16-le-no-signature} @samp{utf-7} @samp{utf-7-safe}
|
|
3326 @samp{utf-8} @samp{utf-8-ws}
|
|
3327
|
|
3328 Development versions of XEmacs (the 21.5 series) support Unicode
|
|
3329 internally, with (at least) the following coding systems implemented:
|
|
3330
|
|
3331 @samp{utf-16-be} @samp{utf-16-be-bom} @samp{utf-16-le}
|
|
3332 @samp{utf-16-le-bom} @samp{utf-8} @samp{utf-8-bom}
|
|
3333
|
|
3334 @item Asian ideographic languages
|
|
3335
|
|
3336 The following coding systems are based on ISO 2022, and are more or less
|
|
3337 suitable for encoding multilingual texts. They all can represent ASCII
|
|
3338 at least, and sometimes several other foreign character sets, without
|
|
3339 resort to arbitrary ISO 2022 designations. However, these subsets are
|
|
3340 not identified with the corresponding national standards in XEmacs Mule.
|
|
3341
|
|
3342 @samp{chinese-euc} @samp{cn-big5} @samp{cn-gb-2312} @samp{gb2312}
|
|
3343 @samp{hz} @samp{hz-gb-2312} @samp{old-jis} @samp{japanese-euc}
|
|
3344 @samp{junet} @samp{euc-japan} @samp{euc-jp} @samp{iso-2022-jp}
|
|
3345 @samp{iso-2022-jp-1978-irv} @samp{iso-2022-jp-2} @samp{euc-kr}
|
|
3346 @samp{korean-euc} @samp{iso-2022-kr} @samp{iso-2022-int-1}
|
|
3347
|
|
3348 The following coding systems cannot be used for general multilingual
|
|
3349 text and do not cooperate well with other coding systems.
|
|
3350
|
|
3351 @samp{big5} @samp{shift_jis}
|
|
3352
|
|
3353 @item Other languages
|
|
3354
|
|
3355 The following coding systems are based on ISO 2022. Though none of them
|
|
3356 provides any Latin characters beyond ASCII, XEmacs Mule allows (and up
|
|
3357 to 21.4 defaults to) use of ISO 2022 control sequences to designate
|
|
3358 other character sets for inclusion the text.
|
|
3359
|
|
3360 @samp{iso-8859-5} @samp{iso-8859-7} @samp{iso-8859-8}
|
|
3361 @samp{ctext-hebrew}
|
|
3362
|
|
3363 The following are character sets that do not conform to ISO 2022 and
|
|
3364 thus cannot be safely used in a multilingual context.
|
|
3365
|
|
3366 @samp{alternativnyj} @samp{koi8-r} @samp{tis-620} @samp{viqr}
|
|
3367 @samp{viscii} @samp{vscii}
|
|
3368
|
|
3369 @item Special coding systems
|
|
3370
|
|
3371 Mule uses the following coding systems for special purposes.
|
|
3372
|
|
3373 @samp{automatic-conversion} @samp{undecided} @samp{escape-quoted}
|
|
3374
|
|
3375 @samp{escape-quoted} is especially important, as it is used internally
|
|
3376 as the coding system for autosaved data.
|
|
3377
|
|
3378 The following coding systems are aliases for others, and are used for
|
|
3379 communication with the host operating system.
|
|
3380
|
|
3381 @samp{file-name} @samp{keyboard} @samp{terminal}
|
|
3382
|
|
3383 @end table
|
|
3384
|
|
3385 Mule detection of coding systems is actually limited to detection of
|
|
3386 classes of coding systems called @dfn{coding categories}. These coding
|
|
3387 categories are identified by the ISO 2022 control sequences they use, if
|
|
3388 any, by their conformance to ISO 2022 restrictions on code points that
|
|
3389 may be used, and by characteristic patterns of use of 8-bit code points.
|
|
3390
|
|
3391 @samp{no-conversion}
|
|
3392 @samp{utf-8}
|
|
3393 @samp{ucs-4}
|
|
3394 @samp{iso-7}
|
|
3395 @samp{iso-lock-shift}
|
|
3396 @samp{iso-8-1}
|
|
3397 @samp{iso-8-2}
|
|
3398 @samp{iso-8-designate}
|
|
3399 @samp{shift-jis}
|
|
3400 @samp{big5}
|
|
3401
|
|
3402
|
|
3403 @c end of mule.texi
|
|
3404
|