comparison man/lispref/mule.texi @ 442:abe6d1db359e r21-2-36

Import from CVS: tag r21-2-36
author cvs
date Mon, 13 Aug 2007 11:35:02 +0200
parents 8de8e3f6228a
children 576fb035e263
comparison
equal deleted inserted replaced
441:72a7cfa4a488 442:abe6d1db359e
4 @c See the file lispref.texi for copying conditions. 4 @c See the file lispref.texi for copying conditions.
5 @setfilename ../../info/internationalization.info 5 @setfilename ../../info/internationalization.info
6 @node MULE, Tips, Internationalization, top 6 @node MULE, Tips, Internationalization, top
7 @chapter MULE 7 @chapter MULE
8 8
9 @dfn{MULE} is the name originally given to the version of GNU Emacs 9 @dfn{MULE} is the name originally given to the version of GNU Emacs
10 extended for multi-lingual (and in particular Asian-language) support. 10 extended for multi-lingual (and in particular Asian-language) support.
11 ``MULE'' is short for ``MUlti-Lingual Emacs''. It was originally called 11 ``MULE'' is short for ``MUlti-Lingual Emacs''. It is an extension and
12 Nemacs (``Nihon Emacs'' where ``Nihon'' is the Japanese word for 12 complete rewrite of Nemacs (``Nihon Emacs'' where ``Nihon'' is the
13 ``Japan''), when it only provided support for Japanese. XEmacs 13 Japanese word for ``Japan''), which only provided support for Japanese.
14 refers to its multi-lingual support as @dfn{MULE support} since it 14 XEmacs refers to its multi-lingual support as @dfn{MULE support} since
15 is based on @dfn{MULE}. 15 it is based on @dfn{MULE}.
16 16
17 @menu 17 @menu
18 * Internationalization Terminology:: 18 * Internationalization Terminology::
19 Definition of various internationalization terms. 19 Definition of various internationalization terms.
20 * Charsets:: Sets of related characters. 20 * Charsets:: Sets of related characters.
21 * MULE Characters:: Working with characters in XEmacs/MULE. 21 * MULE Characters:: Working with characters in XEmacs/MULE.
22 * Composite Characters:: Making new characters by overstriking other ones. 22 * Composite Characters:: Making new characters by overstriking other ones.
23 * ISO 2022:: An international standard for charsets and encodings.
24 * Coding Systems:: Ways of representing a string of chars using integers. 23 * Coding Systems:: Ways of representing a string of chars using integers.
25 * CCL:: A special language for writing fast converters. 24 * CCL:: A special language for writing fast converters.
26 * Category Tables:: Subdividing charsets into groups. 25 * Category Tables:: Subdividing charsets into groups.
27 @end menu 26 @end menu
28 27
29 @node Internationalization Terminology 28 @node Internationalization Terminology, Charsets, , MULE
30 @section Internationalization Terminology 29 @section Internationalization Terminology
31 30
32 In internationalization terminology, a string of text is divided up 31 In internationalization terminology, a string of text is divided up
33 into @dfn{characters}, which are the printable units that make up the 32 into @dfn{characters}, which are the printable units that make up the
34 text. A single character is (for example) a capital @samp{A}, the 33 text. A single character is (for example) a capital @samp{A}, the
35 number @samp{2}, a Katakana character, a Kanji ideograph (an 34 number @samp{2}, a Katakana character, a Hangul character, a Kanji
36 @dfn{ideograph} is a ``picture'' character, such as is used in Japanese 35 ideograph (an @dfn{ideograph} is a ``picture'' character, such as is
37 Kanji, Chinese Hanzi, and Korean Hangul; typically there are thousands 36 used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there
38 of such ideographs in each language), etc. The basic property of a 37 are thousands of such ideographs in each language), etc. The basic
39 character is its shape. Note that the same character may be drawn by 38 property of a character is that it is the smallest unit of text with
40 two different people (or in two different fonts) in slightly different 39 semantic significance in text processing.
41 ways, although the basic shape will be the same. 40
41 Human beings normally process text visually, so to a first approximation
42 a character may be identified with its shape. Note that the same
43 character may be drawn by two different people (or in two different
44 fonts) in slightly different ways, although the "basic shape" will be the
45 same. But consider the works of Scott Kim; human beings can recognize
46 hugely variant shapes as the "same" character. Sometimes, especially
47 where characters are extremely complicated to write, completely
48 different shapes may be defined as the "same" character in national
49 standards. The Taiwanese variant of Hanzi is generally the most
50 complicated; over the centuries, the Japanese, Koreans, and the People's
51 Republic of China have adopted simplifications of the shape, but the
52 line of descent from the original shape is recorded, and the meanings
53 and pronunciation of different forms of the same character are
54 considered to be identical within each language. (Of course, it may
55 take a specialist to recognize the related form; the point is that the
56 relations are standardized, despite the differing shapes.)
42 57
43 In some cases, the differences will be significant enough that it is 58 In some cases, the differences will be significant enough that it is
44 actually possible to identify two or more distinct shapes that both 59 actually possible to identify two or more distinct shapes that both
45 represent the same character. For example, the lowercase letters 60 represent the same character. For example, the lowercase letters
46 @samp{a} and @samp{g} each have two distinct possible shapes---the 61 @samp{a} and @samp{g} each have two distinct possible shapes---the
55 @samp{Aspen} is a city while @samp{aspen} is a kind of tree). 70 @samp{Aspen} is a city while @samp{aspen} is a kind of tree).
56 71
57 Note that @dfn{character} and @dfn{glyph} are used differently 72 Note that @dfn{character} and @dfn{glyph} are used differently
58 here than elsewhere in XEmacs. 73 here than elsewhere in XEmacs.
59 74
60 A @dfn{character set} is simply a set of related characters. ASCII, 75 A @dfn{character set} is essentially a set of related characters. ASCII,
61 for example, is a set of 94 characters (or 128, if you count 76 for example, is a set of 94 characters (or 128, if you count
62 non-printing characters). Other character sets are ISO8859-1 (ASCII 77 non-printing characters). Other character sets are ISO8859-1 (ASCII
63 plus various accented characters and other international symbols), 78 plus various accented characters and other international symbols),
64 JISX0201 (ASCII, more or less, plus half-width Katakana), JISX0208 79 JIS X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208
65 (Japanese Kanji), JISX0212 (a second set of less-used Japanese Kanji), 80 (Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji),
66 GB2312 (Mainland Chinese Hanzi), etc. 81 GB2312 (Mainland Chinese Hanzi), etc.
67 82
68 Every character set has one or more @dfn{orderings}, which can be 83 The definition of a character set will implicitly or explicitly give
69 viewed as a way of assigning a number (or set of numbers) to each 84 it an @dfn{ordering}, a way of assigning a number to each character in
70 character in the set. For most character sets, there is a standard 85 the set. For many character sets, there is a natural ordering, for
71 ordering, and in fact all of the character sets mentioned above define a 86 example the ``ABC'' ordering of the Roman letters. But it is not clear
72 particular ordering. ASCII, for example, places letters in their 87 whether digits should come before or after the letters, and in fact
73 ``natural'' order, puts uppercase letters before lowercase letters, 88 different European languages treat the ordering of accented characters
74 numbers before letters, etc. Note that for many of the Asian character 89 differently. It is useful to use the natural order where available, of
75 sets, there is no natural ordering of the characters. The actual 90 course. The number assigned to any particular character is called the
76 orderings are based on one or more salient characteristic, of which 91 character's @dfn{code point}. (Within a given character set, each
77 there are many to choose from---e.g. number of strokes, common 92 character has a unique code point. Thus the word "set" is ill-chosen;
78 radicals, phonetic ordering, etc. 93 different orderings of the same characters are different character sets.
79 94 Identifying characters is simple enough for alphabetic character sets,
80 The set of numbers assigned to any particular character are called 95 but the difference in ordering can cause great headaches when the same
81 the character's @dfn{position codes}. The number of position codes 96 thousands of characters are used by different cultures as in the Hanzi.)
82 required to index a particular character in a character set is called 97
83 the @dfn{dimension} of the character set. ASCII, being a relatively 98 A code point may be broken into a number of @dfn{position codes}. The
84 small character set, is of dimension one, and each character in the 99 number of position codes required to index a particular character in a
85 set is indexed using a single position code, in the range 0 through 100 character set is called the @dfn{dimension} of the character set. For
86 127 (if non-printing characters are included) or 33 through 126 101 practical purposes, a position code may be thought of as a byte-sized
87 (if only the printing characters are considered). JISX0208, i.e. 102 index. The printing characters of ASCII, being a relatively small
88 Japanese Kanji, has thousands of characters, and is of dimension two -- 103 character set, is of dimension one, and each character in the set is
89 every character is indexed by two position codes, each in the range 104 indexed using a single position code, in the range 1 through 94. Use of
90 33 through 126. (Note that the choice of the range here is somewhat 105 this unusual range, rather than the familiar 33 through 126, is an
91 arbitrary. Although a character set such as JISX0208 defines an 106 intentional abstraction; to understand the programming issues you must
92 @emph{ordering} of all its characters, it does not define the actual 107 break the equation between character sets and encodings.
93 mapping between numbers and characters. You could just as easily 108
94 index the characters in JISX0208 using numbers in the range 0 through 109 JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is
95 93, 1 through 94, 2 through 95, etc. The reason for the actual range 110 of dimension two -- every character is indexed by two position codes,
96 chosen is so that the position codes match up with the actual values 111 each in the range 1 through 94. (This number ``94'' is not a
97 used in the common encodings.) 112 coincidence; we shall see that the JIS position codes were chosen so
113 that JIS kanji could be encoded without using codes that in ASCII are
114 associated with device control functions.) Note that the choice of the
115 range here is somewhat arbitrary. You could just as easily index the
116 printing characters in ASCII using numbers in the range 0 through 93, 2
117 through 95, 3 through 96, etc. In fact, the standardized
118 @emph{encoding} for the ASCII @emph{character set} uses the range 33
119 through 126.
98 120
99 An @dfn{encoding} is a way of numerically representing characters from 121 An @dfn{encoding} is a way of numerically representing characters from
100 one or more character sets into a stream of like-sized numerical values 122 one or more character sets into a stream of like-sized numerical values
101 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit 123 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit
102 quantities. If an encoding encompasses only one character set, then the 124 quantities. If an encoding encompasses only one character set, then the
103 position codes for the characters in that character set could be used 125 position codes for the characters in that character set could be used
104 directly. (This is the case with ASCII, and as a result, most people do 126 directly. (This is the case with the trivial cipher used by children,
105 not understand the difference between a character set and an encoding.) 127 assigning 1 to `A', 2 to `B', and so on.) However, even with ASCII,
106 This is not possible, however, if more than one character set is to be 128 other considerations intrude. For example, why are the upper- and
107 used in the encoding. For example, printed Japanese text typically 129 lowercase alphabets separated by 8 characters? Why do the digits start
108 requires characters from multiple character sets---ASCII, JISX0208, and 130 with `0' being assigned the code 48? In both cases because semantically
109 JISX0212, to be specific. Each of these is indexed using one or more 131 interesting operations (case conversion and numerical value extraction)
110 position codes in the range 33 through 126, so the position codes could 132 become convenient masking operations. Other artificial aspects (the
111 not be used directly or there would be no way to tell which character 133 control characters being assigned to codes 0--31 and 127) are historical
112 was meant. Different Japanese encodings handle this differently---JIS 134 accidents. (The use of 127 for @samp{DEL} is an artifact of the "punch
113 uses special escape characters to denote different character sets; EUC 135 once" nature of paper tape, for example.)
114 sets the high bit of the position codes for JISX0208 and JISX0212, and 136
115 puts a special extra byte before each JISX0212 character; etc. (JIS, 137 Naive use of the position code is not possible, however, if more than
116 EUC, and most of the other encodings you will encounter are 7-bit or 138 one character set is to be used in the encoding. For example, printed
117 8-bit encodings. There is one common 16-bit encoding, which is Unicode; 139 Japanese text typically requires characters from multiple character sets
118 this strives to represent all the world's characters in a single large 140 -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is
119 character set. 32-bit encodings are generally used internally in 141 indexed using one or more position codes in the range 1 through 94, so
120 programs to simplify the code that manipulates them; however, they are 142 the position codes could not be used directly or there would be no way
121 not much used externally because they are not very space-efficient.) 143 to tell which character was meant. Different Japanese encodings handle
144 this differently -- JIS uses special escape characters to denote
145 different character sets; EUC sets the high bit of the position codes
146 for JIS X 0208 and JIS X 0212, and puts a special extra byte before each
147 JIS X 0212 character; etc. (JIS, EUC, and most of the other encodings
148 you will encounter in files are 7-bit or 8-bit encodings. There is one
149 common 16-bit encoding, which is Unicode; this strives to represent all
150 the world's characters in a single large character set. 32-bit
151 encodings are often used internally in programs, such as XEmacs with
152 MULE support, to simplify the code that manipulates them; however, they
153 are not used externally because they are not very space-efficient.)
154
155 A general method of handling text using multiple character sets
156 (whether for multilingual text, or simply text in an extremely
157 complicated single language like Japanese) is defined in the
158 international standard ISO 2022. ISO 2022 will be discussed in more
159 detail later (@pxref{ISO 2022}), but for now suffice it to say that text
160 needs control functions (at least spacing), and if escape sequences are
161 to be used, an escape sequence introducer. It was decided to make all
162 text streams compatible with ASCII in the sense that the codes 0--31
163 (and 128-159) would always be control codes, never graphic characters,
164 and where defined by the character set the @samp{SPC} character would be
165 assigned code 32, and @samp{DEL} would be assigned 127. Thus there are
166 94 code points remaining if 7 bits are used. This is the reason that
167 most character sets are defined using position codes in the range 1
168 through 94. Then ISO 2022 compatible encodings are produced by shifting
169 the position codes 1 to 94 into character codes 33 to 126, or (if 8 bit
170 codes are available) into character codes 161 to 254.
122 171
123 Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In 172 Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In
124 a @dfn{modal encoding}, there are multiple states that the encoding can be in, 173 a @dfn{modal encoding}, there are multiple states that the encoding can
125 and the interpretation of the values in the stream depends on the 174 be in, and the interpretation of the values in the stream depends on the
126 current global state of the encoding. Special values in the encoding, 175 current global state of the encoding. Special values in the encoding,
127 called @dfn{escape sequences}, are used to change the global state. 176 called @dfn{escape sequences}, are used to change the global state.
128 JIS, for example, is a modal encoding. The bytes @samp{ESC $ B} 177 JIS, for example, is a modal encoding. The bytes @samp{ESC $ B}
129 indicate that, from then on, bytes are to be interpreted as position 178 indicate that, from then on, bytes are to be interpreted as position
130 codes for JISX0208, rather than as ASCII. This effect is cancelled 179 codes for JIS X 0208, rather than as ASCII. This effect is cancelled
131 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the 180 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the
132 current state is to ASCII''. To switch to JISX0212, the escape sequence 181 current state is to ASCII''. To switch to JIS X 0212, the escape
133 @samp{ESC $ ( D}. (Note that here, as is common, the escape sequences do 182 sequence @samp{ESC $ ( D}. (Note that here, as is common, the escape
134 in fact begin with @samp{ESC}. This is not necessarily the case, 183 sequences do in fact begin with @samp{ESC}. This is not necessarily the
135 however.) 184 case, however. Some encodings use control characters called "locking
136 185 shifts" (effect persists until cancelled) to switch character sets.)
137 A @dfn{non-modal encoding} has no global state that extends past the 186
187 A @dfn{non-modal encoding} has no global state that extends past the
138 character currently being interpreted. EUC, for example, is a 188 character currently being interpreted. EUC, for example, is a
139 non-modal encoding. Characters in JISX0208 are encoded by setting 189 non-modal encoding. Characters in JIS X 0208 are encoded by setting
140 the high bit of the position codes, and characters in JISX0212 are 190 the high bit of the position codes, and characters in JIS X 0212 are
141 encoded by doing the same but also prefixing the character with the 191 encoded by doing the same but also prefixing the character with the
142 byte 0x8F. 192 byte 0x8F.
143 193
144 The advantage of a modal encoding is that it is generally more 194 The advantage of a modal encoding is that it is generally more
145 space-efficient, and is easily extendable because there are essentially 195 space-efficient, and is easily extendible because there are essentially
146 an arbitrary number of escape sequences that can be created. The 196 an arbitrary number of escape sequences that can be created. The
147 disadvantage, however, is that it is much more difficult to work with 197 disadvantage, however, is that it is much more difficult to work with
148 if it is not being processed in a sequential manner. In the non-modal 198 if it is not being processed in a sequential manner. In the non-modal
149 EUC encoding, for example, the byte 0x41 always refers to the letter 199 EUC encoding, for example, the byte 0x41 always refers to the letter
150 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or 200 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or
151 one of the two position codes in a JISX0208 character, or one of the 201 one of the two position codes in a JIS X 0208 character, or one of the
152 two position codes in a JISX0212 character. Determining exactly which 202 two position codes in a JIS X 0212 character. Determining exactly which
153 one is meant could be difficult and time-consuming if the previous 203 one is meant could be difficult and time-consuming if the previous
154 bytes in the string have not already been processed. 204 bytes in the string have not already been processed, or impossible if
205 they are drawn from an external stream that cannot be rewound.
155 206
156 Non-modal encodings are further divided into @dfn{fixed-width} and 207 Non-modal encodings are further divided into @dfn{fixed-width} and
157 @dfn{variable-width} formats. A fixed-width encoding always uses 208 @dfn{variable-width} formats. A fixed-width encoding always uses
158 the same number of words per character, whereas a variable-width 209 the same number of words per character, whereas a variable-width
159 encoding does not. EUC is a good example of a variable-width 210 encoding does not. EUC is a good example of a variable-width
161 the character set. 16-bit and 32-bit encodings are nearly always 212 the character set. 16-bit and 32-bit encodings are nearly always
162 fixed-width, and this is in fact one of the main reasons for using 213 fixed-width, and this is in fact one of the main reasons for using
163 an encoding with a larger word size. The advantages of fixed-width 214 an encoding with a larger word size. The advantages of fixed-width
164 encodings should be obvious. The advantages of variable-width 215 encodings should be obvious. The advantages of variable-width
165 encodings are that they are generally more space-efficient and allow 216 encodings are that they are generally more space-efficient and allow
166 for compatibility with existing 8-bit encodings such as ASCII. 217 for compatibility with existing 8-bit encodings such as ASCII. (For
167 218 example, in Unicode ASCII characters are simply promoted to a 16-bit
168 Note that the bytes in an 8-bit encoding are often referred to 219 representation. That means that every ASCII character contains a
169 as @dfn{octets} rather than simply as bytes. This terminology 220 @samp{NUL} byte; evidently all of the standard string manipulation
170 dates back to the days before 8-bit bytes were universal, when 221 functions will lose badly in a fixed-width Unicode environment.)
171 some computers had 9-bit bytes, others had 10-bit bytes, etc. 222
172 223 The bytes in an 8-bit encoding are often referred to as @dfn{octets}
173 @node Charsets 224 rather than simply as bytes. This terminology dates back to the days
225 before 8-bit bytes were universal, when some computers had 9-bit bytes,
226 others had 10-bit bytes, etc.
227
228 @node Charsets, MULE Characters, Internationalization Terminology, MULE
174 @section Charsets 229 @section Charsets
175 230
176 A @dfn{charset} in MULE is an object that encapsulates a 231 A @dfn{charset} in MULE is an object that encapsulates a
177 particular character set as well as an ordering of those characters. 232 particular character set as well as an ordering of those characters.
178 Charsets are permanent objects and are named using symbols, like 233 Charsets are permanent objects and are named using symbols, like
187 * Basic Charset Functions:: Functions for working with charsets. 242 * Basic Charset Functions:: Functions for working with charsets.
188 * Charset Property Functions:: Functions for accessing charset properties. 243 * Charset Property Functions:: Functions for accessing charset properties.
189 * Predefined Charsets:: Predefined charset objects. 244 * Predefined Charsets:: Predefined charset objects.
190 @end menu 245 @end menu
191 246
192 @node Charset Properties 247 @node Charset Properties, Basic Charset Functions, , Charsets
193 @subsection Charset Properties 248 @subsection Charset Properties
194 249
195 Charsets have the following properties: 250 Charsets have the following properties:
196 251
197 @table @code 252 @table @code
259 property. If a CCL program is defined, the position codes of a 314 property. If a CCL program is defined, the position codes of a
260 character will first be processed according to @code{graphic} and 315 character will first be processed according to @code{graphic} and
261 then passed through the CCL program, with the resulting values used 316 then passed through the CCL program, with the resulting values used
262 to index the font. 317 to index the font.
263 318
264 This is used, for example, in the Big5 character set (used in Taiwan). 319 This is used, for example, in the Big5 character set (used in Taiwan).
265 This character set is not ISO-2022-compliant, and its size (94x157) does 320 This character set is not ISO-2022-compliant, and its size (94x157) does
266 not fit within the maximum 96x96 size of ISO-2022-compliant character 321 not fit within the maximum 96x96 size of ISO-2022-compliant character
267 sets. As a result, XEmacs/MULE splits it (in a rather complex fashion, 322 sets. As a result, XEmacs/MULE splits it (in a rather complex fashion,
268 so as to group the most commonly used characters together) into two 323 so as to group the most commonly used characters together) into two
269 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94, 324 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94,
270 and each charset object uses a CCL program to convert the modified 325 and each charset object uses a CCL program to convert the modified
271 position codes back into standard Big5 indices to retrieve a character 326 position codes back into standard Big5 indices to retrieve a character
272 from a Big5 font. 327 from a Big5 font.
273 @end table 328 @end table
274 329
275 Most of the above properties can only be changed when the charset 330 Most of the above properties can only be set when the charset is
276 is created. @xref{Charset Property Functions}. 331 initialized, and cannot be changed later.
277 332 @xref{Charset Property Functions}.
278 @node Basic Charset Functions 333
334 @node Basic Charset Functions, Charset Property Functions, Charset Properties, Charsets
279 @subsection Basic Charset Functions 335 @subsection Basic Charset Functions
280 336
281 @defun find-charset charset-or-name 337 @defun find-charset charset-or-name
282 This function retrieves the charset of the given name. If 338 This function retrieves the charset of the given name. If
283 @var{charset-or-name} is a charset object, it is simply returned. 339 @var{charset-or-name} is a charset object, it is simply returned.
296 This function returns a list of the names of all defined charsets. 352 This function returns a list of the names of all defined charsets.
297 @end defun 353 @end defun
298 354
299 @defun make-charset name doc-string props 355 @defun make-charset name doc-string props
300 This function defines a new character set. This function is for use 356 This function defines a new character set. This function is for use
301 with Mule support. @var{name} is a symbol, the name by which the 357 with MULE support. @var{name} is a symbol, the name by which the
302 character set is normally referred. @var{doc-string} is a string 358 character set is normally referred. @var{doc-string} is a string
303 describing the character set. @var{props} is a property list, 359 describing the character set. @var{props} is a property list,
304 describing the specific nature of the character set. The recognized 360 describing the specific nature of the character set. The recognized
305 properties are @code{registry}, @code{dimension}, @code{columns}, 361 properties are @code{registry}, @code{dimension}, @code{columns},
306 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and 362 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and
324 This function returns the charset (if any) with the same dimension, 380 This function returns the charset (if any) with the same dimension,
325 number of characters, and final byte as @var{charset}, but which is 381 number of characters, and final byte as @var{charset}, but which is
326 displayed in the opposite direction. 382 displayed in the opposite direction.
327 @end defun 383 @end defun
328 384
329 @node Charset Property Functions 385 @node Charset Property Functions, Predefined Charsets, Basic Charset Functions, Charsets
330 @subsection Charset Property Functions 386 @subsection Charset Property Functions
331 387
332 All of these functions accept either a charset name or charset object. 388 All of these functions accept either a charset name or charset object.
333 389
334 @defun charset-property charset prop 390 @defun charset-property charset prop
335 This function returns property @var{prop} of @var{charset}. 391 This function returns property @var{prop} of @var{charset}.
336 @xref{Charset Properties}. 392 @xref{Charset Properties}.
337 @end defun 393 @end defun
338 394
339 Convenience functions are also provided for retrieving individual 395 Convenience functions are also provided for retrieving individual
340 properties of a charset. 396 properties of a charset.
341 397
342 @defun charset-name charset 398 @defun charset-name charset
343 This function returns the name of @var{charset}. This will be a symbol. 399 This function returns the name of @var{charset}. This will be a symbol.
344 @end defun 400 @end defun
384 @defun charset-ccl-program charset 440 @defun charset-ccl-program charset
385 This function returns the CCL program, if any, for converting 441 This function returns the CCL program, if any, for converting
386 position codes of characters in @var{charset} into font indices. 442 position codes of characters in @var{charset} into font indices.
387 @end defun 443 @end defun
388 444
389 The only property of a charset that can currently be set after 445 The only property of a charset that can currently be set after
390 the charset has been created is the CCL program. 446 the charset has been created is the CCL program.
391 447
392 @defun set-charset-ccl-program charset ccl-program 448 @defun set-charset-ccl-program charset ccl-program
393 This function sets the @code{ccl-program} property of @var{charset} to 449 This function sets the @code{ccl-program} property of @var{charset} to
394 @var{ccl-program}. 450 @var{ccl-program}.
395 @end defun 451 @end defun
396 452
397 @node Predefined Charsets 453 @node Predefined Charsets, , Charset Property Functions, Charsets
398 @subsection Predefined Charsets 454 @subsection Predefined Charsets
399 455
400 The following charsets are predefined in the C code. 456 The following charsets are predefined in the C code.
401 457
402 @example 458 @example
403 Name Type Fi Gr Dir Registry 459 Name Type Fi Gr Dir Registry
404 -------------------------------------------------------------- 460 --------------------------------------------------------------
405 ascii 94 B 0 l2r ISO8859-1 461 ascii 94 B 0 l2r ISO8859-1
426 chinese-big5-2 94x94 1 0 l2r Big5 482 chinese-big5-2 94x94 1 0 l2r Big5
427 korean-ksc5601 94x94 C 0 l2r KSC5601 483 korean-ksc5601 94x94 C 0 l2r KSC5601
428 composite 96x96 0 l2r --- 484 composite 96x96 0 l2r ---
429 @end example 485 @end example
430 486
431 The following charsets are predefined in the Lisp code. 487 The following charsets are predefined in the Lisp code.
432 488
433 @example 489 @example
434 Name Type Fi Gr Dir Registry 490 Name Type Fi Gr Dir Registry
435 -------------------------------------------------------------- 491 --------------------------------------------------------------
436 arabic-digit 94 2 0 l2r MuleArabic-0 492 arabic-digit 94 2 0 l2r MuleArabic-0
450 @end example 506 @end example
451 507
452 For all of the above charsets, the dimension and number of columns are 508 For all of the above charsets, the dimension and number of columns are
453 the same. 509 the same.
454 510
455 Note that ASCII, Control-1, and Composite are handled specially. 511 Note that ASCII, Control-1, and Composite are handled specially.
456 This is why some of the fields are blank; and some of the filled-in 512 This is why some of the fields are blank; and some of the filled-in
457 fields (e.g. the type) are not really accurate. 513 fields (e.g. the type) are not really accurate.
458 514
459 @node MULE Characters 515 @node MULE Characters, Composite Characters, Charsets, MULE
460 @section MULE Characters 516 @section MULE Characters
461 517
462 @defun make-char charset arg1 &optional arg2 518 @defun make-char charset arg1 &optional arg2
463 This function makes a multi-byte character from @var{charset} and octets 519 This function makes a multi-byte character from @var{charset} and octets
464 @var{arg1} and @var{arg2}. 520 @var{arg1} and @var{arg2}.
481 537
482 @defun find-charset-string string 538 @defun find-charset-string string
483 This function returns a list of the charsets in @var{string}. 539 This function returns a list of the charsets in @var{string}.
484 @end defun 540 @end defun
485 541
486 @node Composite Characters 542 @node Composite Characters, Coding Systems, MULE Characters, MULE
487 @section Composite Characters 543 @section Composite Characters
488 544
489 Composite characters are not yet completely implemented. 545 Composite characters are not yet completely implemented.
490 546
491 @defun make-composite-char string 547 @defun make-composite-char string
492 This function converts a string into a single composite character. The 548 This function converts a string into a single composite character. The
493 character is the result of overstriking all the characters in the 549 character is the result of overstriking all the characters in the
494 string. 550 string.
512 character into one or more characters, the individual characters out of 568 character into one or more characters, the individual characters out of
513 which the composite character was formed. Non-composite characters are 569 which the composite character was formed. Non-composite characters are
514 left as-is. @var{buffer} defaults to the current buffer if omitted. 570 left as-is. @var{buffer} defaults to the current buffer if omitted.
515 @end defun 571 @end defun
516 572
517 @node ISO 2022 573 @node Coding Systems, CCL, Composite Characters, MULE
518 @section ISO 2022
519
520 This section briefly describes the ISO 2022 encoding standard. For more
521 thorough understanding, please refer to the original document of ISO
522 2022.
523
524 Character sets (@dfn{charsets}) are classified into the following four
525 categories, according to the number of characters of charset:
526 94-charset, 96-charset, 94x94-charset, and 96x96-charset.
527
528 @need 1000
529 @table @asis
530 @item 94-charset
531 ASCII(B), left(J) and right(I) half of JISX0201, ...
532 @item 96-charset
533 Latin-1(A), Latin-2(B), Latin-3(C), ...
534 @item 94x94-charset
535 GB2312(A), JISX0208(B), KSC5601(C), ...
536 @item 96x96-charset
537 none for the moment
538 @end table
539
540 The character in parentheses after the name of each charset
541 is the @dfn{final character} @var{F}, which can be regarded as
542 the identifier of the charset. ECMA allocates @var{F} to each
543 charset. @var{F} is in the range of 0x30..0x7F, but 0x30..0x3F
544 are only for private use.
545
546 Note: @dfn{ECMA} = European Computer Manufacturers Association
547
548 There are four @dfn{registers of charsets}, called G0 thru G3.
549 You can designate (or assign) any charset to one of these
550 registers.
551
552 The code space contained within one octet (of size 256) is divided into
553 4 areas: C0, GL, C1, and GR. GL and GR are the areas into which a
554 register of charset can be invoked into.
555
556 @example
557 @group
558 C0: 0x00 - 0x1F
559 GL: 0x20 - 0x7F
560 C1: 0x80 - 0x9F
561 GR: 0xA0 - 0xFF
562 @end group
563 @end example
564
565 Usually, in the initial state, G0 is invoked into GL, and G1
566 is invoked into GR.
567
568 ISO 2022 distinguishes 7-bit environments and 8-bit environments. In
569 7-bit environments, only C0 and GL are used.
570
571 Charset designation is done by escape sequences of the form:
572
573 @example
574 ESC [@var{I}] @var{I} @var{F}
575 @end example
576
577 where @var{I} is an intermediate character in the range 0x20 - 0x2F, and
578 @var{F} is the final character identifying this charset.
579
580 The meaning of intermediate characters are:
581
582 @example
583 @group
584 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
585 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}.
586 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}.
587 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}.
588 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}.
589 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}.
590 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}.
591 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}.
592 @end group
593 @end example
594
595 The following rule is not allowed in ISO 2022 but can be used in Mule.
596
597 @example
598 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}.
599 @end example
600
601 Here are examples of designations:
602
603 @example
604 @group
605 ESC ( B : designate to G0 ASCII
606 ESC - A : designate to G1 Latin-1
607 ESC $ ( A or ESC $ A : designate to G0 GB2312
608 ESC $ ( B or ESC $ B : designate to G0 JISX0208
609 ESC $ ) C : designate to G1 KSC5601
610 @end group
611 @end example
612
613 To use a charset designated to G2 or G3, and to use a charset designated
614 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3
615 into GL. There are two types of invocation, Locking Shift (forever) and
616 Single Shift (one character only).
617
618 Locking Shift is done as follows:
619
620 @example
621 LS0 or SI (0x0F): invoke G0 into GL
622 LS1 or SO (0x0E): invoke G1 into GL
623 LS2: invoke G2 into GL
624 LS3: invoke G3 into GL
625 LS1R: invoke G1 into GR
626 LS2R: invoke G2 into GR
627 LS3R: invoke G3 into GR
628 @end example
629
630 Single Shift is done as follows:
631
632 @example
633 @group
634 SS2 or ESC N: invoke G2 into GL
635 SS3 or ESC O: invoke G3 into GL
636 @end group
637 @end example
638
639 (#### Ben says: I think the above is slightly incorrect. It appears that
640 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and
641 ESC O behave as indicated. The above definitions will not parse
642 EUC-encoded text correctly, and it looks like the code in mule-coding.c
643 has similar problems.)
644
645 You may realize that there are a lot of ISO-2022-compliant ways of
646 encoding multilingual text. Now, in the world, there exist many coding
647 systems such as X11's Compound Text, Japanese JUNET code, and so-called
648 EUC (Extended UNIX Code); all of these are variants of ISO 2022.
649
650 In Mule, we characterize ISO 2022 by the following attributes:
651
652 @enumerate
653 @item
654 Initial designation to G0 thru G3.
655 @item
656 Allow designation of short form for Japanese and Chinese.
657 @item
658 Should we designate ASCII to G0 before control characters?
659 @item
660 Should we designate ASCII to G0 at the end of line?
661 @item
662 7-bit environment or 8-bit environment.
663 @item
664 Use Locking Shift or not.
665 @item
666 Use ASCII or JIS0201-1976-Roman.
667 @item
668 Use JISX0208-1983 or JISX0208-1976.
669 @end enumerate
670
671 (The last two are only for Japanese.)
672
673 By specifying these attributes, you can create any variant
674 of ISO 2022.
675
676 Here are several examples:
677
678 @example
679 @group
680 junet -- Coding system used in JUNET.
681 1. G0 <- ASCII, G1..3 <- never used
682 2. Yes.
683 3. Yes.
684 4. Yes.
685 5. 7-bit environment
686 6. No.
687 7. Use ASCII
688 8. Use JISX0208-1983
689 @end group
690
691 @group
692 ctext -- Compound Text
693 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used
694 2. No.
695 3. No.
696 4. Yes.
697 5. 8-bit environment
698 6. No.
699 7. Use ASCII
700 8. Use JISX0208-1983
701 @end group
702
703 @group
704 euc-china -- Chinese EUC. Although many people call this
705 as "GB encoding", the name may cause misunderstanding.
706 1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used
707 2. No.
708 3. Yes.
709 4. Yes.
710 5. 8-bit environment
711 6. No.
712 7. Use ASCII
713 8. Use JISX0208-1983
714 @end group
715
716 @group
717 korean-mail -- Coding system used in Korean network.
718 1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used
719 2. No.
720 3. Yes.
721 4. Yes.
722 5. 7-bit environment
723 6. Yes.
724 7. No.
725 8. No.
726 @end group
727 @end example
728
729 Mule creates all these coding systems by default.
730
731 @node Coding Systems
732 @section Coding Systems 574 @section Coding Systems
733 575
734 A coding system is an object that defines how text containing multiple 576 A coding system is an object that defines how text containing multiple
735 character sets is encoded into a stream of (typically 8-bit) bytes. The 577 character sets is encoded into a stream of (typically 8-bit) bytes. The
736 coding system is used to decode the stream into a series of characters 578 coding system is used to decode the stream into a series of characters
737 (which may be from multiple charsets) when the text is read from a file 579 (which may be from multiple charsets) when the text is read from a file
738 or process, and is used to encode the text back into the same format 580 or process, and is used to encode the text back into the same format
739 when it is written out to a file or process. 581 when it is written out to a file or process.
740 582
741 For example, many ISO-2022-compliant coding systems (such as Compound 583 For example, many ISO-2022-compliant coding systems (such as Compound
742 Text, which is used for inter-client data under the X Window System) use 584 Text, which is used for inter-client data under the X Window System) use
743 escape sequences to switch between different charsets---Japanese Kanji, 585 escape sequences to switch between different charsets -- Japanese Kanji,
744 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with 586 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with
745 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See 587 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See
746 @code{make-coding-system} for more information. 588 @code{make-coding-system} for more information.
747 589
748 Coding systems are normally identified using a symbol, and the symbol is 590 Coding systems are normally identified using a symbol, and the symbol is
749 accepted in place of the actual coding system object whenever a coding 591 accepted in place of the actual coding system object whenever a coding
750 system is called for. (This is similar to how faces and charsets work.) 592 system is called for. (This is similar to how faces and charsets work.)
751 593
752 @defun coding-system-p object 594 @defun coding-system-p object
753 This function returns non-@code{nil} if @var{object} is a coding system. 595 This function returns non-@code{nil} if @var{object} is a coding system.
754 @end defun 596 @end defun
755 597
756 @menu 598 @menu
757 * Coding System Types:: Classifying coding systems. 599 * Coding System Types:: Classifying coding systems.
600 * ISO 2022:: An international standard for
601 charsets and encodings.
758 * EOL Conversion:: Dealing with different ways of denoting 602 * EOL Conversion:: Dealing with different ways of denoting
759 the end of a line. 603 the end of a line.
760 * Coding System Properties:: Properties of a coding system. 604 * Coding System Properties:: Properties of a coding system.
761 * Basic Coding System Functions:: Working with coding systems. 605 * Basic Coding System Functions:: Working with coding systems.
762 * Coding System Property Functions:: Retrieving a coding system's properties. 606 * Coding System Property Functions:: Retrieving a coding system's properties.
763 * Encoding and Decoding Text:: Encoding and decoding text. 607 * Encoding and Decoding Text:: Encoding and decoding text.
764 * Detection of Textual Encoding:: Determining how text is encoded. 608 * Detection of Textual Encoding:: Determining how text is encoded.
765 * Big5 and Shift-JIS Functions:: Special functions for these non-standard 609 * Big5 and Shift-JIS Functions:: Special functions for these non-standard
766 encodings. 610 encodings.
611 * Predefined Coding Systems:: Coding systems implemented by MULE.
767 @end menu 612 @end menu
768 613
769 @node Coding System Types 614 @node Coding System Types, ISO 2022, , Coding Systems
770 @subsection Coding System Types 615 @subsection Coding System Types
771 616
617 The coding system type determines the basic algorithm XEmacs will use to
618 decode or encode a data stream. Character encodings will be converted
619 to the MULE encoding, escape sequences processed, and newline sequences
620 converted to XEmacs's internal representation. There are three basic
621 classes of coding system type: no-conversion, ISO-2022, and special.
622
623 No conversion allows you to look at the file's internal representation.
624 Since XEmacs is basically a text editor, "no conversion" does convert
625 newline conventions by default. (Use the 'binary coding-system if this
626 is not desired.)
627
628 ISO 2022 (@pxref{ISO 2022}) is the basic international standard regulating
629 use of "coded character sets for the exchange of data", ie, text
630 streams. ISO 2022 contains functions that make it possible to encode
631 text streams to comply with restrictions of the Internet mail system and
632 de facto restrictions of most file systems (eg, use of the separator
633 character in file names). Coding systems which are not ISO 2022
634 conformant can be difficult to handle. Perhaps more important, they are
635 not adaptable to multilingual information interchange, with the obvious
636 exception of ISO 10646 (Unicode). (Unicode is partially supported by
637 XEmacs with the addition of the Lisp package ucs-conv.)
638
639 The special class of coding systems includes automatic detection, CCL (a
640 "little language" embedded as an interpreter, useful for translating
641 between variants of a single character set), non-ISO-2022-conformant
642 encodings like Unicode, Shift JIS, and Big5, and MULE internal coding.
643 (NB: this list is based on XEmacs 21.2. Terminology may vary slightly
644 for other versions of XEmacs and for GNU Emacs 20.)
645
772 @table @code 646 @table @code
773 @item nil 647 @item no-conversion
774 @itemx autodetect 648 No conversion, for binary files, and a few special cases of non-ISO-2022
649 coding systems where conversion is done by hook functions (usually
650 implemented in CCL). On output, graphic characters that are not in
651 ASCII or Latin-1 will be replaced by a @samp{?}. (For a
652 no-conversion-encoded buffer, these characters will only be present if
653 you explicitly insert them.)
654 @item iso2022
655 Any ISO-2022-compliant encoding. Among others, this includes JIS (the
656 Japanese encoding commonly used for e-mail), national variants of EUC
657 (the standard Unix encoding for Japanese and other languages), and
658 Compound Text (an encoding used in X11). You can specify more specific
659 information about the conversion with the @var{flags} argument.
660 @item ucs-4
661 ISO 10646 UCS-4 encoding. A 31-bit fixed-width superset of Unicode.
662 @item utf-8
663 ISO 10646 UTF-8 encoding. A ``file system safe'' transformation format
664 that can be used with both UCS-4 and Unicode.
665 @item undecided
775 Automatic conversion. XEmacs attempts to detect the coding system used 666 Automatic conversion. XEmacs attempts to detect the coding system used
776 in the file. 667 in the file.
777 @item no-conversion
778 No conversion. Use this for binary files and such. On output, graphic
779 characters that are not in ASCII or Latin-1 will be replaced by a
780 @samp{?}. (For a no-conversion-encoded buffer, these characters will
781 only be present if you explicitly insert them.)
782 @item shift-jis 668 @item shift-jis
783 Shift-JIS (a Japanese encoding commonly used in PC operating systems). 669 Shift-JIS (a Japanese encoding commonly used in PC operating systems).
784 @item iso2022
785 Any ISO-2022-compliant encoding. Among other things, this includes JIS
786 (the Japanese encoding commonly used for e-mail), national variants of
787 EUC (the standard Unix encoding for Japanese and other languages), and
788 Compound Text (an encoding used in X11). You can specify more specific
789 information about the conversion with the @var{flags} argument.
790 @item big5 670 @item big5
791 Big5 (the encoding commonly used for Taiwanese). 671 Big5 (the encoding commonly used for Taiwanese).
792 @item ccl 672 @item ccl
793 The conversion is performed using a user-written pseudo-code program. 673 The conversion is performed using a user-written pseudo-code program.
794 CCL (Code Conversion Language) is the name of this pseudo-code. 674 CCL (Code Conversion Language) is the name of this pseudo-code. For
675 example, CCL is used to map KOI8-R characters (an encoding for Russian
676 Cyrillic) to ISO8859-5 (the form used internally by MULE).
795 @item internal 677 @item internal
796 Write out or read in the raw contents of the memory representing the 678 Write out or read in the raw contents of the memory representing the
797 buffer's text. This is primarily useful for debugging purposes, and is 679 buffer's text. This is primarily useful for debugging purposes, and is
798 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set 680 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set
799 (the @samp{--debug} configure option). @strong{Warning}: Reading in a 681 (the @samp{--debug} configure option). @strong{Warning}: Reading in a
801 inconsistency in the memory representing a buffer's text, which will 683 inconsistency in the memory representing a buffer's text, which will
802 produce unpredictable results and may cause XEmacs to crash. Under 684 produce unpredictable results and may cause XEmacs to crash. Under
803 normal circumstances you should never use @code{internal} conversion. 685 normal circumstances you should never use @code{internal} conversion.
804 @end table 686 @end table
805 687
806 @node EOL Conversion 688 @node ISO 2022, EOL Conversion, Coding System Types, Coding Systems
689 @section ISO 2022
690
691 This section briefly describes the ISO 2022 encoding standard. A more
692 thorough treatment is available in the original document of ISO
693 2022 as well as various national standards (such as JIS X 0202).
694
695 Character sets (@dfn{charsets}) are classified into the following four
696 categories, according to the number of characters in the charset:
697 94-charset, 96-charset, 94x94-charset, and 96x96-charset. This means
698 that although an ISO 2022 coding system may have variable width
699 characters, each charset used is fixed-width (in contrast to the MULE
700 character set and UTF-8, for example).
701
702 ISO 2022 provides for switching between character sets via escape
703 sequences. This switching is somewhat complicated, because ISO 2022
704 provides for both legacy applications like Internet mail that accept
705 only 7 significant bits in some contexts (RFC 822 headers, for example),
706 and more modern "8-bit clean" applications. It also provides for
707 compact and transparent representation of languages like Japanese which
708 mix ASCII and a national script (even outside of computer programs).
709
710 First, ISO 2022 codified prevailing practice by dividing the code space
711 into "control" and "graphic" regions. The code points 0x00-0x1F and
712 0x80-0x9F are reserved for "control characters", while "graphic
713 characters" must be assigned to code points in the regions 0x20-0x7F and
714 0xA0-0xFF. The positions 0x20 and 0x7F are special, and under some
715 circumstances must be assigned the graphic character "ASCII SPACE" and
716 the control character "ASCII DEL" respectively.
717
718 The various regions are given the name C0 (0x00-0x1F), GL (0x20-0x7F),
719 C1 (0x80-0x9F), and GR (0xA0-0xFF). GL and GR stand for "graphic left"
720 and "graphic right", respectively, because of the standard method of
721 displaying graphic character sets in tables with the high byte indexing
722 columns and the low byte indexing rows. I don't find it very intuitive,
723 but these are called "registers".
724
725 An ISO 2022-conformant encoding for a graphic character set must use a
726 fixed number of bytes per character, and the values must fit into a
727 single register; that is, each byte must range over either 0x20-0x7F, or
728 0xA0-0xFF. It is not allowed to extend the range of the repertoire of a
729 character set by using both ranges at the same. This is why a standard
730 character set such as ISO 8859-1 is actually considered by ISO 2022 to
731 be an aggregation of two character sets, ASCII and LATIN-1, and why it
732 is technically incorrect to refer to ISO 8859-1 as "Latin 1". Also, a
733 single character's bytes must all be drawn from the same register; this
734 is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO
735 2022-compatible encodings.
736
737 The reason for this restriction becomes clear when you attempt to define
738 an efficient, robust encoding for a language like Japanese. Like ISO
739 8859, Japanese encodings are aggregations of several character sets. In
740 practice, the vast majority of characters are drawn from the "JIS Roman"
741 character set (a derivative of ASCII; it won't hurt to think of it as
742 ASCII) and the JIS X 0208 standard "basic Japanese" character set
743 including not only ideographic characters ("kanji") but syllabic
744 Japanese characters ("kana"), a wide variety of symbols, and many
745 alphabetic characters (Roman, Greek, and Cyrillic) as well. Although
746 JIS X 0208 includes the whole Roman alphabet, as a 2-byte code it is not
747 suited to programming; thus the inclusion of ASCII in the standard
748 Japanese encodings.
749
750 For normal Japanese text such as in newspapers, a broad repertoire of
751 approximately 3000 characters is used. Evidently this won't fit into
752 one byte; two must be used. But much of the text processed by Japanese
753 computers is computer source code, nearly all of which is ASCII. A not
754 insignificant portion of ordinary text is English (as such or as
755 borrowed Japanese vocabulary) or other languages which can represented
756 at least approximately in ASCII, as well. It seems reasonable then to
757 represent ASCII in one byte, and JIS X 0208 in two. And this is exactly
758 what the Extended Unix Code for Japanese (EUC-JP) does. ASCII is
759 invoked to the GL register, and JIS X 0208 is invoked to the GR
760 register. Thus, each byte can be tested for its character set by
761 looking at the high bit; if set, it is Japanese, if clear, it is ASCII.
762 Furthermore, since control characters like newline can never be part of
763 a graphic character, even in the case of corruption in transmission the
764 stream will be resynchronized at every line break, on the order of 60-80
765 bytes. This coding system requires no escape sequences or special
766 control codes to represent 99.9% of all Japanese text.
767
768 Note carefully the distinction between the character sets (ASCII and JIS
769 X 0208), the encoding (EUC-JP), and the coding system (ISO 2022). The
770 JIS X 0208 character set is used in three different encodings for
771 Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is
772 always clear), in EUC-JP it is invoked into GR (setting the high bit in
773 the process), and in Shift JIS the high bit may be set or reset, and the
774 significant bits are shifted within the 16-bit character so that the two
775 main character sets can coexist with a third (the "halfwidth katakana"
776 of JIS X 0201). As the name implies, the ISO-2022-JP encoding is also a
777 version of the ISO-2022 coding system.
778
779 In order to systematically treat subsidiary character sets (like the
780 "halfwidth katakana" already mentioned, and the "supplementary kanji" of
781 JIS X 0212), four further registers are defined: G0, G1, G2, and G3.
782 Unlike GL and GR, they are not logically distinguished by internal
783 format. Instead, the process of "invocation" mentioned earlier is
784 broken into two steps: first, a character set is @dfn{designated} to one
785 of the registers G0-G3 by use of an @dfn{escape sequence} of the form:
786
787 @example
788 ESC [@var{I}] @var{I} @var{F}
789 @end example
790
791 where @var{I} is an intermediate character or characters in the range
792 0x20 - 0x3F, and @var{F}, from the range 0x30-0x7Fm is the final
793 character identifying this charset. (Final characters in the range
794 0x30-0x3F are reserved for private use and will never have a publicly
795 registered meaning.)
796
797 Then that register is @dfn{invoked} to either GL or GR, either
798 automatically (designations to G0 normally involve invocation to GL as
799 well), or by use of shifting (affecting only the following character in
800 the data stream) or locking (effective until the next designation or
801 locking) control sequences. An encoding conformant to ISO 2022 is
802 typically defined by designating the initial contents of the G0-G3
803 registers, specifying an 7 or 8 bit environment, and specifying whether
804 further designations will be recognized.
805
806 Some examples of character sets and the registered final characters
807 @var{F} used to designate them:
808
809 @need 1000
810 @table @asis
811 @item 94-charset
812 ASCII (B), left (J) and right (I) half of JIS X 0201, ...
813 @item 96-charset
814 Latin-1 (A), Latin-2 (B), Latin-3 (C), ...
815 @item 94x94-charset
816 GB2312 (A), JIS X 0208 (B), KSC5601 (C), ...
817 @item 96x96-charset
818 none for the moment
819 @end table
820
821 The meanings of the various characters in these sequences, where not
822 specified by the ISO 2022 standard (such as the ESC character), are
823 assigned by @dfn{ECMA}, the European Computer Manufacturers Association.
824
825 The meaning of intermediate characters are:
826
827 @example
828 @group
829 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
830 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}.
831 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}.
832 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}.
833 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}.
834 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}.
835 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}.
836 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}.
837 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}.
838 @end group
839 @end example
840
841 The comma may be used in files read and written only by MULE, as a MULE
842 extension, but this is illegal in ISO 2022. (The reason is that in ISO
843 2022 G0 must be a 94-member character set, with 0x20 assigned the value
844 SPACE, and 0x7F assigned the value DEL.)
845
846 Here are examples of designations:
847
848 @example
849 @group
850 ESC ( B : designate to G0 ASCII
851 ESC - A : designate to G1 Latin-1
852 ESC $ ( A or ESC $ A : designate to G0 GB2312
853 ESC $ ( B or ESC $ B : designate to G0 JISX0208
854 ESC $ ) C : designate to G1 KSC5601
855 @end group
856 @end example
857
858 (The short forms used to designate GB2312 and JIS X 0208 are for
859 backwards compatibility; the long forms are preferred.)
860
861 To use a charset designated to G2 or G3, and to use a charset designated
862 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3
863 into GL. There are two types of invocation, Locking Shift (forever) and
864 Single Shift (one character only).
865
866 Locking Shift is done as follows:
867
868 @example
869 LS0 or SI (0x0F): invoke G0 into GL
870 LS1 or SO (0x0E): invoke G1 into GL
871 LS2: invoke G2 into GL
872 LS3: invoke G3 into GL
873 LS1R: invoke G1 into GR
874 LS2R: invoke G2 into GR
875 LS3R: invoke G3 into GR
876 @end example
877
878 Single Shift is done as follows:
879
880 @example
881 @group
882 SS2 or ESC N: invoke G2 into GL
883 SS3 or ESC O: invoke G3 into GL
884 @end group
885 @end example
886
887 The shift functions (such as LS1R and SS3) are represented by control
888 characters (from C1) in 8 bit environments and by escape sequences in 7
889 bit environments.
890
891 (#### Ben says: I think the above is slightly incorrect. It appears that
892 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and
893 ESC O behave as indicated. The above definitions will not parse
894 EUC-encoded text correctly, and it looks like the code in mule-coding.c
895 has similar problems.)
896
897 Evidently there are a lot of ISO-2022-compliant ways of encoding
898 multilingual text. Now, in the world, there exist many coding systems
899 such as X11's Compound Text, Japanese JUNET code, and so-called EUC
900 (Extended UNIX Code); all of these are variants of ISO 2022.
901
902 In MULE, we characterize a version of ISO 2022 by the following
903 attributes:
904
905 @enumerate
906 @item
907 The character sets initially designated to G0 thru G3.
908 @item
909 Whether short form designations are allowed for Japanese and Chinese.
910 @item
911 Whether ASCII should be designated to G0 before control characters.
912 @item
913 Whether ASCII should be designated to G0 at the end of line.
914 @item
915 7-bit environment or 8-bit environment.
916 @item
917 Whether Locking Shifts are used or not.
918 @item
919 Whether to use ASCII or the variant JIS X 0201-1976-Roman.
920 @item
921 Whether to use JIS X 0208-1983 or the older version JIS X 0208-1976.
922 @end enumerate
923
924 (The last two are only for Japanese.)
925
926 By specifying these attributes, you can create any variant
927 of ISO 2022.
928
929 Here are several examples:
930
931 @example
932 @group
933 ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check).
934 1. G0 <- ASCII, G1..3 <- never used
935 2. Yes.
936 3. Yes.
937 4. Yes.
938 5. 7-bit environment
939 6. No.
940 7. Use ASCII
941 8. Use JIS X 0208-1983
942 @end group
943
944 @group
945 ctext -- X11 Compound Text
946 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used.
947 2. No.
948 3. No.
949 4. Yes.
950 5. 8-bit environment.
951 6. No.
952 7. Use ASCII.
953 8. Use JIS X 0208-1983.
954 @end group
955
956 @group
957 euc-china -- Chinese EUC. Often called the "GB encoding", but that is
958 technically incorrect.
959 1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used.
960 2. No.
961 3. Yes.
962 4. Yes.
963 5. 8-bit environment.
964 6. No.
965 7. Use ASCII.
966 8. Use JIS X 0208-1983.
967 @end group
968
969 @group
970 ISO-2022-KR -- Coding system used in Korean email.
971 1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used.
972 2. No.
973 3. Yes.
974 4. Yes.
975 5. 7-bit environment.
976 6. Yes.
977 7. Use ASCII.
978 8. Use JIS X 0208-1983.
979 @end group
980 @end example
981
982 MULE creates all of these coding systems by default.
983
984 @node EOL Conversion, Coding System Properties, ISO 2022, Coding Systems
807 @subsection EOL Conversion 985 @subsection EOL Conversion
808 986
809 @table @code 987 @table @code
810 @item nil 988 @item nil
811 Automatically detect the end-of-line type (LF, CRLF, or CR). Also 989 Automatically detect the end-of-line type (LF, CRLF, or CR). Also
828 Automatically detect the end-of-line type but do not generate subsidiary 1006 Automatically detect the end-of-line type but do not generate subsidiary
829 coding systems. (This value is converted to @code{nil} when stored 1007 coding systems. (This value is converted to @code{nil} when stored
830 internally, and @code{coding-system-property} will return @code{nil}.) 1008 internally, and @code{coding-system-property} will return @code{nil}.)
831 @end table 1009 @end table
832 1010
833 @node Coding System Properties 1011 @node Coding System Properties, Basic Coding System Functions, EOL Conversion, Coding Systems
834 @subsection Coding System Properties 1012 @subsection Coding System Properties
835 1013
836 @table @code 1014 @table @code
837 @item mnemonic 1015 @item mnemonic
838 String to be displayed in the modeline when this coding system is 1016 String to be displayed in the modeline when this coding system is
839 active. 1017 active.
840 1018
841 @item eol-type 1019 @item eol-type
842 End-of-line conversion to be used. It should be one of the types 1020 End-of-line conversion to be used. It should be one of the types
843 listed in @ref{EOL Conversion}. 1021 listed in @ref{EOL Conversion}.
1022
1023 @item eol-lf
1024 The coding system which is the same as this one, except that it uses the
1025 Unix line-breaking convention.
1026
1027 @item eol-crlf
1028 The coding system which is the same as this one, except that it uses the
1029 DOS line-breaking convention.
1030
1031 @item eol-cr
1032 The coding system which is the same as this one, except that it uses the
1033 Macintosh line-breaking convention.
844 1034
845 @item post-read-conversion 1035 @item post-read-conversion
846 Function called after a file has been read in, to perform the decoding. 1036 Function called after a file has been read in, to perform the decoding.
847 Called with two arguments, @var{beg} and @var{end}, denoting a region of 1037 Called with two arguments, @var{beg} and @var{end}, denoting a region of
848 the current buffer to be decoded. 1038 the current buffer to be decoded.
851 Function called before a file is written out, to perform the encoding. 1041 Function called before a file is written out, to perform the encoding.
852 Called with two arguments, @var{beg} and @var{end}, denoting a region of 1042 Called with two arguments, @var{beg} and @var{end}, denoting a region of
853 the current buffer to be encoded. 1043 the current buffer to be encoded.
854 @end table 1044 @end table
855 1045
856 The following additional properties are recognized if @var{type} is 1046 The following additional properties are recognized if @var{type} is
857 @code{iso2022}: 1047 @code{iso2022}:
858 1048
859 @table @code 1049 @table @code
860 @item charset-g0 1050 @item charset-g0
861 @itemx charset-g1 1051 @itemx charset-g1
929 A list of conversion specifications, specifying conversion of characters 1119 A list of conversion specifications, specifying conversion of characters
930 in one charset to another when encoding is performed. The form of each 1120 in one charset to another when encoding is performed. The form of each
931 specification is the same as for @code{input-charset-conversion}. 1121 specification is the same as for @code{input-charset-conversion}.
932 @end table 1122 @end table
933 1123
934 The following additional properties are recognized (and required) if 1124 The following additional properties are recognized (and required) if
935 @var{type} is @code{ccl}: 1125 @var{type} is @code{ccl}:
936 1126
937 @table @code 1127 @table @code
938 @item decode 1128 @item decode
939 CCL program used for decoding (converting to internal format). 1129 CCL program used for decoding (converting to internal format).
940 1130
941 @item encode 1131 @item encode
942 CCL program used for encoding (converting to external format). 1132 CCL program used for encoding (converting to external format).
943 @end table 1133 @end table
944 1134
945 @node Basic Coding System Functions 1135 The following properties are used internally: @var{eol-cr},
1136 @var{eol-crlf}, @var{eol-lf}, and @var{base}.
1137
1138 @node Basic Coding System Functions, Coding System Property Functions, Coding System Properties, Coding Systems
946 @subsection Basic Coding System Functions 1139 @subsection Basic Coding System Functions
947 1140
948 @defun find-coding-system coding-system-or-name 1141 @defun find-coding-system coding-system-or-name
949 This function retrieves the coding system of the given name. 1142 This function retrieves the coding system of the given name.
950 1143
951 If @var{coding-system-or-name} is a coding-system object, it is simply 1144 If @var{coding-system-or-name} is a coding-system object, it is simply
952 returned. Otherwise, @var{coding-system-or-name} should be a symbol. 1145 returned. Otherwise, @var{coding-system-or-name} should be a symbol.
953 If there is no such coding system, @code{nil} is returned. Otherwise 1146 If there is no such coding system, @code{nil} is returned. Otherwise
954 the associated coding system object is returned. 1147 the associated coding system object is returned.
955 @end defun 1148 @end defun
956 1149
966 1159
967 @defun coding-system-name coding-system 1160 @defun coding-system-name coding-system
968 This function returns the name of the given coding system. 1161 This function returns the name of the given coding system.
969 @end defun 1162 @end defun
970 1163
1164 @defun coding-system-base coding-system
1165 Returns the base coding system (undecided EOL convention)
1166 coding system.
1167 @end defun
1168
971 @defun make-coding-system name type &optional doc-string props 1169 @defun make-coding-system name type &optional doc-string props
972 This function registers symbol @var{name} as a coding system. 1170 This function registers symbol @var{name} as a coding system.
973 1171
974 @var{type} describes the conversion method used and should be one of 1172 @var{type} describes the conversion method used and should be one of
975 the types listed in @ref{Coding System Types}. 1173 the types listed in @ref{Coding System Types}.
990 @defun subsidiary-coding-system coding-system eol-type 1188 @defun subsidiary-coding-system coding-system eol-type
991 This function returns the subsidiary coding system of 1189 This function returns the subsidiary coding system of
992 @var{coding-system} with eol type @var{eol-type}. 1190 @var{coding-system} with eol type @var{eol-type}.
993 @end defun 1191 @end defun
994 1192
995 @node Coding System Property Functions 1193 @node Coding System Property Functions, Encoding and Decoding Text, Basic Coding System Functions, Coding Systems
996 @subsection Coding System Property Functions 1194 @subsection Coding System Property Functions
997 1195
998 @defun coding-system-doc-string coding-system 1196 @defun coding-system-doc-string coding-system
999 This function returns the doc string for @var{coding-system}. 1197 This function returns the doc string for @var{coding-system}.
1000 @end defun 1198 @end defun
1005 1203
1006 @defun coding-system-property coding-system prop 1204 @defun coding-system-property coding-system prop
1007 This function returns the @var{prop} property of @var{coding-system}. 1205 This function returns the @var{prop} property of @var{coding-system}.
1008 @end defun 1206 @end defun
1009 1207
1010 @node Encoding and Decoding Text 1208 @node Encoding and Decoding Text, Detection of Textual Encoding, Coding System Property Functions, Coding Systems
1011 @subsection Encoding and Decoding Text 1209 @subsection Encoding and Decoding Text
1012 1210
1013 @defun decode-coding-region start end coding-system &optional buffer 1211 @defun decode-coding-region start end coding-system &optional buffer
1014 This function decodes the text between @var{start} and @var{end} which 1212 This function decodes the text between @var{start} and @var{end} which
1015 is encoded in @var{coding-system}. This is useful if you've read in 1213 is encoded in @var{coding-system}. This is useful if you've read in
1026 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS 1224 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS
1027 encoding. The length of the encoded text is returned. @var{buffer} 1225 encoding. The length of the encoded text is returned. @var{buffer}
1028 defaults to the current buffer if unspecified. 1226 defaults to the current buffer if unspecified.
1029 @end defun 1227 @end defun
1030 1228
1031 @node Detection of Textual Encoding 1229 @node Detection of Textual Encoding, Big5 and Shift-JIS Functions, Encoding and Decoding Text, Coding Systems
1032 @subsection Detection of Textual Encoding 1230 @subsection Detection of Textual Encoding
1033 1231
1034 @defun coding-category-list 1232 @defun coding-category-list
1035 This function returns a list of all recognized coding categories. 1233 This function returns a list of all recognized coding categories.
1036 @end defun 1234 @end defun
1062 returns @code{autodetect} or one of its subsidiary coding systems 1260 returns @code{autodetect} or one of its subsidiary coding systems
1063 according to a detected end-of-line type. Optional arg @var{buffer} 1261 according to a detected end-of-line type. Optional arg @var{buffer}
1064 defaults to the current buffer. 1262 defaults to the current buffer.
1065 @end defun 1263 @end defun
1066 1264
1067 @node Big5 and Shift-JIS Functions 1265 @node Big5 and Shift-JIS Functions, Predefined Coding Systems, Detection of Textual Encoding, Coding Systems
1068 @subsection Big5 and Shift-JIS Functions 1266 @subsection Big5 and Shift-JIS Functions
1069 1267
1070 These are special functions for working with the non-standard 1268 These are special functions for working with the non-standard
1071 Shift-JIS and Big5 encodings. 1269 Shift-JIS and Big5 encodings.
1072 1270
1073 @defun decode-shift-jis-char code 1271 @defun decode-shift-jis-char code
1074 This function decodes a JISX0208 character of Shift-JIS coding-system. 1272 This function decodes a JIS X 0208 character of Shift-JIS coding-system.
1075 @var{code} is the character code in Shift-JIS as a cons of type bytes. 1273 @var{code} is the character code in Shift-JIS as a cons of type bytes.
1076 The corresponding character is returned. 1274 The corresponding character is returned.
1077 @end defun 1275 @end defun
1078 1276
1079 @defun encode-shift-jis-char ch 1277 @defun encode-shift-jis-char ch
1080 This function encodes a JISX0208 character @var{ch} to SHIFT-JIS 1278 This function encodes a JIS X 0208 character @var{ch} to SHIFT-JIS
1081 coding-system. The corresponding character code in SHIFT-JIS is 1279 coding-system. The corresponding character code in SHIFT-JIS is
1082 returned as a cons of two bytes. 1280 returned as a cons of two bytes.
1083 @end defun 1281 @end defun
1084 1282
1085 @defun decode-big5-char code 1283 @defun decode-big5-char code
1091 @defun encode-big5-char ch 1289 @defun encode-big5-char ch
1092 This function encodes the Big5 character @var{char} to BIG5 1290 This function encodes the Big5 character @var{char} to BIG5
1093 coding-system. The corresponding character code in Big5 is returned. 1291 coding-system. The corresponding character code in Big5 is returned.
1094 @end defun 1292 @end defun
1095 1293
1294 @node Predefined Coding Systems, , Big5 and Shift-JIS Functions, Coding Systems
1295 @subsection Coding Systems Implemented
1296
1297 MULE initializes most of the commonly used coding systems at XEmacs's
1298 startup. A few others are initialized only when the relevant language
1299 environment is selected and support libraries are loaded. (NB: The
1300 following list is based on XEmacs 21.2.19, the development branch at the
1301 time of writing. The list may be somewhat different for other
1302 versions. Recent versions of GNU Emacs 20 implement a few more rare
1303 coding systems; work is being done to port these to XEmacs.)
1304
1305 Unfortunately, there is not a consistent naming convention for character
1306 sets, and for practical purposes coding systems often take their name
1307 from their principal character sets (ASCII, KOI8-R, Shift JIS). Others
1308 take their names from the coding system (ISO-2022-JP, EUC-KR), and a few
1309 from their non-text usages (internal, binary). To provide for this, and
1310 for the fact that many coding systems have several common names, an
1311 aliasing system is provided. Finally, some effort has been made to use
1312 names that are registered as MIME charsets (this is why the name
1313 'shift_jis contains that un-Lisp-y underscore).
1314
1315 There is a systematic naming convention regarding end-of-line (EOL)
1316 conventions for different systems. A coding system whose name ends in
1317 "-unix" forces the assumptions that lines are broken by newlines (0x0A).
1318 A coding system whose name ends in "-mac" forces the assumptions that
1319 lines are broken by ASCII CRs (0x0D). A coding system whose name ends
1320 in "-dos" forces the assumptions that lines are broken by CRLF sequences
1321 (0x0D 0x0A). These subsidiary coding systems are automatically derived
1322 from a base coding system. Use of the base coding system implies
1323 autodetection of the text file convention. (The fact that the -unix,
1324 -mac, and -dos are derived from a base system results in them showing up
1325 as "aliases" in `list-coding-systems'.) These subsidiaries have a
1326 consistent modeline indicator as well. "-dos" coding systems have ":T"
1327 appended to their modeline indicator, while "-mac" coding systems have
1328 ":t" appended (eg, "ISO8:t" for iso-2022-8-mac).
1329
1330 In the following table, each coding system is given with its mode line
1331 indicator in parentheses. Non-textual coding systems are listed first,
1332 followed by textual coding systems and their aliases. (The coding system
1333 subsidiary modeline indicators ":T" and ":t" will be omitted from the
1334 table of coding systems.)
1335
1336 ### SJT 1999-08-23 Maybe should order these by language? Definitely
1337 need language usage for the ISO-8859 family.
1338
1339 Note that although true coding system aliases have been implemented for
1340 XEmacs 21.2, the coding system initialization has not yet been converted
1341 as of 21.2.19. So coding systems described as aliases have the same
1342 properties as the aliased coding system, but will not be equal as Lisp
1343 objects.
1344
1345 @table @code
1346
1347 @item automatic-conversion
1348 @itemx undecided
1349 @itemx undecided-dos
1350 @itemx undecided-mac
1351 @itemx undecided-unix
1352
1353 Modeline indicator: @code{Auto}. A type @code{undecided} coding system.
1354 Attempts to determine an appropriate coding system from file contents or
1355 the environment.
1356
1357 @item raw-text
1358 @itemx no-conversion
1359 @itemx raw-text-dos
1360 @itemx raw-text-mac
1361 @itemx raw-text-unix
1362 @itemx no-conversion-dos
1363 @itemx no-conversion-mac
1364 @itemx no-conversion-unix
1365
1366 Modeline indicator: @code{Raw}. A type @code{no-conversion} coding system,
1367 which converts only line-break-codes. An implementation quirk means
1368 that this coding system is also used for ISO8859-1.
1369
1370 @item binary
1371 Modeline indicator: @code{Binary}. A type @code{no-conversion} coding
1372 system which does no character coding or EOL conversions. An alias for
1373 @code{raw-text-unix}.
1374
1375 @item alternativnyj
1376 @itemx alternativnyj-dos
1377 @itemx alternativnyj-mac
1378 @itemx alternativnyj-unix
1379
1380 Modeline indicator: @code{Cy.Alt}. A type @code{ccl} coding system used for
1381 Alternativnyj, an encoding of the Cyrillic alphabet.
1382
1383 @item big5
1384 @itemx big5-dos
1385 @itemx big5-mac
1386 @itemx big5-unix
1387
1388 Modeline indicator: @code{Zh/Big5}. A type @code{big5} coding system used for
1389 BIG5, the most common encoding of traditional Chinese as used in Taiwan.
1390
1391 @item cn-gb-2312
1392 @itemx cn-gb-2312-dos
1393 @itemx cn-gb-2312-mac
1394 @itemx cn-gb-2312-unix
1395
1396 Modeline indicator: @code{Zh-GB/EUC}. A type @code{iso2022} coding system used
1397 for simplified Chinese (as used in the People's Republic of China), with
1398 the @code{ascii} (G0), @code{chinese-gb2312} (G1), and @code{sisheng}
1399 (G2) character sets initially designated. Chinese EUC (Extended Unix
1400 Code).
1401
1402 @item ctext-hebrew
1403 @itemx ctext-hebrew-dos
1404 @itemx ctext-hebrew-mac
1405 @itemx ctext-hebrew-unix
1406
1407 Modeline indicator: @code{CText/Hbrw}. A type @code{iso2022} coding system
1408 with the @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) character
1409 sets initially designated for Hebrew.
1410
1411 @item ctext
1412 @itemx ctext-dos
1413 @itemx ctext-mac
1414 @itemx ctext-unix
1415
1416 Modeline indicator: @code{CText}. A type @code{iso2022} 8-bit coding system
1417 with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1) character
1418 sets initially designated. X11 Compound Text Encoding. Often
1419 mistakenly recognized instead of EUC encodings; usual cause is
1420 inappropriate setting of @code{coding-priority-list}.
1421
1422 @item escape-quoted
1423
1424 Modeline indicator: @code{ESC/Quot}. A type @code{iso2022} 8-bit coding
1425 system with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1)
1426 character sets initially designated and escape quoting. Unix EOL
1427 conversion (ie, no conversion). It is used for .ELC files.
1428
1429 @item euc-jp
1430 @itemx euc-jp-dos
1431 @itemx euc-jp-mac
1432 @itemx euc-jp-unix
1433
1434 Modeline indicator: @code{Ja/EUC}. A type @code{iso2022} 8-bit coding system
1435 with @code{ascii} (G0), @code{japanese-jisx0208} (G1),
1436 @code{katakana-jisx0201} (G2), and @code{japanese-jisx0212} (G3)
1437 initially designated. Japanese EUC (Extended Unix Code).
1438
1439 @item euc-kr
1440 @itemx euc-kr-dos
1441 @itemx euc-kr-mac
1442 @itemx euc-kr-unix
1443
1444 Modeline indicator: @code{ko/EUC}. A type @code{iso2022} 8-bit coding system
1445 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially
1446 designated. Korean EUC (Extended Unix Code).
1447
1448 @item hz-gb-2312
1449 Modeline indicator: @code{Zh-GB/Hz}. A type @code{no-conversion} coding
1450 system with Unix EOL convention (ie, no conversion) using
1451 post-read-decode and pre-write-encode functions to translate the Hz/ZW
1452 coding system used for Chinese.
1453
1454 @item iso-2022-7bit
1455 @itemx iso-2022-7bit-unix
1456 @itemx iso-2022-7bit-dos
1457 @itemx iso-2022-7bit-mac
1458 @itemx iso-2022-7
1459
1460 Modeline indicator: @code{ISO7}. A type @code{iso2022} 7-bit coding system
1461 with @code{ascii} (G0) initially designated. Other character sets must
1462 be explicitly designated to be used.
1463
1464 @item iso-2022-7bit-ss2
1465 @itemx iso-2022-7bit-ss2-dos
1466 @itemx iso-2022-7bit-ss2-mac
1467 @itemx iso-2022-7bit-ss2-unix
1468
1469 Modeline indicator: @code{ISO7/SS}. A type @code{iso2022} 7-bit coding system
1470 with @code{ascii} (G0) initially designated. Other character sets must
1471 be explicitly designated to be used. SS2 is used to invoke a
1472 96-charset, one character at a time.
1473
1474 @item iso-2022-8
1475 @itemx iso-2022-8-dos
1476 @itemx iso-2022-8-mac
1477 @itemx iso-2022-8-unix
1478
1479 Modeline indicator: @code{ISO8}. A type @code{iso2022} 8-bit coding system
1480 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially
1481 designated. Other character sets must be explicitly designated to be
1482 used. No single-shift or locking-shift.
1483
1484 @item iso-2022-8bit-ss2
1485 @itemx iso-2022-8bit-ss2-dos
1486 @itemx iso-2022-8bit-ss2-mac
1487 @itemx iso-2022-8bit-ss2-unix
1488
1489 Modeline indicator: @code{ISO8/SS}. A type @code{iso2022} 8-bit coding system
1490 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially
1491 designated. Other character sets must be explicitly designated to be
1492 used. SS2 is used to invoke a 96-charset, one character at a time.
1493
1494 @item iso-2022-int-1
1495 @itemx iso-2022-int-1-dos
1496 @itemx iso-2022-int-1-mac
1497 @itemx iso-2022-int-1-unix
1498
1499 Modeline indicator: @code{INT-1}. A type @code{iso2022} 7-bit coding system
1500 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially
1501 designated. ISO-2022-INT-1.
1502
1503 @item iso-2022-jp-1978-irv
1504 @itemx iso-2022-jp-1978-irv-dos
1505 @itemx iso-2022-jp-1978-irv-mac
1506 @itemx iso-2022-jp-1978-irv-unix
1507
1508 Modeline indicator: @code{Ja-78/7bit}. A type @code{iso2022} 7-bit coding
1509 system. For compatibility with old Japanese terminals; if you need to
1510 know, look at the source.
1511
1512 @item iso-2022-jp
1513 @itemx iso-2022-jp-2 (ISO7/SS)
1514 @itemx iso-2022-jp-dos
1515 @itemx iso-2022-jp-mac
1516 @itemx iso-2022-jp-unix
1517 @itemx iso-2022-jp-2-dos
1518 @itemx iso-2022-jp-2-mac
1519 @itemx iso-2022-jp-2-unix
1520
1521 Modeline indicator: @code{MULE/7bit}. A type @code{iso2022} 7-bit coding
1522 system with @code{ascii} (G0) initially designated, and complex
1523 specifications to insure backward compatibility with old Japanese
1524 systems. Used for communication with mail and news in Japan. The "-2"
1525 versions also use SS2 to invoke a 96-charset one character at a time.
1526
1527 @item iso-2022-kr
1528 Modeline indicator: @code{Ko/7bit} A type @code{iso2022} 7-bit coding
1529 system with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially
1530 designated. Used for e-mail in Korea.
1531
1532 @item iso-2022-lock
1533 @itemx iso-2022-lock-dos
1534 @itemx iso-2022-lock-mac
1535 @itemx iso-2022-lock-unix
1536
1537 Modeline indicator: @code{ISO7/Lock}. A type @code{iso2022} 7-bit coding
1538 system with @code{ascii} (G0) initially designated, using Locking-Shift
1539 to invoke a 96-charset.
1540
1541 @item iso-8859-1
1542 @itemx iso-8859-1-dos
1543 @itemx iso-8859-1-mac
1544 @itemx iso-8859-1-unix
1545
1546 Due to implementation, this is not a type @code{iso2022} coding system,
1547 but rather an alias for the @code{raw-text} coding system.
1548
1549 @item iso-8859-2
1550 @itemx iso-8859-2-dos
1551 @itemx iso-8859-2-mac
1552 @itemx iso-8859-2-unix
1553
1554 Modeline indicator: @code{MIME/Ltn-2}. A type @code{iso2022} coding
1555 system with @code{ascii} (G0) and @code{latin-iso8859-2} (G1) initially
1556 invoked.
1557
1558 @item iso-8859-3
1559 @itemx iso-8859-3-dos
1560 @itemx iso-8859-3-mac
1561 @itemx iso-8859-3-unix
1562
1563 Modeline indicator: @code{MIME/Ltn-3}. A type @code{iso2022} coding system
1564 with @code{ascii} (G0) and @code{latin-iso8859-3} (G1) initially
1565 invoked.
1566
1567 @item iso-8859-4
1568 @itemx iso-8859-4-dos
1569 @itemx iso-8859-4-mac
1570 @itemx iso-8859-4-unix
1571
1572 Modeline indicator: @code{MIME/Ltn-4}. A type @code{iso2022} coding system
1573 with @code{ascii} (G0) and @code{latin-iso8859-4} (G1) initially
1574 invoked.
1575
1576 @item iso-8859-5
1577 @itemx iso-8859-5-dos
1578 @itemx iso-8859-5-mac
1579 @itemx iso-8859-5-unix
1580
1581 Modeline indicator: @code{ISO8/Cyr}. A type @code{iso2022} coding system with
1582 @code{ascii} (G0) and @code{cyrillic-iso8859-5} (G1) initially invoked.
1583
1584 @item iso-8859-7
1585 @itemx iso-8859-7-dos
1586 @itemx iso-8859-7-mac
1587 @itemx iso-8859-7-unix
1588
1589 Modeline indicator: @code{Grk}. A type @code{iso2022} coding system with
1590 @code{ascii} (G0) and @code{greek-iso8859-7} (G1) initially invoked.
1591
1592 @item iso-8859-8
1593 @itemx iso-8859-8-dos
1594 @itemx iso-8859-8-mac
1595 @itemx iso-8859-8-unix
1596
1597 Modeline indicator: @code{MIME/Hbrw}. A type @code{iso2022} coding system with
1598 @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) initially invoked.
1599
1600 @item iso-8859-9
1601 @itemx iso-8859-9-dos
1602 @itemx iso-8859-9-mac
1603 @itemx iso-8859-9-unix
1604
1605 Modeline indicator: @code{MIME/Ltn-5}. A type @code{iso2022} coding system
1606 with @code{ascii} (G0) and @code{latin-iso8859-9} (G1) initially
1607 invoked.
1608
1609 @item koi8-r
1610 @itemx koi8-r-dos
1611 @itemx koi8-r-mac
1612 @itemx koi8-r-unix
1613
1614 Modeline indicator: @code{KOI8}. A type @code{ccl} coding-system used for
1615 KOI8-R, an encoding of the Cyrillic alphabet.
1616
1617 @item shift_jis
1618 @itemx shift_jis-dos
1619 @itemx shift_jis-mac
1620 @itemx shift_jis-unix
1621
1622 Modeline indicator: @code{Ja/SJIS}. A type @code{shift-jis} coding-system
1623 implementing the Shift-JIS encoding for Japanese. The underscore is to
1624 conform to the MIME charset implementing this encoding.
1625
1626 @item tis-620
1627 @itemx tis-620-dos
1628 @itemx tis-620-mac
1629 @itemx tis-620-unix
1630
1631 Modeline indicator: @code{TIS620}. A type @code{ccl} encoding for Thai. The
1632 external encoding is defined by TIS620, the internal encoding is
1633 peculiar to MULE, and called @code{thai-xtis}.
1634
1635 @item viqr
1636
1637 Modeline indicator: @code{VIQR}. A type @code{no-conversion} coding
1638 system with Unix EOL convention (ie, no conversion) using
1639 post-read-decode and pre-write-encode functions to translate the VIQR
1640 coding system for Vietnamese.
1641
1642 @item viscii
1643 @itemx viscii-dos
1644 @itemx viscii-mac
1645 @itemx viscii-unix
1646
1647 Modeline indicator: @code{VISCII}. A type @code{ccl} coding-system used
1648 for VISCII 1.1 for Vietnamese. Differs slightly from VSCII; VISCII is
1649 given priority by XEmacs.
1650
1651 @item vscii
1652 @itemx vscii-dos
1653 @itemx vscii-mac
1654 @itemx vscii-unix
1655
1656 Modeline indicator: @code{VSCII}. A type @code{ccl} coding-system used
1657 for VSCII 1.1 for Vietnamese. Differs slightly from VISCII, which is
1658 given priority by XEmacs. Use
1659 @code{(prefer-coding-system 'vietnamese-vscii)} to give priority to VSCII.
1660
1661 @end table
1662
1096 @node CCL, Category Tables, Coding Systems, MULE 1663 @node CCL, Category Tables, Coding Systems, MULE
1097 @section CCL 1664 @section CCL
1098 1665
1099 CCL (Code Conversion Language) is a simple structured programming 1666 CCL (Code Conversion Language) is a simple structured programming
1100 language designed for character coding conversions. A CCL program is 1667 language designed for character coding conversions. A CCL program is
1101 compiled to CCL code (represented by a vector of integers) and executed 1668 compiled to CCL code (represented by a vector of integers) and executed
1102 by the CCL interpreter embedded in Emacs. The CCL interpreter 1669 by the CCL interpreter embedded in Emacs. The CCL interpreter
1103 implements a virtual machine with 8 registers called @code{r0}, ..., 1670 implements a virtual machine with 8 registers called @code{r0}, ...,
1104 @code{r7}, a number of control structures, and some I/O operators. Take 1671 @code{r7}, a number of control structures, and some I/O operators. Take
1105 care when using registers @code{r0} (used in implicit @dfn{set} 1672 care when using registers @code{r0} (used in implicit @dfn{set}
1106 statements) and especially @code{r7} (used internally by several 1673 statements) and especially @code{r7} (used internally by several
1107 statements and operations, especially for multiple return values and I/O 1674 statements and operations, especially for multiple return values and I/O
1108 operations). 1675 operations).
1109 1676
1110 CCL is used for code conversion during process I/O and file I/O for 1677 CCL is used for code conversion during process I/O and file I/O for
1111 non-ISO2022 coding systems. (It is the only way for a user to specify a 1678 non-ISO2022 coding systems. (It is the only way for a user to specify a
1112 code conversion function.) It is also used for calculating the code 1679 code conversion function.) It is also used for calculating the code
1113 point of an X11 font from a character code. However, since CCL is 1680 point of an X11 font from a character code. However, since CCL is
1114 designed as a powerful programming language, it can be used for more 1681 designed as a powerful programming language, it can be used for more
1115 generic calculation where efficiency is demanded. A combination of 1682 generic calculation where efficiency is demanded. A combination of
1116 three or more arithmetic operations can be calculated faster by CCL than 1683 three or more arithmetic operations can be calculated faster by CCL than
1117 by Emacs Lisp. 1684 by Emacs Lisp.
1118 1685
1119 @strong{Warning:} The code in @file{src/mule-ccl.c} and 1686 @strong{Warning:} The code in @file{src/mule-ccl.c} and
1120 @file{$packages/lisp/mule-base/mule-ccl.el} is the definitive 1687 @file{$packages/lisp/mule-base/mule-ccl.el} is the definitive
1121 description of CCL's semantics. The previous version of this section 1688 description of CCL's semantics. The previous version of this section
1122 contained several typos and obsolete names left from earlier versions of 1689 contained several typos and obsolete names left from earlier versions of
1123 MULE, and many may remain. (I am not an experienced CCL programmer; the 1690 MULE, and many may remain. (I am not an experienced CCL programmer; the
1124 few who know CCL well find writing English painful.) 1691 few who know CCL well find writing English painful.)
1125 1692
1126 A CCL program transforms an input data stream into an output data 1693 A CCL program transforms an input data stream into an output data
1127 stream. The input stream, held in a buffer of constant bytes, is left 1694 stream. The input stream, held in a buffer of constant bytes, is left
1128 unchanged. The buffer may be filled by an external input operation, 1695 unchanged. The buffer may be filled by an external input operation,
1129 taken from an Emacs buffer, or taken from a Lisp string. The output 1696 taken from an Emacs buffer, or taken from a Lisp string. The output
1130 buffer is a dynamic array of bytes, which can be written by an external 1697 buffer is a dynamic array of bytes, which can be written by an external
1131 output operation, inserted into an Emacs buffer, or returned as a Lisp 1698 output operation, inserted into an Emacs buffer, or returned as a Lisp
1132 string. 1699 string.
1133 1700
1134 A CCL program is a (Lisp) list containing two or three members. The 1701 A CCL program is a (Lisp) list containing two or three members. The
1135 first member is the @dfn{buffer magnification}, which indicates the 1702 first member is the @dfn{buffer magnification}, which indicates the
1136 required minimum size of the output buffer as a multiple of the input 1703 required minimum size of the output buffer as a multiple of the input
1137 buffer. It is followed by the @dfn{main block} which executes while 1704 buffer. It is followed by the @dfn{main block} which executes while
1138 there is input remaining, and an optional @dfn{EOF block} which is 1705 there is input remaining, and an optional @dfn{EOF block} which is
1139 executed when the input is exhausted. Both the main block and the EOF 1706 executed when the input is exhausted. Both the main block and the EOF
1140 block are CCL blocks. 1707 block are CCL blocks.
1141 1708
1142 A @dfn{CCL block} is either a CCL statement or list of CCL statements. 1709 A @dfn{CCL block} is either a CCL statement or list of CCL statements.
1143 A @dfn{CCL statement} is either a @dfn{set statement} (either an integer 1710 A @dfn{CCL statement} is either a @dfn{set statement} (either an integer
1144 or an @dfn{assignment}, which is a list of a register to receive the 1711 or an @dfn{assignment}, which is a list of a register to receive the
1145 assignment, an assignment operator, and an expression) or a @dfn{control 1712 assignment, an assignment operator, and an expression) or a @dfn{control
1146 statement} (a list starting with a keyword, whose allowable syntax 1713 statement} (a list starting with a keyword, whose allowable syntax
1147 depends on the keyword). 1714 depends on the keyword).
1152 * CCL Expressions:: Operators and expressions in CCL. 1719 * CCL Expressions:: Operators and expressions in CCL.
1153 * Calling CCL:: Running CCL programs. 1720 * Calling CCL:: Running CCL programs.
1154 * CCL Examples:: The encoding functions for Big5 and KOI-8. 1721 * CCL Examples:: The encoding functions for Big5 and KOI-8.
1155 @end menu 1722 @end menu
1156 1723
1157 @node CCL Syntax, CCL Statements, CCL, CCL 1724 @node CCL Syntax, CCL Statements, , CCL
1158 @comment Node, Next, Previous, Up 1725 @comment Node, Next, Previous, Up
1159 @subsection CCL Syntax 1726 @subsection CCL Syntax
1160 1727
1161 The full syntax of a CCL program in BNF notation: 1728 The full syntax of a CCL program in BNF notation:
1162 1729
1163 @format 1730 @format
1164 CCL_PROGRAM := 1731 CCL_PROGRAM :=
1165 (BUFFER_MAGNIFICATION 1732 (BUFFER_MAGNIFICATION
1166 CCL_MAIN_BLOCK 1733 CCL_MAIN_BLOCK
1215 1782
1216 @node CCL Statements, CCL Expressions, CCL Syntax, CCL 1783 @node CCL Statements, CCL Expressions, CCL Syntax, CCL
1217 @comment Node, Next, Previous, Up 1784 @comment Node, Next, Previous, Up
1218 @subsection CCL Statements 1785 @subsection CCL Statements
1219 1786
1220 The Emacs Code Conversion Language provides the following statement 1787 The Emacs Code Conversion Language provides the following statement
1221 types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat}, 1788 types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat},
1222 @dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, and @dfn{end}. 1789 @dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, and @dfn{end}.
1223 1790
1224 @heading Set statement: 1791 @heading Set statement:
1225 1792
1226 The @dfn{set} statement has three variants with the syntaxes 1793 The @dfn{set} statement has three variants with the syntaxes
1227 @samp{(@var{reg} = @var{expression})}, 1794 @samp{(@var{reg} = @var{expression})},
1228 @samp{(@var{reg} @var{assignment_operator} @var{expression})}, and 1795 @samp{(@var{reg} @var{assignment_operator} @var{expression})}, and
1229 @samp{@var{integer}}. The assignment operator variation of the 1796 @samp{@var{integer}}. The assignment operator variation of the
1230 @dfn{set} statement works the same way as the corresponding C expression 1797 @dfn{set} statement works the same way as the corresponding C expression
1231 statement does. The assignment operators are @code{+=}, @code{-=}, 1798 statement does. The assignment operators are @code{+=}, @code{-=},
1234 "naked integer" @var{integer} is equivalent to a @var{set} statement of 1801 "naked integer" @var{integer} is equivalent to a @var{set} statement of
1235 the form @code{(r0 = @var{integer})}. 1802 the form @code{(r0 = @var{integer})}.
1236 1803
1237 @heading I/O statements: 1804 @heading I/O statements:
1238 1805
1239 The @dfn{read} statement takes one or more registers as arguments. It 1806 The @dfn{read} statement takes one or more registers as arguments. It
1240 reads one byte (a C char) from the input into each register in turn. 1807 reads one byte (a C char) from the input into each register in turn.
1241 1808
1242 The @dfn{write} takes several forms. In the form @samp{(write @var{reg} 1809 The @dfn{write} takes several forms. In the form @samp{(write @var{reg}
1243 ...)} it takes one or more registers as arguments and writes each in 1810 ...)} it takes one or more registers as arguments and writes each in
1244 turn to the output. The integer in a register (interpreted as an 1811 turn to the output. The integer in a register (interpreted as an
1245 Emchar) is encoded to multibyte form (ie, Bufbytes) and written to the 1812 Emchar) is encoded to multibyte form (ie, Bufbytes) and written to the
1246 current output buffer. If it is less than 256, it is written as is. 1813 current output buffer. If it is less than 256, it is written as is.
1247 The forms @samp{(write @var{expression})} and @samp{(write 1814 The forms @samp{(write @var{expression})} and @samp{(write
1251 @var{string})}. The form @samp{(write @var{reg} @var{array})} writes 1818 @var{string})}. The form @samp{(write @var{reg} @var{array})} writes
1252 the @var{reg}th element of the @var{array} to the output. 1819 the @var{reg}th element of the @var{array} to the output.
1253 1820
1254 @heading Conditional statements: 1821 @heading Conditional statements:
1255 1822
1256 The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and 1823 The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and
1257 an optional @var{second CCL block} as arguments. If the 1824 an optional @var{second CCL block} as arguments. If the
1258 @var{expression} evaluates to non-zero, the first @var{CCL block} is 1825 @var{expression} evaluates to non-zero, the first @var{CCL block} is
1259 executed. Otherwise, if there is a @var{second CCL block}, it is 1826 executed. Otherwise, if there is a @var{second CCL block}, it is
1260 executed. 1827 executed.
1261 1828
1262 The @dfn{read-if} variant of the @dfn{if} statement takes an 1829 The @dfn{read-if} variant of the @dfn{if} statement takes an
1263 @var{expression}, a @var{CCL block}, and an optional @var{second CCL 1830 @var{expression}, a @var{CCL block}, and an optional @var{second CCL
1264 block} as arguments. The @var{expression} must have the form 1831 block} as arguments. The @var{expression} must have the form
1265 @code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is 1832 @code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is
1266 a register or an integer). The @code{read-if} statement first reads 1833 a register or an integer). The @code{read-if} statement first reads
1267 from the input into the first register operand in the @var{expression}, 1834 from the input into the first register operand in the @var{expression},
1268 then conditionally executes a CCL block just as the @code{if} statement 1835 then conditionally executes a CCL block just as the @code{if} statement
1269 does. 1836 does.
1270 1837
1271 The @dfn{branch} statement takes an @var{expression} and one or more CCL 1838 The @dfn{branch} statement takes an @var{expression} and one or more CCL
1272 blocks as arguments. The CCL blocks are treated as a zero-indexed 1839 blocks as arguments. The CCL blocks are treated as a zero-indexed
1273 array, and the @code{branch} statement uses the @var{expression} as the 1840 array, and the @code{branch} statement uses the @var{expression} as the
1274 index of the CCL block to execute. Null CCL blocks may be used as 1841 index of the CCL block to execute. Null CCL blocks may be used as
1275 no-ops, continuing execution with the statement following the 1842 no-ops, continuing execution with the statement following the
1276 @code{branch} statement in the containing CCL block. Out-of-range 1843 @code{branch} statement in the containing CCL block. Out-of-range
1277 values for the @var{EXPRESSION} are also treated as no-ops. 1844 values for the @var{EXPRESSION} are also treated as no-ops.
1278 1845
1279 The @dfn{read-branch} variant of the @dfn{branch} statement takes an 1846 The @dfn{read-branch} variant of the @dfn{branch} statement takes an
1280 @var{register}, a @var{CCL block}, and an optional @var{second CCL 1847 @var{register}, a @var{CCL block}, and an optional @var{second CCL
1281 block} as arguments. The @code{read-branch} statement first reads from 1848 block} as arguments. The @code{read-branch} statement first reads from
1282 the input into the @var{register}, then conditionally executes a CCL 1849 the input into the @var{register}, then conditionally executes a CCL
1283 block just as the @code{branch} statement does. 1850 block just as the @code{branch} statement does.
1284 1851
1285 @heading Loop control statements: 1852 @heading Loop control statements:
1286 1853
1287 The @dfn{loop} statement creates a block with an implied jump from the 1854 The @dfn{loop} statement creates a block with an implied jump from the
1288 end of the block back to its head. The loop is exited on a @code{break} 1855 end of the block back to its head. The loop is exited on a @code{break}
1289 statement, and continued without executing the tail by a @code{repeat} 1856 statement, and continued without executing the tail by a @code{repeat}
1290 statement. 1857 statement.
1291 1858
1292 The @dfn{break} statement, written @samp{(break)}, terminates the 1859 The @dfn{break} statement, written @samp{(break)}, terminates the
1293 current loop and continues with the next statement in the current 1860 current loop and continues with the next statement in the current
1294 block. 1861 block.
1295 1862
1296 The @dfn{repeat} statement has three variants, @code{repeat}, 1863 The @dfn{repeat} statement has three variants, @code{repeat},
1297 @code{write-repeat}, and @code{write-read-repeat}. Each continues the 1864 @code{write-repeat}, and @code{write-read-repeat}. Each continues the
1298 current loop from its head, possibly after performing I/O. 1865 current loop from its head, possibly after performing I/O.
1299 @code{repeat} takes no arguments and does no I/O before jumping. 1866 @code{repeat} takes no arguments and does no I/O before jumping.
1300 @code{write-repeat} takes a single argument (a register, an 1867 @code{write-repeat} takes a single argument (a register, an
1301 integer, or a string), writes it to the output, then jumps. 1868 integer, or a string), writes it to the output, then jumps.
1307 @code{write} and @code{read} statements for the semantics of the I/O 1874 @code{write} and @code{read} statements for the semantics of the I/O
1308 operations for each type of argument. 1875 operations for each type of argument.
1309 1876
1310 @heading Other control statements: 1877 @heading Other control statements:
1311 1878
1312 The @dfn{call} statement, written @samp{(call @var{ccl-program-name})}, 1879 The @dfn{call} statement, written @samp{(call @var{ccl-program-name})},
1313 executes a CCL program as a subroutine. It does not return a value to 1880 executes a CCL program as a subroutine. It does not return a value to
1314 the caller, but can modify the register status. 1881 the caller, but can modify the register status.
1315 1882
1316 The @dfn{end} statement, written @samp{(end)}, terminates the CCL 1883 The @dfn{end} statement, written @samp{(end)}, terminates the CCL
1317 program successfully, and returns to caller (which may be a CCL 1884 program successfully, and returns to caller (which may be a CCL
1318 program). It does not alter the status of the registers. 1885 program). It does not alter the status of the registers.
1319 1886
1320 @node CCL Expressions, Calling CCL, CCL Statements, CCL 1887 @node CCL Expressions, Calling CCL, CCL Statements, CCL
1321 @comment Node, Next, Previous, Up 1888 @comment Node, Next, Previous, Up
1322 @subsection CCL Expressions 1889 @subsection CCL Expressions
1323 1890
1324 CCL, unlike Lisp, uses infix expressions. The simplest CCL expressions 1891 CCL, unlike Lisp, uses infix expressions. The simplest CCL expressions
1325 consist of a single @var{operand}, either a register (one of @code{r0}, 1892 consist of a single @var{operand}, either a register (one of @code{r0},
1326 ..., @code{r0}) or an integer. Complex expressions are lists of the 1893 ..., @code{r0}) or an integer. Complex expressions are lists of the
1327 form @code{( @var{expression} @var{operator} @var{operand} )}. Unlike 1894 form @code{( @var{expression} @var{operator} @var{operand} )}. Unlike
1328 C, assignments are not expressions. 1895 C, assignments are not expressions.
1329 1896
1330 In the following table, @var{X} is the target resister for a @dfn{set}. 1897 In the following table, @var{X} is the target resister for a @dfn{set}.
1331 In subexpressions, this is implicitly @code{r7}. This means that 1898 In subexpressions, this is implicitly @code{r7}. This means that
1332 @code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used 1899 @code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used
1333 freely in subexpressions, since they return parts of their values in 1900 freely in subexpressions, since they return parts of their values in
1334 @code{r7}. @var{Y} may be an expression, register, or integer, while 1901 @code{r7}. @var{Y} may be an expression, register, or integer, while
1335 @var{Z} must be a register or an integer. 1902 @var{Z} must be a register or an integer.
1359 @item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z) 1926 @item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z)
1360 @item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z)) 1927 @item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z))
1361 @item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z)) 1928 @item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z))
1362 @end multitable 1929 @end multitable
1363 1930
1364 The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8, 1931 The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8,
1365 CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The CCL_ENCODE_SJIS 1932 CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The CCL_ENCODE_SJIS
1366 and CCL_DECODE_SJIS treat their first and second bytes as the high and 1933 and CCL_DECODE_SJIS treat their first and second bytes as the high and
1367 low bytes of a two-byte character code. (SJIS stands for Shift JIS, an 1934 low bytes of a two-byte character code. (SJIS stands for Shift JIS, an
1368 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a 1935 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a
1369 complicated transformation of the Japanese standard JIS encoding to 1936 complicated transformation of the Japanese standard JIS encoding to
1370 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to 1937 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to
1371 represent the SJIS operations in infix form. 1938 represent the SJIS operations in infix form.
1372 1939
1373 @node Calling CCL, CCL Examples, CCL Expressions, CCL 1940 @node Calling CCL, CCL Examples, CCL Expressions, CCL
1374 @comment Node, Next, Previous, Up 1941 @comment Node, Next, Previous, Up
1375 @subsection Calling CCL 1942 @subsection Calling CCL
1376 1943
1377 CCL programs are called automatically during Emacs buffer I/O when the 1944 CCL programs are called automatically during Emacs buffer I/O when the
1378 external representation has a coding system type of @code{shift-jis}, 1945 external representation has a coding system type of @code{shift-jis},
1379 @code{big5}, or @code{ccl}. The program is specified by the coding 1946 @code{big5}, or @code{ccl}. The program is specified by the coding
1380 system (@pxref{Coding Systems}). You can also call CCL programs from 1947 system (@pxref{Coding Systems}). You can also call CCL programs from
1381 other CCL programs, and from Lisp using these functions: 1948 other CCL programs, and from Lisp using these functions:
1382 1949
1409 of the program. When the program is done, @var{status} is modified (by 1976 of the program. When the program is done, @var{status} is modified (by
1410 side-effect) to contain the ending values for the corresponding 1977 side-effect) to contain the ending values for the corresponding
1411 registers and IC. Returns the resulting string. 1978 registers and IC. Returns the resulting string.
1412 @end defun 1979 @end defun
1413 1980
1414 To call a CCL program from another CCL program, it must first be 1981 To call a CCL program from another CCL program, it must first be
1415 registered: 1982 registered:
1416 1983
1417 @defun register-ccl-program name ccl-program 1984 @defun register-ccl-program name ccl-program
1418 Register @var{name} for CCL program @var{program} in 1985 Register @var{name} for CCL program @var{program} in
1419 @code{ccl-program-table}. @var{program} should be the compiled form of 1986 @code{ccl-program-table}. @var{program} should be the compiled form of
1420 a CCL program, or nil. Return index number of the registered CCL 1987 a CCL program, or nil. Return index number of the registered CCL
1421 program. 1988 program.
1422 @end defun 1989 @end defun
1423 1990
1424 Information about the processor time used by the CCL interpreter can be 1991 Information about the processor time used by the CCL interpreter can be
1425 obtained using these functions: 1992 obtained using these functions:
1426 1993
1427 @defun ccl-elapsed-time 1994 @defun ccl-elapsed-time
1428 Returns the elapsed processor time of the CCL interpreter as cons of 1995 Returns the elapsed processor time of the CCL interpreter as cons of
1429 user and system time, as 1996 user and system time, as
1434 2001
1435 @defun ccl-reset-elapsed-time 2002 @defun ccl-reset-elapsed-time
1436 Resets the CCL interpreter's internal elapsed time registers. 2003 Resets the CCL interpreter's internal elapsed time registers.
1437 @end defun 2004 @end defun
1438 2005
1439 @node CCL Examples, , Calling CCL, CCL 2006 @node CCL Examples, , Calling CCL, CCL
1440 @comment Node, Next, Previous, Up 2007 @comment Node, Next, Previous, Up
1441 @subsection CCL Examples 2008 @subsection CCL Examples
1442 2009
1443 This section is not yet written. 2010 This section is not yet written.
1444 2011
1445 @node Category Tables, , CCL, MULE 2012 @node Category Tables, , CCL, MULE
1446 @section Category Tables 2013 @section Category Tables
1447 2014
1448 A category table is a type of char table used for keeping track of 2015 A category table is a type of char table used for keeping track of