comparison man/lispref/mule.texi @ 412:697ef44129c6 r21-2-14

Import from CVS: tag r21-2-14
author cvs
date Mon, 13 Aug 2007 11:20:41 +0200
parents de805c49cfc1
children
comparison
equal deleted inserted replaced
411:12e008d41344 412:697ef44129c6
4 @c See the file lispref.texi for copying conditions. 4 @c See the file lispref.texi for copying conditions.
5 @setfilename ../../info/internationalization.info 5 @setfilename ../../info/internationalization.info
6 @node MULE, Tips, Internationalization, top 6 @node MULE, Tips, Internationalization, top
7 @chapter MULE 7 @chapter MULE
8 8
9 @dfn{MULE} is the name originally given to the version of GNU Emacs 9 @dfn{MULE} is the name originally given to the version of GNU Emacs
10 extended for multi-lingual (and in particular Asian-language) support. 10 extended for multi-lingual (and in particular Asian-language) support.
11 ``MULE'' is short for ``MUlti-Lingual Emacs''. It is an extension and 11 ``MULE'' is short for ``MUlti-Lingual Emacs''. It was originally called
12 complete rewrite of Nemacs (``Nihon Emacs'' where ``Nihon'' is the 12 Nemacs (``Nihon Emacs'' where ``Nihon'' is the Japanese word for
13 Japanese word for ``Japan''), which only provided support for Japanese. 13 ``Japan''), when it only provided support for Japanese. XEmacs
14 XEmacs refers to its multi-lingual support as @dfn{MULE support} since 14 refers to its multi-lingual support as @dfn{MULE support} since it
15 it is based on @dfn{MULE}. 15 is based on @dfn{MULE}.
16 16
17 @menu 17 @menu
18 * Internationalization Terminology:: 18 * Internationalization Terminology::
19 Definition of various internationalization terms. 19 Definition of various internationalization terms.
20 * Charsets:: Sets of related characters. 20 * Charsets:: Sets of related characters.
21 * MULE Characters:: Working with characters in XEmacs/MULE. 21 * MULE Characters:: Working with characters in XEmacs/MULE.
22 * Composite Characters:: Making new characters by overstriking other ones. 22 * Composite Characters:: Making new characters by overstriking other ones.
23 * ISO 2022:: An international standard for charsets and encodings.
23 * Coding Systems:: Ways of representing a string of chars using integers. 24 * Coding Systems:: Ways of representing a string of chars using integers.
24 * CCL:: A special language for writing fast converters. 25 * CCL:: A special language for writing fast converters.
25 * Category Tables:: Subdividing charsets into groups. 26 * Category Tables:: Subdividing charsets into groups.
26 @end menu 27 @end menu
27 28
28 @node Internationalization Terminology, Charsets, , MULE 29 @node Internationalization Terminology
29 @section Internationalization Terminology 30 @section Internationalization Terminology
30 31
31 In internationalization terminology, a string of text is divided up 32 In internationalization terminology, a string of text is divided up
32 into @dfn{characters}, which are the printable units that make up the 33 into @dfn{characters}, which are the printable units that make up the
33 text. A single character is (for example) a capital @samp{A}, the 34 text. A single character is (for example) a capital @samp{A}, the
34 number @samp{2}, a Katakana character, a Hangul character, a Kanji 35 number @samp{2}, a Katakana character, a Kanji ideograph (an
35 ideograph (an @dfn{ideograph} is a ``picture'' character, such as is 36 @dfn{ideograph} is a ``picture'' character, such as is used in Japanese
36 used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there 37 Kanji, Chinese Hanzi, and Korean Hangul; typically there are thousands
37 are thousands of such ideographs in each language), etc. The basic 38 of such ideographs in each language), etc. The basic property of a
38 property of a character is that it is the smallest unit of text with 39 character is its shape. Note that the same character may be drawn by
39 semantic significance in text processing. 40 two different people (or in two different fonts) in slightly different
40 41 ways, although the basic shape will be the same.
41 Human beings normally process text visually, so to a first approximation
42 a character may be identified with its shape. Note that the same
43 character may be drawn by two different people (or in two different
44 fonts) in slightly different ways, although the "basic shape" will be the
45 same. But consider the works of Scott Kim; human beings can recognize
46 hugely variant shapes as the "same" character. Sometimes, especially
47 where characters are extremely complicated to write, completely
48 different shapes may be defined as the "same" character in national
49 standards. The Taiwanese variant of Hanzi is generally the most
50 complicated; over the centuries, the Japanese, Koreans, and the People's
51 Republic of China have adopted simplifications of the shape, but the
52 line of descent from the original shape is recorded, and the meanings
53 and pronunciation of different forms of the same character are
54 considered to be identical within each language. (Of course, it may
55 take a specialist to recognize the related form; the point is that the
56 relations are standardized, despite the differing shapes.)
57 42
58 In some cases, the differences will be significant enough that it is 43 In some cases, the differences will be significant enough that it is
59 actually possible to identify two or more distinct shapes that both 44 actually possible to identify two or more distinct shapes that both
60 represent the same character. For example, the lowercase letters 45 represent the same character. For example, the lowercase letters
61 @samp{a} and @samp{g} each have two distinct possible shapes---the 46 @samp{a} and @samp{g} each have two distinct possible shapes -- the
62 @samp{a} can optionally have a curved tail projecting off the top, and 47 @samp{a} can optionally have a curved tail projecting off the top, and
63 the @samp{g} can be formed either of two loops, or of one loop and a 48 the @samp{g} can be formed either of two loops, or of one loop and a
64 tail hanging off the bottom. Such distinct possible shapes of a 49 tail hanging off the bottom. Such distinct possible shapes of a
65 character are called @dfn{glyphs}. The important characteristic of two 50 character are called @dfn{glyphs}. The important characteristic of two
66 glyphs making up the same character is that the choice between one or 51 glyphs making up the same character is that the choice between one or
67 the other is purely stylistic and has no linguistic effect on a word 52 the other is purely stylistic and has no linguistic effect on a word
68 (this is the reason why a capital @samp{A} and lowercase @samp{a} 53 (this is the reason why a capital @samp{A} and lowercase @samp{a}
69 are different characters rather than different glyphs---e.g. 54 are different characters rather than different glyphs -- e.g.
70 @samp{Aspen} is a city while @samp{aspen} is a kind of tree). 55 @samp{Aspen} is a city while @samp{aspen} is a kind of tree).
71 56
72 Note that @dfn{character} and @dfn{glyph} are used differently 57 Note that @dfn{character} and @dfn{glyph} are used differently
73 here than elsewhere in XEmacs. 58 here than elsewhere in XEmacs.
74 59
75 A @dfn{character set} is essentially a set of related characters. ASCII, 60 A @dfn{character set} is simply a set of related characters. ASCII,
76 for example, is a set of 94 characters (or 128, if you count 61 for example, is a set of 94 characters (or 128, if you count
77 non-printing characters). Other character sets are ISO8859-1 (ASCII 62 non-printing characters). Other character sets are ISO8859-1 (ASCII
78 plus various accented characters and other international symbols), 63 plus various accented characters and other international symbols),
79 JIS X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208 64 JISX0201 (ASCII, more or less, plus half-width Katakana), JISX0208
80 (Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji), 65 (Japanese Kanji), JISX0212 (a second set of less-used Japanese Kanji),
81 GB2312 (Mainland Chinese Hanzi), etc. 66 GB2312 (Mainland Chinese Hanzi), etc.
82 67
83 The definition of a character set will implicitly or explicitly give 68 Every character set has one or more @dfn{orderings}, which can be
84 it an @dfn{ordering}, a way of assigning a number to each character in 69 viewed as a way of assigning a number (or set of numbers) to each
85 the set. For many character sets, there is a natural ordering, for 70 character in the set. For most character sets, there is a standard
86 example the ``ABC'' ordering of the Roman letters. But it is not clear 71 ordering, and in fact all of the character sets mentioned above define a
87 whether digits should come before or after the letters, and in fact 72 particular ordering. ASCII, for example, places letters in their
88 different European languages treat the ordering of accented characters 73 ``natural'' order, puts uppercase letters before lowercase letters,
89 differently. It is useful to use the natural order where available, of 74 numbers before letters, etc. Note that for many of the Asian character
90 course. The number assigned to any particular character is called the 75 sets, there is no natural ordering of the characters. The actual
91 character's @dfn{code point}. (Within a given character set, each 76 orderings are based on one or more salient characteristic, of which
92 character has a unique code point. Thus the word "set" is ill-chosen; 77 there are many to choose from -- e.g. number of strokes, common
93 different orderings of the same characters are different character sets. 78 radicals, phonetic ordering, etc.
94 Identifying characters is simple enough for alphabetic character sets, 79
95 but the difference in ordering can cause great headaches when the same 80 The set of numbers assigned to any particular character are called
96 thousands of characters are used by different cultures as in the Hanzi.) 81 the character's @dfn{position codes}. The number of position codes
97 82 required to index a particular character in a character set is called
98 A code point may be broken into a number of @dfn{position codes}. The 83 the @dfn{dimension} of the character set. ASCII, being a relatively
99 number of position codes required to index a particular character in a 84 small character set, is of dimension one, and each character in the
100 character set is called the @dfn{dimension} of the character set. For 85 set is indexed using a single position code, in the range 0 through
101 practical purposes, a position code may be thought of as a byte-sized 86 127 (if non-printing characters are included) or 33 through 126
102 index. The printing characters of ASCII, being a relatively small 87 (if only the printing characters are considered). JISX0208, i.e.
103 character set, is of dimension one, and each character in the set is 88 Japanese Kanji, has thousands of characters, and is of dimension two --
104 indexed using a single position code, in the range 1 through 94. Use of 89 every character is indexed by two position codes, each in the range
105 this unusual range, rather than the familiar 33 through 126, is an 90 33 through 126. (Note that the choice of the range here is somewhat
106 intentional abstraction; to understand the programming issues you must 91 arbitrary. Although a character set such as JISX0208 defines an
107 break the equation between character sets and encodings. 92 @emph{ordering} of all its characters, it does not define the actual
108 93 mapping between numbers and characters. You could just as easily
109 JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is 94 index the characters in JISX0208 using numbers in the range 0 through
110 of dimension two -- every character is indexed by two position codes, 95 93, 1 through 94, 2 through 95, etc. The reason for the actual range
111 each in the range 1 through 94. (This number ``94'' is not a 96 chosen is so that the position codes match up with the actual values
112 coincidence; we shall see that the JIS position codes were chosen so 97 used in the common encodings.)
113 that JIS kanji could be encoded without using codes that in ASCII are
114 associated with device control functions.) Note that the choice of the
115 range here is somewhat arbitrary. You could just as easily index the
116 printing characters in ASCII using numbers in the range 0 through 93, 2
117 through 95, 3 through 96, etc. In fact, the standardized
118 @emph{encoding} for the ASCII @emph{character set} uses the range 33
119 through 126.
120 98
121 An @dfn{encoding} is a way of numerically representing characters from 99 An @dfn{encoding} is a way of numerically representing characters from
122 one or more character sets into a stream of like-sized numerical values 100 one or more character sets into a stream of like-sized numerical values
123 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit 101 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit
124 quantities. If an encoding encompasses only one character set, then the 102 quantities. If an encoding encompasses only one character set, then the
125 position codes for the characters in that character set could be used 103 position codes for the characters in that character set could be used
126 directly. (This is the case with the trivial cipher used by children, 104 directly. (This is the case with ASCII, and as a result, most people do
127 assigning 1 to `A', 2 to `B', and so on.) However, even with ASCII, 105 not understand the difference between a character set and an encoding.)
128 other considerations intrude. For example, why are the upper- and 106 This is not possible, however, if more than one character set is to be
129 lowercase alphabets separated by 8 characters? Why do the digits start 107 used in the encoding. For example, printed Japanese text typically
130 with `0' being assigned the code 48? In both cases because semantically 108 requires characters from multiple character sets -- ASCII, JISX0208, and
131 interesting operations (case conversion and numerical value extraction) 109 JISX0212, to be specific. Each of these is indexed using one or more
132 become convenient masking operations. Other artificial aspects (the 110 position codes in the range 33 through 126, so the position codes could
133 control characters being assigned to codes 0--31 and 127) are historical 111 not be used directly or there would be no way to tell which character
134 accidents. (The use of 127 for @samp{DEL} is an artifact of the "punch 112 was meant. Different Japanese encodings handle this differently -- JIS
135 once" nature of paper tape, for example.) 113 uses special escape characters to denote different character sets; EUC
136 114 sets the high bit of the position codes for JISX0208 and JISX0212, and
137 Naive use of the position code is not possible, however, if more than 115 puts a special extra byte before each JISX0212 character; etc. (JIS,
138 one character set is to be used in the encoding. For example, printed 116 EUC, and most of the other encodings you will encounter are 7-bit or
139 Japanese text typically requires characters from multiple character sets 117 8-bit encodings. There is one common 16-bit encoding, which is Unicode;
140 -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is 118 this strives to represent all the world's characters in a single large
141 indexed using one or more position codes in the range 1 through 94, so 119 character set. 32-bit encodings are generally used internally in
142 the position codes could not be used directly or there would be no way 120 programs to simplify the code that manipulates them; however, they are
143 to tell which character was meant. Different Japanese encodings handle 121 not much used externally because they are not very space-efficient.)
144 this differently -- JIS uses special escape characters to denote
145 different character sets; EUC sets the high bit of the position codes
146 for JIS X 0208 and JIS X 0212, and puts a special extra byte before each
147 JIS X 0212 character; etc. (JIS, EUC, and most of the other encodings
148 you will encounter in files are 7-bit or 8-bit encodings. There is one
149 common 16-bit encoding, which is Unicode; this strives to represent all
150 the world's characters in a single large character set. 32-bit
151 encodings are often used internally in programs, such as XEmacs with
152 MULE support, to simplify the code that manipulates them; however, they
153 are not used externally because they are not very space-efficient.)
154
155 A general method of handling text using multiple character sets
156 (whether for multilingual text, or simply text in an extremely
157 complicated single language like Japanese) is defined in the
158 international standard ISO 2022. ISO 2022 will be discussed in more
159 detail later (@pxref{ISO 2022}), but for now suffice it to say that text
160 needs control functions (at least spacing), and if escape sequences are
161 to be used, an escape sequence introducer. It was decided to make all
162 text streams compatible with ASCII in the sense that the codes 0--31
163 (and 128-159) would always be control codes, never graphic characters,
164 and where defined by the character set the @samp{SPC} character would be
165 assigned code 32, and @samp{DEL} would be assigned 127. Thus there are
166 94 code points remaining if 7 bits are used. This is the reason that
167 most character sets are defined using position codes in the range 1
168 through 94. Then ISO 2022 compatible encodings are produced by shifting
169 the position codes 1 to 94 into character codes 33 to 126, or (if 8 bit
170 codes are available) into character codes 161 to 254.
171 122
172 Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In 123 Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In
173 a @dfn{modal encoding}, there are multiple states that the encoding can 124 a @dfn{modal encoding}, there are multiple states that the encoding can be in,
174 be in, and the interpretation of the values in the stream depends on the 125 and the interpretation of the values in the stream depends on the
175 current global state of the encoding. Special values in the encoding, 126 current global state of the encoding. Special values in the encoding,
176 called @dfn{escape sequences}, are used to change the global state. 127 called @dfn{escape sequences}, are used to change the global state.
177 JIS, for example, is a modal encoding. The bytes @samp{ESC $ B} 128 JIS, for example, is a modal encoding. The bytes @samp{ESC $ B}
178 indicate that, from then on, bytes are to be interpreted as position 129 indicate that, from then on, bytes are to be interpreted as position
179 codes for JIS X 0208, rather than as ASCII. This effect is cancelled 130 codes for JISX0208, rather than as ASCII. This effect is cancelled
180 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the 131 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the
181 current state is to ASCII''. To switch to JIS X 0212, the escape 132 current state is to ASCII''. To switch to JISX0212, the escape sequence
182 sequence @samp{ESC $ ( D}. (Note that here, as is common, the escape 133 @samp{ESC $ ( D}. (Note that here, as is common, the escape sequences do
183 sequences do in fact begin with @samp{ESC}. This is not necessarily the 134 in fact begin with @samp{ESC}. This is not necessarily the case,
184 case, however. Some encodings use control characters called "locking 135 however.)
185 shifts" (effect persists until cancelled) to switch character sets.) 136
186 137 A @dfn{non-modal encoding} has no global state that extends past the
187 A @dfn{non-modal encoding} has no global state that extends past the
188 character currently being interpreted. EUC, for example, is a 138 character currently being interpreted. EUC, for example, is a
189 non-modal encoding. Characters in JIS X 0208 are encoded by setting 139 non-modal encoding. Characters in JISX0208 are encoded by setting
190 the high bit of the position codes, and characters in JIS X 0212 are 140 the high bit of the position codes, and characters in JISX0212 are
191 encoded by doing the same but also prefixing the character with the 141 encoded by doing the same but also prefixing the character with the
192 byte 0x8F. 142 byte 0x8F.
193 143
194 The advantage of a modal encoding is that it is generally more 144 The advantage of a modal encoding is that it is generally more
195 space-efficient, and is easily extendable because there are essentially 145 space-efficient, and is easily extendable because there are essentially
196 an arbitrary number of escape sequences that can be created. The 146 an arbitrary number of escape sequences that can be created. The
197 disadvantage, however, is that it is much more difficult to work with 147 disadvantage, however, is that it is much more difficult to work with
198 if it is not being processed in a sequential manner. In the non-modal 148 if it is not being processed in a sequential manner. In the non-modal
199 EUC encoding, for example, the byte 0x41 always refers to the letter 149 EUC encoding, for example, the byte 0x41 always refers to the letter
200 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or 150 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or
201 one of the two position codes in a JIS X 0208 character, or one of the 151 one of the two position codes in a JISX0208 character, or one of the
202 two position codes in a JIS X 0212 character. Determining exactly which 152 two position codes in a JISX0212 character. Determining exactly which
203 one is meant could be difficult and time-consuming if the previous 153 one is meant could be difficult and time-consuming if the previous
204 bytes in the string have not already been processed, or impossible if 154 bytes in the string have not already been processed.
205 they are drawn from an external stream that cannot be rewound.
206 155
207 Non-modal encodings are further divided into @dfn{fixed-width} and 156 Non-modal encodings are further divided into @dfn{fixed-width} and
208 @dfn{variable-width} formats. A fixed-width encoding always uses 157 @dfn{variable-width} formats. A fixed-width encoding always uses
209 the same number of words per character, whereas a variable-width 158 the same number of words per character, whereas a variable-width
210 encoding does not. EUC is a good example of a variable-width 159 encoding does not. EUC is a good example of a variable-width
212 the character set. 16-bit and 32-bit encodings are nearly always 161 the character set. 16-bit and 32-bit encodings are nearly always
213 fixed-width, and this is in fact one of the main reasons for using 162 fixed-width, and this is in fact one of the main reasons for using
214 an encoding with a larger word size. The advantages of fixed-width 163 an encoding with a larger word size. The advantages of fixed-width
215 encodings should be obvious. The advantages of variable-width 164 encodings should be obvious. The advantages of variable-width
216 encodings are that they are generally more space-efficient and allow 165 encodings are that they are generally more space-efficient and allow
217 for compatibility with existing 8-bit encodings such as ASCII. (For 166 for compatibility with existing 8-bit encodings such as ASCII.
218 example, in Unicode ASCII characters are simply promoted to a 16-bit 167
219 representation. That means that every ASCII character contains a 168 Note that the bytes in an 8-bit encoding are often referred to
220 @samp{NUL} byte; evidently all of the standard string manipulation 169 as @dfn{octets} rather than simply as bytes. This terminology
221 functions will lose badly in a fixed-width Unicode environment.) 170 dates back to the days before 8-bit bytes were universal, when
222 171 some computers had 9-bit bytes, others had 10-bit bytes, etc.
223 The bytes in an 8-bit encoding are often referred to as @dfn{octets} 172
224 rather than simply as bytes. This terminology dates back to the days 173 @node Charsets
225 before 8-bit bytes were universal, when some computers had 9-bit bytes,
226 others had 10-bit bytes, etc.
227
228 @node Charsets, MULE Characters, Internationalization Terminology, MULE
229 @section Charsets 174 @section Charsets
230 175
231 A @dfn{charset} in MULE is an object that encapsulates a 176 A @dfn{charset} in MULE is an object that encapsulates a
232 particular character set as well as an ordering of those characters. 177 particular character set as well as an ordering of those characters.
233 Charsets are permanent objects and are named using symbols, like 178 Charsets are permanent objects and are named using symbols, like
242 * Basic Charset Functions:: Functions for working with charsets. 187 * Basic Charset Functions:: Functions for working with charsets.
243 * Charset Property Functions:: Functions for accessing charset properties. 188 * Charset Property Functions:: Functions for accessing charset properties.
244 * Predefined Charsets:: Predefined charset objects. 189 * Predefined Charsets:: Predefined charset objects.
245 @end menu 190 @end menu
246 191
247 @node Charset Properties, Basic Charset Functions, , Charsets 192 @node Charset Properties
248 @subsection Charset Properties 193 @subsection Charset Properties
249 194
250 Charsets have the following properties: 195 Charsets have the following properties:
251 196
252 @table @code 197 @table @code
314 property. If a CCL program is defined, the position codes of a 259 property. If a CCL program is defined, the position codes of a
315 character will first be processed according to @code{graphic} and 260 character will first be processed according to @code{graphic} and
316 then passed through the CCL program, with the resulting values used 261 then passed through the CCL program, with the resulting values used
317 to index the font. 262 to index the font.
318 263
319 This is used, for example, in the Big5 character set (used in Taiwan). 264 This is used, for example, in the Big5 character set (used in Taiwan).
320 This character set is not ISO-2022-compliant, and its size (94x157) does 265 This character set is not ISO-2022-compliant, and its size (94x157) does
321 not fit within the maximum 96x96 size of ISO-2022-compliant character 266 not fit within the maximum 96x96 size of ISO-2022-compliant character
322 sets. As a result, XEmacs/MULE splits it (in a rather complex fashion, 267 sets. As a result, XEmacs/MULE splits it (in a rather complex fashion,
323 so as to group the most commonly used characters together) into two 268 so as to group the most commonly used characters together) into two
324 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94, 269 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94,
325 and each charset object uses a CCL program to convert the modified 270 and each charset object uses a CCL program to convert the modified
326 position codes back into standard Big5 indices to retrieve a character 271 position codes back into standard Big5 indices to retrieve a character
327 from a Big5 font. 272 from a Big5 font.
328 @end table 273 @end table
329 274
330 Most of the above properties can only be set when the charset is 275 Most of the above properties can only be changed when the charset
331 initialized, and cannot be changed later. 276 is created. @xref{Charset Property Functions}.
332 @xref{Charset Property Functions}. 277
333 278 @node Basic Charset Functions
334 @node Basic Charset Functions, Charset Property Functions, Charset Properties, Charsets
335 @subsection Basic Charset Functions 279 @subsection Basic Charset Functions
336 280
337 @defun find-charset charset-or-name 281 @defun find-charset charset-or-name
338 This function retrieves the charset of the given name. If 282 This function retrieves the charset of the given name. If
339 @var{charset-or-name} is a charset object, it is simply returned. 283 @var{charset-or-name} is a charset object, it is simply returned.
352 This function returns a list of the names of all defined charsets. 296 This function returns a list of the names of all defined charsets.
353 @end defun 297 @end defun
354 298
355 @defun make-charset name doc-string props 299 @defun make-charset name doc-string props
356 This function defines a new character set. This function is for use 300 This function defines a new character set. This function is for use
357 with MULE support. @var{name} is a symbol, the name by which the 301 with Mule support. @var{name} is a symbol, the name by which the
358 character set is normally referred. @var{doc-string} is a string 302 character set is normally referred. @var{doc-string} is a string
359 describing the character set. @var{props} is a property list, 303 describing the character set. @var{props} is a property list,
360 describing the specific nature of the character set. The recognized 304 describing the specific nature of the character set. The recognized
361 properties are @code{registry}, @code{dimension}, @code{columns}, 305 properties are @code{registry}, @code{dimension}, @code{columns},
362 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and 306 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and
380 This function returns the charset (if any) with the same dimension, 324 This function returns the charset (if any) with the same dimension,
381 number of characters, and final byte as @var{charset}, but which is 325 number of characters, and final byte as @var{charset}, but which is
382 displayed in the opposite direction. 326 displayed in the opposite direction.
383 @end defun 327 @end defun
384 328
385 @node Charset Property Functions, Predefined Charsets, Basic Charset Functions, Charsets 329 @node Charset Property Functions
386 @subsection Charset Property Functions 330 @subsection Charset Property Functions
387 331
388 All of these functions accept either a charset name or charset object. 332 All of these functions accept either a charset name or charset object.
389 333
390 @defun charset-property charset prop 334 @defun charset-property charset prop
391 This function returns property @var{prop} of @var{charset}. 335 This function returns property @var{prop} of @var{charset}.
392 @xref{Charset Properties}. 336 @xref{Charset Properties}.
393 @end defun 337 @end defun
394 338
395 Convenience functions are also provided for retrieving individual 339 Convenience functions are also provided for retrieving individual
396 properties of a charset. 340 properties of a charset.
397 341
398 @defun charset-name charset 342 @defun charset-name charset
399 This function returns the name of @var{charset}. This will be a symbol. 343 This function returns the name of @var{charset}. This will be a symbol.
400 @end defun 344 @end defun
420 This function returns the number of display columns per character (in 364 This function returns the number of display columns per character (in
421 TTY mode) of @var{charset}. 365 TTY mode) of @var{charset}.
422 @end defun 366 @end defun
423 367
424 @defun charset-direction charset 368 @defun charset-direction charset
425 This function returns the display direction of @var{charset}---either 369 This function returns the display direction of @var{charset} -- either
426 @code{l2r} or @code{r2l}. 370 @code{l2r} or @code{r2l}.
427 @end defun 371 @end defun
428 372
429 @defun charset-final charset 373 @defun charset-final charset
430 This function returns the final byte of the ISO 2022 escape sequence 374 This function returns the final byte of the ISO 2022 escape sequence
440 @defun charset-ccl-program charset 384 @defun charset-ccl-program charset
441 This function returns the CCL program, if any, for converting 385 This function returns the CCL program, if any, for converting
442 position codes of characters in @var{charset} into font indices. 386 position codes of characters in @var{charset} into font indices.
443 @end defun 387 @end defun
444 388
445 The only property of a charset that can currently be set after 389 The only property of a charset that can currently be set after
446 the charset has been created is the CCL program. 390 the charset has been created is the CCL program.
447 391
448 @defun set-charset-ccl-program charset ccl-program 392 @defun set-charset-ccl-program charset ccl-program
449 This function sets the @code{ccl-program} property of @var{charset} to 393 This function sets the @code{ccl-program} property of @var{charset} to
450 @var{ccl-program}. 394 @var{ccl-program}.
451 @end defun 395 @end defun
452 396
453 @node Predefined Charsets, , Charset Property Functions, Charsets 397 @node Predefined Charsets
454 @subsection Predefined Charsets 398 @subsection Predefined Charsets
455 399
456 The following charsets are predefined in the C code. 400 The following charsets are predefined in the C code.
457 401
458 @example 402 @example
459 Name Type Fi Gr Dir Registry 403 Name Type Fi Gr Dir Registry
460 -------------------------------------------------------------- 404 --------------------------------------------------------------
461 ascii 94 B 0 l2r ISO8859-1 405 ascii 94 B 0 l2r ISO8859-1
482 chinese-big5-2 94x94 1 0 l2r Big5 426 chinese-big5-2 94x94 1 0 l2r Big5
483 korean-ksc5601 94x94 C 0 l2r KSC5601 427 korean-ksc5601 94x94 C 0 l2r KSC5601
484 composite 96x96 0 l2r --- 428 composite 96x96 0 l2r ---
485 @end example 429 @end example
486 430
487 The following charsets are predefined in the Lisp code. 431 The following charsets are predefined in the Lisp code.
488 432
489 @example 433 @example
490 Name Type Fi Gr Dir Registry 434 Name Type Fi Gr Dir Registry
491 -------------------------------------------------------------- 435 --------------------------------------------------------------
492 arabic-digit 94 2 0 l2r MuleArabic-0 436 arabic-digit 94 2 0 l2r MuleArabic-0
506 @end example 450 @end example
507 451
508 For all of the above charsets, the dimension and number of columns are 452 For all of the above charsets, the dimension and number of columns are
509 the same. 453 the same.
510 454
511 Note that ASCII, Control-1, and Composite are handled specially. 455 Note that ASCII, Control-1, and Composite are handled specially.
512 This is why some of the fields are blank; and some of the filled-in 456 This is why some of the fields are blank; and some of the filled-in
513 fields (e.g. the type) are not really accurate. 457 fields (e.g. the type) are not really accurate.
514 458
515 @node MULE Characters, Composite Characters, Charsets, MULE 459 @node MULE Characters
516 @section MULE Characters 460 @section MULE Characters
517 461
518 @defun make-char charset arg1 &optional arg2 462 @defun make-char charset arg1 &optional arg2
519 This function makes a multi-byte character from @var{charset} and octets 463 This function makes a multi-byte character from @var{charset} and octets
520 @var{arg1} and @var{arg2}. 464 @var{arg1} and @var{arg2}.
537 481
538 @defun find-charset-string string 482 @defun find-charset-string string
539 This function returns a list of the charsets in @var{string}. 483 This function returns a list of the charsets in @var{string}.
540 @end defun 484 @end defun
541 485
542 @node Composite Characters, Coding Systems, MULE Characters, MULE 486 @node Composite Characters
543 @section Composite Characters 487 @section Composite Characters
544 488
545 Composite characters are not yet completely implemented. 489 Composite characters are not yet completely implemented.
546 490
547 @defun make-composite-char string 491 @defun make-composite-char string
548 This function converts a string into a single composite character. The 492 This function converts a string into a single composite character. The
549 character is the result of overstriking all the characters in the 493 character is the result of overstriking all the characters in the
550 string. 494 string.
568 character into one or more characters, the individual characters out of 512 character into one or more characters, the individual characters out of
569 which the composite character was formed. Non-composite characters are 513 which the composite character was formed. Non-composite characters are
570 left as-is. @var{buffer} defaults to the current buffer if omitted. 514 left as-is. @var{buffer} defaults to the current buffer if omitted.
571 @end defun 515 @end defun
572 516
573 @node Coding Systems, CCL, Composite Characters, MULE 517 @node ISO 2022
518 @section ISO 2022
519
520 This section briefly describes the ISO 2022 encoding standard. For more
521 thorough understanding, please refer to the original document of ISO
522 2022.
523
524 Character sets (@dfn{charsets}) are classified into the following four
525 categories, according to the number of characters of charset:
526 94-charset, 96-charset, 94x94-charset, and 96x96-charset.
527
528 @need 1000
529 @table @asis
530 @item 94-charset
531 ASCII(B), left(J) and right(I) half of JISX0201, ...
532 @item 96-charset
533 Latin-1(A), Latin-2(B), Latin-3(C), ...
534 @item 94x94-charset
535 GB2312(A), JISX0208(B), KSC5601(C), ...
536 @item 96x96-charset
537 none for the moment
538 @end table
539
540 The character in parentheses after the name of each charset
541 is the @dfn{final character} @var{F}, which can be regarded as
542 the identifier of the charset. ECMA allocates @var{F} to each
543 charset. @var{F} is in the range of 0x30..0x7F, but 0x30..0x3F
544 are only for private use.
545
546 Note: @dfn{ECMA} = European Computer Manufacturers Association
547
548 There are four @dfn{registers of charsets}, called G0 thru G3.
549 You can designate (or assign) any charset to one of these
550 registers.
551
552 The code space contained within one octet (of size 256) is divided into
553 4 areas: C0, GL, C1, and GR. GL and GR are the areas into which a
554 register of charset can be invoked into.
555
556 @example
557 @group
558 C0: 0x00 - 0x1F
559 GL: 0x20 - 0x7F
560 C1: 0x80 - 0x9F
561 GR: 0xA0 - 0xFF
562 @end group
563 @end example
564
565 Usually, in the initial state, G0 is invoked into GL, and G1
566 is invoked into GR.
567
568 ISO 2022 distinguishes 7-bit environments and 8-bit environments. In
569 7-bit environments, only C0 and GL are used.
570
571 Charset designation is done by escape sequences of the form:
572
573 @example
574 ESC [@var{I}] @var{I} @var{F}
575 @end example
576
577 where @var{I} is an intermediate character in the range 0x20 - 0x2F, and
578 @var{F} is the final character identifying this charset.
579
580 The meaning of intermediate characters are:
581
582 @example
583 @group
584 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
585 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}.
586 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}.
587 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}.
588 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}.
589 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}.
590 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}.
591 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}.
592 @end group
593 @end example
594
595 The following rule is not allowed in ISO 2022 but can be used in Mule.
596
597 @example
598 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}.
599 @end example
600
601 Here are examples of designations:
602
603 @example
604 @group
605 ESC ( B : designate to G0 ASCII
606 ESC - A : designate to G1 Latin-1
607 ESC $ ( A or ESC $ A : designate to G0 GB2312
608 ESC $ ( B or ESC $ B : designate to G0 JISX0208
609 ESC $ ) C : designate to G1 KSC5601
610 @end group
611 @end example
612
613 To use a charset designated to G2 or G3, and to use a charset designated
614 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3
615 into GL. There are two types of invocation, Locking Shift (forever) and
616 Single Shift (one character only).
617
618 Locking Shift is done as follows:
619
620 @example
621 LS0 or SI (0x0F): invoke G0 into GL
622 LS1 or SO (0x0E): invoke G1 into GL
623 LS2: invoke G2 into GL
624 LS3: invoke G3 into GL
625 LS1R: invoke G1 into GR
626 LS2R: invoke G2 into GR
627 LS3R: invoke G3 into GR
628 @end example
629
630 Single Shift is done as follows:
631
632 @example
633 @group
634 SS2 or ESC N: invoke G2 into GL
635 SS3 or ESC O: invoke G3 into GL
636 @end group
637 @end example
638
639 (#### Ben says: I think the above is slightly incorrect. It appears that
640 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and
641 ESC O behave as indicated. The above definitions will not parse
642 EUC-encoded text correctly, and it looks like the code in mule-coding.c
643 has similar problems.)
644
645 You may realize that there are a lot of ISO-2022-compliant ways of
646 encoding multilingual text. Now, in the world, there exist many coding
647 systems such as X11's Compound Text, Japanese JUNET code, and so-called
648 EUC (Extended UNIX Code); all of these are variants of ISO 2022.
649
650 In Mule, we characterize ISO 2022 by the following attributes:
651
652 @enumerate
653 @item
654 Initial designation to G0 thru G3.
655 @item
656 Allow designation of short form for Japanese and Chinese.
657 @item
658 Should we designate ASCII to G0 before control characters?
659 @item
660 Should we designate ASCII to G0 at the end of line?
661 @item
662 7-bit environment or 8-bit environment.
663 @item
664 Use Locking Shift or not.
665 @item
666 Use ASCII or JIS0201-1976-Roman.
667 @item
668 Use JISX0208-1983 or JISX0208-1976.
669 @end enumerate
670
671 (The last two are only for Japanese.)
672
673 By specifying these attributes, you can create any variant
674 of ISO 2022.
675
676 Here are several examples:
677
678 @example
679 @group
680 junet -- Coding system used in JUNET.
681 1. G0 <- ASCII, G1..3 <- never used
682 2. Yes.
683 3. Yes.
684 4. Yes.
685 5. 7-bit environment
686 6. No.
687 7. Use ASCII
688 8. Use JISX0208-1983
689 @end group
690
691 @group
692 ctext -- Compound Text
693 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used
694 2. No.
695 3. No.
696 4. Yes.
697 5. 8-bit environment
698 6. No.
699 7. Use ASCII
700 8. Use JISX0208-1983
701 @end group
702
703 @group
704 euc-china -- Chinese EUC. Although many people call this
705 as "GB encoding", the name may cause misunderstanding.
706 1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used
707 2. No.
708 3. Yes.
709 4. Yes.
710 5. 8-bit environment
711 6. No.
712 7. Use ASCII
713 8. Use JISX0208-1983
714 @end group
715
716 @group
717 korean-mail -- Coding system used in Korean network.
718 1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used
719 2. No.
720 3. Yes.
721 4. Yes.
722 5. 7-bit environment
723 6. Yes.
724 7. No.
725 8. No.
726 @end group
727 @end example
728
729 Mule creates all these coding systems by default.
730
731 @node Coding Systems
574 @section Coding Systems 732 @section Coding Systems
575 733
576 A coding system is an object that defines how text containing multiple 734 A coding system is an object that defines how text containing multiple
577 character sets is encoded into a stream of (typically 8-bit) bytes. The 735 character sets is encoded into a stream of (typically 8-bit) bytes. The
578 coding system is used to decode the stream into a series of characters 736 coding system is used to decode the stream into a series of characters
579 (which may be from multiple charsets) when the text is read from a file 737 (which may be from multiple charsets) when the text is read from a file
580 or process, and is used to encode the text back into the same format 738 or process, and is used to encode the text back into the same format
581 when it is written out to a file or process. 739 when it is written out to a file or process.
582 740
583 For example, many ISO-2022-compliant coding systems (such as Compound 741 For example, many ISO-2022-compliant coding systems (such as Compound
584 Text, which is used for inter-client data under the X Window System) use 742 Text, which is used for inter-client data under the X Window System) use
585 escape sequences to switch between different charsets -- Japanese Kanji, 743 escape sequences to switch between different charsets -- Japanese Kanji,
586 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with 744 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with
587 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See 745 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See
588 @code{make-coding-system} for more information. 746 @code{make-coding-system} for more information.
589 747
590 Coding systems are normally identified using a symbol, and the symbol is 748 Coding systems are normally identified using a symbol, and the symbol is
591 accepted in place of the actual coding system object whenever a coding 749 accepted in place of the actual coding system object whenever a coding
592 system is called for. (This is similar to how faces and charsets work.) 750 system is called for. (This is similar to how faces and charsets work.)
593 751
594 @defun coding-system-p object 752 @defun coding-system-p object
595 This function returns non-@code{nil} if @var{object} is a coding system. 753 This function returns non-@code{nil} if @var{object} is a coding system.
596 @end defun 754 @end defun
597 755
598 @menu 756 @menu
599 * Coding System Types:: Classifying coding systems. 757 * Coding System Types:: Classifying coding systems.
600 * ISO 2022:: An international standard for
601 charsets and encodings.
602 * EOL Conversion:: Dealing with different ways of denoting 758 * EOL Conversion:: Dealing with different ways of denoting
603 the end of a line. 759 the end of a line.
604 * Coding System Properties:: Properties of a coding system. 760 * Coding System Properties:: Properties of a coding system.
605 * Basic Coding System Functions:: Working with coding systems. 761 * Basic Coding System Functions:: Working with coding systems.
606 * Coding System Property Functions:: Retrieving a coding system's properties. 762 * Coding System Property Functions:: Retrieving a coding system's properties.
607 * Encoding and Decoding Text:: Encoding and decoding text. 763 * Encoding and Decoding Text:: Encoding and decoding text.
608 * Detection of Textual Encoding:: Determining how text is encoded. 764 * Detection of Textual Encoding:: Determining how text is encoded.
609 * Big5 and Shift-JIS Functions:: Special functions for these non-standard 765 * Big5 and Shift-JIS Functions:: Special functions for these non-standard
610 encodings. 766 encodings.
611 * Predefined Coding Systems:: Coding systems implemented by MULE.
612 @end menu 767 @end menu
613 768
614 @node Coding System Types, ISO 2022, , Coding Systems 769 @node Coding System Types
615 @subsection Coding System Types 770 @subsection Coding System Types
616 771
617 The coding system type determines the basic algorithm XEmacs will use to
618 decode or encode a data stream. Character encodings will be converted
619 to the MULE encoding, escape sequences processed, and newline sequences
620 converted to XEmacs's internal representation. There are three basic
621 classes of coding system type: no-conversion, ISO-2022, and special.
622
623 No conversion allows you to look at the file's internal representation.
624 Since XEmacs is basically a text editor, "no conversion" does convert
625 newline conventions by default. (Use the 'binary coding-system if this
626 is not desired.)
627
628 ISO 2022 (@pxref{ISO 2022}) is the basic international standard regulating
629 use of "coded character sets for the exchange of data", ie, text
630 streams. ISO 2022 contains functions that make it possible to encode
631 text streams to comply with restrictions of the Internet mail system and
632 de facto restrictions of most file systems (eg, use of the separator
633 character in file names). Coding systems which are not ISO 2022
634 conformant can be difficult to handle. Perhaps more important, they are
635 not adaptable to multilingual information interchange, with the obvious
636 exception of ISO 10646 (Unicode). (Unicode is partially supported by
637 XEmacs with the addition of the Lisp package ucs-conv.)
638
639 The special class of coding systems includes automatic detection, CCL (a
640 "little language" embedded as an interpreter, useful for translating
641 between variants of a single character set), non-ISO-2022-conformant
642 encodings like Unicode, Shift JIS, and Big5, and MULE internal coding.
643 (NB: this list is based on XEmacs 21.2. Terminology may vary slightly
644 for other versions of XEmacs and for GNU Emacs 20.)
645
646 @table @code 772 @table @code
773 @item nil
774 @itemx autodetect
775 Automatic conversion. XEmacs attempts to detect the coding system used
776 in the file.
647 @item no-conversion 777 @item no-conversion
648 No conversion, for binary files, and a few special cases of non-ISO-2022 778 No conversion. Use this for binary files and such. On output, graphic
649 coding systems where conversion is done by hook functions (usually 779 characters that are not in ASCII or Latin-1 will be replaced by a
650 implemented in CCL). On output, graphic characters that are not in 780 @samp{?}. (For a no-conversion-encoded buffer, these characters will
651 ASCII or Latin-1 will be replaced by a @samp{?}. (For a 781 only be present if you explicitly insert them.)
652 no-conversion-encoded buffer, these characters will only be present if 782 @item shift-jis
653 you explicitly insert them.) 783 Shift-JIS (a Japanese encoding commonly used in PC operating systems).
654 @item iso2022 784 @item iso2022
655 Any ISO-2022-compliant encoding. Among others, this includes JIS (the 785 Any ISO-2022-compliant encoding. Among other things, this includes JIS
656 Japanese encoding commonly used for e-mail), national variants of EUC 786 (the Japanese encoding commonly used for e-mail), national variants of
657 (the standard Unix encoding for Japanese and other languages), and 787 EUC (the standard Unix encoding for Japanese and other languages), and
658 Compound Text (an encoding used in X11). You can specify more specific 788 Compound Text (an encoding used in X11). You can specify more specific
659 information about the conversion with the @var{flags} argument. 789 information about the conversion with the @var{flags} argument.
660 @item ucs-4
661 ISO 10646 UCS-4 encoding. A 31-bit fixed-width superset of Unicode.
662 @item utf-8
663 ISO 10646 UTF-8 encoding. A ``file system safe'' transformation format
664 that can be used with both UCS-4 and Unicode.
665 @item undecided
666 Automatic conversion. XEmacs attempts to detect the coding system used
667 in the file.
668 @item shift-jis
669 Shift-JIS (a Japanese encoding commonly used in PC operating systems).
670 @item big5 790 @item big5
671 Big5 (the encoding commonly used for Taiwanese). 791 Big5 (the encoding commonly used for Taiwanese).
672 @item ccl 792 @item ccl
673 The conversion is performed using a user-written pseudo-code program. 793 The conversion is performed using a user-written pseudo-code program.
674 CCL (Code Conversion Language) is the name of this pseudo-code. For 794 CCL (Code Conversion Language) is the name of this pseudo-code.
675 example, CCL is used to map KOI8-R characters (an encoding for Russian
676 Cyrillic) to ISO8859-5 (the form used internally by MULE).
677 @item internal 795 @item internal
678 Write out or read in the raw contents of the memory representing the 796 Write out or read in the raw contents of the memory representing the
679 buffer's text. This is primarily useful for debugging purposes, and is 797 buffer's text. This is primarily useful for debugging purposes, and is
680 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set 798 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set
681 (the @samp{--debug} configure option). @strong{Warning}: Reading in a 799 (the @samp{--debug} configure option). @strong{Warning}: Reading in a
683 inconsistency in the memory representing a buffer's text, which will 801 inconsistency in the memory representing a buffer's text, which will
684 produce unpredictable results and may cause XEmacs to crash. Under 802 produce unpredictable results and may cause XEmacs to crash. Under
685 normal circumstances you should never use @code{internal} conversion. 803 normal circumstances you should never use @code{internal} conversion.
686 @end table 804 @end table
687 805
688 @node ISO 2022, EOL Conversion, Coding System Types, Coding Systems 806 @node EOL Conversion
689 @section ISO 2022
690
691 This section briefly describes the ISO 2022 encoding standard. A more
692 thorough treatment is available in the original document of ISO
693 2022 as well as various national standards (such as JIS X 0202).
694
695 Character sets (@dfn{charsets}) are classified into the following four
696 categories, according to the number of characters in the charset:
697 94-charset, 96-charset, 94x94-charset, and 96x96-charset. This means
698 that although an ISO 2022 coding system may have variable width
699 characters, each charset used is fixed-width (in contrast to the MULE
700 character set and UTF-8, for example).
701
702 ISO 2022 provides for switching between character sets via escape
703 sequences. This switching is somewhat complicated, because ISO 2022
704 provides for both legacy applications like Internet mail that accept
705 only 7 significant bits in some contexts (RFC 822 headers, for example),
706 and more modern "8-bit clean" applications. It also provides for
707 compact and transparent representation of languages like Japanese which
708 mix ASCII and a national script (even outside of computer programs).
709
710 First, ISO 2022 codified prevailing practice by dividing the code space
711 into "control" and "graphic" regions. The code points 0x00-0x1F and
712 0x80-0x9F are reserved for "control characters", while "graphic
713 characters" must be assigned to code points in the regions 0x20-0x7F and
714 0xA0-0xFF. The positions 0x20 and 0x7F are special, and under some
715 circumstances must be assigned the graphic character "ASCII SPACE" and
716 the control character "ASCII DEL" respectively.
717
718 The various regions are given the name C0 (0x00-0x1F), GL (0x20-0x7F),
719 C1 (0x80-0x9F), and GR (0xA0-0xFF). GL and GR stand for "graphic left"
720 and "graphic right", respectively, because of the standard method of
721 displaying graphic character sets in tables with the high byte indexing
722 columns and the low byte indexing rows. I don't find it very intuitive,
723 but these are called "registers".
724
725 An ISO 2022-conformant encoding for a graphic character set must use a
726 fixed number of bytes per character, and the values must fit into a
727 single register; that is, each byte must range over either 0x20-0x7F, or
728 0xA0-0xFF. It is not allowed to extend the range of the repertoire of a
729 character set by using both ranges at the same. This is why a standard
730 character set such as ISO 8859-1 is actually considered by ISO 2022 to
731 be an aggregation of two character sets, ASCII and LATIN-1, and why it
732 is technically incorrect to refer to ISO 8859-1 as "Latin 1". Also, a
733 single character's bytes must all be drawn from the same register; this
734 is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO
735 2022-compatible encodings.
736
737 The reason for this restriction becomes clear when you attempt to define
738 an efficient, robust encoding for a language like Japanese. Like ISO
739 8859, Japanese encodings are aggregations of several character sets. In
740 practice, the vast majority of characters are drawn from the "JIS Roman"
741 character set (a derivative of ASCII; it won't hurt to think of it as
742 ASCII) and the JIS X 0208 standard "basic Japanese" character set
743 including not only ideographic characters ("kanji") but syllabic
744 Japanese characters ("kana"), a wide variety of symbols, and many
745 alphabetic characters (Roman, Greek, and Cyrillic) as well. Although
746 JIS X 0208 includes the whole Roman alphabet, as a 2-byte code it is not
747 suited to programming; thus the inclusion of ASCII in the standard
748 Japanese encodings.
749
750 For normal Japanese text such as in newspapers, a broad repertoire of
751 approximately 3000 characters is used. Evidently this won't fit into
752 one byte; two must be used. But much of the text processed by Japanese
753 computers is computer source code, nearly all of which is ASCII. A not
754 insignificant portion of ordinary text is English (as such or as
755 borrowed Japanese vocabulary) or other languages which can represented
756 at least approximately in ASCII, as well. It seems reasonable then to
757 represent ASCII in one byte, and JIS X 0208 in two. And this is exactly
758 what the Extended Unix Code for Japanese (EUC-JP) does. ASCII is
759 invoked to the GL register, and JIS X 0208 is invoked to the GR
760 register. Thus, each byte can be tested for its character set by
761 looking at the high bit; if set, it is Japanese, if clear, it is ASCII.
762 Furthermore, since control characters like newline can never be part of
763 a graphic character, even in the case of corruption in transmission the
764 stream will be resynchronized at every line break, on the order of 60-80
765 bytes. This coding system requires no escape sequences or special
766 control codes to represent 99.9% of all Japanese text.
767
768 Note carefully the distinction between the character sets (ASCII and JIS
769 X 0208), the encoding (EUC-JP), and the coding system (ISO 2022). The
770 JIS X 0208 character set is used in three different encodings for
771 Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is
772 always clear), in EUC-JP it is invoked into GR (setting the high bit in
773 the process), and in Shift JIS the high bit may be set or reset, and the
774 significant bits are shifted within the 16-bit character so that the two
775 main character sets can coexist with a third (the "halfwidth katakana"
776 of JIS X 0201). As the name implies, the ISO-2022-JP encoding is also a
777 version of the ISO-2022 coding system.
778
779 In order to systematically treat subsidiary character sets (like the
780 "halfwidth katakana" already mentioned, and the "supplementary kanji" of
781 JIS X 0212), four further registers are defined: G0, G1, G2, and G3.
782 Unlike GL and GR, they are not logically distinguished by internal
783 format. Instead, the process of "invocation" mentioned earlier is
784 broken into two steps: first, a character set is @dfn{designated} to one
785 of the registers G0-G3 by use of an @dfn{escape sequence} of the form:
786
787 @example
788 ESC [@var{I}] @var{I} @var{F}
789 @end example
790
791 where @var{I} is an intermediate character or characters in the range
792 0x20 - 0x3F, and @var{F}, from the range 0x30-0x7Fm is the final
793 character identifying this charset. (Final characters in the range
794 0x30-0x3F are reserved for private use and will never have a publicly
795 registered meaning.)
796
797 Then that register is @dfn{invoked} to either GL or GR, either
798 automatically (designations to G0 normally involve invocation to GL as
799 well), or by use of shifting (affecting only the following character in
800 the data stream) or locking (effective until the next designation or
801 locking) control sequences. An encoding conformant to ISO 2022 is
802 typically defined by designating the initial contents of the G0-G3
803 registers, specifying an 7 or 8 bit environment, and specifying whether
804 further designations will be recognized.
805
806 Some examples of character sets and the registered final characters
807 @var{F} used to designate them:
808
809 @need 1000
810 @table @asis
811 @item 94-charset
812 ASCII (B), left (J) and right (I) half of JIS X 0201, ...
813 @item 96-charset
814 Latin-1 (A), Latin-2 (B), Latin-3 (C), ...
815 @item 94x94-charset
816 GB2312 (A), JIS X 0208 (B), KSC5601 (C), ...
817 @item 96x96-charset
818 none for the moment
819 @end table
820
821 The meanings of the various characters in these sequences, where not
822 specified by the ISO 2022 standard (such as the ESC character), are
823 assigned by @dfn{ECMA}, the European Computer Manufacturers Association.
824
825 The meaning of intermediate characters are:
826
827 @example
828 @group
829 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
830 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}.
831 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}.
832 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}.
833 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}.
834 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}.
835 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}.
836 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}.
837 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}.
838 @end group
839 @end example
840
841 The comma may be used in files read and written only by MULE, as a MULE
842 extension, but this is illegal in ISO 2022. (The reason is that in ISO
843 2022 G0 must be a 94-member character set, with 0x20 assigned the value
844 SPACE, and 0x7F assigned the value DEL.)
845
846 Here are examples of designations:
847
848 @example
849 @group
850 ESC ( B : designate to G0 ASCII
851 ESC - A : designate to G1 Latin-1
852 ESC $ ( A or ESC $ A : designate to G0 GB2312
853 ESC $ ( B or ESC $ B : designate to G0 JISX0208
854 ESC $ ) C : designate to G1 KSC5601
855 @end group
856 @end example
857
858 (The short forms used to designate GB2312 and JIS X 0208 are for
859 backwards compatibility; the long forms are preferred.)
860
861 To use a charset designated to G2 or G3, and to use a charset designated
862 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3
863 into GL. There are two types of invocation, Locking Shift (forever) and
864 Single Shift (one character only).
865
866 Locking Shift is done as follows:
867
868 @example
869 LS0 or SI (0x0F): invoke G0 into GL
870 LS1 or SO (0x0E): invoke G1 into GL
871 LS2: invoke G2 into GL
872 LS3: invoke G3 into GL
873 LS1R: invoke G1 into GR
874 LS2R: invoke G2 into GR
875 LS3R: invoke G3 into GR
876 @end example
877
878 Single Shift is done as follows:
879
880 @example
881 @group
882 SS2 or ESC N: invoke G2 into GL
883 SS3 or ESC O: invoke G3 into GL
884 @end group
885 @end example
886
887 The shift functions (such as LS1R and SS3) are represented by control
888 characters (from C1) in 8 bit environments and by escape sequences in 7
889 bit environments.
890
891 (#### Ben says: I think the above is slightly incorrect. It appears that
892 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and
893 ESC O behave as indicated. The above definitions will not parse
894 EUC-encoded text correctly, and it looks like the code in mule-coding.c
895 has similar problems.)
896
897 Evidently there are a lot of ISO-2022-compliant ways of encoding
898 multilingual text. Now, in the world, there exist many coding systems
899 such as X11's Compound Text, Japanese JUNET code, and so-called EUC
900 (Extended UNIX Code); all of these are variants of ISO 2022.
901
902 In MULE, we characterize a version of ISO 2022 by the following
903 attributes:
904
905 @enumerate
906 @item
907 The character sets initially designated to G0 thru G3.
908 @item
909 Whether short form designations are allowed for Japanese and Chinese.
910 @item
911 Whether ASCII should be designated to G0 before control characters.
912 @item
913 Whether ASCII should be designated to G0 at the end of line.
914 @item
915 7-bit environment or 8-bit environment.
916 @item
917 Whether Locking Shifts are used or not.
918 @item
919 Whether to use ASCII or the variant JIS X 0201-1976-Roman.
920 @item
921 Whether to use JIS X 0208-1983 or the older version JIS X 0208-1976.
922 @end enumerate
923
924 (The last two are only for Japanese.)
925
926 By specifying these attributes, you can create any variant
927 of ISO 2022.
928
929 Here are several examples:
930
931 @example
932 @group
933 ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check).
934 1. G0 <- ASCII, G1..3 <- never used
935 2. Yes.
936 3. Yes.
937 4. Yes.
938 5. 7-bit environment
939 6. No.
940 7. Use ASCII
941 8. Use JIS X 0208-1983
942 @end group
943
944 @group
945 ctext -- X11 Compound Text
946 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used.
947 2. No.
948 3. No.
949 4. Yes.
950 5. 8-bit environment.
951 6. No.
952 7. Use ASCII.
953 8. Use JIS X 0208-1983.
954 @end group
955
956 @group
957 euc-china -- Chinese EUC. Often called the "GB encoding", but that is
958 technically incorrect.
959 1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used.
960 2. No.
961 3. Yes.
962 4. Yes.
963 5. 8-bit environment.
964 6. No.
965 7. Use ASCII.
966 8. Use JIS X 0208-1983.
967 @end group
968
969 @group
970 ISO-2022-KR -- Coding system used in Korean email.
971 1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used.
972 2. No.
973 3. Yes.
974 4. Yes.
975 5. 7-bit environment.
976 6. Yes.
977 7. Use ASCII.
978 8. Use JIS X 0208-1983.
979 @end group
980 @end example
981
982 MULE creates all of these coding systems by default.
983
984 @node EOL Conversion, Coding System Properties, ISO 2022, Coding Systems
985 @subsection EOL Conversion 807 @subsection EOL Conversion
986 808
987 @table @code 809 @table @code
988 @item nil 810 @item nil
989 Automatically detect the end-of-line type (LF, CRLF, or CR). Also 811 Automatically detect the end-of-line type (LF, CRLF, or CR). Also
1006 Automatically detect the end-of-line type but do not generate subsidiary 828 Automatically detect the end-of-line type but do not generate subsidiary
1007 coding systems. (This value is converted to @code{nil} when stored 829 coding systems. (This value is converted to @code{nil} when stored
1008 internally, and @code{coding-system-property} will return @code{nil}.) 830 internally, and @code{coding-system-property} will return @code{nil}.)
1009 @end table 831 @end table
1010 832
1011 @node Coding System Properties, Basic Coding System Functions, EOL Conversion, Coding Systems 833 @node Coding System Properties
1012 @subsection Coding System Properties 834 @subsection Coding System Properties
1013 835
1014 @table @code 836 @table @code
1015 @item mnemonic 837 @item mnemonic
1016 String to be displayed in the modeline when this coding system is 838 String to be displayed in the modeline when this coding system is
1017 active. 839 active.
1018 840
1019 @item eol-type 841 @item eol-type
1020 End-of-line conversion to be used. It should be one of the types 842 End-of-line conversion to be used. It should be one of the types
1021 listed in @ref{EOL Conversion}. 843 listed in @ref{EOL Conversion}.
1022
1023 @item eol-lf
1024 The coding system which is the same as this one, except that it uses the
1025 Unix line-breaking convention.
1026
1027 @item eol-crlf
1028 The coding system which is the same as this one, except that it uses the
1029 DOS line-breaking convention.
1030
1031 @item eol-cr
1032 The coding system which is the same as this one, except that it uses the
1033 Macintosh line-breaking convention.
1034 844
1035 @item post-read-conversion 845 @item post-read-conversion
1036 Function called after a file has been read in, to perform the decoding. 846 Function called after a file has been read in, to perform the decoding.
1037 Called with two arguments, @var{beg} and @var{end}, denoting a region of 847 Called with two arguments, @var{beg} and @var{end}, denoting a region of
1038 the current buffer to be decoded. 848 the current buffer to be decoded.
1041 Function called before a file is written out, to perform the encoding. 851 Function called before a file is written out, to perform the encoding.
1042 Called with two arguments, @var{beg} and @var{end}, denoting a region of 852 Called with two arguments, @var{beg} and @var{end}, denoting a region of
1043 the current buffer to be encoded. 853 the current buffer to be encoded.
1044 @end table 854 @end table
1045 855
1046 The following additional properties are recognized if @var{type} is 856 The following additional properties are recognized if @var{type} is
1047 @code{iso2022}: 857 @code{iso2022}:
1048 858
1049 @table @code 859 @table @code
1050 @item charset-g0 860 @item charset-g0
1051 @itemx charset-g1 861 @itemx charset-g1
1119 A list of conversion specifications, specifying conversion of characters 929 A list of conversion specifications, specifying conversion of characters
1120 in one charset to another when encoding is performed. The form of each 930 in one charset to another when encoding is performed. The form of each
1121 specification is the same as for @code{input-charset-conversion}. 931 specification is the same as for @code{input-charset-conversion}.
1122 @end table 932 @end table
1123 933
1124 The following additional properties are recognized (and required) if 934 The following additional properties are recognized (and required) if
1125 @var{type} is @code{ccl}: 935 @var{type} is @code{ccl}:
1126 936
1127 @table @code 937 @table @code
1128 @item decode 938 @item decode
1129 CCL program used for decoding (converting to internal format). 939 CCL program used for decoding (converting to internal format).
1130 940
1131 @item encode 941 @item encode
1132 CCL program used for encoding (converting to external format). 942 CCL program used for encoding (converting to external format).
1133 @end table 943 @end table
1134 944
1135 The following properties are used internally: @var{eol-cr}, 945 @node Basic Coding System Functions
1136 @var{eol-crlf}, @var{eol-lf}, and @var{base}.
1137
1138 @node Basic Coding System Functions, Coding System Property Functions, Coding System Properties, Coding Systems
1139 @subsection Basic Coding System Functions 946 @subsection Basic Coding System Functions
1140 947
1141 @defun find-coding-system coding-system-or-name 948 @defun find-coding-system coding-system-or-name
1142 This function retrieves the coding system of the given name. 949 This function retrieves the coding system of the given name.
1143 950
1144 If @var{coding-system-or-name} is a coding-system object, it is simply 951 If @var{coding-system-or-name} is a coding-system object, it is simply
1145 returned. Otherwise, @var{coding-system-or-name} should be a symbol. 952 returned. Otherwise, @var{coding-system-or-name} should be a symbol.
1146 If there is no such coding system, @code{nil} is returned. Otherwise 953 If there is no such coding system, @code{nil} is returned. Otherwise
1147 the associated coding system object is returned. 954 the associated coding system object is returned.
1148 @end defun 955 @end defun
1149 956
1159 966
1160 @defun coding-system-name coding-system 967 @defun coding-system-name coding-system
1161 This function returns the name of the given coding system. 968 This function returns the name of the given coding system.
1162 @end defun 969 @end defun
1163 970
1164 @defun coding-system-base coding-system
1165 Returns the base coding system (undecided EOL convention)
1166 coding system.
1167 @end defun
1168
1169 @defun make-coding-system name type &optional doc-string props 971 @defun make-coding-system name type &optional doc-string props
1170 This function registers symbol @var{name} as a coding system. 972 This function registers symbol @var{name} as a coding system.
1171 973
1172 @var{type} describes the conversion method used and should be one of 974 @var{type} describes the conversion method used and should be one of
1173 the types listed in @ref{Coding System Types}. 975 the types listed in @ref{Coding System Types}.
1188 @defun subsidiary-coding-system coding-system eol-type 990 @defun subsidiary-coding-system coding-system eol-type
1189 This function returns the subsidiary coding system of 991 This function returns the subsidiary coding system of
1190 @var{coding-system} with eol type @var{eol-type}. 992 @var{coding-system} with eol type @var{eol-type}.
1191 @end defun 993 @end defun
1192 994
1193 @node Coding System Property Functions, Encoding and Decoding Text, Basic Coding System Functions, Coding Systems 995 @node Coding System Property Functions
1194 @subsection Coding System Property Functions 996 @subsection Coding System Property Functions
1195 997
1196 @defun coding-system-doc-string coding-system 998 @defun coding-system-doc-string coding-system
1197 This function returns the doc string for @var{coding-system}. 999 This function returns the doc string for @var{coding-system}.
1198 @end defun 1000 @end defun
1203 1005
1204 @defun coding-system-property coding-system prop 1006 @defun coding-system-property coding-system prop
1205 This function returns the @var{prop} property of @var{coding-system}. 1007 This function returns the @var{prop} property of @var{coding-system}.
1206 @end defun 1008 @end defun
1207 1009
1208 @node Encoding and Decoding Text, Detection of Textual Encoding, Coding System Property Functions, Coding Systems 1010 @node Encoding and Decoding Text
1209 @subsection Encoding and Decoding Text 1011 @subsection Encoding and Decoding Text
1210 1012
1211 @defun decode-coding-region start end coding-system &optional buffer 1013 @defun decode-coding-region start end coding-system &optional buffer
1212 This function decodes the text between @var{start} and @var{end} which 1014 This function decodes the text between @var{start} and @var{end} which
1213 is encoded in @var{coding-system}. This is useful if you've read in 1015 is encoded in @var{coding-system}. This is useful if you've read in
1224 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS 1026 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS
1225 encoding. The length of the encoded text is returned. @var{buffer} 1027 encoding. The length of the encoded text is returned. @var{buffer}
1226 defaults to the current buffer if unspecified. 1028 defaults to the current buffer if unspecified.
1227 @end defun 1029 @end defun
1228 1030
1229 @node Detection of Textual Encoding, Big5 and Shift-JIS Functions, Encoding and Decoding Text, Coding Systems 1031 @node Detection of Textual Encoding
1230 @subsection Detection of Textual Encoding 1032 @subsection Detection of Textual Encoding
1231 1033
1232 @defun coding-category-list 1034 @defun coding-category-list
1233 This function returns a list of all recognized coding categories. 1035 This function returns a list of all recognized coding categories.
1234 @end defun 1036 @end defun
1260 returns @code{autodetect} or one of its subsidiary coding systems 1062 returns @code{autodetect} or one of its subsidiary coding systems
1261 according to a detected end-of-line type. Optional arg @var{buffer} 1063 according to a detected end-of-line type. Optional arg @var{buffer}
1262 defaults to the current buffer. 1064 defaults to the current buffer.
1263 @end defun 1065 @end defun
1264 1066
1265 @node Big5 and Shift-JIS Functions, Predefined Coding Systems, Detection of Textual Encoding, Coding Systems 1067 @node Big5 and Shift-JIS Functions
1266 @subsection Big5 and Shift-JIS Functions 1068 @subsection Big5 and Shift-JIS Functions
1267 1069
1268 These are special functions for working with the non-standard 1070 These are special functions for working with the non-standard
1269 Shift-JIS and Big5 encodings. 1071 Shift-JIS and Big5 encodings.
1270 1072
1271 @defun decode-shift-jis-char code 1073 @defun decode-shift-jis-char code
1272 This function decodes a JIS X 0208 character of Shift-JIS coding-system. 1074 This function decodes a JISX0208 character of Shift-JIS coding-system.
1273 @var{code} is the character code in Shift-JIS as a cons of type bytes. 1075 @var{code} is the character code in Shift-JIS as a cons of type bytes.
1274 The corresponding character is returned. 1076 The corresponding character is returned.
1275 @end defun 1077 @end defun
1276 1078
1277 @defun encode-shift-jis-char ch 1079 @defun encode-shift-jis-char ch
1278 This function encodes a JIS X 0208 character @var{ch} to SHIFT-JIS 1080 This function encodes a JISX0208 character @var{ch} to SHIFT-JIS
1279 coding-system. The corresponding character code in SHIFT-JIS is 1081 coding-system. The corresponding character code in SHIFT-JIS is
1280 returned as a cons of two bytes. 1082 returned as a cons of two bytes.
1281 @end defun 1083 @end defun
1282 1084
1283 @defun decode-big5-char code 1085 @defun decode-big5-char code
1289 @defun encode-big5-char ch 1091 @defun encode-big5-char ch
1290 This function encodes the Big5 character @var{char} to BIG5 1092 This function encodes the Big5 character @var{char} to BIG5
1291 coding-system. The corresponding character code in Big5 is returned. 1093 coding-system. The corresponding character code in Big5 is returned.
1292 @end defun 1094 @end defun
1293 1095
1294 @node Predefined Coding Systems, , Big5 and Shift-JIS Functions, Coding Systems
1295 @subsection Coding Systems Implemented
1296
1297 MULE initializes most of the commonly used coding systems at XEmacs's
1298 startup. A few others are initialized only when the relevant language
1299 environment is selected and support libraries are loaded. (NB: The
1300 following list is based on XEmacs 21.2.19, the development branch at the
1301 time of writing. The list may be somewhat different for other
1302 versions. Recent versions of GNU Emacs 20 implement a few more rare
1303 coding systems; work is being done to port these to XEmacs.)
1304
1305 Unfortunately, there is not a consistent naming convention for character
1306 sets, and for practical purposes coding systems often take their name
1307 from their principal character sets (ASCII, KOI8-R, Shift JIS). Others
1308 take their names from the coding system (ISO-2022-JP, EUC-KR), and a few
1309 from their non-text usages (internal, binary). To provide for this, and
1310 for the fact that many coding systems have several common names, an
1311 aliasing system is provided. Finally, some effort has been made to use
1312 names that are registered as MIME charsets (this is why the name
1313 'shift_jis contains that un-Lisp-y underscore).
1314
1315 There is a systematic naming convention regarding end-of-line (EOL)
1316 conventions for different systems. A coding system whose name ends in
1317 "-unix" forces the assumptions that lines are broken by newlines (0x0A).
1318 A coding system whose name ends in "-mac" forces the assumptions that
1319 lines are broken by ASCII CRs (0x0D). A coding system whose name ends
1320 in "-dos" forces the assumptions that lines are broken by CRLF sequences
1321 (0x0D 0x0A). These subsidiary coding systems are automatically derived
1322 from a base coding system. Use of the base coding system implies
1323 autodetection of the text file convention. (The fact that the -unix,
1324 -mac, and -dos are derived from a base system results in them showing up
1325 as "aliases" in `list-coding-systems'.) These subsidiaries have a
1326 consistent modeline indicator as well. "-dos" coding systems have ":T"
1327 appended to their modeline indicator, while "-mac" coding systems have
1328 ":t" appended (eg, "ISO8:t" for iso-2022-8-mac).
1329
1330 In the following table, each coding system is given with its mode line
1331 indicator in parentheses. Non-textual coding systems are listed first,
1332 followed by textual coding systems and their aliases. (The coding system
1333 subsidiary modeline indicators ":T" and ":t" will be omitted from the
1334 table of coding systems.)
1335
1336 ### SJT 1999-08-23 Maybe should order these by language? Definitely
1337 need language usage for the ISO-8859 family.
1338
1339 Note that although true coding system aliases have been implemented for
1340 XEmacs 21.2, the coding system initialization has not yet been converted
1341 as of 21.2.19. So coding systems described as aliases have the same
1342 properties as the aliased coding system, but will not be equal as Lisp
1343 objects.
1344
1345 @table @code
1346
1347 @item automatic-conversion
1348 @itemx undecided
1349 @itemx undecided-dos
1350 @itemx undecided-mac
1351 @itemx undecided-unix
1352
1353 Modeline indicator: @code{Auto}. A type @code{undecided} coding system.
1354 Attempts to determine an appropriate coding system from file contents or
1355 the environment.
1356
1357 @item raw-text
1358 @itemx no-conversion
1359 @itemx raw-text-dos
1360 @itemx raw-text-mac
1361 @itemx raw-text-unix
1362 @itemx no-conversion-dos
1363 @itemx no-conversion-mac
1364 @itemx no-conversion-unix
1365
1366 Modeline indicator: @code{Raw}. A type @code{no-conversion} coding system,
1367 which converts only line-break-codes. An implementation quirk means
1368 that this coding system is also used for ISO8859-1.
1369
1370 @item binary
1371 Modeline indicator: @code{Binary}. A type @code{no-conversion} coding
1372 system which does no character coding or EOL conversions. An alias for
1373 @code{raw-text-unix}.
1374
1375 @item alternativnyj
1376 @itemx alternativnyj-dos
1377 @itemx alternativnyj-mac
1378 @itemx alternativnyj-unix
1379
1380 Modeline indicator: @code{Cy.Alt}. A type @code{ccl} coding system used for
1381 Alternativnyj, an encoding of the Cyrillic alphabet.
1382
1383 @item big5
1384 @itemx big5-dos
1385 @itemx big5-mac
1386 @itemx big5-unix
1387
1388 Modeline indicator: @code{Zh/Big5}. A type @code{big5} coding system used for
1389 BIG5, the most common encoding of traditional Chinese as used in Taiwan.
1390
1391 @item cn-gb-2312
1392 @itemx cn-gb-2312-dos
1393 @itemx cn-gb-2312-mac
1394 @itemx cn-gb-2312-unix
1395
1396 Modeline indicator: @code{Zh-GB/EUC}. A type @code{iso2022} coding system used
1397 for simplified Chinese (as used in the People's Republic of China), with
1398 the @code{ascii} (G0), @code{chinese-gb2312} (G1), and @code{sisheng}
1399 (G2) character sets initially designated. Chinese EUC (Extended Unix
1400 Code).
1401
1402 @item ctext-hebrew
1403 @itemx ctext-hebrew-dos
1404 @itemx ctext-hebrew-mac
1405 @itemx ctext-hebrew-unix
1406
1407 Modeline indicator: @code{CText/Hbrw}. A type @code{iso2022} coding system
1408 with the @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) character
1409 sets initially designated for Hebrew.
1410
1411 @item ctext
1412 @itemx ctext-dos
1413 @itemx ctext-mac
1414 @itemx ctext-unix
1415
1416 Modeline indicator: @code{CText}. A type @code{iso2022} 8-bit coding system
1417 with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1) character
1418 sets initially designated. X11 Compound Text Encoding. Often
1419 mistakenly recognized instead of EUC encodings; usual cause is
1420 inappropriate setting of @code{coding-priority-list}.
1421
1422 @item escape-quoted
1423
1424 Modeline indicator: @code{ESC/Quot}. A type @code{iso2022} 8-bit coding
1425 system with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1)
1426 character sets initially designated and escape quoting. Unix EOL
1427 conversion (ie, no conversion). It is used for .ELC files.
1428
1429 @item euc-jp
1430 @itemx euc-jp-dos
1431 @itemx euc-jp-mac
1432 @itemx euc-jp-unix
1433
1434 Modeline indicator: @code{Ja/EUC}. A type @code{iso2022} 8-bit coding system
1435 with @code{ascii} (G0), @code{japanese-jisx0208} (G1),
1436 @code{katakana-jisx0201} (G2), and @code{japanese-jisx0212} (G3)
1437 initially designated. Japanese EUC (Extended Unix Code).
1438
1439 @item euc-kr
1440 @itemx euc-kr-dos
1441 @itemx euc-kr-mac
1442 @itemx euc-kr-unix
1443
1444 Modeline indicator: @code{ko/EUC}. A type @code{iso2022} 8-bit coding system
1445 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially
1446 designated. Korean EUC (Extended Unix Code).
1447
1448 @item hz-gb-2312
1449 Modeline indicator: @code{Zh-GB/Hz}. A type @code{no-conversion} coding
1450 system with Unix EOL convention (ie, no conversion) using
1451 post-read-decode and pre-write-encode functions to translate the Hz/ZW
1452 coding system used for Chinese.
1453
1454 @item iso-2022-7bit
1455 @itemx iso-2022-7bit-unix
1456 @itemx iso-2022-7bit-dos
1457 @itemx iso-2022-7bit-mac
1458 @itemx iso-2022-7
1459
1460 Modeline indicator: @code{ISO7}. A type @code{iso2022} 7-bit coding system
1461 with @code{ascii} (G0) initially designated. Other character sets must
1462 be explicitly designated to be used.
1463
1464 @item iso-2022-7bit-ss2
1465 @itemx iso-2022-7bit-ss2-dos
1466 @itemx iso-2022-7bit-ss2-mac
1467 @itemx iso-2022-7bit-ss2-unix
1468
1469 Modeline indicator: @code{ISO7/SS}. A type @code{iso2022} 7-bit coding system
1470 with @code{ascii} (G0) initially designated. Other character sets must
1471 be explicitly designated to be used. SS2 is used to invoke a
1472 96-charset, one character at a time.
1473
1474 @item iso-2022-8
1475 @itemx iso-2022-8-dos
1476 @itemx iso-2022-8-mac
1477 @itemx iso-2022-8-unix
1478
1479 Modeline indicator: @code{ISO8}. A type @code{iso2022} 8-bit coding system
1480 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially
1481 designated. Other character sets must be explicitly designated to be
1482 used. No single-shift or locking-shift.
1483
1484 @item iso-2022-8bit-ss2
1485 @itemx iso-2022-8bit-ss2-dos
1486 @itemx iso-2022-8bit-ss2-mac
1487 @itemx iso-2022-8bit-ss2-unix
1488
1489 Modeline indicator: @code{ISO8/SS}. A type @code{iso2022} 8-bit coding system
1490 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially
1491 designated. Other character sets must be explicitly designated to be
1492 used. SS2 is used to invoke a 96-charset, one character at a time.
1493
1494 @item iso-2022-int-1
1495 @itemx iso-2022-int-1-dos
1496 @itemx iso-2022-int-1-mac
1497 @itemx iso-2022-int-1-unix
1498
1499 Modeline indicator: @code{INT-1}. A type @code{iso2022} 7-bit coding system
1500 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially
1501 designated. ISO-2022-INT-1.
1502
1503 @item iso-2022-jp-1978-irv
1504 @itemx iso-2022-jp-1978-irv-dos
1505 @itemx iso-2022-jp-1978-irv-mac
1506 @itemx iso-2022-jp-1978-irv-unix
1507
1508 Modeline indicator: @code{Ja-78/7bit}. A type @code{iso2022} 7-bit coding
1509 system. For compatibility with old Japanese terminals; if you need to
1510 know, look at the source.
1511
1512 @item iso-2022-jp
1513 @itemx iso-2022-jp-2 (ISO7/SS)
1514 @itemx iso-2022-jp-dos
1515 @itemx iso-2022-jp-mac
1516 @itemx iso-2022-jp-unix
1517 @itemx iso-2022-jp-2-dos
1518 @itemx iso-2022-jp-2-mac
1519 @itemx iso-2022-jp-2-unix
1520
1521 Modeline indicator: @code{MULE/7bit}. A type @code{iso2022} 7-bit coding
1522 system with @code{ascii} (G0) initially designated, and complex
1523 specifications to insure backward compatibility with old Japanese
1524 systems. Used for communication with mail and news in Japan. The "-2"
1525 versions also use SS2 to invoke a 96-charset one character at a time.
1526
1527 @item iso-2022-kr
1528 Modeline indicator: @code{Ko/7bit} A type @code{iso2022} 7-bit coding
1529 system with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially
1530 designated. Used for e-mail in Korea.
1531
1532 @item iso-2022-lock
1533 @itemx iso-2022-lock-dos
1534 @itemx iso-2022-lock-mac
1535 @itemx iso-2022-lock-unix
1536
1537 Modeline indicator: @code{ISO7/Lock}. A type @code{iso2022} 7-bit coding
1538 system with @code{ascii} (G0) initially designated, using Locking-Shift
1539 to invoke a 96-charset.
1540
1541 @item iso-8859-1
1542 @itemx iso-8859-1-dos
1543 @itemx iso-8859-1-mac
1544 @itemx iso-8859-1-unix
1545
1546 Due to implementation, this is not a type @code{iso2022} coding system,
1547 but rather an alias for the @code{raw-text} coding system.
1548
1549 @item iso-8859-2
1550 @itemx iso-8859-2-dos
1551 @itemx iso-8859-2-mac
1552 @itemx iso-8859-2-unix
1553
1554 Modeline indicator: @code{MIME/Ltn-2}. A type @code{iso2022} coding
1555 system with @code{ascii} (G0) and @code{latin-iso8859-2} (G1) initially
1556 invoked.
1557
1558 @item iso-8859-3
1559 @itemx iso-8859-3-dos
1560 @itemx iso-8859-3-mac
1561 @itemx iso-8859-3-unix
1562
1563 Modeline indicator: @code{MIME/Ltn-3}. A type @code{iso2022} coding system
1564 with @code{ascii} (G0) and @code{latin-iso8859-3} (G1) initially
1565 invoked.
1566
1567 @item iso-8859-4
1568 @itemx iso-8859-4-dos
1569 @itemx iso-8859-4-mac
1570 @itemx iso-8859-4-unix
1571
1572 Modeline indicator: @code{MIME/Ltn-4}. A type @code{iso2022} coding system
1573 with @code{ascii} (G0) and @code{latin-iso8859-4} (G1) initially
1574 invoked.
1575
1576 @item iso-8859-5
1577 @itemx iso-8859-5-dos
1578 @itemx iso-8859-5-mac
1579 @itemx iso-8859-5-unix
1580
1581 Modeline indicator: @code{ISO8/Cyr}. A type @code{iso2022} coding system with
1582 @code{ascii} (G0) and @code{cyrillic-iso8859-5} (G1) initially invoked.
1583
1584 @item iso-8859-7
1585 @itemx iso-8859-7-dos
1586 @itemx iso-8859-7-mac
1587 @itemx iso-8859-7-unix
1588
1589 Modeline indicator: @code{Grk}. A type @code{iso2022} coding system with
1590 @code{ascii} (G0) and @code{greek-iso8859-7} (G1) initially invoked.
1591
1592 @item iso-8859-8
1593 @itemx iso-8859-8-dos
1594 @itemx iso-8859-8-mac
1595 @itemx iso-8859-8-unix
1596
1597 Modeline indicator: @code{MIME/Hbrw}. A type @code{iso2022} coding system with
1598 @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) initially invoked.
1599
1600 @item iso-8859-9
1601 @itemx iso-8859-9-dos
1602 @itemx iso-8859-9-mac
1603 @itemx iso-8859-9-unix
1604
1605 Modeline indicator: @code{MIME/Ltn-5}. A type @code{iso2022} coding system
1606 with @code{ascii} (G0) and @code{latin-iso8859-9} (G1) initially
1607 invoked.
1608
1609 @item koi8-r
1610 @itemx koi8-r-dos
1611 @itemx koi8-r-mac
1612 @itemx koi8-r-unix
1613
1614 Modeline indicator: @code{KOI8}. A type @code{ccl} coding-system used for
1615 KOI8-R, an encoding of the Cyrillic alphabet.
1616
1617 @item shift_jis
1618 @itemx shift_jis-dos
1619 @itemx shift_jis-mac
1620 @itemx shift_jis-unix
1621
1622 Modeline indicator: @code{Ja/SJIS}. A type @code{shift-jis} coding-system
1623 implementing the Shift-JIS encoding for Japanese. The underscore is to
1624 conform to the MIME charset implementing this encoding.
1625
1626 @item tis-620
1627 @itemx tis-620-dos
1628 @itemx tis-620-mac
1629 @itemx tis-620-unix
1630
1631 Modeline indicator: @code{TIS620}. A type @code{ccl} encoding for Thai. The
1632 external encoding is defined by TIS620, the internal encoding is
1633 peculiar to MULE, and called @code{thai-xtis}.
1634
1635 @item viqr
1636
1637 Modeline indicator: @code{VIQR}. A type @code{no-conversion} coding
1638 system with Unix EOL convention (ie, no conversion) using
1639 post-read-decode and pre-write-encode functions to translate the VIQR
1640 coding system for Vietnamese.
1641
1642 @item viscii
1643 @itemx viscii-dos
1644 @itemx viscii-mac
1645 @itemx viscii-unix
1646
1647 Modeline indicator: @code{VISCII}. A type @code{ccl} coding-system used
1648 for VISCII 1.1 for Vietnamese. Differs slightly from VSCII; VISCII is
1649 given priority by XEmacs.
1650
1651 @item vscii
1652 @itemx vscii-dos
1653 @itemx vscii-mac
1654 @itemx vscii-unix
1655
1656 Modeline indicator: @code{VSCII}. A type @code{ccl} coding-system used
1657 for VSCII 1.1 for Vietnamese. Differs slightly from VISCII, which is
1658 given priority by XEmacs. Use
1659 @code{(prefer-coding-system 'vietnamese-vscii)} to give priority to VSCII.
1660
1661 @end table
1662
1663 @node CCL, Category Tables, Coding Systems, MULE 1096 @node CCL, Category Tables, Coding Systems, MULE
1664 @section CCL 1097 @section CCL
1665 1098
1666 CCL (Code Conversion Language) is a simple structured programming 1099 CCL (Code Conversion Language) is a simple structured programming
1667 language designed for character coding conversions. A CCL program is 1100 language designed for character coding conversions. A CCL program is
1668 compiled to CCL code (represented by a vector of integers) and executed 1101 compiled to CCL code (represented by a vector of integers) and executed
1669 by the CCL interpreter embedded in Emacs. The CCL interpreter 1102 by the CCL interpreter embedded in Emacs. The CCL interpreter
1670 implements a virtual machine with 8 registers called @code{r0}, ..., 1103 implements a virtual machine with 8 registers called @code{r0}, ...,
1671 @code{r7}, a number of control structures, and some I/O operators. Take 1104 @code{r7}, a number of control structures, and some I/O operators. Take
1672 care when using registers @code{r0} (used in implicit @dfn{set} 1105 care when using registers @code{r0} (used in implicit @dfn{set}
1673 statements) and especially @code{r7} (used internally by several 1106 statements) and especially @code{r7} (used internally by several
1674 statements and operations, especially for multiple return values and I/O 1107 statements and operations, especially for multiple return values and I/O
1675 operations). 1108 operations).
1676 1109
1677 CCL is used for code conversion during process I/O and file I/O for 1110 CCL is used for code conversion during process I/O and file I/O for
1678 non-ISO2022 coding systems. (It is the only way for a user to specify a 1111 non-ISO2022 coding systems. (It is the only way for a user to specify a
1679 code conversion function.) It is also used for calculating the code 1112 code conversion function.) It is also used for calculating the code
1680 point of an X11 font from a character code. However, since CCL is 1113 point of an X11 font from a character code. However, since CCL is
1681 designed as a powerful programming language, it can be used for more 1114 designed as a powerful programming language, it can be used for more
1682 generic calculation where efficiency is demanded. A combination of 1115 generic calculation where efficiency is demanded. A combination of
1683 three or more arithmetic operations can be calculated faster by CCL than 1116 three or more arithmetic operations can be calculated faster by CCL than
1684 by Emacs Lisp. 1117 by Emacs Lisp.
1685 1118
1686 @strong{Warning:} The code in @file{src/mule-ccl.c} and 1119 @strong{Warning:} The code in @file{src/mule-ccl.c} and
1687 @file{$packages/lisp/mule-base/mule-ccl.el} is the definitive 1120 @file{$packages/lisp/mule-base/mule-ccl.el} is the definitive
1688 description of CCL's semantics. The previous version of this section 1121 description of CCL's semantics. The previous version of this section
1689 contained several typos and obsolete names left from earlier versions of 1122 contained several typos and obsolete names left from earlier versions of
1690 MULE, and many may remain. (I am not an experienced CCL programmer; the 1123 MULE, and many may remain. (I am not an experienced CCL programmer; the
1691 few who know CCL well find writing English painful.) 1124 few who know CCL well find writing English painful.)
1692 1125
1693 A CCL program transforms an input data stream into an output data 1126 A CCL program transforms an input data stream into an output data
1694 stream. The input stream, held in a buffer of constant bytes, is left 1127 stream. The input stream, held in a buffer of constant bytes, is left
1695 unchanged. The buffer may be filled by an external input operation, 1128 unchanged. The buffer may be filled by an external input operation,
1696 taken from an Emacs buffer, or taken from a Lisp string. The output 1129 taken from an Emacs buffer, or taken from a Lisp string. The output
1697 buffer is a dynamic array of bytes, which can be written by an external 1130 buffer is a dynamic array of bytes, which can be written by an external
1698 output operation, inserted into an Emacs buffer, or returned as a Lisp 1131 output operation, inserted into an Emacs buffer, or returned as a Lisp
1699 string. 1132 string.
1700 1133
1701 A CCL program is a (Lisp) list containing two or three members. The 1134 A CCL program is a (Lisp) list containing two or three members. The
1702 first member is the @dfn{buffer magnification}, which indicates the 1135 first member is the @dfn{buffer magnification}, which indicates the
1703 required minimum size of the output buffer as a multiple of the input 1136 required minimum size of the output buffer as a multiple of the input
1704 buffer. It is followed by the @dfn{main block} which executes while 1137 buffer. It is followed by the @dfn{main block} which executes while
1705 there is input remaining, and an optional @dfn{EOF block} which is 1138 there is input remaining, and an optional @dfn{EOF block} which is
1706 executed when the input is exhausted. Both the main block and the EOF 1139 executed when the input is exhausted. Both the main block and the EOF
1707 block are CCL blocks. 1140 block are CCL blocks.
1708 1141
1709 A @dfn{CCL block} is either a CCL statement or list of CCL statements. 1142 A @dfn{CCL block} is either a CCL statement or list of CCL statements.
1710 A @dfn{CCL statement} is either a @dfn{set statement} (either an integer 1143 A @dfn{CCL statement} is either a @dfn{set statement} (either an integer
1711 or an @dfn{assignment}, which is a list of a register to receive the 1144 or an @dfn{assignment}, which is a list of a register to receive the
1712 assignment, an assignment operator, and an expression) or a @dfn{control 1145 assignment, an assignment operator, and an expression) or a @dfn{control
1713 statement} (a list starting with a keyword, whose allowable syntax 1146 statement} (a list starting with a keyword, whose allowable syntax
1714 depends on the keyword). 1147 depends on the keyword).
1719 * CCL Expressions:: Operators and expressions in CCL. 1152 * CCL Expressions:: Operators and expressions in CCL.
1720 * Calling CCL:: Running CCL programs. 1153 * Calling CCL:: Running CCL programs.
1721 * CCL Examples:: The encoding functions for Big5 and KOI-8. 1154 * CCL Examples:: The encoding functions for Big5 and KOI-8.
1722 @end menu 1155 @end menu
1723 1156
1724 @node CCL Syntax, CCL Statements, , CCL 1157 @node CCL Syntax, CCL Statements, CCL, CCL
1725 @comment Node, Next, Previous, Up 1158 @comment Node, Next, Previous, Up
1726 @subsection CCL Syntax 1159 @subsection CCL Syntax
1727 1160
1728 The full syntax of a CCL program in BNF notation: 1161 The full syntax of a CCL program in BNF notation:
1729 1162
1730 @format 1163 @format
1731 CCL_PROGRAM := 1164 CCL_PROGRAM :=
1732 (BUFFER_MAGNIFICATION 1165 (BUFFER_MAGNIFICATION
1733 CCL_MAIN_BLOCK 1166 CCL_MAIN_BLOCK
1782 1215
1783 @node CCL Statements, CCL Expressions, CCL Syntax, CCL 1216 @node CCL Statements, CCL Expressions, CCL Syntax, CCL
1784 @comment Node, Next, Previous, Up 1217 @comment Node, Next, Previous, Up
1785 @subsection CCL Statements 1218 @subsection CCL Statements
1786 1219
1787 The Emacs Code Conversion Language provides the following statement 1220 The Emacs Code Conversion Language provides the following statement
1788 types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat}, 1221 types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat},
1789 @dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, and @dfn{end}. 1222 @dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, and @dfn{end}.
1790 1223
1791 @heading Set statement: 1224 @heading Set statement:
1792 1225
1793 The @dfn{set} statement has three variants with the syntaxes 1226 The @dfn{set} statement has three variants with the syntaxes
1794 @samp{(@var{reg} = @var{expression})}, 1227 @samp{(@var{reg} = @var{expression})},
1795 @samp{(@var{reg} @var{assignment_operator} @var{expression})}, and 1228 @samp{(@var{reg} @var{assignment_operator} @var{expression})}, and
1796 @samp{@var{integer}}. The assignment operator variation of the 1229 @samp{@var{integer}}. The assignment operator variation of the
1797 @dfn{set} statement works the same way as the corresponding C expression 1230 @dfn{set} statement works the same way as the corresponding C expression
1798 statement does. The assignment operators are @code{+=}, @code{-=}, 1231 statement does. The assignment operators are @code{+=}, @code{-=},
1801 "naked integer" @var{integer} is equivalent to a @var{set} statement of 1234 "naked integer" @var{integer} is equivalent to a @var{set} statement of
1802 the form @code{(r0 = @var{integer})}. 1235 the form @code{(r0 = @var{integer})}.
1803 1236
1804 @heading I/O statements: 1237 @heading I/O statements:
1805 1238
1806 The @dfn{read} statement takes one or more registers as arguments. It 1239 The @dfn{read} statement takes one or more registers as arguments. It
1807 reads one byte (a C char) from the input into each register in turn. 1240 reads one byte (a C char) from the input into each register in turn.
1808 1241
1809 The @dfn{write} takes several forms. In the form @samp{(write @var{reg} 1242 The @dfn{write} takes several forms. In the form @samp{(write @var{reg}
1810 ...)} it takes one or more registers as arguments and writes each in 1243 ...)} it takes one or more registers as arguments and writes each in
1811 turn to the output. The integer in a register (interpreted as an 1244 turn to the output. The integer in a register (interpreted as an
1812 Emchar) is encoded to multibyte form (ie, Bufbytes) and written to the 1245 Emchar) is encoded to multibyte form (ie, Bufbytes) and written to the
1813 current output buffer. If it is less than 256, it is written as is. 1246 current output buffer. If it is less than 256, it is written as is.
1814 The forms @samp{(write @var{expression})} and @samp{(write 1247 The forms @samp{(write @var{expression})} and @samp{(write
1818 @var{string})}. The form @samp{(write @var{reg} @var{array})} writes 1251 @var{string})}. The form @samp{(write @var{reg} @var{array})} writes
1819 the @var{reg}th element of the @var{array} to the output. 1252 the @var{reg}th element of the @var{array} to the output.
1820 1253
1821 @heading Conditional statements: 1254 @heading Conditional statements:
1822 1255
1823 The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and 1256 The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and
1824 an optional @var{second CCL block} as arguments. If the 1257 an optional @var{second CCL block} as arguments. If the
1825 @var{expression} evaluates to non-zero, the first @var{CCL block} is 1258 @var{expression} evaluates to non-zero, the first @var{CCL block} is
1826 executed. Otherwise, if there is a @var{second CCL block}, it is 1259 executed. Otherwise, if there is a @var{second CCL block}, it is
1827 executed. 1260 executed.
1828 1261
1829 The @dfn{read-if} variant of the @dfn{if} statement takes an 1262 The @dfn{read-if} variant of the @dfn{if} statement takes an
1830 @var{expression}, a @var{CCL block}, and an optional @var{second CCL 1263 @var{expression}, a @var{CCL block}, and an optional @var{second CCL
1831 block} as arguments. The @var{expression} must have the form 1264 block} as arguments. The @var{expression} must have the form
1832 @code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is 1265 @code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is
1833 a register or an integer). The @code{read-if} statement first reads 1266 a register or an integer). The @code{read-if} statement first reads
1834 from the input into the first register operand in the @var{expression}, 1267 from the input into the first register operand in the @var{expression},
1835 then conditionally executes a CCL block just as the @code{if} statement 1268 then conditionally executes a CCL block just as the @code{if} statement
1836 does. 1269 does.
1837 1270
1838 The @dfn{branch} statement takes an @var{expression} and one or more CCL 1271 The @dfn{branch} statement takes an @var{expression} and one or more CCL
1839 blocks as arguments. The CCL blocks are treated as a zero-indexed 1272 blocks as arguments. The CCL blocks are treated as a zero-indexed
1840 array, and the @code{branch} statement uses the @var{expression} as the 1273 array, and the @code{branch} statement uses the @var{expression} as the
1841 index of the CCL block to execute. Null CCL blocks may be used as 1274 index of the CCL block to execute. Null CCL blocks may be used as
1842 no-ops, continuing execution with the statement following the 1275 no-ops, continuing execution with the statement following the
1843 @code{branch} statement in the containing CCL block. Out-of-range 1276 @code{branch} statement in the containing CCL block. Out-of-range
1844 values for the @var{EXPRESSION} are also treated as no-ops. 1277 values for the @var{EXPRESSION} are also treated as no-ops.
1845 1278
1846 The @dfn{read-branch} variant of the @dfn{branch} statement takes an 1279 The @dfn{read-branch} variant of the @dfn{branch} statement takes an
1847 @var{register}, a @var{CCL block}, and an optional @var{second CCL 1280 @var{register}, a @var{CCL block}, and an optional @var{second CCL
1848 block} as arguments. The @code{read-branch} statement first reads from 1281 block} as arguments. The @code{read-branch} statement first reads from
1849 the input into the @var{register}, then conditionally executes a CCL 1282 the input into the @var{register}, then conditionally executes a CCL
1850 block just as the @code{branch} statement does. 1283 block just as the @code{branch} statement does.
1851 1284
1852 @heading Loop control statements: 1285 @heading Loop control statements:
1853 1286
1854 The @dfn{loop} statement creates a block with an implied jump from the 1287 The @dfn{loop} statement creates a block with an implied jump from the
1855 end of the block back to its head. The loop is exited on a @code{break} 1288 end of the block back to its head. The loop is exited on a @code{break}
1856 statement, and continued without executing the tail by a @code{repeat} 1289 statement, and continued without executing the tail by a @code{repeat}
1857 statement. 1290 statement.
1858 1291
1859 The @dfn{break} statement, written @samp{(break)}, terminates the 1292 The @dfn{break} statement, written @samp{(break)}, terminates the
1860 current loop and continues with the next statement in the current 1293 current loop and continues with the next statement in the current
1861 block. 1294 block.
1862 1295
1863 The @dfn{repeat} statement has three variants, @code{repeat}, 1296 The @dfn{repeat} statement has three variants, @code{repeat},
1864 @code{write-repeat}, and @code{write-read-repeat}. Each continues the 1297 @code{write-repeat}, and @code{write-read-repeat}. Each continues the
1865 current loop from its head, possibly after performing I/O. 1298 current loop from its head, possibly after performing I/O.
1866 @code{repeat} takes no arguments and does no I/O before jumping. 1299 @code{repeat} takes no arguments and does no I/O before jumping.
1867 @code{write-repeat} takes a single argument (a register, an 1300 @code{write-repeat} takes a single argument (a register, an
1868 integer, or a string), writes it to the output, then jumps. 1301 integer, or a string), writes it to the output, then jumps.
1874 @code{write} and @code{read} statements for the semantics of the I/O 1307 @code{write} and @code{read} statements for the semantics of the I/O
1875 operations for each type of argument. 1308 operations for each type of argument.
1876 1309
1877 @heading Other control statements: 1310 @heading Other control statements:
1878 1311
1879 The @dfn{call} statement, written @samp{(call @var{ccl-program-name})}, 1312 The @dfn{call} statement, written @samp{(call @var{ccl-program-name})},
1880 executes a CCL program as a subroutine. It does not return a value to 1313 executes a CCL program as a subroutine. It does not return a value to
1881 the caller, but can modify the register status. 1314 the caller, but can modify the register status.
1882 1315
1883 The @dfn{end} statement, written @samp{(end)}, terminates the CCL 1316 The @dfn{end} statement, written @samp{(end)}, terminates the CCL
1884 program successfully, and returns to caller (which may be a CCL 1317 program successfully, and returns to caller (which may be a CCL
1885 program). It does not alter the status of the registers. 1318 program). It does not alter the status of the registers.
1886 1319
1887 @node CCL Expressions, Calling CCL, CCL Statements, CCL 1320 @node CCL Expressions, Calling CCL, CCL Statements, CCL
1888 @comment Node, Next, Previous, Up 1321 @comment Node, Next, Previous, Up
1889 @subsection CCL Expressions 1322 @subsection CCL Expressions
1890 1323
1891 CCL, unlike Lisp, uses infix expressions. The simplest CCL expressions 1324 CCL, unlike Lisp, uses infix expressions. The simplest CCL expressions
1892 consist of a single @var{operand}, either a register (one of @code{r0}, 1325 consist of a single @var{operand}, either a register (one of @code{r0},
1893 ..., @code{r0}) or an integer. Complex expressions are lists of the 1326 ..., @code{r0}) or an integer. Complex expressions are lists of the
1894 form @code{( @var{expression} @var{operator} @var{operand} )}. Unlike 1327 form @code{( @var{expression} @var{operator} @var{operand} )}. Unlike
1895 C, assignments are not expressions. 1328 C, assignments are not expressions.
1896 1329
1897 In the following table, @var{X} is the target resister for a @dfn{set}. 1330 In the following table, @var{X} is the target resister for a @dfn{set}.
1898 In subexpressions, this is implicitly @code{r7}. This means that 1331 In subexpressions, this is implicitly @code{r7}. This means that
1899 @code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used 1332 @code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used
1900 freely in subexpressions, since they return parts of their values in 1333 freely in subexpressions, since they return parts of their values in
1901 @code{r7}. @var{Y} may be an expression, register, or integer, while 1334 @code{r7}. @var{Y} may be an expression, register, or integer, while
1902 @var{Z} must be a register or an integer. 1335 @var{Z} must be a register or an integer.
1926 @item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z) 1359 @item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z)
1927 @item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z)) 1360 @item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z))
1928 @item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z)) 1361 @item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z))
1929 @end multitable 1362 @end multitable
1930 1363
1931 The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8, 1364 The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8,
1932 CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The CCL_ENCODE_SJIS 1365 CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The CCL_ENCODE_SJIS
1933 and CCL_DECODE_SJIS treat their first and second bytes as the high and 1366 and CCL_DECODE_SJIS treat their first and second bytes as the high and
1934 low bytes of a two-byte character code. (SJIS stands for Shift JIS, an 1367 low bytes of a two-byte character code. (SJIS stands for Shift JIS, an
1935 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a 1368 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a
1936 complicated transformation of the Japanese standard JIS encoding to 1369 complicated transformation of the Japanese standard JIS encoding to
1937 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to 1370 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to
1938 represent the SJIS operations in infix form. 1371 represent the SJIS operations in infix form.
1939 1372
1940 @node Calling CCL, CCL Examples, CCL Expressions, CCL 1373 @node Calling CCL, CCL Examples, CCL Expressions, CCL
1941 @comment Node, Next, Previous, Up 1374 @comment Node, Next, Previous, Up
1942 @subsection Calling CCL 1375 @subsection Calling CCL
1943 1376
1944 CCL programs are called automatically during Emacs buffer I/O when the 1377 CCL programs are called automatically during Emacs buffer I/O when the
1945 external representation has a coding system type of @code{shift-jis}, 1378 external representation has a coding system type of @code{shift-jis},
1946 @code{big5}, or @code{ccl}. The program is specified by the coding 1379 @code{big5}, or @code{ccl}. The program is specified by the coding
1947 system (@pxref{Coding Systems}). You can also call CCL programs from 1380 system (@pxref{Coding Systems}). You can also call CCL programs from
1948 other CCL programs, and from Lisp using these functions: 1381 other CCL programs, and from Lisp using these functions:
1949 1382
1976 of the program. When the program is done, @var{status} is modified (by 1409 of the program. When the program is done, @var{status} is modified (by
1977 side-effect) to contain the ending values for the corresponding 1410 side-effect) to contain the ending values for the corresponding
1978 registers and IC. Returns the resulting string. 1411 registers and IC. Returns the resulting string.
1979 @end defun 1412 @end defun
1980 1413
1981 To call a CCL program from another CCL program, it must first be 1414 To call a CCL program from another CCL program, it must first be
1982 registered: 1415 registered:
1983 1416
1984 @defun register-ccl-program name ccl-program 1417 @defun register-ccl-program name ccl-program
1985 Register @var{name} for CCL program @var{program} in 1418 Register @var{name} for CCL program @var{program} in
1986 @code{ccl-program-table}. @var{program} should be the compiled form of 1419 @code{ccl-program-table}. @var{program} should be the compiled form of
1987 a CCL program, or nil. Return index number of the registered CCL 1420 a CCL program, or nil. Return index number of the registered CCL
1988 program. 1421 program.
1989 @end defun 1422 @end defun
1990 1423
1991 Information about the processor time used by the CCL interpreter can be 1424 Information about the processor time used by the CCL interpreter can be
1992 obtained using these functions: 1425 obtained using these functions:
1993 1426
1994 @defun ccl-elapsed-time 1427 @defun ccl-elapsed-time
1995 Returns the elapsed processor time of the CCL interpreter as cons of 1428 Returns the elapsed processor time of the CCL interpreter as cons of
1996 user and system time, as 1429 user and system time, as
2001 1434
2002 @defun ccl-reset-elapsed-time 1435 @defun ccl-reset-elapsed-time
2003 Resets the CCL interpreter's internal elapsed time registers. 1436 Resets the CCL interpreter's internal elapsed time registers.
2004 @end defun 1437 @end defun
2005 1438
2006 @node CCL Examples, , Calling CCL, CCL 1439 @node CCL Examples, , Calling CCL, CCL
2007 @comment Node, Next, Previous, Up 1440 @comment Node, Next, Previous, Up
2008 @subsection CCL Examples 1441 @subsection CCL Examples
2009 1442
2010 This section is not yet written. 1443 This section is not yet written.
2011 1444
2012 @node Category Tables, , CCL, MULE 1445 @node Category Tables, , CCL, MULE
2013 @section Category Tables 1446 @section Category Tables
2014 1447
2015 A category table is a type of char table used for keeping track of 1448 A category table is a type of char table used for keeping track of
2016 categories. Categories are used for classifying characters for use in 1449 categories. Categories are used for classifying characters for use in
2017 regexps---you can refer to a category rather than having to use a 1450 regexps -- you can refer to a category rather than having to use a
2018 complicated [] expression (and category lookups are significantly 1451 complicated [] expression (and category lookups are significantly
2019 faster). 1452 faster).
2020 1453
2021 There are 95 different categories available, one for each printable 1454 There are 95 different categories available, one for each printable
2022 character (including space) in the ASCII charset. Each category is 1455 character (including space) in the ASCII charset. Each category is