Mercurial > hg > xemacs-beta
comparison man/lispref/mule.texi @ 412:697ef44129c6 r21-2-14
Import from CVS: tag r21-2-14
author | cvs |
---|---|
date | Mon, 13 Aug 2007 11:20:41 +0200 |
parents | de805c49cfc1 |
children |
comparison
equal
deleted
inserted
replaced
411:12e008d41344 | 412:697ef44129c6 |
---|---|
4 @c See the file lispref.texi for copying conditions. | 4 @c See the file lispref.texi for copying conditions. |
5 @setfilename ../../info/internationalization.info | 5 @setfilename ../../info/internationalization.info |
6 @node MULE, Tips, Internationalization, top | 6 @node MULE, Tips, Internationalization, top |
7 @chapter MULE | 7 @chapter MULE |
8 | 8 |
9 @dfn{MULE} is the name originally given to the version of GNU Emacs | 9 @dfn{MULE} is the name originally given to the version of GNU Emacs |
10 extended for multi-lingual (and in particular Asian-language) support. | 10 extended for multi-lingual (and in particular Asian-language) support. |
11 ``MULE'' is short for ``MUlti-Lingual Emacs''. It is an extension and | 11 ``MULE'' is short for ``MUlti-Lingual Emacs''. It was originally called |
12 complete rewrite of Nemacs (``Nihon Emacs'' where ``Nihon'' is the | 12 Nemacs (``Nihon Emacs'' where ``Nihon'' is the Japanese word for |
13 Japanese word for ``Japan''), which only provided support for Japanese. | 13 ``Japan''), when it only provided support for Japanese. XEmacs |
14 XEmacs refers to its multi-lingual support as @dfn{MULE support} since | 14 refers to its multi-lingual support as @dfn{MULE support} since it |
15 it is based on @dfn{MULE}. | 15 is based on @dfn{MULE}. |
16 | 16 |
17 @menu | 17 @menu |
18 * Internationalization Terminology:: | 18 * Internationalization Terminology:: |
19 Definition of various internationalization terms. | 19 Definition of various internationalization terms. |
20 * Charsets:: Sets of related characters. | 20 * Charsets:: Sets of related characters. |
21 * MULE Characters:: Working with characters in XEmacs/MULE. | 21 * MULE Characters:: Working with characters in XEmacs/MULE. |
22 * Composite Characters:: Making new characters by overstriking other ones. | 22 * Composite Characters:: Making new characters by overstriking other ones. |
23 * ISO 2022:: An international standard for charsets and encodings. | |
23 * Coding Systems:: Ways of representing a string of chars using integers. | 24 * Coding Systems:: Ways of representing a string of chars using integers. |
24 * CCL:: A special language for writing fast converters. | 25 * CCL:: A special language for writing fast converters. |
25 * Category Tables:: Subdividing charsets into groups. | 26 * Category Tables:: Subdividing charsets into groups. |
26 @end menu | 27 @end menu |
27 | 28 |
28 @node Internationalization Terminology, Charsets, , MULE | 29 @node Internationalization Terminology |
29 @section Internationalization Terminology | 30 @section Internationalization Terminology |
30 | 31 |
31 In internationalization terminology, a string of text is divided up | 32 In internationalization terminology, a string of text is divided up |
32 into @dfn{characters}, which are the printable units that make up the | 33 into @dfn{characters}, which are the printable units that make up the |
33 text. A single character is (for example) a capital @samp{A}, the | 34 text. A single character is (for example) a capital @samp{A}, the |
34 number @samp{2}, a Katakana character, a Hangul character, a Kanji | 35 number @samp{2}, a Katakana character, a Kanji ideograph (an |
35 ideograph (an @dfn{ideograph} is a ``picture'' character, such as is | 36 @dfn{ideograph} is a ``picture'' character, such as is used in Japanese |
36 used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there | 37 Kanji, Chinese Hanzi, and Korean Hangul; typically there are thousands |
37 are thousands of such ideographs in each language), etc. The basic | 38 of such ideographs in each language), etc. The basic property of a |
38 property of a character is that it is the smallest unit of text with | 39 character is its shape. Note that the same character may be drawn by |
39 semantic significance in text processing. | 40 two different people (or in two different fonts) in slightly different |
40 | 41 ways, although the basic shape will be the same. |
41 Human beings normally process text visually, so to a first approximation | |
42 a character may be identified with its shape. Note that the same | |
43 character may be drawn by two different people (or in two different | |
44 fonts) in slightly different ways, although the "basic shape" will be the | |
45 same. But consider the works of Scott Kim; human beings can recognize | |
46 hugely variant shapes as the "same" character. Sometimes, especially | |
47 where characters are extremely complicated to write, completely | |
48 different shapes may be defined as the "same" character in national | |
49 standards. The Taiwanese variant of Hanzi is generally the most | |
50 complicated; over the centuries, the Japanese, Koreans, and the People's | |
51 Republic of China have adopted simplifications of the shape, but the | |
52 line of descent from the original shape is recorded, and the meanings | |
53 and pronunciation of different forms of the same character are | |
54 considered to be identical within each language. (Of course, it may | |
55 take a specialist to recognize the related form; the point is that the | |
56 relations are standardized, despite the differing shapes.) | |
57 | 42 |
58 In some cases, the differences will be significant enough that it is | 43 In some cases, the differences will be significant enough that it is |
59 actually possible to identify two or more distinct shapes that both | 44 actually possible to identify two or more distinct shapes that both |
60 represent the same character. For example, the lowercase letters | 45 represent the same character. For example, the lowercase letters |
61 @samp{a} and @samp{g} each have two distinct possible shapes---the | 46 @samp{a} and @samp{g} each have two distinct possible shapes -- the |
62 @samp{a} can optionally have a curved tail projecting off the top, and | 47 @samp{a} can optionally have a curved tail projecting off the top, and |
63 the @samp{g} can be formed either of two loops, or of one loop and a | 48 the @samp{g} can be formed either of two loops, or of one loop and a |
64 tail hanging off the bottom. Such distinct possible shapes of a | 49 tail hanging off the bottom. Such distinct possible shapes of a |
65 character are called @dfn{glyphs}. The important characteristic of two | 50 character are called @dfn{glyphs}. The important characteristic of two |
66 glyphs making up the same character is that the choice between one or | 51 glyphs making up the same character is that the choice between one or |
67 the other is purely stylistic and has no linguistic effect on a word | 52 the other is purely stylistic and has no linguistic effect on a word |
68 (this is the reason why a capital @samp{A} and lowercase @samp{a} | 53 (this is the reason why a capital @samp{A} and lowercase @samp{a} |
69 are different characters rather than different glyphs---e.g. | 54 are different characters rather than different glyphs -- e.g. |
70 @samp{Aspen} is a city while @samp{aspen} is a kind of tree). | 55 @samp{Aspen} is a city while @samp{aspen} is a kind of tree). |
71 | 56 |
72 Note that @dfn{character} and @dfn{glyph} are used differently | 57 Note that @dfn{character} and @dfn{glyph} are used differently |
73 here than elsewhere in XEmacs. | 58 here than elsewhere in XEmacs. |
74 | 59 |
75 A @dfn{character set} is essentially a set of related characters. ASCII, | 60 A @dfn{character set} is simply a set of related characters. ASCII, |
76 for example, is a set of 94 characters (or 128, if you count | 61 for example, is a set of 94 characters (or 128, if you count |
77 non-printing characters). Other character sets are ISO8859-1 (ASCII | 62 non-printing characters). Other character sets are ISO8859-1 (ASCII |
78 plus various accented characters and other international symbols), | 63 plus various accented characters and other international symbols), |
79 JIS X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208 | 64 JISX0201 (ASCII, more or less, plus half-width Katakana), JISX0208 |
80 (Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji), | 65 (Japanese Kanji), JISX0212 (a second set of less-used Japanese Kanji), |
81 GB2312 (Mainland Chinese Hanzi), etc. | 66 GB2312 (Mainland Chinese Hanzi), etc. |
82 | 67 |
83 The definition of a character set will implicitly or explicitly give | 68 Every character set has one or more @dfn{orderings}, which can be |
84 it an @dfn{ordering}, a way of assigning a number to each character in | 69 viewed as a way of assigning a number (or set of numbers) to each |
85 the set. For many character sets, there is a natural ordering, for | 70 character in the set. For most character sets, there is a standard |
86 example the ``ABC'' ordering of the Roman letters. But it is not clear | 71 ordering, and in fact all of the character sets mentioned above define a |
87 whether digits should come before or after the letters, and in fact | 72 particular ordering. ASCII, for example, places letters in their |
88 different European languages treat the ordering of accented characters | 73 ``natural'' order, puts uppercase letters before lowercase letters, |
89 differently. It is useful to use the natural order where available, of | 74 numbers before letters, etc. Note that for many of the Asian character |
90 course. The number assigned to any particular character is called the | 75 sets, there is no natural ordering of the characters. The actual |
91 character's @dfn{code point}. (Within a given character set, each | 76 orderings are based on one or more salient characteristic, of which |
92 character has a unique code point. Thus the word "set" is ill-chosen; | 77 there are many to choose from -- e.g. number of strokes, common |
93 different orderings of the same characters are different character sets. | 78 radicals, phonetic ordering, etc. |
94 Identifying characters is simple enough for alphabetic character sets, | 79 |
95 but the difference in ordering can cause great headaches when the same | 80 The set of numbers assigned to any particular character are called |
96 thousands of characters are used by different cultures as in the Hanzi.) | 81 the character's @dfn{position codes}. The number of position codes |
97 | 82 required to index a particular character in a character set is called |
98 A code point may be broken into a number of @dfn{position codes}. The | 83 the @dfn{dimension} of the character set. ASCII, being a relatively |
99 number of position codes required to index a particular character in a | 84 small character set, is of dimension one, and each character in the |
100 character set is called the @dfn{dimension} of the character set. For | 85 set is indexed using a single position code, in the range 0 through |
101 practical purposes, a position code may be thought of as a byte-sized | 86 127 (if non-printing characters are included) or 33 through 126 |
102 index. The printing characters of ASCII, being a relatively small | 87 (if only the printing characters are considered). JISX0208, i.e. |
103 character set, is of dimension one, and each character in the set is | 88 Japanese Kanji, has thousands of characters, and is of dimension two -- |
104 indexed using a single position code, in the range 1 through 94. Use of | 89 every character is indexed by two position codes, each in the range |
105 this unusual range, rather than the familiar 33 through 126, is an | 90 33 through 126. (Note that the choice of the range here is somewhat |
106 intentional abstraction; to understand the programming issues you must | 91 arbitrary. Although a character set such as JISX0208 defines an |
107 break the equation between character sets and encodings. | 92 @emph{ordering} of all its characters, it does not define the actual |
108 | 93 mapping between numbers and characters. You could just as easily |
109 JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is | 94 index the characters in JISX0208 using numbers in the range 0 through |
110 of dimension two -- every character is indexed by two position codes, | 95 93, 1 through 94, 2 through 95, etc. The reason for the actual range |
111 each in the range 1 through 94. (This number ``94'' is not a | 96 chosen is so that the position codes match up with the actual values |
112 coincidence; we shall see that the JIS position codes were chosen so | 97 used in the common encodings.) |
113 that JIS kanji could be encoded without using codes that in ASCII are | |
114 associated with device control functions.) Note that the choice of the | |
115 range here is somewhat arbitrary. You could just as easily index the | |
116 printing characters in ASCII using numbers in the range 0 through 93, 2 | |
117 through 95, 3 through 96, etc. In fact, the standardized | |
118 @emph{encoding} for the ASCII @emph{character set} uses the range 33 | |
119 through 126. | |
120 | 98 |
121 An @dfn{encoding} is a way of numerically representing characters from | 99 An @dfn{encoding} is a way of numerically representing characters from |
122 one or more character sets into a stream of like-sized numerical values | 100 one or more character sets into a stream of like-sized numerical values |
123 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit | 101 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit |
124 quantities. If an encoding encompasses only one character set, then the | 102 quantities. If an encoding encompasses only one character set, then the |
125 position codes for the characters in that character set could be used | 103 position codes for the characters in that character set could be used |
126 directly. (This is the case with the trivial cipher used by children, | 104 directly. (This is the case with ASCII, and as a result, most people do |
127 assigning 1 to `A', 2 to `B', and so on.) However, even with ASCII, | 105 not understand the difference between a character set and an encoding.) |
128 other considerations intrude. For example, why are the upper- and | 106 This is not possible, however, if more than one character set is to be |
129 lowercase alphabets separated by 8 characters? Why do the digits start | 107 used in the encoding. For example, printed Japanese text typically |
130 with `0' being assigned the code 48? In both cases because semantically | 108 requires characters from multiple character sets -- ASCII, JISX0208, and |
131 interesting operations (case conversion and numerical value extraction) | 109 JISX0212, to be specific. Each of these is indexed using one or more |
132 become convenient masking operations. Other artificial aspects (the | 110 position codes in the range 33 through 126, so the position codes could |
133 control characters being assigned to codes 0--31 and 127) are historical | 111 not be used directly or there would be no way to tell which character |
134 accidents. (The use of 127 for @samp{DEL} is an artifact of the "punch | 112 was meant. Different Japanese encodings handle this differently -- JIS |
135 once" nature of paper tape, for example.) | 113 uses special escape characters to denote different character sets; EUC |
136 | 114 sets the high bit of the position codes for JISX0208 and JISX0212, and |
137 Naive use of the position code is not possible, however, if more than | 115 puts a special extra byte before each JISX0212 character; etc. (JIS, |
138 one character set is to be used in the encoding. For example, printed | 116 EUC, and most of the other encodings you will encounter are 7-bit or |
139 Japanese text typically requires characters from multiple character sets | 117 8-bit encodings. There is one common 16-bit encoding, which is Unicode; |
140 -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is | 118 this strives to represent all the world's characters in a single large |
141 indexed using one or more position codes in the range 1 through 94, so | 119 character set. 32-bit encodings are generally used internally in |
142 the position codes could not be used directly or there would be no way | 120 programs to simplify the code that manipulates them; however, they are |
143 to tell which character was meant. Different Japanese encodings handle | 121 not much used externally because they are not very space-efficient.) |
144 this differently -- JIS uses special escape characters to denote | |
145 different character sets; EUC sets the high bit of the position codes | |
146 for JIS X 0208 and JIS X 0212, and puts a special extra byte before each | |
147 JIS X 0212 character; etc. (JIS, EUC, and most of the other encodings | |
148 you will encounter in files are 7-bit or 8-bit encodings. There is one | |
149 common 16-bit encoding, which is Unicode; this strives to represent all | |
150 the world's characters in a single large character set. 32-bit | |
151 encodings are often used internally in programs, such as XEmacs with | |
152 MULE support, to simplify the code that manipulates them; however, they | |
153 are not used externally because they are not very space-efficient.) | |
154 | |
155 A general method of handling text using multiple character sets | |
156 (whether for multilingual text, or simply text in an extremely | |
157 complicated single language like Japanese) is defined in the | |
158 international standard ISO 2022. ISO 2022 will be discussed in more | |
159 detail later (@pxref{ISO 2022}), but for now suffice it to say that text | |
160 needs control functions (at least spacing), and if escape sequences are | |
161 to be used, an escape sequence introducer. It was decided to make all | |
162 text streams compatible with ASCII in the sense that the codes 0--31 | |
163 (and 128-159) would always be control codes, never graphic characters, | |
164 and where defined by the character set the @samp{SPC} character would be | |
165 assigned code 32, and @samp{DEL} would be assigned 127. Thus there are | |
166 94 code points remaining if 7 bits are used. This is the reason that | |
167 most character sets are defined using position codes in the range 1 | |
168 through 94. Then ISO 2022 compatible encodings are produced by shifting | |
169 the position codes 1 to 94 into character codes 33 to 126, or (if 8 bit | |
170 codes are available) into character codes 161 to 254. | |
171 | 122 |
172 Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In | 123 Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In |
173 a @dfn{modal encoding}, there are multiple states that the encoding can | 124 a @dfn{modal encoding}, there are multiple states that the encoding can be in, |
174 be in, and the interpretation of the values in the stream depends on the | 125 and the interpretation of the values in the stream depends on the |
175 current global state of the encoding. Special values in the encoding, | 126 current global state of the encoding. Special values in the encoding, |
176 called @dfn{escape sequences}, are used to change the global state. | 127 called @dfn{escape sequences}, are used to change the global state. |
177 JIS, for example, is a modal encoding. The bytes @samp{ESC $ B} | 128 JIS, for example, is a modal encoding. The bytes @samp{ESC $ B} |
178 indicate that, from then on, bytes are to be interpreted as position | 129 indicate that, from then on, bytes are to be interpreted as position |
179 codes for JIS X 0208, rather than as ASCII. This effect is cancelled | 130 codes for JISX0208, rather than as ASCII. This effect is cancelled |
180 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the | 131 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the |
181 current state is to ASCII''. To switch to JIS X 0212, the escape | 132 current state is to ASCII''. To switch to JISX0212, the escape sequence |
182 sequence @samp{ESC $ ( D}. (Note that here, as is common, the escape | 133 @samp{ESC $ ( D}. (Note that here, as is common, the escape sequences do |
183 sequences do in fact begin with @samp{ESC}. This is not necessarily the | 134 in fact begin with @samp{ESC}. This is not necessarily the case, |
184 case, however. Some encodings use control characters called "locking | 135 however.) |
185 shifts" (effect persists until cancelled) to switch character sets.) | 136 |
186 | 137 A @dfn{non-modal encoding} has no global state that extends past the |
187 A @dfn{non-modal encoding} has no global state that extends past the | |
188 character currently being interpreted. EUC, for example, is a | 138 character currently being interpreted. EUC, for example, is a |
189 non-modal encoding. Characters in JIS X 0208 are encoded by setting | 139 non-modal encoding. Characters in JISX0208 are encoded by setting |
190 the high bit of the position codes, and characters in JIS X 0212 are | 140 the high bit of the position codes, and characters in JISX0212 are |
191 encoded by doing the same but also prefixing the character with the | 141 encoded by doing the same but also prefixing the character with the |
192 byte 0x8F. | 142 byte 0x8F. |
193 | 143 |
194 The advantage of a modal encoding is that it is generally more | 144 The advantage of a modal encoding is that it is generally more |
195 space-efficient, and is easily extendable because there are essentially | 145 space-efficient, and is easily extendable because there are essentially |
196 an arbitrary number of escape sequences that can be created. The | 146 an arbitrary number of escape sequences that can be created. The |
197 disadvantage, however, is that it is much more difficult to work with | 147 disadvantage, however, is that it is much more difficult to work with |
198 if it is not being processed in a sequential manner. In the non-modal | 148 if it is not being processed in a sequential manner. In the non-modal |
199 EUC encoding, for example, the byte 0x41 always refers to the letter | 149 EUC encoding, for example, the byte 0x41 always refers to the letter |
200 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or | 150 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or |
201 one of the two position codes in a JIS X 0208 character, or one of the | 151 one of the two position codes in a JISX0208 character, or one of the |
202 two position codes in a JIS X 0212 character. Determining exactly which | 152 two position codes in a JISX0212 character. Determining exactly which |
203 one is meant could be difficult and time-consuming if the previous | 153 one is meant could be difficult and time-consuming if the previous |
204 bytes in the string have not already been processed, or impossible if | 154 bytes in the string have not already been processed. |
205 they are drawn from an external stream that cannot be rewound. | |
206 | 155 |
207 Non-modal encodings are further divided into @dfn{fixed-width} and | 156 Non-modal encodings are further divided into @dfn{fixed-width} and |
208 @dfn{variable-width} formats. A fixed-width encoding always uses | 157 @dfn{variable-width} formats. A fixed-width encoding always uses |
209 the same number of words per character, whereas a variable-width | 158 the same number of words per character, whereas a variable-width |
210 encoding does not. EUC is a good example of a variable-width | 159 encoding does not. EUC is a good example of a variable-width |
212 the character set. 16-bit and 32-bit encodings are nearly always | 161 the character set. 16-bit and 32-bit encodings are nearly always |
213 fixed-width, and this is in fact one of the main reasons for using | 162 fixed-width, and this is in fact one of the main reasons for using |
214 an encoding with a larger word size. The advantages of fixed-width | 163 an encoding with a larger word size. The advantages of fixed-width |
215 encodings should be obvious. The advantages of variable-width | 164 encodings should be obvious. The advantages of variable-width |
216 encodings are that they are generally more space-efficient and allow | 165 encodings are that they are generally more space-efficient and allow |
217 for compatibility with existing 8-bit encodings such as ASCII. (For | 166 for compatibility with existing 8-bit encodings such as ASCII. |
218 example, in Unicode ASCII characters are simply promoted to a 16-bit | 167 |
219 representation. That means that every ASCII character contains a | 168 Note that the bytes in an 8-bit encoding are often referred to |
220 @samp{NUL} byte; evidently all of the standard string manipulation | 169 as @dfn{octets} rather than simply as bytes. This terminology |
221 functions will lose badly in a fixed-width Unicode environment.) | 170 dates back to the days before 8-bit bytes were universal, when |
222 | 171 some computers had 9-bit bytes, others had 10-bit bytes, etc. |
223 The bytes in an 8-bit encoding are often referred to as @dfn{octets} | 172 |
224 rather than simply as bytes. This terminology dates back to the days | 173 @node Charsets |
225 before 8-bit bytes were universal, when some computers had 9-bit bytes, | |
226 others had 10-bit bytes, etc. | |
227 | |
228 @node Charsets, MULE Characters, Internationalization Terminology, MULE | |
229 @section Charsets | 174 @section Charsets |
230 | 175 |
231 A @dfn{charset} in MULE is an object that encapsulates a | 176 A @dfn{charset} in MULE is an object that encapsulates a |
232 particular character set as well as an ordering of those characters. | 177 particular character set as well as an ordering of those characters. |
233 Charsets are permanent objects and are named using symbols, like | 178 Charsets are permanent objects and are named using symbols, like |
242 * Basic Charset Functions:: Functions for working with charsets. | 187 * Basic Charset Functions:: Functions for working with charsets. |
243 * Charset Property Functions:: Functions for accessing charset properties. | 188 * Charset Property Functions:: Functions for accessing charset properties. |
244 * Predefined Charsets:: Predefined charset objects. | 189 * Predefined Charsets:: Predefined charset objects. |
245 @end menu | 190 @end menu |
246 | 191 |
247 @node Charset Properties, Basic Charset Functions, , Charsets | 192 @node Charset Properties |
248 @subsection Charset Properties | 193 @subsection Charset Properties |
249 | 194 |
250 Charsets have the following properties: | 195 Charsets have the following properties: |
251 | 196 |
252 @table @code | 197 @table @code |
314 property. If a CCL program is defined, the position codes of a | 259 property. If a CCL program is defined, the position codes of a |
315 character will first be processed according to @code{graphic} and | 260 character will first be processed according to @code{graphic} and |
316 then passed through the CCL program, with the resulting values used | 261 then passed through the CCL program, with the resulting values used |
317 to index the font. | 262 to index the font. |
318 | 263 |
319 This is used, for example, in the Big5 character set (used in Taiwan). | 264 This is used, for example, in the Big5 character set (used in Taiwan). |
320 This character set is not ISO-2022-compliant, and its size (94x157) does | 265 This character set is not ISO-2022-compliant, and its size (94x157) does |
321 not fit within the maximum 96x96 size of ISO-2022-compliant character | 266 not fit within the maximum 96x96 size of ISO-2022-compliant character |
322 sets. As a result, XEmacs/MULE splits it (in a rather complex fashion, | 267 sets. As a result, XEmacs/MULE splits it (in a rather complex fashion, |
323 so as to group the most commonly used characters together) into two | 268 so as to group the most commonly used characters together) into two |
324 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94, | 269 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94, |
325 and each charset object uses a CCL program to convert the modified | 270 and each charset object uses a CCL program to convert the modified |
326 position codes back into standard Big5 indices to retrieve a character | 271 position codes back into standard Big5 indices to retrieve a character |
327 from a Big5 font. | 272 from a Big5 font. |
328 @end table | 273 @end table |
329 | 274 |
330 Most of the above properties can only be set when the charset is | 275 Most of the above properties can only be changed when the charset |
331 initialized, and cannot be changed later. | 276 is created. @xref{Charset Property Functions}. |
332 @xref{Charset Property Functions}. | 277 |
333 | 278 @node Basic Charset Functions |
334 @node Basic Charset Functions, Charset Property Functions, Charset Properties, Charsets | |
335 @subsection Basic Charset Functions | 279 @subsection Basic Charset Functions |
336 | 280 |
337 @defun find-charset charset-or-name | 281 @defun find-charset charset-or-name |
338 This function retrieves the charset of the given name. If | 282 This function retrieves the charset of the given name. If |
339 @var{charset-or-name} is a charset object, it is simply returned. | 283 @var{charset-or-name} is a charset object, it is simply returned. |
352 This function returns a list of the names of all defined charsets. | 296 This function returns a list of the names of all defined charsets. |
353 @end defun | 297 @end defun |
354 | 298 |
355 @defun make-charset name doc-string props | 299 @defun make-charset name doc-string props |
356 This function defines a new character set. This function is for use | 300 This function defines a new character set. This function is for use |
357 with MULE support. @var{name} is a symbol, the name by which the | 301 with Mule support. @var{name} is a symbol, the name by which the |
358 character set is normally referred. @var{doc-string} is a string | 302 character set is normally referred. @var{doc-string} is a string |
359 describing the character set. @var{props} is a property list, | 303 describing the character set. @var{props} is a property list, |
360 describing the specific nature of the character set. The recognized | 304 describing the specific nature of the character set. The recognized |
361 properties are @code{registry}, @code{dimension}, @code{columns}, | 305 properties are @code{registry}, @code{dimension}, @code{columns}, |
362 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and | 306 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and |
380 This function returns the charset (if any) with the same dimension, | 324 This function returns the charset (if any) with the same dimension, |
381 number of characters, and final byte as @var{charset}, but which is | 325 number of characters, and final byte as @var{charset}, but which is |
382 displayed in the opposite direction. | 326 displayed in the opposite direction. |
383 @end defun | 327 @end defun |
384 | 328 |
385 @node Charset Property Functions, Predefined Charsets, Basic Charset Functions, Charsets | 329 @node Charset Property Functions |
386 @subsection Charset Property Functions | 330 @subsection Charset Property Functions |
387 | 331 |
388 All of these functions accept either a charset name or charset object. | 332 All of these functions accept either a charset name or charset object. |
389 | 333 |
390 @defun charset-property charset prop | 334 @defun charset-property charset prop |
391 This function returns property @var{prop} of @var{charset}. | 335 This function returns property @var{prop} of @var{charset}. |
392 @xref{Charset Properties}. | 336 @xref{Charset Properties}. |
393 @end defun | 337 @end defun |
394 | 338 |
395 Convenience functions are also provided for retrieving individual | 339 Convenience functions are also provided for retrieving individual |
396 properties of a charset. | 340 properties of a charset. |
397 | 341 |
398 @defun charset-name charset | 342 @defun charset-name charset |
399 This function returns the name of @var{charset}. This will be a symbol. | 343 This function returns the name of @var{charset}. This will be a symbol. |
400 @end defun | 344 @end defun |
420 This function returns the number of display columns per character (in | 364 This function returns the number of display columns per character (in |
421 TTY mode) of @var{charset}. | 365 TTY mode) of @var{charset}. |
422 @end defun | 366 @end defun |
423 | 367 |
424 @defun charset-direction charset | 368 @defun charset-direction charset |
425 This function returns the display direction of @var{charset}---either | 369 This function returns the display direction of @var{charset} -- either |
426 @code{l2r} or @code{r2l}. | 370 @code{l2r} or @code{r2l}. |
427 @end defun | 371 @end defun |
428 | 372 |
429 @defun charset-final charset | 373 @defun charset-final charset |
430 This function returns the final byte of the ISO 2022 escape sequence | 374 This function returns the final byte of the ISO 2022 escape sequence |
440 @defun charset-ccl-program charset | 384 @defun charset-ccl-program charset |
441 This function returns the CCL program, if any, for converting | 385 This function returns the CCL program, if any, for converting |
442 position codes of characters in @var{charset} into font indices. | 386 position codes of characters in @var{charset} into font indices. |
443 @end defun | 387 @end defun |
444 | 388 |
445 The only property of a charset that can currently be set after | 389 The only property of a charset that can currently be set after |
446 the charset has been created is the CCL program. | 390 the charset has been created is the CCL program. |
447 | 391 |
448 @defun set-charset-ccl-program charset ccl-program | 392 @defun set-charset-ccl-program charset ccl-program |
449 This function sets the @code{ccl-program} property of @var{charset} to | 393 This function sets the @code{ccl-program} property of @var{charset} to |
450 @var{ccl-program}. | 394 @var{ccl-program}. |
451 @end defun | 395 @end defun |
452 | 396 |
453 @node Predefined Charsets, , Charset Property Functions, Charsets | 397 @node Predefined Charsets |
454 @subsection Predefined Charsets | 398 @subsection Predefined Charsets |
455 | 399 |
456 The following charsets are predefined in the C code. | 400 The following charsets are predefined in the C code. |
457 | 401 |
458 @example | 402 @example |
459 Name Type Fi Gr Dir Registry | 403 Name Type Fi Gr Dir Registry |
460 -------------------------------------------------------------- | 404 -------------------------------------------------------------- |
461 ascii 94 B 0 l2r ISO8859-1 | 405 ascii 94 B 0 l2r ISO8859-1 |
482 chinese-big5-2 94x94 1 0 l2r Big5 | 426 chinese-big5-2 94x94 1 0 l2r Big5 |
483 korean-ksc5601 94x94 C 0 l2r KSC5601 | 427 korean-ksc5601 94x94 C 0 l2r KSC5601 |
484 composite 96x96 0 l2r --- | 428 composite 96x96 0 l2r --- |
485 @end example | 429 @end example |
486 | 430 |
487 The following charsets are predefined in the Lisp code. | 431 The following charsets are predefined in the Lisp code. |
488 | 432 |
489 @example | 433 @example |
490 Name Type Fi Gr Dir Registry | 434 Name Type Fi Gr Dir Registry |
491 -------------------------------------------------------------- | 435 -------------------------------------------------------------- |
492 arabic-digit 94 2 0 l2r MuleArabic-0 | 436 arabic-digit 94 2 0 l2r MuleArabic-0 |
506 @end example | 450 @end example |
507 | 451 |
508 For all of the above charsets, the dimension and number of columns are | 452 For all of the above charsets, the dimension and number of columns are |
509 the same. | 453 the same. |
510 | 454 |
511 Note that ASCII, Control-1, and Composite are handled specially. | 455 Note that ASCII, Control-1, and Composite are handled specially. |
512 This is why some of the fields are blank; and some of the filled-in | 456 This is why some of the fields are blank; and some of the filled-in |
513 fields (e.g. the type) are not really accurate. | 457 fields (e.g. the type) are not really accurate. |
514 | 458 |
515 @node MULE Characters, Composite Characters, Charsets, MULE | 459 @node MULE Characters |
516 @section MULE Characters | 460 @section MULE Characters |
517 | 461 |
518 @defun make-char charset arg1 &optional arg2 | 462 @defun make-char charset arg1 &optional arg2 |
519 This function makes a multi-byte character from @var{charset} and octets | 463 This function makes a multi-byte character from @var{charset} and octets |
520 @var{arg1} and @var{arg2}. | 464 @var{arg1} and @var{arg2}. |
537 | 481 |
538 @defun find-charset-string string | 482 @defun find-charset-string string |
539 This function returns a list of the charsets in @var{string}. | 483 This function returns a list of the charsets in @var{string}. |
540 @end defun | 484 @end defun |
541 | 485 |
542 @node Composite Characters, Coding Systems, MULE Characters, MULE | 486 @node Composite Characters |
543 @section Composite Characters | 487 @section Composite Characters |
544 | 488 |
545 Composite characters are not yet completely implemented. | 489 Composite characters are not yet completely implemented. |
546 | 490 |
547 @defun make-composite-char string | 491 @defun make-composite-char string |
548 This function converts a string into a single composite character. The | 492 This function converts a string into a single composite character. The |
549 character is the result of overstriking all the characters in the | 493 character is the result of overstriking all the characters in the |
550 string. | 494 string. |
568 character into one or more characters, the individual characters out of | 512 character into one or more characters, the individual characters out of |
569 which the composite character was formed. Non-composite characters are | 513 which the composite character was formed. Non-composite characters are |
570 left as-is. @var{buffer} defaults to the current buffer if omitted. | 514 left as-is. @var{buffer} defaults to the current buffer if omitted. |
571 @end defun | 515 @end defun |
572 | 516 |
573 @node Coding Systems, CCL, Composite Characters, MULE | 517 @node ISO 2022 |
518 @section ISO 2022 | |
519 | |
520 This section briefly describes the ISO 2022 encoding standard. For more | |
521 thorough understanding, please refer to the original document of ISO | |
522 2022. | |
523 | |
524 Character sets (@dfn{charsets}) are classified into the following four | |
525 categories, according to the number of characters of charset: | |
526 94-charset, 96-charset, 94x94-charset, and 96x96-charset. | |
527 | |
528 @need 1000 | |
529 @table @asis | |
530 @item 94-charset | |
531 ASCII(B), left(J) and right(I) half of JISX0201, ... | |
532 @item 96-charset | |
533 Latin-1(A), Latin-2(B), Latin-3(C), ... | |
534 @item 94x94-charset | |
535 GB2312(A), JISX0208(B), KSC5601(C), ... | |
536 @item 96x96-charset | |
537 none for the moment | |
538 @end table | |
539 | |
540 The character in parentheses after the name of each charset | |
541 is the @dfn{final character} @var{F}, which can be regarded as | |
542 the identifier of the charset. ECMA allocates @var{F} to each | |
543 charset. @var{F} is in the range of 0x30..0x7F, but 0x30..0x3F | |
544 are only for private use. | |
545 | |
546 Note: @dfn{ECMA} = European Computer Manufacturers Association | |
547 | |
548 There are four @dfn{registers of charsets}, called G0 thru G3. | |
549 You can designate (or assign) any charset to one of these | |
550 registers. | |
551 | |
552 The code space contained within one octet (of size 256) is divided into | |
553 4 areas: C0, GL, C1, and GR. GL and GR are the areas into which a | |
554 register of charset can be invoked into. | |
555 | |
556 @example | |
557 @group | |
558 C0: 0x00 - 0x1F | |
559 GL: 0x20 - 0x7F | |
560 C1: 0x80 - 0x9F | |
561 GR: 0xA0 - 0xFF | |
562 @end group | |
563 @end example | |
564 | |
565 Usually, in the initial state, G0 is invoked into GL, and G1 | |
566 is invoked into GR. | |
567 | |
568 ISO 2022 distinguishes 7-bit environments and 8-bit environments. In | |
569 7-bit environments, only C0 and GL are used. | |
570 | |
571 Charset designation is done by escape sequences of the form: | |
572 | |
573 @example | |
574 ESC [@var{I}] @var{I} @var{F} | |
575 @end example | |
576 | |
577 where @var{I} is an intermediate character in the range 0x20 - 0x2F, and | |
578 @var{F} is the final character identifying this charset. | |
579 | |
580 The meaning of intermediate characters are: | |
581 | |
582 @example | |
583 @group | |
584 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96). | |
585 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}. | |
586 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}. | |
587 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}. | |
588 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}. | |
589 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}. | |
590 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}. | |
591 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}. | |
592 @end group | |
593 @end example | |
594 | |
595 The following rule is not allowed in ISO 2022 but can be used in Mule. | |
596 | |
597 @example | |
598 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}. | |
599 @end example | |
600 | |
601 Here are examples of designations: | |
602 | |
603 @example | |
604 @group | |
605 ESC ( B : designate to G0 ASCII | |
606 ESC - A : designate to G1 Latin-1 | |
607 ESC $ ( A or ESC $ A : designate to G0 GB2312 | |
608 ESC $ ( B or ESC $ B : designate to G0 JISX0208 | |
609 ESC $ ) C : designate to G1 KSC5601 | |
610 @end group | |
611 @end example | |
612 | |
613 To use a charset designated to G2 or G3, and to use a charset designated | |
614 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3 | |
615 into GL. There are two types of invocation, Locking Shift (forever) and | |
616 Single Shift (one character only). | |
617 | |
618 Locking Shift is done as follows: | |
619 | |
620 @example | |
621 LS0 or SI (0x0F): invoke G0 into GL | |
622 LS1 or SO (0x0E): invoke G1 into GL | |
623 LS2: invoke G2 into GL | |
624 LS3: invoke G3 into GL | |
625 LS1R: invoke G1 into GR | |
626 LS2R: invoke G2 into GR | |
627 LS3R: invoke G3 into GR | |
628 @end example | |
629 | |
630 Single Shift is done as follows: | |
631 | |
632 @example | |
633 @group | |
634 SS2 or ESC N: invoke G2 into GL | |
635 SS3 or ESC O: invoke G3 into GL | |
636 @end group | |
637 @end example | |
638 | |
639 (#### Ben says: I think the above is slightly incorrect. It appears that | |
640 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and | |
641 ESC O behave as indicated. The above definitions will not parse | |
642 EUC-encoded text correctly, and it looks like the code in mule-coding.c | |
643 has similar problems.) | |
644 | |
645 You may realize that there are a lot of ISO-2022-compliant ways of | |
646 encoding multilingual text. Now, in the world, there exist many coding | |
647 systems such as X11's Compound Text, Japanese JUNET code, and so-called | |
648 EUC (Extended UNIX Code); all of these are variants of ISO 2022. | |
649 | |
650 In Mule, we characterize ISO 2022 by the following attributes: | |
651 | |
652 @enumerate | |
653 @item | |
654 Initial designation to G0 thru G3. | |
655 @item | |
656 Allow designation of short form for Japanese and Chinese. | |
657 @item | |
658 Should we designate ASCII to G0 before control characters? | |
659 @item | |
660 Should we designate ASCII to G0 at the end of line? | |
661 @item | |
662 7-bit environment or 8-bit environment. | |
663 @item | |
664 Use Locking Shift or not. | |
665 @item | |
666 Use ASCII or JIS0201-1976-Roman. | |
667 @item | |
668 Use JISX0208-1983 or JISX0208-1976. | |
669 @end enumerate | |
670 | |
671 (The last two are only for Japanese.) | |
672 | |
673 By specifying these attributes, you can create any variant | |
674 of ISO 2022. | |
675 | |
676 Here are several examples: | |
677 | |
678 @example | |
679 @group | |
680 junet -- Coding system used in JUNET. | |
681 1. G0 <- ASCII, G1..3 <- never used | |
682 2. Yes. | |
683 3. Yes. | |
684 4. Yes. | |
685 5. 7-bit environment | |
686 6. No. | |
687 7. Use ASCII | |
688 8. Use JISX0208-1983 | |
689 @end group | |
690 | |
691 @group | |
692 ctext -- Compound Text | |
693 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used | |
694 2. No. | |
695 3. No. | |
696 4. Yes. | |
697 5. 8-bit environment | |
698 6. No. | |
699 7. Use ASCII | |
700 8. Use JISX0208-1983 | |
701 @end group | |
702 | |
703 @group | |
704 euc-china -- Chinese EUC. Although many people call this | |
705 as "GB encoding", the name may cause misunderstanding. | |
706 1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used | |
707 2. No. | |
708 3. Yes. | |
709 4. Yes. | |
710 5. 8-bit environment | |
711 6. No. | |
712 7. Use ASCII | |
713 8. Use JISX0208-1983 | |
714 @end group | |
715 | |
716 @group | |
717 korean-mail -- Coding system used in Korean network. | |
718 1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used | |
719 2. No. | |
720 3. Yes. | |
721 4. Yes. | |
722 5. 7-bit environment | |
723 6. Yes. | |
724 7. No. | |
725 8. No. | |
726 @end group | |
727 @end example | |
728 | |
729 Mule creates all these coding systems by default. | |
730 | |
731 @node Coding Systems | |
574 @section Coding Systems | 732 @section Coding Systems |
575 | 733 |
576 A coding system is an object that defines how text containing multiple | 734 A coding system is an object that defines how text containing multiple |
577 character sets is encoded into a stream of (typically 8-bit) bytes. The | 735 character sets is encoded into a stream of (typically 8-bit) bytes. The |
578 coding system is used to decode the stream into a series of characters | 736 coding system is used to decode the stream into a series of characters |
579 (which may be from multiple charsets) when the text is read from a file | 737 (which may be from multiple charsets) when the text is read from a file |
580 or process, and is used to encode the text back into the same format | 738 or process, and is used to encode the text back into the same format |
581 when it is written out to a file or process. | 739 when it is written out to a file or process. |
582 | 740 |
583 For example, many ISO-2022-compliant coding systems (such as Compound | 741 For example, many ISO-2022-compliant coding systems (such as Compound |
584 Text, which is used for inter-client data under the X Window System) use | 742 Text, which is used for inter-client data under the X Window System) use |
585 escape sequences to switch between different charsets -- Japanese Kanji, | 743 escape sequences to switch between different charsets -- Japanese Kanji, |
586 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with | 744 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with |
587 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See | 745 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See |
588 @code{make-coding-system} for more information. | 746 @code{make-coding-system} for more information. |
589 | 747 |
590 Coding systems are normally identified using a symbol, and the symbol is | 748 Coding systems are normally identified using a symbol, and the symbol is |
591 accepted in place of the actual coding system object whenever a coding | 749 accepted in place of the actual coding system object whenever a coding |
592 system is called for. (This is similar to how faces and charsets work.) | 750 system is called for. (This is similar to how faces and charsets work.) |
593 | 751 |
594 @defun coding-system-p object | 752 @defun coding-system-p object |
595 This function returns non-@code{nil} if @var{object} is a coding system. | 753 This function returns non-@code{nil} if @var{object} is a coding system. |
596 @end defun | 754 @end defun |
597 | 755 |
598 @menu | 756 @menu |
599 * Coding System Types:: Classifying coding systems. | 757 * Coding System Types:: Classifying coding systems. |
600 * ISO 2022:: An international standard for | |
601 charsets and encodings. | |
602 * EOL Conversion:: Dealing with different ways of denoting | 758 * EOL Conversion:: Dealing with different ways of denoting |
603 the end of a line. | 759 the end of a line. |
604 * Coding System Properties:: Properties of a coding system. | 760 * Coding System Properties:: Properties of a coding system. |
605 * Basic Coding System Functions:: Working with coding systems. | 761 * Basic Coding System Functions:: Working with coding systems. |
606 * Coding System Property Functions:: Retrieving a coding system's properties. | 762 * Coding System Property Functions:: Retrieving a coding system's properties. |
607 * Encoding and Decoding Text:: Encoding and decoding text. | 763 * Encoding and Decoding Text:: Encoding and decoding text. |
608 * Detection of Textual Encoding:: Determining how text is encoded. | 764 * Detection of Textual Encoding:: Determining how text is encoded. |
609 * Big5 and Shift-JIS Functions:: Special functions for these non-standard | 765 * Big5 and Shift-JIS Functions:: Special functions for these non-standard |
610 encodings. | 766 encodings. |
611 * Predefined Coding Systems:: Coding systems implemented by MULE. | |
612 @end menu | 767 @end menu |
613 | 768 |
614 @node Coding System Types, ISO 2022, , Coding Systems | 769 @node Coding System Types |
615 @subsection Coding System Types | 770 @subsection Coding System Types |
616 | 771 |
617 The coding system type determines the basic algorithm XEmacs will use to | |
618 decode or encode a data stream. Character encodings will be converted | |
619 to the MULE encoding, escape sequences processed, and newline sequences | |
620 converted to XEmacs's internal representation. There are three basic | |
621 classes of coding system type: no-conversion, ISO-2022, and special. | |
622 | |
623 No conversion allows you to look at the file's internal representation. | |
624 Since XEmacs is basically a text editor, "no conversion" does convert | |
625 newline conventions by default. (Use the 'binary coding-system if this | |
626 is not desired.) | |
627 | |
628 ISO 2022 (@pxref{ISO 2022}) is the basic international standard regulating | |
629 use of "coded character sets for the exchange of data", ie, text | |
630 streams. ISO 2022 contains functions that make it possible to encode | |
631 text streams to comply with restrictions of the Internet mail system and | |
632 de facto restrictions of most file systems (eg, use of the separator | |
633 character in file names). Coding systems which are not ISO 2022 | |
634 conformant can be difficult to handle. Perhaps more important, they are | |
635 not adaptable to multilingual information interchange, with the obvious | |
636 exception of ISO 10646 (Unicode). (Unicode is partially supported by | |
637 XEmacs with the addition of the Lisp package ucs-conv.) | |
638 | |
639 The special class of coding systems includes automatic detection, CCL (a | |
640 "little language" embedded as an interpreter, useful for translating | |
641 between variants of a single character set), non-ISO-2022-conformant | |
642 encodings like Unicode, Shift JIS, and Big5, and MULE internal coding. | |
643 (NB: this list is based on XEmacs 21.2. Terminology may vary slightly | |
644 for other versions of XEmacs and for GNU Emacs 20.) | |
645 | |
646 @table @code | 772 @table @code |
773 @item nil | |
774 @itemx autodetect | |
775 Automatic conversion. XEmacs attempts to detect the coding system used | |
776 in the file. | |
647 @item no-conversion | 777 @item no-conversion |
648 No conversion, for binary files, and a few special cases of non-ISO-2022 | 778 No conversion. Use this for binary files and such. On output, graphic |
649 coding systems where conversion is done by hook functions (usually | 779 characters that are not in ASCII or Latin-1 will be replaced by a |
650 implemented in CCL). On output, graphic characters that are not in | 780 @samp{?}. (For a no-conversion-encoded buffer, these characters will |
651 ASCII or Latin-1 will be replaced by a @samp{?}. (For a | 781 only be present if you explicitly insert them.) |
652 no-conversion-encoded buffer, these characters will only be present if | 782 @item shift-jis |
653 you explicitly insert them.) | 783 Shift-JIS (a Japanese encoding commonly used in PC operating systems). |
654 @item iso2022 | 784 @item iso2022 |
655 Any ISO-2022-compliant encoding. Among others, this includes JIS (the | 785 Any ISO-2022-compliant encoding. Among other things, this includes JIS |
656 Japanese encoding commonly used for e-mail), national variants of EUC | 786 (the Japanese encoding commonly used for e-mail), national variants of |
657 (the standard Unix encoding for Japanese and other languages), and | 787 EUC (the standard Unix encoding for Japanese and other languages), and |
658 Compound Text (an encoding used in X11). You can specify more specific | 788 Compound Text (an encoding used in X11). You can specify more specific |
659 information about the conversion with the @var{flags} argument. | 789 information about the conversion with the @var{flags} argument. |
660 @item ucs-4 | |
661 ISO 10646 UCS-4 encoding. A 31-bit fixed-width superset of Unicode. | |
662 @item utf-8 | |
663 ISO 10646 UTF-8 encoding. A ``file system safe'' transformation format | |
664 that can be used with both UCS-4 and Unicode. | |
665 @item undecided | |
666 Automatic conversion. XEmacs attempts to detect the coding system used | |
667 in the file. | |
668 @item shift-jis | |
669 Shift-JIS (a Japanese encoding commonly used in PC operating systems). | |
670 @item big5 | 790 @item big5 |
671 Big5 (the encoding commonly used for Taiwanese). | 791 Big5 (the encoding commonly used for Taiwanese). |
672 @item ccl | 792 @item ccl |
673 The conversion is performed using a user-written pseudo-code program. | 793 The conversion is performed using a user-written pseudo-code program. |
674 CCL (Code Conversion Language) is the name of this pseudo-code. For | 794 CCL (Code Conversion Language) is the name of this pseudo-code. |
675 example, CCL is used to map KOI8-R characters (an encoding for Russian | |
676 Cyrillic) to ISO8859-5 (the form used internally by MULE). | |
677 @item internal | 795 @item internal |
678 Write out or read in the raw contents of the memory representing the | 796 Write out or read in the raw contents of the memory representing the |
679 buffer's text. This is primarily useful for debugging purposes, and is | 797 buffer's text. This is primarily useful for debugging purposes, and is |
680 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set | 798 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set |
681 (the @samp{--debug} configure option). @strong{Warning}: Reading in a | 799 (the @samp{--debug} configure option). @strong{Warning}: Reading in a |
683 inconsistency in the memory representing a buffer's text, which will | 801 inconsistency in the memory representing a buffer's text, which will |
684 produce unpredictable results and may cause XEmacs to crash. Under | 802 produce unpredictable results and may cause XEmacs to crash. Under |
685 normal circumstances you should never use @code{internal} conversion. | 803 normal circumstances you should never use @code{internal} conversion. |
686 @end table | 804 @end table |
687 | 805 |
688 @node ISO 2022, EOL Conversion, Coding System Types, Coding Systems | 806 @node EOL Conversion |
689 @section ISO 2022 | |
690 | |
691 This section briefly describes the ISO 2022 encoding standard. A more | |
692 thorough treatment is available in the original document of ISO | |
693 2022 as well as various national standards (such as JIS X 0202). | |
694 | |
695 Character sets (@dfn{charsets}) are classified into the following four | |
696 categories, according to the number of characters in the charset: | |
697 94-charset, 96-charset, 94x94-charset, and 96x96-charset. This means | |
698 that although an ISO 2022 coding system may have variable width | |
699 characters, each charset used is fixed-width (in contrast to the MULE | |
700 character set and UTF-8, for example). | |
701 | |
702 ISO 2022 provides for switching between character sets via escape | |
703 sequences. This switching is somewhat complicated, because ISO 2022 | |
704 provides for both legacy applications like Internet mail that accept | |
705 only 7 significant bits in some contexts (RFC 822 headers, for example), | |
706 and more modern "8-bit clean" applications. It also provides for | |
707 compact and transparent representation of languages like Japanese which | |
708 mix ASCII and a national script (even outside of computer programs). | |
709 | |
710 First, ISO 2022 codified prevailing practice by dividing the code space | |
711 into "control" and "graphic" regions. The code points 0x00-0x1F and | |
712 0x80-0x9F are reserved for "control characters", while "graphic | |
713 characters" must be assigned to code points in the regions 0x20-0x7F and | |
714 0xA0-0xFF. The positions 0x20 and 0x7F are special, and under some | |
715 circumstances must be assigned the graphic character "ASCII SPACE" and | |
716 the control character "ASCII DEL" respectively. | |
717 | |
718 The various regions are given the name C0 (0x00-0x1F), GL (0x20-0x7F), | |
719 C1 (0x80-0x9F), and GR (0xA0-0xFF). GL and GR stand for "graphic left" | |
720 and "graphic right", respectively, because of the standard method of | |
721 displaying graphic character sets in tables with the high byte indexing | |
722 columns and the low byte indexing rows. I don't find it very intuitive, | |
723 but these are called "registers". | |
724 | |
725 An ISO 2022-conformant encoding for a graphic character set must use a | |
726 fixed number of bytes per character, and the values must fit into a | |
727 single register; that is, each byte must range over either 0x20-0x7F, or | |
728 0xA0-0xFF. It is not allowed to extend the range of the repertoire of a | |
729 character set by using both ranges at the same. This is why a standard | |
730 character set such as ISO 8859-1 is actually considered by ISO 2022 to | |
731 be an aggregation of two character sets, ASCII and LATIN-1, and why it | |
732 is technically incorrect to refer to ISO 8859-1 as "Latin 1". Also, a | |
733 single character's bytes must all be drawn from the same register; this | |
734 is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO | |
735 2022-compatible encodings. | |
736 | |
737 The reason for this restriction becomes clear when you attempt to define | |
738 an efficient, robust encoding for a language like Japanese. Like ISO | |
739 8859, Japanese encodings are aggregations of several character sets. In | |
740 practice, the vast majority of characters are drawn from the "JIS Roman" | |
741 character set (a derivative of ASCII; it won't hurt to think of it as | |
742 ASCII) and the JIS X 0208 standard "basic Japanese" character set | |
743 including not only ideographic characters ("kanji") but syllabic | |
744 Japanese characters ("kana"), a wide variety of symbols, and many | |
745 alphabetic characters (Roman, Greek, and Cyrillic) as well. Although | |
746 JIS X 0208 includes the whole Roman alphabet, as a 2-byte code it is not | |
747 suited to programming; thus the inclusion of ASCII in the standard | |
748 Japanese encodings. | |
749 | |
750 For normal Japanese text such as in newspapers, a broad repertoire of | |
751 approximately 3000 characters is used. Evidently this won't fit into | |
752 one byte; two must be used. But much of the text processed by Japanese | |
753 computers is computer source code, nearly all of which is ASCII. A not | |
754 insignificant portion of ordinary text is English (as such or as | |
755 borrowed Japanese vocabulary) or other languages which can represented | |
756 at least approximately in ASCII, as well. It seems reasonable then to | |
757 represent ASCII in one byte, and JIS X 0208 in two. And this is exactly | |
758 what the Extended Unix Code for Japanese (EUC-JP) does. ASCII is | |
759 invoked to the GL register, and JIS X 0208 is invoked to the GR | |
760 register. Thus, each byte can be tested for its character set by | |
761 looking at the high bit; if set, it is Japanese, if clear, it is ASCII. | |
762 Furthermore, since control characters like newline can never be part of | |
763 a graphic character, even in the case of corruption in transmission the | |
764 stream will be resynchronized at every line break, on the order of 60-80 | |
765 bytes. This coding system requires no escape sequences or special | |
766 control codes to represent 99.9% of all Japanese text. | |
767 | |
768 Note carefully the distinction between the character sets (ASCII and JIS | |
769 X 0208), the encoding (EUC-JP), and the coding system (ISO 2022). The | |
770 JIS X 0208 character set is used in three different encodings for | |
771 Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is | |
772 always clear), in EUC-JP it is invoked into GR (setting the high bit in | |
773 the process), and in Shift JIS the high bit may be set or reset, and the | |
774 significant bits are shifted within the 16-bit character so that the two | |
775 main character sets can coexist with a third (the "halfwidth katakana" | |
776 of JIS X 0201). As the name implies, the ISO-2022-JP encoding is also a | |
777 version of the ISO-2022 coding system. | |
778 | |
779 In order to systematically treat subsidiary character sets (like the | |
780 "halfwidth katakana" already mentioned, and the "supplementary kanji" of | |
781 JIS X 0212), four further registers are defined: G0, G1, G2, and G3. | |
782 Unlike GL and GR, they are not logically distinguished by internal | |
783 format. Instead, the process of "invocation" mentioned earlier is | |
784 broken into two steps: first, a character set is @dfn{designated} to one | |
785 of the registers G0-G3 by use of an @dfn{escape sequence} of the form: | |
786 | |
787 @example | |
788 ESC [@var{I}] @var{I} @var{F} | |
789 @end example | |
790 | |
791 where @var{I} is an intermediate character or characters in the range | |
792 0x20 - 0x3F, and @var{F}, from the range 0x30-0x7Fm is the final | |
793 character identifying this charset. (Final characters in the range | |
794 0x30-0x3F are reserved for private use and will never have a publicly | |
795 registered meaning.) | |
796 | |
797 Then that register is @dfn{invoked} to either GL or GR, either | |
798 automatically (designations to G0 normally involve invocation to GL as | |
799 well), or by use of shifting (affecting only the following character in | |
800 the data stream) or locking (effective until the next designation or | |
801 locking) control sequences. An encoding conformant to ISO 2022 is | |
802 typically defined by designating the initial contents of the G0-G3 | |
803 registers, specifying an 7 or 8 bit environment, and specifying whether | |
804 further designations will be recognized. | |
805 | |
806 Some examples of character sets and the registered final characters | |
807 @var{F} used to designate them: | |
808 | |
809 @need 1000 | |
810 @table @asis | |
811 @item 94-charset | |
812 ASCII (B), left (J) and right (I) half of JIS X 0201, ... | |
813 @item 96-charset | |
814 Latin-1 (A), Latin-2 (B), Latin-3 (C), ... | |
815 @item 94x94-charset | |
816 GB2312 (A), JIS X 0208 (B), KSC5601 (C), ... | |
817 @item 96x96-charset | |
818 none for the moment | |
819 @end table | |
820 | |
821 The meanings of the various characters in these sequences, where not | |
822 specified by the ISO 2022 standard (such as the ESC character), are | |
823 assigned by @dfn{ECMA}, the European Computer Manufacturers Association. | |
824 | |
825 The meaning of intermediate characters are: | |
826 | |
827 @example | |
828 @group | |
829 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96). | |
830 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}. | |
831 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}. | |
832 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}. | |
833 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}. | |
834 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}. | |
835 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}. | |
836 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}. | |
837 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}. | |
838 @end group | |
839 @end example | |
840 | |
841 The comma may be used in files read and written only by MULE, as a MULE | |
842 extension, but this is illegal in ISO 2022. (The reason is that in ISO | |
843 2022 G0 must be a 94-member character set, with 0x20 assigned the value | |
844 SPACE, and 0x7F assigned the value DEL.) | |
845 | |
846 Here are examples of designations: | |
847 | |
848 @example | |
849 @group | |
850 ESC ( B : designate to G0 ASCII | |
851 ESC - A : designate to G1 Latin-1 | |
852 ESC $ ( A or ESC $ A : designate to G0 GB2312 | |
853 ESC $ ( B or ESC $ B : designate to G0 JISX0208 | |
854 ESC $ ) C : designate to G1 KSC5601 | |
855 @end group | |
856 @end example | |
857 | |
858 (The short forms used to designate GB2312 and JIS X 0208 are for | |
859 backwards compatibility; the long forms are preferred.) | |
860 | |
861 To use a charset designated to G2 or G3, and to use a charset designated | |
862 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3 | |
863 into GL. There are two types of invocation, Locking Shift (forever) and | |
864 Single Shift (one character only). | |
865 | |
866 Locking Shift is done as follows: | |
867 | |
868 @example | |
869 LS0 or SI (0x0F): invoke G0 into GL | |
870 LS1 or SO (0x0E): invoke G1 into GL | |
871 LS2: invoke G2 into GL | |
872 LS3: invoke G3 into GL | |
873 LS1R: invoke G1 into GR | |
874 LS2R: invoke G2 into GR | |
875 LS3R: invoke G3 into GR | |
876 @end example | |
877 | |
878 Single Shift is done as follows: | |
879 | |
880 @example | |
881 @group | |
882 SS2 or ESC N: invoke G2 into GL | |
883 SS3 or ESC O: invoke G3 into GL | |
884 @end group | |
885 @end example | |
886 | |
887 The shift functions (such as LS1R and SS3) are represented by control | |
888 characters (from C1) in 8 bit environments and by escape sequences in 7 | |
889 bit environments. | |
890 | |
891 (#### Ben says: I think the above is slightly incorrect. It appears that | |
892 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and | |
893 ESC O behave as indicated. The above definitions will not parse | |
894 EUC-encoded text correctly, and it looks like the code in mule-coding.c | |
895 has similar problems.) | |
896 | |
897 Evidently there are a lot of ISO-2022-compliant ways of encoding | |
898 multilingual text. Now, in the world, there exist many coding systems | |
899 such as X11's Compound Text, Japanese JUNET code, and so-called EUC | |
900 (Extended UNIX Code); all of these are variants of ISO 2022. | |
901 | |
902 In MULE, we characterize a version of ISO 2022 by the following | |
903 attributes: | |
904 | |
905 @enumerate | |
906 @item | |
907 The character sets initially designated to G0 thru G3. | |
908 @item | |
909 Whether short form designations are allowed for Japanese and Chinese. | |
910 @item | |
911 Whether ASCII should be designated to G0 before control characters. | |
912 @item | |
913 Whether ASCII should be designated to G0 at the end of line. | |
914 @item | |
915 7-bit environment or 8-bit environment. | |
916 @item | |
917 Whether Locking Shifts are used or not. | |
918 @item | |
919 Whether to use ASCII or the variant JIS X 0201-1976-Roman. | |
920 @item | |
921 Whether to use JIS X 0208-1983 or the older version JIS X 0208-1976. | |
922 @end enumerate | |
923 | |
924 (The last two are only for Japanese.) | |
925 | |
926 By specifying these attributes, you can create any variant | |
927 of ISO 2022. | |
928 | |
929 Here are several examples: | |
930 | |
931 @example | |
932 @group | |
933 ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check). | |
934 1. G0 <- ASCII, G1..3 <- never used | |
935 2. Yes. | |
936 3. Yes. | |
937 4. Yes. | |
938 5. 7-bit environment | |
939 6. No. | |
940 7. Use ASCII | |
941 8. Use JIS X 0208-1983 | |
942 @end group | |
943 | |
944 @group | |
945 ctext -- X11 Compound Text | |
946 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used. | |
947 2. No. | |
948 3. No. | |
949 4. Yes. | |
950 5. 8-bit environment. | |
951 6. No. | |
952 7. Use ASCII. | |
953 8. Use JIS X 0208-1983. | |
954 @end group | |
955 | |
956 @group | |
957 euc-china -- Chinese EUC. Often called the "GB encoding", but that is | |
958 technically incorrect. | |
959 1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used. | |
960 2. No. | |
961 3. Yes. | |
962 4. Yes. | |
963 5. 8-bit environment. | |
964 6. No. | |
965 7. Use ASCII. | |
966 8. Use JIS X 0208-1983. | |
967 @end group | |
968 | |
969 @group | |
970 ISO-2022-KR -- Coding system used in Korean email. | |
971 1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used. | |
972 2. No. | |
973 3. Yes. | |
974 4. Yes. | |
975 5. 7-bit environment. | |
976 6. Yes. | |
977 7. Use ASCII. | |
978 8. Use JIS X 0208-1983. | |
979 @end group | |
980 @end example | |
981 | |
982 MULE creates all of these coding systems by default. | |
983 | |
984 @node EOL Conversion, Coding System Properties, ISO 2022, Coding Systems | |
985 @subsection EOL Conversion | 807 @subsection EOL Conversion |
986 | 808 |
987 @table @code | 809 @table @code |
988 @item nil | 810 @item nil |
989 Automatically detect the end-of-line type (LF, CRLF, or CR). Also | 811 Automatically detect the end-of-line type (LF, CRLF, or CR). Also |
1006 Automatically detect the end-of-line type but do not generate subsidiary | 828 Automatically detect the end-of-line type but do not generate subsidiary |
1007 coding systems. (This value is converted to @code{nil} when stored | 829 coding systems. (This value is converted to @code{nil} when stored |
1008 internally, and @code{coding-system-property} will return @code{nil}.) | 830 internally, and @code{coding-system-property} will return @code{nil}.) |
1009 @end table | 831 @end table |
1010 | 832 |
1011 @node Coding System Properties, Basic Coding System Functions, EOL Conversion, Coding Systems | 833 @node Coding System Properties |
1012 @subsection Coding System Properties | 834 @subsection Coding System Properties |
1013 | 835 |
1014 @table @code | 836 @table @code |
1015 @item mnemonic | 837 @item mnemonic |
1016 String to be displayed in the modeline when this coding system is | 838 String to be displayed in the modeline when this coding system is |
1017 active. | 839 active. |
1018 | 840 |
1019 @item eol-type | 841 @item eol-type |
1020 End-of-line conversion to be used. It should be one of the types | 842 End-of-line conversion to be used. It should be one of the types |
1021 listed in @ref{EOL Conversion}. | 843 listed in @ref{EOL Conversion}. |
1022 | |
1023 @item eol-lf | |
1024 The coding system which is the same as this one, except that it uses the | |
1025 Unix line-breaking convention. | |
1026 | |
1027 @item eol-crlf | |
1028 The coding system which is the same as this one, except that it uses the | |
1029 DOS line-breaking convention. | |
1030 | |
1031 @item eol-cr | |
1032 The coding system which is the same as this one, except that it uses the | |
1033 Macintosh line-breaking convention. | |
1034 | 844 |
1035 @item post-read-conversion | 845 @item post-read-conversion |
1036 Function called after a file has been read in, to perform the decoding. | 846 Function called after a file has been read in, to perform the decoding. |
1037 Called with two arguments, @var{beg} and @var{end}, denoting a region of | 847 Called with two arguments, @var{beg} and @var{end}, denoting a region of |
1038 the current buffer to be decoded. | 848 the current buffer to be decoded. |
1041 Function called before a file is written out, to perform the encoding. | 851 Function called before a file is written out, to perform the encoding. |
1042 Called with two arguments, @var{beg} and @var{end}, denoting a region of | 852 Called with two arguments, @var{beg} and @var{end}, denoting a region of |
1043 the current buffer to be encoded. | 853 the current buffer to be encoded. |
1044 @end table | 854 @end table |
1045 | 855 |
1046 The following additional properties are recognized if @var{type} is | 856 The following additional properties are recognized if @var{type} is |
1047 @code{iso2022}: | 857 @code{iso2022}: |
1048 | 858 |
1049 @table @code | 859 @table @code |
1050 @item charset-g0 | 860 @item charset-g0 |
1051 @itemx charset-g1 | 861 @itemx charset-g1 |
1119 A list of conversion specifications, specifying conversion of characters | 929 A list of conversion specifications, specifying conversion of characters |
1120 in one charset to another when encoding is performed. The form of each | 930 in one charset to another when encoding is performed. The form of each |
1121 specification is the same as for @code{input-charset-conversion}. | 931 specification is the same as for @code{input-charset-conversion}. |
1122 @end table | 932 @end table |
1123 | 933 |
1124 The following additional properties are recognized (and required) if | 934 The following additional properties are recognized (and required) if |
1125 @var{type} is @code{ccl}: | 935 @var{type} is @code{ccl}: |
1126 | 936 |
1127 @table @code | 937 @table @code |
1128 @item decode | 938 @item decode |
1129 CCL program used for decoding (converting to internal format). | 939 CCL program used for decoding (converting to internal format). |
1130 | 940 |
1131 @item encode | 941 @item encode |
1132 CCL program used for encoding (converting to external format). | 942 CCL program used for encoding (converting to external format). |
1133 @end table | 943 @end table |
1134 | 944 |
1135 The following properties are used internally: @var{eol-cr}, | 945 @node Basic Coding System Functions |
1136 @var{eol-crlf}, @var{eol-lf}, and @var{base}. | |
1137 | |
1138 @node Basic Coding System Functions, Coding System Property Functions, Coding System Properties, Coding Systems | |
1139 @subsection Basic Coding System Functions | 946 @subsection Basic Coding System Functions |
1140 | 947 |
1141 @defun find-coding-system coding-system-or-name | 948 @defun find-coding-system coding-system-or-name |
1142 This function retrieves the coding system of the given name. | 949 This function retrieves the coding system of the given name. |
1143 | 950 |
1144 If @var{coding-system-or-name} is a coding-system object, it is simply | 951 If @var{coding-system-or-name} is a coding-system object, it is simply |
1145 returned. Otherwise, @var{coding-system-or-name} should be a symbol. | 952 returned. Otherwise, @var{coding-system-or-name} should be a symbol. |
1146 If there is no such coding system, @code{nil} is returned. Otherwise | 953 If there is no such coding system, @code{nil} is returned. Otherwise |
1147 the associated coding system object is returned. | 954 the associated coding system object is returned. |
1148 @end defun | 955 @end defun |
1149 | 956 |
1159 | 966 |
1160 @defun coding-system-name coding-system | 967 @defun coding-system-name coding-system |
1161 This function returns the name of the given coding system. | 968 This function returns the name of the given coding system. |
1162 @end defun | 969 @end defun |
1163 | 970 |
1164 @defun coding-system-base coding-system | |
1165 Returns the base coding system (undecided EOL convention) | |
1166 coding system. | |
1167 @end defun | |
1168 | |
1169 @defun make-coding-system name type &optional doc-string props | 971 @defun make-coding-system name type &optional doc-string props |
1170 This function registers symbol @var{name} as a coding system. | 972 This function registers symbol @var{name} as a coding system. |
1171 | 973 |
1172 @var{type} describes the conversion method used and should be one of | 974 @var{type} describes the conversion method used and should be one of |
1173 the types listed in @ref{Coding System Types}. | 975 the types listed in @ref{Coding System Types}. |
1188 @defun subsidiary-coding-system coding-system eol-type | 990 @defun subsidiary-coding-system coding-system eol-type |
1189 This function returns the subsidiary coding system of | 991 This function returns the subsidiary coding system of |
1190 @var{coding-system} with eol type @var{eol-type}. | 992 @var{coding-system} with eol type @var{eol-type}. |
1191 @end defun | 993 @end defun |
1192 | 994 |
1193 @node Coding System Property Functions, Encoding and Decoding Text, Basic Coding System Functions, Coding Systems | 995 @node Coding System Property Functions |
1194 @subsection Coding System Property Functions | 996 @subsection Coding System Property Functions |
1195 | 997 |
1196 @defun coding-system-doc-string coding-system | 998 @defun coding-system-doc-string coding-system |
1197 This function returns the doc string for @var{coding-system}. | 999 This function returns the doc string for @var{coding-system}. |
1198 @end defun | 1000 @end defun |
1203 | 1005 |
1204 @defun coding-system-property coding-system prop | 1006 @defun coding-system-property coding-system prop |
1205 This function returns the @var{prop} property of @var{coding-system}. | 1007 This function returns the @var{prop} property of @var{coding-system}. |
1206 @end defun | 1008 @end defun |
1207 | 1009 |
1208 @node Encoding and Decoding Text, Detection of Textual Encoding, Coding System Property Functions, Coding Systems | 1010 @node Encoding and Decoding Text |
1209 @subsection Encoding and Decoding Text | 1011 @subsection Encoding and Decoding Text |
1210 | 1012 |
1211 @defun decode-coding-region start end coding-system &optional buffer | 1013 @defun decode-coding-region start end coding-system &optional buffer |
1212 This function decodes the text between @var{start} and @var{end} which | 1014 This function decodes the text between @var{start} and @var{end} which |
1213 is encoded in @var{coding-system}. This is useful if you've read in | 1015 is encoded in @var{coding-system}. This is useful if you've read in |
1224 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS | 1026 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS |
1225 encoding. The length of the encoded text is returned. @var{buffer} | 1027 encoding. The length of the encoded text is returned. @var{buffer} |
1226 defaults to the current buffer if unspecified. | 1028 defaults to the current buffer if unspecified. |
1227 @end defun | 1029 @end defun |
1228 | 1030 |
1229 @node Detection of Textual Encoding, Big5 and Shift-JIS Functions, Encoding and Decoding Text, Coding Systems | 1031 @node Detection of Textual Encoding |
1230 @subsection Detection of Textual Encoding | 1032 @subsection Detection of Textual Encoding |
1231 | 1033 |
1232 @defun coding-category-list | 1034 @defun coding-category-list |
1233 This function returns a list of all recognized coding categories. | 1035 This function returns a list of all recognized coding categories. |
1234 @end defun | 1036 @end defun |
1260 returns @code{autodetect} or one of its subsidiary coding systems | 1062 returns @code{autodetect} or one of its subsidiary coding systems |
1261 according to a detected end-of-line type. Optional arg @var{buffer} | 1063 according to a detected end-of-line type. Optional arg @var{buffer} |
1262 defaults to the current buffer. | 1064 defaults to the current buffer. |
1263 @end defun | 1065 @end defun |
1264 | 1066 |
1265 @node Big5 and Shift-JIS Functions, Predefined Coding Systems, Detection of Textual Encoding, Coding Systems | 1067 @node Big5 and Shift-JIS Functions |
1266 @subsection Big5 and Shift-JIS Functions | 1068 @subsection Big5 and Shift-JIS Functions |
1267 | 1069 |
1268 These are special functions for working with the non-standard | 1070 These are special functions for working with the non-standard |
1269 Shift-JIS and Big5 encodings. | 1071 Shift-JIS and Big5 encodings. |
1270 | 1072 |
1271 @defun decode-shift-jis-char code | 1073 @defun decode-shift-jis-char code |
1272 This function decodes a JIS X 0208 character of Shift-JIS coding-system. | 1074 This function decodes a JISX0208 character of Shift-JIS coding-system. |
1273 @var{code} is the character code in Shift-JIS as a cons of type bytes. | 1075 @var{code} is the character code in Shift-JIS as a cons of type bytes. |
1274 The corresponding character is returned. | 1076 The corresponding character is returned. |
1275 @end defun | 1077 @end defun |
1276 | 1078 |
1277 @defun encode-shift-jis-char ch | 1079 @defun encode-shift-jis-char ch |
1278 This function encodes a JIS X 0208 character @var{ch} to SHIFT-JIS | 1080 This function encodes a JISX0208 character @var{ch} to SHIFT-JIS |
1279 coding-system. The corresponding character code in SHIFT-JIS is | 1081 coding-system. The corresponding character code in SHIFT-JIS is |
1280 returned as a cons of two bytes. | 1082 returned as a cons of two bytes. |
1281 @end defun | 1083 @end defun |
1282 | 1084 |
1283 @defun decode-big5-char code | 1085 @defun decode-big5-char code |
1289 @defun encode-big5-char ch | 1091 @defun encode-big5-char ch |
1290 This function encodes the Big5 character @var{char} to BIG5 | 1092 This function encodes the Big5 character @var{char} to BIG5 |
1291 coding-system. The corresponding character code in Big5 is returned. | 1093 coding-system. The corresponding character code in Big5 is returned. |
1292 @end defun | 1094 @end defun |
1293 | 1095 |
1294 @node Predefined Coding Systems, , Big5 and Shift-JIS Functions, Coding Systems | |
1295 @subsection Coding Systems Implemented | |
1296 | |
1297 MULE initializes most of the commonly used coding systems at XEmacs's | |
1298 startup. A few others are initialized only when the relevant language | |
1299 environment is selected and support libraries are loaded. (NB: The | |
1300 following list is based on XEmacs 21.2.19, the development branch at the | |
1301 time of writing. The list may be somewhat different for other | |
1302 versions. Recent versions of GNU Emacs 20 implement a few more rare | |
1303 coding systems; work is being done to port these to XEmacs.) | |
1304 | |
1305 Unfortunately, there is not a consistent naming convention for character | |
1306 sets, and for practical purposes coding systems often take their name | |
1307 from their principal character sets (ASCII, KOI8-R, Shift JIS). Others | |
1308 take their names from the coding system (ISO-2022-JP, EUC-KR), and a few | |
1309 from their non-text usages (internal, binary). To provide for this, and | |
1310 for the fact that many coding systems have several common names, an | |
1311 aliasing system is provided. Finally, some effort has been made to use | |
1312 names that are registered as MIME charsets (this is why the name | |
1313 'shift_jis contains that un-Lisp-y underscore). | |
1314 | |
1315 There is a systematic naming convention regarding end-of-line (EOL) | |
1316 conventions for different systems. A coding system whose name ends in | |
1317 "-unix" forces the assumptions that lines are broken by newlines (0x0A). | |
1318 A coding system whose name ends in "-mac" forces the assumptions that | |
1319 lines are broken by ASCII CRs (0x0D). A coding system whose name ends | |
1320 in "-dos" forces the assumptions that lines are broken by CRLF sequences | |
1321 (0x0D 0x0A). These subsidiary coding systems are automatically derived | |
1322 from a base coding system. Use of the base coding system implies | |
1323 autodetection of the text file convention. (The fact that the -unix, | |
1324 -mac, and -dos are derived from a base system results in them showing up | |
1325 as "aliases" in `list-coding-systems'.) These subsidiaries have a | |
1326 consistent modeline indicator as well. "-dos" coding systems have ":T" | |
1327 appended to their modeline indicator, while "-mac" coding systems have | |
1328 ":t" appended (eg, "ISO8:t" for iso-2022-8-mac). | |
1329 | |
1330 In the following table, each coding system is given with its mode line | |
1331 indicator in parentheses. Non-textual coding systems are listed first, | |
1332 followed by textual coding systems and their aliases. (The coding system | |
1333 subsidiary modeline indicators ":T" and ":t" will be omitted from the | |
1334 table of coding systems.) | |
1335 | |
1336 ### SJT 1999-08-23 Maybe should order these by language? Definitely | |
1337 need language usage for the ISO-8859 family. | |
1338 | |
1339 Note that although true coding system aliases have been implemented for | |
1340 XEmacs 21.2, the coding system initialization has not yet been converted | |
1341 as of 21.2.19. So coding systems described as aliases have the same | |
1342 properties as the aliased coding system, but will not be equal as Lisp | |
1343 objects. | |
1344 | |
1345 @table @code | |
1346 | |
1347 @item automatic-conversion | |
1348 @itemx undecided | |
1349 @itemx undecided-dos | |
1350 @itemx undecided-mac | |
1351 @itemx undecided-unix | |
1352 | |
1353 Modeline indicator: @code{Auto}. A type @code{undecided} coding system. | |
1354 Attempts to determine an appropriate coding system from file contents or | |
1355 the environment. | |
1356 | |
1357 @item raw-text | |
1358 @itemx no-conversion | |
1359 @itemx raw-text-dos | |
1360 @itemx raw-text-mac | |
1361 @itemx raw-text-unix | |
1362 @itemx no-conversion-dos | |
1363 @itemx no-conversion-mac | |
1364 @itemx no-conversion-unix | |
1365 | |
1366 Modeline indicator: @code{Raw}. A type @code{no-conversion} coding system, | |
1367 which converts only line-break-codes. An implementation quirk means | |
1368 that this coding system is also used for ISO8859-1. | |
1369 | |
1370 @item binary | |
1371 Modeline indicator: @code{Binary}. A type @code{no-conversion} coding | |
1372 system which does no character coding or EOL conversions. An alias for | |
1373 @code{raw-text-unix}. | |
1374 | |
1375 @item alternativnyj | |
1376 @itemx alternativnyj-dos | |
1377 @itemx alternativnyj-mac | |
1378 @itemx alternativnyj-unix | |
1379 | |
1380 Modeline indicator: @code{Cy.Alt}. A type @code{ccl} coding system used for | |
1381 Alternativnyj, an encoding of the Cyrillic alphabet. | |
1382 | |
1383 @item big5 | |
1384 @itemx big5-dos | |
1385 @itemx big5-mac | |
1386 @itemx big5-unix | |
1387 | |
1388 Modeline indicator: @code{Zh/Big5}. A type @code{big5} coding system used for | |
1389 BIG5, the most common encoding of traditional Chinese as used in Taiwan. | |
1390 | |
1391 @item cn-gb-2312 | |
1392 @itemx cn-gb-2312-dos | |
1393 @itemx cn-gb-2312-mac | |
1394 @itemx cn-gb-2312-unix | |
1395 | |
1396 Modeline indicator: @code{Zh-GB/EUC}. A type @code{iso2022} coding system used | |
1397 for simplified Chinese (as used in the People's Republic of China), with | |
1398 the @code{ascii} (G0), @code{chinese-gb2312} (G1), and @code{sisheng} | |
1399 (G2) character sets initially designated. Chinese EUC (Extended Unix | |
1400 Code). | |
1401 | |
1402 @item ctext-hebrew | |
1403 @itemx ctext-hebrew-dos | |
1404 @itemx ctext-hebrew-mac | |
1405 @itemx ctext-hebrew-unix | |
1406 | |
1407 Modeline indicator: @code{CText/Hbrw}. A type @code{iso2022} coding system | |
1408 with the @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) character | |
1409 sets initially designated for Hebrew. | |
1410 | |
1411 @item ctext | |
1412 @itemx ctext-dos | |
1413 @itemx ctext-mac | |
1414 @itemx ctext-unix | |
1415 | |
1416 Modeline indicator: @code{CText}. A type @code{iso2022} 8-bit coding system | |
1417 with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1) character | |
1418 sets initially designated. X11 Compound Text Encoding. Often | |
1419 mistakenly recognized instead of EUC encodings; usual cause is | |
1420 inappropriate setting of @code{coding-priority-list}. | |
1421 | |
1422 @item escape-quoted | |
1423 | |
1424 Modeline indicator: @code{ESC/Quot}. A type @code{iso2022} 8-bit coding | |
1425 system with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1) | |
1426 character sets initially designated and escape quoting. Unix EOL | |
1427 conversion (ie, no conversion). It is used for .ELC files. | |
1428 | |
1429 @item euc-jp | |
1430 @itemx euc-jp-dos | |
1431 @itemx euc-jp-mac | |
1432 @itemx euc-jp-unix | |
1433 | |
1434 Modeline indicator: @code{Ja/EUC}. A type @code{iso2022} 8-bit coding system | |
1435 with @code{ascii} (G0), @code{japanese-jisx0208} (G1), | |
1436 @code{katakana-jisx0201} (G2), and @code{japanese-jisx0212} (G3) | |
1437 initially designated. Japanese EUC (Extended Unix Code). | |
1438 | |
1439 @item euc-kr | |
1440 @itemx euc-kr-dos | |
1441 @itemx euc-kr-mac | |
1442 @itemx euc-kr-unix | |
1443 | |
1444 Modeline indicator: @code{ko/EUC}. A type @code{iso2022} 8-bit coding system | |
1445 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially | |
1446 designated. Korean EUC (Extended Unix Code). | |
1447 | |
1448 @item hz-gb-2312 | |
1449 Modeline indicator: @code{Zh-GB/Hz}. A type @code{no-conversion} coding | |
1450 system with Unix EOL convention (ie, no conversion) using | |
1451 post-read-decode and pre-write-encode functions to translate the Hz/ZW | |
1452 coding system used for Chinese. | |
1453 | |
1454 @item iso-2022-7bit | |
1455 @itemx iso-2022-7bit-unix | |
1456 @itemx iso-2022-7bit-dos | |
1457 @itemx iso-2022-7bit-mac | |
1458 @itemx iso-2022-7 | |
1459 | |
1460 Modeline indicator: @code{ISO7}. A type @code{iso2022} 7-bit coding system | |
1461 with @code{ascii} (G0) initially designated. Other character sets must | |
1462 be explicitly designated to be used. | |
1463 | |
1464 @item iso-2022-7bit-ss2 | |
1465 @itemx iso-2022-7bit-ss2-dos | |
1466 @itemx iso-2022-7bit-ss2-mac | |
1467 @itemx iso-2022-7bit-ss2-unix | |
1468 | |
1469 Modeline indicator: @code{ISO7/SS}. A type @code{iso2022} 7-bit coding system | |
1470 with @code{ascii} (G0) initially designated. Other character sets must | |
1471 be explicitly designated to be used. SS2 is used to invoke a | |
1472 96-charset, one character at a time. | |
1473 | |
1474 @item iso-2022-8 | |
1475 @itemx iso-2022-8-dos | |
1476 @itemx iso-2022-8-mac | |
1477 @itemx iso-2022-8-unix | |
1478 | |
1479 Modeline indicator: @code{ISO8}. A type @code{iso2022} 8-bit coding system | |
1480 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially | |
1481 designated. Other character sets must be explicitly designated to be | |
1482 used. No single-shift or locking-shift. | |
1483 | |
1484 @item iso-2022-8bit-ss2 | |
1485 @itemx iso-2022-8bit-ss2-dos | |
1486 @itemx iso-2022-8bit-ss2-mac | |
1487 @itemx iso-2022-8bit-ss2-unix | |
1488 | |
1489 Modeline indicator: @code{ISO8/SS}. A type @code{iso2022} 8-bit coding system | |
1490 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially | |
1491 designated. Other character sets must be explicitly designated to be | |
1492 used. SS2 is used to invoke a 96-charset, one character at a time. | |
1493 | |
1494 @item iso-2022-int-1 | |
1495 @itemx iso-2022-int-1-dos | |
1496 @itemx iso-2022-int-1-mac | |
1497 @itemx iso-2022-int-1-unix | |
1498 | |
1499 Modeline indicator: @code{INT-1}. A type @code{iso2022} 7-bit coding system | |
1500 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially | |
1501 designated. ISO-2022-INT-1. | |
1502 | |
1503 @item iso-2022-jp-1978-irv | |
1504 @itemx iso-2022-jp-1978-irv-dos | |
1505 @itemx iso-2022-jp-1978-irv-mac | |
1506 @itemx iso-2022-jp-1978-irv-unix | |
1507 | |
1508 Modeline indicator: @code{Ja-78/7bit}. A type @code{iso2022} 7-bit coding | |
1509 system. For compatibility with old Japanese terminals; if you need to | |
1510 know, look at the source. | |
1511 | |
1512 @item iso-2022-jp | |
1513 @itemx iso-2022-jp-2 (ISO7/SS) | |
1514 @itemx iso-2022-jp-dos | |
1515 @itemx iso-2022-jp-mac | |
1516 @itemx iso-2022-jp-unix | |
1517 @itemx iso-2022-jp-2-dos | |
1518 @itemx iso-2022-jp-2-mac | |
1519 @itemx iso-2022-jp-2-unix | |
1520 | |
1521 Modeline indicator: @code{MULE/7bit}. A type @code{iso2022} 7-bit coding | |
1522 system with @code{ascii} (G0) initially designated, and complex | |
1523 specifications to insure backward compatibility with old Japanese | |
1524 systems. Used for communication with mail and news in Japan. The "-2" | |
1525 versions also use SS2 to invoke a 96-charset one character at a time. | |
1526 | |
1527 @item iso-2022-kr | |
1528 Modeline indicator: @code{Ko/7bit} A type @code{iso2022} 7-bit coding | |
1529 system with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially | |
1530 designated. Used for e-mail in Korea. | |
1531 | |
1532 @item iso-2022-lock | |
1533 @itemx iso-2022-lock-dos | |
1534 @itemx iso-2022-lock-mac | |
1535 @itemx iso-2022-lock-unix | |
1536 | |
1537 Modeline indicator: @code{ISO7/Lock}. A type @code{iso2022} 7-bit coding | |
1538 system with @code{ascii} (G0) initially designated, using Locking-Shift | |
1539 to invoke a 96-charset. | |
1540 | |
1541 @item iso-8859-1 | |
1542 @itemx iso-8859-1-dos | |
1543 @itemx iso-8859-1-mac | |
1544 @itemx iso-8859-1-unix | |
1545 | |
1546 Due to implementation, this is not a type @code{iso2022} coding system, | |
1547 but rather an alias for the @code{raw-text} coding system. | |
1548 | |
1549 @item iso-8859-2 | |
1550 @itemx iso-8859-2-dos | |
1551 @itemx iso-8859-2-mac | |
1552 @itemx iso-8859-2-unix | |
1553 | |
1554 Modeline indicator: @code{MIME/Ltn-2}. A type @code{iso2022} coding | |
1555 system with @code{ascii} (G0) and @code{latin-iso8859-2} (G1) initially | |
1556 invoked. | |
1557 | |
1558 @item iso-8859-3 | |
1559 @itemx iso-8859-3-dos | |
1560 @itemx iso-8859-3-mac | |
1561 @itemx iso-8859-3-unix | |
1562 | |
1563 Modeline indicator: @code{MIME/Ltn-3}. A type @code{iso2022} coding system | |
1564 with @code{ascii} (G0) and @code{latin-iso8859-3} (G1) initially | |
1565 invoked. | |
1566 | |
1567 @item iso-8859-4 | |
1568 @itemx iso-8859-4-dos | |
1569 @itemx iso-8859-4-mac | |
1570 @itemx iso-8859-4-unix | |
1571 | |
1572 Modeline indicator: @code{MIME/Ltn-4}. A type @code{iso2022} coding system | |
1573 with @code{ascii} (G0) and @code{latin-iso8859-4} (G1) initially | |
1574 invoked. | |
1575 | |
1576 @item iso-8859-5 | |
1577 @itemx iso-8859-5-dos | |
1578 @itemx iso-8859-5-mac | |
1579 @itemx iso-8859-5-unix | |
1580 | |
1581 Modeline indicator: @code{ISO8/Cyr}. A type @code{iso2022} coding system with | |
1582 @code{ascii} (G0) and @code{cyrillic-iso8859-5} (G1) initially invoked. | |
1583 | |
1584 @item iso-8859-7 | |
1585 @itemx iso-8859-7-dos | |
1586 @itemx iso-8859-7-mac | |
1587 @itemx iso-8859-7-unix | |
1588 | |
1589 Modeline indicator: @code{Grk}. A type @code{iso2022} coding system with | |
1590 @code{ascii} (G0) and @code{greek-iso8859-7} (G1) initially invoked. | |
1591 | |
1592 @item iso-8859-8 | |
1593 @itemx iso-8859-8-dos | |
1594 @itemx iso-8859-8-mac | |
1595 @itemx iso-8859-8-unix | |
1596 | |
1597 Modeline indicator: @code{MIME/Hbrw}. A type @code{iso2022} coding system with | |
1598 @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) initially invoked. | |
1599 | |
1600 @item iso-8859-9 | |
1601 @itemx iso-8859-9-dos | |
1602 @itemx iso-8859-9-mac | |
1603 @itemx iso-8859-9-unix | |
1604 | |
1605 Modeline indicator: @code{MIME/Ltn-5}. A type @code{iso2022} coding system | |
1606 with @code{ascii} (G0) and @code{latin-iso8859-9} (G1) initially | |
1607 invoked. | |
1608 | |
1609 @item koi8-r | |
1610 @itemx koi8-r-dos | |
1611 @itemx koi8-r-mac | |
1612 @itemx koi8-r-unix | |
1613 | |
1614 Modeline indicator: @code{KOI8}. A type @code{ccl} coding-system used for | |
1615 KOI8-R, an encoding of the Cyrillic alphabet. | |
1616 | |
1617 @item shift_jis | |
1618 @itemx shift_jis-dos | |
1619 @itemx shift_jis-mac | |
1620 @itemx shift_jis-unix | |
1621 | |
1622 Modeline indicator: @code{Ja/SJIS}. A type @code{shift-jis} coding-system | |
1623 implementing the Shift-JIS encoding for Japanese. The underscore is to | |
1624 conform to the MIME charset implementing this encoding. | |
1625 | |
1626 @item tis-620 | |
1627 @itemx tis-620-dos | |
1628 @itemx tis-620-mac | |
1629 @itemx tis-620-unix | |
1630 | |
1631 Modeline indicator: @code{TIS620}. A type @code{ccl} encoding for Thai. The | |
1632 external encoding is defined by TIS620, the internal encoding is | |
1633 peculiar to MULE, and called @code{thai-xtis}. | |
1634 | |
1635 @item viqr | |
1636 | |
1637 Modeline indicator: @code{VIQR}. A type @code{no-conversion} coding | |
1638 system with Unix EOL convention (ie, no conversion) using | |
1639 post-read-decode and pre-write-encode functions to translate the VIQR | |
1640 coding system for Vietnamese. | |
1641 | |
1642 @item viscii | |
1643 @itemx viscii-dos | |
1644 @itemx viscii-mac | |
1645 @itemx viscii-unix | |
1646 | |
1647 Modeline indicator: @code{VISCII}. A type @code{ccl} coding-system used | |
1648 for VISCII 1.1 for Vietnamese. Differs slightly from VSCII; VISCII is | |
1649 given priority by XEmacs. | |
1650 | |
1651 @item vscii | |
1652 @itemx vscii-dos | |
1653 @itemx vscii-mac | |
1654 @itemx vscii-unix | |
1655 | |
1656 Modeline indicator: @code{VSCII}. A type @code{ccl} coding-system used | |
1657 for VSCII 1.1 for Vietnamese. Differs slightly from VISCII, which is | |
1658 given priority by XEmacs. Use | |
1659 @code{(prefer-coding-system 'vietnamese-vscii)} to give priority to VSCII. | |
1660 | |
1661 @end table | |
1662 | |
1663 @node CCL, Category Tables, Coding Systems, MULE | 1096 @node CCL, Category Tables, Coding Systems, MULE |
1664 @section CCL | 1097 @section CCL |
1665 | 1098 |
1666 CCL (Code Conversion Language) is a simple structured programming | 1099 CCL (Code Conversion Language) is a simple structured programming |
1667 language designed for character coding conversions. A CCL program is | 1100 language designed for character coding conversions. A CCL program is |
1668 compiled to CCL code (represented by a vector of integers) and executed | 1101 compiled to CCL code (represented by a vector of integers) and executed |
1669 by the CCL interpreter embedded in Emacs. The CCL interpreter | 1102 by the CCL interpreter embedded in Emacs. The CCL interpreter |
1670 implements a virtual machine with 8 registers called @code{r0}, ..., | 1103 implements a virtual machine with 8 registers called @code{r0}, ..., |
1671 @code{r7}, a number of control structures, and some I/O operators. Take | 1104 @code{r7}, a number of control structures, and some I/O operators. Take |
1672 care when using registers @code{r0} (used in implicit @dfn{set} | 1105 care when using registers @code{r0} (used in implicit @dfn{set} |
1673 statements) and especially @code{r7} (used internally by several | 1106 statements) and especially @code{r7} (used internally by several |
1674 statements and operations, especially for multiple return values and I/O | 1107 statements and operations, especially for multiple return values and I/O |
1675 operations). | 1108 operations). |
1676 | 1109 |
1677 CCL is used for code conversion during process I/O and file I/O for | 1110 CCL is used for code conversion during process I/O and file I/O for |
1678 non-ISO2022 coding systems. (It is the only way for a user to specify a | 1111 non-ISO2022 coding systems. (It is the only way for a user to specify a |
1679 code conversion function.) It is also used for calculating the code | 1112 code conversion function.) It is also used for calculating the code |
1680 point of an X11 font from a character code. However, since CCL is | 1113 point of an X11 font from a character code. However, since CCL is |
1681 designed as a powerful programming language, it can be used for more | 1114 designed as a powerful programming language, it can be used for more |
1682 generic calculation where efficiency is demanded. A combination of | 1115 generic calculation where efficiency is demanded. A combination of |
1683 three or more arithmetic operations can be calculated faster by CCL than | 1116 three or more arithmetic operations can be calculated faster by CCL than |
1684 by Emacs Lisp. | 1117 by Emacs Lisp. |
1685 | 1118 |
1686 @strong{Warning:} The code in @file{src/mule-ccl.c} and | 1119 @strong{Warning:} The code in @file{src/mule-ccl.c} and |
1687 @file{$packages/lisp/mule-base/mule-ccl.el} is the definitive | 1120 @file{$packages/lisp/mule-base/mule-ccl.el} is the definitive |
1688 description of CCL's semantics. The previous version of this section | 1121 description of CCL's semantics. The previous version of this section |
1689 contained several typos and obsolete names left from earlier versions of | 1122 contained several typos and obsolete names left from earlier versions of |
1690 MULE, and many may remain. (I am not an experienced CCL programmer; the | 1123 MULE, and many may remain. (I am not an experienced CCL programmer; the |
1691 few who know CCL well find writing English painful.) | 1124 few who know CCL well find writing English painful.) |
1692 | 1125 |
1693 A CCL program transforms an input data stream into an output data | 1126 A CCL program transforms an input data stream into an output data |
1694 stream. The input stream, held in a buffer of constant bytes, is left | 1127 stream. The input stream, held in a buffer of constant bytes, is left |
1695 unchanged. The buffer may be filled by an external input operation, | 1128 unchanged. The buffer may be filled by an external input operation, |
1696 taken from an Emacs buffer, or taken from a Lisp string. The output | 1129 taken from an Emacs buffer, or taken from a Lisp string. The output |
1697 buffer is a dynamic array of bytes, which can be written by an external | 1130 buffer is a dynamic array of bytes, which can be written by an external |
1698 output operation, inserted into an Emacs buffer, or returned as a Lisp | 1131 output operation, inserted into an Emacs buffer, or returned as a Lisp |
1699 string. | 1132 string. |
1700 | 1133 |
1701 A CCL program is a (Lisp) list containing two or three members. The | 1134 A CCL program is a (Lisp) list containing two or three members. The |
1702 first member is the @dfn{buffer magnification}, which indicates the | 1135 first member is the @dfn{buffer magnification}, which indicates the |
1703 required minimum size of the output buffer as a multiple of the input | 1136 required minimum size of the output buffer as a multiple of the input |
1704 buffer. It is followed by the @dfn{main block} which executes while | 1137 buffer. It is followed by the @dfn{main block} which executes while |
1705 there is input remaining, and an optional @dfn{EOF block} which is | 1138 there is input remaining, and an optional @dfn{EOF block} which is |
1706 executed when the input is exhausted. Both the main block and the EOF | 1139 executed when the input is exhausted. Both the main block and the EOF |
1707 block are CCL blocks. | 1140 block are CCL blocks. |
1708 | 1141 |
1709 A @dfn{CCL block} is either a CCL statement or list of CCL statements. | 1142 A @dfn{CCL block} is either a CCL statement or list of CCL statements. |
1710 A @dfn{CCL statement} is either a @dfn{set statement} (either an integer | 1143 A @dfn{CCL statement} is either a @dfn{set statement} (either an integer |
1711 or an @dfn{assignment}, which is a list of a register to receive the | 1144 or an @dfn{assignment}, which is a list of a register to receive the |
1712 assignment, an assignment operator, and an expression) or a @dfn{control | 1145 assignment, an assignment operator, and an expression) or a @dfn{control |
1713 statement} (a list starting with a keyword, whose allowable syntax | 1146 statement} (a list starting with a keyword, whose allowable syntax |
1714 depends on the keyword). | 1147 depends on the keyword). |
1719 * CCL Expressions:: Operators and expressions in CCL. | 1152 * CCL Expressions:: Operators and expressions in CCL. |
1720 * Calling CCL:: Running CCL programs. | 1153 * Calling CCL:: Running CCL programs. |
1721 * CCL Examples:: The encoding functions for Big5 and KOI-8. | 1154 * CCL Examples:: The encoding functions for Big5 and KOI-8. |
1722 @end menu | 1155 @end menu |
1723 | 1156 |
1724 @node CCL Syntax, CCL Statements, , CCL | 1157 @node CCL Syntax, CCL Statements, CCL, CCL |
1725 @comment Node, Next, Previous, Up | 1158 @comment Node, Next, Previous, Up |
1726 @subsection CCL Syntax | 1159 @subsection CCL Syntax |
1727 | 1160 |
1728 The full syntax of a CCL program in BNF notation: | 1161 The full syntax of a CCL program in BNF notation: |
1729 | 1162 |
1730 @format | 1163 @format |
1731 CCL_PROGRAM := | 1164 CCL_PROGRAM := |
1732 (BUFFER_MAGNIFICATION | 1165 (BUFFER_MAGNIFICATION |
1733 CCL_MAIN_BLOCK | 1166 CCL_MAIN_BLOCK |
1782 | 1215 |
1783 @node CCL Statements, CCL Expressions, CCL Syntax, CCL | 1216 @node CCL Statements, CCL Expressions, CCL Syntax, CCL |
1784 @comment Node, Next, Previous, Up | 1217 @comment Node, Next, Previous, Up |
1785 @subsection CCL Statements | 1218 @subsection CCL Statements |
1786 | 1219 |
1787 The Emacs Code Conversion Language provides the following statement | 1220 The Emacs Code Conversion Language provides the following statement |
1788 types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat}, | 1221 types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat}, |
1789 @dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, and @dfn{end}. | 1222 @dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, and @dfn{end}. |
1790 | 1223 |
1791 @heading Set statement: | 1224 @heading Set statement: |
1792 | 1225 |
1793 The @dfn{set} statement has three variants with the syntaxes | 1226 The @dfn{set} statement has three variants with the syntaxes |
1794 @samp{(@var{reg} = @var{expression})}, | 1227 @samp{(@var{reg} = @var{expression})}, |
1795 @samp{(@var{reg} @var{assignment_operator} @var{expression})}, and | 1228 @samp{(@var{reg} @var{assignment_operator} @var{expression})}, and |
1796 @samp{@var{integer}}. The assignment operator variation of the | 1229 @samp{@var{integer}}. The assignment operator variation of the |
1797 @dfn{set} statement works the same way as the corresponding C expression | 1230 @dfn{set} statement works the same way as the corresponding C expression |
1798 statement does. The assignment operators are @code{+=}, @code{-=}, | 1231 statement does. The assignment operators are @code{+=}, @code{-=}, |
1801 "naked integer" @var{integer} is equivalent to a @var{set} statement of | 1234 "naked integer" @var{integer} is equivalent to a @var{set} statement of |
1802 the form @code{(r0 = @var{integer})}. | 1235 the form @code{(r0 = @var{integer})}. |
1803 | 1236 |
1804 @heading I/O statements: | 1237 @heading I/O statements: |
1805 | 1238 |
1806 The @dfn{read} statement takes one or more registers as arguments. It | 1239 The @dfn{read} statement takes one or more registers as arguments. It |
1807 reads one byte (a C char) from the input into each register in turn. | 1240 reads one byte (a C char) from the input into each register in turn. |
1808 | 1241 |
1809 The @dfn{write} takes several forms. In the form @samp{(write @var{reg} | 1242 The @dfn{write} takes several forms. In the form @samp{(write @var{reg} |
1810 ...)} it takes one or more registers as arguments and writes each in | 1243 ...)} it takes one or more registers as arguments and writes each in |
1811 turn to the output. The integer in a register (interpreted as an | 1244 turn to the output. The integer in a register (interpreted as an |
1812 Emchar) is encoded to multibyte form (ie, Bufbytes) and written to the | 1245 Emchar) is encoded to multibyte form (ie, Bufbytes) and written to the |
1813 current output buffer. If it is less than 256, it is written as is. | 1246 current output buffer. If it is less than 256, it is written as is. |
1814 The forms @samp{(write @var{expression})} and @samp{(write | 1247 The forms @samp{(write @var{expression})} and @samp{(write |
1818 @var{string})}. The form @samp{(write @var{reg} @var{array})} writes | 1251 @var{string})}. The form @samp{(write @var{reg} @var{array})} writes |
1819 the @var{reg}th element of the @var{array} to the output. | 1252 the @var{reg}th element of the @var{array} to the output. |
1820 | 1253 |
1821 @heading Conditional statements: | 1254 @heading Conditional statements: |
1822 | 1255 |
1823 The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and | 1256 The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and |
1824 an optional @var{second CCL block} as arguments. If the | 1257 an optional @var{second CCL block} as arguments. If the |
1825 @var{expression} evaluates to non-zero, the first @var{CCL block} is | 1258 @var{expression} evaluates to non-zero, the first @var{CCL block} is |
1826 executed. Otherwise, if there is a @var{second CCL block}, it is | 1259 executed. Otherwise, if there is a @var{second CCL block}, it is |
1827 executed. | 1260 executed. |
1828 | 1261 |
1829 The @dfn{read-if} variant of the @dfn{if} statement takes an | 1262 The @dfn{read-if} variant of the @dfn{if} statement takes an |
1830 @var{expression}, a @var{CCL block}, and an optional @var{second CCL | 1263 @var{expression}, a @var{CCL block}, and an optional @var{second CCL |
1831 block} as arguments. The @var{expression} must have the form | 1264 block} as arguments. The @var{expression} must have the form |
1832 @code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is | 1265 @code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is |
1833 a register or an integer). The @code{read-if} statement first reads | 1266 a register or an integer). The @code{read-if} statement first reads |
1834 from the input into the first register operand in the @var{expression}, | 1267 from the input into the first register operand in the @var{expression}, |
1835 then conditionally executes a CCL block just as the @code{if} statement | 1268 then conditionally executes a CCL block just as the @code{if} statement |
1836 does. | 1269 does. |
1837 | 1270 |
1838 The @dfn{branch} statement takes an @var{expression} and one or more CCL | 1271 The @dfn{branch} statement takes an @var{expression} and one or more CCL |
1839 blocks as arguments. The CCL blocks are treated as a zero-indexed | 1272 blocks as arguments. The CCL blocks are treated as a zero-indexed |
1840 array, and the @code{branch} statement uses the @var{expression} as the | 1273 array, and the @code{branch} statement uses the @var{expression} as the |
1841 index of the CCL block to execute. Null CCL blocks may be used as | 1274 index of the CCL block to execute. Null CCL blocks may be used as |
1842 no-ops, continuing execution with the statement following the | 1275 no-ops, continuing execution with the statement following the |
1843 @code{branch} statement in the containing CCL block. Out-of-range | 1276 @code{branch} statement in the containing CCL block. Out-of-range |
1844 values for the @var{EXPRESSION} are also treated as no-ops. | 1277 values for the @var{EXPRESSION} are also treated as no-ops. |
1845 | 1278 |
1846 The @dfn{read-branch} variant of the @dfn{branch} statement takes an | 1279 The @dfn{read-branch} variant of the @dfn{branch} statement takes an |
1847 @var{register}, a @var{CCL block}, and an optional @var{second CCL | 1280 @var{register}, a @var{CCL block}, and an optional @var{second CCL |
1848 block} as arguments. The @code{read-branch} statement first reads from | 1281 block} as arguments. The @code{read-branch} statement first reads from |
1849 the input into the @var{register}, then conditionally executes a CCL | 1282 the input into the @var{register}, then conditionally executes a CCL |
1850 block just as the @code{branch} statement does. | 1283 block just as the @code{branch} statement does. |
1851 | 1284 |
1852 @heading Loop control statements: | 1285 @heading Loop control statements: |
1853 | 1286 |
1854 The @dfn{loop} statement creates a block with an implied jump from the | 1287 The @dfn{loop} statement creates a block with an implied jump from the |
1855 end of the block back to its head. The loop is exited on a @code{break} | 1288 end of the block back to its head. The loop is exited on a @code{break} |
1856 statement, and continued without executing the tail by a @code{repeat} | 1289 statement, and continued without executing the tail by a @code{repeat} |
1857 statement. | 1290 statement. |
1858 | 1291 |
1859 The @dfn{break} statement, written @samp{(break)}, terminates the | 1292 The @dfn{break} statement, written @samp{(break)}, terminates the |
1860 current loop and continues with the next statement in the current | 1293 current loop and continues with the next statement in the current |
1861 block. | 1294 block. |
1862 | 1295 |
1863 The @dfn{repeat} statement has three variants, @code{repeat}, | 1296 The @dfn{repeat} statement has three variants, @code{repeat}, |
1864 @code{write-repeat}, and @code{write-read-repeat}. Each continues the | 1297 @code{write-repeat}, and @code{write-read-repeat}. Each continues the |
1865 current loop from its head, possibly after performing I/O. | 1298 current loop from its head, possibly after performing I/O. |
1866 @code{repeat} takes no arguments and does no I/O before jumping. | 1299 @code{repeat} takes no arguments and does no I/O before jumping. |
1867 @code{write-repeat} takes a single argument (a register, an | 1300 @code{write-repeat} takes a single argument (a register, an |
1868 integer, or a string), writes it to the output, then jumps. | 1301 integer, or a string), writes it to the output, then jumps. |
1874 @code{write} and @code{read} statements for the semantics of the I/O | 1307 @code{write} and @code{read} statements for the semantics of the I/O |
1875 operations for each type of argument. | 1308 operations for each type of argument. |
1876 | 1309 |
1877 @heading Other control statements: | 1310 @heading Other control statements: |
1878 | 1311 |
1879 The @dfn{call} statement, written @samp{(call @var{ccl-program-name})}, | 1312 The @dfn{call} statement, written @samp{(call @var{ccl-program-name})}, |
1880 executes a CCL program as a subroutine. It does not return a value to | 1313 executes a CCL program as a subroutine. It does not return a value to |
1881 the caller, but can modify the register status. | 1314 the caller, but can modify the register status. |
1882 | 1315 |
1883 The @dfn{end} statement, written @samp{(end)}, terminates the CCL | 1316 The @dfn{end} statement, written @samp{(end)}, terminates the CCL |
1884 program successfully, and returns to caller (which may be a CCL | 1317 program successfully, and returns to caller (which may be a CCL |
1885 program). It does not alter the status of the registers. | 1318 program). It does not alter the status of the registers. |
1886 | 1319 |
1887 @node CCL Expressions, Calling CCL, CCL Statements, CCL | 1320 @node CCL Expressions, Calling CCL, CCL Statements, CCL |
1888 @comment Node, Next, Previous, Up | 1321 @comment Node, Next, Previous, Up |
1889 @subsection CCL Expressions | 1322 @subsection CCL Expressions |
1890 | 1323 |
1891 CCL, unlike Lisp, uses infix expressions. The simplest CCL expressions | 1324 CCL, unlike Lisp, uses infix expressions. The simplest CCL expressions |
1892 consist of a single @var{operand}, either a register (one of @code{r0}, | 1325 consist of a single @var{operand}, either a register (one of @code{r0}, |
1893 ..., @code{r0}) or an integer. Complex expressions are lists of the | 1326 ..., @code{r0}) or an integer. Complex expressions are lists of the |
1894 form @code{( @var{expression} @var{operator} @var{operand} )}. Unlike | 1327 form @code{( @var{expression} @var{operator} @var{operand} )}. Unlike |
1895 C, assignments are not expressions. | 1328 C, assignments are not expressions. |
1896 | 1329 |
1897 In the following table, @var{X} is the target resister for a @dfn{set}. | 1330 In the following table, @var{X} is the target resister for a @dfn{set}. |
1898 In subexpressions, this is implicitly @code{r7}. This means that | 1331 In subexpressions, this is implicitly @code{r7}. This means that |
1899 @code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used | 1332 @code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used |
1900 freely in subexpressions, since they return parts of their values in | 1333 freely in subexpressions, since they return parts of their values in |
1901 @code{r7}. @var{Y} may be an expression, register, or integer, while | 1334 @code{r7}. @var{Y} may be an expression, register, or integer, while |
1902 @var{Z} must be a register or an integer. | 1335 @var{Z} must be a register or an integer. |
1926 @item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z) | 1359 @item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z) |
1927 @item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z)) | 1360 @item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z)) |
1928 @item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z)) | 1361 @item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z)) |
1929 @end multitable | 1362 @end multitable |
1930 | 1363 |
1931 The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8, | 1364 The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8, |
1932 CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The CCL_ENCODE_SJIS | 1365 CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The CCL_ENCODE_SJIS |
1933 and CCL_DECODE_SJIS treat their first and second bytes as the high and | 1366 and CCL_DECODE_SJIS treat their first and second bytes as the high and |
1934 low bytes of a two-byte character code. (SJIS stands for Shift JIS, an | 1367 low bytes of a two-byte character code. (SJIS stands for Shift JIS, an |
1935 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a | 1368 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a |
1936 complicated transformation of the Japanese standard JIS encoding to | 1369 complicated transformation of the Japanese standard JIS encoding to |
1937 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to | 1370 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to |
1938 represent the SJIS operations in infix form. | 1371 represent the SJIS operations in infix form. |
1939 | 1372 |
1940 @node Calling CCL, CCL Examples, CCL Expressions, CCL | 1373 @node Calling CCL, CCL Examples, CCL Expressions, CCL |
1941 @comment Node, Next, Previous, Up | 1374 @comment Node, Next, Previous, Up |
1942 @subsection Calling CCL | 1375 @subsection Calling CCL |
1943 | 1376 |
1944 CCL programs are called automatically during Emacs buffer I/O when the | 1377 CCL programs are called automatically during Emacs buffer I/O when the |
1945 external representation has a coding system type of @code{shift-jis}, | 1378 external representation has a coding system type of @code{shift-jis}, |
1946 @code{big5}, or @code{ccl}. The program is specified by the coding | 1379 @code{big5}, or @code{ccl}. The program is specified by the coding |
1947 system (@pxref{Coding Systems}). You can also call CCL programs from | 1380 system (@pxref{Coding Systems}). You can also call CCL programs from |
1948 other CCL programs, and from Lisp using these functions: | 1381 other CCL programs, and from Lisp using these functions: |
1949 | 1382 |
1976 of the program. When the program is done, @var{status} is modified (by | 1409 of the program. When the program is done, @var{status} is modified (by |
1977 side-effect) to contain the ending values for the corresponding | 1410 side-effect) to contain the ending values for the corresponding |
1978 registers and IC. Returns the resulting string. | 1411 registers and IC. Returns the resulting string. |
1979 @end defun | 1412 @end defun |
1980 | 1413 |
1981 To call a CCL program from another CCL program, it must first be | 1414 To call a CCL program from another CCL program, it must first be |
1982 registered: | 1415 registered: |
1983 | 1416 |
1984 @defun register-ccl-program name ccl-program | 1417 @defun register-ccl-program name ccl-program |
1985 Register @var{name} for CCL program @var{program} in | 1418 Register @var{name} for CCL program @var{program} in |
1986 @code{ccl-program-table}. @var{program} should be the compiled form of | 1419 @code{ccl-program-table}. @var{program} should be the compiled form of |
1987 a CCL program, or nil. Return index number of the registered CCL | 1420 a CCL program, or nil. Return index number of the registered CCL |
1988 program. | 1421 program. |
1989 @end defun | 1422 @end defun |
1990 | 1423 |
1991 Information about the processor time used by the CCL interpreter can be | 1424 Information about the processor time used by the CCL interpreter can be |
1992 obtained using these functions: | 1425 obtained using these functions: |
1993 | 1426 |
1994 @defun ccl-elapsed-time | 1427 @defun ccl-elapsed-time |
1995 Returns the elapsed processor time of the CCL interpreter as cons of | 1428 Returns the elapsed processor time of the CCL interpreter as cons of |
1996 user and system time, as | 1429 user and system time, as |
2001 | 1434 |
2002 @defun ccl-reset-elapsed-time | 1435 @defun ccl-reset-elapsed-time |
2003 Resets the CCL interpreter's internal elapsed time registers. | 1436 Resets the CCL interpreter's internal elapsed time registers. |
2004 @end defun | 1437 @end defun |
2005 | 1438 |
2006 @node CCL Examples, , Calling CCL, CCL | 1439 @node CCL Examples, , Calling CCL, CCL |
2007 @comment Node, Next, Previous, Up | 1440 @comment Node, Next, Previous, Up |
2008 @subsection CCL Examples | 1441 @subsection CCL Examples |
2009 | 1442 |
2010 This section is not yet written. | 1443 This section is not yet written. |
2011 | 1444 |
2012 @node Category Tables, , CCL, MULE | 1445 @node Category Tables, , CCL, MULE |
2013 @section Category Tables | 1446 @section Category Tables |
2014 | 1447 |
2015 A category table is a type of char table used for keeping track of | 1448 A category table is a type of char table used for keeping track of |
2016 categories. Categories are used for classifying characters for use in | 1449 categories. Categories are used for classifying characters for use in |
2017 regexps---you can refer to a category rather than having to use a | 1450 regexps -- you can refer to a category rather than having to use a |
2018 complicated [] expression (and category lookups are significantly | 1451 complicated [] expression (and category lookups are significantly |
2019 faster). | 1452 faster). |
2020 | 1453 |
2021 There are 95 different categories available, one for each printable | 1454 There are 95 different categories available, one for each printable |
2022 character (including space) in the ASCII charset. Each category is | 1455 character (including space) in the ASCII charset. Each category is |