Mercurial > hg > xemacs-beta
comparison man/lispref/mule.texi @ 0:376386a54a3c r19-14
Import from CVS: tag r19-14
author | cvs |
---|---|
date | Mon, 13 Aug 2007 08:45:50 +0200 |
parents | |
children | 05472e90ae02 |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:376386a54a3c |
---|---|
1 @c -*-texinfo-*- | |
2 @c This is part of the XEmacs Lisp Reference Manual. | |
3 @c Copyright (C) 1996 Ben Wing. | |
4 @c See the file lispref.texi for copying conditions. | |
5 @setfilename ../../info/internationalization.info | |
6 @node MULE, Tips, Internationalization, top | |
7 @chapter MULE | |
8 | |
9 @dfn{MULE} is the name originally given to the version of GNU Emacs | |
10 extended for multi-lingual (and in particular Asian-language) support. | |
11 ``MULE'' is short for ``MUlti-Lingual Emacs''. It was originally called | |
12 Nemacs (``Nihon Emacs'' where ``Nihon'' is the Japanese word for | |
13 ``Japan''), when it only provided support for Japanese. XEmacs | |
14 refers to its multi-lingual support as @dfn{MULE support} since it | |
15 is based on @dfn{MULE}. | |
16 | |
17 @menu | |
18 * Internationalization Terminology:: | |
19 Definition of various internationalization terms. | |
20 * Charsets:: Sets of related characters. | |
21 * MULE Characters:: Working with characters in XEmacs/MULE. | |
22 * Composite Characters:: Making new characters by overstriking other ones. | |
23 * ISO 2022:: An international standard for charsets and encodings. | |
24 * Coding Systems:: Ways of representing a string of chars using integers. | |
25 * CCL:: A special language for writing fast converters. | |
26 * Category Tables:: Subdividing charsets into groups. | |
27 @end menu | |
28 | |
29 @node Internationalization Terminology | |
30 @section Internationalization Terminology | |
31 | |
32 In internationalization terminology, a string of text is divided up | |
33 into @dfn{characters}, which are the printable units that make up the | |
34 text. A single character is (for example) a capital @samp{A}, the | |
35 number @samp{2}, a Katakana character, a Kanji ideograph (an | |
36 @dfn{ideograph} is a ``picture'' character, such as is used in Japanese | |
37 Kanji, Chinese Hanzi, and Korean Hangul; typically there are thousands | |
38 of such ideographs in each language), etc. The basic property of a | |
39 character is its shape. Note that the same character may be drawn by | |
40 two different people (or in two different fonts) in slightly different | |
41 ways, although the basic shape will be the same. | |
42 | |
43 In some cases, the differences will be significant enough that it is | |
44 actually possible to identify two or more distinct shapes that both | |
45 represent the same character. For example, the lowercase letters | |
46 @samp{a} and @samp{g} each have two distinct possible shapes -- the | |
47 @samp{a} can optionally have a curved tail projecting off the top, and | |
48 the @samp{g} can be formed either of two loops, or of one loop and a | |
49 tail hanging off the bottom. Such distinct possible shapes of a | |
50 character are called @dfn{glyphs}. The important characteristic of two | |
51 glyphs making up the same character is that the choice between one or | |
52 the other is purely stylistic and has no linguistic effect on a word | |
53 (this is the reason why a capital @samp{A} and lowercase @samp{a} | |
54 are different characters rather than different glyphs -- e.g. | |
55 @samp{Aspen} is a city while @samp{aspen} is a kind of tree). | |
56 | |
57 Note that @dfn{character} and @dfn{glyph} are used differently | |
58 here than elsewhere in XEmacs. | |
59 | |
60 A @dfn{character set} is simply a set of related characters. ASCII, | |
61 for example, is a set of 94 characters (or 128, if you count | |
62 non-printing characters). Other character sets are ISO8859-1 (ASCII | |
63 plus various accented characters and other international symbols), | |
64 JISX0201 (ASCII, more or less, plus half-width Katakana), JISX0208 | |
65 (Japanese Kanji), JISX0212 (a second set of less-used Japanese Kanji), | |
66 GB2312 (Mainland Chinese Hanzi), etc. | |
67 | |
68 Every character set has one or more @dfn{orderings}, which can be | |
69 viewed as a way of assigning a number (or set of numbers) to each | |
70 character in the set. For most character sets, there is a standard | |
71 ordering, and in fact all of the character sets mentioned above define a | |
72 particular ordering. ASCII, for example, places letters in their | |
73 ``natural'' order, puts uppercase letters before lowercase letters, | |
74 numbers before letters, etc. Note that for many of the Asian character | |
75 sets, there is no natural ordering of the characters. The actual | |
76 orderings are based on one or more salient characteristic, of which | |
77 there are many to choose from -- e.g. number of strokes, common | |
78 radicals, phonetic ordering, etc. | |
79 | |
80 The set of numbers assigned to any particular character are called | |
81 the character's @dfn{position codes}. The number of position codes | |
82 required to index a particular character in a character set is called | |
83 the @dfn{dimension} of the character set. ASCII, being a relatively | |
84 small character set, is of dimension one, and each character in the | |
85 set is indexed using a single position code, in the range 0 through | |
86 127 (if non-printing characters are included) or 33 through 126 | |
87 (if only the printing characters are considered). JISX0208, i.e. | |
88 Japanese Kanji, has thousands of characters, and is of dimension two -- | |
89 every character is indexed by two position codes, each in the range | |
90 33 through 126. (Note that the choice of the range here is somewhat | |
91 arbitrary. Although a character set such as JISX0208 defines an | |
92 @emph{ordering} of all its characters, it does not define the actual | |
93 mapping between numbers and characters. You could just as easily | |
94 index the characters in JISX0208 using numbers in the range 0 through | |
95 93, 1 through 94, 2 through 95, etc. The reason for the actual range | |
96 chosen is so that the position codes match up with the actual values | |
97 used in the common encodings.) | |
98 | |
99 An @dfn{encoding} is a way of numerically representing characters from | |
100 one or more character sets into a stream of like-sized numerical values | |
101 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit | |
102 quantities. If an encoding encompasses only one character set, then the | |
103 position codes for the characters in that character set could be used | |
104 directly. (This is the case with ASCII, and as a result, most people do | |
105 not understand the difference between a character set and an encoding.) | |
106 This is not possible, however, if more than one character set is to be | |
107 used in the encoding. For example, printed Japanese text typically | |
108 requires characters from multiple character sets -- ASCII, JISX0208, and | |
109 JISX0212, to be specific. Each of these is indexed using one or more | |
110 position codes in the range 33 through 126, so the position codes could | |
111 not be used directly or there would be no way to tell which character | |
112 was meant. Different Japanese encodings handle this differently -- JIS | |
113 uses special escape characters to denote different character sets; EUC | |
114 sets the high bit of the position codes for JISX0208 and JISX0212, and | |
115 puts a special extra byte before each JISX0212 character; etc. (JIS, | |
116 EUC, and most of the other encodings you will encounter are 7-bit or | |
117 8-bit encodings. There is one common 16-bit encoding, which is Unicode; | |
118 this strives to represent all the world's characters in a single large | |
119 character set. 32-bit encodings are generally used internally in | |
120 programs to simplify the code that manipulates them; however, they are | |
121 not much used externally because they are not very space-efficient.) | |
122 | |
123 Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In | |
124 a @dfn{modal encoding}, there are multiple states that the encoding can be in, | |
125 and the interpretation of the values in the stream depends on the | |
126 current global state of the encoding. Special values in the encoding, | |
127 called @dfn{escape sequences}, are used to change the global state. | |
128 JIS, for example, is a modal encoding. The bytes @samp{ESC $ B} | |
129 indicate that, from then on, bytes are to be interpreted as position | |
130 codes for JISX0208, rather than as ASCII. This effect is cancelled | |
131 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the | |
132 current state is to ASCII''. To switch to JISX0212, the escape sequence | |
133 @samp{ESC $ ( D}. (Note that here, as is common, the escape sequences do | |
134 in fact begin with @samp{ESC}. This is not necessarily the case, | |
135 however.) | |
136 | |
137 A @dfn{non-modal encoding} has no global state that extends past the | |
138 character currently being interpreted. EUC, for example, is a | |
139 non-modal encoding. Characters in JISX0208 are encoded by setting | |
140 the high bit of the position codes, and characters in JISX0212 are | |
141 encoded by doing the same but also prefixing the character with the | |
142 byte 0x8F. | |
143 | |
144 The advantage of a modal encoding is that it is generally more | |
145 space-efficient, and is easily extendable because there are essentially | |
146 an arbitrary number of escape sequences that can be created. The | |
147 disadvantage, however, is that it is much more difficult to work with | |
148 if it is not being processed in a sequential manner. In the non-modal | |
149 EUC encoding, for example, the byte 0x41 always refers to the letter | |
150 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or | |
151 one of the two position codes in a JISX0208 character, or one of the | |
152 two position codes in a JISX0212 character. Determining exactly which | |
153 one is meant could be difficult and time-consuming if the previous | |
154 bytes in the string have not already been processed. | |
155 | |
156 Non-modal encodings are further divided into @dfn{fixed-width} and | |
157 @dfn{variable-width} formats. A fixed-width encoding always uses | |
158 the same number of words per character, whereas a variable-width | |
159 encoding does not. EUC is a good example of a variable-width | |
160 encoding: one to three bytes are used per character, depending on | |
161 the character set. 16-bit and 32-bit encodings are nearly always | |
162 fixed-width, and this is in fact one of the main reasons for using | |
163 an encoding with a larger word size. The advantages of fixed-width | |
164 encodings should be obvious. The advantages of variable-width | |
165 encodings are that they are generally more space-efficient and allow | |
166 for compatibility with existing 8-bit encodings such as ASCII. | |
167 | |
168 Note that the bytes in an 8-bit encoding are often referred to | |
169 as @dfn{octets} rather than simply as bytes. This terminology | |
170 dates back to the days before 8-bit bytes were universal, when | |
171 some computers had 9-bit bytes, others had 10-bit bytes, etc. | |
172 | |
173 @node Charsets | |
174 @section Charsets | |
175 | |
176 A @dfn{charset} in MULE is an object that encapsulates a | |
177 particular character set as well as an ordering of those characters. | |
178 Charsets are permanent objects and are named using symbols, like | |
179 faces. | |
180 | |
181 @defun charsetp object | |
182 This function returns non-@code{nil} if @var{object} is a charset. | |
183 @end defun | |
184 | |
185 @menu | |
186 * Charset Properties:: Properties of a charset. | |
187 * Basic Charset Functions:: Functions for working with charsets. | |
188 * Charset Property Functions:: Functions for accessing charset properties. | |
189 * Predefined Charsets:: Predefined charset objects. | |
190 @end menu | |
191 | |
192 @node Charset Properties | |
193 @subsection Charset Properties | |
194 | |
195 Charsets have the following properties: | |
196 | |
197 @table @code | |
198 @item name | |
199 A symbol naming the charset. Every charset must have a different name; | |
200 this allows a charset to be referred to using its name rather than | |
201 the actual charset object. | |
202 @item doc-string | |
203 A documentation string describing the charset. | |
204 @item registry | |
205 A regular expression matching the font registry field for this character | |
206 set. For example, both the @code{ascii} and @code{latin-1} charsets | |
207 use the registry @code{"ISO8859-1"}. This field is used to choose | |
208 an appropriate font when the user gives a general font specification | |
209 such as @samp{-*-courier-medium-r-*-140-*}, i.e. a 14-point upright | |
210 medium-weight Courier font. | |
211 @item dimension | |
212 Number of position codes used to index a character in the character set. | |
213 XEmacs/MULE can only handle character sets of dimension 1 or 2. | |
214 This property defaults to 1. | |
215 @item chars | |
216 Number of characters in each dimension. In XEmacs/MULE, the only | |
217 allowed values are 94 or 96. (There are a couple of pre-defined | |
218 character sets, such as ASCII, that do not follow this, but you cannot | |
219 define new ones like this.) Defaults to 94. Note that if the dimension | |
220 is 2, the character set thus described is 94x94 or 96x96. | |
221 @item columns | |
222 Number of columns used to display a character in this charset. | |
223 Only used in TTY mode. (Under X, the actual width of a character | |
224 can be derived from the font used to display the characters.) | |
225 If unspecified, defaults to the dimension. (This is almost | |
226 always the correct value, because character sets with dimension 2 | |
227 are usually ideograph character sets, which need two columns to | |
228 display the intricate ideographs.) | |
229 @item direction | |
230 A symbol, either @code{l2r} (left-to-right) or @code{r2l} | |
231 (right-to-left). Defaults to @code{l2r}. This specifies the | |
232 direction that the text should be displayed in, and will be | |
233 left-to-right for most charsets but right-to-left for Hebrew | |
234 and Arabic. (Right-to-left display is not currently implemented.) | |
235 @item final | |
236 Final byte of the standard ISO 2022 escape sequence designating this | |
237 charset. Must be supplied. Each combination of (@var{dimension}, | |
238 @var{chars}) defines a separate namespace for final bytes, and each | |
239 charset within a particular namespace must have a different final byte. | |
240 Note that ISO 2022 restricts the final byte to the range 0x30 - 0x7E if | |
241 dimension == 1, and 0x30 - 0x5F if dimension == 2. Note also that final | |
242 bytes in the range 0x30 - 0x3F are reserved for user-defined (not | |
243 official) character sets. For more information on ISO 2022, see @ref{Coding | |
244 Systems}. | |
245 @item graphic | |
246 0 (use left half of font on output) or 1 (use right half of font on | |
247 output). Defaults to 0. This specifies how to convert the position | |
248 codes that index a character in a character set into an index into the | |
249 font used to display the character set. With @code{graphic} set to 0, | |
250 position codes 33 through 126 map to font indices 33 through 126; with | |
251 it set to 1, position codes 33 through 126 map to font indices 161 | |
252 through 254 (i.e. the same number but with the high bit set). For | |
253 example, for a font whose registry is ISO8859-1, the left half of the | |
254 font (octets 0x20 - 0x7F) is the @code{ascii} charset, while the | |
255 right half (octets 0xA0 - 0xFF) is the @code{latin-1} charset. | |
256 @item ccl-program | |
257 A compiled CCL program used to convert a character in this charset into | |
258 an index into the font. This is in addition to the @code{graphic} | |
259 property. If a CCL program is defined, the position codes of a | |
260 character will first be processed according to @code{graphic} and | |
261 then passed through the CCL program, with the resulting values used | |
262 to index the font. | |
263 | |
264 This is used, for example, in the Big5 character set (used in Taiwan). | |
265 This character set is not ISO-2022-compliant, and its size (94x157) does | |
266 not fit within the maximum 96x96 size of ISO-2022-compliant character | |
267 sets. As a result, XEmacs/MULE splits it (in a rather complex fashion, | |
268 so as to group the most commonly used characters together) into two | |
269 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94, | |
270 and each charset object uses a CCL program to convert the modified | |
271 position codes back into standard Big5 indices to retrieve a character | |
272 from a Big5 font. | |
273 @end table | |
274 | |
275 Most of the above properties can only be changed when the charset | |
276 is created. @xref{Charset Property Functions}. | |
277 | |
278 @node Basic Charset Functions | |
279 @subsection Basic Charset Functions | |
280 | |
281 @defun find-charset charset-or-name | |
282 This function retrieves the charset of the given name. If | |
283 @var{charset-or-name} is a charset object, it is simply returned. | |
284 Otherwise, @var{charset-or-name} should be a symbol. If there is no | |
285 such charset, @code{nil} is returned. Otherwise the associated charset | |
286 object is returned. | |
287 @end defun | |
288 | |
289 @defun get-charset name | |
290 This function retrieves the charset of the given name. Same as | |
291 @code{find-charset} except an error is signalled if there is no such | |
292 charset instead of returning @code{nil}. | |
293 @end defun | |
294 | |
295 @defun charset-list | |
296 This function returns a list of the names of all defined charsets. | |
297 @end defun | |
298 | |
299 @defun make-charset name doc-string props | |
300 This function defines a new character set. This function is for use | |
301 with Mule support. @var{name} is a symbol, the name by which the | |
302 character set is normally referred. @var{doc-string} is a string | |
303 describing the character set. @var{props} is a property list, | |
304 describing the specific nature of the character set. The recognized | |
305 properties are @code{registry}, @code{dimension}, @code{columns}, | |
306 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and | |
307 @code{ccl-program}, as previously described. | |
308 @end defun | |
309 | |
310 @defun make-reverse-direction-charset charset new-name | |
311 This function makes a charset equivalent to @var{charset} but which goes | |
312 in the opposite direction. @var{new-name} is the name of the new | |
313 charset. The new charset is returned. | |
314 @end defun | |
315 | |
316 @defun charset-from-attributes dimension chars final &optional direction | |
317 This function returns a charset with the given @var{dimension}, | |
318 @var{chars}, @var{final}, and @var{direction}. If @var{direction} is | |
319 omitted, both directions will be checked (left-to-right will be returned | |
320 if character sets exist for both directions). | |
321 @end defun | |
322 | |
323 @defun charset-reverse-direction-charset charset | |
324 This function returns the charset (if any) with the same dimension, | |
325 number of characters, and final byte as @var{charset}, but which is | |
326 displayed in the opposite direction. | |
327 @end defun | |
328 | |
329 @node Charset Property Functions | |
330 @subsection Charset Property Functions | |
331 | |
332 All of these functions accept either a charset name or charset object. | |
333 | |
334 @defun charset-property charset prop | |
335 This function returns property @var{prop} of @var{charset}. | |
336 @xref{Charset Properties}. | |
337 @end defun | |
338 | |
339 Convenience functions are also provided for retrieving individual | |
340 properties of a charset. | |
341 | |
342 @defun charset-name charset | |
343 This function returns the name of @var{charset}. This will be a symbol. | |
344 @end defun | |
345 | |
346 @defun charset-doc-string charset | |
347 This function returns the doc string of @var{charset}. | |
348 @end defun | |
349 | |
350 @defun charset-registry charset | |
351 This function returns the registry of @var{charset}. | |
352 @end defun | |
353 | |
354 @defun charset-dimension charset | |
355 This function returns the dimension of @var{charset}. | |
356 @end defun | |
357 | |
358 @defun charset-chars charset | |
359 This function returns the number of characters per dimension of | |
360 @var{charset}. | |
361 @end defun | |
362 | |
363 @defun charset-columns charset | |
364 This function returns the number of display columns per character (in | |
365 TTY mode) of @var{charset}. | |
366 @end defun | |
367 | |
368 @defun charset-direction charset | |
369 This function returns the display direction of @var{charset} -- either | |
370 @code{l2r} or @code{r2l}. | |
371 @end defun | |
372 | |
373 @defun charset-final charset | |
374 This function returns the final byte of the ISO 2022 escape sequence | |
375 designating @var{charset}. | |
376 @end defun | |
377 | |
378 @defun charset-graphic charset | |
379 This function returns either 0 or 1, depending on whether the position | |
380 codes of characters in @var{charset} map to the left or right half | |
381 of their font, respectively. | |
382 @end defun | |
383 | |
384 @defun charset-ccl-program charset | |
385 This function returns the CCL program, if any, for converting | |
386 position codes of characters in @var{charset} into font indices. | |
387 @end defun | |
388 | |
389 The only property of a charset that can currently be set after | |
390 the charset has been created is the CCL program. | |
391 | |
392 @defun set-charset-ccl-program charset ccl-program | |
393 This function sets the @code{ccl-program} property of @var{charset} to | |
394 @var{ccl-program}. | |
395 @end defun | |
396 | |
397 @node Predefined Charsets | |
398 @subsection Predefined Charsets | |
399 | |
400 The following charsets are predefined in the C code. | |
401 | |
402 @example | |
403 Name Doc String Type Fi Gr Dir Registry | |
404 -------------------------------------------------------------- | |
405 ascii ASCII 94 B 0 l2r ISO8859-1 | |
406 control-1 Control characters 94 0 l2r --- | |
407 latin-1 Latin-1 94 A 1 l2r ISO8859-1 | |
408 latin-2 Latin-2 96 B 1 l2r ISO8859-2 | |
409 latin-3 Latin-3 96 C 1 l2r ISO8859-3 | |
410 latin-4 Latin-4 96 D 1 l2r ISO8859-4 | |
411 cyrillic Cyrillic 96 L 1 l2r ISO8859-5 | |
412 arabic Arabic 96 G 1 r2l ISO8859-6 | |
413 greek Greek 96 F 1 l2r ISO8859-7 | |
414 hebrew Hebrew 96 H 1 r2l ISO8859-8 | |
415 latin-5 Latin-5 96 M 1 l2r ISO8859-9 | |
416 thai Thai 96 T 1 l2r TIS620 | |
417 japanese-kana Japanese Katakana 94 I 1 l2r JISX0201.1976 | |
418 japanese-roman Japanese Roman 94 J 0 l2r JISX0201.1976 | |
419 japanese-old Japanese Old 94x94 @@ 0 l2r JISX0208.1978 | |
420 chinese-gb Chinese GB 94x94 A 0 l2r GB2312 | |
421 japanese Japanese 94x94 B 0 l2r JISX0208.19(83|90) | |
422 korean Korean 94x94 C 0 l2r KSC5601 | |
423 japanese-2 Japanese Supplement 94x94 D 0 l2r JISX0212 | |
424 chinese-cns-1 Chinese CNS Plane 1 94x94 G 0 l2r CNS11643.1 | |
425 chinese-cns-2 Chinese CNS Plane 2 94x94 H 0 l2r CNS11643.2 | |
426 chinese-big5-1 Chinese Big5 Level 1 94x94 0 0 l2r Big5 | |
427 chinese-big5-2 Chinese Big5 Level 2 94x94 1 0 l2r Big5 | |
428 composite Composite 96x96 0 l2r --- | |
429 @end example | |
430 | |
431 The following charsets are predefined in the Lisp code. | |
432 | |
433 @example | |
434 Name Doc String Type Fi Gr Dir Registry | |
435 -------------------------------------------------------------- | |
436 arabic-0 Arabic digits 94 2 0 l2r MuleArabic-0 | |
437 arabic-1 one-column Arabic 94 3 0 r2l MuleArabic-1 | |
438 arabic-2 one-column Arabic 94 4 0 r2l MuleArabic-2 | |
439 sisheng PinYin-ZhuYin 94 0 0 l2r sisheng_cwnn\| | |
440 OMRON_UDC_ZH | |
441 chinese-cns-3 Chinese CNS Plane 3 94x94 I 0 l2r CNS11643.1 | |
442 chinese-cns-4 Chinese CNS Plane 4 94x94 J 0 l2r CNS11643.1 | |
443 chinese-cns-5 Chinese CNS Plane 5 94x94 K 0 l2r CNS11643.1 | |
444 chinese-cns-6 Chinese CNS Plane 6 94x94 L 0 l2r CNS11643.1 | |
445 chinese-cns-7 Chinese CNS Plane 7 94x94 M 0 l2r CNS11643.1 | |
446 ethiopic Ethiopic 94x94 2 0 l2r Ethio | |
447 ascii-r2l Right-to-Left ASCII 94 B 0 r2l ISO8859-1 | |
448 ipa IPA for Mule 96 0 1 l2r MuleIPA | |
449 vietnamese-1 VISCII lower 96 1 1 l2r VISCII1.1 | |
450 vietnamese-2 VISCII upper 96 2 1 l2r VISCII1.1 | |
451 @end example | |
452 | |
453 For all of the above charsets, the dimension and number of columns are | |
454 the same. | |
455 | |
456 Note that ASCII, Control-1, and Composite are handled specially. | |
457 This is why some of the fields are blank; and some of the filled-in | |
458 fields (e.g. the type) are not really accurate. | |
459 | |
460 @node MULE Characters | |
461 @section MULE Characters | |
462 | |
463 @defun make-char charset arg1 &optional arg2 | |
464 This function makes a multi-byte character from @var{charset} and octets | |
465 @var{arg1} and @var{arg2}. | |
466 @end defun | |
467 | |
468 @defun char-charset ch | |
469 This function returns the character set of char @var{ch}. | |
470 @end defun | |
471 | |
472 @defun char-octet ch &optional n | |
473 This function returns the octet (i.e. position code) numbered @var{n} | |
474 (should be 0 or 1) of char @var{ch}. @var{n} defaults to 0 if omitted. | |
475 @end defun | |
476 | |
477 @defun charsets-in-region start end &optional buffer | |
478 This function returns a list of the charsets in the region between | |
479 @var{start} and @var{end}. @var{buffer} defaults to the current buffer | |
480 if omitted. | |
481 @end defun | |
482 | |
483 @defun charsets-in-string string | |
484 This function returns a list of the charsets in @var{string}. | |
485 @end defun | |
486 | |
487 @node Composite Characters | |
488 @section Composite Characters | |
489 | |
490 Composite characters are not yet completely implemented. | |
491 | |
492 @defun make-composite-char string | |
493 This function converts a string into a single composite character. The | |
494 character is the result of overstriking all the characters in the | |
495 string. | |
496 @end defun | |
497 | |
498 @defun composite-char-string ch | |
499 This function returns a string of the characters comprising a composite | |
500 character. | |
501 @end defun | |
502 | |
503 @defun compose-region start end &optional buffer | |
504 This function composes the characters in the region from @var{start} to | |
505 @var{end} in @var{buffer} into one composite character. The composite | |
506 character replaces the composed characters. @var{buffer} defaults to | |
507 the current buffer if omitted. | |
508 @end defun | |
509 | |
510 @defun decompose-region start end &optional buffer | |
511 This function decomposes any composite characters in the region from | |
512 @var{start} to @var{end} in @var{buffer}. This converts each composite | |
513 character into one or more characters, the individual characters out of | |
514 which the composite character was formed. Non-composite characters are | |
515 left as-is. @var{buffer} defaults to the current buffer if omitted. | |
516 @end defun | |
517 | |
518 @node ISO 2022 | |
519 @section ISO 2022 | |
520 | |
521 This section briefly describes the ISO2022 encoding standard. For more | |
522 thorough understanding, please refer to the original document of | |
523 ISO2022. | |
524 | |
525 Character sets (@dfn{charsets}) are classified into the following four | |
526 categories, according to the number of characters of charset: | |
527 94-charset, 96-charset, 94x94-charset, and 96x96-charset. | |
528 | |
529 @need 1000 | |
530 @table @asis | |
531 @item 94-charset | |
532 ASCII(B), left(J) and right(I) half of JISX0201, ... | |
533 @item 96-charset | |
534 Latin-1(A), Latin-2(B), Latin-3(C), ... | |
535 @item 94x94-charset | |
536 GB2312(A), JISX0208(B), KSC5601(C), ... | |
537 @item 96x96-charset | |
538 none for the moment | |
539 @end table | |
540 | |
541 The character in parentheses after the name of each charset | |
542 is the @dfn{final character} @var{F}, which can be regarded as | |
543 the identifier of the charset. ECMA allocates @var{F} to each | |
544 charset. @var{F} is in the range of 0x30..0x7F, but 0x30..0x3F | |
545 are only for private use. | |
546 | |
547 Note: @dfn{ECMA} = European Computer Manufacturers Association | |
548 | |
549 There are four @dfn{registers of charsets}, called G0 thru G3. | |
550 You can designate (or assign) any charset to one of these | |
551 registers. | |
552 | |
553 The code space contained within one octet (of size 256) is divided into | |
554 4 areas: C0, GL, C1, and GR. GL and GR are the areas into which a | |
555 register of charset can be invoked into. | |
556 | |
557 @example | |
558 @group | |
559 C0: 0x00 - 0x1F | |
560 GL: 0x20 - 0x7F | |
561 C1: 0x80 - 0x9F | |
562 GR: 0xA0 - 0xFF | |
563 @end group | |
564 @end example | |
565 | |
566 Usually, in the initial state, G0 is invoked into GL, and G1 | |
567 is invoked into GR. | |
568 | |
569 ISO2022 distinguishes 7-bit environments and 8-bit | |
570 environments. In 7-bit environments, only C0 and GL are used. | |
571 | |
572 Charset designation is done by escape sequences of the form: | |
573 | |
574 @example | |
575 ESC [@var{I}] @var{I} @var{F} | |
576 @end example | |
577 | |
578 where @var{I} is an intermediate character in the range 0x20 - 0x2F, and | |
579 @var{F} is the final character identifying this charset. | |
580 | |
581 The meaning of intermediate characters are: | |
582 | |
583 @example | |
584 @group | |
585 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96). | |
586 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}. | |
587 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}. | |
588 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}. | |
589 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}. | |
590 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}. | |
591 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}. | |
592 / [0x2F]: designate to G3 a 96-charset whose final byte is | |
593 @var{F}. | |
594 @end group | |
595 @end example | |
596 | |
597 The following rule is not allowed in ISO2022 but can be used | |
598 in Mule. | |
599 | |
600 @example | |
601 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}. | |
602 @end example | |
603 | |
604 Here are examples of designations: | |
605 | |
606 @example | |
607 @group | |
608 ESC ( B : designate to G0 ASCII | |
609 ESC - A : designate to G1 Latin-1 | |
610 ESC $ ( A or ESC $ A : designate to G0 GB2312 | |
611 ESC $ ( B or ESC $ B : designate to G0 JISX0208 | |
612 ESC $ ) C : designate to G1 KSC5601 | |
613 @end group | |
614 @end example | |
615 | |
616 To use a charset designated to G2 or G3, and to use a | |
617 charset designated to G1 in a 7-bit environment, you must | |
618 explicitly invoke G1, G2, or G3 into GL. There are two | |
619 types of invocation, Locking Shift (forever) and Single | |
620 Shift (one character only). | |
621 | |
622 Locking Shift is done as follows: | |
623 | |
624 @example | |
625 SI or LS0: invoke G0 into GL | |
626 SO or LS1: invoke G1 into GL | |
627 LS2: invoke G2 into GL | |
628 LS3: invoke G3 into GL | |
629 LS1R: invoke G1 into GR | |
630 LS2R: invoke G2 into GR | |
631 LS3R: invoke G3 into GR | |
632 @end example | |
633 | |
634 Single Shift is done as follows: | |
635 | |
636 @example | |
637 @group | |
638 SS2 or ESC N: invoke G2 into GL | |
639 SS3 or ESC O: invoke G3 into GL | |
640 @end group | |
641 @end example | |
642 | |
643 (#### Ben says: I think the above is slightly incorrect. It appears that | |
644 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and | |
645 ESC O behave as indicated. The above definitions will not parse | |
646 EUC-encoded text correctly, and it looks like the code in mule-coding.c | |
647 has similar problems.) | |
648 | |
649 You may realize that there are a lot of ISO2022-compliant ways of | |
650 encoding multilingual text. Now, in the world, there exist many coding | |
651 systems such as X11's Compound Text, Japanese JUNET code, and so-called | |
652 EUC (Extended UNIX Code); all of these are variants of ISO2022. | |
653 | |
654 In Mule, we characterize ISO2022 by the following attributes: | |
655 | |
656 @enumerate | |
657 @item | |
658 Initial designation to G0 thru G3. | |
659 @item | |
660 Allow designation of short form for Japanese and Chinese. | |
661 @item | |
662 Should we designate ASCII to G0 before control characters? | |
663 @item | |
664 Should we designate ASCII to G0 at the end of line? | |
665 @item | |
666 7-bit environment or 8-bit environment. | |
667 @item | |
668 Use Locking Shift or not. | |
669 @item | |
670 Use ASCII or JIS0201-1976-Roman. | |
671 @item | |
672 Use JISX0208-1983 or JISX0208-1976. | |
673 @end enumerate | |
674 | |
675 (The last two are only for Japanese.) | |
676 | |
677 By specifying these attributes, you can create any variant | |
678 of ISO2022. | |
679 | |
680 Here are several examples: | |
681 | |
682 @example | |
683 @group | |
684 junet -- Coding system used in JUNET. | |
685 1. G0 <- ASCII, G1..3 <- never used | |
686 2. Yes. | |
687 3. Yes. | |
688 4. Yes. | |
689 5. 7-bit environment | |
690 6. No. | |
691 7. Use ASCII | |
692 8. Use JISX0208-1983 | |
693 @end group | |
694 | |
695 @group | |
696 ctext -- Compound Text | |
697 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used | |
698 2. No. | |
699 3. No. | |
700 4. Yes. | |
701 5. 8-bit environment | |
702 6. No. | |
703 7. Use ASCII | |
704 8. Use JISX0208-1983 | |
705 @end group | |
706 | |
707 @group | |
708 euc-china -- Chinese EUC. Although many people call this | |
709 as "GB encoding", the name may cause misunderstanding. | |
710 1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used | |
711 2. No. | |
712 3. Yes. | |
713 4. Yes. | |
714 5. 8-bit environment | |
715 6. No. | |
716 7. Use ASCII | |
717 8. Use JISX0208-1983 | |
718 @end group | |
719 | |
720 @group | |
721 korean-mail -- Coding system used in Korean network. | |
722 1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used | |
723 2. No. | |
724 3. Yes. | |
725 4. Yes. | |
726 5. 7-bit environment | |
727 6. Yes. | |
728 7. No. | |
729 8. No. | |
730 @end group | |
731 @end example | |
732 | |
733 Mule creates all these coding systems by default. | |
734 | |
735 @node Coding Systems | |
736 @section Coding Systems | |
737 | |
738 A coding system is an object that defines how text containing multiple | |
739 character sets is encoded into a stream of (typically 8-bit) bytes. The | |
740 coding system is used to decode the stream into a series of characters | |
741 (which may be from multiple charsets) when the text is read from a file | |
742 or process, and is used to encode the text back into the same format | |
743 when it is written out to a file or process. | |
744 | |
745 For example, many ISO2022-compliant coding systems (such as Compound | |
746 Text, which is used for inter-client data under the X Window System) use | |
747 escape sequences to switch between different charsets -- Japanese Kanji, | |
748 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with | |
749 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See | |
750 @code{make-coding-system} for more information. | |
751 | |
752 Coding systems are normally identified using a symbol, and the symbol is | |
753 accepted in place of the actual coding system object whenever a coding | |
754 system is called for. (This is similar to how faces and charsets work.) | |
755 | |
756 @defun coding-system-p object | |
757 This function returns non-@code{nil} if @var{object} is a coding system. | |
758 @end defun | |
759 | |
760 @menu | |
761 * Coding System Types:: Classifying coding systems. | |
762 * EOL Conversion:: Dealing with different ways of denoting | |
763 the end of a line. | |
764 * Coding System Properties:: Properties of a coding system. | |
765 * Basic Coding System Functions:: Working with coding systems. | |
766 * Coding System Property Functions:: Retrieving a coding system's properties. | |
767 * Encoding and Decoding Text:: Encoding and decoding text. | |
768 * Detection of Textual Encoding:: Determining how text is encoded. | |
769 * Big5 and Shift-JIS Functions:: Special functions for these non-standard | |
770 encodings. | |
771 @end menu | |
772 | |
773 @node Coding System Types | |
774 @subsection Coding System Types | |
775 | |
776 @table @code | |
777 @item nil | |
778 @itemx autodetect | |
779 Automatic conversion. XEmacs attempts to detect the coding system used | |
780 in the file. | |
781 @item noconv | |
782 No conversion. Use this for binary files and such. On output, graphic | |
783 characters that are not in ASCII or Latin-1 will be replaced by a | |
784 @samp{?}. (For a noconv-encoded buffer, these characters will only be | |
785 present if you explicitly insert them.) | |
786 @item shift-jis | |
787 Shift-JIS (a Japanese encoding commonly used in PC operating systems). | |
788 @item iso2022 | |
789 Any ISO2022-compliant encoding. Among other things, this includes JIS | |
790 (the Japanese encoding commonly used for e-mail), EUC (the standard Unix | |
791 encoding for Japanese and other languages), and Compound Text (the | |
792 encoding used in X11). You can specify more specific information about | |
793 the conversion with the @var{flags} argument. | |
794 @item big5 | |
795 Big5 (the encoding commonly used for Taiwanese). | |
796 @item ccl | |
797 The conversion is performed using a user-written pseudo-code program. | |
798 CCL (Code Conversion Language) is the name of this pseudo-code. | |
799 @item internal | |
800 Write out or read in the raw contents of the memory representing the | |
801 buffer's text. This is primarily useful for debugging purposes, and is | |
802 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set | |
803 (the @samp{--debug} configure option). @strong{Warning}: Reading in a | |
804 file using @code{internal} conversion can result in an internal | |
805 inconsistency in the memory representing a buffer's text, which will | |
806 produce unpredictable results and may cause XEmacs to crash. Under | |
807 normal circumstances you should never use @code{internal} conversion. | |
808 @end table | |
809 | |
810 @node EOL Conversion | |
811 @subsection EOL Conversion | |
812 | |
813 @table @code | |
814 @item nil | |
815 Automatically detect the end-of-line type (LF, CRLF, or CR). Also | |
816 generate subsidiary coding systems named @code{@var{name}-unix}, | |
817 @code{@var{name}-dos}, and @code{@var{name}-mac}, that are identical to | |
818 this coding system but have an EOL-TYPE value of @code{lf}, @code{crlf}, | |
819 and @code{cr}, respectively. | |
820 @item lf | |
821 The end of a line is marked externally using ASCII LF. Since this is | |
822 also the way that XEmacs represents an end-of-line internally, | |
823 specifying this option results in no end-of-line conversion. This is | |
824 the standard format for Unix text files. | |
825 @item crlf | |
826 The end of a line is marked externally using ASCII CRLF. This is the | |
827 standard format for MS-DOS text files. | |
828 @item cr | |
829 The end of a line is marked externally using ASCII CR. This is the | |
830 standard format for Macintosh text files. | |
831 @item t | |
832 Automatically detect the end-of-line type but do not generate subsidiary | |
833 coding systems. (This value is converted to @code{nil} when stored | |
834 internally, and @code{coding-system-property} will return @code{nil}.) | |
835 @end table | |
836 | |
837 @node Coding System Properties | |
838 @subsection Coding System Properties | |
839 | |
840 @table @code | |
841 @item mnemonic | |
842 String to be displayed in the modeline when this coding system is | |
843 active. | |
844 | |
845 @item eol-type | |
846 End-of-line conversion to be used. It should be one of the types | |
847 listed in @ref{EOL Conversion}. | |
848 | |
849 @item post-read-conversion | |
850 Function called after a file has been read in, to perform the decoding. | |
851 Called with two arguments, @var{beg} and @var{end}, denoting a region of | |
852 the current buffer to be decoded. | |
853 | |
854 @item pre-write-conversion | |
855 Function called before a file is written out, to perform the encoding. | |
856 Called with two arguments, @var{beg} and @var{end}, denoting a region of | |
857 the current buffer to be encoded. | |
858 @end table | |
859 | |
860 The following additional properties are recognized if @var{type} is | |
861 @code{iso2022}: | |
862 | |
863 @table @code | |
864 @item charset-g0 | |
865 @itemx charset-g1 | |
866 @itemx charset-g2 | |
867 @itemx charset-g3 | |
868 The character set initially designated to the G0 - G3 registers. | |
869 The value should be one of | |
870 | |
871 @itemize @bullet | |
872 @item | |
873 A charset object (designate that character set) | |
874 @item | |
875 @code{nil} (do not ever use this register) | |
876 @item | |
877 @code{t} (no character set is initially designated to the register, but | |
878 may be later on; this automatically sets the corresponding | |
879 @code{force-g*-on-output} property) | |
880 @end itemize | |
881 | |
882 @item force-g0-on-output | |
883 @itemx force-g1-on-output | |
884 @itemx force-g2-on-output | |
885 @itemx force-g2-on-output | |
886 If non-@code{nil}, send an explicit designation sequence on output | |
887 before using the specified register. | |
888 | |
889 @item short | |
890 If non-@code{nil}, use the short forms @samp{ESC $ @@}, @samp{ESC $ A}, | |
891 and @samp{ESC $ B} on output in place of the full designation sequences | |
892 @samp{ESC $ ( @@}, @samp{ESC $ ( A}, and @samp{ESC $ ( B}. | |
893 | |
894 @item no-ascii-eol | |
895 If non-@code{nil}, don't designate ASCII to G0 at each end of line on | |
896 output. Setting this to non-@code{nil} also suppresses other | |
897 state-resetting that normally happens at the end of a line. | |
898 | |
899 @item no-ascii-cntl | |
900 If non-@code{nil}, don't designate ASCII to G0 before control chars on | |
901 output. | |
902 | |
903 @item seven | |
904 If non-@code{nil}, use 7-bit environment on output. Otherwise, use 8-bit | |
905 environment. | |
906 | |
907 @item lock-shift | |
908 If non-@code{nil}, use locking-shift (SO/SI) instead of single-shift or | |
909 designation by escape sequence. | |
910 | |
911 @item no-iso6429 | |
912 If non-@code{nil}, don't use ISO6429's direction specification. | |
913 | |
914 @item escape-quoted | |
915 If non-nil, literal control characters that are the same as the | |
916 beginning of a recognized ISO2022 or ISO6429 escape sequence (in | |
917 particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F), | |
918 and CSI (0x9B)) are ``quoted'' with an escape character so that they can | |
919 be properly distinguished from an escape sequence. (Note that doing | |
920 this results in a non-portable encoding.) This encoding flag is used for | |
921 byte-compiled files. Note that ESC is a good choice for a quoting | |
922 character because there are no escape sequences whose second byte is a | |
923 character from the Control-0 or Control-1 character sets; this is | |
924 explicitly disallowed by the ISO2022 standard. | |
925 | |
926 @item input-charset-conversion | |
927 A list of conversion specifications, specifying conversion of characters | |
928 in one charset to another when decoding is performed. Each | |
929 specification is a list of two elements: the source charset, and the | |
930 destination charset. | |
931 | |
932 @item output-charset-conversion | |
933 A list of conversion specifications, specifying conversion of characters | |
934 in one charset to another when encoding is performed. The form of each | |
935 specification is the same as for @code{input-charset-conversion}. | |
936 @end table | |
937 | |
938 The following additional properties are recognized (and required) if | |
939 @var{type} is @code{ccl}: | |
940 | |
941 @table @code | |
942 @item decode | |
943 CCL program used for decoding (converting to internal format). | |
944 | |
945 @item encode | |
946 CCL program used for encoding (converting to external format). | |
947 @end table | |
948 | |
949 @node Basic Coding System Functions | |
950 @subsection Basic Coding System Functions | |
951 | |
952 @defun find-coding-system coding-system-or-name | |
953 This function retrieves the coding system of the given name. | |
954 | |
955 If @var{coding-system-or-name} is a coding-system object, it is simply | |
956 returned. Otherwise, @var{coding-system-or-name} should be a symbol. | |
957 If there is no such coding system, @code{nil} is returned. Otherwise | |
958 the associated coding system object is returned. | |
959 @end defun | |
960 | |
961 @defun get-coding-system name | |
962 This function retrieves the coding system of the given name. Same as | |
963 @code{find-coding-system} except an error is signalled if there is no | |
964 such coding system instead of returning @code{nil}. | |
965 @end defun | |
966 | |
967 @defun coding-system-list | |
968 This function returns a list of the names of all defined coding systems. | |
969 @end defun | |
970 | |
971 @defun coding-system-name coding-system | |
972 This function returns the name of the given coding system. | |
973 @end defun | |
974 | |
975 @defun make-coding-system name type &optional doc-string props | |
976 This function registers symbol @var{name} as a coding system. | |
977 | |
978 @var{type} describes the conversion method used and should be one of | |
979 the types listed in @ref{Coding System Types}. | |
980 | |
981 @var{doc-string} is a string describing the coding system. | |
982 | |
983 @var{props} is a property list, describing the specific nature of the | |
984 character set. Recognized properties are as in @ref{Coding System | |
985 Properties}. | |
986 @end defun | |
987 | |
988 @defun copy-coding-system old-coding-system new-name | |
989 This function copies @var{old-coding-system} to @var{new-name}. If | |
990 @var{new-name} does not name an existing coding system, a new one will | |
991 be created. | |
992 @end defun | |
993 | |
994 @defun subsidiary-coding-system coding-system eol-type | |
995 This function returns the subsidiary coding system of | |
996 @var{coding-system} with eol type @var{eol-type}. | |
997 @end defun | |
998 | |
999 @node Coding System Property Functions | |
1000 @subsection Coding System Property Functions | |
1001 | |
1002 @defun coding-system-doc-string coding-system | |
1003 This function returns the doc string for @var{coding-system}. | |
1004 @end defun | |
1005 | |
1006 @defun coding-system-type coding-system | |
1007 This function returns the type of @var{coding-system}. | |
1008 @end defun | |
1009 | |
1010 @defun coding-system-property coding-system prop | |
1011 This function returns the @var{prop} property of @var{coding-system}. | |
1012 @end defun | |
1013 | |
1014 @node Encoding and Decoding Text | |
1015 @subsection Encoding and Decoding Text | |
1016 | |
1017 @defun decode-coding-region start end coding-system &optional buffer | |
1018 This function decodes the text between @var{start} and @var{end} which | |
1019 is encoded in @var{coding-system}. This is useful if you've read in | |
1020 encoded text from a file without decoding it (e.g. you read in a | |
1021 JIS-formatted file but used the @code{binary} or @code{noconv} coding | |
1022 system, so that it shows up as @samp{^[$B!<!+^[(B}). The length of the | |
1023 encoded text is returned. @var{buffer} defaults to the current buffer | |
1024 if unspecified. | |
1025 @end defun | |
1026 | |
1027 @defun encode-coding-region start end coding-system &optional buffer | |
1028 This function encodes the text between @var{start} and @var{end} using | |
1029 @var{coding-system}. This will, for example, convert Japanese | |
1030 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS | |
1031 encoding. The length of the encoded text is returned. @var{buffer} | |
1032 defaults to the current buffer if unspecified. | |
1033 @end defun | |
1034 | |
1035 @node Detection of Textual Encoding | |
1036 @subsection Detection of Textual Encoding | |
1037 | |
1038 @defun coding-category-list | |
1039 This function returns a list of all recognized coding categories. | |
1040 @end defun | |
1041 | |
1042 @defun set-coding-priority-list list | |
1043 This function changes the priority order of the coding categories. | |
1044 @var{list} should be a list of coding categories, in descending order of | |
1045 priority. Unspecified coding categories will be lower in priority than | |
1046 all specified ones, in the same relative order they were in previously. | |
1047 @end defun | |
1048 | |
1049 @defun coding-priority-list | |
1050 This function returns a list of coding categories in descending order of | |
1051 priority. | |
1052 @end defun | |
1053 | |
1054 @defun set-coding-category-system coding-category coding-system | |
1055 This function changes the coding system associated with a coding category. | |
1056 @end defun | |
1057 | |
1058 @defun coding-category-system coding-category | |
1059 This function returns the coding system associated with a coding category. | |
1060 @end defun | |
1061 | |
1062 @defun detect-coding-region start end &optional buffer | |
1063 This function detects coding system of the text in the region between | |
1064 @var{start} and @var{end}. Returned value is a list of possible coding | |
1065 systems ordered by priority. If only ASCII characters are found, it | |
1066 returns @code{autodetect} or one of its subsidiary coding systems | |
1067 according to a detected end-of-line type. Optional arg @var{buffer} | |
1068 defaults to the current buffer. | |
1069 @end defun | |
1070 | |
1071 @node Big5 and Shift-JIS Functions | |
1072 @subsection Big5 and Shift-JIS Functions | |
1073 | |
1074 These are special functions for working with the non-standard | |
1075 Shift-JIS and Big5 encodings. | |
1076 | |
1077 @defun decode-shift-jis-char code | |
1078 This function decodes a JISX0208 character of Shift-JIS coding-system. | |
1079 @var{code} is the character code in Shift-JIS as a cons of type bytes. | |
1080 The corresponding character is returned. | |
1081 @end defun | |
1082 | |
1083 @defun encode-shift-jis-char ch | |
1084 This function encodes a JISX0208 character @var{ch} to SHIFT-JIS | |
1085 coding-system. The corresponding character code in SHIFT-JIS is | |
1086 returned as a cons of two bytes. | |
1087 @end defun | |
1088 | |
1089 @defun decode-big5-char code | |
1090 This function decodes a Big5 character @var{code} of BIG5 coding-system. | |
1091 @var{code} is the character code in BIG5. The corresponding character | |
1092 is returned. | |
1093 @end defun | |
1094 | |
1095 @defun encode-big5-char ch | |
1096 This function encodes the Big5 character @var{char} to BIG5 | |
1097 coding-system. The corresponding character code in Big5 is returned. | |
1098 @end defun | |
1099 | |
1100 @node CCL | |
1101 @section CCL | |
1102 | |
1103 @defun execute-ccl-program ccl-program status | |
1104 This function executes @var{ccl-program} with registers initialized by | |
1105 @var{status}. @var{ccl-program} is a vector of compiled CCL code | |
1106 created by @code{ccl-compile}. @var{status} must be a vector of nine | |
1107 values, specifying the initial value for the R0, R1 .. R7 registers and | |
1108 for the instruction counter IC. A @code{nil} value for a register | |
1109 initializer causes the register to be set to 0. A @code{nil} value for | |
1110 the IC initializer causes execution to start at the beginning of the | |
1111 program. When the program is done, @var{status} is modified (by | |
1112 side-effect) to contain the ending values for the corresponding | |
1113 registers and IC. | |
1114 @end defun | |
1115 | |
1116 @defun execute-ccl-program-string ccl-program status str | |
1117 This function executes @var{ccl-program} with initial @var{status} on | |
1118 @var{string}. @var{ccl-program} is a vector of compiled CCL code | |
1119 created by @code{ccl-compile}. @var{status} must be a vector of nine | |
1120 values, specifying the initial value for the R0, R1 .. R7 registers and | |
1121 for the instruction counter IC. A @code{nil} value for a register | |
1122 initializer causes the register to be set to 0. A @code{nil} value for | |
1123 the IC initializer causes execution to start at the beginning of the | |
1124 program. When the program is done, @var{status} is modified (by | |
1125 side-effect) to contain the ending values for the corresponding | |
1126 registers and IC. Returns the resulting string. | |
1127 @end defun | |
1128 | |
1129 @defun ccl-reset-elapsed-time | |
1130 This function resets the internal value which holds the time elapsed by | |
1131 CCL interpreter. | |
1132 @end defun | |
1133 | |
1134 @defun ccl-elapsed-time | |
1135 This function returns the time elapsed by CCL interpreter as cons of | |
1136 user and system time. This measures processor time, not real time. | |
1137 Both values are floating point numbers measured in seconds. If only one | |
1138 overall value can be determined, the return value will be a cons of that | |
1139 value and 0. | |
1140 @end defun | |
1141 | |
1142 @node Category Tables | |
1143 @section Category Tables | |
1144 | |
1145 A category table is a type of char table used for keeping track of | |
1146 categories. Categories are used for classifying characters for use in | |
1147 regexps -- you can refer to a category rather than having to use a | |
1148 complicated [] expression (and category lookups are significantly | |
1149 faster). | |
1150 | |
1151 There are 95 different categories available, one for each printable | |
1152 character (including space) in the ASCII charset. Each category is | |
1153 designated by one such character, called a @dfn{category designator}. | |
1154 They are specified in a regexp using the syntax @samp{\cX}, where X is a | |
1155 category designator. (This is not yet implemented.) | |
1156 | |
1157 A category table specifies, for each character, the categories that | |
1158 the character is in. Note that a character can be in more than one | |
1159 category. More specifically, a category table maps from a character to | |
1160 either the value @code{nil} (meaning the character is in no categories) | |
1161 or a 95-element bit vector, specifying for each of the 95 categories | |
1162 whether the character is in that category. | |
1163 | |
1164 Special Lisp functions are provided that abstract this, so you do not | |
1165 have to directly manipulate bit vectors. | |
1166 | |
1167 @defun category-table-p obj | |
1168 This function returns @code{t} if @var{arg} is a category table. | |
1169 @end defun | |
1170 | |
1171 @defun category-table &optional buffer | |
1172 This function returns the current category table. This is the one | |
1173 specified by the current buffer, or by @var{buffer} if it is | |
1174 non-@code{nil}. | |
1175 @end defun | |
1176 | |
1177 @defun standard-category-table | |
1178 This function returns the standard category table. This is the one used | |
1179 for new buffers. | |
1180 @end defun | |
1181 | |
1182 @defun copy-category-table &optional table | |
1183 This function constructs a new category table and return it. It is a | |
1184 copy of the @var{table}, which defaults to the standard category table. | |
1185 @end defun | |
1186 | |
1187 @defun set-category-table table &optional buffer | |
1188 This function selects a new category table for @var{buffer}. One | |
1189 argument, a category table. @var{buffer} defaults to the current buffer | |
1190 if omitted. | |
1191 @end defun | |
1192 | |
1193 @defun category-designator-p obj | |
1194 This function returns @code{t} if @var{arg} is a category designator (a | |
1195 char in the range @samp{' '} to @samp{'~'}). | |
1196 @end defun | |
1197 | |
1198 @defun category-table-value-p obj | |
1199 This function returns @code{t} if @var{arg} is a category table value. | |
1200 Valid values are @code{nil} or a bit vector of size 95. | |
1201 @end defun | |
1202 |