comparison man/xemacs/mule.texi @ 207:e45d5e7c476e r20-4b2

Import from CVS: tag r20-4b2
author cvs
date Mon, 13 Aug 2007 10:03:52 +0200
parents
children 41ff10fd062f
comparison
equal deleted inserted replaced
206:d3e9274cbc4e 207:e45d5e7c476e
1 @c This is part of the Emacs manual.
2 @c Copyright (C) 1997 Free Software Foundation, Inc.
3 @c See file emacs.texi for copying conditions.
4 @node Mule, Major Modes, Windows, Top
5 @chapter World Scripts Support
6 @cindex MULE
7 @cindex international scripts
8 @cindex multibyte characters
9 @cindex encoding of characters
10
11 @cindex Chinese
12 @cindex Devanagari
13 @cindex Hindi
14 @cindex Marathi
15 @cindex Ethiopian
16 @cindex Greek
17 @cindex IPA
18 @cindex Japanese
19 @cindex Korean
20 @cindex Lao
21 @cindex Russian
22 @cindex Thai
23 @cindex Tibetan
24 @cindex Vietnamese
25 If you compile XEmacs with mule option, it supports a wide variety of
26 world scripts, including Latin alphabet (for some European languages and
27 Vietnamese), as well as Arabic, Simplified Chinese (for mainland of
28 China), Traditional Chinese (for Taiwan and Hong-Kong), Greek, Hebrew,
29 IPA, Japanese (Hiragana, Katakana and Kanji), Korean (Hangul and Hanja)
30 and Cyrillic (Beylorussian, Bulgarian, Russian, Serbian and Ukrainian)
31 scripts. These features have been merged from the modified version of
32 Emacs known as MULE (for ``MULti-lingual Enhancement to GNU Emacs'').
33
34 @menu
35 * Mule Intro:: Basic concepts of Mule.
36 * Language Environments:: Setting things up for the language you use.
37 * Input Methods:: Entering text characters not on your keyboard.
38 * Select Input Method:: Specifying your choice of input methods.
39 * Coding Systems:: Character set conversion when you read and
40 write files, and so on.
41 * Recognize Coding:: How XEmacs figures out which conversion to use.
42 * Specify Coding:: Various ways to choose which conversion to use.
43 @end menu
44
45 @node Mule Intro, Language Environments, Mule, Mule
46 @section Introduction to world scripts
47
48 The users of these scripts have established many more-or-less standard
49 coding systems for storing files.
50 @c XEmacs internally uses a single multibyte character encoding, so that it
51 @c can intermix characters from all these scripts in a single buffer or
52 @c string. This encoding represents each non-ASCII character as a sequence
53 @c of bytes in the range 0200 through 0377.
54 XEmacs translates between the internal character encoding and various
55 other coding systems when reading and writing files, when exchanging
56 data with subprocesses, and (in some cases) in the @kbd{C-q} command
57 (see below).
58
59 @kindex C-h h
60 @findex view-hello-file
61 The command @kbd{C-h h} (@code{view-hello-file}) displays the file
62 @file{etc/HELLO}, which shows how to say ``hello'' in many languages.
63 This illustrates various scripts.
64
65 Keyboards, even in the countries where these character sets are used,
66 generally don't have keys for all the characters in them. So XEmacs
67 supports various @dfn{input methods}, typically one for each script or
68 language, to make it convenient to type them.
69
70 @kindex C-x RET
71 The prefix key @kbd{C-x @key{RET}} is used for commands that pertain
72 to world scripts, coding systems, and input methods.
73
74
75 @node Language Environments, Input Methods, Mule Intro, Mule
76 @section Language Environments
77 @cindex language environments
78
79 All supported character sets are supported in XEmacs buffers if it is
80 compile with mule; there is no need to select a particular language in
81 order to display its characters in an XEmacs buffer. However, it is
82 important to select a @dfn{language environment} in order to set various
83 defaults. The language environment really represents a choice of
84 preferred script (more or less) rather that a choice of language.
85
86 The language environment controls which coding systems to recognize
87 when reading text (@pxref{Recognize Coding}). This applies to files,
88 incoming mail, netnews, and any other text you read into XEmacs. It may
89 also specify the default coding system to use when you create a file.
90 Each language environment also specifies a default input method.
91
92 @findex set-language-environment
93 The command to select a language environment is @kbd{M-x
94 set-language-environment}. It makes no difference which buffer is
95 current when you use this command, because the effects apply globally to
96 the XEmacs session. The supported language environments include:
97
98 @quotation
99 Chinese-BIG5, Chinese-CNS, Chinese-GB, Cyrillic-ISO, English, Ethiopic,
100 Greek, Japanese, Korean, Latin-1, Latin-2, Latin-3, Latin-4, Latin-5.
101 @end quotation
102
103 Some operating systems let you specify the language you are using by
104 setting locale environment variables. XEmacs handles one common special
105 case of this: if your locale name for character types contains the
106 string @samp{8859-@var{n}}, XEmacs automatically selects the
107 corresponding language environment.
108
109 @kindex C-h L
110 @findex describe-language-environment
111 To display information about the effects of a certain language
112 environment @var{lang-env}, use the command @kbd{C-h L @var{lang-env}
113 @key{RET}} (@code{describe-language-environment}). This tells you which
114 languages this language environment is useful for, and lists the
115 character sets, coding systems, and input methods that go with it. It
116 also shows some sample text to illustrate scripts used in this language
117 environment. By default, this command describes the chosen language
118 environment.
119
120 @node Input Methods, Select Input Method, Language Environments, Mule
121 @section Input Methods
122
123 @cindex input methods
124 An @dfn{input method} is a kind of character conversion designed
125 specifically for interactive input. In XEmacs, typically each language
126 has its own input method; sometimes several languages which use the same
127 characters can share one input method. A few languages support several
128 input methods.
129
130 The simplest kind of input method works by mapping ASCII letters into
131 another alphabet. This is how the Greek and Russian input methods work.
132
133 A more powerful technique is composition: converting sequences of
134 characters into one letter. Many European input methods use composition
135 to produce a single non-ASCII letter from a sequence that consists of a
136 letter followed by accent characters. For example, some methods convert
137 the sequence @kbd{'a} into a single accented letter.
138
139 The input methods for syllabic scripts typically use mapping followed
140 by composition. The input methods for Thai and Korean work this way.
141 First, letters are mapped into symbols for particular sounds or tone
142 marks; then, sequences of these which make up a whole syllable are
143 mapped into one syllable sign.
144
145 Chinese and Japanese require more complex methods. In Chinese input
146 methods, first you enter the phonetic spelling of a Chinese word (in
147 input method @code{chinese-py}, among others), or a sequence of portions
148 of the character (input methods @code{chinese-4corner} and
149 @code{chinese-sw}, and others). Since one phonetic spelling typically
150 corresponds to many different Chinese characters, you must select one of
151 the alternatives using special XEmacs commands. Keys such as @kbd{C-f},
152 @kbd{C-b}, @kbd{C-n}, @kbd{C-p}, and digits have special definitions in
153 this situation, used for selecting among the alternatives. @key{TAB}
154 displays a buffer showing all the possibilities.
155
156 In Japanese input methods, first you input a whole word using
157 phonetic spelling; then, after the word is in the buffer, XEmacs
158 converts it into one or more characters using a large dictionary. One
159 phonetic spelling corresponds to many differently written Japanese
160 words, so you must select one of them; use @kbd{C-n} and @kbd{C-p} to
161 cycle through the alternatives.
162
163 Sometimes it is useful to cut off input method processing so that the
164 characters you have just entered will not combine with subsequent
165 characters. For example, in input method @code{latin-1-postfix}, the
166 sequence @kbd{e '} combines to form an @samp{e} with an accent. What if
167 you want to enter them as separate characters?
168
169 One way is to type the accent twice; that is a special feature for
170 entering the separate letter and accent. For example, @kbd{e ' '} gives
171 you the two characters @samp{e'}. Another way is to type another letter
172 after the @kbd{e}---something that won't combine with that---and
173 immediately delete it. For example, you could type @kbd{e e @key{DEL}
174 '} to get separate @samp{e} and @samp{'}.
175
176 Another method, more general but not quite as easy to type, is to use
177 @kbd{C-\ C-\} between two characters to stop them from combining. This
178 is the command @kbd{C-\} (@code{toggle-input-method}) used twice.
179 @ifinfo
180 @xref{Select Input Method}.
181 @end ifinfo
182
183 @kbd{C-\ C-\} is especially useful inside an incremental search,
184 because stops waiting for more characters to combine, and starts
185 searching for what you have already entered.
186
187 @vindex input-method-verbose-flag
188 @vindex input-method-highlight-flag
189 The variables @code{input-method-highlight-flag} and
190 @code{input-method-verbose-flag} control how input methods explain what
191 is happening. If @code{input-method-highlight-flag} is non-@code{nil},
192 the partial sequence is highlighted in the buffer. If
193 @code{input-method-verbose-flag} is non-@code{nil}, the list of possible
194 characters to type next is displayed in the echo area (but not when you
195 are in the minibuffer).
196
197 @node Select Input Method, Coding Systems, Input Methods, Mule
198 @section Selecting an Input Method
199
200 @table @kbd
201 @item C-\
202 Enable or disable use of the selected input method.
203
204 @item C-x @key{RET} C-\ @var{method} @key{RET}
205 Select a new input method for the current buffer.
206
207 @item C-h I @var{method} @key{RET}
208 @itemx C-h C-\ @var{method} @key{RET}
209 @findex describe-input-method
210 @kindex C-h I
211 @kindex C-h C-\
212 Describe the input method @var{method} (@code{describe-input-method}).
213 By default, it describes the current input method (if any).
214
215 @item M-x list-input-methods
216 Display a list of all the supported input methods.
217 @end table
218
219 @findex select-input-method
220 @vindex current-input-method
221 @kindex C-x RET C-\
222 To choose an input method for the current buffer, use @kbd{C-x
223 @key{RET} C-\} (@code{select-input-method}). This command reads the
224 input method name with the minibuffer; the name normally starts with the
225 language environment that it is meant to be used with. The variable
226 @code{current-input-method} records which input method is selected.
227
228 @findex toggle-input-method
229 @kindex C-\
230 Input methods use various sequences of ASCII characters to stand for
231 non-ASCII characters. Sometimes it is useful to turn off the input
232 method temporarily. To do this, type @kbd{C-\}
233 (@code{toggle-input-method}). To reenable the input method, type
234 @kbd{C-\} again.
235
236 If you type @kbd{C-\} and you have not yet selected an input method,
237 it prompts for you to specify one. This has the same effect as using
238 @kbd{C-x @key{RET} C-\} to specify an input method.
239
240 @vindex default-input-method
241 Selecting a language environment specifies a default input method for
242 use in various buffers. When you have a default input method, you can
243 select it in the current buffer by typing @kbd{C-\}. The variable
244 @code{default-input-method} specifies the default input method
245 (@code{nil} means there is none).
246
247 @findex quail-set-keyboard-layout
248 Some input methods for alphabetic scripts work by (in effect)
249 remapping the keyboard to emulate various keyboard layouts commonly used
250 for those scripts. How to do this remapping properly depends on your
251 actual keyboard layout. To specify which layout your keyboard has, use
252 the command @kbd{M-x quail-set-keyboard-layout}.
253
254 @findex list-input-methods
255 To display a list of all the supported input methods, type @kbd{M-x
256 list-input-methods}. The list gives information about each input
257 method, including the string that stands for it in the mode line.
258
259 @node Coding Systems, Recognize Coding, Select Input Method, Mule
260 @section Coding Systems
261 @cindex coding systems
262
263 Users of various languages have established many more-or-less standard
264 coding systems for representing them. XEmacs does not use these coding
265 systems internally; instead, it converts from various coding systems to
266 its own system when reading data, and converts the internal coding
267 system to other coding systems when writing data. Conversion is
268 possible in reading or writing files, in sending or receiving from the
269 terminal, and in exchanging data with subprocesses.
270
271 XEmacs assigns a name to each coding system. Most coding systems are
272 used for one language, and the name of the coding system starts with the
273 language name. Some coding systems are used for several languages;
274 their names usually start with @samp{iso}. There are also special
275 coding systems @code{binary} and @code{no-conversion} which do not
276 convert printing characters at all.
277
278 In addition to converting various representations of non-ASCII
279 characters, a coding system can perform end-of-line conversion. XEmacs
280 handles three different conventions for how to separate lines in a file:
281 newline, carriage-return linefeed, and just carriage-return.
282
283 @table @kbd
284 @item C-h C @var{coding} @key{RET}
285 Describe coding system @var{coding}.
286
287 @item C-h C @key{RET}
288 Describe the coding systems currently in use.
289
290 @item M-x list-coding-systems
291 Display a list of all the supported coding systems.
292 @end table
293
294 @kindex C-h C
295 @findex describe-coding-system
296 The command @kbd{C-h C} (@code{describe-coding-system}) displays
297 information about particular coding systems. You can specify a coding
298 system name as argument; alternatively, with an empty argument, it
299 describes the coding systems currently selected for various purposes,
300 both in the current buffer and as the defaults, and the priority list
301 for recognizing coding systems (@pxref{Recognize Coding}).
302
303 @findex list-coding-systems
304 To display a list of all the supported coding systems, type @kbd{M-x
305 list-coding-systems}. The list gives information about each coding
306 system, including the letter that stands for it in the mode line
307 (@pxref{Mode Line}).
308
309 Each of the coding systems that appear in this list---except for
310 @code{binary}, which means no conversion of any kind---specifies how and
311 whether to convert printing characters, but leaves the choice of
312 end-of-line conversion to be decided based on the contents of each file.
313 For example, if the file appears to use carriage-return linefeed between
314 lines, that end-of-line conversion will be used.
315
316 Each of the listed coding systems has three variants which specify
317 exactly what to do for end-of-line conversion:
318
319 @table @code
320 @item @dots{}-unix
321 Don't do any end-of-line conversion; assume the file uses
322 newline to separate lines. (This is the convention normally used
323 on Unix and GNU systems.)
324
325 @item @dots{}-dos
326 Assume the file uses carriage-return linefeed to separate lines,
327 and do the appropriate conversion. (This is the convention normally used
328 on Microsoft systems.)
329
330 @item @dots{}-mac
331 Assume the file uses carriage-return to separate lines, and do the
332 appropriate conversion. (This is the convention normally used on the
333 Macintosh system.)
334 @end table
335
336 These variant coding systems are omitted from the
337 @code{list-coding-systems} display for brevity, since they are entirely
338 predictable. For example, the coding system @code{iso-8859-1} has
339 variants @code{iso-8859-1-unix}, @code{iso-8859-1-dos} and
340 @code{iso-8859-1-mac}.
341
342 In contrast, the coding system @code{binary} specifies no character
343 code conversion at all---none for non-Latin-1 byte values and none for
344 end of line. This is useful for reading or writing binary files, tar
345 files, and other files that must be examined verbatim.
346
347 The easiest way to edit a file with no conversion of any kind is with
348 the @kbd{M-x find-file-literally} command. This uses @code{binary}, and
349 also suppresses other XEmacs features that might convert the file
350 contents before you see them. @xref{Visiting}.
351
352 The coding system @code{no-conversion} means that the file contains
353 non-Latin-1 characters stored with the internal XEmacs encoding. It
354 handles end-of-line conversion based on the data encountered, and has
355 the usual three variants to specify the kind of end-of-line conversion.
356
357
358 @node Recognize Coding, Specify Coding, Coding Systems, Mule
359 @section Recognizing Coding Systems
360
361 Most of the time, XEmacs can recognize which coding system to use for
362 any given file--once you have specified your preferences.
363
364 Some coding systems can be recognized or distinguished by which byte
365 sequences appear in the data. However, there are coding systems that
366 cannot be distinguished, not even potentially. For example, there is no
367 way to distinguish between Latin-1 and Latin-2; they use the same byte
368 values with different meanings.
369
370 XEmacs handles this situation by means of a priority list of coding
371 systems. Whenever XEmacs reads a file, if you do not specify the coding
372 system to use, XEmacs checks the data against each coding system,
373 starting with the first in priority and working down the list, until it
374 finds a coding system that fits the data. Then it converts the file
375 contents assuming that they are represented in this coding system.
376
377 The priority list of coding systems depends on the selected language
378 environment (@pxref{Language Environments}). For example, if you use
379 French, you probably want XEmacs to prefer Latin-1 to Latin-2; if you
380 use Czech, you probably want Latin-2 to be preferred. This is one of
381 the reasons to specify a language environment.
382
383 @findex prefer-coding-system
384 However, you can alter the priority list in detail with the command
385 @kbd{M-x prefer-coding-system}. This command reads the name of a coding
386 system from the minibuffer, and adds it to the front of the priority
387 list, so that it is preferred to all others. If you use this command
388 several times, each use adds one element to the front of the priority
389 list.
390
391 @vindex file-coding-system-alist
392 Sometimes a file name indicates which coding system to use for the
393 file. The variable @code{file-coding-system-alist} specifies this
394 correspondence. There is a special function
395 @code{modify-coding-system-alist} for adding elements to this list. For
396 example, to read and write all @samp{.txt} using the coding system
397 @code{china-iso-8bit}, you can execute this Lisp expression:
398
399 @smallexample
400 (modify-coding-system-alist 'file "\\.txt\\'" 'china-iso-8bit)
401 @end smallexample
402
403 @noindent
404 The first argument should be @code{file}, the second argument should be
405 a regular expression that determines which files this applies to, and
406 the third argument says which coding system to use for these files.
407
408 @vindex coding
409 You can specify the coding system for a particular file using the
410 @samp{-*-@dots{}-*-} construct at the beginning of a file, or a local
411 variables list at the end (@pxref{File Variables}). You do this by
412 defining a value for the ``variable'' named @code{coding}. XEmacs does
413 not really have a variable @code{coding}; instead of setting a variable,
414 it uses the specified coding system for the file. For example,
415 @samp{-*-mode: C; coding: iso-8859-1;-*-} specifies use of the
416 iso-8859-1 coding system, as well as C mode.
417
418 @vindex buffer-file-coding-system
419 Once XEmacs has chosen a coding system for a buffer, it stores that
420 coding system in @code{buffer-file-coding-system} and uses that coding
421 system, by default, for operations that write from this buffer into a
422 file. This includes the commands @code{save-buffer} and
423 @code{write-region}. If you want to write files from this buffer using
424 a different coding system, you can specify a different coding system for
425 the buffer using @code{set-buffer-file-coding-system} (@pxref{Specify
426 Coding}).
427
428
429 @node Specify Coding, , Recognize Coding, Mule
430 @section Specifying a Coding System
431
432 In cases where XEmacs does not automatically choose the right coding
433 system, you can use these commands to specify one:
434
435 @table @kbd
436 @item C-x @key{RET} f @var{coding} @key{RET}
437 Use coding system @var{coding} for the visited file
438 in the current buffer.
439
440 @item C-x @key{RET} k @var{coding} @key{RET}
441 Use coding system @var{coding} for keyboard input.
442
443 @item C-x @key{RET} t @var{coding} @key{RET}
444 Use coding system @var{coding} for terminal output.
445
446 @item C-x @key{RET} p @var{coding} @key{RET}
447 Use coding system @var{coding} for subprocess input and output
448 in the current buffer.
449 @end table
450
451 @kindex C-x RET f
452 @findex set-buffer-file-coding-system
453 The command @kbd{C-x RET f} (@code{set-buffer-file-coding-system})
454 specifies the file coding system for the current buffer---in other
455 words, which coding system to use when saving or rereading the visited
456 file. You specify which coding system using the minibuffer. Since this
457 command applies to a file you have already visited, it affects only the
458 way the file is saved.
459
460 Another way to specify the coding system for a file is when you visit
461 the file. If you run some file input commands with the precedent
462 @kbd{C-u}, you can specify coding system to read from minibuffer.
463
464 So if the immediately following command is @kbd{C-x C-f}, for example,
465 it reads the file using that coding system (and records the coding
466 system for when the file is saved). Other file commands affected by a
467 specified coding system include @kbd{C-x C-i} and @kbd{C-x C-v}, as well
468 as the other-window variants of @kbd{C-x C-f}.
469
470 @vindex default-buffer-file-coding-system
471 The variable @code{default-buffer-file-coding-system} specifies the
472 choice of coding system to use when you create a new file. It applies
473 when you find a new file, and when you create a buffer and then save it
474 in a file. Selecting a language environment typically sets this
475 variable to a good choice of default coding system for that language
476 environment.
477
478 @kindex C-x RET t
479 @findex set-terminal-coding-system
480 The command @kbd{C-x @key{RET} t} (@code{set-terminal-coding-system})
481 specifies the coding system for terminal output. If you specify a
482 character code for terminal output, all characters output to the
483 terminal are translated into that coding system.
484
485 This feature is useful for certain character-only terminals built to
486 support specific languages or character sets---for example, European
487 terminals that support one of the ISO Latin character sets.
488
489 By default, output to the terminal is not translated at all.
490
491 @kindex C-x RET k
492 @findex set-keyboard-coding-system
493 The command @kbd{C-x @key{RET} k} (@code{set-keyboard-coding-system})
494 specifies the coding system for keyboard input. Character-code
495 translation of keyboard input is useful for terminals with keys that
496 send non-ASCII graphic characters---for example, some terminals designed
497 for ISO Latin-1 or subsets of it.
498
499 By default, keyboard input is not translated at all.
500
501 There is a similarity between using a coding system translation for
502 keyboard input, and using an input method: both define sequences of
503 keyboard input that translate into single characters. However, input
504 methods are designed to be convenient for interactive use by humans, and
505 the sequences that are translated are typically sequences of ASCII
506 printing characters. Coding systems typically translate sequences of
507 non-graphic characters.
508
509 @kindex C-x RET p
510 @findex set-buffer-process-coding-system
511 The command @kbd{C-x @key{RET} p} (@code{set-buffer-process-coding-system})
512 specifies the coding system for input and output to a subprocess. This
513 command applies to the current buffer; normally, each subprocess has its
514 own buffer, and thus you can use this command to specify translation to
515 and from a particular subprocess by giving the command in the
516 corresponding buffer.
517
518 By default, process input and output are not translated at all.
519
520 @vindex file-name-coding-system
521 The variable @code{file-name-coding-system} specifies a coding system
522 to use for encoding file names. If you set the variable to a coding
523 system name (as a Lisp symbol or a string), XEmacs encodes file names
524 using that coding system for all file operations. This makes it
525 possible to use non-Latin-1 characters in file names---or, at least,
526 those non-Latin-1 characters which the specified coding system can
527 encode. By default, this variable is @code{nil}, which implies that you
528 cannot use non-Latin-1 characters in file names.