Mercurial > hg > xemacs-beta
comparison man/xemacs/mule.texi @ 1183:c1553814932e
[xemacs-hg @ 2003-01-03 12:12:30 by stephent]
various docs
<873coa5unb.fsf@tleepslib.sk.tsukuba.ac.jp>
<87r8bu4emz.fsf@tleepslib.sk.tsukuba.ac.jp>
author | stephent |
---|---|
date | Fri, 03 Jan 2003 12:12:40 +0000 |
parents | 26f7cf2a4792 |
children | 6b0000935adc |
comparison
equal
deleted
inserted
replaced
1182:7d696106ffe9 | 1183:c1553814932e |
---|---|
13 @cindex IPA | 13 @cindex IPA |
14 @cindex Japanese | 14 @cindex Japanese |
15 @cindex Korean | 15 @cindex Korean |
16 @cindex Cyrillic | 16 @cindex Cyrillic |
17 @cindex Russian | 17 @cindex Russian |
18 @c #### It's a lie that this file tells you about Unicode.... | |
19 @cindex Unicode | |
18 If you build XEmacs using the @code{--with-mule} option, it supports a | 20 If you build XEmacs using the @code{--with-mule} option, it supports a |
19 wide variety of world scripts, including the Latin script, the Arabic | 21 wide variety of world scripts, including the Latin script, the Arabic |
20 script, Simplified Chinese (for mainland of China), Traditional Chinese | 22 script, Simplified Chinese (for mainland of China), Traditional Chinese |
21 (for Taiwan and Hong-Kong), the Greek script, the Hebrew script, IPA | 23 (for Taiwan and Hong-Kong), the Greek script, the Hebrew script, IPA |
22 symbols, Japanese scripts (Hiragana, Katakana and Kanji), Korean scripts | 24 symbols, Japanese scripts (Hiragana, Katakana and Kanji), Korean scripts |
31 * Input Methods:: Entering text characters not on your keyboard. | 33 * Input Methods:: Entering text characters not on your keyboard. |
32 * Select Input Method:: Specifying your choice of input methods. | 34 * Select Input Method:: Specifying your choice of input methods. |
33 * Coding Systems:: Character set conversion when you read and | 35 * Coding Systems:: Character set conversion when you read and |
34 write files, and so on. | 36 write files, and so on. |
35 * Recognize Coding:: How XEmacs figures out which conversion to use. | 37 * Recognize Coding:: How XEmacs figures out which conversion to use. |
38 * Unification:: Integrating overlapping character sets. | |
36 * Specify Coding:: Various ways to choose which conversion to use. | 39 * Specify Coding:: Various ways to choose which conversion to use. |
40 * Charsets and Coding Systems:: Tables and other reference material. | |
37 @end menu | 41 @end menu |
38 | 42 |
39 @node Mule Intro, Language Environments, Mule, Mule | 43 @node Mule Intro, Language Environments, Mule, Mule |
40 @section Introduction to world scripts | 44 @section Introduction: The Wide Variety of Scripts and Codings in Use |
41 | 45 |
42 The users of these scripts have established many more-or-less standard | 46 There are hundreds of scripts in use world-wide. The users of these |
43 coding systems for storing files. | 47 scripts have established many more-or-less standard coding systems for |
44 @c XEmacs internally uses a single multibyte character encoding, so that it | 48 storing text written in them in files. XEmacs translates between its |
45 @c can intermix characters from all these scripts in a single buffer or | 49 internal character encoding and various other coding systems when |
46 @c string. This encoding represents each non-ASCII character as a sequence | 50 reading and writing files, when exchanging data with subprocesses, and |
47 @c of bytes in the range 0200 through 0377. | 51 (in some cases) in the @kbd{C-q} command (see below). |
48 XEmacs translates between the internal character encoding and various | 52 @footnote{Historically the internal encoding was a specially designed |
49 other coding systems when reading and writing files, when exchanging | 53 encoding, called @dfn{Mule encoding}, intended for easy conversion to |
50 data with subprocesses, and (in some cases) in the @kbd{C-q} command | 54 and from versions of ISO 2022. However, this encoding shares many |
51 (see below). | 55 properties with UTF-8, and conversion to UTF-8 as the internal code is |
56 proposed.} | |
52 | 57 |
53 @kindex C-h h | 58 @kindex C-h h |
54 @findex view-hello-file | 59 @findex view-hello-file |
55 The command @kbd{C-h h} (@code{view-hello-file}) displays the file | 60 The command @kbd{C-h h} (@code{view-hello-file}) displays the file |
56 @file{etc/HELLO}, which shows how to say ``hello'' in many languages. | 61 @file{etc/HELLO}, which shows how to say ``hello'' in many languages. |
354 non-Latin-1 characters stored with the internal XEmacs encoding. It | 359 non-Latin-1 characters stored with the internal XEmacs encoding. It |
355 handles end-of-line conversion based on the data encountered, and has | 360 handles end-of-line conversion based on the data encountered, and has |
356 the usual three variants to specify the kind of end-of-line conversion. | 361 the usual three variants to specify the kind of end-of-line conversion. |
357 | 362 |
358 | 363 |
359 @node Recognize Coding, Specify Coding, Coding Systems, Mule | 364 @node Recognize Coding, Unification, Coding Systems, Mule |
360 @section Recognizing Coding Systems | 365 @section Recognizing Coding Systems |
361 | 366 |
362 Most of the time, XEmacs can recognize which coding system to use for | 367 Most of the time, XEmacs can recognize which coding system to use for |
363 any given file--once you have specified your preferences. | 368 any given file--once you have specified your preferences. |
364 | 369 |
425 a different coding system, you can specify a different coding system for | 430 a different coding system, you can specify a different coding system for |
426 the buffer using @code{set-buffer-file-coding-system} (@pxref{Specify | 431 the buffer using @code{set-buffer-file-coding-system} (@pxref{Specify |
427 Coding}). | 432 Coding}). |
428 | 433 |
429 | 434 |
430 @node Specify Coding, , Recognize Coding, Mule | 435 @node Unification, Specify Coding, Recognize Coding, Mule |
436 @section Character Set Unification | |
437 | |
438 Mule suffers from a design defect that causes it to consider the ISO | |
439 Latin character sets to be disjoint. This results in oddities such as | |
440 files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO | |
441 2022 control sequences to switch between them, as well as more | |
442 plausible but often unnecessary combinations like ISO 8859/1 with ISO | |
443 8859/2. This can be very annoying when sending messages or even in | |
444 simple editing on a single host. XEmacs works around the problem by | |
445 converting as many characters as possible to use a single Latin coded | |
446 character set before saving the buffer. | |
447 | |
448 Unification is planned for extension to other character set families, | |
449 in particular the Han family of character sets based on the Chinese | |
450 ideographic characters. At least for the Han sets, however, the | |
451 unification feature will be disabled by default. | |
452 | |
453 This functionality is based on the @file{latin-unity} package by | |
454 Stephen Turnbull @email{stephen@@xemacs.org}, but is somewhat | |
455 divergent. This documentation is also based on the package | |
456 documentation, and is likely to be inaccurate because of the different | |
457 constraints we place on ``core'' and packaged functionality. | |
458 | |
459 @menu | |
460 * Unification Overview:: History and general information. | |
461 * Unification Usage:: An overview of operation. | |
462 * Unification Configuration:: Configuring unification. | |
463 * Unification FAQs:: Questions and answers from the mailing list. | |
464 * Unification Theory:: How unification works. | |
465 * What Unification Cannot Do for You:: Inherent problems of 8-bit charsets. | |
466 @end menu | |
467 | |
468 @node Unification Overview, Unification Usage, Unification, Unification | |
469 @subsection An Overview of Character Set Unification | |
470 | |
471 Mule suffers from a design defect that causes it to consider the ISO | |
472 Latin character sets to be disjoint. This manifests itself when a user | |
473 enters characters using input methods associated with different coded | |
474 character sets into a single buffer. | |
475 | |
476 A very important example involves email. Many sites, especially in the | |
477 U.S., default to use of the ISO 8859/1 coded character set (also called | |
478 ``Latin 1,'' though these are somewhat different concepts). However, | |
479 ISO 8859/1 provides a generic CURRENCY SIGN character. Now that the | |
480 Euro has become the official currency of most countries in Europe, this | |
481 is unsatisfactory (and in practice, useless). So Europeans generally | |
482 use ISO 8859/15, which is nearly identical to ISO 8859/1 for most | |
483 languages, except that it substitutes EURO SIGN for CURRENCY SIGN. | |
484 | |
485 Suppose a European user yanks text from a post encoded in ISO 8859/1 | |
486 into a message composition buffer, and enters some text including the | |
487 Euro sign. Then Mule will consider the buffer to contain both ISO | |
488 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively | |
489 programmed) send the message as a multipart mixed MIME body! | |
490 | |
491 This is clearly stupid. What is not as obvious is that, just as any | |
492 European can include American English in their text because ASCII is a | |
493 subset of ISO 8859/15, most European languages which use Latin | |
494 characters (eg, German and Polish) can typically be mixed while using | |
495 only one Latin coded character set (in this case, ISO 8859/2). However, | |
496 this often depends on exactly what text is to be encoded. | |
497 | |
498 Unification works around the problem by converting as many characters as | |
499 possible to use a single Latin coded character set before saving the | |
500 buffer. | |
501 | |
502 | |
503 @node Unification Usage, Unification Configuration, Unification Overview, Unification | |
504 @subsection Operation of Unification | |
505 | |
506 This is a description of the early hack to include unification in | |
507 XEmacs 21.5. This will almost surely change. | |
508 | |
509 Normally, unification works in the background by installing | |
510 @code{unity-sanity-check} on @code{write-region-pre-hook}. | |
511 Unification is on by default for the ISO-8859 Latin sets. The user | |
512 activates this functionality for other chacter set families by | |
513 invoking @code{enable-unification}, either interactively or in her | |
514 init file. @xref{Init File, , , xemacs}. Unification can be | |
515 deactivated by invoking @code{disable-unification}. | |
516 | |
517 Unification also provides a few functions for remapping or recoding the | |
518 buffer by hand. To @dfn{remap} a character means to change the buffer | |
519 representation of the character by using another coded character set. | |
520 Remapping never changes the identity of the character, but may involve | |
521 altering the code point of the character. To @dfn{recode} a character | |
522 means to simply change the coded character set. Recoding never alters | |
523 the code point of the character, but may change the identity of the | |
524 character. @xref{Unification Theory}. | |
525 | |
526 There are a few variables which determine which coding systems are | |
527 always acceptable to unification: @code{unity-ucs-list}, | |
528 @code{unity-preferred-coding-system-list}, and | |
529 @code{unity-preapproved-coding-system-list}. The last defaults to | |
530 @code{(buffer preferred)}, and you should probably avoid changing it | |
531 because it short-circuits the sanity check. If you find you need to | |
532 use it, consider reporting it as a bug or request for enhancement. | |
533 | |
534 @menu | |
535 * Basic Functionality:: User interface and customization. | |
536 * Interactive Usage:: Treating text by hand. | |
537 Also documents the hook function(s). | |
538 @end menu | |
539 | |
540 | |
541 @node Basic Functionality, Interactive Usage, , Unification Usage | |
542 @subsubsection Basic Functionality | |
543 | |
544 These functions and user options initialize and configure unification. | |
545 In normal use, they are not needed. | |
546 | |
547 @strong{These interfaces will change. Also, the @samp{unity-} prefix | |
548 is likely to be changed for many of the variables and functions, as | |
549 they are of more general usefulness.} | |
550 | |
551 @defun enable-unification | |
552 Set up hooks and initialize variables for unification. | |
553 | |
554 There are no arguments. | |
555 | |
556 This function is idempotent. It will reinitialize any hooks or variables | |
557 that are not in initial state. | |
558 @end defun | |
559 | |
560 @defun disable-unification | |
561 There are no arguments. | |
562 | |
563 Clean up hooks and void variables used by unification. | |
564 @end defun | |
565 | |
566 @c #### several changes should go to latin-unity.texi | |
567 @defopt unity-ucs-list | |
568 List of universal coding systems recommended for character set unification. | |
569 | |
570 The default value is @code{'(utf-8 iso-2022-7 ctext escape-quoted)}. | |
571 | |
572 Order matters; coding systems earlier in the list will be preferred when | |
573 recommending a coding system. These coding systems will not be used | |
574 without querying the user (unless they are also present in | |
575 @code{unity-preapproved-coding-system-list}), and follow the | |
576 @code{unity-preferred-coding-system-list} in the list of suggested | |
577 coding systems. | |
578 | |
579 If none of the preferred coding systems are feasible, the first in | |
580 this list will be the default. | |
581 | |
582 Notes on certain coding systems: @code{escape-quoted} is a special | |
583 coding system used for autosaves and compiled Lisp in Mule. You should | |
584 never delete this, although it is rare that a user would want to use it | |
585 directly. Unification does not try to be ``smart'' about other general | |
586 ISO 2022 coding systems, such as ISO-2022-JP. (They are not recognized | |
587 as equivalent to @code{iso-2022-7}.) If your preferred coding system is | |
588 one of these, you may consider adding it to @code{unity-ucs-list}. | |
589 @end defopt | |
590 | |
591 Coding systems which are not Latin and not in | |
592 @code{unity-ucs-list} are handled by short circuiting checks of | |
593 coding system against the next two variables. | |
594 | |
595 @defopt unity-preapproved-coding-system-list | |
596 List of coding systems used without querying the user if feasible. | |
597 | |
598 The default value is @samp{(buffer-default preferred)}. | |
599 | |
600 The first feasible coding system in this list is used. The special values | |
601 @samp{preferred} and @samp{buffer-default} may be present: | |
602 | |
603 @table @code | |
604 @item buffer-default | |
605 Use the coding system used by @samp{write-region}, if feasible. | |
606 | |
607 @item preferred | |
608 Use the coding system specified by @samp{prefer-coding-system} if feasible. | |
609 @end table | |
610 | |
611 "Feasible" means that all characters in the buffer can be represented by | |
612 the coding system. Coding systems in @samp{unity-ucs-list} are | |
613 always considered feasible. Other feasible coding systems are computed | |
614 by @samp{unity-representations-feasible-region}. | |
615 | |
616 Note that, by definition, the first universal coding system in this | |
617 list shadows all other coding systems. In particular, if your | |
618 preferred coding system is a universal coding system, and | |
619 @code{preferred} is a member of this list, unification will blithely | |
620 convert all your files to that coding system. This is considered a | |
621 feature, but it may surprise most users. Users who don't like this | |
622 behavior may put @code{preferred} in | |
623 @code{unity-preferred-coding-system-list}, but not in | |
624 @code{unity-preapproved-coding-system-list}. | |
625 @end defopt | |
626 | |
627 | |
628 @defopt unity-preferred-coding-system-list | |
629 List of coding systems suggested to the user if feasible. | |
630 | |
631 The default value is @samp{(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3 | |
632 iso-8859-4 iso-8859-9)}. | |
633 | |
634 If none of the coding systems in | |
635 @samp{unity-preapproved-coding-system-list} are feasible, this list | |
636 will be recommended to the user, followed by the | |
637 @samp{unity-ucs-list} (so those coding systems should not be in | |
638 this list). The first coding system in this list is default. The | |
639 special values @samp{preferred} and @samp{buffer-default} may be | |
640 present: | |
641 | |
642 @table @code | |
643 @item buffer-default | |
644 Use the coding system used by @samp{write-region}, if feasible. | |
645 | |
646 @item preferred | |
647 Use the coding system specified by @samp{prefer-coding-system} if feasible. | |
648 @end table | |
649 | |
650 "Feasible" means that all characters in the buffer can be represented by | |
651 the coding system. Coding systems in @samp{unity-ucs-list} are | |
652 always considered feasible. Other feasible coding systems are computed | |
653 by @samp{unity-representations-feasible-region}. | |
654 @end defopt | |
655 | |
656 | |
657 @defvar unity-iso-8859-1-aliases | |
658 List of coding systems to be treated as aliases of ISO 8859/1. | |
659 | |
660 The default value is '(iso-8859-1). | |
661 | |
662 This is not a user variable; to customize input of coding systems or | |
663 charsets, @samp{unity-coding-system-alias-alist} or | |
664 @samp{unity-charset-alias-alist}. | |
665 @end defvar | |
666 | |
667 | |
668 @node Interactive Usage, , Basic Functionality, Unification Usage | |
669 @subsubsection Interactive Usage | |
670 | |
671 First, the hook function @code{unity-sanity-check} is documented. | |
672 (It is placed here because it is not an interactive function, and there | |
673 is not yet a programmer's section of the manual.) | |
674 | |
675 These functions provide access to internal functionality (such as the | |
676 remapping function) and to extra functionality (the recoding functions | |
677 and the test function). | |
678 | |
679 @defun unity-sanity-check begin end filename append visit lockname &optional coding-system | |
680 | |
681 Check if @var{coding-system} can represent all characters between | |
682 @var{begin} and @var{end}. | |
683 | |
684 For compatibility with old broken versions of @code{write-region}, | |
685 @var{coding-system} defaults to @code{buffer-file-coding-system}. | |
686 @var{filename}, @var{append}, @var{visit}, and @var{lockname} are | |
687 ignored. | |
688 | |
689 Return nil if buffer-file-coding-system is not (ISO-2022-compatible) | |
690 Latin. If @code{buffer-file-coding-system} is safe for the charsets | |
691 actually present in the buffer, return it. Otherwise, ask the user to | |
692 choose a coding system, and return that. | |
693 | |
694 This function does @emph{not} do the safe thing when | |
695 @code{buffer-file-coding-system} is nil (aka no-conversion). It | |
696 considers that ``non-Latin,'' and passes it on to the Mule detection | |
697 mechanism. | |
698 | |
699 This function is intended for use as a @code{write-region-pre-hook}. It | |
700 does nothing except return @var{coding-system} if @code{write-region} | |
701 handlers are inhibited. | |
702 @end defun | |
703 | |
704 @defun unity-buffer-representations-feasible | |
705 There are no arguments. | |
706 | |
707 Apply unity-region-representations-feasible to the current buffer. | |
708 @end defun | |
709 | |
710 @defun unity-region-representations-feasible begin end &optional buf | |
711 Return character sets that can represent the text from @var{begin} to | |
712 @var{end} in @var{buf}. | |
713 | |
714 @c #### Fix in latin-unity.texi. | |
715 @var{buf} defaults to the current buffer. Called interactively, will be | |
716 applied to the region. The function assumes @var{begin} <= @var{end}. | |
717 | |
718 The return value is a cons. The car is the list of character sets | |
719 that can individually represent all of the non-ASCII portion of the | |
720 buffer, and the cdr is the list of character sets that can | |
721 individually represent all of the ASCII portion. | |
722 | |
723 The following is taken from a comment in the source. Please refer to | |
724 the source to be sure of an accurate description. | |
725 | |
726 The basic algorithm is to map over the region, compute the set of | |
727 charsets that can represent each character (the ``feasible charset''), | |
728 and take the intersection of those sets. | |
729 | |
730 The current implementation takes advantage of the fact that ASCII | |
731 characters are common and cannot change asciisets. Then using | |
732 skip-chars-forward makes motion over ASCII subregions very fast. | |
733 | |
734 This same strategy could be applied generally by precomputing classes | |
735 of characters equivalent according to their effect on latinsets, and | |
736 adding a whole class to the skip-chars-forward string once a member is | |
737 found. | |
738 | |
739 Probably efficiency is a function of the number of characters matched, | |
740 or maybe the length of the match string? With @code{skip-category-forward} | |
741 over a precomputed category table it should be really fast. In practice | |
742 for Latin character sets there are only 29 classes. | |
743 @end defun | |
744 | |
745 @defun unity-remap-region begin end character-set &optional coding-system | |
746 | |
747 Remap characters between @var{begin} and @var{end} to equivalents in | |
748 @var{character-set}. Optional argument @var{coding-system} may be a | |
749 coding system name (a symbol) or nil. Characters with no equivalent are | |
750 left as-is. | |
751 | |
752 When called interactively, @var{begin} and @var{end} are set to the | |
753 beginning and end, respectively, of the active region, and the function | |
754 prompts for @var{character-set}. The function does completion, knows | |
755 how to guess a character set name from a coding system name, and also | |
756 provides some common aliases. See @code{unity-guess-charset}. | |
757 There is no way to specify @var{coding-system}, as it has no useful | |
758 function interactively. | |
759 | |
760 Return @var{coding-system} if @var{coding-system} can encode all | |
761 characters in the region, t if @var{coding-system} is nil and the coding | |
762 system with G0 = 'ascii and G1 = @var{character-set} can encode all | |
763 characters, and otherwise nil. Note that a non-null return does | |
764 @emph{not} mean it is safe to write the file, only the specified region. | |
765 (This behavior is useful for multipart MIME encoding and the like.) | |
766 | |
767 Note: by default this function is quite fascist about universal coding | |
768 systems. It only admits @samp{utf-8}, @samp{iso-2022-7}, and | |
769 @samp{ctext}. Customize @code{unity-approved-ucs-list} to change | |
770 this. | |
771 | |
772 This function remaps characters that are artificially distinguished by Mule | |
773 internal code. It may change the code point as well as the character set. | |
774 To recode characters that were decoded in the wrong coding system, use | |
775 @code{unity-recode-region}. | |
776 @end defun | |
777 | |
778 @defun unity-recode-region begin end wrong-cs right-cs | |
779 | |
780 Recode characters between @var{begin} and @var{end} from @var{wrong-cs} | |
781 to @var{right-cs}. | |
782 | |
783 @var{wrong-cs} and @var{right-cs} are character sets. Characters retain | |
784 the same code point but the character set is changed. Only characters | |
785 from @var{wrong-cs} are changed to @var{right-cs}. The identity of the | |
786 character may change. Note that this could be dangerous, if characters | |
787 whose identities you do not want changed are included in the region. | |
788 This function cannot guess which characters you want changed, and which | |
789 should be left alone. | |
790 | |
791 When called interactively, @var{begin} and @var{end} are set to the | |
792 beginning and end, respectively, of the active region, and the function | |
793 prompts for @var{wrong-cs} and @var{right-cs}. The function does | |
794 completion, knows how to guess a character set name from a coding system | |
795 name, and also provides some common aliases. See | |
796 @code{unity-guess-charset}. | |
797 | |
798 Another way to accomplish this, but using coding systems rather than | |
799 character sets to specify the desired recoding, is | |
800 @samp{unity-recode-coding-region}. That function may be faster | |
801 but is somewhat more dangerous, because it may recode more than one | |
802 character set. | |
803 | |
804 To change from one Mule representation to another without changing identity | |
805 of any characters, use @samp{unity-remap-region}. | |
806 @end defun | |
807 | |
808 @defun unity-recode-coding-region begin end wrong-cs right-cs | |
809 | |
810 Recode text between @var{begin} and @var{end} from @var{wrong-cs} to | |
811 @var{right-cs}. | |
812 | |
813 @var{wrong-cs} and @var{right-cs} are coding systems. Characters retain | |
814 the same code point but the character set is changed. The identity of | |
815 characters may change. This is an inherently dangerous function; | |
816 multilingual text may be recoded in unexpected ways. #### It's also | |
817 dangerous because the coding systems are not sanity-checked in the | |
818 current implementation. | |
819 | |
820 When called interactively, @var{begin} and @var{end} are set to the | |
821 beginning and end, respectively, of the active region, and the function | |
822 prompts for @var{wrong-cs} and @var{right-cs}. The function does | |
823 completion, knows how to guess a coding system name from a character set | |
824 name, and also provides some common aliases. See | |
825 @code{unity-guess-coding-system}. | |
826 | |
827 Another, safer, way to accomplish this, using character sets rather | |
828 than coding systems to specify the desired recoding, is to use | |
829 @code{unity-recode-region}. | |
830 | |
831 To change from one Mule representation to another without changing identity | |
832 of any characters, use @code{unity-remap-region}. | |
833 @end defun | |
834 | |
835 Helper functions for input of coding system and character set names. | |
836 | |
837 @defun unity-guess-charset candidate | |
838 Guess a charset based on the symbol @var{candidate}. | |
839 | |
840 @var{candidate} itself is not tried as the value. | |
841 | |
842 Uses the natural mapping in @samp{unity-cset-codesys-alist}, and | |
843 the values in @samp{unity-charset-alias-alist}." | |
844 @end defun | |
845 | |
846 @defun unity-guess-coding-system candidate | |
847 Guess a coding system based on the symbol @var{candidate}. | |
848 | |
849 @var{candidate} itself is not tried as the value. | |
850 | |
851 Uses the natural mapping in @samp{unity-cset-codesys-alist}, and | |
852 the values in @samp{unity-coding-system-alias-alist}." | |
853 @end defun | |
854 | |
855 @defun unity-example | |
856 | |
857 A cheesy example for unification. | |
858 | |
859 At present it just makes a multilingual buffer. To test, setq | |
860 buffer-file-coding-system to some value, make the buffer dirty (eg | |
861 with RET BackSpace), and save. | |
862 @end defun | |
863 | |
864 | |
865 @node Unification Configuration, Unification FAQs, Unification Usage, Unification | |
866 @subsection Configuring Unification for Use | |
867 | |
868 If you want unification to be automatically initialized, invoke | |
869 @samp{enable-unification} with no arguments in your init file. | |
870 @xref{Init File, , , xemacs}. If you are using GNU Emacs or an XEmacs | |
871 earlier than 21.1, you should also load @file{auto-autoloads} using the | |
872 full path (@emph{never} @samp{require} @file{auto-autoloads} libraries). | |
873 | |
874 You may wish to define aliases for commonly used character sets and | |
875 coding systems for convenience in input. | |
876 | |
877 @defopt unity-charset-alias-alist | |
878 Alist mapping aliases to Mule charset names (symbols)." | |
879 | |
880 The default value is | |
881 @example | |
882 ((latin-1 . latin-iso8859-1) | |
883 (latin-2 . latin-iso8859-2) | |
884 (latin-3 . latin-iso8859-3) | |
885 (latin-4 . latin-iso8859-4) | |
886 (latin-5 . latin-iso8859-9) | |
887 (latin-9 . latin-iso8859-15) | |
888 (latin-10 . latin-iso8859-16)) | |
889 @end example | |
890 | |
891 If a charset does not exist on your system, it will not complete and you | |
892 will not be able to enter it in response to prompts. A real charset | |
893 with the same name as an alias in this list will shadow the alias. | |
894 @end defopt | |
895 | |
896 @defopt unity-coding-system-alias-alist nil | |
897 Alist mapping aliases to Mule coding system names (symbols). | |
898 | |
899 The default value is @samp{nil}. | |
900 @end defopt | |
901 | |
902 | |
903 @node Unification FAQs, Unification Theory, Unification Configuration, Unification | |
904 @subsection Frequently Asked Questions About Unification | |
905 | |
906 @enumerate | |
907 @item | |
908 I'm smarter than XEmacs's unification feature! How can that be? | |
909 | |
910 Don't be surprised. Trust yourself. | |
911 | |
912 Unification is very young as yet. Teach it what you know by | |
913 Customizing its variables, and report your changes to the maintainer | |
914 (@kbd{M-x report-xemacs-bug RET}). | |
915 | |
916 @item | |
917 What is a UCS? | |
918 | |
919 According to ISO 10646, a Universal Coded character Set. In | |
920 XEmacs, it's Universal (Mule) Coding System. | |
921 @ref{Coding Systems, , , xemacs} | |
922 | |
923 @item | |
924 I know @code{utf-16-le-bom} is a UCS, but unification won't use it. | |
925 Why not? | |
926 | |
927 There are an awful lot of UCSes in Mule, and you probably do not want to | |
928 ever use, and definitely not be asked about, most of them. So the | |
929 default set includes a few that the author thought plausible, but | |
930 they're surely not comprehensive or optimal. | |
931 | |
932 Customize @code{unity-ucs-list} to include the ones you use often, and | |
933 report your favorites to the maintainer for consideration for | |
934 inclusion in the defaults using @kbd{M-x report-xemacs-bug RET}. | |
935 (Note that you @emph{must} include @code{escape-quoted} in this list, | |
936 because Mule uses it internally as the coding system for auto-save | |
937 files.) | |
938 | |
939 Alternatively, if you just want to use it this one time, simply type | |
940 it in at the prompt. Unification will confirm that is a real coding | |
941 system, and then assume that you know what you're doing. | |
942 | |
943 @item | |
944 This is crazy: I can't quit XEmacs and get queried on autosaves! Why? | |
945 | |
946 You probably removed @code{escape-quoted} from | |
947 @code{unity-ucs-list}. Put it back. | |
948 | |
949 @item | |
950 Unification is really buggy and I can't get any work done. | |
951 | |
952 First, use @kbd{M-x disable-unification RET}, then report your | |
953 problems as a bug (@kbd{M-x report-xemacs-bug RET}). | |
954 @end enumerate | |
955 | |
956 | |
957 @node Unification Theory, What Unification Cannot Do for You, Unification FAQs, Unification | |
958 @subsection Unification Theory | |
959 | |
960 Standard encodings suffer from the design defect that they do not | |
961 provide a reliable way to recognize which coded character sets in use. | |
962 @xref{What Unification Cannot Do for You}. There are scores of | |
963 character sets which can be represented by a single octet (8-bit | |
964 byte), whose union contains many hundreds of characters. Obviously | |
965 this results in great confusion, since you can't tell the players | |
966 without a scorecard, and there is no scorecard. | |
967 | |
968 There are two ways to solve this problem. The first is to create a | |
969 universal coded character set. This is the concept behind Unicode. | |
970 However, there have been satisfactory (nearly) universal character | |
971 sets for several decades, but even today many Westerners resist using | |
972 Unicode because they consider its space requirements excessive. On | |
973 the other hand, many Asians dislike Unicode because they consider it | |
974 to be incomplete. (This is partly, but not entirely, political.) | |
975 | |
976 In any case, Unicode only solves the internal representation problem. | |
977 Many data sets will contain files in ``legacy'' encodings, and Unicode | |
978 does not help distinguish among them. | |
979 | |
980 The second approach is to embed information about the encodings used in | |
981 a document in its text. This approach is taken by the ISO 2022 | |
982 standard. This would solve the problem completely from the users' of | |
983 view, except that ISO 2022 is basically not implemented at all, in the | |
984 sense that few applications or systems implement more than a small | |
985 subset of ISO 2022 functionality. This is due to the fact that | |
986 mono-literate users object to the presence of escape sequences in their | |
987 texts (which they, with some justification, consider data corruption). | |
988 Programmers are more than willing to cater to these users, since | |
989 implementing ISO 2022 is a painstaking task. | |
990 | |
991 In fact, Emacs/Mule adopts both of these approaches. Internally it uses | |
992 a universal character set, @dfn{Mule code}. Externally it uses ISO 2022 | |
993 techniques both to save files in forms robust to encoding issues, and as | |
994 hints when attempting to ``guess'' an unknown encoding. However, Mule | |
995 suffers from a design defect, namely it embeds the character set | |
996 information that ISO 2022 attaches to runs of characters by introducing | |
997 them with a control sequence in each character. That causes Mule to | |
998 consider the ISO Latin character sets to be disjoint. This manifests | |
999 itself when a user enters characters using input methods associated with | |
1000 different coded character sets into a single buffer. | |
1001 | |
1002 There are two problems stemming from this design. First, Mule | |
1003 represents the same character in different ways. Abstractly, ',As(B' | |
1004 (LATIN SMALL LETTER O WITH ACUTE) can get represented as | |
1005 [latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73]. So what looks like | |
1006 ',Ass(B' in the display might actually be represented [latin-iso8859-1 | |
1007 #x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B | |
1008 #xF3 ESC - A] in the file. In some cases this treatment would be | |
1009 appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00 | |
1010 (the CJK ideographic character meaning ``one'')), and although arguably | |
1011 incorrect it is convenient when mixing the CJK scripts. But in the case | |
1012 of the Latin scripts this is wrong. | |
1013 | |
1014 Worse yet, it is very likely to occur when mixing ``different'' encodings | |
1015 (such as ISO 8859/1 and ISO 8859/15) that differ only in a few code | |
1016 points that are almost never used. A very important example involves | |
1017 email. Many sites, especially in the U.S., default to use of the ISO | |
1018 8859/1 coded character set (also called ``Latin 1,'' though these are | |
1019 somewhat different concepts). However, ISO 8859/1 provides a generic | |
1020 CURRENCY SIGN character. Now that the Euro has become the official | |
1021 currency of most countries in Europe, this is unsatisfactory (and in | |
1022 practice, useless). So Europeans generally use ISO 8859/15, which is | |
1023 nearly identical to ISO 8859/1 for most languages, except that it | |
1024 substitutes EURO SIGN for CURRENCY SIGN. | |
1025 | |
1026 Suppose a European user yanks text from a post encoded in ISO 8859/1 | |
1027 into a message composition buffer, and enters some text including the | |
1028 Euro sign. Then Mule will consider the buffer to contain both ISO | |
1029 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively | |
1030 programmed) send the message as a multipart mixed MIME body! | |
1031 | |
1032 This is clearly stupid. What is not as obvious is that, just as any | |
1033 European can include American English in their text because ASCII is a | |
1034 subset of ISO 8859/15, most European languages which use Latin | |
1035 characters (eg, German and Polish) can typically be mixed while using | |
1036 only one Latin coded character set (in the case of German and Polish, | |
1037 ISO 8859/2). However, this often depends on exactly what text is to be | |
1038 encoded (even for the same pair of languages). | |
1039 | |
1040 Unification works around the problem by converting as many characters as | |
1041 possible to use a single Latin coded character set before saving the | |
1042 buffer. | |
1043 | |
1044 Because the problem is rarely noticable in editing a buffer, but tends | |
1045 to manifest when that buffer is exported to a file or process, | |
1046 unification uses the strategy of examining the buffer prior to export. | |
1047 If use of multiple Latin coded character sets is detected, unification | |
1048 attempts to unify them by finding a single coded character set which | |
1049 contains all of the Latin characters in the buffer. | |
1050 | |
1051 The primary purpose of unification is to fix the problem by giving the | |
1052 user the choice to change the representation of all characters to one | |
1053 character set and give sensible recommendations based on context. In | |
1054 the ',As(B' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and | |
1055 both will be suggested. In the EURO SIGN example, only ISO 8859/15 | |
1056 makes sense, and that is what will be recommended. In both cases, the | |
1057 user will be reminded that there are universal encodings available. | |
1058 | |
1059 I call this @dfn{remapping} (from the universal character set to a | |
1060 particular ISO 8859 coded character set). It is mere accident that this | |
1061 letter has the same code point in both character sets. (Not entirely, | |
1062 but there are many examples of Latin characters that have different code | |
1063 points in different Latin-X sets.) | |
1064 | |
1065 Note that, in the ',As(B' example, that treating the buffer in this way will | |
1066 result in a representation such as [latin-iso8859-2 | |
1067 #x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3]. | |
1068 This is guaranteed to occasionally result in the second problem you | |
1069 observed, to which we now turn. | |
1070 | |
1071 This problem is that, although the file is intended to be an | |
1072 ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX | |
1073 compliant program---this is required by the standard, obvious if you | |
1074 think a bit, @pxref{What Unification Cannot Do for You}) will read that | |
1075 file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73]. Of course this | |
1076 is no problem if all of the characters in the file are contained in ISO | |
1077 8859/1, but suppose there are some which are not, but are contained in | |
1078 the (intended) ISO 8859/2. | |
1079 | |
1080 You now want to fix this, but not by finding the same character in | |
1081 another set. Instead, you want to simply change the character set | |
1082 that Mule associates with that buffer position without changing the | |
1083 code. (This is conceptually somewhat distinct from the first problem, | |
1084 and logically ought to be handled in the code that defines coding | |
1085 systems. However, unification is not an unreasonable place for it.) | |
1086 Unification provides two functions (one fast and dangerous, the other | |
1087 @c #### fix latin-unity.texi | |
1088 slower and careful) to handle this. I call this @dfn{recoding}, because | |
1089 the transformation actually involves @emph{encoding} the buffer to | |
1090 file representation, then @emph{decoding} it to buffer representation | |
1091 (in a different character set). This cannot be done automatically | |
1092 because Mule can have no idea what the correct encoding is---after | |
1093 all, it already gave you its best guess. @xref{What Unification | |
1094 Cannot Do for You}. So these functions must be invoked by the user. | |
1095 @xref{Interactive Usage}. | |
1096 | |
1097 | |
1098 @node What Unification Cannot Do for You, , Unification Theory, Unification | |
1099 @subsection What Unification Cannot Do for You | |
1100 | |
1101 Unification @strong{cannot} save you if you insist on exporting data in | |
1102 8-bit encodings in a multilingual environment. @emph{You will | |
1103 eventually corrupt data if you do this.} It is not Mule's, or any | |
1104 application's, fault. You will have only yourself to blame; consider | |
1105 yourself warned. (It is true that Mule has bugs, which make Mule | |
1106 somewhat more dangerous and inconvenient than some naive applications. | |
1107 We're working to address those, but no application can remedy the | |
1108 inherent defect of 8-bit encodings.) | |
1109 | |
1110 Use standard universal encodings, preferably Unicode (UTF-8) unless | |
1111 applicable standards indicate otherwise. The most important such case | |
1112 is Internet messages, where MIME should be used, whether or not the | |
1113 subordinate encoding is a universal encoding. (Note that since one of | |
1114 the important provisions of MIME is the @samp{Content-Type} header, | |
1115 which has the charset parameter, MIME is to be considered a universal | |
1116 encoding for the purposes of this manual. Of course, technically | |
1117 speaking it's neither a coded character set nor a coding extension | |
1118 technique compliant with ISO 2022.) | |
1119 | |
1120 As mentioned earlier, the problem is that standard encodings suffer from | |
1121 the design defect that they do not provide a reliable way to recognize | |
1122 which coded character sets are in use. There are scores of character | |
1123 sets which can be represented by a single octet (8-bit byte), whose | |
1124 union contains many hundreds of characters. Thus any 8-bit coded | |
1125 character set must contain characters that share code points used for | |
1126 different characters in other coded character sets. | |
1127 | |
1128 This means that a given file's intended encoding cannot be identified | |
1129 with 100% reliability unless it contains encoding markers such as those | |
1130 provided by MIME or ISO 2022. | |
1131 | |
1132 Unification actually makes it more likely that you will have problems of | |
1133 this kind. Traditionally Mule has been ``helpful'' by simply using an | |
1134 ISO 2022 universal coding system when the current buffer coding system | |
1135 cannot handle all the characters in the buffer. This has the effect | |
1136 that, because the file contains control sequences, it is not recognized | |
1137 as being in the locale's normal 8-bit encoding. It may be annoying if | |
1138 @c #### fix in latin-unity.texi | |
1139 you are not a Mule expert, but your data is guaranteed to be recoverable | |
1140 with a tool you already have: Mule. | |
1141 | |
1142 However, with unification, Mule converts to a single 8-bit character set | |
1143 when possible. But typically this will @emph{not} be in your usual | |
1144 locale. Ie, the times that an ISO 8859/1 user will need unification is | |
1145 when there are ISO 8859/2 characters in the buffer. But then most | |
1146 likely the file will be saved in a pure 8-bit encoding that is not ISO | |
1147 8859/1, ie, ISO 8859/2. Mule's autorecognizer (which is probably the | |
1148 most sophisticated yet available) cannot tell the difference between ISO | |
1149 8859/1 and ISO 8859/2, and in a Western European locale will choose the | |
1150 former even though the latter was intended. Even the extension | |
1151 @c #### fix in latin-unity.texi | |
1152 (``statistical recognition'') planned for XEmacs 22 is unlikely to be | |
1153 acceptably accurate in the case of mixed codes. | |
1154 | |
1155 So now consider adding some additional ISO 8859/1 text to the buffer. | |
1156 If it includes any ISO 8859/1 codes that are used by different | |
1157 characters in ISO 8859/2, you now have a file that cannot be | |
1158 mechanically disentangled. You need a human being who can recognize | |
1159 that @emph{this is German and Swedish} and stays in Latin-1, while | |
1160 @emph{that is Polish} and needs to be recoded to Latin-2. | |
1161 | |
1162 Moral: switch to a universal coded character set, preferably Unicode | |
1163 using the UTF-8 transformation format. If you really need the space, | |
1164 compress your files. | |
1165 | |
1166 | |
1167 @node Specify Coding, Charsets and Coding Systems, Unification, Mule | |
431 @section Specifying a Coding System | 1168 @section Specifying a Coding System |
432 | 1169 |
433 In cases where XEmacs does not automatically choose the right coding | 1170 In cases where XEmacs does not automatically choose the right coding |
434 system, you can use these commands to specify one: | 1171 system, you can use these commands to specify one: |
435 | 1172 |
547 using that coding system for all file operations. This makes it | 1284 using that coding system for all file operations. This makes it |
548 possible to use non-Latin-1 characters in file names---or, at least, | 1285 possible to use non-Latin-1 characters in file names---or, at least, |
549 those non-Latin-1 characters which the specified coding system can | 1286 those non-Latin-1 characters which the specified coding system can |
550 encode. By default, this variable is @code{nil}, which implies that you | 1287 encode. By default, this variable is @code{nil}, which implies that you |
551 cannot use non-Latin-1 characters in file names. | 1288 cannot use non-Latin-1 characters in file names. |
1289 | |
1290 | |
1291 @node Charsets and Coding Systems, , Specify Coding, Mule | |
1292 @section Charsets and Coding Systems | |
1293 | |
1294 This section provides reference lists of Mule charsets and coding | |
1295 systems. Mule charsets are typically named by character set and | |
1296 standard. | |
1297 | |
1298 @table @strong | |
1299 @item ASCII variants | |
1300 | |
1301 Identification of equivalent characters in these sets is not properly | |
1302 implemented. Unification does not distinguish the two charsets. | |
1303 | |
1304 @samp{ascii} @samp{latin-jisx0201} | |
1305 | |
1306 @item Extended Latin | |
1307 | |
1308 Characters from the following ISO 2022 conformant charsets are | |
1309 identified with equivalents in other charsets in the group by | |
1310 unification. | |
1311 | |
1312 @samp{latin-iso8859-1} @samp{latin-iso8859-15} @samp{latin-iso8859-2} | |
1313 @samp{latin-iso8859-3} @samp{latin-iso8859-4} @samp{latin-iso8859-9} | |
1314 @samp{latin-iso8859-13} @samp{latin-iso8859-16} | |
1315 | |
1316 The follow charsets are Latin variants which are not understood by | |
1317 unification. In addition, many of the Asian language standards provide | |
1318 ASCII, at least, and sometimes other Latin characters. None of these | |
1319 are identified with their ISO 8859 equivalents. | |
1320 | |
1321 @samp{vietnamese-viscii-lower} | |
1322 @samp{vietnamese-viscii-upper} | |
1323 | |
1324 @item Other character sets | |
1325 | |
1326 @samp{arabic-1-column} | |
1327 @samp{arabic-2-column} | |
1328 @samp{arabic-digit} | |
1329 @samp{arabic-iso8859-6} | |
1330 @samp{chinese-big5-1} | |
1331 @samp{chinese-big5-2} | |
1332 @samp{chinese-cns11643-1} | |
1333 @samp{chinese-cns11643-2} | |
1334 @samp{chinese-cns11643-3} | |
1335 @samp{chinese-cns11643-4} | |
1336 @samp{chinese-cns11643-5} | |
1337 @samp{chinese-cns11643-6} | |
1338 @samp{chinese-cns11643-7} | |
1339 @samp{chinese-gb2312} | |
1340 @samp{chinese-isoir165} | |
1341 @samp{cyrillic-iso8859-5} | |
1342 @samp{ethiopic} | |
1343 @samp{greek-iso8859-7} | |
1344 @samp{hebrew-iso8859-8} | |
1345 @samp{ipa} | |
1346 @samp{japanese-jisx0208} | |
1347 @samp{japanese-jisx0208-1978} | |
1348 @samp{japanese-jisx0212} | |
1349 @samp{katakana-jisx0201} | |
1350 @samp{korean-ksc5601} | |
1351 @samp{sisheng} | |
1352 @samp{thai-tis620} | |
1353 @samp{thai-xtis} | |
1354 | |
1355 @item Non-graphic charsets | |
1356 | |
1357 @samp{control-1} | |
1358 @end table | |
1359 | |
1360 @table @strong | |
1361 @item No conversion | |
1362 | |
1363 Some of these coding systems may specify EOL conventions. Note that | |
1364 @samp{iso-8859-1} is a no-conversion coding system, not an ISO 2022 | |
1365 coding system. Although unification attempts to compensate for this, it | |
1366 is possible that the @samp{iso-8859-1} coding system will behave | |
1367 differently from other ISO 8859 coding systems. | |
1368 | |
1369 @samp{binary} @samp{no-conversion} @samp{raw-text} @samp{iso-8859-1} | |
1370 | |
1371 @item Latin coding systems | |
1372 | |
1373 These coding systems are all single-byte, 8-bit ISO 2022 coding systems, | |
1374 combining ASCII in the GL register (bytes with high-bit clear) and an | |
1375 extended Latin character set in the GR register (bytes with high-bit set). | |
1376 | |
1377 @samp{iso-8859-15} @samp{iso-8859-2} @samp{iso-8859-3} @samp{iso-8859-4} | |
1378 @samp{iso-8859-9} @samp{iso-8859-13} @samp{iso-8859-14} @samp{iso-8859-16} | |
1379 | |
1380 These coding systems are single-byte, 8-bit coding systems that do not | |
1381 conform to international standards. They should be avoided in all | |
1382 potentially multilingual contexts, including any text distributed over | |
1383 the Internet and World Wide Web. | |
1384 | |
1385 @samp{windows-1251} | |
1386 | |
1387 @item Multilingual coding systems | |
1388 | |
1389 The following ISO-2022-based coding systems are useful for multilingual | |
1390 text. | |
1391 | |
1392 @samp{ctext} @samp{iso-2022-lock} @samp{iso-2022-7} @samp{iso-2022-7bit} | |
1393 @samp{iso-2022-7bit-ss2} @samp{iso-2022-8} @samp{iso-2022-8bit-ss2} | |
1394 | |
1395 XEmacs also supports Unicode with the Mule-UCS package. These are the | |
1396 preferred coding systems for multilingual use. (There is a possible | |
1397 exception for texts that mix several Asian ideographic character sets.) | |
1398 | |
1399 @samp{utf-16-be} @samp{utf-16-be-no-signature} @samp{utf-16-le} | |
1400 @samp{utf-16-le-no-signature} @samp{utf-7} @samp{utf-7-safe} | |
1401 @samp{utf-8} @samp{utf-8-ws} | |
1402 | |
1403 Development versions of XEmacs (the 21.5 series) support Unicode | |
1404 internally, with (at least) the following coding systems implemented: | |
1405 | |
1406 @samp{utf-16-be} @samp{utf-16-be-bom} @samp{utf-16-le} | |
1407 @samp{utf-16-le-bom} @samp{utf-8} @samp{utf-8-bom} | |
1408 | |
1409 @item Asian ideographic languages | |
1410 | |
1411 The following coding systems are based on ISO 2022, and are more or less | |
1412 suitable for encoding multilingual texts. They all can represent ASCII | |
1413 at least, and sometimes several other foreign character sets, without | |
1414 resort to arbitrary ISO 2022 designations. However, these subsets are | |
1415 not identified with the corresponding national standards in XEmacs Mule. | |
1416 | |
1417 @samp{chinese-euc} @samp{cn-big5} @samp{cn-gb-2312} @samp{gb2312} | |
1418 @samp{hz} @samp{hz-gb-2312} @samp{old-jis} @samp{japanese-euc} | |
1419 @samp{junet} @samp{euc-japan} @samp{euc-jp} @samp{iso-2022-jp} | |
1420 @samp{iso-2022-jp-1978-irv} @samp{iso-2022-jp-2} @samp{euc-kr} | |
1421 @samp{korean-euc} @samp{iso-2022-kr} @samp{iso-2022-int-1} | |
1422 | |
1423 The following coding systems cannot be used for general multilingual | |
1424 text and do not cooperate well with other coding systems. | |
1425 | |
1426 @samp{big5} @samp{shift_jis} | |
1427 | |
1428 @item Other languages | |
1429 | |
1430 The following coding systems are based on ISO 2022. Though none of them | |
1431 provides any Latin characters beyond ASCII, XEmacs Mule allows (and up | |
1432 to 21.4 defaults to) use of ISO 2022 control sequences to designate | |
1433 other character sets for inclusion the text. | |
1434 | |
1435 @samp{iso-8859-5} @samp{iso-8859-7} @samp{iso-8859-8} | |
1436 @samp{ctext-hebrew} | |
1437 | |
1438 The following are character sets that do not conform to ISO 2022 and | |
1439 thus cannot be safely used in a multilingual context. | |
1440 | |
1441 @samp{alternativnyj} @samp{koi8-r} @samp{tis-620} @samp{viqr} | |
1442 @samp{viscii} @samp{vscii} | |
1443 | |
1444 @item Special coding systems | |
1445 | |
1446 Mule uses the following coding systems for special purposes. | |
1447 | |
1448 @samp{automatic-conversion} @samp{undecided} @samp{escape-quoted} | |
1449 | |
1450 @samp{escape-quoted} is especially important, as it is used internally | |
1451 as the coding system for autosaved data. | |
1452 | |
1453 The following coding systems are aliases for others, and are used for | |
1454 communication with the host operating system. | |
1455 | |
1456 @samp{file-name} @samp{keyboard} @samp{terminal} | |
1457 | |
1458 @end table | |
1459 | |
1460 Mule detection of coding systems is actually limited to detection of | |
1461 classes of coding systems called @dfn{coding categories}. These coding | |
1462 categories are identified by the ISO 2022 control sequences they use, if | |
1463 any, by their conformance to ISO 2022 restrictions on code points that | |
1464 may be used, and by characteristic patterns of use of 8-bit code points. | |
1465 | |
1466 @samp{no-conversion} | |
1467 @samp{utf-8} | |
1468 @samp{ucs-4} | |
1469 @samp{iso-7} | |
1470 @samp{iso-lock-shift} | |
1471 @samp{iso-8-1} | |
1472 @samp{iso-8-2} | |
1473 @samp{iso-8-designate} | |
1474 @samp{shift-jis} | |
1475 @samp{big5} | |
1476 | |
1477 |