comparison man/xemacs/mule.texi @ 1183:c1553814932e

[xemacs-hg @ 2003-01-03 12:12:30 by stephent] various docs <873coa5unb.fsf@tleepslib.sk.tsukuba.ac.jp> <87r8bu4emz.fsf@tleepslib.sk.tsukuba.ac.jp>
author stephent
date Fri, 03 Jan 2003 12:12:40 +0000
parents 26f7cf2a4792
children 6b0000935adc
comparison
equal deleted inserted replaced
1182:7d696106ffe9 1183:c1553814932e
13 @cindex IPA 13 @cindex IPA
14 @cindex Japanese 14 @cindex Japanese
15 @cindex Korean 15 @cindex Korean
16 @cindex Cyrillic 16 @cindex Cyrillic
17 @cindex Russian 17 @cindex Russian
18 @c #### It's a lie that this file tells you about Unicode....
19 @cindex Unicode
18 If you build XEmacs using the @code{--with-mule} option, it supports a 20 If you build XEmacs using the @code{--with-mule} option, it supports a
19 wide variety of world scripts, including the Latin script, the Arabic 21 wide variety of world scripts, including the Latin script, the Arabic
20 script, Simplified Chinese (for mainland of China), Traditional Chinese 22 script, Simplified Chinese (for mainland of China), Traditional Chinese
21 (for Taiwan and Hong-Kong), the Greek script, the Hebrew script, IPA 23 (for Taiwan and Hong-Kong), the Greek script, the Hebrew script, IPA
22 symbols, Japanese scripts (Hiragana, Katakana and Kanji), Korean scripts 24 symbols, Japanese scripts (Hiragana, Katakana and Kanji), Korean scripts
31 * Input Methods:: Entering text characters not on your keyboard. 33 * Input Methods:: Entering text characters not on your keyboard.
32 * Select Input Method:: Specifying your choice of input methods. 34 * Select Input Method:: Specifying your choice of input methods.
33 * Coding Systems:: Character set conversion when you read and 35 * Coding Systems:: Character set conversion when you read and
34 write files, and so on. 36 write files, and so on.
35 * Recognize Coding:: How XEmacs figures out which conversion to use. 37 * Recognize Coding:: How XEmacs figures out which conversion to use.
38 * Unification:: Integrating overlapping character sets.
36 * Specify Coding:: Various ways to choose which conversion to use. 39 * Specify Coding:: Various ways to choose which conversion to use.
40 * Charsets and Coding Systems:: Tables and other reference material.
37 @end menu 41 @end menu
38 42
39 @node Mule Intro, Language Environments, Mule, Mule 43 @node Mule Intro, Language Environments, Mule, Mule
40 @section Introduction to world scripts 44 @section Introduction: The Wide Variety of Scripts and Codings in Use
41 45
42 The users of these scripts have established many more-or-less standard 46 There are hundreds of scripts in use world-wide. The users of these
43 coding systems for storing files. 47 scripts have established many more-or-less standard coding systems for
44 @c XEmacs internally uses a single multibyte character encoding, so that it 48 storing text written in them in files. XEmacs translates between its
45 @c can intermix characters from all these scripts in a single buffer or 49 internal character encoding and various other coding systems when
46 @c string. This encoding represents each non-ASCII character as a sequence 50 reading and writing files, when exchanging data with subprocesses, and
47 @c of bytes in the range 0200 through 0377. 51 (in some cases) in the @kbd{C-q} command (see below).
48 XEmacs translates between the internal character encoding and various 52 @footnote{Historically the internal encoding was a specially designed
49 other coding systems when reading and writing files, when exchanging 53 encoding, called @dfn{Mule encoding}, intended for easy conversion to
50 data with subprocesses, and (in some cases) in the @kbd{C-q} command 54 and from versions of ISO 2022. However, this encoding shares many
51 (see below). 55 properties with UTF-8, and conversion to UTF-8 as the internal code is
56 proposed.}
52 57
53 @kindex C-h h 58 @kindex C-h h
54 @findex view-hello-file 59 @findex view-hello-file
55 The command @kbd{C-h h} (@code{view-hello-file}) displays the file 60 The command @kbd{C-h h} (@code{view-hello-file}) displays the file
56 @file{etc/HELLO}, which shows how to say ``hello'' in many languages. 61 @file{etc/HELLO}, which shows how to say ``hello'' in many languages.
354 non-Latin-1 characters stored with the internal XEmacs encoding. It 359 non-Latin-1 characters stored with the internal XEmacs encoding. It
355 handles end-of-line conversion based on the data encountered, and has 360 handles end-of-line conversion based on the data encountered, and has
356 the usual three variants to specify the kind of end-of-line conversion. 361 the usual three variants to specify the kind of end-of-line conversion.
357 362
358 363
359 @node Recognize Coding, Specify Coding, Coding Systems, Mule 364 @node Recognize Coding, Unification, Coding Systems, Mule
360 @section Recognizing Coding Systems 365 @section Recognizing Coding Systems
361 366
362 Most of the time, XEmacs can recognize which coding system to use for 367 Most of the time, XEmacs can recognize which coding system to use for
363 any given file--once you have specified your preferences. 368 any given file--once you have specified your preferences.
364 369
425 a different coding system, you can specify a different coding system for 430 a different coding system, you can specify a different coding system for
426 the buffer using @code{set-buffer-file-coding-system} (@pxref{Specify 431 the buffer using @code{set-buffer-file-coding-system} (@pxref{Specify
427 Coding}). 432 Coding}).
428 433
429 434
430 @node Specify Coding, , Recognize Coding, Mule 435 @node Unification, Specify Coding, Recognize Coding, Mule
436 @section Character Set Unification
437
438 Mule suffers from a design defect that causes it to consider the ISO
439 Latin character sets to be disjoint. This results in oddities such as
440 files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO
441 2022 control sequences to switch between them, as well as more
442 plausible but often unnecessary combinations like ISO 8859/1 with ISO
443 8859/2. This can be very annoying when sending messages or even in
444 simple editing on a single host. XEmacs works around the problem by
445 converting as many characters as possible to use a single Latin coded
446 character set before saving the buffer.
447
448 Unification is planned for extension to other character set families,
449 in particular the Han family of character sets based on the Chinese
450 ideographic characters. At least for the Han sets, however, the
451 unification feature will be disabled by default.
452
453 This functionality is based on the @file{latin-unity} package by
454 Stephen Turnbull @email{stephen@@xemacs.org}, but is somewhat
455 divergent. This documentation is also based on the package
456 documentation, and is likely to be inaccurate because of the different
457 constraints we place on ``core'' and packaged functionality.
458
459 @menu
460 * Unification Overview:: History and general information.
461 * Unification Usage:: An overview of operation.
462 * Unification Configuration:: Configuring unification.
463 * Unification FAQs:: Questions and answers from the mailing list.
464 * Unification Theory:: How unification works.
465 * What Unification Cannot Do for You:: Inherent problems of 8-bit charsets.
466 @end menu
467
468 @node Unification Overview, Unification Usage, Unification, Unification
469 @subsection An Overview of Character Set Unification
470
471 Mule suffers from a design defect that causes it to consider the ISO
472 Latin character sets to be disjoint. This manifests itself when a user
473 enters characters using input methods associated with different coded
474 character sets into a single buffer.
475
476 A very important example involves email. Many sites, especially in the
477 U.S., default to use of the ISO 8859/1 coded character set (also called
478 ``Latin 1,'' though these are somewhat different concepts). However,
479 ISO 8859/1 provides a generic CURRENCY SIGN character. Now that the
480 Euro has become the official currency of most countries in Europe, this
481 is unsatisfactory (and in practice, useless). So Europeans generally
482 use ISO 8859/15, which is nearly identical to ISO 8859/1 for most
483 languages, except that it substitutes EURO SIGN for CURRENCY SIGN.
484
485 Suppose a European user yanks text from a post encoded in ISO 8859/1
486 into a message composition buffer, and enters some text including the
487 Euro sign. Then Mule will consider the buffer to contain both ISO
488 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
489 programmed) send the message as a multipart mixed MIME body!
490
491 This is clearly stupid. What is not as obvious is that, just as any
492 European can include American English in their text because ASCII is a
493 subset of ISO 8859/15, most European languages which use Latin
494 characters (eg, German and Polish) can typically be mixed while using
495 only one Latin coded character set (in this case, ISO 8859/2). However,
496 this often depends on exactly what text is to be encoded.
497
498 Unification works around the problem by converting as many characters as
499 possible to use a single Latin coded character set before saving the
500 buffer.
501
502
503 @node Unification Usage, Unification Configuration, Unification Overview, Unification
504 @subsection Operation of Unification
505
506 This is a description of the early hack to include unification in
507 XEmacs 21.5. This will almost surely change.
508
509 Normally, unification works in the background by installing
510 @code{unity-sanity-check} on @code{write-region-pre-hook}.
511 Unification is on by default for the ISO-8859 Latin sets. The user
512 activates this functionality for other chacter set families by
513 invoking @code{enable-unification}, either interactively or in her
514 init file. @xref{Init File, , , xemacs}. Unification can be
515 deactivated by invoking @code{disable-unification}.
516
517 Unification also provides a few functions for remapping or recoding the
518 buffer by hand. To @dfn{remap} a character means to change the buffer
519 representation of the character by using another coded character set.
520 Remapping never changes the identity of the character, but may involve
521 altering the code point of the character. To @dfn{recode} a character
522 means to simply change the coded character set. Recoding never alters
523 the code point of the character, but may change the identity of the
524 character. @xref{Unification Theory}.
525
526 There are a few variables which determine which coding systems are
527 always acceptable to unification: @code{unity-ucs-list},
528 @code{unity-preferred-coding-system-list}, and
529 @code{unity-preapproved-coding-system-list}. The last defaults to
530 @code{(buffer preferred)}, and you should probably avoid changing it
531 because it short-circuits the sanity check. If you find you need to
532 use it, consider reporting it as a bug or request for enhancement.
533
534 @menu
535 * Basic Functionality:: User interface and customization.
536 * Interactive Usage:: Treating text by hand.
537 Also documents the hook function(s).
538 @end menu
539
540
541 @node Basic Functionality, Interactive Usage, , Unification Usage
542 @subsubsection Basic Functionality
543
544 These functions and user options initialize and configure unification.
545 In normal use, they are not needed.
546
547 @strong{These interfaces will change. Also, the @samp{unity-} prefix
548 is likely to be changed for many of the variables and functions, as
549 they are of more general usefulness.}
550
551 @defun enable-unification
552 Set up hooks and initialize variables for unification.
553
554 There are no arguments.
555
556 This function is idempotent. It will reinitialize any hooks or variables
557 that are not in initial state.
558 @end defun
559
560 @defun disable-unification
561 There are no arguments.
562
563 Clean up hooks and void variables used by unification.
564 @end defun
565
566 @c #### several changes should go to latin-unity.texi
567 @defopt unity-ucs-list
568 List of universal coding systems recommended for character set unification.
569
570 The default value is @code{'(utf-8 iso-2022-7 ctext escape-quoted)}.
571
572 Order matters; coding systems earlier in the list will be preferred when
573 recommending a coding system. These coding systems will not be used
574 without querying the user (unless they are also present in
575 @code{unity-preapproved-coding-system-list}), and follow the
576 @code{unity-preferred-coding-system-list} in the list of suggested
577 coding systems.
578
579 If none of the preferred coding systems are feasible, the first in
580 this list will be the default.
581
582 Notes on certain coding systems: @code{escape-quoted} is a special
583 coding system used for autosaves and compiled Lisp in Mule. You should
584 never delete this, although it is rare that a user would want to use it
585 directly. Unification does not try to be ``smart'' about other general
586 ISO 2022 coding systems, such as ISO-2022-JP. (They are not recognized
587 as equivalent to @code{iso-2022-7}.) If your preferred coding system is
588 one of these, you may consider adding it to @code{unity-ucs-list}.
589 @end defopt
590
591 Coding systems which are not Latin and not in
592 @code{unity-ucs-list} are handled by short circuiting checks of
593 coding system against the next two variables.
594
595 @defopt unity-preapproved-coding-system-list
596 List of coding systems used without querying the user if feasible.
597
598 The default value is @samp{(buffer-default preferred)}.
599
600 The first feasible coding system in this list is used. The special values
601 @samp{preferred} and @samp{buffer-default} may be present:
602
603 @table @code
604 @item buffer-default
605 Use the coding system used by @samp{write-region}, if feasible.
606
607 @item preferred
608 Use the coding system specified by @samp{prefer-coding-system} if feasible.
609 @end table
610
611 "Feasible" means that all characters in the buffer can be represented by
612 the coding system. Coding systems in @samp{unity-ucs-list} are
613 always considered feasible. Other feasible coding systems are computed
614 by @samp{unity-representations-feasible-region}.
615
616 Note that, by definition, the first universal coding system in this
617 list shadows all other coding systems. In particular, if your
618 preferred coding system is a universal coding system, and
619 @code{preferred} is a member of this list, unification will blithely
620 convert all your files to that coding system. This is considered a
621 feature, but it may surprise most users. Users who don't like this
622 behavior may put @code{preferred} in
623 @code{unity-preferred-coding-system-list}, but not in
624 @code{unity-preapproved-coding-system-list}.
625 @end defopt
626
627
628 @defopt unity-preferred-coding-system-list
629 List of coding systems suggested to the user if feasible.
630
631 The default value is @samp{(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3
632 iso-8859-4 iso-8859-9)}.
633
634 If none of the coding systems in
635 @samp{unity-preapproved-coding-system-list} are feasible, this list
636 will be recommended to the user, followed by the
637 @samp{unity-ucs-list} (so those coding systems should not be in
638 this list). The first coding system in this list is default. The
639 special values @samp{preferred} and @samp{buffer-default} may be
640 present:
641
642 @table @code
643 @item buffer-default
644 Use the coding system used by @samp{write-region}, if feasible.
645
646 @item preferred
647 Use the coding system specified by @samp{prefer-coding-system} if feasible.
648 @end table
649
650 "Feasible" means that all characters in the buffer can be represented by
651 the coding system. Coding systems in @samp{unity-ucs-list} are
652 always considered feasible. Other feasible coding systems are computed
653 by @samp{unity-representations-feasible-region}.
654 @end defopt
655
656
657 @defvar unity-iso-8859-1-aliases
658 List of coding systems to be treated as aliases of ISO 8859/1.
659
660 The default value is '(iso-8859-1).
661
662 This is not a user variable; to customize input of coding systems or
663 charsets, @samp{unity-coding-system-alias-alist} or
664 @samp{unity-charset-alias-alist}.
665 @end defvar
666
667
668 @node Interactive Usage, , Basic Functionality, Unification Usage
669 @subsubsection Interactive Usage
670
671 First, the hook function @code{unity-sanity-check} is documented.
672 (It is placed here because it is not an interactive function, and there
673 is not yet a programmer's section of the manual.)
674
675 These functions provide access to internal functionality (such as the
676 remapping function) and to extra functionality (the recoding functions
677 and the test function).
678
679 @defun unity-sanity-check begin end filename append visit lockname &optional coding-system
680
681 Check if @var{coding-system} can represent all characters between
682 @var{begin} and @var{end}.
683
684 For compatibility with old broken versions of @code{write-region},
685 @var{coding-system} defaults to @code{buffer-file-coding-system}.
686 @var{filename}, @var{append}, @var{visit}, and @var{lockname} are
687 ignored.
688
689 Return nil if buffer-file-coding-system is not (ISO-2022-compatible)
690 Latin. If @code{buffer-file-coding-system} is safe for the charsets
691 actually present in the buffer, return it. Otherwise, ask the user to
692 choose a coding system, and return that.
693
694 This function does @emph{not} do the safe thing when
695 @code{buffer-file-coding-system} is nil (aka no-conversion). It
696 considers that ``non-Latin,'' and passes it on to the Mule detection
697 mechanism.
698
699 This function is intended for use as a @code{write-region-pre-hook}. It
700 does nothing except return @var{coding-system} if @code{write-region}
701 handlers are inhibited.
702 @end defun
703
704 @defun unity-buffer-representations-feasible
705 There are no arguments.
706
707 Apply unity-region-representations-feasible to the current buffer.
708 @end defun
709
710 @defun unity-region-representations-feasible begin end &optional buf
711 Return character sets that can represent the text from @var{begin} to
712 @var{end} in @var{buf}.
713
714 @c #### Fix in latin-unity.texi.
715 @var{buf} defaults to the current buffer. Called interactively, will be
716 applied to the region. The function assumes @var{begin} <= @var{end}.
717
718 The return value is a cons. The car is the list of character sets
719 that can individually represent all of the non-ASCII portion of the
720 buffer, and the cdr is the list of character sets that can
721 individually represent all of the ASCII portion.
722
723 The following is taken from a comment in the source. Please refer to
724 the source to be sure of an accurate description.
725
726 The basic algorithm is to map over the region, compute the set of
727 charsets that can represent each character (the ``feasible charset''),
728 and take the intersection of those sets.
729
730 The current implementation takes advantage of the fact that ASCII
731 characters are common and cannot change asciisets. Then using
732 skip-chars-forward makes motion over ASCII subregions very fast.
733
734 This same strategy could be applied generally by precomputing classes
735 of characters equivalent according to their effect on latinsets, and
736 adding a whole class to the skip-chars-forward string once a member is
737 found.
738
739 Probably efficiency is a function of the number of characters matched,
740 or maybe the length of the match string? With @code{skip-category-forward}
741 over a precomputed category table it should be really fast. In practice
742 for Latin character sets there are only 29 classes.
743 @end defun
744
745 @defun unity-remap-region begin end character-set &optional coding-system
746
747 Remap characters between @var{begin} and @var{end} to equivalents in
748 @var{character-set}. Optional argument @var{coding-system} may be a
749 coding system name (a symbol) or nil. Characters with no equivalent are
750 left as-is.
751
752 When called interactively, @var{begin} and @var{end} are set to the
753 beginning and end, respectively, of the active region, and the function
754 prompts for @var{character-set}. The function does completion, knows
755 how to guess a character set name from a coding system name, and also
756 provides some common aliases. See @code{unity-guess-charset}.
757 There is no way to specify @var{coding-system}, as it has no useful
758 function interactively.
759
760 Return @var{coding-system} if @var{coding-system} can encode all
761 characters in the region, t if @var{coding-system} is nil and the coding
762 system with G0 = 'ascii and G1 = @var{character-set} can encode all
763 characters, and otherwise nil. Note that a non-null return does
764 @emph{not} mean it is safe to write the file, only the specified region.
765 (This behavior is useful for multipart MIME encoding and the like.)
766
767 Note: by default this function is quite fascist about universal coding
768 systems. It only admits @samp{utf-8}, @samp{iso-2022-7}, and
769 @samp{ctext}. Customize @code{unity-approved-ucs-list} to change
770 this.
771
772 This function remaps characters that are artificially distinguished by Mule
773 internal code. It may change the code point as well as the character set.
774 To recode characters that were decoded in the wrong coding system, use
775 @code{unity-recode-region}.
776 @end defun
777
778 @defun unity-recode-region begin end wrong-cs right-cs
779
780 Recode characters between @var{begin} and @var{end} from @var{wrong-cs}
781 to @var{right-cs}.
782
783 @var{wrong-cs} and @var{right-cs} are character sets. Characters retain
784 the same code point but the character set is changed. Only characters
785 from @var{wrong-cs} are changed to @var{right-cs}. The identity of the
786 character may change. Note that this could be dangerous, if characters
787 whose identities you do not want changed are included in the region.
788 This function cannot guess which characters you want changed, and which
789 should be left alone.
790
791 When called interactively, @var{begin} and @var{end} are set to the
792 beginning and end, respectively, of the active region, and the function
793 prompts for @var{wrong-cs} and @var{right-cs}. The function does
794 completion, knows how to guess a character set name from a coding system
795 name, and also provides some common aliases. See
796 @code{unity-guess-charset}.
797
798 Another way to accomplish this, but using coding systems rather than
799 character sets to specify the desired recoding, is
800 @samp{unity-recode-coding-region}. That function may be faster
801 but is somewhat more dangerous, because it may recode more than one
802 character set.
803
804 To change from one Mule representation to another without changing identity
805 of any characters, use @samp{unity-remap-region}.
806 @end defun
807
808 @defun unity-recode-coding-region begin end wrong-cs right-cs
809
810 Recode text between @var{begin} and @var{end} from @var{wrong-cs} to
811 @var{right-cs}.
812
813 @var{wrong-cs} and @var{right-cs} are coding systems. Characters retain
814 the same code point but the character set is changed. The identity of
815 characters may change. This is an inherently dangerous function;
816 multilingual text may be recoded in unexpected ways. #### It's also
817 dangerous because the coding systems are not sanity-checked in the
818 current implementation.
819
820 When called interactively, @var{begin} and @var{end} are set to the
821 beginning and end, respectively, of the active region, and the function
822 prompts for @var{wrong-cs} and @var{right-cs}. The function does
823 completion, knows how to guess a coding system name from a character set
824 name, and also provides some common aliases. See
825 @code{unity-guess-coding-system}.
826
827 Another, safer, way to accomplish this, using character sets rather
828 than coding systems to specify the desired recoding, is to use
829 @code{unity-recode-region}.
830
831 To change from one Mule representation to another without changing identity
832 of any characters, use @code{unity-remap-region}.
833 @end defun
834
835 Helper functions for input of coding system and character set names.
836
837 @defun unity-guess-charset candidate
838 Guess a charset based on the symbol @var{candidate}.
839
840 @var{candidate} itself is not tried as the value.
841
842 Uses the natural mapping in @samp{unity-cset-codesys-alist}, and
843 the values in @samp{unity-charset-alias-alist}."
844 @end defun
845
846 @defun unity-guess-coding-system candidate
847 Guess a coding system based on the symbol @var{candidate}.
848
849 @var{candidate} itself is not tried as the value.
850
851 Uses the natural mapping in @samp{unity-cset-codesys-alist}, and
852 the values in @samp{unity-coding-system-alias-alist}."
853 @end defun
854
855 @defun unity-example
856
857 A cheesy example for unification.
858
859 At present it just makes a multilingual buffer. To test, setq
860 buffer-file-coding-system to some value, make the buffer dirty (eg
861 with RET BackSpace), and save.
862 @end defun
863
864
865 @node Unification Configuration, Unification FAQs, Unification Usage, Unification
866 @subsection Configuring Unification for Use
867
868 If you want unification to be automatically initialized, invoke
869 @samp{enable-unification} with no arguments in your init file.
870 @xref{Init File, , , xemacs}. If you are using GNU Emacs or an XEmacs
871 earlier than 21.1, you should also load @file{auto-autoloads} using the
872 full path (@emph{never} @samp{require} @file{auto-autoloads} libraries).
873
874 You may wish to define aliases for commonly used character sets and
875 coding systems for convenience in input.
876
877 @defopt unity-charset-alias-alist
878 Alist mapping aliases to Mule charset names (symbols)."
879
880 The default value is
881 @example
882 ((latin-1 . latin-iso8859-1)
883 (latin-2 . latin-iso8859-2)
884 (latin-3 . latin-iso8859-3)
885 (latin-4 . latin-iso8859-4)
886 (latin-5 . latin-iso8859-9)
887 (latin-9 . latin-iso8859-15)
888 (latin-10 . latin-iso8859-16))
889 @end example
890
891 If a charset does not exist on your system, it will not complete and you
892 will not be able to enter it in response to prompts. A real charset
893 with the same name as an alias in this list will shadow the alias.
894 @end defopt
895
896 @defopt unity-coding-system-alias-alist nil
897 Alist mapping aliases to Mule coding system names (symbols).
898
899 The default value is @samp{nil}.
900 @end defopt
901
902
903 @node Unification FAQs, Unification Theory, Unification Configuration, Unification
904 @subsection Frequently Asked Questions About Unification
905
906 @enumerate
907 @item
908 I'm smarter than XEmacs's unification feature! How can that be?
909
910 Don't be surprised. Trust yourself.
911
912 Unification is very young as yet. Teach it what you know by
913 Customizing its variables, and report your changes to the maintainer
914 (@kbd{M-x report-xemacs-bug RET}).
915
916 @item
917 What is a UCS?
918
919 According to ISO 10646, a Universal Coded character Set. In
920 XEmacs, it's Universal (Mule) Coding System.
921 @ref{Coding Systems, , , xemacs}
922
923 @item
924 I know @code{utf-16-le-bom} is a UCS, but unification won't use it.
925 Why not?
926
927 There are an awful lot of UCSes in Mule, and you probably do not want to
928 ever use, and definitely not be asked about, most of them. So the
929 default set includes a few that the author thought plausible, but
930 they're surely not comprehensive or optimal.
931
932 Customize @code{unity-ucs-list} to include the ones you use often, and
933 report your favorites to the maintainer for consideration for
934 inclusion in the defaults using @kbd{M-x report-xemacs-bug RET}.
935 (Note that you @emph{must} include @code{escape-quoted} in this list,
936 because Mule uses it internally as the coding system for auto-save
937 files.)
938
939 Alternatively, if you just want to use it this one time, simply type
940 it in at the prompt. Unification will confirm that is a real coding
941 system, and then assume that you know what you're doing.
942
943 @item
944 This is crazy: I can't quit XEmacs and get queried on autosaves! Why?
945
946 You probably removed @code{escape-quoted} from
947 @code{unity-ucs-list}. Put it back.
948
949 @item
950 Unification is really buggy and I can't get any work done.
951
952 First, use @kbd{M-x disable-unification RET}, then report your
953 problems as a bug (@kbd{M-x report-xemacs-bug RET}).
954 @end enumerate
955
956
957 @node Unification Theory, What Unification Cannot Do for You, Unification FAQs, Unification
958 @subsection Unification Theory
959
960 Standard encodings suffer from the design defect that they do not
961 provide a reliable way to recognize which coded character sets in use.
962 @xref{What Unification Cannot Do for You}. There are scores of
963 character sets which can be represented by a single octet (8-bit
964 byte), whose union contains many hundreds of characters. Obviously
965 this results in great confusion, since you can't tell the players
966 without a scorecard, and there is no scorecard.
967
968 There are two ways to solve this problem. The first is to create a
969 universal coded character set. This is the concept behind Unicode.
970 However, there have been satisfactory (nearly) universal character
971 sets for several decades, but even today many Westerners resist using
972 Unicode because they consider its space requirements excessive. On
973 the other hand, many Asians dislike Unicode because they consider it
974 to be incomplete. (This is partly, but not entirely, political.)
975
976 In any case, Unicode only solves the internal representation problem.
977 Many data sets will contain files in ``legacy'' encodings, and Unicode
978 does not help distinguish among them.
979
980 The second approach is to embed information about the encodings used in
981 a document in its text. This approach is taken by the ISO 2022
982 standard. This would solve the problem completely from the users' of
983 view, except that ISO 2022 is basically not implemented at all, in the
984 sense that few applications or systems implement more than a small
985 subset of ISO 2022 functionality. This is due to the fact that
986 mono-literate users object to the presence of escape sequences in their
987 texts (which they, with some justification, consider data corruption).
988 Programmers are more than willing to cater to these users, since
989 implementing ISO 2022 is a painstaking task.
990
991 In fact, Emacs/Mule adopts both of these approaches. Internally it uses
992 a universal character set, @dfn{Mule code}. Externally it uses ISO 2022
993 techniques both to save files in forms robust to encoding issues, and as
994 hints when attempting to ``guess'' an unknown encoding. However, Mule
995 suffers from a design defect, namely it embeds the character set
996 information that ISO 2022 attaches to runs of characters by introducing
997 them with a control sequence in each character. That causes Mule to
998 consider the ISO Latin character sets to be disjoint. This manifests
999 itself when a user enters characters using input methods associated with
1000 different coded character sets into a single buffer.
1001
1002 There are two problems stemming from this design. First, Mule
1003 represents the same character in different ways. Abstractly, ',As(B'
1004 (LATIN SMALL LETTER O WITH ACUTE) can get represented as
1005 [latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73]. So what looks like
1006 ',Ass(B' in the display might actually be represented [latin-iso8859-1
1007 #x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B
1008 #xF3 ESC - A] in the file. In some cases this treatment would be
1009 appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00
1010 (the CJK ideographic character meaning ``one'')), and although arguably
1011 incorrect it is convenient when mixing the CJK scripts. But in the case
1012 of the Latin scripts this is wrong.
1013
1014 Worse yet, it is very likely to occur when mixing ``different'' encodings
1015 (such as ISO 8859/1 and ISO 8859/15) that differ only in a few code
1016 points that are almost never used. A very important example involves
1017 email. Many sites, especially in the U.S., default to use of the ISO
1018 8859/1 coded character set (also called ``Latin 1,'' though these are
1019 somewhat different concepts). However, ISO 8859/1 provides a generic
1020 CURRENCY SIGN character. Now that the Euro has become the official
1021 currency of most countries in Europe, this is unsatisfactory (and in
1022 practice, useless). So Europeans generally use ISO 8859/15, which is
1023 nearly identical to ISO 8859/1 for most languages, except that it
1024 substitutes EURO SIGN for CURRENCY SIGN.
1025
1026 Suppose a European user yanks text from a post encoded in ISO 8859/1
1027 into a message composition buffer, and enters some text including the
1028 Euro sign. Then Mule will consider the buffer to contain both ISO
1029 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
1030 programmed) send the message as a multipart mixed MIME body!
1031
1032 This is clearly stupid. What is not as obvious is that, just as any
1033 European can include American English in their text because ASCII is a
1034 subset of ISO 8859/15, most European languages which use Latin
1035 characters (eg, German and Polish) can typically be mixed while using
1036 only one Latin coded character set (in the case of German and Polish,
1037 ISO 8859/2). However, this often depends on exactly what text is to be
1038 encoded (even for the same pair of languages).
1039
1040 Unification works around the problem by converting as many characters as
1041 possible to use a single Latin coded character set before saving the
1042 buffer.
1043
1044 Because the problem is rarely noticable in editing a buffer, but tends
1045 to manifest when that buffer is exported to a file or process,
1046 unification uses the strategy of examining the buffer prior to export.
1047 If use of multiple Latin coded character sets is detected, unification
1048 attempts to unify them by finding a single coded character set which
1049 contains all of the Latin characters in the buffer.
1050
1051 The primary purpose of unification is to fix the problem by giving the
1052 user the choice to change the representation of all characters to one
1053 character set and give sensible recommendations based on context. In
1054 the ',As(B' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and
1055 both will be suggested. In the EURO SIGN example, only ISO 8859/15
1056 makes sense, and that is what will be recommended. In both cases, the
1057 user will be reminded that there are universal encodings available.
1058
1059 I call this @dfn{remapping} (from the universal character set to a
1060 particular ISO 8859 coded character set). It is mere accident that this
1061 letter has the same code point in both character sets. (Not entirely,
1062 but there are many examples of Latin characters that have different code
1063 points in different Latin-X sets.)
1064
1065 Note that, in the ',As(B' example, that treating the buffer in this way will
1066 result in a representation such as [latin-iso8859-2
1067 #x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3].
1068 This is guaranteed to occasionally result in the second problem you
1069 observed, to which we now turn.
1070
1071 This problem is that, although the file is intended to be an
1072 ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX
1073 compliant program---this is required by the standard, obvious if you
1074 think a bit, @pxref{What Unification Cannot Do for You}) will read that
1075 file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73]. Of course this
1076 is no problem if all of the characters in the file are contained in ISO
1077 8859/1, but suppose there are some which are not, but are contained in
1078 the (intended) ISO 8859/2.
1079
1080 You now want to fix this, but not by finding the same character in
1081 another set. Instead, you want to simply change the character set
1082 that Mule associates with that buffer position without changing the
1083 code. (This is conceptually somewhat distinct from the first problem,
1084 and logically ought to be handled in the code that defines coding
1085 systems. However, unification is not an unreasonable place for it.)
1086 Unification provides two functions (one fast and dangerous, the other
1087 @c #### fix latin-unity.texi
1088 slower and careful) to handle this. I call this @dfn{recoding}, because
1089 the transformation actually involves @emph{encoding} the buffer to
1090 file representation, then @emph{decoding} it to buffer representation
1091 (in a different character set). This cannot be done automatically
1092 because Mule can have no idea what the correct encoding is---after
1093 all, it already gave you its best guess. @xref{What Unification
1094 Cannot Do for You}. So these functions must be invoked by the user.
1095 @xref{Interactive Usage}.
1096
1097
1098 @node What Unification Cannot Do for You, , Unification Theory, Unification
1099 @subsection What Unification Cannot Do for You
1100
1101 Unification @strong{cannot} save you if you insist on exporting data in
1102 8-bit encodings in a multilingual environment. @emph{You will
1103 eventually corrupt data if you do this.} It is not Mule's, or any
1104 application's, fault. You will have only yourself to blame; consider
1105 yourself warned. (It is true that Mule has bugs, which make Mule
1106 somewhat more dangerous and inconvenient than some naive applications.
1107 We're working to address those, but no application can remedy the
1108 inherent defect of 8-bit encodings.)
1109
1110 Use standard universal encodings, preferably Unicode (UTF-8) unless
1111 applicable standards indicate otherwise. The most important such case
1112 is Internet messages, where MIME should be used, whether or not the
1113 subordinate encoding is a universal encoding. (Note that since one of
1114 the important provisions of MIME is the @samp{Content-Type} header,
1115 which has the charset parameter, MIME is to be considered a universal
1116 encoding for the purposes of this manual. Of course, technically
1117 speaking it's neither a coded character set nor a coding extension
1118 technique compliant with ISO 2022.)
1119
1120 As mentioned earlier, the problem is that standard encodings suffer from
1121 the design defect that they do not provide a reliable way to recognize
1122 which coded character sets are in use. There are scores of character
1123 sets which can be represented by a single octet (8-bit byte), whose
1124 union contains many hundreds of characters. Thus any 8-bit coded
1125 character set must contain characters that share code points used for
1126 different characters in other coded character sets.
1127
1128 This means that a given file's intended encoding cannot be identified
1129 with 100% reliability unless it contains encoding markers such as those
1130 provided by MIME or ISO 2022.
1131
1132 Unification actually makes it more likely that you will have problems of
1133 this kind. Traditionally Mule has been ``helpful'' by simply using an
1134 ISO 2022 universal coding system when the current buffer coding system
1135 cannot handle all the characters in the buffer. This has the effect
1136 that, because the file contains control sequences, it is not recognized
1137 as being in the locale's normal 8-bit encoding. It may be annoying if
1138 @c #### fix in latin-unity.texi
1139 you are not a Mule expert, but your data is guaranteed to be recoverable
1140 with a tool you already have: Mule.
1141
1142 However, with unification, Mule converts to a single 8-bit character set
1143 when possible. But typically this will @emph{not} be in your usual
1144 locale. Ie, the times that an ISO 8859/1 user will need unification is
1145 when there are ISO 8859/2 characters in the buffer. But then most
1146 likely the file will be saved in a pure 8-bit encoding that is not ISO
1147 8859/1, ie, ISO 8859/2. Mule's autorecognizer (which is probably the
1148 most sophisticated yet available) cannot tell the difference between ISO
1149 8859/1 and ISO 8859/2, and in a Western European locale will choose the
1150 former even though the latter was intended. Even the extension
1151 @c #### fix in latin-unity.texi
1152 (``statistical recognition'') planned for XEmacs 22 is unlikely to be
1153 acceptably accurate in the case of mixed codes.
1154
1155 So now consider adding some additional ISO 8859/1 text to the buffer.
1156 If it includes any ISO 8859/1 codes that are used by different
1157 characters in ISO 8859/2, you now have a file that cannot be
1158 mechanically disentangled. You need a human being who can recognize
1159 that @emph{this is German and Swedish} and stays in Latin-1, while
1160 @emph{that is Polish} and needs to be recoded to Latin-2.
1161
1162 Moral: switch to a universal coded character set, preferably Unicode
1163 using the UTF-8 transformation format. If you really need the space,
1164 compress your files.
1165
1166
1167 @node Specify Coding, Charsets and Coding Systems, Unification, Mule
431 @section Specifying a Coding System 1168 @section Specifying a Coding System
432 1169
433 In cases where XEmacs does not automatically choose the right coding 1170 In cases where XEmacs does not automatically choose the right coding
434 system, you can use these commands to specify one: 1171 system, you can use these commands to specify one:
435 1172
547 using that coding system for all file operations. This makes it 1284 using that coding system for all file operations. This makes it
548 possible to use non-Latin-1 characters in file names---or, at least, 1285 possible to use non-Latin-1 characters in file names---or, at least,
549 those non-Latin-1 characters which the specified coding system can 1286 those non-Latin-1 characters which the specified coding system can
550 encode. By default, this variable is @code{nil}, which implies that you 1287 encode. By default, this variable is @code{nil}, which implies that you
551 cannot use non-Latin-1 characters in file names. 1288 cannot use non-Latin-1 characters in file names.
1289
1290
1291 @node Charsets and Coding Systems, , Specify Coding, Mule
1292 @section Charsets and Coding Systems
1293
1294 This section provides reference lists of Mule charsets and coding
1295 systems. Mule charsets are typically named by character set and
1296 standard.
1297
1298 @table @strong
1299 @item ASCII variants
1300
1301 Identification of equivalent characters in these sets is not properly
1302 implemented. Unification does not distinguish the two charsets.
1303
1304 @samp{ascii} @samp{latin-jisx0201}
1305
1306 @item Extended Latin
1307
1308 Characters from the following ISO 2022 conformant charsets are
1309 identified with equivalents in other charsets in the group by
1310 unification.
1311
1312 @samp{latin-iso8859-1} @samp{latin-iso8859-15} @samp{latin-iso8859-2}
1313 @samp{latin-iso8859-3} @samp{latin-iso8859-4} @samp{latin-iso8859-9}
1314 @samp{latin-iso8859-13} @samp{latin-iso8859-16}
1315
1316 The follow charsets are Latin variants which are not understood by
1317 unification. In addition, many of the Asian language standards provide
1318 ASCII, at least, and sometimes other Latin characters. None of these
1319 are identified with their ISO 8859 equivalents.
1320
1321 @samp{vietnamese-viscii-lower}
1322 @samp{vietnamese-viscii-upper}
1323
1324 @item Other character sets
1325
1326 @samp{arabic-1-column}
1327 @samp{arabic-2-column}
1328 @samp{arabic-digit}
1329 @samp{arabic-iso8859-6}
1330 @samp{chinese-big5-1}
1331 @samp{chinese-big5-2}
1332 @samp{chinese-cns11643-1}
1333 @samp{chinese-cns11643-2}
1334 @samp{chinese-cns11643-3}
1335 @samp{chinese-cns11643-4}
1336 @samp{chinese-cns11643-5}
1337 @samp{chinese-cns11643-6}
1338 @samp{chinese-cns11643-7}
1339 @samp{chinese-gb2312}
1340 @samp{chinese-isoir165}
1341 @samp{cyrillic-iso8859-5}
1342 @samp{ethiopic}
1343 @samp{greek-iso8859-7}
1344 @samp{hebrew-iso8859-8}
1345 @samp{ipa}
1346 @samp{japanese-jisx0208}
1347 @samp{japanese-jisx0208-1978}
1348 @samp{japanese-jisx0212}
1349 @samp{katakana-jisx0201}
1350 @samp{korean-ksc5601}
1351 @samp{sisheng}
1352 @samp{thai-tis620}
1353 @samp{thai-xtis}
1354
1355 @item Non-graphic charsets
1356
1357 @samp{control-1}
1358 @end table
1359
1360 @table @strong
1361 @item No conversion
1362
1363 Some of these coding systems may specify EOL conventions. Note that
1364 @samp{iso-8859-1} is a no-conversion coding system, not an ISO 2022
1365 coding system. Although unification attempts to compensate for this, it
1366 is possible that the @samp{iso-8859-1} coding system will behave
1367 differently from other ISO 8859 coding systems.
1368
1369 @samp{binary} @samp{no-conversion} @samp{raw-text} @samp{iso-8859-1}
1370
1371 @item Latin coding systems
1372
1373 These coding systems are all single-byte, 8-bit ISO 2022 coding systems,
1374 combining ASCII in the GL register (bytes with high-bit clear) and an
1375 extended Latin character set in the GR register (bytes with high-bit set).
1376
1377 @samp{iso-8859-15} @samp{iso-8859-2} @samp{iso-8859-3} @samp{iso-8859-4}
1378 @samp{iso-8859-9} @samp{iso-8859-13} @samp{iso-8859-14} @samp{iso-8859-16}
1379
1380 These coding systems are single-byte, 8-bit coding systems that do not
1381 conform to international standards. They should be avoided in all
1382 potentially multilingual contexts, including any text distributed over
1383 the Internet and World Wide Web.
1384
1385 @samp{windows-1251}
1386
1387 @item Multilingual coding systems
1388
1389 The following ISO-2022-based coding systems are useful for multilingual
1390 text.
1391
1392 @samp{ctext} @samp{iso-2022-lock} @samp{iso-2022-7} @samp{iso-2022-7bit}
1393 @samp{iso-2022-7bit-ss2} @samp{iso-2022-8} @samp{iso-2022-8bit-ss2}
1394
1395 XEmacs also supports Unicode with the Mule-UCS package. These are the
1396 preferred coding systems for multilingual use. (There is a possible
1397 exception for texts that mix several Asian ideographic character sets.)
1398
1399 @samp{utf-16-be} @samp{utf-16-be-no-signature} @samp{utf-16-le}
1400 @samp{utf-16-le-no-signature} @samp{utf-7} @samp{utf-7-safe}
1401 @samp{utf-8} @samp{utf-8-ws}
1402
1403 Development versions of XEmacs (the 21.5 series) support Unicode
1404 internally, with (at least) the following coding systems implemented:
1405
1406 @samp{utf-16-be} @samp{utf-16-be-bom} @samp{utf-16-le}
1407 @samp{utf-16-le-bom} @samp{utf-8} @samp{utf-8-bom}
1408
1409 @item Asian ideographic languages
1410
1411 The following coding systems are based on ISO 2022, and are more or less
1412 suitable for encoding multilingual texts. They all can represent ASCII
1413 at least, and sometimes several other foreign character sets, without
1414 resort to arbitrary ISO 2022 designations. However, these subsets are
1415 not identified with the corresponding national standards in XEmacs Mule.
1416
1417 @samp{chinese-euc} @samp{cn-big5} @samp{cn-gb-2312} @samp{gb2312}
1418 @samp{hz} @samp{hz-gb-2312} @samp{old-jis} @samp{japanese-euc}
1419 @samp{junet} @samp{euc-japan} @samp{euc-jp} @samp{iso-2022-jp}
1420 @samp{iso-2022-jp-1978-irv} @samp{iso-2022-jp-2} @samp{euc-kr}
1421 @samp{korean-euc} @samp{iso-2022-kr} @samp{iso-2022-int-1}
1422
1423 The following coding systems cannot be used for general multilingual
1424 text and do not cooperate well with other coding systems.
1425
1426 @samp{big5} @samp{shift_jis}
1427
1428 @item Other languages
1429
1430 The following coding systems are based on ISO 2022. Though none of them
1431 provides any Latin characters beyond ASCII, XEmacs Mule allows (and up
1432 to 21.4 defaults to) use of ISO 2022 control sequences to designate
1433 other character sets for inclusion the text.
1434
1435 @samp{iso-8859-5} @samp{iso-8859-7} @samp{iso-8859-8}
1436 @samp{ctext-hebrew}
1437
1438 The following are character sets that do not conform to ISO 2022 and
1439 thus cannot be safely used in a multilingual context.
1440
1441 @samp{alternativnyj} @samp{koi8-r} @samp{tis-620} @samp{viqr}
1442 @samp{viscii} @samp{vscii}
1443
1444 @item Special coding systems
1445
1446 Mule uses the following coding systems for special purposes.
1447
1448 @samp{automatic-conversion} @samp{undecided} @samp{escape-quoted}
1449
1450 @samp{escape-quoted} is especially important, as it is used internally
1451 as the coding system for autosaved data.
1452
1453 The following coding systems are aliases for others, and are used for
1454 communication with the host operating system.
1455
1456 @samp{file-name} @samp{keyboard} @samp{terminal}
1457
1458 @end table
1459
1460 Mule detection of coding systems is actually limited to detection of
1461 classes of coding systems called @dfn{coding categories}. These coding
1462 categories are identified by the ISO 2022 control sequences they use, if
1463 any, by their conformance to ISO 2022 restrictions on code points that
1464 may be used, and by characteristic patterns of use of 8-bit code points.
1465
1466 @samp{no-conversion}
1467 @samp{utf-8}
1468 @samp{ucs-4}
1469 @samp{iso-7}
1470 @samp{iso-lock-shift}
1471 @samp{iso-8-1}
1472 @samp{iso-8-2}
1473 @samp{iso-8-designate}
1474 @samp{shift-jis}
1475 @samp{big5}
1476
1477