Mercurial > hg > xemacs-beta

diff man/xemacs/mule.texi @ 1183:c1553814932e
[xemacs-hg @ 2003-01-03 12:12:30 by stephent] various docs <873coa5unb.fsf@tleepslib.sk.tsukuba.ac.jp> <87r8bu4emz.fsf@tleepslib.sk.tsukuba.ac.jp>
author: stephent
date: Fri, 03 Jan 2003 12:12:40 +0000
parents: 26f7cf2a4792
children: 6b0000935adc
--- a/man/xemacs/mule.texi	Thu Jan 02 22:52:44 2003 +0000
+++ b/man/xemacs/mule.texi	Fri Jan 03 12:12:40 2003 +0000
@@ -15,6 +15,8 @@
 @cindex Korean
 @cindex Cyrillic
 @cindex Russian
+@c #### It's a lie that this file tells you about Unicode....
+@cindex Unicode
   If you build XEmacs using the @code{--with-mule} option, it supports a
 wide variety of world scripts, including the Latin script, the Arabic
 script, Simplified Chinese (for mainland of China), Traditional Chinese
@@ -33,22 +35,25 @@
 * Coding Systems::          Character set conversion when you read and
                               write files, and so on.
 * Recognize Coding::        How XEmacs figures out which conversion to use.
+* Unification::             Integrating overlapping character sets.
 * Specify Coding::          Various ways to choose which conversion to use.
+* Charsets and Coding Systems:: Tables and other reference material.
 @end menu
 
 @node Mule Intro, Language Environments, Mule, Mule
-@section Introduction to world scripts
+@section Introduction: The Wide Variety of Scripts and Codings in Use
 
-  The users of these scripts have established many more-or-less standard
-coding systems for storing files.
-@c XEmacs internally uses a single multibyte character encoding, so that it
-@c can intermix characters from all these scripts in a single buffer or
-@c string.  This encoding represents each non-ASCII character as a sequence
-@c of bytes in the range 0200 through 0377.
-XEmacs translates between the internal character encoding and various
-other coding systems when reading and writing files, when exchanging
-data with subprocesses, and (in some cases) in the @kbd{C-q} command
-(see below).
+  There are hundreds of scripts in use world-wide.  The users of these
+scripts have established many more-or-less standard coding systems for
+storing text written in them in files.  XEmacs translates between its
+internal character encoding and various other coding systems when
+reading and writing files, when exchanging data with subprocesses, and
+(in some cases) in the @kbd{C-q} command (see below).
+@footnote{Historically the internal encoding was a specially designed
+encoding, called @dfn{Mule encoding}, intended for easy conversion to
+and from versions of ISO 2022.  However, this encoding shares many
+properties with UTF-8, and conversion to UTF-8 as the internal code is
+proposed.}
 
 @kindex C-h h
 @findex view-hello-file
@@ -356,7 +361,7 @@
 the usual three variants to specify the kind of end-of-line conversion.
 
 
-@node Recognize Coding, Specify Coding, Coding Systems, Mule
+@node Recognize Coding, Unification, Coding Systems, Mule
 @section Recognizing Coding Systems
 
   Most of the time, XEmacs can recognize which coding system to use for
@@ -427,7 +432,739 @@
 Coding}).
 
 
-@node Specify Coding,  , Recognize Coding, Mule
+@node Unification, Specify Coding, Recognize Coding, Mule
+@section Character Set Unification
+
+Mule suffers from a design defect that causes it to consider the ISO
+Latin character sets to be disjoint.  This results in oddities such as
+files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO
+2022 control sequences to switch between them, as well as more
+plausible but often unnecessary combinations like ISO 8859/1 with ISO
+8859/2.  This can be very annoying when sending messages or even in
+simple editing on a single host.  XEmacs works around the problem by
+converting as many characters as possible to use a single Latin coded
+character set before saving the buffer.
+
+Unification is planned for extension to other character set families,
+in particular the Han family of character sets based on the Chinese
+ideographic characters.  At least for the Han sets, however, the
+unification feature will be disabled by default.
+
+This functionality is based on the @file{latin-unity} package by
+Stephen Turnbull @email{stephen@@xemacs.org}, but is somewhat
+divergent.  This documentation is also based on the package
+documentation, and is likely to be inaccurate because of the different
+constraints we place on ``core'' and packaged functionality.
+
+@menu
+* Unification Overview::        History and general information.
+* Unification Usage::           An overview of operation.
+* Unification Configuration::   Configuring unification.
+* Unification FAQs::            Questions and answers from the mailing list.
+* Unification Theory::          How unification works.
+* What Unification Cannot Do for You::  Inherent problems of 8-bit charsets.
+@end menu
+
+@node Unification Overview, Unification Usage, Unification, Unification
+@subsection An Overview of Character Set Unification
+
+Mule suffers from a design defect that causes it to consider the ISO
+Latin character sets to be disjoint.  This manifests itself when a user
+enters characters using input methods associated with different coded
+character sets into a single buffer.
+
+A very important example involves email.  Many sites, especially in the
+U.S., default to use of the ISO 8859/1 coded character set (also called
+``Latin 1,'' though these are somewhat different concepts).  However,
+ISO 8859/1 provides a generic CURRENCY SIGN character.  Now that the
+Euro has become the official currency of most countries in Europe, this
+is unsatisfactory (and in practice, useless).  So Europeans generally
+use ISO 8859/15, which is nearly identical to ISO 8859/1 for most
+languages, except that it substitutes EURO SIGN for CURRENCY SIGN.
+
+Suppose a European user yanks text from a post encoded in ISO 8859/1
+into a message composition buffer, and enters some text including the
+Euro sign.  Then Mule will consider the buffer to contain both ISO
+8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
+programmed) send the message as a multipart mixed MIME body!
+
+This is clearly stupid.  What is not as obvious is that, just as any
+European can include American English in their text because ASCII is a
+subset of ISO 8859/15, most European languages which use Latin
+characters (eg, German and Polish) can typically be mixed while using
+only one Latin coded character set (in this case, ISO 8859/2).  However,
+this often depends on exactly what text is to be encoded.
+
+Unification works around the problem by converting as many characters as
+possible to use a single Latin coded character set before saving the
+buffer.
+
+
+@node Unification Usage, Unification Configuration, Unification Overview, Unification
+@subsection Operation of Unification
+
+This is a description of the early hack to include unification in
+XEmacs 21.5.  This will almost surely change.
+
+Normally, unification works in the background by installing
+@code{unity-sanity-check} on @code{write-region-pre-hook}.
+Unification is on by default for the ISO-8859 Latin sets.  The user
+activates this functionality for other chacter set families by
+invoking @code{enable-unification}, either interactively or in her
+init file.  @xref{Init File, , , xemacs}.  Unification can be
+deactivated by invoking @code{disable-unification}.
+
+Unification also provides a few functions for remapping or recoding the
+buffer by hand.  To @dfn{remap} a character means to change the buffer
+representation of the character by using another coded character set.
+Remapping never changes the identity of the character, but may involve
+altering the code point of the character.  To @dfn{recode} a character
+means to simply change the coded character set.  Recoding never alters
+the code point of the character, but may change the identity of the
+character.  @xref{Unification Theory}.
+
+There are a few variables which determine which coding systems are
+always acceptable to unification: @code{unity-ucs-list},
+@code{unity-preferred-coding-system-list}, and
+@code{unity-preapproved-coding-system-list}.  The last defaults to
+@code{(buffer preferred)}, and you should probably avoid changing it
+because it short-circuits the sanity check.  If you find you need to
+use it, consider reporting it as a bug or request for enhancement.
+
+@menu
+* Basic Functionality::            User interface and customization.
+* Interactive Usage::              Treating text by hand.
+                                   Also documents the hook function(s).
+@end menu
+
+
+@node Basic Functionality, Interactive Usage, , Unification Usage
+@subsubsection Basic Functionality
+
+These functions and user options initialize and configure unification.
+In normal use, they are not needed.
+
+@strong{These interfaces will change.  Also, the @samp{unity-} prefix
+is likely to be changed for many of the variables and functions, as
+they are of more general usefulness.}
+
+@defun enable-unification
+Set up hooks and initialize variables for unification.
+
+There are no arguments.
+
+This function is idempotent.  It will reinitialize any hooks or variables
+that are not in initial state.
+@end defun
+
+@defun disable-unification
+There are no arguments.
+
+Clean up hooks and void variables used by unification.
+@end defun
+
+@c #### several changes should go to latin-unity.texi
+@defopt unity-ucs-list
+List of universal coding systems recommended for character set unification.
+
+The default value is @code{'(utf-8 iso-2022-7 ctext escape-quoted)}.
+
+Order matters; coding systems earlier in the list will be preferred when
+recommending a coding system.  These coding systems will not be used
+without querying the user (unless they are also present in
+@code{unity-preapproved-coding-system-list}), and follow the
+@code{unity-preferred-coding-system-list} in the list of suggested
+coding systems.
+
+If none of the preferred coding systems are feasible, the first in
+this list will be the default.
+
+Notes on certain coding systems:  @code{escape-quoted} is a special
+coding system used for autosaves and compiled Lisp in Mule.  You should
+never delete this, although it is rare that a user would want to use it
+directly.  Unification does not try to be ``smart'' about other general
+ISO 2022 coding systems, such as ISO-2022-JP.  (They are not recognized
+as equivalent to @code{iso-2022-7}.)  If your preferred coding system is
+one of these, you may consider adding it to @code{unity-ucs-list}.
+@end defopt
+
+Coding systems which are not Latin and not in
+@code{unity-ucs-list} are handled by short circuiting checks of
+coding system against the next two variables.
+
+@defopt unity-preapproved-coding-system-list
+List of coding systems used without querying the user if feasible.
+
+The default value is @samp{(buffer-default preferred)}.
+
+The first feasible coding system in this list is used.  The special values
+@samp{preferred} and @samp{buffer-default} may be present:
+
+@table @code
+@item buffer-default
+Use the coding system used by @samp{write-region}, if feasible.
+
+@item preferred
+Use the coding system specified by @samp{prefer-coding-system} if feasible.
+@end table
+
+"Feasible" means that all characters in the buffer can be represented by
+the coding system.  Coding systems in @samp{unity-ucs-list} are
+always considered feasible.  Other feasible coding systems are computed
+by @samp{unity-representations-feasible-region}.
+
+Note that, by definition, the first universal coding system in this
+list shadows all other coding systems.  In particular, if your
+preferred coding system is a universal coding system, and
+@code{preferred} is a member of this list, unification will blithely
+convert all your files to that coding system.  This is considered a
+feature, but it may surprise most users.  Users who don't like this
+behavior may put @code{preferred} in
+@code{unity-preferred-coding-system-list}, but not in
+@code{unity-preapproved-coding-system-list}.
+@end defopt
+
+
+@defopt unity-preferred-coding-system-list
+List of coding systems suggested to the user if feasible.
+
+The default value is @samp{(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3
+iso-8859-4 iso-8859-9)}.
+
+If none of the coding systems in
+@samp{unity-preapproved-coding-system-list} are feasible, this list
+will be recommended to the user, followed by the
+@samp{unity-ucs-list} (so those coding systems should not be in
+this list).  The first coding system in this list is default.  The
+special values @samp{preferred} and @samp{buffer-default} may be
+present:
+
+@table @code
+@item buffer-default
+Use the coding system used by @samp{write-region}, if feasible.
+
+@item preferred
+Use the coding system specified by @samp{prefer-coding-system} if feasible.
+@end table
+
+"Feasible" means that all characters in the buffer can be represented by
+the coding system.  Coding systems in @samp{unity-ucs-list} are
+always considered feasible.  Other feasible coding systems are computed
+by @samp{unity-representations-feasible-region}.
+@end defopt
+
+
+@defvar unity-iso-8859-1-aliases
+List of coding systems to be treated as aliases of ISO 8859/1.
+
+The default value is '(iso-8859-1).
+
+This is not a user variable; to customize input of coding systems or
+charsets, @samp{unity-coding-system-alias-alist} or
+@samp{unity-charset-alias-alist}.
+@end defvar
+
+
+@node Interactive Usage, , Basic Functionality, Unification Usage
+@subsubsection Interactive Usage
+
+First, the hook function @code{unity-sanity-check} is documented.
+(It is placed here because it is not an interactive function, and there
+is not yet a programmer's section of the manual.)
+
+These functions provide access to internal functionality (such as the
+remapping function) and to extra functionality (the recoding functions
+and the test function).
+
+@defun unity-sanity-check begin end filename append visit lockname &optional coding-system
+
+Check if @var{coding-system} can represent all characters between
+@var{begin} and @var{end}.
+
+For compatibility with old broken versions of @code{write-region},
+@var{coding-system} defaults to @code{buffer-file-coding-system}.
+@var{filename}, @var{append}, @var{visit}, and @var{lockname} are
+ignored.
+
+Return nil if buffer-file-coding-system is not (ISO-2022-compatible)
+Latin.  If @code{buffer-file-coding-system} is safe for the charsets
+actually present in the buffer, return it.  Otherwise, ask the user to
+choose a coding system, and return that.
+
+This function does @emph{not} do the safe thing when
+@code{buffer-file-coding-system} is nil (aka no-conversion).  It
+considers that ``non-Latin,'' and passes it on to the Mule detection
+mechanism.
+
+This function is intended for use as a @code{write-region-pre-hook}.  It
+does nothing except return @var{coding-system} if @code{write-region}
+handlers are inhibited.
+@end defun
+
+@defun unity-buffer-representations-feasible
+There are no arguments.
+
+Apply unity-region-representations-feasible to the current buffer.
+@end defun
+
+@defun unity-region-representations-feasible begin end &optional buf
+Return character sets that can represent the text from @var{begin} to
+@var{end} in @var{buf}.
+
+@c #### Fix in latin-unity.texi.
+@var{buf} defaults to the current buffer.  Called interactively, will be
+applied to the region.  The function assumes @var{begin} <= @var{end}.
+
+The return value is a cons.  The car is the list of character sets
+that can individually represent all of the non-ASCII portion of the
+buffer, and the cdr is the list of character sets that can
+individually represent all of the ASCII portion.
+
+The following is taken from a comment in the source.  Please refer to
+the source to be sure of an accurate description.
+
+The basic algorithm is to map over the region, compute the set of
+charsets that can represent each character (the ``feasible charset''),
+and take the intersection of those sets.
+
+The current implementation takes advantage of the fact that ASCII
+characters are common and cannot change asciisets.  Then using
+skip-chars-forward makes motion over ASCII subregions very fast.
+
+This same strategy could be applied generally by precomputing classes
+of characters equivalent according to their effect on latinsets, and
+adding a whole class to the skip-chars-forward string once a member is
+found.
+
+Probably efficiency is a function of the number of characters matched,
+or maybe the length of the match string?  With @code{skip-category-forward}
+over a precomputed category table it should be really fast.  In practice
+for Latin character sets there are only 29 classes.
+@end defun
+
+@defun unity-remap-region begin end character-set &optional coding-system
+
+Remap characters between @var{begin} and @var{end} to equivalents in
+@var{character-set}.  Optional argument @var{coding-system} may be a
+coding system name (a symbol) or nil.  Characters with no equivalent are
+left as-is.
+
+When called interactively, @var{begin} and @var{end} are set to the
+beginning and end, respectively, of the active region, and the function
+prompts for @var{character-set}.  The function does completion, knows
+how to guess a character set name from a coding system name, and also
+provides some common aliases.  See @code{unity-guess-charset}.
+There is no way to specify @var{coding-system}, as it has no useful
+function interactively.
+
+Return @var{coding-system} if @var{coding-system} can encode all
+characters in the region, t if @var{coding-system} is nil and the coding
+system with G0 = 'ascii and G1 = @var{character-set} can encode all
+characters, and otherwise nil.  Note that a non-null return does
+@emph{not} mean it is safe to write the file, only the specified region.
+(This behavior is useful for multipart MIME encoding and the like.)
+
+Note:  by default this function is quite fascist about universal coding
+systems.  It only admits @samp{utf-8}, @samp{iso-2022-7}, and
+@samp{ctext}.  Customize @code{unity-approved-ucs-list} to change
+this.
+
+This function remaps characters that are artificially distinguished by Mule
+internal code.  It may change the code point as well as the character set.
+To recode characters that were decoded in the wrong coding system, use
+@code{unity-recode-region}.
+@end defun
+
+@defun unity-recode-region begin end wrong-cs right-cs
+
+Recode characters between @var{begin} and @var{end} from @var{wrong-cs}
+to @var{right-cs}.
+
+@var{wrong-cs} and @var{right-cs} are character sets.  Characters retain
+the same code point but the character set is changed.  Only characters
+from @var{wrong-cs} are changed to @var{right-cs}.  The identity of the
+character may change.  Note that this could be dangerous, if characters
+whose identities you do not want changed are included in the region.
+This function cannot guess which characters you want changed, and which
+should be left alone.
+
+When called interactively, @var{begin} and @var{end} are set to the
+beginning and end, respectively, of the active region, and the function
+prompts for @var{wrong-cs} and @var{right-cs}.  The function does
+completion, knows how to guess a character set name from a coding system
+name, and also provides some common aliases.  See
+@code{unity-guess-charset}.
+
+Another way to accomplish this, but using coding systems rather than
+character sets to specify the desired recoding, is
+@samp{unity-recode-coding-region}.  That function may be faster
+but is somewhat more dangerous, because it may recode more than one
+character set.
+
+To change from one Mule representation to another without changing identity
+of any characters, use @samp{unity-remap-region}.
+@end defun
+
+@defun unity-recode-coding-region begin end wrong-cs right-cs
+
+Recode text between @var{begin} and @var{end} from @var{wrong-cs} to
+@var{right-cs}.
+
+@var{wrong-cs} and @var{right-cs} are coding systems.  Characters retain
+the same code point but the character set is changed.  The identity of
+characters may change.  This is an inherently dangerous function;
+multilingual text may be recoded in unexpected ways.  #### It's also
+dangerous because the coding systems are not sanity-checked in the
+current implementation.
+
+When called interactively, @var{begin} and @var{end} are set to the
+beginning and end, respectively, of the active region, and the function
+prompts for @var{wrong-cs} and @var{right-cs}.  The function does
+completion, knows how to guess a coding system name from a character set
+name, and also provides some common aliases.  See
+@code{unity-guess-coding-system}.
+
+Another, safer, way to accomplish this, using character sets rather
+than coding systems to specify the desired recoding, is to use
+@code{unity-recode-region}.
+
+To change from one Mule representation to another without changing identity
+of any characters, use @code{unity-remap-region}.
+@end defun
+
+Helper functions for input of coding system and character set names.
+
+@defun unity-guess-charset candidate
+Guess a charset based on the symbol @var{candidate}.
+
+@var{candidate} itself is not tried as the value.
+
+Uses the natural mapping in @samp{unity-cset-codesys-alist}, and
+the values in @samp{unity-charset-alias-alist}."
+@end defun
+
+@defun unity-guess-coding-system candidate
+Guess a coding system based on the symbol @var{candidate}.
+
+@var{candidate} itself is not tried as the value.
+
+Uses the natural mapping in @samp{unity-cset-codesys-alist}, and
+the values in @samp{unity-coding-system-alias-alist}."
+@end defun
+
+@defun unity-example
+
+A cheesy example for unification.
+
+At present it just makes a multilingual buffer.  To test, setq
+buffer-file-coding-system to some value, make the buffer dirty (eg
+with RET BackSpace), and save.
+@end defun
+
+
+@node Unification Configuration, Unification FAQs, Unification Usage, Unification
+@subsection Configuring Unification for Use
+
+If you want unification to be automatically initialized, invoke
+@samp{enable-unification} with no arguments in your init file.
+@xref{Init File, , , xemacs}.  If you are using GNU Emacs or an XEmacs
+earlier than 21.1, you should also load @file{auto-autoloads} using the
+full path (@emph{never} @samp{require} @file{auto-autoloads} libraries).
+
+You may wish to define aliases for commonly used character sets and
+coding systems for convenience in input.
+
+@defopt unity-charset-alias-alist
+Alist mapping aliases to Mule charset names (symbols)."
+
+The default value is
+@example
+   ((latin-1 . latin-iso8859-1)
+    (latin-2 . latin-iso8859-2)
+    (latin-3 . latin-iso8859-3)
+    (latin-4 . latin-iso8859-4)
+    (latin-5 . latin-iso8859-9)
+    (latin-9 . latin-iso8859-15)
+    (latin-10 . latin-iso8859-16))
+@end example
+
+If a charset does not exist on your system, it will not complete and you
+will not be able to enter it in response to prompts.  A real charset
+with the same name as an alias in this list will shadow the alias.
+@end defopt
+
+@defopt unity-coding-system-alias-alist nil
+Alist mapping aliases to Mule coding system names (symbols).
+
+The default value is @samp{nil}.
+@end defopt
+
+
+@node Unification FAQs, Unification Theory, Unification Configuration, Unification
+@subsection Frequently Asked Questions About Unification
+
+@enumerate
+@item
+I'm smarter than XEmacs's unification feature!  How can that be?
+
+Don't be surprised.  Trust yourself.
+
+Unification is very young as yet.  Teach it what you know by
+Customizing its variables, and report your changes to the maintainer
+(@kbd{M-x report-xemacs-bug RET}).
+
+@item
+What is a UCS?
+
+According to ISO 10646, a Universal Coded character Set.  In
+XEmacs, it's Universal (Mule) Coding System.
+@ref{Coding Systems, , , xemacs}
+
+@item
+I know @code{utf-16-le-bom} is a UCS, but unification won't use it.
+Why not?
+
+There are an awful lot of UCSes in Mule, and you probably do not want to
+ever use, and definitely not be asked about, most of them.  So the
+default set includes a few that the author thought plausible, but
+they're surely not comprehensive or optimal.
+
+Customize @code{unity-ucs-list} to include the ones you use often, and
+report your favorites to the maintainer for consideration for
+inclusion in the defaults using @kbd{M-x report-xemacs-bug RET}.
+(Note that you @emph{must} include @code{escape-quoted} in this list,
+because Mule uses it internally as the coding system for auto-save
+files.)
+
+Alternatively, if you just want to use it this one time, simply type
+it in at the prompt.  Unification will confirm that is a real coding
+system, and then assume that you know what you're doing.
+
+@item
+This is crazy: I can't quit XEmacs and get queried on autosaves!  Why?
+
+You probably removed @code{escape-quoted} from
+@code{unity-ucs-list}.  Put it back.
+
+@item
+Unification is really buggy and I can't get any work done.
+
+First, use @kbd{M-x disable-unification RET}, then report your
+problems as a bug (@kbd{M-x report-xemacs-bug RET}).
+@end enumerate
+
+
+@node Unification Theory, What Unification Cannot Do for You, Unification FAQs, Unification
+@subsection Unification Theory
+
+Standard encodings suffer from the design defect that they do not
+provide a reliable way to recognize which coded character sets in use.
+@xref{What Unification Cannot Do for You}.  There are scores of
+character sets which can be represented by a single octet (8-bit
+byte), whose union contains many hundreds of characters.  Obviously
+this results in great confusion, since you can't tell the players
+without a scorecard, and there is no scorecard.
+
+There are two ways to solve this problem.  The first is to create a
+universal coded character set.  This is the concept behind Unicode.
+However, there have been satisfactory (nearly) universal character
+sets for several decades, but even today many Westerners resist using
+Unicode because they consider its space requirements excessive.  On
+the other hand, many Asians dislike Unicode because they consider it
+to be incomplete.  (This is partly, but not entirely, political.)
+
+In any case, Unicode only solves the internal representation problem.
+Many data sets will contain files in ``legacy'' encodings, and Unicode
+does not help distinguish among them.
+
+The second approach is to embed information about the encodings used in
+a document in its text.  This approach is taken by the ISO 2022
+standard.  This would solve the problem completely from the users' of
+view, except that ISO 2022 is basically not implemented at all, in the
+sense that few applications or systems implement more than a small
+subset of ISO 2022 functionality.  This is due to the fact that
+mono-literate users object to the presence of escape sequences in their
+texts (which they, with some justification, consider data corruption).
+Programmers are more than willing to cater to these users, since
+implementing ISO 2022 is a painstaking task.
+
+In fact, Emacs/Mule adopts both of these approaches.  Internally it uses
+a universal character set, @dfn{Mule code}.  Externally it uses ISO 2022
+techniques both to save files in forms robust to encoding issues, and as
+hints when attempting to ``guess'' an unknown encoding.  However, Mule
+suffers from a design defect, namely it embeds the character set
+information that ISO 2022 attaches to runs of characters by introducing
+them with a control sequence in each character.  That causes Mule to
+consider the ISO Latin character sets to be disjoint.  This manifests
+itself when a user enters characters using input methods associated with
+different coded character sets into a single buffer.
+
+There are two problems stemming from this design.  First, Mule
+represents the same character in different ways.  Abstractly, ',As(B'
+(LATIN SMALL LETTER O WITH ACUTE) can get represented as
+[latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73].  So what looks like
+',Ass(B' in the display might actually be represented [latin-iso8859-1
+#x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B
+#xF3 ESC - A] in the file.  In some cases this treatment would be
+appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00
+(the CJK ideographic character meaning ``one'')), and although arguably
+incorrect it is convenient when mixing the CJK scripts.  But in the case
+of the Latin scripts this is wrong.
+
+Worse yet, it is very likely to occur when mixing ``different'' encodings
+(such as ISO 8859/1 and ISO 8859/15) that differ only in a few code
+points that are almost never used.  A very important example involves
+email.  Many sites, especially in the U.S., default to use of the ISO
+8859/1 coded character set (also called ``Latin 1,'' though these are
+somewhat different concepts).  However, ISO 8859/1 provides a generic
+CURRENCY SIGN character.  Now that the Euro has become the official
+currency of most countries in Europe, this is unsatisfactory (and in
+practice, useless).  So Europeans generally use ISO 8859/15, which is
+nearly identical to ISO 8859/1 for most languages, except that it
+substitutes EURO SIGN for CURRENCY SIGN.
+
+Suppose a European user yanks text from a post encoded in ISO 8859/1
+into a message composition buffer, and enters some text including the
+Euro sign.  Then Mule will consider the buffer to contain both ISO
+8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
+programmed) send the message as a multipart mixed MIME body!
+
+This is clearly stupid.  What is not as obvious is that, just as any
+European can include American English in their text because ASCII is a
+subset of ISO 8859/15, most European languages which use Latin
+characters (eg, German and Polish) can typically be mixed while using
+only one Latin coded character set (in the case of German and Polish,
+ISO 8859/2).  However, this often depends on exactly what text is to be
+encoded (even for the same pair of languages).
+
+Unification works around the problem by converting as many characters as
+possible to use a single Latin coded character set before saving the
+buffer.
+
+Because the problem is rarely noticable in editing a buffer, but tends
+to manifest when that buffer is exported to a file or process,
+unification uses the strategy of examining the buffer prior to export.
+If use of multiple Latin coded character sets is detected, unification
+attempts to unify them by finding a single coded character set which
+contains all of the Latin characters in the buffer.
+
+The primary purpose of unification is to fix the problem by giving the
+user the choice to change the representation of all characters to one
+character set and give sensible recommendations based on context.  In
+the ',As(B' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and
+both will be suggested.  In the EURO SIGN example, only ISO 8859/15
+makes sense, and that is what will be recommended.  In both cases, the
+user will be reminded that there are universal encodings available.
+
+I call this @dfn{remapping} (from the universal character set to a
+particular ISO 8859 coded character set).  It is mere accident that this
+letter has the same code point in both character sets.  (Not entirely,
+but there are many examples of Latin characters that have different code
+points in different Latin-X sets.)
+
+Note that, in the ',As(B' example, that treating the buffer in this way will
+result in a representation such as [latin-iso8859-2
+#x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3].
+This is guaranteed to occasionally result in the second problem you
+observed, to which we now turn.
+
+This problem is that, although the file is intended to be an
+ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX
+compliant program---this is required by the standard, obvious if you
+think a bit, @pxref{What Unification Cannot Do for You}) will read that
+file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73].  Of course this
+is no problem if all of the characters in the file are contained in ISO
+8859/1, but suppose there are some which are not, but are contained in
+the (intended) ISO 8859/2.
+
+You now want to fix this, but not by finding the same character in
+another set.  Instead, you want to simply change the character set
+that Mule associates with that buffer position without changing the
+code.  (This is conceptually somewhat distinct from the first problem,
+and logically ought to be handled in the code that defines coding
+systems.  However, unification is not an unreasonable place for it.)
+Unification provides two functions (one fast and dangerous, the other
+@c #### fix latin-unity.texi
+slower and careful) to handle this.  I call this @dfn{recoding}, because
+the transformation actually involves @emph{encoding} the buffer to
+file representation, then @emph{decoding} it to buffer representation
+(in a different character set).  This cannot be done automatically
+because Mule can have no idea what the correct encoding is---after
+all, it already gave you its best guess.  @xref{What Unification
+Cannot Do for You}.  So these functions must be invoked by the user.
+@xref{Interactive Usage}.
+
+
+@node What Unification Cannot Do for You, , Unification Theory, Unification
+@subsection What Unification Cannot Do for You
+
+Unification @strong{cannot} save you if you insist on exporting data in
+8-bit encodings in a multilingual environment.  @emph{You will
+eventually corrupt data if you do this.}  It is not Mule's, or any
+application's, fault.  You will have only yourself to blame; consider
+yourself warned.  (It is true that Mule has bugs, which make Mule
+somewhat more dangerous and inconvenient than some naive applications.
+We're working to address those, but no application can remedy the
+inherent defect of 8-bit encodings.)
+
+Use standard universal encodings, preferably Unicode (UTF-8) unless
+applicable standards indicate otherwise.  The most important such case
+is Internet messages, where MIME should be used, whether or not the
+subordinate encoding is a universal encoding.  (Note that since one of
+the important provisions of MIME is the @samp{Content-Type} header,
+which has the charset parameter, MIME is to be considered a universal
+encoding for the purposes of this manual.  Of course, technically
+speaking it's neither a coded character set nor a coding extension
+technique compliant with ISO 2022.)
+
+As mentioned earlier, the problem is that standard encodings suffer from
+the design defect that they do not provide a reliable way to recognize
+which coded character sets are in use.  There are scores of character
+sets which can be represented by a single octet (8-bit byte), whose
+union contains many hundreds of characters.  Thus any 8-bit coded
+character set must contain characters that share code points used for
+different characters in other coded character sets.
+
+This means that a given file's intended encoding cannot be identified
+with 100% reliability unless it contains encoding markers such as those
+provided by MIME or ISO 2022.
+
+Unification actually makes it more likely that you will have problems of
+this kind.  Traditionally Mule has been ``helpful'' by simply using an
+ISO 2022 universal coding system when the current buffer coding system
+cannot handle all the characters in the buffer.  This has the effect
+that, because the file contains control sequences, it is not recognized
+as being in the locale's normal 8-bit encoding.  It may be annoying if
+@c #### fix in latin-unity.texi
+you are not a Mule expert, but your data is guaranteed to be recoverable
+with a tool you already have: Mule.
+
+However, with unification, Mule converts to a single 8-bit character set
+when possible.  But typically this will @emph{not} be in your usual
+locale.  Ie, the times that an ISO 8859/1 user will need unification is
+when there are ISO 8859/2 characters in the buffer.  But then most
+likely the file will be saved in a pure 8-bit encoding that is not ISO
+8859/1, ie, ISO 8859/2.  Mule's autorecognizer (which is probably the
+most sophisticated yet available) cannot tell the difference between ISO
+8859/1 and ISO 8859/2, and in a Western European locale will choose the
+former even though the latter was intended.  Even the extension
+@c #### fix in latin-unity.texi
+(``statistical recognition'') planned for XEmacs 22 is unlikely to be
+acceptably accurate in the case of mixed codes.
+
+So now consider adding some additional ISO 8859/1 text to the buffer.
+If it includes any ISO 8859/1 codes that are used by different
+characters in ISO 8859/2, you now have a file that cannot be
+mechanically disentangled.  You need a human being who can recognize
+that @emph{this is German and Swedish} and stays in Latin-1, while
+@emph{that is Polish} and needs to be recoded to Latin-2.
+
+Moral: switch to a universal coded character set, preferably Unicode
+using the UTF-8 transformation format.  If you really need the space,
+compress your files.
+
+
+@node Specify Coding, Charsets and Coding Systems, Unification, Mule
 @section Specifying a Coding System
 
   In cases where XEmacs does not automatically choose the right coding
@@ -549,3 +1286,192 @@
 those non-Latin-1 characters which the specified coding system can
 encode.  By default, this variable is @code{nil}, which implies that you
 cannot use non-Latin-1 characters in file names.
+
+
+@node Charsets and Coding Systems, , Specify Coding, Mule
+@section Charsets and Coding Systems
+
+This section provides reference lists of Mule charsets and coding
+systems.  Mule charsets are typically named by character set and
+standard.
+
+@table @strong
+@item ASCII variants
+
+Identification of equivalent characters in these sets is not properly
+implemented.  Unification does not distinguish the two charsets.
+
+@samp{ascii} @samp{latin-jisx0201}
+
+@item Extended Latin
+
+Characters from the following ISO 2022 conformant charsets are
+identified with equivalents in other charsets in the group by
+unification.
+
+@samp{latin-iso8859-1} @samp{latin-iso8859-15} @samp{latin-iso8859-2}
+@samp{latin-iso8859-3} @samp{latin-iso8859-4} @samp{latin-iso8859-9}
+@samp{latin-iso8859-13} @samp{latin-iso8859-16}
+
+The follow charsets are Latin variants which are not understood by
+unification.  In addition, many of the Asian language standards provide
+ASCII, at least, and sometimes other Latin characters.  None of these
+are identified with their ISO 8859 equivalents.
+
+@samp{vietnamese-viscii-lower}
+@samp{vietnamese-viscii-upper}
+
+@item Other character sets
+
+@samp{arabic-1-column}
+@samp{arabic-2-column}
+@samp{arabic-digit}
+@samp{arabic-iso8859-6}
+@samp{chinese-big5-1}
+@samp{chinese-big5-2}
+@samp{chinese-cns11643-1}
+@samp{chinese-cns11643-2}
+@samp{chinese-cns11643-3}
+@samp{chinese-cns11643-4}
+@samp{chinese-cns11643-5}
+@samp{chinese-cns11643-6}
+@samp{chinese-cns11643-7}
+@samp{chinese-gb2312}
+@samp{chinese-isoir165}
+@samp{cyrillic-iso8859-5}
+@samp{ethiopic}
+@samp{greek-iso8859-7}
+@samp{hebrew-iso8859-8}
+@samp{ipa}
+@samp{japanese-jisx0208}
+@samp{japanese-jisx0208-1978}
+@samp{japanese-jisx0212}
+@samp{katakana-jisx0201}
+@samp{korean-ksc5601}
+@samp{sisheng}
+@samp{thai-tis620}
+@samp{thai-xtis}
+
+@item Non-graphic charsets
+
+@samp{control-1}
+@end table
+
+@table @strong
+@item No conversion
+
+Some of these coding systems may specify EOL conventions.  Note that
+@samp{iso-8859-1} is a no-conversion coding system, not an ISO 2022
+coding system.  Although unification attempts to compensate for this, it
+is possible that the @samp{iso-8859-1} coding system will behave
+differently from other ISO 8859 coding systems.
+
+@samp{binary} @samp{no-conversion} @samp{raw-text} @samp{iso-8859-1}
+
+@item Latin coding systems
+
+These coding systems are all single-byte, 8-bit ISO 2022 coding systems,
+combining ASCII in the GL register (bytes with high-bit clear) and an
+extended Latin character set in the GR register (bytes with high-bit set).
+
+@samp{iso-8859-15} @samp{iso-8859-2} @samp{iso-8859-3} @samp{iso-8859-4}
+@samp{iso-8859-9} @samp{iso-8859-13} @samp{iso-8859-14} @samp{iso-8859-16}
+
+These coding systems are single-byte, 8-bit coding systems that do not
+conform to international standards.  They should be avoided in all
+potentially multilingual contexts, including any text distributed over
+the Internet and World Wide Web.
+
+@samp{windows-1251}
+
+@item Multilingual coding systems
+
+The following ISO-2022-based coding systems are useful for multilingual
+text.
+
+@samp{ctext} @samp{iso-2022-lock} @samp{iso-2022-7} @samp{iso-2022-7bit}
+@samp{iso-2022-7bit-ss2} @samp{iso-2022-8} @samp{iso-2022-8bit-ss2}
+
+XEmacs also supports Unicode with the Mule-UCS package.  These are the
+preferred coding systems for multilingual use.  (There is a possible
+exception for texts that mix several Asian ideographic character sets.)
+
+@samp{utf-16-be} @samp{utf-16-be-no-signature} @samp{utf-16-le}
+@samp{utf-16-le-no-signature} @samp{utf-7} @samp{utf-7-safe}
+@samp{utf-8} @samp{utf-8-ws}
+
+Development versions of XEmacs (the 21.5 series) support Unicode
+internally, with (at least) the following coding systems implemented:
+
+@samp{utf-16-be} @samp{utf-16-be-bom} @samp{utf-16-le}
+@samp{utf-16-le-bom} @samp{utf-8} @samp{utf-8-bom}
+
+@item Asian ideographic languages
+
+The following coding systems are based on ISO 2022, and are more or less
+suitable for encoding multilingual texts.  They all can represent ASCII
+at least, and sometimes several other foreign character sets, without
+resort to arbitrary ISO 2022 designations.  However, these subsets are
+not identified with the corresponding national standards in XEmacs Mule.
+
+@samp{chinese-euc} @samp{cn-big5} @samp{cn-gb-2312} @samp{gb2312}
+@samp{hz} @samp{hz-gb-2312} @samp{old-jis} @samp{japanese-euc}
+@samp{junet} @samp{euc-japan} @samp{euc-jp} @samp{iso-2022-jp}
+@samp{iso-2022-jp-1978-irv} @samp{iso-2022-jp-2} @samp{euc-kr}
+@samp{korean-euc} @samp{iso-2022-kr} @samp{iso-2022-int-1}
+
+The following coding systems cannot be used for general multilingual
+text and do not cooperate well with other coding systems.
+
+@samp{big5} @samp{shift_jis}
+
+@item Other languages
+
+The following coding systems are based on ISO 2022.  Though none of them
+provides any Latin characters beyond ASCII, XEmacs Mule allows (and up
+to 21.4 defaults to) use of ISO 2022 control sequences to designate
+other character sets for inclusion the text.
+
+@samp{iso-8859-5} @samp{iso-8859-7} @samp{iso-8859-8}
+@samp{ctext-hebrew}
+
+The following are character sets that do not conform to ISO 2022 and
+thus cannot be safely used in a multilingual context.
+
+@samp{alternativnyj} @samp{koi8-r} @samp{tis-620} @samp{viqr}
+@samp{viscii} @samp{vscii}
+
+@item Special coding systems
+
+Mule uses the following coding systems for special purposes.
+
+@samp{automatic-conversion} @samp{undecided} @samp{escape-quoted}
+
+@samp{escape-quoted} is especially important, as it is used internally
+as the coding system for autosaved data.
+
+The following coding systems are aliases for others, and are used for
+communication with the host operating system.
+
+@samp{file-name} @samp{keyboard} @samp{terminal}
+
+@end table
+
+Mule detection of coding systems is actually limited to detection of
+classes of coding systems called @dfn{coding categories}.  These coding
+categories are identified by the ISO 2022 control sequences they use, if
+any, by their conformance to ISO 2022 restrictions on code points that
+may be used, and by characteristic patterns of use of 8-bit code points.
+
+@samp{no-conversion}
+@samp{utf-8}
+@samp{ucs-4}
+@samp{iso-7}
+@samp{iso-lock-shift}
+@samp{iso-8-1}
+@samp{iso-8-2}
+@samp{iso-8-designate}
+@samp{shift-jis}
+@samp{big5}
+
+
author	stephent
date	Fri, 03 Jan 2003 12:12:40 +0000
parents	26f7cf2a4792
children	6b0000935adc