xemacs-beta: man/lispref/mule.texi comparison

comparison man/lispref/mule.texi @ 1183:c1553814932e

[xemacs-hg @ 2003-01-03 12:12:30 by stephent] various docs <873coa5unb.fsf@tleepslib.sk.tsukuba.ac.jp> <87r8bu4emz.fsf@tleepslib.sk.tsukuba.ac.jp>

author	stephent
date	Fri, 03 Jan 2003 12:12:40 +0000
parents	37e56e920ac5
children	11ff4edb6bb7

comparison

equal deleted inserted replaced

-:7d696106ffe9
+:c1553814932e
 * Composite Characters:: Making new characters by overstriking other ones.
 * Coding Systems::      Ways of representing a string of chars using integers.
 * CCL::                 A special language for writing fast converters.
 * Category Tables::     Subdividing charsets into groups.
 * Unicode Support::     The universal coded character set.
+* Charset Unification:: Handling overlapping character sets.
+* Charsets and Coding Systems:: Tables and reference information.
 @end menu
 @node Internationalization Terminology, Charsets, , MULE
 @section Internationalization Terminology
 Valid values are @code{nil} or a bit vector of size 95.
 @end defun
 @c Added 2002-03-13 sjt
-@node Unicode Support, , Category Tables, MULE
+@node Unicode Support, Charset Unification, Category Tables, MULE
 @section Unicode Support
 @cindex unicode
 @cindex utf-8
 @cindex utf-16
 @cindex ucs-2
 The charset codepoint is a Big Five codepoint; convert it to the
 proper hacked-up codepoint in `chinese-big5-1' or `chinese-big5-2'.
 @end table
 @end defun
+@node Charset Unification, Charsets and Coding Systems, Unicode Support, MULE
+@section Character Set Unification
+Mule suffers from a design defect that causes it to consider the ISO
+Latin character sets to be disjoint.  This results in oddities such as
+files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO
+2022 control sequences to switch between them, as well as more plausible
+but often unnecessary combinations like ISO 8859/1 with ISO 8859/2.
+This can be very annoying when sending messages or even in simple
+editing on a single host.  Unification works around the problem by
+converting as many characters as possible to use a single Latin coded
+character set before saving the buffer.
+This node and its children were ripp'd untimely from
+@file{latin-unity.texi}, and have been quickly converted for use here.
+However as APIs are likely to diverge, beware of inaccuracies.  Please
+report any you discover with @kbd{M-x report-xemacs-bug RET}, as well
+as any ambiguities or downright unintelligible passages.
+A lot of the stuff here doesn't belong here; it belongs in the
+@ref{Top, , , xemacs, XEmacs User's Manual}.  Report those as bugs,
+too, preferably with patches.
+@menu
+* Overview::                    Unification history and general information.
+* Usage::                       An overview of the operation of Unification.
+* Configuration::               Configuring Unification for use.
+* Theory of Operation::         How Unification works.
+* What Unification Cannot Do for You::  Inherent problems of 8-bit charsets.
+* Charsets and Coding Systems:: Reference lists with annotations.
+* Internals::                   Utilities and implementation details.
+@end menu
+@node Overview, Usage, Charset Unification, Charset Unification
+@subsection An Overview of Unification
+Mule suffers from a design defect that causes it to consider the ISO
+Latin character sets to be disjoint.  This manifests itself when a user
+enters characters using input methods associated with different coded
+character sets into a single buffer.
+A very important example involves email.  Many sites, especially in the
+U.S., default to use of the ISO 8859/1 coded character set (also called
+``Latin 1,'' though these are somewhat different concepts).  However,
+ISO 8859/1 provides a generic CURRENCY SIGN character.  Now that the
+Euro has become the official currency of most countries in Europe, this
+is unsatisfactory (and in practice, useless).  So Europeans generally
+use ISO 8859/15, which is nearly identical to ISO 8859/1 for most
+languages, except that it substitutes EURO SIGN for CURRENCY SIGN.
+Suppose a European user yanks text from a post encoded in ISO 8859/1
+into a message composition buffer, and enters some text including the
+Euro sign.  Then Mule will consider the buffer to contain both ISO
+8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
+programmed) send the message as a multipart mixed MIME body!
+This is clearly stupid.  What is not as obvious is that, just as any
+European can include American English in their text because ASCII is a
+subset of ISO 8859/15, most European languages which use Latin
+characters (eg, German and Polish) can typically be mixed while using
+only one Latin coded character set (in this case, ISO 8859/2).  However,
+this often depends on exactly what text is to be encoded.
+Unification works around the problem by converting as many characters as
+possible to use a single Latin coded character set before saving the
+buffer.
+@node Usage, Configuration, Overview, Charset Unification
+@subsection Operation of Unification
+Normally, Unification works in the background by installing
+@code{unity-sanity-check} on @code{write-region-pre-hook}.  This is
+done by default for the ISO 8859 Latin family of character sets.  The
+user activates this functionality for other character set families by
+invoking @code{enable-unification}, either interactively or in her
+init file.  @xref{Init File, , , xemacs}.  Unification can be
+deactivated by invoking @code{disable-unification}.
+Unification also provides a few functions for remapping or recoding the
+buffer by hand.  To @dfn{remap} a character means to change the buffer
+representation of the character by using another coded character set.
+Remapping never changes the identity of the character, but may involve
+altering the code point of the character.  To @dfn{recode} a character
+means to simply change the coded character set.  Recoding never alters
+the code point of the character, but may change the identity of the
+character.  @xref{Theory of Operation}.
+There are a few variables which determine which coding systems are
+always acceptable to Unification:  @code{unity-ucs-list},
+@code{unity-preferred-coding-system-list}, and
+@code{unity-preapproved-coding-system-list}.  The latter two default
+to @code{()}, and should probably be avoided because they short-circuit
+the sanity check.  If you find you need to use them, consider reporting
+it as a bug or request for enhancement.  Because they seem unsafe, the
+recommended interface is likely to change.
+@menu
+* Basic Functionality::            User interface and customization.
+* Interactive Usage::              Treating text by hand.
+Also documents the hook function(s).
+@end menu
+@node Basic Functionality, Interactive Usage, , Usage
+@section Basic Functionality
+These functions and user options initialize and configure Unification.
+In normal use, none of these should be needed.
+@strong{These APIs are certain to change.}
+@defun enable-unification
+Set up hooks and initialize variables for latin-unity.
+There are no arguments.
+This function is idempotent.  It will reinitialize any hooks or variables
+that are not in initial state.
+@end defun
+@defun disable-unification
+There are no arguments.
+Clean up hooks and void variables used by latin-unity.
+@end defun
+@defopt unity-ucs-list
+List of coding systems considered to be universal.
+The default value is @code{'(utf-8 iso-2022-7 ctext escape-quoted)}.
+Order matters; coding systems earlier in the list will be preferred when
+recommending a coding system.  These coding systems will not be used
+without querying the user (unless they are also present in
+@code{unity-preapproved-coding-system-list}), and follow the
+@code{unity-preferred-coding-system-list} in the list of suggested
+coding systems.
+If none of the preferred coding systems are feasible, the first in
+this list will be the default.
+Notes on certain coding systems:  @code{escape-quoted} is a special
+coding system used for autosaves and compiled Lisp in Mule.  You should
+@c #### fix in latin-unity.texi
+never delete this, although it is rare that a user would want to use it
+directly.  Unification does not try to be \"smart\" about other general
+ISO 2022 coding systems, such as ISO-2022-JP.  (They are not recognized
+as equivalent to @code{iso-2022-7}.)  If your preferred coding system is
+one of these, you may consider adding it to @code{unity-ucs-list}.
+However, this will typically have the side effect that (eg) ISO 8859/1
+files will be saved in 7-bit form with ISO 2022 escape sequences.
+@end defopt
+Coding systems which are not Latin and not in
+@code{unity-ucs-list} are handled by short circuiting checks of
+coding system against the next two variables.
+@defopt unity-preapproved-coding-system-list
+List of coding systems used without querying the user if feasible.
+The default value is @samp{(buffer-default preferred)}.
+The first feasible coding system in this list is used.  The special values
+@samp{preferred} and @samp{buffer-default} may be present:
+@table @code
+@item buffer-default
+Use the coding system used by @samp{write-region}, if feasible.
+@item preferred
+Use the coding system specified by @samp{prefer-coding-system} if feasible.
+@end table
+"Feasible" means that all characters in the buffer can be represented by
+the coding system.  Coding systems in @samp{unity-ucs-list} are
+always considered feasible.  Other feasible coding systems are computed
+by @samp{unity-representations-feasible-region}.
+Note that the first universal coding system in this list shadows all
+other coding systems.  In particular, if your preferred coding system is
+a universal coding system, and @code{preferred} is a member of this
+list, unification will blithely convert all your files to that coding
+system.  This is considered a feature, but it may surprise most users.
+Users who don't like this behavior should put @code{preferred} in
+@code{unity-preferred-coding-system-list}.
+@end defopt
+@defopt unity-preferred-coding-system-list
+@c #### fix in latin-unity.texi
+List of coding systems suggested to the user if feasible.
+The default value is @samp{(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3
+iso-8859-4 iso-8859-9)}.
+If none of the coding systems in
+@c #### fix in latin-unity.texi
+@code{unity-preapproved-coding-system-list} are feasible, this list
+will be recommended to the user, followed by the
+@code{unity-ucs-list}.  The first coding system in this list is default.  The
+special values @samp{preferred} and @samp{buffer-default} may be
+present:
+@table @code
+@item buffer-default
+Use the coding system used by @samp{write-region}, if feasible.
+@item preferred
+Use the coding system specified by @samp{prefer-coding-system} if feasible.
+@end table
+"Feasible" means that all characters in the buffer can be represented by
+the coding system.  Coding systems in @samp{unity-ucs-list} are
+always considered feasible.  Other feasible coding systems are computed
+by @samp{unity-representations-feasible-region}.
+@end defopt
+@defvar unity-iso-8859-1-aliases
+List of coding systems to be treated as aliases of ISO 8859/1.
+The default value is '(iso-8859-1).
+This is not a user variable; to customize input of coding systems or
+charsets, @samp{unity-coding-system-alias-alist} or
+@samp{unity-charset-alias-alist}.
+@end defvar
+@node Interactive Usage, , Basic Functionality, Usage
+@section Interactive Usage
+First, the hook function @code{unity-sanity-check} is documented.
+(It is placed here because it is not an interactive function, and there
+is not yet a programmer's section of the manual.)
+These functions provide access to internal functionality (such as the
+remapping function) and to extra functionality (the recoding functions
+and the test function).
+@defun unity-sanity-check begin end filename append visit lockname &optional coding-system
+Check if @var{coding-system} can represent all characters between
+@var{begin} and @var{end}.
+For compatibility with old broken versions of @code{write-region},
+@var{coding-system} defaults to @code{buffer-file-coding-system}.
+@var{filename}, @var{append}, @var{visit}, and @var{lockname} are
+ignored.
+Return nil if buffer-file-coding-system is not (ISO-2022-compatible)
+Latin.  If @code{buffer-file-coding-system} is safe for the charsets
+actually present in the buffer, return it.  Otherwise, ask the user to
+choose a coding system, and return that.
+This function does @emph{not} do the safe thing when
+@code{buffer-file-coding-system} is nil (aka no-conversion).  It
+considers that ``non-Latin,'' and passes it on to the Mule detection
+mechanism.
+This function is intended for use as a @code{write-region-pre-hook}.  It
+does nothing except return @var{coding-system} if @code{write-region}
+handlers are inhibited.
+@end defun
+@defun unity-buffer-representations-feasible
+There are no arguments.
+Apply unity-region-representations-feasible to the current buffer.
+@end defun
+@defun unity-region-representations-feasible begin end &optional buf
+Return character sets that can represent the text from @var{begin} to @var{end} in @var{buf}.
+@var{buf} defaults to the current buffer.  Called interactively, will be
+applied to the region.  Function assumes @var{begin} <= @var{end}.
+The return value is a cons.  The car is the list of character sets
+that can individually represent all of the non-ASCII portion of the
+buffer, and the cdr is the list of character sets that can
+individually represent all of the ASCII portion.
+The following is taken from a comment in the source.  Please refer to
+the source to be sure of an accurate description.
+The basic algorithm is to map over the region, compute the set of
+charsets that can represent each character (the ``feasible charset''),
+and take the intersection of those sets.
+The current implementation takes advantage of the fact that ASCII
+characters are common and cannot change asciisets.  Then using
+skip-chars-forward makes motion over ASCII subregions very fast.
+This same strategy could be applied generally by precomputing classes
+of characters equivalent according to their effect on latinsets, and
+adding a whole class to the skip-chars-forward string once a member is
+found.
+Probably efficiency is a function of the number of characters matched,
+or maybe the length of the match string?  With @code{skip-category-forward}
+over a precomputed category table it should be really fast.  In practice
+for Latin character sets there are only 29 classes.
+@end defun
+@defun unity-remap-region begin end character-set &optional coding-system
+Remap characters between @var{begin} and @var{end} to equivalents in
+@var{character-set}.  Optional argument @var{coding-system} may be a
+coding system name (a symbol) or nil.  Characters with no equivalent are
+left as-is.
+When called interactively, @var{begin} and @var{end} are set to the
+beginning and end, respectively, of the active region, and the function
+prompts for @var{character-set}.  The function does completion, knows
+how to guess a character set name from a coding system name, and also
+provides some common aliases.  See @code{unity-guess-charset}.
+There is no way to specify @var{coding-system}, as it has no useful
+function interactively.
+Return @var{coding-system} if @var{coding-system} can encode all
+characters in the region, t if @var{coding-system} is nil and the coding
+system with G0 = 'ascii and G1 = @var{character-set} can encode all
+characters, and otherwise nil.  Note that a non-null return does
+@emph{not} mean it is safe to write the file, only the specified region.
+(This behavior is useful for multipart MIME encoding and the like.)
+Note:  by default this function is quite fascist about universal coding
+systems.  It only admits @samp{utf-8}, @samp{iso-2022-7}, and
+@samp{ctext}.  Customize @code{unity-approved-ucs-list} to change
+this.
+This function remaps characters that are artificially distinguished by Mule
+internal code.  It may change the code point as well as the character set.
+To recode characters that were decoded in the wrong coding system, use
+@code{unity-recode-region}.
+@end defun
+@defun unity-recode-region begin end wrong-cs right-cs
+Recode characters between @var{begin} and @var{end} from @var{wrong-cs}
+to @var{right-cs}.
+@var{wrong-cs} and @var{right-cs} are character sets.  Characters retain
+the same code point but the character set is changed.  Only characters
+from @var{wrong-cs} are changed to @var{right-cs}.  The identity of the
+character may change.  Note that this could be dangerous, if characters
+whose identities you do not want changed are included in the region.
+This function cannot guess which characters you want changed, and which
+should be left alone.
+When called interactively, @var{begin} and @var{end} are set to the
+beginning and end, respectively, of the active region, and the function
+prompts for @var{wrong-cs} and @var{right-cs}.  The function does
+completion, knows how to guess a character set name from a coding system
+name, and also provides some common aliases.  See
+@code{unity-guess-charset}.
+Another way to accomplish this, but using coding systems rather than
+character sets to specify the desired recoding, is
+@samp{unity-recode-coding-region}.  That function may be faster
+but is somewhat more dangerous, because it may recode more than one
+character set.
+To change from one Mule representation to another without changing identity
+of any characters, use @samp{unity-remap-region}.
+@end defun
+@defun unity-recode-coding-region begin end wrong-cs right-cs
+Recode text between @var{begin} and @var{end} from @var{wrong-cs} to
+@var{right-cs}.
+@var{wrong-cs} and @var{right-cs} are coding systems.  Characters retain
+the same code point but the character set is changed.  The identity of
+characters may change.  This is an inherently dangerous function;
+multilingual text may be recoded in unexpected ways.  #### It's also
+dangerous because the coding systems are not sanity-checked in the
+current implementation.
+When called interactively, @var{begin} and @var{end} are set to the
+beginning and end, respectively, of the active region, and the function
+prompts for @var{wrong-cs} and @var{right-cs}.  The function does
+completion, knows how to guess a coding system name from a character set
+name, and also provides some common aliases.  See
+@code{unity-guess-coding-system}.
+Another, safer, way to accomplish this, using character sets rather
+than coding systems to specify the desired recoding, is to use
+@c #### fixme in latin-unity.texi
+@code{unity-recode-region}.
+To change from one Mule representation to another without changing identity
+of any characters, use @code{unity-remap-region}.
+@end defun
+Helper functions for input of coding system and character set names.
+@defun unity-guess-charset candidate
+Guess a charset based on the symbol @var{candidate}.
+@var{candidate} itself is not tried as the value.
+Uses the natural mapping in @samp{unity-cset-codesys-alist}, and
+the values in @samp{unity-charset-alias-alist}."
+@end defun
+@defun unity-guess-coding-system candidate
+Guess a coding system based on the symbol @var{candidate}.
+@var{candidate} itself is not tried as the value.
+Uses the natural mapping in @samp{unity-cset-codesys-alist}, and
+the values in @samp{unity-coding-system-alias-alist}."
+@end defun
+@defun unity-example
+A cheesy example for Unification.
+At present it just makes a multilingual buffer.  To test, setq
+buffer-file-coding-system to some value, make the buffer dirty (eg
+with RET BackSpace), and save.
+@end defun
+@node Configuration, Theory of Operation, Usage, Charset Unification
+@subsection Configuring Unification for Use
+If you want Unification to be automatically initialized, invoke
+@samp{enable-unification} with no arguments in your init file.
+@xref{Init File, , , xemacs}.  If you are using GNU Emacs or an XEmacs
+earlier than 21.1, you should also load @file{auto-autoloads} using the
+full path (@emph{never} @samp{require} @file{auto-autoloads} libraries).
+You may wish to define aliases for commonly used character sets and
+coding systems for convenience in input.
+@defopt unity-charset-alias-alist
+Alist mapping aliases to Mule charset names (symbols)."
+The default value is
+@example
+((latin-1 . latin-iso8859-1)
+(latin-2 . latin-iso8859-2)
+(latin-3 . latin-iso8859-3)
+(latin-4 . latin-iso8859-4)
+(latin-5 . latin-iso8859-9)
+(latin-9 . latin-iso8859-15)
+(latin-10 . latin-iso8859-16))
+@end example
+If a charset does not exist on your system, it will not complete and you
+will not be able to enter it in response to prompts.  A real charset
+with the same name as an alias in this list will shadow the alias.
+@end defopt
+@defopt unity-coding-system-alias-alist nil
+Alist mapping aliases to Mule coding system names (symbols).
+The default value is @samp{nil}.
+@end defopt
+@node Theory of Operation, What Unification Cannot Do for You, Configuration, Charset Unification
+@subsection Theory of Operation
+Standard encodings suffer from the design defect that they do not
+provide a reliable way to recognize which coded character sets in use.
+@xref{What Unification Cannot Do for You}.  There are scores of
+character sets which can be represented by a single octet (8-bit byte),
+whose union contains many hundreds of characters.  Obviously this
+results in great confusion, since you can't tell the players without a
+scorecard, and there is no scorecard.
+There are two ways to solve this problem.  The first is to create a
+universal coded character set.  This is the concept behind Unicode.
+However, there have been satisfactory (nearly) universal character sets
+for several decades, but even today many Westerners resist using Unicode
+because they consider its space requirements excessive.  On the other
+hand, Asians dislike Unicode because they consider it to be incomplete.
+(This is partly, but not entirely, political.)
+In any case, Unicode only solves the internal representation problem.
+Many data sets will contain files in ``legacy'' encodings, and Unicode
+does not help distinguish among them.
+The second approach is to embed information about the encodings used in
+a document in its text.  This approach is taken by the ISO 2022
+standard.  This would solve the problem completely from the users' of
+view, except that ISO 2022 is basically not implemented at all, in the
+sense that few applications or systems implement more than a small
+subset of ISO 2022 functionality.  This is due to the fact that
+mono-literate users object to the presence of escape sequences in their
+texts (which they, with some justification, consider data corruption).
+Programmers are more than willing to cater to these users, since
+implementing ISO 2022 is a painstaking task.
+In fact, Emacs/Mule adopts both of these approaches.  Internally it uses
+a universal character set, @dfn{Mule code}.  Externally it uses ISO 2022
+techniques both to save files in forms robust to encoding issues, and as
+hints when attempting to ``guess'' an unknown encoding.  However, Mule
+suffers from a design defect, namely it embeds the character set
+information that ISO 2022 attaches to runs of characters by introducing
+them with a control sequence in each character.  That causes Mule to
+consider the ISO Latin character sets to be disjoint.  This manifests
+itself when a user enters characters using input methods associated with
+different coded character sets into a single buffer.
+There are two problems stemming from this design.  First, Mule
+represents the same character in different ways.  Abstractly, ',As(B'
+(LATIN SMALL LETTER O WITH ACUTE) can get represented as
+[latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73].  So what looks like
+',Ass(B' in the display might actually be represented [latin-iso8859-1
+#x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B
+#xF3 ESC - A] in the file.  In some cases this treatment would be
+appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00
+(the CJK ideographic character meaning ``one'')), and although arguably
+incorrect it is convenient when mixing the CJK scripts.  But in the case
+of the Latin scripts this is wrong.
+Worse yet, it is very likely to occur when mixing ``different'' encodings
+(such as ISO 8859/1 and ISO 8859/15) that differ only in a few code
+points that are almost never used.  A very important example involves
+email.  Many sites, especially in the U.S., default to use of the ISO
+8859/1 coded character set (also called ``Latin 1,'' though these are
+somewhat different concepts).  However, ISO 8859/1 provides a generic
+CURRENCY SIGN character.  Now that the Euro has become the official
+currency of most countries in Europe, this is unsatisfactory (and in
+practice, useless).  So Europeans generally use ISO 8859/15, which is
+nearly identical to ISO 8859/1 for most languages, except that it
+substitutes EURO SIGN for CURRENCY SIGN.
+Suppose a European user yanks text from a post encoded in ISO 8859/1
+into a message composition buffer, and enters some text including the
+Euro sign.  Then Mule will consider the buffer to contain both ISO
+8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
+programmed) send the message as a multipart mixed MIME body!
+This is clearly stupid.  What is not as obvious is that, just as any
+European can include American English in their text because ASCII is a
+subset of ISO 8859/15, most European languages which use Latin
+characters (eg, German and Polish) can typically be mixed while using
+only one Latin coded character set (in the case of German and Polish,
+ISO 8859/2).  However, this often depends on exactly what text is to be
+encoded (even for the same pair of languages).
+Unification works around the problem by converting as many characters as
+possible to use a single Latin coded character set before saving the
+buffer.
+Because the problem is rarely noticable in editing a buffer, but tends
+to manifest when that buffer is exported to a file or process, the
+Unification package uses the strategy of examining the buffer prior to
+export.  If use of multiple Latin coded character sets is detected,
+Unification attempts to unify them by finding a single coded character
+set which contains all of the Latin characters in the buffer.
+The primary purpose of Unification is to fix the problem by giving the
+user the choice to change the representation of all characters to one
+character set and give sensible recommendations based on context.  In
+the ',As(B' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and
+both will be suggested.  In the EURO SIGN example, only ISO 8859/15
+makes sense, and that is what will be recommended.  In both cases, the
+user will be reminded that there are universal encodings available.
+I call this @dfn{remapping} (from the universal character set to a
+particular ISO 8859 coded character set).  It is mere accident that this
+letter has the same code point in both character sets.  (Not entirely,
+but there are many examples of Latin characters that have different code
+points in different Latin-X sets.)
+Note that, in the ',As(B' example, that treating the buffer in this way will
+result in a representation such as [latin-iso8859-2
+#x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3].
+This is guaranteed to occasionally result in the second problem you
+observed, to which we now turn.
+This problem is that, although the file is intended to be an
+ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX
+compliant program---this is required by the standard, obvious if you
+think a bit, @pxref{What Unification Cannot Do for You}) will read that
+file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73].  Of course this
+is no problem if all of the characters in the file are contained in ISO
+8859/1, but suppose there are some which are not, but are contained in
+the (intended) ISO 8859/2.
+You now want to fix this, but not by finding the same character in
+another set.  Instead, you want to simply change the character set that
+Mule associates with that buffer position without changing the code.
+(This is conceptually somewhat distinct from the first problem, and
+logically ought to be handled in the code that defines coding systems.
+However, unification is not an unreasonable place for it.)  Unification
+provides two functions (one fast and dangerous, the other slow and
+careful) to handle this.  I call this @dfn{recoding}, because the
+transformation actually involves @emph{encoding} the buffer to file
+representation, then @emph{decoding} it to buffer representation (in a
+different character set).  This cannot be done automatically because
+Mule can have no idea what the correct encoding is---after all, it
+already gave you its best guess.  @xref{What Unification Cannot Do for
+You}.  So these functions must be invoked by the user.  @xref{Interactive
+Usage}.
+@node What Unification Cannot Do for You, Unification Internals, Theory of Operation, Charset Unification
+@subsection What Unification Cannot Do for You
+Unification @strong{cannot} save you if you insist on exporting data in
+8-bit encodings in a multilingual environment.  @emph{You will
+eventually corrupt data if you do this.}  It is not Mule's, or any
+application's, fault.  You will have only yourself to blame; consider
+yourself warned.  (It is true that Mule has bugs, which make Mule
+somewhat more dangerous and inconvenient than some naive applications.
+We're working to address those, but no application can remedy the
+inherent defect of 8-bit encodings.)
+Use standard universal encodings, preferably Unicode (UTF-8) unless
+applicable standards indicate otherwise.  The most important such case
+is Internet messages, where MIME should be used, whether or not the
+subordinate encoding is a universal encoding.  (Note that since one of
+the important provisions of MIME is the @samp{Content-Type} header,
+which has the charset parameter, MIME is to be considered a universal
+encoding for the purposes of this manual.  Of course, technically
+speaking it's neither a coded character set nor a coding extension
+technique compliant with ISO 2022.)
+As mentioned earlier, the problem is that standard encodings suffer from
+the design defect that they do not provide a reliable way to recognize
+which coded character sets are in use.  There are scores of character
+sets which can be represented by a single octet (8-bit byte), whose
+union contains many hundreds of characters.  Thus any 8-bit coded
+character set must contain characters that share code points used for
+different characters in other coded character sets.
+This means that a given file's intended encoding cannot be identified
+with 100% reliability unless it contains encoding markers such as those
+provided by MIME or ISO 2022.
+Unification actually makes it more likely that you will have problems of
+this kind.  Traditionally Mule has been ``helpful'' by simply using an
+ISO 2022 universal coding system when the current buffer coding system
+cannot handle all the characters in the buffer.  This has the effect
+that, because the file contains control sequences, it is not recognized
+as being in the locale's normal 8-bit encoding.  It may be annoying if
+you are not a Mule expert, but your data is automatically recoverable
+with a tool you already have: Mule.
+However, with unification, Mule converts to a single 8-bit character set
+when possible.  But typically this will @emph{not} be in your usual
+locale.  Ie, the times that an ISO 8859/1 user will need Unification is
+when there are ISO 8859/2 characters in the buffer.  But then most
+likely the file will be saved in a pure 8-bit encoding that is not ISO
+8859/1, ie, ISO 8859/2.  Mule's autorecognizer (which is probably the
+most sophisticated yet available) cannot tell the difference between ISO
+8859/1 and ISO 8859/2, and in a Western European locale will choose the
+former even though the latter was intended.  Even the extension
+(``statistical recognition'') planned for XEmacs 22 is unlikely to be at
+all accurate in the case of mixed codes.
+So now consider adding some additional ISO 8859/1 text to the buffer.
+If it includes any ISO 8859/1 codes that are used by different
+characters in ISO 8859/2, you now have a file that cannot be
+mechanically disentangled.  You need a human being who can recognize
+that @emph{this is German and Swedish} and stays in Latin-1, while
+@emph{that is Polish} and needs to be recoded to Latin-2.
+Moral: switch to a universal coded character set, preferably Unicode
+using the UTF-8 transformation format.  If you really need the space,
+compress your files.
+@node Unification Internals, , What Unification Cannot Do for You, Charset Unification
+@subsection Internals
+No internals documentation yet.
+@file{unity-utils.el} provides one utility function.
+@defun unity-dump-tables
+Dump the temporary table created by loading @file{unity-utils.el}
+to @file{unity-tables.el}.  Loading the latter file initializes
+@samp{unity-equivalences}.
+@end defun
+@node Charsets and Coding Systems, , Charset Unification, MULE
+@subsection Charsets and Coding Systems
+This section provides reference lists of Mule charsets and coding
+systems.  Mule charsets are typically named by character set and
+standard.
+@table @strong
+@item ASCII variants
+Identification of equivalent characters in these sets is not properly
+implemented.  Unification does not distinguish the two charsets.
+@samp{ascii} @samp{latin-jisx0201}
+@item Extended Latin
+Characters from the following ISO 2022 conformant charsets are
+identified with equivalents in other charsets in the group by
+Unification.
+@samp{latin-iso8859-1} @samp{latin-iso8859-15} @samp{latin-iso8859-2}
+@samp{latin-iso8859-3} @samp{latin-iso8859-4} @samp{latin-iso8859-9}
+@samp{latin-iso8859-13} @samp{latin-iso8859-16}
+The follow charsets are Latin variants which are not understood by
+Unification.  In addition, many of the Asian language standards provide
+ASCII, at least, and sometimes other Latin characters.  None of these
+are identified with their ISO 8859 equivalents.
+@samp{vietnamese-viscii-lower}
+@samp{vietnamese-viscii-upper}
+@item Other character sets
+@samp{arabic-1-column}
+@samp{arabic-2-column}
+@samp{arabic-digit}
+@samp{arabic-iso8859-6}
+@samp{chinese-big5-1}
+@samp{chinese-big5-2}
+@samp{chinese-cns11643-1}
+@samp{chinese-cns11643-2}
+@samp{chinese-cns11643-3}
+@samp{chinese-cns11643-4}
+@samp{chinese-cns11643-5}
+@samp{chinese-cns11643-6}
+@samp{chinese-cns11643-7}
+@samp{chinese-gb2312}
+@samp{chinese-isoir165}
+@samp{cyrillic-iso8859-5}
+@samp{ethiopic}
+@samp{greek-iso8859-7}
+@samp{hebrew-iso8859-8}
+@samp{ipa}
+@samp{japanese-jisx0208}
+@samp{japanese-jisx0208-1978}
+@samp{japanese-jisx0212}
+@samp{katakana-jisx0201}
+@samp{korean-ksc5601}
+@samp{sisheng}
+@samp{thai-tis620}
+@samp{thai-xtis}
+@item Non-graphic charsets
+@samp{control-1}
+@end table
+@table @strong
+@item No conversion
+Some of these coding systems may specify EOL conventions.  Note that
+@samp{iso-8859-1} is a no-conversion coding system, not an ISO 2022
+coding system.  Although unification attempts to compensate for this, it
+is possible that the @samp{iso-8859-1} coding system will behave
+differently from other ISO 8859 coding systems.
+@samp{binary} @samp{no-conversion} @samp{raw-text} @samp{iso-8859-1}
+@item Latin coding systems
+These coding systems are all single-byte, 8-bit ISO 2022 coding systems,
+combining ASCII in the GL register (bytes with high-bit clear) and an
+extended Latin character set in the GR register (bytes with high-bit set).
+@samp{iso-8859-15} @samp{iso-8859-2} @samp{iso-8859-3} @samp{iso-8859-4}
+@samp{iso-8859-9} @samp{iso-8859-13} @samp{iso-8859-14} @samp{iso-8859-16}
+These coding systems are single-byte, 8-bit coding systems that do not
+conform to international standards.  They should be avoided in all
+potentially multilingual contexts, including any text distributed over
+the Internet and World Wide Web.
+@samp{windows-1251}
+@item Multilingual coding systems
+The following ISO-2022-based coding systems are useful for multilingual
+text.
+@samp{ctext} @samp{iso-2022-lock} @samp{iso-2022-7} @samp{iso-2022-7bit}
+@samp{iso-2022-7bit-ss2} @samp{iso-2022-8} @samp{iso-2022-8bit-ss2}
+XEmacs also supports Unicode with the Mule-UCS package.  These are the
+preferred coding systems for multilingual use.  (There is a possible
+exception for texts that mix several Asian ideographic character sets.)
+@samp{utf-16-be} @samp{utf-16-be-no-signature} @samp{utf-16-le}
+@samp{utf-16-le-no-signature} @samp{utf-7} @samp{utf-7-safe}
+@samp{utf-8} @samp{utf-8-ws}
+Development versions of XEmacs (the 21.5 series) support Unicode
+internally, with (at least) the following coding systems implemented:
+@samp{utf-16-be} @samp{utf-16-be-bom} @samp{utf-16-le}
+@samp{utf-16-le-bom} @samp{utf-8} @samp{utf-8-bom}
+@item Asian ideographic languages
+The following coding systems are based on ISO 2022, and are more or less
+suitable for encoding multilingual texts.  They all can represent ASCII
+at least, and sometimes several other foreign character sets, without
+resort to arbitrary ISO 2022 designations.  However, these subsets are
+not identified with the corresponding national standards in XEmacs Mule.
+@samp{chinese-euc} @samp{cn-big5} @samp{cn-gb-2312} @samp{gb2312}
+@samp{hz} @samp{hz-gb-2312} @samp{old-jis} @samp{japanese-euc}
+@samp{junet} @samp{euc-japan} @samp{euc-jp} @samp{iso-2022-jp}
+@samp{iso-2022-jp-1978-irv} @samp{iso-2022-jp-2} @samp{euc-kr}
+@samp{korean-euc} @samp{iso-2022-kr} @samp{iso-2022-int-1}
+The following coding systems cannot be used for general multilingual
+text and do not cooperate well with other coding systems.
+@samp{big5} @samp{shift_jis}
+@item Other languages
+The following coding systems are based on ISO 2022.  Though none of them
+provides any Latin characters beyond ASCII, XEmacs Mule allows (and up
+to 21.4 defaults to) use of ISO 2022 control sequences to designate
+other character sets for inclusion the text.
+@samp{iso-8859-5} @samp{iso-8859-7} @samp{iso-8859-8}
+@samp{ctext-hebrew}
+The following are character sets that do not conform to ISO 2022 and
+thus cannot be safely used in a multilingual context.
+@samp{alternativnyj} @samp{koi8-r} @samp{tis-620} @samp{viqr}
+@samp{viscii} @samp{vscii}
+@item Special coding systems
+Mule uses the following coding systems for special purposes.
+@samp{automatic-conversion} @samp{undecided} @samp{escape-quoted}
+@samp{escape-quoted} is especially important, as it is used internally
+as the coding system for autosaved data.
+The following coding systems are aliases for others, and are used for
+communication with the host operating system.
+@samp{file-name} @samp{keyboard} @samp{terminal}
+@end table
+Mule detection of coding systems is actually limited to detection of
+classes of coding systems called @dfn{coding categories}.  These coding
+categories are identified by the ISO 2022 control sequences they use, if
+any, by their conformance to ISO 2022 restrictions on code points that
+may be used, and by characteristic patterns of use of 8-bit code points.
+@samp{no-conversion}
+@samp{utf-8}
+@samp{ucs-4}
+@samp{iso-7}
+@samp{iso-lock-shift}
+@samp{iso-8-1}
+@samp{iso-8-2}
+@samp{iso-8-designate}
+@samp{shift-jis}
+@samp{big5}
+@c end of mule.texi

Mercurial > hg > xemacs-beta

comparison man/lispref/mule.texi @ 1183:c1553814932e