Mercurial > hg > xemacs-beta

--- a/man/ChangeLog	Thu Jan 02 22:52:44 2003 +0000
+++ b/man/ChangeLog	Fri Jan 03 12:12:40 2003 +0000
@@ -1,3 +1,60 @@
+2003-01-03  Stephen J. Turnbull  <stephen@xemacs.org>
+
+	* xemacs/startup.texi (Startup Paths): Hierarchy, not package, layout.
+
+2003-01-03  Stephen J. Turnbull  <stephen@xemacs.org>
+
+	* xemacs-faq.texi: Debugging FAQ improvements from Ben Wing.
+	(Q2.0.6): Mention union type bugs.
+	(Q2.1.1): Debugging HOWTO improvements.
+	(Q2.1.15): Decoding Lisp objects in the debugger.
+
+	* widget.texi (Widget Internals): New node.
+	(Top): Add menu item for it.
+
+	* xemacs/xemacs.texi (Top): Better short description of Mule in
+	menu.  Mule submenu.
+
+	Charset unification docs.  What a concept---commit docs first!
+
+	* lispref/mule.texi (MULE): Add Unification and Tables menu entries.
+	(Unicode Support): Fixup next node.
+	(Charset Unification):
+	(Overview):
+	(Usage):
+	(Basic Functionality):
+	(Interactive Usage):
+	(Configuration):
+	(Theory of Operation):
+	(What Unification Cannot Do for You):
+	(Unification Internals):
+	(Charsets and Coding Systems):
+	New nodes.
+
+	* xemacs/mule.texi (Mule): Menu items for Unification and Tables.
+	(Recognize Coding):
+	(Specify Coding):
+	Fixup next and previous pointers.
+	(Unification):
+	(Unification Overview):
+	(Unification Usage):
+	(Unification Configuration):
+	(Unification FAQs):
+	(Unification Theory):
+	(What Unification Cannot Do for You):
+	(Charsets and Coding Systems):
+	New nodes.
+
+2002-12-17  Stephen Turnbull  <stephen@xemacs.org>
+
+	* widget.texi (Widget Wishlist): Typo.
+	(Defining New Widgets): s/widget-define/define-widget/g.
+
+2002-12-27  Stephen J. Turnbull  <stephen@xemacs.org>
+
+	* internals/internals.texi (Regression Testing XEmacs): Hints for
+	test design.
+
 2002-10-29  Ville Skyttä  <scop@xemacs.org>

 	* xemacs-faq.texi (Top):
--- a/man/internals/internals.texi	Thu Jan 02 22:52:44 2003 +0000
+++ b/man/internals/internals.texi	Fri Jan 03 12:12:40 2003 +0000
@@ -3636,6 +3636,45 @@
 GTK widgets, but not Athena, Motif, MS Windows, or Carbon), simply
 silently suppress the test if the feature is not available.

+Here are a few general hints for writing tests.
+
+@enumerate
+@item
+Include related successful cases.  Fixes often break something.
+
+@item
+Use the Known-Bug-Expect-Failure macro to mark the cases you know
+are going to fail.  We want to be able to distinguish between
+regressions and other unexpected failures, and cases that have
+been (partially) analyzed but not yet repaired.
+
+@item
+Mark the bug with the date of report.  An ``Unfixed since yyyy-mm-dd''
+gloss for Known-Bug-Expect-Failure is planned to further increase
+developer embarrassment (== incentive to fix the bug), but until then at
+least put a comment about the date so we can easily see when it was
+first reported.
+
+@item
+It's a matter of your judgement, but you should often use generic tests
+(@emph{e.g.}, @code{eq}) instead of more specific tests (@code{=} for
+numbers) even though you know that arguments ``should'' be of correct
+type.  That is, if the functions used can return generic objects
+(typically @code{nil}), as well as some more specific type that will be
+returned on success.  We don't want failures of those assertions
+reported as ``other failures'' (a wrong-type-arg signal, rather than a
+null return), we want them reported as ``assertion failures.''
+
+One example is a test that tests @code{(= (string-match this that) 0)},
+expecting a successful match.  Now suppose @code{string-match} is broken
+such that the match fails.  Then it will return @code{nil}, and @code{=}
+will signal ``wrong-type-argument, number-char-or-marker-p, nil'',
+generating an ``other failure'' in the report.  But this should be
+reported as an assertion failure (the test failed in a foreseeable way),
+rather than something else (we don't know what happened because XEmacs
+is broken in a way that we weren't trying to test!)
+@end enumerate
+

 @node CVS Techniques, A Summary of the Various XEmacs Modules, Regression Testing XEmacs, Top
 @chapter CVS Techniques
--- a/man/lispref/mule.texi	Thu Jan 02 22:52:44 2003 +0000
+++ b/man/lispref/mule.texi	Fri Jan 03 12:12:40 2003 +0000
@@ -24,6 +24,8 @@
 * CCL::                 A special language for writing fast converters.
 * Category Tables::     Subdividing charsets into groups.
 * Unicode Support::     The universal coded character set.
+* Charset Unification:: Handling overlapping character sets.
+* Charsets and Coding Systems:: Tables and reference information.
 @end menu

 @node Internationalization Terminology, Charsets, , MULE
@@ -2072,7 +2074,7 @@


 @c Added 2002-03-13 sjt
-@node Unicode Support, , Category Tables, MULE
+@node Unicode Support, Charset Unification, Category Tables, MULE
 @section Unicode Support
 @cindex unicode
 @cindex utf-8
@@ -2181,3 +2183,880 @@
 @end table
 @end defun

+
+@node Charset Unification, Charsets and Coding Systems, Unicode Support, MULE
+@section Character Set Unification
+
+Mule suffers from a design defect that causes it to consider the ISO
+Latin character sets to be disjoint.  This results in oddities such as
+files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO
+2022 control sequences to switch between them, as well as more plausible
+but often unnecessary combinations like ISO 8859/1 with ISO 8859/2.
+This can be very annoying when sending messages or even in simple
+editing on a single host.  Unification works around the problem by
+converting as many characters as possible to use a single Latin coded
+character set before saving the buffer.
+
+This node and its children were ripp'd untimely from
+@file{latin-unity.texi}, and have been quickly converted for use here.
+However as APIs are likely to diverge, beware of inaccuracies.  Please
+report any you discover with @kbd{M-x report-xemacs-bug RET}, as well
+as any ambiguities or downright unintelligible passages.
+
+A lot of the stuff here doesn't belong here; it belongs in the
+@ref{Top, , , xemacs, XEmacs User's Manual}.  Report those as bugs,
+too, preferably with patches.
+
+@menu
+* Overview::                    Unification history and general information.
+* Usage::                       An overview of the operation of Unification.
+* Configuration::               Configuring Unification for use.
+* Theory of Operation::         How Unification works.
+* What Unification Cannot Do for You::  Inherent problems of 8-bit charsets.
+* Charsets and Coding Systems:: Reference lists with annotations.
+* Internals::                   Utilities and implementation details.
+@end menu
+
+@node Overview, Usage, Charset Unification, Charset Unification
+@subsection An Overview of Unification
+
+Mule suffers from a design defect that causes it to consider the ISO
+Latin character sets to be disjoint.  This manifests itself when a user
+enters characters using input methods associated with different coded
+character sets into a single buffer.
+
+A very important example involves email.  Many sites, especially in the
+U.S., default to use of the ISO 8859/1 coded character set (also called
+``Latin 1,'' though these are somewhat different concepts).  However,
+ISO 8859/1 provides a generic CURRENCY SIGN character.  Now that the
+Euro has become the official currency of most countries in Europe, this
+is unsatisfactory (and in practice, useless).  So Europeans generally
+use ISO 8859/15, which is nearly identical to ISO 8859/1 for most
+languages, except that it substitutes EURO SIGN for CURRENCY SIGN.
+
+Suppose a European user yanks text from a post encoded in ISO 8859/1
+into a message composition buffer, and enters some text including the
+Euro sign.  Then Mule will consider the buffer to contain both ISO
+8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
+programmed) send the message as a multipart mixed MIME body!
+
+This is clearly stupid.  What is not as obvious is that, just as any
+European can include American English in their text because ASCII is a
+subset of ISO 8859/15, most European languages which use Latin
+characters (eg, German and Polish) can typically be mixed while using
+only one Latin coded character set (in this case, ISO 8859/2).  However,
+this often depends on exactly what text is to be encoded.
+
+Unification works around the problem by converting as many characters as
+possible to use a single Latin coded character set before saving the
+buffer.
+
+@node Usage, Configuration, Overview, Charset Unification
+@subsection Operation of Unification
+
+Normally, Unification works in the background by installing
+@code{unity-sanity-check} on @code{write-region-pre-hook}.  This is
+done by default for the ISO 8859 Latin family of character sets.  The
+user activates this functionality for other character set families by
+invoking @code{enable-unification}, either interactively or in her
+init file.  @xref{Init File, , , xemacs}.  Unification can be
+deactivated by invoking @code{disable-unification}.
+
+Unification also provides a few functions for remapping or recoding the
+buffer by hand.  To @dfn{remap} a character means to change the buffer
+representation of the character by using another coded character set.
+Remapping never changes the identity of the character, but may involve
+altering the code point of the character.  To @dfn{recode} a character
+means to simply change the coded character set.  Recoding never alters
+the code point of the character, but may change the identity of the
+character.  @xref{Theory of Operation}.
+
+There are a few variables which determine which coding systems are
+always acceptable to Unification:  @code{unity-ucs-list},
+@code{unity-preferred-coding-system-list}, and
+@code{unity-preapproved-coding-system-list}.  The latter two default
+to @code{()}, and should probably be avoided because they short-circuit
+the sanity check.  If you find you need to use them, consider reporting
+it as a bug or request for enhancement.  Because they seem unsafe, the
+recommended interface is likely to change.
+
+@menu
+* Basic Functionality::            User interface and customization.
+* Interactive Usage::              Treating text by hand.
+                                   Also documents the hook function(s).
+@end menu
+
+
+@node Basic Functionality, Interactive Usage, , Usage
+@section Basic Functionality
+
+These functions and user options initialize and configure Unification.
+In normal use, none of these should be needed.
+
+@strong{These APIs are certain to change.}
+
+@defun enable-unification
+Set up hooks and initialize variables for latin-unity.
+
+There are no arguments.
+
+This function is idempotent.  It will reinitialize any hooks or variables
+that are not in initial state.
+@end defun
+
+@defun disable-unification
+There are no arguments.
+
+Clean up hooks and void variables used by latin-unity.
+@end defun
+
+@defopt unity-ucs-list
+List of coding systems considered to be universal.
+
+The default value is @code{'(utf-8 iso-2022-7 ctext escape-quoted)}.
+
+Order matters; coding systems earlier in the list will be preferred when
+recommending a coding system.  These coding systems will not be used
+without querying the user (unless they are also present in
+@code{unity-preapproved-coding-system-list}), and follow the
+@code{unity-preferred-coding-system-list} in the list of suggested
+coding systems.
+
+If none of the preferred coding systems are feasible, the first in
+this list will be the default.
+
+Notes on certain coding systems:  @code{escape-quoted} is a special
+coding system used for autosaves and compiled Lisp in Mule.  You should
+@c #### fix in latin-unity.texi
+never delete this, although it is rare that a user would want to use it
+directly.  Unification does not try to be \"smart\" about other general
+ISO 2022 coding systems, such as ISO-2022-JP.  (They are not recognized
+as equivalent to @code{iso-2022-7}.)  If your preferred coding system is
+one of these, you may consider adding it to @code{unity-ucs-list}.
+However, this will typically have the side effect that (eg) ISO 8859/1
+files will be saved in 7-bit form with ISO 2022 escape sequences.
+@end defopt
+
+Coding systems which are not Latin and not in
+@code{unity-ucs-list} are handled by short circuiting checks of
+coding system against the next two variables.
+
+@defopt unity-preapproved-coding-system-list
+List of coding systems used without querying the user if feasible.
+
+The default value is @samp{(buffer-default preferred)}.
+
+The first feasible coding system in this list is used.  The special values
+@samp{preferred} and @samp{buffer-default} may be present:
+
+@table @code
+@item buffer-default
+Use the coding system used by @samp{write-region}, if feasible.
+
+@item preferred
+Use the coding system specified by @samp{prefer-coding-system} if feasible.
+@end table
+
+"Feasible" means that all characters in the buffer can be represented by
+the coding system.  Coding systems in @samp{unity-ucs-list} are
+always considered feasible.  Other feasible coding systems are computed
+by @samp{unity-representations-feasible-region}.
+
+Note that the first universal coding system in this list shadows all
+other coding systems.  In particular, if your preferred coding system is
+a universal coding system, and @code{preferred} is a member of this
+list, unification will blithely convert all your files to that coding
+system.  This is considered a feature, but it may surprise most users.
+Users who don't like this behavior should put @code{preferred} in
+@code{unity-preferred-coding-system-list}.
+@end defopt
+
+@defopt unity-preferred-coding-system-list
+@c #### fix in latin-unity.texi
+List of coding systems suggested to the user if feasible.
+
+The default value is @samp{(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3
+iso-8859-4 iso-8859-9)}.
+
+If none of the coding systems in
+@c #### fix in latin-unity.texi
+@code{unity-preapproved-coding-system-list} are feasible, this list
+will be recommended to the user, followed by the
+@code{unity-ucs-list}.  The first coding system in this list is default.  The
+special values @samp{preferred} and @samp{buffer-default} may be
+present:
+
+@table @code
+@item buffer-default
+Use the coding system used by @samp{write-region}, if feasible.
+
+@item preferred
+Use the coding system specified by @samp{prefer-coding-system} if feasible.
+@end table
+
+"Feasible" means that all characters in the buffer can be represented by
+the coding system.  Coding systems in @samp{unity-ucs-list} are
+always considered feasible.  Other feasible coding systems are computed
+by @samp{unity-representations-feasible-region}.
+@end defopt
+
+
+@defvar unity-iso-8859-1-aliases
+List of coding systems to be treated as aliases of ISO 8859/1.
+
+The default value is '(iso-8859-1).
+
+This is not a user variable; to customize input of coding systems or
+charsets, @samp{unity-coding-system-alias-alist} or
+@samp{unity-charset-alias-alist}.
+@end defvar
+
+
+@node Interactive Usage, , Basic Functionality, Usage
+@section Interactive Usage
+
+First, the hook function @code{unity-sanity-check} is documented.
+(It is placed here because it is not an interactive function, and there
+is not yet a programmer's section of the manual.)
+
+These functions provide access to internal functionality (such as the
+remapping function) and to extra functionality (the recoding functions
+and the test function).
+
+
+@defun unity-sanity-check begin end filename append visit lockname &optional coding-system
+
+Check if @var{coding-system} can represent all characters between
+@var{begin} and @var{end}.
+
+For compatibility with old broken versions of @code{write-region},
+@var{coding-system} defaults to @code{buffer-file-coding-system}.
+@var{filename}, @var{append}, @var{visit}, and @var{lockname} are
+ignored.
+
+Return nil if buffer-file-coding-system is not (ISO-2022-compatible)
+Latin.  If @code{buffer-file-coding-system} is safe for the charsets
+actually present in the buffer, return it.  Otherwise, ask the user to
+choose a coding system, and return that.
+
+This function does @emph{not} do the safe thing when
+@code{buffer-file-coding-system} is nil (aka no-conversion).  It
+considers that ``non-Latin,'' and passes it on to the Mule detection
+mechanism.
+
+This function is intended for use as a @code{write-region-pre-hook}.  It
+does nothing except return @var{coding-system} if @code{write-region}
+handlers are inhibited.
+@end defun
+
+@defun unity-buffer-representations-feasible
+
+There are no arguments.
+
+Apply unity-region-representations-feasible to the current buffer.
+@end defun
+
+@defun unity-region-representations-feasible begin end &optional buf
+
+Return character sets that can represent the text from @var{begin} to @var{end} in @var{buf}.
+
+@var{buf} defaults to the current buffer.  Called interactively, will be
+applied to the region.  Function assumes @var{begin} <= @var{end}.
+
+The return value is a cons.  The car is the list of character sets
+that can individually represent all of the non-ASCII portion of the
+buffer, and the cdr is the list of character sets that can
+individually represent all of the ASCII portion.
+
+The following is taken from a comment in the source.  Please refer to
+the source to be sure of an accurate description.
+
+The basic algorithm is to map over the region, compute the set of
+charsets that can represent each character (the ``feasible charset''),
+and take the intersection of those sets.
+
+The current implementation takes advantage of the fact that ASCII
+characters are common and cannot change asciisets.  Then using
+skip-chars-forward makes motion over ASCII subregions very fast.
+
+This same strategy could be applied generally by precomputing classes
+of characters equivalent according to their effect on latinsets, and
+adding a whole class to the skip-chars-forward string once a member is
+found.
+
+Probably efficiency is a function of the number of characters matched,
+or maybe the length of the match string?  With @code{skip-category-forward}
+over a precomputed category table it should be really fast.  In practice
+for Latin character sets there are only 29 classes.
+@end defun
+
+@defun unity-remap-region begin end character-set &optional coding-system
+
+Remap characters between @var{begin} and @var{end} to equivalents in
+@var{character-set}.  Optional argument @var{coding-system} may be a
+coding system name (a symbol) or nil.  Characters with no equivalent are
+left as-is.
+
+When called interactively, @var{begin} and @var{end} are set to the
+beginning and end, respectively, of the active region, and the function
+prompts for @var{character-set}.  The function does completion, knows
+how to guess a character set name from a coding system name, and also
+provides some common aliases.  See @code{unity-guess-charset}.
+There is no way to specify @var{coding-system}, as it has no useful
+function interactively.
+
+Return @var{coding-system} if @var{coding-system} can encode all
+characters in the region, t if @var{coding-system} is nil and the coding
+system with G0 = 'ascii and G1 = @var{character-set} can encode all
+characters, and otherwise nil.  Note that a non-null return does
+@emph{not} mean it is safe to write the file, only the specified region.
+(This behavior is useful for multipart MIME encoding and the like.)
+
+Note:  by default this function is quite fascist about universal coding
+systems.  It only admits @samp{utf-8}, @samp{iso-2022-7}, and
+@samp{ctext}.  Customize @code{unity-approved-ucs-list} to change
+this.
+
+This function remaps characters that are artificially distinguished by Mule
+internal code.  It may change the code point as well as the character set.
+To recode characters that were decoded in the wrong coding system, use
+@code{unity-recode-region}.
+@end defun
+
+@defun unity-recode-region begin end wrong-cs right-cs
+
+Recode characters between @var{begin} and @var{end} from @var{wrong-cs}
+to @var{right-cs}.
+
+@var{wrong-cs} and @var{right-cs} are character sets.  Characters retain
+the same code point but the character set is changed.  Only characters
+from @var{wrong-cs} are changed to @var{right-cs}.  The identity of the
+character may change.  Note that this could be dangerous, if characters
+whose identities you do not want changed are included in the region.
+This function cannot guess which characters you want changed, and which
+should be left alone.
+
+When called interactively, @var{begin} and @var{end} are set to the
+beginning and end, respectively, of the active region, and the function
+prompts for @var{wrong-cs} and @var{right-cs}.  The function does
+completion, knows how to guess a character set name from a coding system
+name, and also provides some common aliases.  See
+@code{unity-guess-charset}.
+
+Another way to accomplish this, but using coding systems rather than
+character sets to specify the desired recoding, is
+@samp{unity-recode-coding-region}.  That function may be faster
+but is somewhat more dangerous, because it may recode more than one
+character set.
+
+To change from one Mule representation to another without changing identity
+of any characters, use @samp{unity-remap-region}.
+@end defun
+
+@defun unity-recode-coding-region begin end wrong-cs right-cs
+
+Recode text between @var{begin} and @var{end} from @var{wrong-cs} to
+@var{right-cs}.
+
+@var{wrong-cs} and @var{right-cs} are coding systems.  Characters retain
+the same code point but the character set is changed.  The identity of
+characters may change.  This is an inherently dangerous function;
+multilingual text may be recoded in unexpected ways.  #### It's also
+dangerous because the coding systems are not sanity-checked in the
+current implementation.
+
+When called interactively, @var{begin} and @var{end} are set to the
+beginning and end, respectively, of the active region, and the function
+prompts for @var{wrong-cs} and @var{right-cs}.  The function does
+completion, knows how to guess a coding system name from a character set
+name, and also provides some common aliases.  See
+@code{unity-guess-coding-system}.
+
+Another, safer, way to accomplish this, using character sets rather
+than coding systems to specify the desired recoding, is to use
+@c #### fixme in latin-unity.texi
+@code{unity-recode-region}.
+
+To change from one Mule representation to another without changing identity
+of any characters, use @code{unity-remap-region}.
+@end defun
+
+Helper functions for input of coding system and character set names.
+
+@defun unity-guess-charset candidate
+Guess a charset based on the symbol @var{candidate}.
+
+@var{candidate} itself is not tried as the value.
+
+Uses the natural mapping in @samp{unity-cset-codesys-alist}, and
+the values in @samp{unity-charset-alias-alist}."
+@end defun
+
+@defun unity-guess-coding-system candidate
+Guess a coding system based on the symbol @var{candidate}.
+
+@var{candidate} itself is not tried as the value.
+
+Uses the natural mapping in @samp{unity-cset-codesys-alist}, and
+the values in @samp{unity-coding-system-alias-alist}."
+@end defun
+
+@defun unity-example
+
+A cheesy example for Unification.
+
+At present it just makes a multilingual buffer.  To test, setq
+buffer-file-coding-system to some value, make the buffer dirty (eg
+with RET BackSpace), and save.
+@end defun
+
+
+@node Configuration, Theory of Operation, Usage, Charset Unification
+@subsection Configuring Unification for Use
+
+If you want Unification to be automatically initialized, invoke
+@samp{enable-unification} with no arguments in your init file.
+@xref{Init File, , , xemacs}.  If you are using GNU Emacs or an XEmacs
+earlier than 21.1, you should also load @file{auto-autoloads} using the
+full path (@emph{never} @samp{require} @file{auto-autoloads} libraries).
+
+You may wish to define aliases for commonly used character sets and
+coding systems for convenience in input.
+
+@defopt unity-charset-alias-alist
+Alist mapping aliases to Mule charset names (symbols)."
+
+The default value is
+@example
+   ((latin-1 . latin-iso8859-1)
+    (latin-2 . latin-iso8859-2)
+    (latin-3 . latin-iso8859-3)
+    (latin-4 . latin-iso8859-4)
+    (latin-5 . latin-iso8859-9)
+    (latin-9 . latin-iso8859-15)
+    (latin-10 . latin-iso8859-16))
+@end example
+
+If a charset does not exist on your system, it will not complete and you
+will not be able to enter it in response to prompts.  A real charset
+with the same name as an alias in this list will shadow the alias.
+@end defopt
+
+@defopt unity-coding-system-alias-alist nil
+Alist mapping aliases to Mule coding system names (symbols).
+
+The default value is @samp{nil}.
+@end defopt
+
+
+@node Theory of Operation, What Unification Cannot Do for You, Configuration, Charset Unification
+@subsection Theory of Operation
+
+Standard encodings suffer from the design defect that they do not
+provide a reliable way to recognize which coded character sets in use.
+@xref{What Unification Cannot Do for You}.  There are scores of
+character sets which can be represented by a single octet (8-bit byte),
+whose union contains many hundreds of characters.  Obviously this
+results in great confusion, since you can't tell the players without a
+scorecard, and there is no scorecard.
+
+There are two ways to solve this problem.  The first is to create a
+universal coded character set.  This is the concept behind Unicode.
+However, there have been satisfactory (nearly) universal character sets
+for several decades, but even today many Westerners resist using Unicode
+because they consider its space requirements excessive.  On the other
+hand, Asians dislike Unicode because they consider it to be incomplete.
+(This is partly, but not entirely, political.)
+
+In any case, Unicode only solves the internal representation problem.
+Many data sets will contain files in ``legacy'' encodings, and Unicode
+does not help distinguish among them.
+
+The second approach is to embed information about the encodings used in
+a document in its text.  This approach is taken by the ISO 2022
+standard.  This would solve the problem completely from the users' of
+view, except that ISO 2022 is basically not implemented at all, in the
+sense that few applications or systems implement more than a small
+subset of ISO 2022 functionality.  This is due to the fact that
+mono-literate users object to the presence of escape sequences in their
+texts (which they, with some justification, consider data corruption).
+Programmers are more than willing to cater to these users, since
+implementing ISO 2022 is a painstaking task.
+
+In fact, Emacs/Mule adopts both of these approaches.  Internally it uses
+a universal character set, @dfn{Mule code}.  Externally it uses ISO 2022
+techniques both to save files in forms robust to encoding issues, and as
+hints when attempting to ``guess'' an unknown encoding.  However, Mule
+suffers from a design defect, namely it embeds the character set
+information that ISO 2022 attaches to runs of characters by introducing
+them with a control sequence in each character.  That causes Mule to
+consider the ISO Latin character sets to be disjoint.  This manifests
+itself when a user enters characters using input methods associated with
+different coded character sets into a single buffer.
+
+There are two problems stemming from this design.  First, Mule
+represents the same character in different ways.  Abstractly, ',As(B'
+(LATIN SMALL LETTER O WITH ACUTE) can get represented as
+[latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73].  So what looks like
+',Ass(B' in the display might actually be represented [latin-iso8859-1
+#x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B
+#xF3 ESC - A] in the file.  In some cases this treatment would be
+appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00
+(the CJK ideographic character meaning ``one'')), and although arguably
+incorrect it is convenient when mixing the CJK scripts.  But in the case
+of the Latin scripts this is wrong.
+
+Worse yet, it is very likely to occur when mixing ``different'' encodings
+(such as ISO 8859/1 and ISO 8859/15) that differ only in a few code
+points that are almost never used.  A very important example involves
+email.  Many sites, especially in the U.S., default to use of the ISO
+8859/1 coded character set (also called ``Latin 1,'' though these are
+somewhat different concepts).  However, ISO 8859/1 provides a generic
+CURRENCY SIGN character.  Now that the Euro has become the official
+currency of most countries in Europe, this is unsatisfactory (and in
+practice, useless).  So Europeans generally use ISO 8859/15, which is
+nearly identical to ISO 8859/1 for most languages, except that it
+substitutes EURO SIGN for CURRENCY SIGN.
+
+Suppose a European user yanks text from a post encoded in ISO 8859/1
+into a message composition buffer, and enters some text including the
+Euro sign.  Then Mule will consider the buffer to contain both ISO
+8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
+programmed) send the message as a multipart mixed MIME body!
+
+This is clearly stupid.  What is not as obvious is that, just as any
+European can include American English in their text because ASCII is a
+subset of ISO 8859/15, most European languages which use Latin
+characters (eg, German and Polish) can typically be mixed while using
+only one Latin coded character set (in the case of German and Polish,
+ISO 8859/2).  However, this often depends on exactly what text is to be
+encoded (even for the same pair of languages).
+
+Unification works around the problem by converting as many characters as
+possible to use a single Latin coded character set before saving the
+buffer.
+
+Because the problem is rarely noticable in editing a buffer, but tends
+to manifest when that buffer is exported to a file or process, the
+Unification package uses the strategy of examining the buffer prior to
+export.  If use of multiple Latin coded character sets is detected,
+Unification attempts to unify them by finding a single coded character
+set which contains all of the Latin characters in the buffer.
+
+The primary purpose of Unification is to fix the problem by giving the
+user the choice to change the representation of all characters to one
+character set and give sensible recommendations based on context.  In
+the ',As(B' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and
+both will be suggested.  In the EURO SIGN example, only ISO 8859/15
+makes sense, and that is what will be recommended.  In both cases, the
+user will be reminded that there are universal encodings available.
+
+I call this @dfn{remapping} (from the universal character set to a
+particular ISO 8859 coded character set).  It is mere accident that this
+letter has the same code point in both character sets.  (Not entirely,
+but there are many examples of Latin characters that have different code
+points in different Latin-X sets.)
+
+Note that, in the ',As(B' example, that treating the buffer in this way will
+result in a representation such as [latin-iso8859-2
+#x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3].
+This is guaranteed to occasionally result in the second problem you
+observed, to which we now turn.
+
+This problem is that, although the file is intended to be an
+ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX
+compliant program---this is required by the standard, obvious if you
+think a bit, @pxref{What Unification Cannot Do for You}) will read that
+file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73].  Of course this
+is no problem if all of the characters in the file are contained in ISO
+8859/1, but suppose there are some which are not, but are contained in
+the (intended) ISO 8859/2.
+
+You now want to fix this, but not by finding the same character in
+another set.  Instead, you want to simply change the character set that
+Mule associates with that buffer position without changing the code.
+(This is conceptually somewhat distinct from the first problem, and
+logically ought to be handled in the code that defines coding systems.
+However, unification is not an unreasonable place for it.)  Unification
+provides two functions (one fast and dangerous, the other slow and
+careful) to handle this.  I call this @dfn{recoding}, because the
+transformation actually involves @emph{encoding} the buffer to file
+representation, then @emph{decoding} it to buffer representation (in a
+different character set).  This cannot be done automatically because
+Mule can have no idea what the correct encoding is---after all, it
+already gave you its best guess.  @xref{What Unification Cannot Do for
+You}.  So these functions must be invoked by the user.  @xref{Interactive
+Usage}.
+
+
+@node What Unification Cannot Do for You, Unification Internals, Theory of Operation, Charset Unification
+@subsection What Unification Cannot Do for You
+
+Unification @strong{cannot} save you if you insist on exporting data in
+8-bit encodings in a multilingual environment.  @emph{You will
+eventually corrupt data if you do this.}  It is not Mule's, or any
+application's, fault.  You will have only yourself to blame; consider
+yourself warned.  (It is true that Mule has bugs, which make Mule
+somewhat more dangerous and inconvenient than some naive applications.
+We're working to address those, but no application can remedy the
+inherent defect of 8-bit encodings.)
+
+Use standard universal encodings, preferably Unicode (UTF-8) unless
+applicable standards indicate otherwise.  The most important such case
+is Internet messages, where MIME should be used, whether or not the
+subordinate encoding is a universal encoding.  (Note that since one of
+the important provisions of MIME is the @samp{Content-Type} header,
+which has the charset parameter, MIME is to be considered a universal
+encoding for the purposes of this manual.  Of course, technically
+speaking it's neither a coded character set nor a coding extension
+technique compliant with ISO 2022.)
+
+As mentioned earlier, the problem is that standard encodings suffer from
+the design defect that they do not provide a reliable way to recognize
+which coded character sets are in use.  There are scores of character
+sets which can be represented by a single octet (8-bit byte), whose
+union contains many hundreds of characters.  Thus any 8-bit coded
+character set must contain characters that share code points used for
+different characters in other coded character sets.
+
+This means that a given file's intended encoding cannot be identified
+with 100% reliability unless it contains encoding markers such as those
+provided by MIME or ISO 2022.
+
+Unification actually makes it more likely that you will have problems of
+this kind.  Traditionally Mule has been ``helpful'' by simply using an
+ISO 2022 universal coding system when the current buffer coding system
+cannot handle all the characters in the buffer.  This has the effect
+that, because the file contains control sequences, it is not recognized
+as being in the locale's normal 8-bit encoding.  It may be annoying if
+you are not a Mule expert, but your data is automatically recoverable
+with a tool you already have: Mule.
+
+However, with unification, Mule converts to a single 8-bit character set
+when possible.  But typically this will @emph{not} be in your usual
+locale.  Ie, the times that an ISO 8859/1 user will need Unification is
+when there are ISO 8859/2 characters in the buffer.  But then most
+likely the file will be saved in a pure 8-bit encoding that is not ISO
+8859/1, ie, ISO 8859/2.  Mule's autorecognizer (which is probably the
+most sophisticated yet available) cannot tell the difference between ISO
+8859/1 and ISO 8859/2, and in a Western European locale will choose the
+former even though the latter was intended.  Even the extension
+(``statistical recognition'') planned for XEmacs 22 is unlikely to be at
+all accurate in the case of mixed codes.
+
+So now consider adding some additional ISO 8859/1 text to the buffer.
+If it includes any ISO 8859/1 codes that are used by different
+characters in ISO 8859/2, you now have a file that cannot be
+mechanically disentangled.  You need a human being who can recognize
+that @emph{this is German and Swedish} and stays in Latin-1, while
+@emph{that is Polish} and needs to be recoded to Latin-2.
+
+Moral: switch to a universal coded character set, preferably Unicode
+using the UTF-8 transformation format.  If you really need the space,
+compress your files.
+
+
+@node Unification Internals, , What Unification Cannot Do for You, Charset Unification
+@subsection Internals
+
+No internals documentation yet.
+
+@file{unity-utils.el} provides one utility function.
+
+@defun unity-dump-tables
+
+Dump the temporary table created by loading @file{unity-utils.el}
+to @file{unity-tables.el}.  Loading the latter file initializes
+@samp{unity-equivalences}.
+@end defun
+
+
+@node Charsets and Coding Systems, , Charset Unification, MULE
+@subsection Charsets and Coding Systems
+
+This section provides reference lists of Mule charsets and coding
+systems.  Mule charsets are typically named by character set and
+standard.
+
+@table @strong
+@item ASCII variants
+
+Identification of equivalent characters in these sets is not properly
+implemented.  Unification does not distinguish the two charsets.
+
+@samp{ascii} @samp{latin-jisx0201}
+
+@item Extended Latin
+
+Characters from the following ISO 2022 conformant charsets are
+identified with equivalents in other charsets in the group by
+Unification.
+
+@samp{latin-iso8859-1} @samp{latin-iso8859-15} @samp{latin-iso8859-2}
+@samp{latin-iso8859-3} @samp{latin-iso8859-4} @samp{latin-iso8859-9}
+@samp{latin-iso8859-13} @samp{latin-iso8859-16}
+
+The follow charsets are Latin variants which are not understood by
+Unification.  In addition, many of the Asian language standards provide
+ASCII, at least, and sometimes other Latin characters.  None of these
+are identified with their ISO 8859 equivalents.
+
+@samp{vietnamese-viscii-lower}
+@samp{vietnamese-viscii-upper}
+
+@item Other character sets
+
+@samp{arabic-1-column}
+@samp{arabic-2-column}
+@samp{arabic-digit}
+@samp{arabic-iso8859-6}
+@samp{chinese-big5-1}
+@samp{chinese-big5-2}
+@samp{chinese-cns11643-1}
+@samp{chinese-cns11643-2}
+@samp{chinese-cns11643-3}
+@samp{chinese-cns11643-4}
+@samp{chinese-cns11643-5}
+@samp{chinese-cns11643-6}
+@samp{chinese-cns11643-7}
+@samp{chinese-gb2312}
+@samp{chinese-isoir165}
+@samp{cyrillic-iso8859-5}
+@samp{ethiopic}
+@samp{greek-iso8859-7}
+@samp{hebrew-iso8859-8}
+@samp{ipa}
+@samp{japanese-jisx0208}
+@samp{japanese-jisx0208-1978}
+@samp{japanese-jisx0212}
+@samp{katakana-jisx0201}
+@samp{korean-ksc5601}
+@samp{sisheng}
+@samp{thai-tis620}
+@samp{thai-xtis}
+
+@item Non-graphic charsets
+
+@samp{control-1}
+@end table
+
+@table @strong
+@item No conversion
+
+Some of these coding systems may specify EOL conventions.  Note that
+@samp{iso-8859-1} is a no-conversion coding system, not an ISO 2022
+coding system.  Although unification attempts to compensate for this, it
+is possible that the @samp{iso-8859-1} coding system will behave
+differently from other ISO 8859 coding systems.
+
+@samp{binary} @samp{no-conversion} @samp{raw-text} @samp{iso-8859-1}
+
+@item Latin coding systems
+
+These coding systems are all single-byte, 8-bit ISO 2022 coding systems,
+combining ASCII in the GL register (bytes with high-bit clear) and an
+extended Latin character set in the GR register (bytes with high-bit set).
+
+@samp{iso-8859-15} @samp{iso-8859-2} @samp{iso-8859-3} @samp{iso-8859-4}
+@samp{iso-8859-9} @samp{iso-8859-13} @samp{iso-8859-14} @samp{iso-8859-16}
+
+These coding systems are single-byte, 8-bit coding systems that do not
+conform to international standards.  They should be avoided in all
+potentially multilingual contexts, including any text distributed over
+the Internet and World Wide Web.
+
+@samp{windows-1251}
+
+@item Multilingual coding systems
+
+The following ISO-2022-based coding systems are useful for multilingual
+text.
+
+@samp{ctext} @samp{iso-2022-lock} @samp{iso-2022-7} @samp{iso-2022-7bit}
+@samp{iso-2022-7bit-ss2} @samp{iso-2022-8} @samp{iso-2022-8bit-ss2}
+
+XEmacs also supports Unicode with the Mule-UCS package.  These are the
+preferred coding systems for multilingual use.  (There is a possible
+exception for texts that mix several Asian ideographic character sets.)
+
+@samp{utf-16-be} @samp{utf-16-be-no-signature} @samp{utf-16-le}
+@samp{utf-16-le-no-signature} @samp{utf-7} @samp{utf-7-safe}
+@samp{utf-8} @samp{utf-8-ws}
+
+Development versions of XEmacs (the 21.5 series) support Unicode
+internally, with (at least) the following coding systems implemented:
+
+@samp{utf-16-be} @samp{utf-16-be-bom} @samp{utf-16-le}
+@samp{utf-16-le-bom} @samp{utf-8} @samp{utf-8-bom}
+
+@item Asian ideographic languages
+
+The following coding systems are based on ISO 2022, and are more or less
+suitable for encoding multilingual texts.  They all can represent ASCII
+at least, and sometimes several other foreign character sets, without
+resort to arbitrary ISO 2022 designations.  However, these subsets are
+not identified with the corresponding national standards in XEmacs Mule.
+
+@samp{chinese-euc} @samp{cn-big5} @samp{cn-gb-2312} @samp{gb2312}
+@samp{hz} @samp{hz-gb-2312} @samp{old-jis} @samp{japanese-euc}
+@samp{junet} @samp{euc-japan} @samp{euc-jp} @samp{iso-2022-jp}
+@samp{iso-2022-jp-1978-irv} @samp{iso-2022-jp-2} @samp{euc-kr}
+@samp{korean-euc} @samp{iso-2022-kr} @samp{iso-2022-int-1}
+
+The following coding systems cannot be used for general multilingual
+text and do not cooperate well with other coding systems.
+
+@samp{big5} @samp{shift_jis}
+
+@item Other languages
+
+The following coding systems are based on ISO 2022.  Though none of them
+provides any Latin characters beyond ASCII, XEmacs Mule allows (and up
+to 21.4 defaults to) use of ISO 2022 control sequences to designate
+other character sets for inclusion the text.
+
+@samp{iso-8859-5} @samp{iso-8859-7} @samp{iso-8859-8}
+@samp{ctext-hebrew}
+
+The following are character sets that do not conform to ISO 2022 and
+thus cannot be safely used in a multilingual context.
+
+@samp{alternativnyj} @samp{koi8-r} @samp{tis-620} @samp{viqr}
+@samp{viscii} @samp{vscii}
+
+@item Special coding systems
+
+Mule uses the following coding systems for special purposes.
+
+@samp{automatic-conversion} @samp{undecided} @samp{escape-quoted}
+
+@samp{escape-quoted} is especially important, as it is used internally
+as the coding system for autosaved data.
+
+The following coding systems are aliases for others, and are used for
+communication with the host operating system.
+
+@samp{file-name} @samp{keyboard} @samp{terminal}
+
+@end table
+
+Mule detection of coding systems is actually limited to detection of
+classes of coding systems called @dfn{coding categories}.  These coding
+categories are identified by the ISO 2022 control sequences they use, if
+any, by their conformance to ISO 2022 restrictions on code points that
+may be used, and by characteristic patterns of use of 8-bit code points.
+
+@samp{no-conversion}
+@samp{utf-8}
+@samp{ucs-4}
+@samp{iso-7}
+@samp{iso-lock-shift}
+@samp{iso-8-1}
+@samp{iso-8-2}
+@samp{iso-8-designate}
+@samp{shift-jis}
+@samp{big5}
+
+
+@c end of mule.texi
+
--- a/man/widget.texi	Thu Jan 02 22:52:44 2003 +0000
+++ b/man/widget.texi	Fri Jan 03 12:12:40 2003 +0000
@@ -33,6 +33,7 @@
 * Widget Minor Mode::
 * Utilities::
 * Widget Wishlist::
+* Widget Internals::
 @end menu

 @node  Introduction, User Interface, Top, Top
@@ -120,7 +121,7 @@
 @table @file
 @item widget.el
 This will declare the user variables, define the function
-@code{widget-define}, and autoload the function @code{widget-create}.
+@code{define-widget}, and autoload the function @code{widget-create}.
 @item wid-edit.el
 Everything else is here, there is no reason to load it explicitly, as
 it will be autoloaded when needed.
@@ -1359,7 +1360,7 @@
 specifying component widgets and new default values for the keyword
 arguments.

-@defun widget-define name class doc &rest args
+@defun define-widget name class doc &rest args
 Define a new widget type named @var{name} from @code{class}.

 @var{name} and class should both be symbols, @code{class} should be one
@@ -1384,7 +1385,7 @@

 @end defun

-Using @code{widget-define} just stores the definition of the widget type
+Using @code{define-widget} just stores the definition of the widget type
 in the @code{widget-type} property of @var{name}, which is what
 @code{widget-create} uses.

@@ -1558,7 +1559,7 @@
 This is only meaningful for radio buttons or checkboxes in a list.
 @end defun

-@node  Widget Wishlist,  , Utilities, Top
+@node Widget Wishlist, Widget Internals, Utilities, Top
 @comment  node-name,  next,  previous,  up
 @section Wishlist

@@ -1620,7 +1621,7 @@
 the field, not the end of the field itself.

 @item
-Use and overlay instead of markers to delimit the widget.  Create
+Use an overlay instead of markers to delimit the widget.  Create
 accessors for the end points.

 @item
@@ -1631,5 +1632,35 @@

 @end itemize

+@node Widget Internals, , Widget Wishlist, Top
+@section Internals
+
+This (very brief!) section provides a few notes on the internal
+structure and implementation of Emacs widgets.  Avoid relying on this
+information.  (We intend to improve it, but this will take some time.)
+To the extent that it actually describes APIs, the information will be
+moved to appropriate sections of the manual in due course.
+
+@subsection The @dfn{Widget} and @dfn{Type} Structures
+
+Widgets and types are currently both implemented as lists.
+
+A symbol may be defined as a @dfn{type name} using @code{define-widget}.
+@xref{Defining New Widgets}.  A @dfn{type} is a list whose car is a
+previously defined type name, nil, or (recursively) a type.  The car is
+the @dfn{class} or parent type of the type, and properties which are not
+specified in the new type will be inherited from ancestors.  Probably
+the only type without a class should be the @code{default} type.  The
+cdr of a type is a plist whose keys are widget property keywords.
+
+A type or type name may also be referred to as an @dfn{unconverted
+widget}.
+
+A @dfn{converted widget} or @dfn{widget instance} is a list whose car is
+a type name or a type, and whose cdr is a property list.  Furthermore,
+all children of the converted widget must be converted.  Finally, in the
+process of appropriate parts of the list structure are copied to ensure
+that changes in values of one instance do not affect another's.
+
 @contents
 @bye
--- a/man/xemacs-faq.texi	Thu Jan 02 22:52:44 2003 +0000
+++ b/man/xemacs-faq.texi	Fri Jan 03 12:12:40 2003 +0000
@@ -7,7 +7,7 @@
 @finalout
 @titlepage
 @title XEmacs FAQ
-@subtitle Frequently asked questions about XEmacs @* Last Modified: $Date: 2002/12/04 14:06:04 $
+@subtitle Frequently asked questions about XEmacs @* Last Modified: $Date: 2003/01/03 12:12:30 $
 @sp 1
 @author Tony Rossini <rossini@@biostat.washington.edu>
 @author Ben Wing <ben@@xemacs.org>
@@ -1500,6 +1500,11 @@
 buggy optimizers.  Please see the @file{PROBLEMS} file that comes with
 XEmacs to read what it says about your platform.

+If you compiled XEmacs using @samp{--use-union-type} (or the option
+@samp{USE_UNION_TYPE} in @file{config.inc} under Windows), recompile
+again without this.  This has been known to trigger compiler errors in a
+number of cases.
+
 @node Q2.0.7, Q2.0.8, Q2.0.6, Installation
 @unnumberedsubsec Q2.0.7: Libraries in non-standard locations

@@ -1802,18 +1807,29 @@
 particular sequences of actions, that cause it to crash.  If you can
 come up with a reproducible way of doing this (or even if you have a
 pretty good memory of exactly what you were doing at the time), the
-maintainers would be very interested in knowing about it.  Post a
-message to comp.emacs.xemacs or send mail to @email{crashes@@xemacs.org}.
-Please note that the @samp{crashes} address is exclusively for crash
+maintainers would be very interested in knowing about it.  The best way
+to report a bug is using @kbd{M-x report-emacs-bug} (or by selecting
+@samp{Send Bug Report...} from the Help menu).  If that won't work
+(e.g. you can't get XEmacs working at all), send ordinary mail to
+@email{crashes@@xemacs.org}. @emph{MAKE SURE} to include the output from
+the crash, especially including the Lisp backtrace, as well as the
+XEmacs configuration from @kbd{M-x describe-installation} (or
+equivalently, the file @file{Installation} in the top of the build
+tree).  Please note that the @samp{crashes} address is exclusively for
+crash reports.  The best way to report bugs in general is through the
+@kbd{M-x report-emacs-bug} interface just mentioned, or if necessary by
+emailing @email{xemacs-beta@@xemacs.org}.  Note that the developers do
+@emph{not} usually follow @samp{comp.emacs.xemacs} on a regular basis;
+thus, this is better for general questions about XEmacs than bug
 reports.

-If at all possible, include a stack backtrace of the core dump that was
-produced.  This shows where exactly things went wrong, and makes it much
-easier to diagnose problems.  To do this, you need to locate the core
-file (it's called @file{core}, and is usually sitting in the directory
-that you started XEmacs from, or your home directory if that other
-directory was not writable).  Then, go to that directory and execute a
-command like:
+If at all possible, include a C stack backtrace of the core dump that
+was produced.  This shows where exactly things went wrong, and makes it
+much easier to diagnose problems.  To do this under Unix, you need to
+locate the core file (it's called @file{core}, and is usually sitting in
+the directory that you started XEmacs from, or your home directory if
+that other directory was not writable).  Then, go to that directory and
+execute a command like:

 @example
 gdb `which xemacs` core
@@ -1829,6 +1845,13 @@
 to disable core files by default.  Also see @ref{Q2.1.15}, for tips and
 techniques for dealing with a debugger.

+If you're under Microsoft Windows, you're out of luck unless you happen
+to have a debugging aid installed on your system, for example Visual
+C++.  In this case, the crash will result in a message giving you the
+option to enter a debugger (for example, by pressing @samp{Cancel}).  Do
+this and locate the stack-trace window. (If your XEmacs was built
+without debugging information, the stack trace may not be very useful.)
+
 When making a problem report make sure that:

 @enumerate
@@ -1846,12 +1869,12 @@
 What build options you are using.

 @item
-If the problem is related to graphics, we will also need to know what
-version of the X Window System you are running, and what window manager
-you are using.
-
-@item
-If the problem happened on a tty, please include the terminal type.
+If the problem is related to graphics and you are running Unix, we will
+also need to know what version of the X Window System you are running,
+and what window manager you are using.
+
+@item
+If the problem happened on a TTY, please include the terminal type.
 @end enumerate

 Much of the information above is automatically generated by @kbd{M-x
@@ -2237,7 +2260,7 @@
 decode them, do this:

 @example
-call debug_print (OBJECT)
+call dp (OBJECT)
 @end example

 where @var{OBJECT} is whatever you want to decode (it can be a variable,
@@ -2249,14 +2272,14 @@
 stack, do this:

 @example
-call debug_backtrace ()
+call db ()
 @end example

 @item
-Using @code{debug_print} and @code{debug_backtrace} has two
-disadvantages - it can only be used with a running xemacs process, and
-it cannot display the internal C structure of a Lisp Object.  Even if
-all you've got is a core dump, all is not lost.
+Using @code{dp} and @code{db} has two disadvantages - it can only be
+used with a running xemacs process, and it cannot display the internal C
+structure of a Lisp Object.  Even if all you've got is a core dump, all
+is not lost.

 If you're using GDB, there are some macros in the file
 @file{src/.gdbinit} in the XEmacs source distribution that should make
@@ -2319,8 +2342,8 @@
 running the XEmacs process under a debugger, the stack trace should be
 clean.

-@email{1CMC3466@@ibm.mtsac.edu, Curtiss} suggests upgrading to ld.so version 1.8
-if dynamic linking and debugging is a problem on Linux.
+@email{1CMC3466@@ibm.mtsac.edu, Curtiss} suggests upgrading to ld.so
+version 1.8 if dynamic linking and debugging is a problem on Linux.

 @item
 If you're using a debugger to get a C stack backtrace and you're
@@ -2344,9 +2367,9 @@
 could simply mean that XEmacs attempted to execute code at that address,
 e.g. through jumping to a null function pointer.  Unfortunately, under
 those circumstances, GDB under Linux doesn't know how to get a stack
-trace. (Yes, this is the third Linux-related problem I've mentioned.  I
+trace. (Yes, this is the fourth Linux-related problem I've mentioned.  I
 have no idea why GDB under Linux is so bogus.  Complain to the GDB
-authors, or to comp.os.linux.development.system).  Again, you'll have to
+authors, or to comp.os.linux.development.system.) Again, you'll have to
 use the narrowing-down process described above.

 @item
@@ -2365,6 +2388,10 @@
 @file{src/gdbinit}.  This had the disadvantage of not being sourced
 automatically by gdb, so you had to set that up yourself.

+@item
+If you are running Microsoft Windows, the the file @file{nt/README} for
+further information about debugging XEmacs.
+
 @end itemize

 @node Q2.1.16, Q2.1.17, Q2.1.15, Installation
--- a/man/xemacs/mule.texi	Thu Jan 02 22:52:44 2003 +0000
+++ b/man/xemacs/mule.texi	Fri Jan 03 12:12:40 2003 +0000
@@ -15,6 +15,8 @@
 @cindex Korean
 @cindex Cyrillic
 @cindex Russian
+@c #### It's a lie that this file tells you about Unicode....
+@cindex Unicode
   If you build XEmacs using the @code{--with-mule} option, it supports a
 wide variety of world scripts, including the Latin script, the Arabic
 script, Simplified Chinese (for mainland of China), Traditional Chinese
@@ -33,22 +35,25 @@
 * Coding Systems::          Character set conversion when you read and
                               write files, and so on.
 * Recognize Coding::        How XEmacs figures out which conversion to use.
+* Unification::             Integrating overlapping character sets.
 * Specify Coding::          Various ways to choose which conversion to use.
+* Charsets and Coding Systems:: Tables and other reference material.
 @end menu

 @node Mule Intro, Language Environments, Mule, Mule
-@section Introduction to world scripts
+@section Introduction: The Wide Variety of Scripts and Codings in Use

-  The users of these scripts have established many more-or-less standard
-coding systems for storing files.
-@c XEmacs internally uses a single multibyte character encoding, so that it
-@c can intermix characters from all these scripts in a single buffer or
-@c string.  This encoding represents each non-ASCII character as a sequence
-@c of bytes in the range 0200 through 0377.
-XEmacs translates between the internal character encoding and various
-other coding systems when reading and writing files, when exchanging
-data with subprocesses, and (in some cases) in the @kbd{C-q} command
-(see below).
+  There are hundreds of scripts in use world-wide.  The users of these
+scripts have established many more-or-less standard coding systems for
+storing text written in them in files.  XEmacs translates between its
+internal character encoding and various other coding systems when
+reading and writing files, when exchanging data with subprocesses, and
+(in some cases) in the @kbd{C-q} command (see below).
+@footnote{Historically the internal encoding was a specially designed
+encoding, called @dfn{Mule encoding}, intended for easy conversion to
+and from versions of ISO 2022.  However, this encoding shares many
+properties with UTF-8, and conversion to UTF-8 as the internal code is
+proposed.}

 @kindex C-h h
 @findex view-hello-file
@@ -356,7 +361,7 @@
 the usual three variants to specify the kind of end-of-line conversion.


-@node Recognize Coding, Specify Coding, Coding Systems, Mule
+@node Recognize Coding, Unification, Coding Systems, Mule
 @section Recognizing Coding Systems

   Most of the time, XEmacs can recognize which coding system to use for
@@ -427,7 +432,739 @@
 Coding}).


-@node Specify Coding,  , Recognize Coding, Mule
+@node Unification, Specify Coding, Recognize Coding, Mule
+@section Character Set Unification
+
+Mule suffers from a design defect that causes it to consider the ISO
+Latin character sets to be disjoint.  This results in oddities such as
+files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO
+2022 control sequences to switch between them, as well as more
+plausible but often unnecessary combinations like ISO 8859/1 with ISO
+8859/2.  This can be very annoying when sending messages or even in
+simple editing on a single host.  XEmacs works around the problem by
+converting as many characters as possible to use a single Latin coded
+character set before saving the buffer.
+
+Unification is planned for extension to other character set families,
+in particular the Han family of character sets based on the Chinese
+ideographic characters.  At least for the Han sets, however, the
+unification feature will be disabled by default.
+
+This functionality is based on the @file{latin-unity} package by
+Stephen Turnbull @email{stephen@@xemacs.org}, but is somewhat
+divergent.  This documentation is also based on the package
+documentation, and is likely to be inaccurate because of the different
+constraints we place on ``core'' and packaged functionality.
+
+@menu
+* Unification Overview::        History and general information.
+* Unification Usage::           An overview of operation.
+* Unification Configuration::   Configuring unification.
+* Unification FAQs::            Questions and answers from the mailing list.
+* Unification Theory::          How unification works.
+* What Unification Cannot Do for You::  Inherent problems of 8-bit charsets.
+@end menu
+
+@node Unification Overview, Unification Usage, Unification, Unification
+@subsection An Overview of Character Set Unification
+
+Mule suffers from a design defect that causes it to consider the ISO
+Latin character sets to be disjoint.  This manifests itself when a user
+enters characters using input methods associated with different coded
+character sets into a single buffer.
+
+A very important example involves email.  Many sites, especially in the
+U.S., default to use of the ISO 8859/1 coded character set (also called
+``Latin 1,'' though these are somewhat different concepts).  However,
+ISO 8859/1 provides a generic CURRENCY SIGN character.  Now that the
+Euro has become the official currency of most countries in Europe, this
+is unsatisfactory (and in practice, useless).  So Europeans generally
+use ISO 8859/15, which is nearly identical to ISO 8859/1 for most
+languages, except that it substitutes EURO SIGN for CURRENCY SIGN.
+
+Suppose a European user yanks text from a post encoded in ISO 8859/1
+into a message composition buffer, and enters some text including the
+Euro sign.  Then Mule will consider the buffer to contain both ISO
+8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
+programmed) send the message as a multipart mixed MIME body!
+
+This is clearly stupid.  What is not as obvious is that, just as any
+European can include American English in their text because ASCII is a
+subset of ISO 8859/15, most European languages which use Latin
+characters (eg, German and Polish) can typically be mixed while using
+only one Latin coded character set (in this case, ISO 8859/2).  However,
+this often depends on exactly what text is to be encoded.
+
+Unification works around the problem by converting as many characters as
+possible to use a single Latin coded character set before saving the
+buffer.
+
+
+@node Unification Usage, Unification Configuration, Unification Overview, Unification
+@subsection Operation of Unification
+
+This is a description of the early hack to include unification in
+XEmacs 21.5.  This will almost surely change.
+
+Normally, unification works in the background by installing
+@code{unity-sanity-check} on @code{write-region-pre-hook}.
+Unification is on by default for the ISO-8859 Latin sets.  The user
+activates this functionality for other chacter set families by
+invoking @code{enable-unification}, either interactively or in her
+init file.  @xref{Init File, , , xemacs}.  Unification can be
+deactivated by invoking @code{disable-unification}.
+
+Unification also provides a few functions for remapping or recoding the
+buffer by hand.  To @dfn{remap} a character means to change the buffer
+representation of the character by using another coded character set.
+Remapping never changes the identity of the character, but may involve
+altering the code point of the character.  To @dfn{recode} a character
+means to simply change the coded character set.  Recoding never alters
+the code point of the character, but may change the identity of the
+character.  @xref{Unification Theory}.
+
+There are a few variables which determine which coding systems are
+always acceptable to unification: @code{unity-ucs-list},
+@code{unity-preferred-coding-system-list}, and
+@code{unity-preapproved-coding-system-list}.  The last defaults to
+@code{(buffer preferred)}, and you should probably avoid changing it
+because it short-circuits the sanity check.  If you find you need to
+use it, consider reporting it as a bug or request for enhancement.
+
+@menu
+* Basic Functionality::            User interface and customization.
+* Interactive Usage::              Treating text by hand.
+                                   Also documents the hook function(s).
+@end menu
+
+
+@node Basic Functionality, Interactive Usage, , Unification Usage
+@subsubsection Basic Functionality
+
+These functions and user options initialize and configure unification.
+In normal use, they are not needed.
+
+@strong{These interfaces will change.  Also, the @samp{unity-} prefix
+is likely to be changed for many of the variables and functions, as
+they are of more general usefulness.}
+
+@defun enable-unification
+Set up hooks and initialize variables for unification.
+
+There are no arguments.
+
+This function is idempotent.  It will reinitialize any hooks or variables
+that are not in initial state.
+@end defun
+
+@defun disable-unification
+There are no arguments.
+
+Clean up hooks and void variables used by unification.
+@end defun
+
+@c #### several changes should go to latin-unity.texi
+@defopt unity-ucs-list
+List of universal coding systems recommended for character set unification.
+
+The default value is @code{'(utf-8 iso-2022-7 ctext escape-quoted)}.
+
+Order matters; coding systems earlier in the list will be preferred when
+recommending a coding system.  These coding systems will not be used
+without querying the user (unless they are also present in
+@code{unity-preapproved-coding-system-list}), and follow the
+@code{unity-preferred-coding-system-list} in the list of suggested
+coding systems.
+
+If none of the preferred coding systems are feasible, the first in
+this list will be the default.
+
+Notes on certain coding systems:  @code{escape-quoted} is a special
+coding system used for autosaves and compiled Lisp in Mule.  You should
+never delete this, although it is rare that a user would want to use it
+directly.  Unification does not try to be ``smart'' about other general
+ISO 2022 coding systems, such as ISO-2022-JP.  (They are not recognized
+as equivalent to @code{iso-2022-7}.)  If your preferred coding system is
+one of these, you may consider adding it to @code{unity-ucs-list}.
+@end defopt
+
+Coding systems which are not Latin and not in
+@code{unity-ucs-list} are handled by short circuiting checks of
+coding system against the next two variables.
+
+@defopt unity-preapproved-coding-system-list
+List of coding systems used without querying the user if feasible.
+
+The default value is @samp{(buffer-default preferred)}.
+
+The first feasible coding system in this list is used.  The special values
+@samp{preferred} and @samp{buffer-default} may be present:
+
+@table @code
+@item buffer-default
+Use the coding system used by @samp{write-region}, if feasible.
+
+@item preferred
+Use the coding system specified by @samp{prefer-coding-system} if feasible.
+@end table
+
+"Feasible" means that all characters in the buffer can be represented by
+the coding system.  Coding systems in @samp{unity-ucs-list} are
+always considered feasible.  Other feasible coding systems are computed
+by @samp{unity-representations-feasible-region}.
+
+Note that, by definition, the first universal coding system in this
+list shadows all other coding systems.  In particular, if your
+preferred coding system is a universal coding system, and
+@code{preferred} is a member of this list, unification will blithely
+convert all your files to that coding system.  This is considered a
+feature, but it may surprise most users.  Users who don't like this
+behavior may put @code{preferred} in
+@code{unity-preferred-coding-system-list}, but not in
+@code{unity-preapproved-coding-system-list}.
+@end defopt
+
+
+@defopt unity-preferred-coding-system-list
+List of coding systems suggested to the user if feasible.
+
+The default value is @samp{(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3
+iso-8859-4 iso-8859-9)}.
+
+If none of the coding systems in
+@samp{unity-preapproved-coding-system-list} are feasible, this list
+will be recommended to the user, followed by the
+@samp{unity-ucs-list} (so those coding systems should not be in
+this list).  The first coding system in this list is default.  The
+special values @samp{preferred} and @samp{buffer-default} may be
+present:
+
+@table @code
+@item buffer-default
+Use the coding system used by @samp{write-region}, if feasible.
+
+@item preferred
+Use the coding system specified by @samp{prefer-coding-system} if feasible.
+@end table
+
+"Feasible" means that all characters in the buffer can be represented by
+the coding system.  Coding systems in @samp{unity-ucs-list} are
+always considered feasible.  Other feasible coding systems are computed
+by @samp{unity-representations-feasible-region}.
+@end defopt
+
+
+@defvar unity-iso-8859-1-aliases
+List of coding systems to be treated as aliases of ISO 8859/1.
+
+The default value is '(iso-8859-1).
+
+This is not a user variable; to customize input of coding systems or
+charsets, @samp{unity-coding-system-alias-alist} or
+@samp{unity-charset-alias-alist}.
+@end defvar
+
+
+@node Interactive Usage, , Basic Functionality, Unification Usage
+@subsubsection Interactive Usage
+
+First, the hook function @code{unity-sanity-check} is documented.
+(It is placed here because it is not an interactive function, and there
+is not yet a programmer's section of the manual.)
+
+These functions provide access to internal functionality (such as the
+remapping function) and to extra functionality (the recoding functions
+and the test function).
+
+@defun unity-sanity-check begin end filename append visit lockname &optional coding-system
+
+Check if @var{coding-system} can represent all characters between
+@var{begin} and @var{end}.
+
+For compatibility with old broken versions of @code{write-region},
+@var{coding-system} defaults to @code{buffer-file-coding-system}.
+@var{filename}, @var{append}, @var{visit}, and @var{lockname} are
+ignored.
+
+Return nil if buffer-file-coding-system is not (ISO-2022-compatible)
+Latin.  If @code{buffer-file-coding-system} is safe for the charsets
+actually present in the buffer, return it.  Otherwise, ask the user to
+choose a coding system, and return that.
+
+This function does @emph{not} do the safe thing when
+@code{buffer-file-coding-system} is nil (aka no-conversion).  It
+considers that ``non-Latin,'' and passes it on to the Mule detection
+mechanism.
+
+This function is intended for use as a @code{write-region-pre-hook}.  It
+does nothing except return @var{coding-system} if @code{write-region}
+handlers are inhibited.
+@end defun
+
+@defun unity-buffer-representations-feasible
+There are no arguments.
+
+Apply unity-region-representations-feasible to the current buffer.
+@end defun
+
+@defun unity-region-representations-feasible begin end &optional buf
+Return character sets that can represent the text from @var{begin} to
+@var{end} in @var{buf}.
+
+@c #### Fix in latin-unity.texi.
+@var{buf} defaults to the current buffer.  Called interactively, will be
+applied to the region.  The function assumes @var{begin} <= @var{end}.
+
+The return value is a cons.  The car is the list of character sets
+that can individually represent all of the non-ASCII portion of the
+buffer, and the cdr is the list of character sets that can
+individually represent all of the ASCII portion.
+
+The following is taken from a comment in the source.  Please refer to
+the source to be sure of an accurate description.
+
+The basic algorithm is to map over the region, compute the set of
+charsets that can represent each character (the ``feasible charset''),
+and take the intersection of those sets.
+
+The current implementation takes advantage of the fact that ASCII
+characters are common and cannot change asciisets.  Then using
+skip-chars-forward makes motion over ASCII subregions very fast.
+
+This same strategy could be applied generally by precomputing classes
+of characters equivalent according to their effect on latinsets, and
+adding a whole class to the skip-chars-forward string once a member is
+found.
+
+Probably efficiency is a function of the number of characters matched,
+or maybe the length of the match string?  With @code{skip-category-forward}
+over a precomputed category table it should be really fast.  In practice
+for Latin character sets there are only 29 classes.
+@end defun
+
+@defun unity-remap-region begin end character-set &optional coding-system
+
+Remap characters between @var{begin} and @var{end} to equivalents in
+@var{character-set}.  Optional argument @var{coding-system} may be a
+coding system name (a symbol) or nil.  Characters with no equivalent are
+left as-is.
+
+When called interactively, @var{begin} and @var{end} are set to the
+beginning and end, respectively, of the active region, and the function
+prompts for @var{character-set}.  The function does completion, knows
+how to guess a character set name from a coding system name, and also
+provides some common aliases.  See @code{unity-guess-charset}.
+There is no way to specify @var{coding-system}, as it has no useful
+function interactively.
+
+Return @var{coding-system} if @var{coding-system} can encode all
+characters in the region, t if @var{coding-system} is nil and the coding
+system with G0 = 'ascii and G1 = @var{character-set} can encode all
+characters, and otherwise nil.  Note that a non-null return does
+@emph{not} mean it is safe to write the file, only the specified region.
+(This behavior is useful for multipart MIME encoding and the like.)
+
+Note:  by default this function is quite fascist about universal coding
+systems.  It only admits @samp{utf-8}, @samp{iso-2022-7}, and
+@samp{ctext}.  Customize @code{unity-approved-ucs-list} to change
+this.
+
+This function remaps characters that are artificially distinguished by Mule
+internal code.  It may change the code point as well as the character set.
+To recode characters that were decoded in the wrong coding system, use
+@code{unity-recode-region}.
+@end defun
+
+@defun unity-recode-region begin end wrong-cs right-cs
+
+Recode characters between @var{begin} and @var{end} from @var{wrong-cs}
+to @var{right-cs}.
+
+@var{wrong-cs} and @var{right-cs} are character sets.  Characters retain
+the same code point but the character set is changed.  Only characters
+from @var{wrong-cs} are changed to @var{right-cs}.  The identity of the
+character may change.  Note that this could be dangerous, if characters
+whose identities you do not want changed are included in the region.
+This function cannot guess which characters you want changed, and which
+should be left alone.
+
+When called interactively, @var{begin} and @var{end} are set to the
+beginning and end, respectively, of the active region, and the function
+prompts for @var{wrong-cs} and @var{right-cs}.  The function does
+completion, knows how to guess a character set name from a coding system
+name, and also provides some common aliases.  See
+@code{unity-guess-charset}.
+
+Another way to accomplish this, but using coding systems rather than
+character sets to specify the desired recoding, is
+@samp{unity-recode-coding-region}.  That function may be faster
+but is somewhat more dangerous, because it may recode more than one
+character set.
+
+To change from one Mule representation to another without changing identity
+of any characters, use @samp{unity-remap-region}.
+@end defun
+
+@defun unity-recode-coding-region begin end wrong-cs right-cs
+
+Recode text between @var{begin} and @var{end} from @var{wrong-cs} to
+@var{right-cs}.
+
+@var{wrong-cs} and @var{right-cs} are coding systems.  Characters retain
+the same code point but the character set is changed.  The identity of
+characters may change.  This is an inherently dangerous function;
+multilingual text may be recoded in unexpected ways.  #### It's also
+dangerous because the coding systems are not sanity-checked in the
+current implementation.
+
+When called interactively, @var{begin} and @var{end} are set to the
+beginning and end, respectively, of the active region, and the function
+prompts for @var{wrong-cs} and @var{right-cs}.  The function does
+completion, knows how to guess a coding system name from a character set
+name, and also provides some common aliases.  See
+@code{unity-guess-coding-system}.
+
+Another, safer, way to accomplish this, using character sets rather
+than coding systems to specify the desired recoding, is to use
+@code{unity-recode-region}.
+
+To change from one Mule representation to another without changing identity
+of any characters, use @code{unity-remap-region}.
+@end defun
+
+Helper functions for input of coding system and character set names.
+
+@defun unity-guess-charset candidate
+Guess a charset based on the symbol @var{candidate}.
+
+@var{candidate} itself is not tried as the value.
+
+Uses the natural mapping in @samp{unity-cset-codesys-alist}, and
+the values in @samp{unity-charset-alias-alist}."
+@end defun
+
+@defun unity-guess-coding-system candidate
+Guess a coding system based on the symbol @var{candidate}.
+
+@var{candidate} itself is not tried as the value.
+
+Uses the natural mapping in @samp{unity-cset-codesys-alist}, and
+the values in @samp{unity-coding-system-alias-alist}."
+@end defun
+
+@defun unity-example
+
+A cheesy example for unification.
+
+At present it just makes a multilingual buffer.  To test, setq
+buffer-file-coding-system to some value, make the buffer dirty (eg
+with RET BackSpace), and save.
+@end defun
+
+
+@node Unification Configuration, Unification FAQs, Unification Usage, Unification
+@subsection Configuring Unification for Use
+
+If you want unification to be automatically initialized, invoke
+@samp{enable-unification} with no arguments in your init file.
+@xref{Init File, , , xemacs}.  If you are using GNU Emacs or an XEmacs
+earlier than 21.1, you should also load @file{auto-autoloads} using the
+full path (@emph{never} @samp{require} @file{auto-autoloads} libraries).
+
+You may wish to define aliases for commonly used character sets and
+coding systems for convenience in input.
+
+@defopt unity-charset-alias-alist
+Alist mapping aliases to Mule charset names (symbols)."
+
+The default value is
+@example
+   ((latin-1 . latin-iso8859-1)
+    (latin-2 . latin-iso8859-2)
+    (latin-3 . latin-iso8859-3)
+    (latin-4 . latin-iso8859-4)
+    (latin-5 . latin-iso8859-9)
+    (latin-9 . latin-iso8859-15)
+    (latin-10 . latin-iso8859-16))
+@end example
+
+If a charset does not exist on your system, it will not complete and you
+will not be able to enter it in response to prompts.  A real charset
+with the same name as an alias in this list will shadow the alias.
+@end defopt
+
+@defopt unity-coding-system-alias-alist nil
+Alist mapping aliases to Mule coding system names (symbols).
+
+The default value is @samp{nil}.
+@end defopt
+
+
+@node Unification FAQs, Unification Theory, Unification Configuration, Unification
+@subsection Frequently Asked Questions About Unification
+
+@enumerate
+@item
+I'm smarter than XEmacs's unification feature!  How can that be?
+
+Don't be surprised.  Trust yourself.
+
+Unification is very young as yet.  Teach it what you know by
+Customizing its variables, and report your changes to the maintainer
+(@kbd{M-x report-xemacs-bug RET}).
+
+@item
+What is a UCS?
+
+According to ISO 10646, a Universal Coded character Set.  In
+XEmacs, it's Universal (Mule) Coding System.
+@ref{Coding Systems, , , xemacs}
+
+@item
+I know @code{utf-16-le-bom} is a UCS, but unification won't use it.
+Why not?
+
+There are an awful lot of UCSes in Mule, and you probably do not want to
+ever use, and definitely not be asked about, most of them.  So the
+default set includes a few that the author thought plausible, but
+they're surely not comprehensive or optimal.
+
+Customize @code{unity-ucs-list} to include the ones you use often, and
+report your favorites to the maintainer for consideration for
+inclusion in the defaults using @kbd{M-x report-xemacs-bug RET}.
+(Note that you @emph{must} include @code{escape-quoted} in this list,
+because Mule uses it internally as the coding system for auto-save
+files.)
+
+Alternatively, if you just want to use it this one time, simply type
+it in at the prompt.  Unification will confirm that is a real coding
+system, and then assume that you know what you're doing.
+
+@item
+This is crazy: I can't quit XEmacs and get queried on autosaves!  Why?
+
+You probably removed @code{escape-quoted} from
+@code{unity-ucs-list}.  Put it back.
+
+@item
+Unification is really buggy and I can't get any work done.
+
+First, use @kbd{M-x disable-unification RET}, then report your
+problems as a bug (@kbd{M-x report-xemacs-bug RET}).
+@end enumerate
+
+
+@node Unification Theory, What Unification Cannot Do for You, Unification FAQs, Unification
+@subsection Unification Theory
+
+Standard encodings suffer from the design defect that they do not
+provide a reliable way to recognize which coded character sets in use.
+@xref{What Unification Cannot Do for You}.  There are scores of
+character sets which can be represented by a single octet (8-bit
+byte), whose union contains many hundreds of characters.  Obviously
+this results in great confusion, since you can't tell the players
+without a scorecard, and there is no scorecard.
+
+There are two ways to solve this problem.  The first is to create a
+universal coded character set.  This is the concept behind Unicode.
+However, there have been satisfactory (nearly) universal character
+sets for several decades, but even today many Westerners resist using
+Unicode because they consider its space requirements excessive.  On
+the other hand, many Asians dislike Unicode because they consider it
+to be incomplete.  (This is partly, but not entirely, political.)
+
+In any case, Unicode only solves the internal representation problem.
+Many data sets will contain files in ``legacy'' encodings, and Unicode
+does not help distinguish among them.
+
+The second approach is to embed information about the encodings used in
+a document in its text.  This approach is taken by the ISO 2022
+standard.  This would solve the problem completely from the users' of
+view, except that ISO 2022 is basically not implemented at all, in the
+sense that few applications or systems implement more than a small
+subset of ISO 2022 functionality.  This is due to the fact that
+mono-literate users object to the presence of escape sequences in their
+texts (which they, with some justification, consider data corruption).
+Programmers are more than willing to cater to these users, since
+implementing ISO 2022 is a painstaking task.
+
+In fact, Emacs/Mule adopts both of these approaches.  Internally it uses
+a universal character set, @dfn{Mule code}.  Externally it uses ISO 2022
+techniques both to save files in forms robust to encoding issues, and as
+hints when attempting to ``guess'' an unknown encoding.  However, Mule
+suffers from a design defect, namely it embeds the character set
+information that ISO 2022 attaches to runs of characters by introducing
+them with a control sequence in each character.  That causes Mule to
+consider the ISO Latin character sets to be disjoint.  This manifests
+itself when a user enters characters using input methods associated with
+different coded character sets into a single buffer.
+
+There are two problems stemming from this design.  First, Mule
+represents the same character in different ways.  Abstractly, ',As(B'
+(LATIN SMALL LETTER O WITH ACUTE) can get represented as
+[latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73].  So what looks like
+',Ass(B' in the display might actually be represented [latin-iso8859-1
+#x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B
+#xF3 ESC - A] in the file.  In some cases this treatment would be
+appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00
+(the CJK ideographic character meaning ``one'')), and although arguably
+incorrect it is convenient when mixing the CJK scripts.  But in the case
+of the Latin scripts this is wrong.
+
+Worse yet, it is very likely to occur when mixing ``different'' encodings
+(such as ISO 8859/1 and ISO 8859/15) that differ only in a few code
+points that are almost never used.  A very important example involves
+email.  Many sites, especially in the U.S., default to use of the ISO
+8859/1 coded character set (also called ``Latin 1,'' though these are
+somewhat different concepts).  However, ISO 8859/1 provides a generic
+CURRENCY SIGN character.  Now that the Euro has become the official
+currency of most countries in Europe, this is unsatisfactory (and in
+practice, useless).  So Europeans generally use ISO 8859/15, which is
+nearly identical to ISO 8859/1 for most languages, except that it
+substitutes EURO SIGN for CURRENCY SIGN.
+
+Suppose a European user yanks text from a post encoded in ISO 8859/1
+into a message composition buffer, and enters some text including the
+Euro sign.  Then Mule will consider the buffer to contain both ISO
+8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
+programmed) send the message as a multipart mixed MIME body!
+
+This is clearly stupid.  What is not as obvious is that, just as any
+European can include American English in their text because ASCII is a
+subset of ISO 8859/15, most European languages which use Latin
+characters (eg, German and Polish) can typically be mixed while using
+only one Latin coded character set (in the case of German and Polish,
+ISO 8859/2).  However, this often depends on exactly what text is to be
+encoded (even for the same pair of languages).
+
+Unification works around the problem by converting as many characters as
+possible to use a single Latin coded character set before saving the
+buffer.
+
+Because the problem is rarely noticable in editing a buffer, but tends
+to manifest when that buffer is exported to a file or process,
+unification uses the strategy of examining the buffer prior to export.
+If use of multiple Latin coded character sets is detected, unification
+attempts to unify them by finding a single coded character set which
+contains all of the Latin characters in the buffer.
+
+The primary purpose of unification is to fix the problem by giving the
+user the choice to change the representation of all characters to one
+character set and give sensible recommendations based on context.  In
+the ',As(B' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and
+both will be suggested.  In the EURO SIGN example, only ISO 8859/15
+makes sense, and that is what will be recommended.  In both cases, the
+user will be reminded that there are universal encodings available.
+
+I call this @dfn{remapping} (from the universal character set to a
+particular ISO 8859 coded character set).  It is mere accident that this
+letter has the same code point in both character sets.  (Not entirely,
+but there are many examples of Latin characters that have different code
+points in different Latin-X sets.)
+
+Note that, in the ',As(B' example, that treating the buffer in this way will
+result in a representation such as [latin-iso8859-2
+#x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3].
+This is guaranteed to occasionally result in the second problem you
+observed, to which we now turn.
+
+This problem is that, although the file is intended to be an
+ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX
+compliant program---this is required by the standard, obvious if you
+think a bit, @pxref{What Unification Cannot Do for You}) will read that
+file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73].  Of course this
+is no problem if all of the characters in the file are contained in ISO
+8859/1, but suppose there are some which are not, but are contained in
+the (intended) ISO 8859/2.
+
+You now want to fix this, but not by finding the same character in
+another set.  Instead, you want to simply change the character set
+that Mule associates with that buffer position without changing the
+code.  (This is conceptually somewhat distinct from the first problem,
+and logically ought to be handled in the code that defines coding
+systems.  However, unification is not an unreasonable place for it.)
+Unification provides two functions (one fast and dangerous, the other
+@c #### fix latin-unity.texi
+slower and careful) to handle this.  I call this @dfn{recoding}, because
+the transformation actually involves @emph{encoding} the buffer to
+file representation, then @emph{decoding} it to buffer representation
+(in a different character set).  This cannot be done automatically
+because Mule can have no idea what the correct encoding is---after
+all, it already gave you its best guess.  @xref{What Unification
+Cannot Do for You}.  So these functions must be invoked by the user.
+@xref{Interactive Usage}.
+
+
+@node What Unification Cannot Do for You, , Unification Theory, Unification
+@subsection What Unification Cannot Do for You
+
+Unification @strong{cannot} save you if you insist on exporting data in
+8-bit encodings in a multilingual environment.  @emph{You will
+eventually corrupt data if you do this.}  It is not Mule's, or any
+application's, fault.  You will have only yourself to blame; consider
+yourself warned.  (It is true that Mule has bugs, which make Mule
+somewhat more dangerous and inconvenient than some naive applications.
+We're working to address those, but no application can remedy the
+inherent defect of 8-bit encodings.)
+
+Use standard universal encodings, preferably Unicode (UTF-8) unless
+applicable standards indicate otherwise.  The most important such case
+is Internet messages, where MIME should be used, whether or not the
+subordinate encoding is a universal encoding.  (Note that since one of
+the important provisions of MIME is the @samp{Content-Type} header,
+which has the charset parameter, MIME is to be considered a universal
+encoding for the purposes of this manual.  Of course, technically
+speaking it's neither a coded character set nor a coding extension
+technique compliant with ISO 2022.)
+
+As mentioned earlier, the problem is that standard encodings suffer from
+the design defect that they do not provide a reliable way to recognize
+which coded character sets are in use.  There are scores of character
+sets which can be represented by a single octet (8-bit byte), whose
+union contains many hundreds of characters.  Thus any 8-bit coded
+character set must contain characters that share code points used for
+different characters in other coded character sets.
+
+This means that a given file's intended encoding cannot be identified
+with 100% reliability unless it contains encoding markers such as those
+provided by MIME or ISO 2022.
+
+Unification actually makes it more likely that you will have problems of
+this kind.  Traditionally Mule has been ``helpful'' by simply using an
+ISO 2022 universal coding system when the current buffer coding system
+cannot handle all the characters in the buffer.  This has the effect
+that, because the file contains control sequences, it is not recognized
+as being in the locale's normal 8-bit encoding.  It may be annoying if
+@c #### fix in latin-unity.texi
+you are not a Mule expert, but your data is guaranteed to be recoverable
+with a tool you already have: Mule.
+
+However, with unification, Mule converts to a single 8-bit character set
+when possible.  But typically this will @emph{not} be in your usual
+locale.  Ie, the times that an ISO 8859/1 user will need unification is
+when there are ISO 8859/2 characters in the buffer.  But then most
+likely the file will be saved in a pure 8-bit encoding that is not ISO
+8859/1, ie, ISO 8859/2.  Mule's autorecognizer (which is probably the
+most sophisticated yet available) cannot tell the difference between ISO
+8859/1 and ISO 8859/2, and in a Western European locale will choose the
+former even though the latter was intended.  Even the extension
+@c #### fix in latin-unity.texi
+(``statistical recognition'') planned for XEmacs 22 is unlikely to be
+acceptably accurate in the case of mixed codes.
+
+So now consider adding some additional ISO 8859/1 text to the buffer.
+If it includes any ISO 8859/1 codes that are used by different
+characters in ISO 8859/2, you now have a file that cannot be
+mechanically disentangled.  You need a human being who can recognize
+that @emph{this is German and Swedish} and stays in Latin-1, while
+@emph{that is Polish} and needs to be recoded to Latin-2.
+
+Moral: switch to a universal coded character set, preferably Unicode
+using the UTF-8 transformation format.  If you really need the space,
+compress your files.
+
+
+@node Specify Coding, Charsets and Coding Systems, Unification, Mule
 @section Specifying a Coding System

   In cases where XEmacs does not automatically choose the right coding
@@ -549,3 +1286,192 @@
 those non-Latin-1 characters which the specified coding system can
 encode.  By default, this variable is @code{nil}, which implies that you
 cannot use non-Latin-1 characters in file names.
+
+
+@node Charsets and Coding Systems, , Specify Coding, Mule
+@section Charsets and Coding Systems
+
+This section provides reference lists of Mule charsets and coding
+systems.  Mule charsets are typically named by character set and
+standard.
+
+@table @strong
+@item ASCII variants
+
+Identification of equivalent characters in these sets is not properly
+implemented.  Unification does not distinguish the two charsets.
+
+@samp{ascii} @samp{latin-jisx0201}
+
+@item Extended Latin
+
+Characters from the following ISO 2022 conformant charsets are
+identified with equivalents in other charsets in the group by
+unification.
+
+@samp{latin-iso8859-1} @samp{latin-iso8859-15} @samp{latin-iso8859-2}
+@samp{latin-iso8859-3} @samp{latin-iso8859-4} @samp{latin-iso8859-9}
+@samp{latin-iso8859-13} @samp{latin-iso8859-16}
+
+The follow charsets are Latin variants which are not understood by
+unification.  In addition, many of the Asian language standards provide
+ASCII, at least, and sometimes other Latin characters.  None of these
+are identified with their ISO 8859 equivalents.
+
+@samp{vietnamese-viscii-lower}
+@samp{vietnamese-viscii-upper}
+
+@item Other character sets
+
+@samp{arabic-1-column}
+@samp{arabic-2-column}
+@samp{arabic-digit}
+@samp{arabic-iso8859-6}
+@samp{chinese-big5-1}
+@samp{chinese-big5-2}
+@samp{chinese-cns11643-1}
+@samp{chinese-cns11643-2}
+@samp{chinese-cns11643-3}
+@samp{chinese-cns11643-4}
+@samp{chinese-cns11643-5}
+@samp{chinese-cns11643-6}
+@samp{chinese-cns11643-7}
+@samp{chinese-gb2312}
+@samp{chinese-isoir165}
+@samp{cyrillic-iso8859-5}
+@samp{ethiopic}
+@samp{greek-iso8859-7}
+@samp{hebrew-iso8859-8}
+@samp{ipa}
+@samp{japanese-jisx0208}
+@samp{japanese-jisx0208-1978}
+@samp{japanese-jisx0212}
+@samp{katakana-jisx0201}
+@samp{korean-ksc5601}
+@samp{sisheng}
+@samp{thai-tis620}
+@samp{thai-xtis}
+
+@item Non-graphic charsets
+
+@samp{control-1}
+@end table
+
+@table @strong
+@item No conversion
+
+Some of these coding systems may specify EOL conventions.  Note that
+@samp{iso-8859-1} is a no-conversion coding system, not an ISO 2022
+coding system.  Although unification attempts to compensate for this, it
+is possible that the @samp{iso-8859-1} coding system will behave
+differently from other ISO 8859 coding systems.
+
+@samp{binary} @samp{no-conversion} @samp{raw-text} @samp{iso-8859-1}
+
+@item Latin coding systems
+
+These coding systems are all single-byte, 8-bit ISO 2022 coding systems,
+combining ASCII in the GL register (bytes with high-bit clear) and an
+extended Latin character set in the GR register (bytes with high-bit set).
+
+@samp{iso-8859-15} @samp{iso-8859-2} @samp{iso-8859-3} @samp{iso-8859-4}
+@samp{iso-8859-9} @samp{iso-8859-13} @samp{iso-8859-14} @samp{iso-8859-16}
+
+These coding systems are single-byte, 8-bit coding systems that do not
+conform to international standards.  They should be avoided in all
+potentially multilingual contexts, including any text distributed over
+the Internet and World Wide Web.
+
+@samp{windows-1251}
+
+@item Multilingual coding systems
+
+The following ISO-2022-based coding systems are useful for multilingual
+text.
+
+@samp{ctext} @samp{iso-2022-lock} @samp{iso-2022-7} @samp{iso-2022-7bit}
+@samp{iso-2022-7bit-ss2} @samp{iso-2022-8} @samp{iso-2022-8bit-ss2}
+
+XEmacs also supports Unicode with the Mule-UCS package.  These are the
+preferred coding systems for multilingual use.  (There is a possible
+exception for texts that mix several Asian ideographic character sets.)
+
+@samp{utf-16-be} @samp{utf-16-be-no-signature} @samp{utf-16-le}
+@samp{utf-16-le-no-signature} @samp{utf-7} @samp{utf-7-safe}
+@samp{utf-8} @samp{utf-8-ws}
+
+Development versions of XEmacs (the 21.5 series) support Unicode
+internally, with (at least) the following coding systems implemented:
+
+@samp{utf-16-be} @samp{utf-16-be-bom} @samp{utf-16-le}
+@samp{utf-16-le-bom} @samp{utf-8} @samp{utf-8-bom}
+
+@item Asian ideographic languages
+
+The following coding systems are based on ISO 2022, and are more or less
+suitable for encoding multilingual texts.  They all can represent ASCII
+at least, and sometimes several other foreign character sets, without
+resort to arbitrary ISO 2022 designations.  However, these subsets are
+not identified with the corresponding national standards in XEmacs Mule.
+
+@samp{chinese-euc} @samp{cn-big5} @samp{cn-gb-2312} @samp{gb2312}
+@samp{hz} @samp{hz-gb-2312} @samp{old-jis} @samp{japanese-euc}
+@samp{junet} @samp{euc-japan} @samp{euc-jp} @samp{iso-2022-jp}
+@samp{iso-2022-jp-1978-irv} @samp{iso-2022-jp-2} @samp{euc-kr}
+@samp{korean-euc} @samp{iso-2022-kr} @samp{iso-2022-int-1}
+
+The following coding systems cannot be used for general multilingual
+text and do not cooperate well with other coding systems.
+
+@samp{big5} @samp{shift_jis}
+
+@item Other languages
+
+The following coding systems are based on ISO 2022.  Though none of them
+provides any Latin characters beyond ASCII, XEmacs Mule allows (and up
+to 21.4 defaults to) use of ISO 2022 control sequences to designate
+other character sets for inclusion the text.
+
+@samp{iso-8859-5} @samp{iso-8859-7} @samp{iso-8859-8}
+@samp{ctext-hebrew}
+
+The following are character sets that do not conform to ISO 2022 and
+thus cannot be safely used in a multilingual context.
+
+@samp{alternativnyj} @samp{koi8-r} @samp{tis-620} @samp{viqr}
+@samp{viscii} @samp{vscii}
+
+@item Special coding systems
+
+Mule uses the following coding systems for special purposes.
+
+@samp{automatic-conversion} @samp{undecided} @samp{escape-quoted}
+
+@samp{escape-quoted} is especially important, as it is used internally
+as the coding system for autosaved data.
+
+The following coding systems are aliases for others, and are used for
+communication with the host operating system.
+
+@samp{file-name} @samp{keyboard} @samp{terminal}
+
+@end table
+
+Mule detection of coding systems is actually limited to detection of
+classes of coding systems called @dfn{coding categories}.  These coding
+categories are identified by the ISO 2022 control sequences they use, if
+any, by their conformance to ISO 2022 restrictions on code points that
+may be used, and by characteristic patterns of use of 8-bit code points.
+
+@samp{no-conversion}
+@samp{utf-8}
+@samp{ucs-4}
+@samp{iso-7}
+@samp{iso-lock-shift}
+@samp{iso-8-1}
+@samp{iso-8-2}
+@samp{iso-8-designate}
+@samp{shift-jis}
+@samp{big5}
+
+
--- a/man/xemacs/startup.texi	Thu Jan 02 22:52:44 2003 +0000
+++ b/man/xemacs/startup.texi	Fri Jan 03 12:12:40 2003 +0000
@@ -92,10 +92,10 @@
 late hierarchy.  At run time, the package path may also be specified via
 the @code{EMACSPACKAGEPATH} environment variable.

-An XEmacs package is laid out just like a normal installed XEmacs lisp
-directory.  It may have @file{lisp}, @file{etc}, @file{info}, and
-@file{lib-src} subdirectories.  XEmacs adds these at appropriate places
-within the various system-wide paths.
+An XEmacs package hierarchy is laid out just like a normal installed
+XEmacs lisp directory.  It may have @file{lisp}, @file{etc},
+@file{info}, and @file{lib-src} subdirectories.  XEmacs adds these at
+appropriate places within the various system-wide paths.

 There may be any number of package hierarchy directories.