Mercurial > hg > xemacs-beta
changeset 1183:c1553814932e
[xemacs-hg @ 2003-01-03 12:12:30 by stephent]
various docs
<873coa5unb.fsf@tleepslib.sk.tsukuba.ac.jp>
<87r8bu4emz.fsf@tleepslib.sk.tsukuba.ac.jp>
author | stephent |
---|---|
date | Fri, 03 Jan 2003 12:12:40 +0000 |
parents | 7d696106ffe9 |
children | b3e062e7368f |
files | man/ChangeLog man/internals/internals.texi man/lispref/mule.texi man/widget.texi man/xemacs-faq.texi man/xemacs/mule.texi man/xemacs/startup.texi |
diffstat | 7 files changed, 2009 insertions(+), 50 deletions(-) [+] |
line wrap: on
line diff
--- a/man/ChangeLog Thu Jan 02 22:52:44 2003 +0000 +++ b/man/ChangeLog Fri Jan 03 12:12:40 2003 +0000 @@ -1,3 +1,60 @@ +2003-01-03 Stephen J. Turnbull <stephen@xemacs.org> + + * xemacs/startup.texi (Startup Paths): Hierarchy, not package, layout. + +2003-01-03 Stephen J. Turnbull <stephen@xemacs.org> + + * xemacs-faq.texi: Debugging FAQ improvements from Ben Wing. + (Q2.0.6): Mention union type bugs. + (Q2.1.1): Debugging HOWTO improvements. + (Q2.1.15): Decoding Lisp objects in the debugger. + + * widget.texi (Widget Internals): New node. + (Top): Add menu item for it. + + * xemacs/xemacs.texi (Top): Better short description of Mule in + menu. Mule submenu. + + Charset unification docs. What a concept---commit docs first! + + * lispref/mule.texi (MULE): Add Unification and Tables menu entries. + (Unicode Support): Fixup next node. + (Charset Unification): + (Overview): + (Usage): + (Basic Functionality): + (Interactive Usage): + (Configuration): + (Theory of Operation): + (What Unification Cannot Do for You): + (Unification Internals): + (Charsets and Coding Systems): + New nodes. + + * xemacs/mule.texi (Mule): Menu items for Unification and Tables. + (Recognize Coding): + (Specify Coding): + Fixup next and previous pointers. + (Unification): + (Unification Overview): + (Unification Usage): + (Unification Configuration): + (Unification FAQs): + (Unification Theory): + (What Unification Cannot Do for You): + (Charsets and Coding Systems): + New nodes. + +2002-12-17 Stephen Turnbull <stephen@xemacs.org> + + * widget.texi (Widget Wishlist): Typo. + (Defining New Widgets): s/widget-define/define-widget/g. + +2002-12-27 Stephen J. Turnbull <stephen@xemacs.org> + + * internals/internals.texi (Regression Testing XEmacs): Hints for + test design. + 2002-10-29 Ville Skyttä <scop@xemacs.org> * xemacs-faq.texi (Top):
--- a/man/internals/internals.texi Thu Jan 02 22:52:44 2003 +0000 +++ b/man/internals/internals.texi Fri Jan 03 12:12:40 2003 +0000 @@ -3636,6 +3636,45 @@ GTK widgets, but not Athena, Motif, MS Windows, or Carbon), simply silently suppress the test if the feature is not available. +Here are a few general hints for writing tests. + +@enumerate +@item +Include related successful cases. Fixes often break something. + +@item +Use the Known-Bug-Expect-Failure macro to mark the cases you know +are going to fail. We want to be able to distinguish between +regressions and other unexpected failures, and cases that have +been (partially) analyzed but not yet repaired. + +@item +Mark the bug with the date of report. An ``Unfixed since yyyy-mm-dd'' +gloss for Known-Bug-Expect-Failure is planned to further increase +developer embarrassment (== incentive to fix the bug), but until then at +least put a comment about the date so we can easily see when it was +first reported. + +@item +It's a matter of your judgement, but you should often use generic tests +(@emph{e.g.}, @code{eq}) instead of more specific tests (@code{=} for +numbers) even though you know that arguments ``should'' be of correct +type. That is, if the functions used can return generic objects +(typically @code{nil}), as well as some more specific type that will be +returned on success. We don't want failures of those assertions +reported as ``other failures'' (a wrong-type-arg signal, rather than a +null return), we want them reported as ``assertion failures.'' + +One example is a test that tests @code{(= (string-match this that) 0)}, +expecting a successful match. Now suppose @code{string-match} is broken +such that the match fails. Then it will return @code{nil}, and @code{=} +will signal ``wrong-type-argument, number-char-or-marker-p, nil'', +generating an ``other failure'' in the report. But this should be +reported as an assertion failure (the test failed in a foreseeable way), +rather than something else (we don't know what happened because XEmacs +is broken in a way that we weren't trying to test!) +@end enumerate + @node CVS Techniques, A Summary of the Various XEmacs Modules, Regression Testing XEmacs, Top @chapter CVS Techniques
--- a/man/lispref/mule.texi Thu Jan 02 22:52:44 2003 +0000 +++ b/man/lispref/mule.texi Fri Jan 03 12:12:40 2003 +0000 @@ -24,6 +24,8 @@ * CCL:: A special language for writing fast converters. * Category Tables:: Subdividing charsets into groups. * Unicode Support:: The universal coded character set. +* Charset Unification:: Handling overlapping character sets. +* Charsets and Coding Systems:: Tables and reference information. @end menu @node Internationalization Terminology, Charsets, , MULE @@ -2072,7 +2074,7 @@ @c Added 2002-03-13 sjt -@node Unicode Support, , Category Tables, MULE +@node Unicode Support, Charset Unification, Category Tables, MULE @section Unicode Support @cindex unicode @cindex utf-8 @@ -2181,3 +2183,880 @@ @end table @end defun + +@node Charset Unification, Charsets and Coding Systems, Unicode Support, MULE +@section Character Set Unification + +Mule suffers from a design defect that causes it to consider the ISO +Latin character sets to be disjoint. This results in oddities such as +files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO +2022 control sequences to switch between them, as well as more plausible +but often unnecessary combinations like ISO 8859/1 with ISO 8859/2. +This can be very annoying when sending messages or even in simple +editing on a single host. Unification works around the problem by +converting as many characters as possible to use a single Latin coded +character set before saving the buffer. + +This node and its children were ripp'd untimely from +@file{latin-unity.texi}, and have been quickly converted for use here. +However as APIs are likely to diverge, beware of inaccuracies. Please +report any you discover with @kbd{M-x report-xemacs-bug RET}, as well +as any ambiguities or downright unintelligible passages. + +A lot of the stuff here doesn't belong here; it belongs in the +@ref{Top, , , xemacs, XEmacs User's Manual}. Report those as bugs, +too, preferably with patches. + +@menu +* Overview:: Unification history and general information. +* Usage:: An overview of the operation of Unification. +* Configuration:: Configuring Unification for use. +* Theory of Operation:: How Unification works. +* What Unification Cannot Do for You:: Inherent problems of 8-bit charsets. +* Charsets and Coding Systems:: Reference lists with annotations. +* Internals:: Utilities and implementation details. +@end menu + +@node Overview, Usage, Charset Unification, Charset Unification +@subsection An Overview of Unification + +Mule suffers from a design defect that causes it to consider the ISO +Latin character sets to be disjoint. This manifests itself when a user +enters characters using input methods associated with different coded +character sets into a single buffer. + +A very important example involves email. Many sites, especially in the +U.S., default to use of the ISO 8859/1 coded character set (also called +``Latin 1,'' though these are somewhat different concepts). However, +ISO 8859/1 provides a generic CURRENCY SIGN character. Now that the +Euro has become the official currency of most countries in Europe, this +is unsatisfactory (and in practice, useless). So Europeans generally +use ISO 8859/15, which is nearly identical to ISO 8859/1 for most +languages, except that it substitutes EURO SIGN for CURRENCY SIGN. + +Suppose a European user yanks text from a post encoded in ISO 8859/1 +into a message composition buffer, and enters some text including the +Euro sign. Then Mule will consider the buffer to contain both ISO +8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively +programmed) send the message as a multipart mixed MIME body! + +This is clearly stupid. What is not as obvious is that, just as any +European can include American English in their text because ASCII is a +subset of ISO 8859/15, most European languages which use Latin +characters (eg, German and Polish) can typically be mixed while using +only one Latin coded character set (in this case, ISO 8859/2). However, +this often depends on exactly what text is to be encoded. + +Unification works around the problem by converting as many characters as +possible to use a single Latin coded character set before saving the +buffer. + +@node Usage, Configuration, Overview, Charset Unification +@subsection Operation of Unification + +Normally, Unification works in the background by installing +@code{unity-sanity-check} on @code{write-region-pre-hook}. This is +done by default for the ISO 8859 Latin family of character sets. The +user activates this functionality for other character set families by +invoking @code{enable-unification}, either interactively or in her +init file. @xref{Init File, , , xemacs}. Unification can be +deactivated by invoking @code{disable-unification}. + +Unification also provides a few functions for remapping or recoding the +buffer by hand. To @dfn{remap} a character means to change the buffer +representation of the character by using another coded character set. +Remapping never changes the identity of the character, but may involve +altering the code point of the character. To @dfn{recode} a character +means to simply change the coded character set. Recoding never alters +the code point of the character, but may change the identity of the +character. @xref{Theory of Operation}. + +There are a few variables which determine which coding systems are +always acceptable to Unification: @code{unity-ucs-list}, +@code{unity-preferred-coding-system-list}, and +@code{unity-preapproved-coding-system-list}. The latter two default +to @code{()}, and should probably be avoided because they short-circuit +the sanity check. If you find you need to use them, consider reporting +it as a bug or request for enhancement. Because they seem unsafe, the +recommended interface is likely to change. + +@menu +* Basic Functionality:: User interface and customization. +* Interactive Usage:: Treating text by hand. + Also documents the hook function(s). +@end menu + + +@node Basic Functionality, Interactive Usage, , Usage +@section Basic Functionality + +These functions and user options initialize and configure Unification. +In normal use, none of these should be needed. + +@strong{These APIs are certain to change.} + +@defun enable-unification +Set up hooks and initialize variables for latin-unity. + +There are no arguments. + +This function is idempotent. It will reinitialize any hooks or variables +that are not in initial state. +@end defun + +@defun disable-unification +There are no arguments. + +Clean up hooks and void variables used by latin-unity. +@end defun + +@defopt unity-ucs-list +List of coding systems considered to be universal. + +The default value is @code{'(utf-8 iso-2022-7 ctext escape-quoted)}. + +Order matters; coding systems earlier in the list will be preferred when +recommending a coding system. These coding systems will not be used +without querying the user (unless they are also present in +@code{unity-preapproved-coding-system-list}), and follow the +@code{unity-preferred-coding-system-list} in the list of suggested +coding systems. + +If none of the preferred coding systems are feasible, the first in +this list will be the default. + +Notes on certain coding systems: @code{escape-quoted} is a special +coding system used for autosaves and compiled Lisp in Mule. You should +@c #### fix in latin-unity.texi +never delete this, although it is rare that a user would want to use it +directly. Unification does not try to be \"smart\" about other general +ISO 2022 coding systems, such as ISO-2022-JP. (They are not recognized +as equivalent to @code{iso-2022-7}.) If your preferred coding system is +one of these, you may consider adding it to @code{unity-ucs-list}. +However, this will typically have the side effect that (eg) ISO 8859/1 +files will be saved in 7-bit form with ISO 2022 escape sequences. +@end defopt + +Coding systems which are not Latin and not in +@code{unity-ucs-list} are handled by short circuiting checks of +coding system against the next two variables. + +@defopt unity-preapproved-coding-system-list +List of coding systems used without querying the user if feasible. + +The default value is @samp{(buffer-default preferred)}. + +The first feasible coding system in this list is used. The special values +@samp{preferred} and @samp{buffer-default} may be present: + +@table @code +@item buffer-default +Use the coding system used by @samp{write-region}, if feasible. + +@item preferred +Use the coding system specified by @samp{prefer-coding-system} if feasible. +@end table + +"Feasible" means that all characters in the buffer can be represented by +the coding system. Coding systems in @samp{unity-ucs-list} are +always considered feasible. Other feasible coding systems are computed +by @samp{unity-representations-feasible-region}. + +Note that the first universal coding system in this list shadows all +other coding systems. In particular, if your preferred coding system is +a universal coding system, and @code{preferred} is a member of this +list, unification will blithely convert all your files to that coding +system. This is considered a feature, but it may surprise most users. +Users who don't like this behavior should put @code{preferred} in +@code{unity-preferred-coding-system-list}. +@end defopt + +@defopt unity-preferred-coding-system-list +@c #### fix in latin-unity.texi +List of coding systems suggested to the user if feasible. + +The default value is @samp{(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3 +iso-8859-4 iso-8859-9)}. + +If none of the coding systems in +@c #### fix in latin-unity.texi +@code{unity-preapproved-coding-system-list} are feasible, this list +will be recommended to the user, followed by the +@code{unity-ucs-list}. The first coding system in this list is default. The +special values @samp{preferred} and @samp{buffer-default} may be +present: + +@table @code +@item buffer-default +Use the coding system used by @samp{write-region}, if feasible. + +@item preferred +Use the coding system specified by @samp{prefer-coding-system} if feasible. +@end table + +"Feasible" means that all characters in the buffer can be represented by +the coding system. Coding systems in @samp{unity-ucs-list} are +always considered feasible. Other feasible coding systems are computed +by @samp{unity-representations-feasible-region}. +@end defopt + + +@defvar unity-iso-8859-1-aliases +List of coding systems to be treated as aliases of ISO 8859/1. + +The default value is '(iso-8859-1). + +This is not a user variable; to customize input of coding systems or +charsets, @samp{unity-coding-system-alias-alist} or +@samp{unity-charset-alias-alist}. +@end defvar + + +@node Interactive Usage, , Basic Functionality, Usage +@section Interactive Usage + +First, the hook function @code{unity-sanity-check} is documented. +(It is placed here because it is not an interactive function, and there +is not yet a programmer's section of the manual.) + +These functions provide access to internal functionality (such as the +remapping function) and to extra functionality (the recoding functions +and the test function). + + +@defun unity-sanity-check begin end filename append visit lockname &optional coding-system + +Check if @var{coding-system} can represent all characters between +@var{begin} and @var{end}. + +For compatibility with old broken versions of @code{write-region}, +@var{coding-system} defaults to @code{buffer-file-coding-system}. +@var{filename}, @var{append}, @var{visit}, and @var{lockname} are +ignored. + +Return nil if buffer-file-coding-system is not (ISO-2022-compatible) +Latin. If @code{buffer-file-coding-system} is safe for the charsets +actually present in the buffer, return it. Otherwise, ask the user to +choose a coding system, and return that. + +This function does @emph{not} do the safe thing when +@code{buffer-file-coding-system} is nil (aka no-conversion). It +considers that ``non-Latin,'' and passes it on to the Mule detection +mechanism. + +This function is intended for use as a @code{write-region-pre-hook}. It +does nothing except return @var{coding-system} if @code{write-region} +handlers are inhibited. +@end defun + +@defun unity-buffer-representations-feasible + +There are no arguments. + +Apply unity-region-representations-feasible to the current buffer. +@end defun + +@defun unity-region-representations-feasible begin end &optional buf + +Return character sets that can represent the text from @var{begin} to @var{end} in @var{buf}. + +@var{buf} defaults to the current buffer. Called interactively, will be +applied to the region. Function assumes @var{begin} <= @var{end}. + +The return value is a cons. The car is the list of character sets +that can individually represent all of the non-ASCII portion of the +buffer, and the cdr is the list of character sets that can +individually represent all of the ASCII portion. + +The following is taken from a comment in the source. Please refer to +the source to be sure of an accurate description. + +The basic algorithm is to map over the region, compute the set of +charsets that can represent each character (the ``feasible charset''), +and take the intersection of those sets. + +The current implementation takes advantage of the fact that ASCII +characters are common and cannot change asciisets. Then using +skip-chars-forward makes motion over ASCII subregions very fast. + +This same strategy could be applied generally by precomputing classes +of characters equivalent according to their effect on latinsets, and +adding a whole class to the skip-chars-forward string once a member is +found. + +Probably efficiency is a function of the number of characters matched, +or maybe the length of the match string? With @code{skip-category-forward} +over a precomputed category table it should be really fast. In practice +for Latin character sets there are only 29 classes. +@end defun + +@defun unity-remap-region begin end character-set &optional coding-system + +Remap characters between @var{begin} and @var{end} to equivalents in +@var{character-set}. Optional argument @var{coding-system} may be a +coding system name (a symbol) or nil. Characters with no equivalent are +left as-is. + +When called interactively, @var{begin} and @var{end} are set to the +beginning and end, respectively, of the active region, and the function +prompts for @var{character-set}. The function does completion, knows +how to guess a character set name from a coding system name, and also +provides some common aliases. See @code{unity-guess-charset}. +There is no way to specify @var{coding-system}, as it has no useful +function interactively. + +Return @var{coding-system} if @var{coding-system} can encode all +characters in the region, t if @var{coding-system} is nil and the coding +system with G0 = 'ascii and G1 = @var{character-set} can encode all +characters, and otherwise nil. Note that a non-null return does +@emph{not} mean it is safe to write the file, only the specified region. +(This behavior is useful for multipart MIME encoding and the like.) + +Note: by default this function is quite fascist about universal coding +systems. It only admits @samp{utf-8}, @samp{iso-2022-7}, and +@samp{ctext}. Customize @code{unity-approved-ucs-list} to change +this. + +This function remaps characters that are artificially distinguished by Mule +internal code. It may change the code point as well as the character set. +To recode characters that were decoded in the wrong coding system, use +@code{unity-recode-region}. +@end defun + +@defun unity-recode-region begin end wrong-cs right-cs + +Recode characters between @var{begin} and @var{end} from @var{wrong-cs} +to @var{right-cs}. + +@var{wrong-cs} and @var{right-cs} are character sets. Characters retain +the same code point but the character set is changed. Only characters +from @var{wrong-cs} are changed to @var{right-cs}. The identity of the +character may change. Note that this could be dangerous, if characters +whose identities you do not want changed are included in the region. +This function cannot guess which characters you want changed, and which +should be left alone. + +When called interactively, @var{begin} and @var{end} are set to the +beginning and end, respectively, of the active region, and the function +prompts for @var{wrong-cs} and @var{right-cs}. The function does +completion, knows how to guess a character set name from a coding system +name, and also provides some common aliases. See +@code{unity-guess-charset}. + +Another way to accomplish this, but using coding systems rather than +character sets to specify the desired recoding, is +@samp{unity-recode-coding-region}. That function may be faster +but is somewhat more dangerous, because it may recode more than one +character set. + +To change from one Mule representation to another without changing identity +of any characters, use @samp{unity-remap-region}. +@end defun + +@defun unity-recode-coding-region begin end wrong-cs right-cs + +Recode text between @var{begin} and @var{end} from @var{wrong-cs} to +@var{right-cs}. + +@var{wrong-cs} and @var{right-cs} are coding systems. Characters retain +the same code point but the character set is changed. The identity of +characters may change. This is an inherently dangerous function; +multilingual text may be recoded in unexpected ways. #### It's also +dangerous because the coding systems are not sanity-checked in the +current implementation. + +When called interactively, @var{begin} and @var{end} are set to the +beginning and end, respectively, of the active region, and the function +prompts for @var{wrong-cs} and @var{right-cs}. The function does +completion, knows how to guess a coding system name from a character set +name, and also provides some common aliases. See +@code{unity-guess-coding-system}. + +Another, safer, way to accomplish this, using character sets rather +than coding systems to specify the desired recoding, is to use +@c #### fixme in latin-unity.texi +@code{unity-recode-region}. + +To change from one Mule representation to another without changing identity +of any characters, use @code{unity-remap-region}. +@end defun + +Helper functions for input of coding system and character set names. + +@defun unity-guess-charset candidate +Guess a charset based on the symbol @var{candidate}. + +@var{candidate} itself is not tried as the value. + +Uses the natural mapping in @samp{unity-cset-codesys-alist}, and +the values in @samp{unity-charset-alias-alist}." +@end defun + +@defun unity-guess-coding-system candidate +Guess a coding system based on the symbol @var{candidate}. + +@var{candidate} itself is not tried as the value. + +Uses the natural mapping in @samp{unity-cset-codesys-alist}, and +the values in @samp{unity-coding-system-alias-alist}." +@end defun + +@defun unity-example + +A cheesy example for Unification. + +At present it just makes a multilingual buffer. To test, setq +buffer-file-coding-system to some value, make the buffer dirty (eg +with RET BackSpace), and save. +@end defun + + +@node Configuration, Theory of Operation, Usage, Charset Unification +@subsection Configuring Unification for Use + +If you want Unification to be automatically initialized, invoke +@samp{enable-unification} with no arguments in your init file. +@xref{Init File, , , xemacs}. If you are using GNU Emacs or an XEmacs +earlier than 21.1, you should also load @file{auto-autoloads} using the +full path (@emph{never} @samp{require} @file{auto-autoloads} libraries). + +You may wish to define aliases for commonly used character sets and +coding systems for convenience in input. + +@defopt unity-charset-alias-alist +Alist mapping aliases to Mule charset names (symbols)." + +The default value is +@example + ((latin-1 . latin-iso8859-1) + (latin-2 . latin-iso8859-2) + (latin-3 . latin-iso8859-3) + (latin-4 . latin-iso8859-4) + (latin-5 . latin-iso8859-9) + (latin-9 . latin-iso8859-15) + (latin-10 . latin-iso8859-16)) +@end example + +If a charset does not exist on your system, it will not complete and you +will not be able to enter it in response to prompts. A real charset +with the same name as an alias in this list will shadow the alias. +@end defopt + +@defopt unity-coding-system-alias-alist nil +Alist mapping aliases to Mule coding system names (symbols). + +The default value is @samp{nil}. +@end defopt + + +@node Theory of Operation, What Unification Cannot Do for You, Configuration, Charset Unification +@subsection Theory of Operation + +Standard encodings suffer from the design defect that they do not +provide a reliable way to recognize which coded character sets in use. +@xref{What Unification Cannot Do for You}. There are scores of +character sets which can be represented by a single octet (8-bit byte), +whose union contains many hundreds of characters. Obviously this +results in great confusion, since you can't tell the players without a +scorecard, and there is no scorecard. + +There are two ways to solve this problem. The first is to create a +universal coded character set. This is the concept behind Unicode. +However, there have been satisfactory (nearly) universal character sets +for several decades, but even today many Westerners resist using Unicode +because they consider its space requirements excessive. On the other +hand, Asians dislike Unicode because they consider it to be incomplete. +(This is partly, but not entirely, political.) + +In any case, Unicode only solves the internal representation problem. +Many data sets will contain files in ``legacy'' encodings, and Unicode +does not help distinguish among them. + +The second approach is to embed information about the encodings used in +a document in its text. This approach is taken by the ISO 2022 +standard. This would solve the problem completely from the users' of +view, except that ISO 2022 is basically not implemented at all, in the +sense that few applications or systems implement more than a small +subset of ISO 2022 functionality. This is due to the fact that +mono-literate users object to the presence of escape sequences in their +texts (which they, with some justification, consider data corruption). +Programmers are more than willing to cater to these users, since +implementing ISO 2022 is a painstaking task. + +In fact, Emacs/Mule adopts both of these approaches. Internally it uses +a universal character set, @dfn{Mule code}. Externally it uses ISO 2022 +techniques both to save files in forms robust to encoding issues, and as +hints when attempting to ``guess'' an unknown encoding. However, Mule +suffers from a design defect, namely it embeds the character set +information that ISO 2022 attaches to runs of characters by introducing +them with a control sequence in each character. That causes Mule to +consider the ISO Latin character sets to be disjoint. This manifests +itself when a user enters characters using input methods associated with +different coded character sets into a single buffer. + +There are two problems stemming from this design. First, Mule +represents the same character in different ways. Abstractly, ',As(B' +(LATIN SMALL LETTER O WITH ACUTE) can get represented as +[latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73]. So what looks like +',Ass(B' in the display might actually be represented [latin-iso8859-1 +#x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B +#xF3 ESC - A] in the file. In some cases this treatment would be +appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00 +(the CJK ideographic character meaning ``one'')), and although arguably +incorrect it is convenient when mixing the CJK scripts. But in the case +of the Latin scripts this is wrong. + +Worse yet, it is very likely to occur when mixing ``different'' encodings +(such as ISO 8859/1 and ISO 8859/15) that differ only in a few code +points that are almost never used. A very important example involves +email. Many sites, especially in the U.S., default to use of the ISO +8859/1 coded character set (also called ``Latin 1,'' though these are +somewhat different concepts). However, ISO 8859/1 provides a generic +CURRENCY SIGN character. Now that the Euro has become the official +currency of most countries in Europe, this is unsatisfactory (and in +practice, useless). So Europeans generally use ISO 8859/15, which is +nearly identical to ISO 8859/1 for most languages, except that it +substitutes EURO SIGN for CURRENCY SIGN. + +Suppose a European user yanks text from a post encoded in ISO 8859/1 +into a message composition buffer, and enters some text including the +Euro sign. Then Mule will consider the buffer to contain both ISO +8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively +programmed) send the message as a multipart mixed MIME body! + +This is clearly stupid. What is not as obvious is that, just as any +European can include American English in their text because ASCII is a +subset of ISO 8859/15, most European languages which use Latin +characters (eg, German and Polish) can typically be mixed while using +only one Latin coded character set (in the case of German and Polish, +ISO 8859/2). However, this often depends on exactly what text is to be +encoded (even for the same pair of languages). + +Unification works around the problem by converting as many characters as +possible to use a single Latin coded character set before saving the +buffer. + +Because the problem is rarely noticable in editing a buffer, but tends +to manifest when that buffer is exported to a file or process, the +Unification package uses the strategy of examining the buffer prior to +export. If use of multiple Latin coded character sets is detected, +Unification attempts to unify them by finding a single coded character +set which contains all of the Latin characters in the buffer. + +The primary purpose of Unification is to fix the problem by giving the +user the choice to change the representation of all characters to one +character set and give sensible recommendations based on context. In +the ',As(B' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and +both will be suggested. In the EURO SIGN example, only ISO 8859/15 +makes sense, and that is what will be recommended. In both cases, the +user will be reminded that there are universal encodings available. + +I call this @dfn{remapping} (from the universal character set to a +particular ISO 8859 coded character set). It is mere accident that this +letter has the same code point in both character sets. (Not entirely, +but there are many examples of Latin characters that have different code +points in different Latin-X sets.) + +Note that, in the ',As(B' example, that treating the buffer in this way will +result in a representation such as [latin-iso8859-2 +#x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3]. +This is guaranteed to occasionally result in the second problem you +observed, to which we now turn. + +This problem is that, although the file is intended to be an +ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX +compliant program---this is required by the standard, obvious if you +think a bit, @pxref{What Unification Cannot Do for You}) will read that +file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73]. Of course this +is no problem if all of the characters in the file are contained in ISO +8859/1, but suppose there are some which are not, but are contained in +the (intended) ISO 8859/2. + +You now want to fix this, but not by finding the same character in +another set. Instead, you want to simply change the character set that +Mule associates with that buffer position without changing the code. +(This is conceptually somewhat distinct from the first problem, and +logically ought to be handled in the code that defines coding systems. +However, unification is not an unreasonable place for it.) Unification +provides two functions (one fast and dangerous, the other slow and +careful) to handle this. I call this @dfn{recoding}, because the +transformation actually involves @emph{encoding} the buffer to file +representation, then @emph{decoding} it to buffer representation (in a +different character set). This cannot be done automatically because +Mule can have no idea what the correct encoding is---after all, it +already gave you its best guess. @xref{What Unification Cannot Do for +You}. So these functions must be invoked by the user. @xref{Interactive +Usage}. + + +@node What Unification Cannot Do for You, Unification Internals, Theory of Operation, Charset Unification +@subsection What Unification Cannot Do for You + +Unification @strong{cannot} save you if you insist on exporting data in +8-bit encodings in a multilingual environment. @emph{You will +eventually corrupt data if you do this.} It is not Mule's, or any +application's, fault. You will have only yourself to blame; consider +yourself warned. (It is true that Mule has bugs, which make Mule +somewhat more dangerous and inconvenient than some naive applications. +We're working to address those, but no application can remedy the +inherent defect of 8-bit encodings.) + +Use standard universal encodings, preferably Unicode (UTF-8) unless +applicable standards indicate otherwise. The most important such case +is Internet messages, where MIME should be used, whether or not the +subordinate encoding is a universal encoding. (Note that since one of +the important provisions of MIME is the @samp{Content-Type} header, +which has the charset parameter, MIME is to be considered a universal +encoding for the purposes of this manual. Of course, technically +speaking it's neither a coded character set nor a coding extension +technique compliant with ISO 2022.) + +As mentioned earlier, the problem is that standard encodings suffer from +the design defect that they do not provide a reliable way to recognize +which coded character sets are in use. There are scores of character +sets which can be represented by a single octet (8-bit byte), whose +union contains many hundreds of characters. Thus any 8-bit coded +character set must contain characters that share code points used for +different characters in other coded character sets. + +This means that a given file's intended encoding cannot be identified +with 100% reliability unless it contains encoding markers such as those +provided by MIME or ISO 2022. + +Unification actually makes it more likely that you will have problems of +this kind. Traditionally Mule has been ``helpful'' by simply using an +ISO 2022 universal coding system when the current buffer coding system +cannot handle all the characters in the buffer. This has the effect +that, because the file contains control sequences, it is not recognized +as being in the locale's normal 8-bit encoding. It may be annoying if +you are not a Mule expert, but your data is automatically recoverable +with a tool you already have: Mule. + +However, with unification, Mule converts to a single 8-bit character set +when possible. But typically this will @emph{not} be in your usual +locale. Ie, the times that an ISO 8859/1 user will need Unification is +when there are ISO 8859/2 characters in the buffer. But then most +likely the file will be saved in a pure 8-bit encoding that is not ISO +8859/1, ie, ISO 8859/2. Mule's autorecognizer (which is probably the +most sophisticated yet available) cannot tell the difference between ISO +8859/1 and ISO 8859/2, and in a Western European locale will choose the +former even though the latter was intended. Even the extension +(``statistical recognition'') planned for XEmacs 22 is unlikely to be at +all accurate in the case of mixed codes. + +So now consider adding some additional ISO 8859/1 text to the buffer. +If it includes any ISO 8859/1 codes that are used by different +characters in ISO 8859/2, you now have a file that cannot be +mechanically disentangled. You need a human being who can recognize +that @emph{this is German and Swedish} and stays in Latin-1, while +@emph{that is Polish} and needs to be recoded to Latin-2. + +Moral: switch to a universal coded character set, preferably Unicode +using the UTF-8 transformation format. If you really need the space, +compress your files. + + +@node Unification Internals, , What Unification Cannot Do for You, Charset Unification +@subsection Internals + +No internals documentation yet. + +@file{unity-utils.el} provides one utility function. + +@defun unity-dump-tables + +Dump the temporary table created by loading @file{unity-utils.el} +to @file{unity-tables.el}. Loading the latter file initializes +@samp{unity-equivalences}. +@end defun + + +@node Charsets and Coding Systems, , Charset Unification, MULE +@subsection Charsets and Coding Systems + +This section provides reference lists of Mule charsets and coding +systems. Mule charsets are typically named by character set and +standard. + +@table @strong +@item ASCII variants + +Identification of equivalent characters in these sets is not properly +implemented. Unification does not distinguish the two charsets. + +@samp{ascii} @samp{latin-jisx0201} + +@item Extended Latin + +Characters from the following ISO 2022 conformant charsets are +identified with equivalents in other charsets in the group by +Unification. + +@samp{latin-iso8859-1} @samp{latin-iso8859-15} @samp{latin-iso8859-2} +@samp{latin-iso8859-3} @samp{latin-iso8859-4} @samp{latin-iso8859-9} +@samp{latin-iso8859-13} @samp{latin-iso8859-16} + +The follow charsets are Latin variants which are not understood by +Unification. In addition, many of the Asian language standards provide +ASCII, at least, and sometimes other Latin characters. None of these +are identified with their ISO 8859 equivalents. + +@samp{vietnamese-viscii-lower} +@samp{vietnamese-viscii-upper} + +@item Other character sets + +@samp{arabic-1-column} +@samp{arabic-2-column} +@samp{arabic-digit} +@samp{arabic-iso8859-6} +@samp{chinese-big5-1} +@samp{chinese-big5-2} +@samp{chinese-cns11643-1} +@samp{chinese-cns11643-2} +@samp{chinese-cns11643-3} +@samp{chinese-cns11643-4} +@samp{chinese-cns11643-5} +@samp{chinese-cns11643-6} +@samp{chinese-cns11643-7} +@samp{chinese-gb2312} +@samp{chinese-isoir165} +@samp{cyrillic-iso8859-5} +@samp{ethiopic} +@samp{greek-iso8859-7} +@samp{hebrew-iso8859-8} +@samp{ipa} +@samp{japanese-jisx0208} +@samp{japanese-jisx0208-1978} +@samp{japanese-jisx0212} +@samp{katakana-jisx0201} +@samp{korean-ksc5601} +@samp{sisheng} +@samp{thai-tis620} +@samp{thai-xtis} + +@item Non-graphic charsets + +@samp{control-1} +@end table + +@table @strong +@item No conversion + +Some of these coding systems may specify EOL conventions. Note that +@samp{iso-8859-1} is a no-conversion coding system, not an ISO 2022 +coding system. Although unification attempts to compensate for this, it +is possible that the @samp{iso-8859-1} coding system will behave +differently from other ISO 8859 coding systems. + +@samp{binary} @samp{no-conversion} @samp{raw-text} @samp{iso-8859-1} + +@item Latin coding systems + +These coding systems are all single-byte, 8-bit ISO 2022 coding systems, +combining ASCII in the GL register (bytes with high-bit clear) and an +extended Latin character set in the GR register (bytes with high-bit set). + +@samp{iso-8859-15} @samp{iso-8859-2} @samp{iso-8859-3} @samp{iso-8859-4} +@samp{iso-8859-9} @samp{iso-8859-13} @samp{iso-8859-14} @samp{iso-8859-16} + +These coding systems are single-byte, 8-bit coding systems that do not +conform to international standards. They should be avoided in all +potentially multilingual contexts, including any text distributed over +the Internet and World Wide Web. + +@samp{windows-1251} + +@item Multilingual coding systems + +The following ISO-2022-based coding systems are useful for multilingual +text. + +@samp{ctext} @samp{iso-2022-lock} @samp{iso-2022-7} @samp{iso-2022-7bit} +@samp{iso-2022-7bit-ss2} @samp{iso-2022-8} @samp{iso-2022-8bit-ss2} + +XEmacs also supports Unicode with the Mule-UCS package. These are the +preferred coding systems for multilingual use. (There is a possible +exception for texts that mix several Asian ideographic character sets.) + +@samp{utf-16-be} @samp{utf-16-be-no-signature} @samp{utf-16-le} +@samp{utf-16-le-no-signature} @samp{utf-7} @samp{utf-7-safe} +@samp{utf-8} @samp{utf-8-ws} + +Development versions of XEmacs (the 21.5 series) support Unicode +internally, with (at least) the following coding systems implemented: + +@samp{utf-16-be} @samp{utf-16-be-bom} @samp{utf-16-le} +@samp{utf-16-le-bom} @samp{utf-8} @samp{utf-8-bom} + +@item Asian ideographic languages + +The following coding systems are based on ISO 2022, and are more or less +suitable for encoding multilingual texts. They all can represent ASCII +at least, and sometimes several other foreign character sets, without +resort to arbitrary ISO 2022 designations. However, these subsets are +not identified with the corresponding national standards in XEmacs Mule. + +@samp{chinese-euc} @samp{cn-big5} @samp{cn-gb-2312} @samp{gb2312} +@samp{hz} @samp{hz-gb-2312} @samp{old-jis} @samp{japanese-euc} +@samp{junet} @samp{euc-japan} @samp{euc-jp} @samp{iso-2022-jp} +@samp{iso-2022-jp-1978-irv} @samp{iso-2022-jp-2} @samp{euc-kr} +@samp{korean-euc} @samp{iso-2022-kr} @samp{iso-2022-int-1} + +The following coding systems cannot be used for general multilingual +text and do not cooperate well with other coding systems. + +@samp{big5} @samp{shift_jis} + +@item Other languages + +The following coding systems are based on ISO 2022. Though none of them +provides any Latin characters beyond ASCII, XEmacs Mule allows (and up +to 21.4 defaults to) use of ISO 2022 control sequences to designate +other character sets for inclusion the text. + +@samp{iso-8859-5} @samp{iso-8859-7} @samp{iso-8859-8} +@samp{ctext-hebrew} + +The following are character sets that do not conform to ISO 2022 and +thus cannot be safely used in a multilingual context. + +@samp{alternativnyj} @samp{koi8-r} @samp{tis-620} @samp{viqr} +@samp{viscii} @samp{vscii} + +@item Special coding systems + +Mule uses the following coding systems for special purposes. + +@samp{automatic-conversion} @samp{undecided} @samp{escape-quoted} + +@samp{escape-quoted} is especially important, as it is used internally +as the coding system for autosaved data. + +The following coding systems are aliases for others, and are used for +communication with the host operating system. + +@samp{file-name} @samp{keyboard} @samp{terminal} + +@end table + +Mule detection of coding systems is actually limited to detection of +classes of coding systems called @dfn{coding categories}. These coding +categories are identified by the ISO 2022 control sequences they use, if +any, by their conformance to ISO 2022 restrictions on code points that +may be used, and by characteristic patterns of use of 8-bit code points. + +@samp{no-conversion} +@samp{utf-8} +@samp{ucs-4} +@samp{iso-7} +@samp{iso-lock-shift} +@samp{iso-8-1} +@samp{iso-8-2} +@samp{iso-8-designate} +@samp{shift-jis} +@samp{big5} + + +@c end of mule.texi +
--- a/man/widget.texi Thu Jan 02 22:52:44 2003 +0000 +++ b/man/widget.texi Fri Jan 03 12:12:40 2003 +0000 @@ -33,6 +33,7 @@ * Widget Minor Mode:: * Utilities:: * Widget Wishlist:: +* Widget Internals:: @end menu @node Introduction, User Interface, Top, Top @@ -120,7 +121,7 @@ @table @file @item widget.el This will declare the user variables, define the function -@code{widget-define}, and autoload the function @code{widget-create}. +@code{define-widget}, and autoload the function @code{widget-create}. @item wid-edit.el Everything else is here, there is no reason to load it explicitly, as it will be autoloaded when needed. @@ -1359,7 +1360,7 @@ specifying component widgets and new default values for the keyword arguments. -@defun widget-define name class doc &rest args +@defun define-widget name class doc &rest args Define a new widget type named @var{name} from @code{class}. @var{name} and class should both be symbols, @code{class} should be one @@ -1384,7 +1385,7 @@ @end defun -Using @code{widget-define} just stores the definition of the widget type +Using @code{define-widget} just stores the definition of the widget type in the @code{widget-type} property of @var{name}, which is what @code{widget-create} uses. @@ -1558,7 +1559,7 @@ This is only meaningful for radio buttons or checkboxes in a list. @end defun -@node Widget Wishlist, , Utilities, Top +@node Widget Wishlist, Widget Internals, Utilities, Top @comment node-name, next, previous, up @section Wishlist @@ -1620,7 +1621,7 @@ the field, not the end of the field itself. @item -Use and overlay instead of markers to delimit the widget. Create +Use an overlay instead of markers to delimit the widget. Create accessors for the end points. @item @@ -1631,5 +1632,35 @@ @end itemize +@node Widget Internals, , Widget Wishlist, Top +@section Internals + +This (very brief!) section provides a few notes on the internal +structure and implementation of Emacs widgets. Avoid relying on this +information. (We intend to improve it, but this will take some time.) +To the extent that it actually describes APIs, the information will be +moved to appropriate sections of the manual in due course. + +@subsection The @dfn{Widget} and @dfn{Type} Structures + +Widgets and types are currently both implemented as lists. + +A symbol may be defined as a @dfn{type name} using @code{define-widget}. +@xref{Defining New Widgets}. A @dfn{type} is a list whose car is a +previously defined type name, nil, or (recursively) a type. The car is +the @dfn{class} or parent type of the type, and properties which are not +specified in the new type will be inherited from ancestors. Probably +the only type without a class should be the @code{default} type. The +cdr of a type is a plist whose keys are widget property keywords. + +A type or type name may also be referred to as an @dfn{unconverted +widget}. + +A @dfn{converted widget} or @dfn{widget instance} is a list whose car is +a type name or a type, and whose cdr is a property list. Furthermore, +all children of the converted widget must be converted. Finally, in the +process of appropriate parts of the list structure are copied to ensure +that changes in values of one instance do not affect another's. + @contents @bye
--- a/man/xemacs-faq.texi Thu Jan 02 22:52:44 2003 +0000 +++ b/man/xemacs-faq.texi Fri Jan 03 12:12:40 2003 +0000 @@ -7,7 +7,7 @@ @finalout @titlepage @title XEmacs FAQ -@subtitle Frequently asked questions about XEmacs @* Last Modified: $Date: 2002/12/04 14:06:04 $ +@subtitle Frequently asked questions about XEmacs @* Last Modified: $Date: 2003/01/03 12:12:30 $ @sp 1 @author Tony Rossini <rossini@@biostat.washington.edu> @author Ben Wing <ben@@xemacs.org> @@ -1500,6 +1500,11 @@ buggy optimizers. Please see the @file{PROBLEMS} file that comes with XEmacs to read what it says about your platform. +If you compiled XEmacs using @samp{--use-union-type} (or the option +@samp{USE_UNION_TYPE} in @file{config.inc} under Windows), recompile +again without this. This has been known to trigger compiler errors in a +number of cases. + @node Q2.0.7, Q2.0.8, Q2.0.6, Installation @unnumberedsubsec Q2.0.7: Libraries in non-standard locations @@ -1802,18 +1807,29 @@ particular sequences of actions, that cause it to crash. If you can come up with a reproducible way of doing this (or even if you have a pretty good memory of exactly what you were doing at the time), the -maintainers would be very interested in knowing about it. Post a -message to comp.emacs.xemacs or send mail to @email{crashes@@xemacs.org}. -Please note that the @samp{crashes} address is exclusively for crash +maintainers would be very interested in knowing about it. The best way +to report a bug is using @kbd{M-x report-emacs-bug} (or by selecting +@samp{Send Bug Report...} from the Help menu). If that won't work +(e.g. you can't get XEmacs working at all), send ordinary mail to +@email{crashes@@xemacs.org}. @emph{MAKE SURE} to include the output from +the crash, especially including the Lisp backtrace, as well as the +XEmacs configuration from @kbd{M-x describe-installation} (or +equivalently, the file @file{Installation} in the top of the build +tree). Please note that the @samp{crashes} address is exclusively for +crash reports. The best way to report bugs in general is through the +@kbd{M-x report-emacs-bug} interface just mentioned, or if necessary by +emailing @email{xemacs-beta@@xemacs.org}. Note that the developers do +@emph{not} usually follow @samp{comp.emacs.xemacs} on a regular basis; +thus, this is better for general questions about XEmacs than bug reports. -If at all possible, include a stack backtrace of the core dump that was -produced. This shows where exactly things went wrong, and makes it much -easier to diagnose problems. To do this, you need to locate the core -file (it's called @file{core}, and is usually sitting in the directory -that you started XEmacs from, or your home directory if that other -directory was not writable). Then, go to that directory and execute a -command like: +If at all possible, include a C stack backtrace of the core dump that +was produced. This shows where exactly things went wrong, and makes it +much easier to diagnose problems. To do this under Unix, you need to +locate the core file (it's called @file{core}, and is usually sitting in +the directory that you started XEmacs from, or your home directory if +that other directory was not writable). Then, go to that directory and +execute a command like: @example gdb `which xemacs` core @@ -1829,6 +1845,13 @@ to disable core files by default. Also see @ref{Q2.1.15}, for tips and techniques for dealing with a debugger. +If you're under Microsoft Windows, you're out of luck unless you happen +to have a debugging aid installed on your system, for example Visual +C++. In this case, the crash will result in a message giving you the +option to enter a debugger (for example, by pressing @samp{Cancel}). Do +this and locate the stack-trace window. (If your XEmacs was built +without debugging information, the stack trace may not be very useful.) + When making a problem report make sure that: @enumerate @@ -1846,12 +1869,12 @@ What build options you are using. @item -If the problem is related to graphics, we will also need to know what -version of the X Window System you are running, and what window manager -you are using. - -@item -If the problem happened on a tty, please include the terminal type. +If the problem is related to graphics and you are running Unix, we will +also need to know what version of the X Window System you are running, +and what window manager you are using. + +@item +If the problem happened on a TTY, please include the terminal type. @end enumerate Much of the information above is automatically generated by @kbd{M-x @@ -2237,7 +2260,7 @@ decode them, do this: @example -call debug_print (OBJECT) +call dp (OBJECT) @end example where @var{OBJECT} is whatever you want to decode (it can be a variable, @@ -2249,14 +2272,14 @@ stack, do this: @example -call debug_backtrace () +call db () @end example @item -Using @code{debug_print} and @code{debug_backtrace} has two -disadvantages - it can only be used with a running xemacs process, and -it cannot display the internal C structure of a Lisp Object. Even if -all you've got is a core dump, all is not lost. +Using @code{dp} and @code{db} has two disadvantages - it can only be +used with a running xemacs process, and it cannot display the internal C +structure of a Lisp Object. Even if all you've got is a core dump, all +is not lost. If you're using GDB, there are some macros in the file @file{src/.gdbinit} in the XEmacs source distribution that should make @@ -2319,8 +2342,8 @@ running the XEmacs process under a debugger, the stack trace should be clean. -@email{1CMC3466@@ibm.mtsac.edu, Curtiss} suggests upgrading to ld.so version 1.8 -if dynamic linking and debugging is a problem on Linux. +@email{1CMC3466@@ibm.mtsac.edu, Curtiss} suggests upgrading to ld.so +version 1.8 if dynamic linking and debugging is a problem on Linux. @item If you're using a debugger to get a C stack backtrace and you're @@ -2344,9 +2367,9 @@ could simply mean that XEmacs attempted to execute code at that address, e.g. through jumping to a null function pointer. Unfortunately, under those circumstances, GDB under Linux doesn't know how to get a stack -trace. (Yes, this is the third Linux-related problem I've mentioned. I +trace. (Yes, this is the fourth Linux-related problem I've mentioned. I have no idea why GDB under Linux is so bogus. Complain to the GDB -authors, or to comp.os.linux.development.system). Again, you'll have to +authors, or to comp.os.linux.development.system.) Again, you'll have to use the narrowing-down process described above. @item @@ -2365,6 +2388,10 @@ @file{src/gdbinit}. This had the disadvantage of not being sourced automatically by gdb, so you had to set that up yourself. +@item +If you are running Microsoft Windows, the the file @file{nt/README} for +further information about debugging XEmacs. + @end itemize @node Q2.1.16, Q2.1.17, Q2.1.15, Installation
--- a/man/xemacs/mule.texi Thu Jan 02 22:52:44 2003 +0000 +++ b/man/xemacs/mule.texi Fri Jan 03 12:12:40 2003 +0000 @@ -15,6 +15,8 @@ @cindex Korean @cindex Cyrillic @cindex Russian +@c #### It's a lie that this file tells you about Unicode.... +@cindex Unicode If you build XEmacs using the @code{--with-mule} option, it supports a wide variety of world scripts, including the Latin script, the Arabic script, Simplified Chinese (for mainland of China), Traditional Chinese @@ -33,22 +35,25 @@ * Coding Systems:: Character set conversion when you read and write files, and so on. * Recognize Coding:: How XEmacs figures out which conversion to use. +* Unification:: Integrating overlapping character sets. * Specify Coding:: Various ways to choose which conversion to use. +* Charsets and Coding Systems:: Tables and other reference material. @end menu @node Mule Intro, Language Environments, Mule, Mule -@section Introduction to world scripts +@section Introduction: The Wide Variety of Scripts and Codings in Use - The users of these scripts have established many more-or-less standard -coding systems for storing files. -@c XEmacs internally uses a single multibyte character encoding, so that it -@c can intermix characters from all these scripts in a single buffer or -@c string. This encoding represents each non-ASCII character as a sequence -@c of bytes in the range 0200 through 0377. -XEmacs translates between the internal character encoding and various -other coding systems when reading and writing files, when exchanging -data with subprocesses, and (in some cases) in the @kbd{C-q} command -(see below). + There are hundreds of scripts in use world-wide. The users of these +scripts have established many more-or-less standard coding systems for +storing text written in them in files. XEmacs translates between its +internal character encoding and various other coding systems when +reading and writing files, when exchanging data with subprocesses, and +(in some cases) in the @kbd{C-q} command (see below). +@footnote{Historically the internal encoding was a specially designed +encoding, called @dfn{Mule encoding}, intended for easy conversion to +and from versions of ISO 2022. However, this encoding shares many +properties with UTF-8, and conversion to UTF-8 as the internal code is +proposed.} @kindex C-h h @findex view-hello-file @@ -356,7 +361,7 @@ the usual three variants to specify the kind of end-of-line conversion. -@node Recognize Coding, Specify Coding, Coding Systems, Mule +@node Recognize Coding, Unification, Coding Systems, Mule @section Recognizing Coding Systems Most of the time, XEmacs can recognize which coding system to use for @@ -427,7 +432,739 @@ Coding}). -@node Specify Coding, , Recognize Coding, Mule +@node Unification, Specify Coding, Recognize Coding, Mule +@section Character Set Unification + +Mule suffers from a design defect that causes it to consider the ISO +Latin character sets to be disjoint. This results in oddities such as +files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO +2022 control sequences to switch between them, as well as more +plausible but often unnecessary combinations like ISO 8859/1 with ISO +8859/2. This can be very annoying when sending messages or even in +simple editing on a single host. XEmacs works around the problem by +converting as many characters as possible to use a single Latin coded +character set before saving the buffer. + +Unification is planned for extension to other character set families, +in particular the Han family of character sets based on the Chinese +ideographic characters. At least for the Han sets, however, the +unification feature will be disabled by default. + +This functionality is based on the @file{latin-unity} package by +Stephen Turnbull @email{stephen@@xemacs.org}, but is somewhat +divergent. This documentation is also based on the package +documentation, and is likely to be inaccurate because of the different +constraints we place on ``core'' and packaged functionality. + +@menu +* Unification Overview:: History and general information. +* Unification Usage:: An overview of operation. +* Unification Configuration:: Configuring unification. +* Unification FAQs:: Questions and answers from the mailing list. +* Unification Theory:: How unification works. +* What Unification Cannot Do for You:: Inherent problems of 8-bit charsets. +@end menu + +@node Unification Overview, Unification Usage, Unification, Unification +@subsection An Overview of Character Set Unification + +Mule suffers from a design defect that causes it to consider the ISO +Latin character sets to be disjoint. This manifests itself when a user +enters characters using input methods associated with different coded +character sets into a single buffer. + +A very important example involves email. Many sites, especially in the +U.S., default to use of the ISO 8859/1 coded character set (also called +``Latin 1,'' though these are somewhat different concepts). However, +ISO 8859/1 provides a generic CURRENCY SIGN character. Now that the +Euro has become the official currency of most countries in Europe, this +is unsatisfactory (and in practice, useless). So Europeans generally +use ISO 8859/15, which is nearly identical to ISO 8859/1 for most +languages, except that it substitutes EURO SIGN for CURRENCY SIGN. + +Suppose a European user yanks text from a post encoded in ISO 8859/1 +into a message composition buffer, and enters some text including the +Euro sign. Then Mule will consider the buffer to contain both ISO +8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively +programmed) send the message as a multipart mixed MIME body! + +This is clearly stupid. What is not as obvious is that, just as any +European can include American English in their text because ASCII is a +subset of ISO 8859/15, most European languages which use Latin +characters (eg, German and Polish) can typically be mixed while using +only one Latin coded character set (in this case, ISO 8859/2). However, +this often depends on exactly what text is to be encoded. + +Unification works around the problem by converting as many characters as +possible to use a single Latin coded character set before saving the +buffer. + + +@node Unification Usage, Unification Configuration, Unification Overview, Unification +@subsection Operation of Unification + +This is a description of the early hack to include unification in +XEmacs 21.5. This will almost surely change. + +Normally, unification works in the background by installing +@code{unity-sanity-check} on @code{write-region-pre-hook}. +Unification is on by default for the ISO-8859 Latin sets. The user +activates this functionality for other chacter set families by +invoking @code{enable-unification}, either interactively or in her +init file. @xref{Init File, , , xemacs}. Unification can be +deactivated by invoking @code{disable-unification}. + +Unification also provides a few functions for remapping or recoding the +buffer by hand. To @dfn{remap} a character means to change the buffer +representation of the character by using another coded character set. +Remapping never changes the identity of the character, but may involve +altering the code point of the character. To @dfn{recode} a character +means to simply change the coded character set. Recoding never alters +the code point of the character, but may change the identity of the +character. @xref{Unification Theory}. + +There are a few variables which determine which coding systems are +always acceptable to unification: @code{unity-ucs-list}, +@code{unity-preferred-coding-system-list}, and +@code{unity-preapproved-coding-system-list}. The last defaults to +@code{(buffer preferred)}, and you should probably avoid changing it +because it short-circuits the sanity check. If you find you need to +use it, consider reporting it as a bug or request for enhancement. + +@menu +* Basic Functionality:: User interface and customization. +* Interactive Usage:: Treating text by hand. + Also documents the hook function(s). +@end menu + + +@node Basic Functionality, Interactive Usage, , Unification Usage +@subsubsection Basic Functionality + +These functions and user options initialize and configure unification. +In normal use, they are not needed. + +@strong{These interfaces will change. Also, the @samp{unity-} prefix +is likely to be changed for many of the variables and functions, as +they are of more general usefulness.} + +@defun enable-unification +Set up hooks and initialize variables for unification. + +There are no arguments. + +This function is idempotent. It will reinitialize any hooks or variables +that are not in initial state. +@end defun + +@defun disable-unification +There are no arguments. + +Clean up hooks and void variables used by unification. +@end defun + +@c #### several changes should go to latin-unity.texi +@defopt unity-ucs-list +List of universal coding systems recommended for character set unification. + +The default value is @code{'(utf-8 iso-2022-7 ctext escape-quoted)}. + +Order matters; coding systems earlier in the list will be preferred when +recommending a coding system. These coding systems will not be used +without querying the user (unless they are also present in +@code{unity-preapproved-coding-system-list}), and follow the +@code{unity-preferred-coding-system-list} in the list of suggested +coding systems. + +If none of the preferred coding systems are feasible, the first in +this list will be the default. + +Notes on certain coding systems: @code{escape-quoted} is a special +coding system used for autosaves and compiled Lisp in Mule. You should +never delete this, although it is rare that a user would want to use it +directly. Unification does not try to be ``smart'' about other general +ISO 2022 coding systems, such as ISO-2022-JP. (They are not recognized +as equivalent to @code{iso-2022-7}.) If your preferred coding system is +one of these, you may consider adding it to @code{unity-ucs-list}. +@end defopt + +Coding systems which are not Latin and not in +@code{unity-ucs-list} are handled by short circuiting checks of +coding system against the next two variables. + +@defopt unity-preapproved-coding-system-list +List of coding systems used without querying the user if feasible. + +The default value is @samp{(buffer-default preferred)}. + +The first feasible coding system in this list is used. The special values +@samp{preferred} and @samp{buffer-default} may be present: + +@table @code +@item buffer-default +Use the coding system used by @samp{write-region}, if feasible. + +@item preferred +Use the coding system specified by @samp{prefer-coding-system} if feasible. +@end table + +"Feasible" means that all characters in the buffer can be represented by +the coding system. Coding systems in @samp{unity-ucs-list} are +always considered feasible. Other feasible coding systems are computed +by @samp{unity-representations-feasible-region}. + +Note that, by definition, the first universal coding system in this +list shadows all other coding systems. In particular, if your +preferred coding system is a universal coding system, and +@code{preferred} is a member of this list, unification will blithely +convert all your files to that coding system. This is considered a +feature, but it may surprise most users. Users who don't like this +behavior may put @code{preferred} in +@code{unity-preferred-coding-system-list}, but not in +@code{unity-preapproved-coding-system-list}. +@end defopt + + +@defopt unity-preferred-coding-system-list +List of coding systems suggested to the user if feasible. + +The default value is @samp{(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3 +iso-8859-4 iso-8859-9)}. + +If none of the coding systems in +@samp{unity-preapproved-coding-system-list} are feasible, this list +will be recommended to the user, followed by the +@samp{unity-ucs-list} (so those coding systems should not be in +this list). The first coding system in this list is default. The +special values @samp{preferred} and @samp{buffer-default} may be +present: + +@table @code +@item buffer-default +Use the coding system used by @samp{write-region}, if feasible. + +@item preferred +Use the coding system specified by @samp{prefer-coding-system} if feasible. +@end table + +"Feasible" means that all characters in the buffer can be represented by +the coding system. Coding systems in @samp{unity-ucs-list} are +always considered feasible. Other feasible coding systems are computed +by @samp{unity-representations-feasible-region}. +@end defopt + + +@defvar unity-iso-8859-1-aliases +List of coding systems to be treated as aliases of ISO 8859/1. + +The default value is '(iso-8859-1). + +This is not a user variable; to customize input of coding systems or +charsets, @samp{unity-coding-system-alias-alist} or +@samp{unity-charset-alias-alist}. +@end defvar + + +@node Interactive Usage, , Basic Functionality, Unification Usage +@subsubsection Interactive Usage + +First, the hook function @code{unity-sanity-check} is documented. +(It is placed here because it is not an interactive function, and there +is not yet a programmer's section of the manual.) + +These functions provide access to internal functionality (such as the +remapping function) and to extra functionality (the recoding functions +and the test function). + +@defun unity-sanity-check begin end filename append visit lockname &optional coding-system + +Check if @var{coding-system} can represent all characters between +@var{begin} and @var{end}. + +For compatibility with old broken versions of @code{write-region}, +@var{coding-system} defaults to @code{buffer-file-coding-system}. +@var{filename}, @var{append}, @var{visit}, and @var{lockname} are +ignored. + +Return nil if buffer-file-coding-system is not (ISO-2022-compatible) +Latin. If @code{buffer-file-coding-system} is safe for the charsets +actually present in the buffer, return it. Otherwise, ask the user to +choose a coding system, and return that. + +This function does @emph{not} do the safe thing when +@code{buffer-file-coding-system} is nil (aka no-conversion). It +considers that ``non-Latin,'' and passes it on to the Mule detection +mechanism. + +This function is intended for use as a @code{write-region-pre-hook}. It +does nothing except return @var{coding-system} if @code{write-region} +handlers are inhibited. +@end defun + +@defun unity-buffer-representations-feasible +There are no arguments. + +Apply unity-region-representations-feasible to the current buffer. +@end defun + +@defun unity-region-representations-feasible begin end &optional buf +Return character sets that can represent the text from @var{begin} to +@var{end} in @var{buf}. + +@c #### Fix in latin-unity.texi. +@var{buf} defaults to the current buffer. Called interactively, will be +applied to the region. The function assumes @var{begin} <= @var{end}. + +The return value is a cons. The car is the list of character sets +that can individually represent all of the non-ASCII portion of the +buffer, and the cdr is the list of character sets that can +individually represent all of the ASCII portion. + +The following is taken from a comment in the source. Please refer to +the source to be sure of an accurate description. + +The basic algorithm is to map over the region, compute the set of +charsets that can represent each character (the ``feasible charset''), +and take the intersection of those sets. + +The current implementation takes advantage of the fact that ASCII +characters are common and cannot change asciisets. Then using +skip-chars-forward makes motion over ASCII subregions very fast. + +This same strategy could be applied generally by precomputing classes +of characters equivalent according to their effect on latinsets, and +adding a whole class to the skip-chars-forward string once a member is +found. + +Probably efficiency is a function of the number of characters matched, +or maybe the length of the match string? With @code{skip-category-forward} +over a precomputed category table it should be really fast. In practice +for Latin character sets there are only 29 classes. +@end defun + +@defun unity-remap-region begin end character-set &optional coding-system + +Remap characters between @var{begin} and @var{end} to equivalents in +@var{character-set}. Optional argument @var{coding-system} may be a +coding system name (a symbol) or nil. Characters with no equivalent are +left as-is. + +When called interactively, @var{begin} and @var{end} are set to the +beginning and end, respectively, of the active region, and the function +prompts for @var{character-set}. The function does completion, knows +how to guess a character set name from a coding system name, and also +provides some common aliases. See @code{unity-guess-charset}. +There is no way to specify @var{coding-system}, as it has no useful +function interactively. + +Return @var{coding-system} if @var{coding-system} can encode all +characters in the region, t if @var{coding-system} is nil and the coding +system with G0 = 'ascii and G1 = @var{character-set} can encode all +characters, and otherwise nil. Note that a non-null return does +@emph{not} mean it is safe to write the file, only the specified region. +(This behavior is useful for multipart MIME encoding and the like.) + +Note: by default this function is quite fascist about universal coding +systems. It only admits @samp{utf-8}, @samp{iso-2022-7}, and +@samp{ctext}. Customize @code{unity-approved-ucs-list} to change +this. + +This function remaps characters that are artificially distinguished by Mule +internal code. It may change the code point as well as the character set. +To recode characters that were decoded in the wrong coding system, use +@code{unity-recode-region}. +@end defun + +@defun unity-recode-region begin end wrong-cs right-cs + +Recode characters between @var{begin} and @var{end} from @var{wrong-cs} +to @var{right-cs}. + +@var{wrong-cs} and @var{right-cs} are character sets. Characters retain +the same code point but the character set is changed. Only characters +from @var{wrong-cs} are changed to @var{right-cs}. The identity of the +character may change. Note that this could be dangerous, if characters +whose identities you do not want changed are included in the region. +This function cannot guess which characters you want changed, and which +should be left alone. + +When called interactively, @var{begin} and @var{end} are set to the +beginning and end, respectively, of the active region, and the function +prompts for @var{wrong-cs} and @var{right-cs}. The function does +completion, knows how to guess a character set name from a coding system +name, and also provides some common aliases. See +@code{unity-guess-charset}. + +Another way to accomplish this, but using coding systems rather than +character sets to specify the desired recoding, is +@samp{unity-recode-coding-region}. That function may be faster +but is somewhat more dangerous, because it may recode more than one +character set. + +To change from one Mule representation to another without changing identity +of any characters, use @samp{unity-remap-region}. +@end defun + +@defun unity-recode-coding-region begin end wrong-cs right-cs + +Recode text between @var{begin} and @var{end} from @var{wrong-cs} to +@var{right-cs}. + +@var{wrong-cs} and @var{right-cs} are coding systems. Characters retain +the same code point but the character set is changed. The identity of +characters may change. This is an inherently dangerous function; +multilingual text may be recoded in unexpected ways. #### It's also +dangerous because the coding systems are not sanity-checked in the +current implementation. + +When called interactively, @var{begin} and @var{end} are set to the +beginning and end, respectively, of the active region, and the function +prompts for @var{wrong-cs} and @var{right-cs}. The function does +completion, knows how to guess a coding system name from a character set +name, and also provides some common aliases. See +@code{unity-guess-coding-system}. + +Another, safer, way to accomplish this, using character sets rather +than coding systems to specify the desired recoding, is to use +@code{unity-recode-region}. + +To change from one Mule representation to another without changing identity +of any characters, use @code{unity-remap-region}. +@end defun + +Helper functions for input of coding system and character set names. + +@defun unity-guess-charset candidate +Guess a charset based on the symbol @var{candidate}. + +@var{candidate} itself is not tried as the value. + +Uses the natural mapping in @samp{unity-cset-codesys-alist}, and +the values in @samp{unity-charset-alias-alist}." +@end defun + +@defun unity-guess-coding-system candidate +Guess a coding system based on the symbol @var{candidate}. + +@var{candidate} itself is not tried as the value. + +Uses the natural mapping in @samp{unity-cset-codesys-alist}, and +the values in @samp{unity-coding-system-alias-alist}." +@end defun + +@defun unity-example + +A cheesy example for unification. + +At present it just makes a multilingual buffer. To test, setq +buffer-file-coding-system to some value, make the buffer dirty (eg +with RET BackSpace), and save. +@end defun + + +@node Unification Configuration, Unification FAQs, Unification Usage, Unification +@subsection Configuring Unification for Use + +If you want unification to be automatically initialized, invoke +@samp{enable-unification} with no arguments in your init file. +@xref{Init File, , , xemacs}. If you are using GNU Emacs or an XEmacs +earlier than 21.1, you should also load @file{auto-autoloads} using the +full path (@emph{never} @samp{require} @file{auto-autoloads} libraries). + +You may wish to define aliases for commonly used character sets and +coding systems for convenience in input. + +@defopt unity-charset-alias-alist +Alist mapping aliases to Mule charset names (symbols)." + +The default value is +@example + ((latin-1 . latin-iso8859-1) + (latin-2 . latin-iso8859-2) + (latin-3 . latin-iso8859-3) + (latin-4 . latin-iso8859-4) + (latin-5 . latin-iso8859-9) + (latin-9 . latin-iso8859-15) + (latin-10 . latin-iso8859-16)) +@end example + +If a charset does not exist on your system, it will not complete and you +will not be able to enter it in response to prompts. A real charset +with the same name as an alias in this list will shadow the alias. +@end defopt + +@defopt unity-coding-system-alias-alist nil +Alist mapping aliases to Mule coding system names (symbols). + +The default value is @samp{nil}. +@end defopt + + +@node Unification FAQs, Unification Theory, Unification Configuration, Unification +@subsection Frequently Asked Questions About Unification + +@enumerate +@item +I'm smarter than XEmacs's unification feature! How can that be? + +Don't be surprised. Trust yourself. + +Unification is very young as yet. Teach it what you know by +Customizing its variables, and report your changes to the maintainer +(@kbd{M-x report-xemacs-bug RET}). + +@item +What is a UCS? + +According to ISO 10646, a Universal Coded character Set. In +XEmacs, it's Universal (Mule) Coding System. +@ref{Coding Systems, , , xemacs} + +@item +I know @code{utf-16-le-bom} is a UCS, but unification won't use it. +Why not? + +There are an awful lot of UCSes in Mule, and you probably do not want to +ever use, and definitely not be asked about, most of them. So the +default set includes a few that the author thought plausible, but +they're surely not comprehensive or optimal. + +Customize @code{unity-ucs-list} to include the ones you use often, and +report your favorites to the maintainer for consideration for +inclusion in the defaults using @kbd{M-x report-xemacs-bug RET}. +(Note that you @emph{must} include @code{escape-quoted} in this list, +because Mule uses it internally as the coding system for auto-save +files.) + +Alternatively, if you just want to use it this one time, simply type +it in at the prompt. Unification will confirm that is a real coding +system, and then assume that you know what you're doing. + +@item +This is crazy: I can't quit XEmacs and get queried on autosaves! Why? + +You probably removed @code{escape-quoted} from +@code{unity-ucs-list}. Put it back. + +@item +Unification is really buggy and I can't get any work done. + +First, use @kbd{M-x disable-unification RET}, then report your +problems as a bug (@kbd{M-x report-xemacs-bug RET}). +@end enumerate + + +@node Unification Theory, What Unification Cannot Do for You, Unification FAQs, Unification +@subsection Unification Theory + +Standard encodings suffer from the design defect that they do not +provide a reliable way to recognize which coded character sets in use. +@xref{What Unification Cannot Do for You}. There are scores of +character sets which can be represented by a single octet (8-bit +byte), whose union contains many hundreds of characters. Obviously +this results in great confusion, since you can't tell the players +without a scorecard, and there is no scorecard. + +There are two ways to solve this problem. The first is to create a +universal coded character set. This is the concept behind Unicode. +However, there have been satisfactory (nearly) universal character +sets for several decades, but even today many Westerners resist using +Unicode because they consider its space requirements excessive. On +the other hand, many Asians dislike Unicode because they consider it +to be incomplete. (This is partly, but not entirely, political.) + +In any case, Unicode only solves the internal representation problem. +Many data sets will contain files in ``legacy'' encodings, and Unicode +does not help distinguish among them. + +The second approach is to embed information about the encodings used in +a document in its text. This approach is taken by the ISO 2022 +standard. This would solve the problem completely from the users' of +view, except that ISO 2022 is basically not implemented at all, in the +sense that few applications or systems implement more than a small +subset of ISO 2022 functionality. This is due to the fact that +mono-literate users object to the presence of escape sequences in their +texts (which they, with some justification, consider data corruption). +Programmers are more than willing to cater to these users, since +implementing ISO 2022 is a painstaking task. + +In fact, Emacs/Mule adopts both of these approaches. Internally it uses +a universal character set, @dfn{Mule code}. Externally it uses ISO 2022 +techniques both to save files in forms robust to encoding issues, and as +hints when attempting to ``guess'' an unknown encoding. However, Mule +suffers from a design defect, namely it embeds the character set +information that ISO 2022 attaches to runs of characters by introducing +them with a control sequence in each character. That causes Mule to +consider the ISO Latin character sets to be disjoint. This manifests +itself when a user enters characters using input methods associated with +different coded character sets into a single buffer. + +There are two problems stemming from this design. First, Mule +represents the same character in different ways. Abstractly, ',As(B' +(LATIN SMALL LETTER O WITH ACUTE) can get represented as +[latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73]. So what looks like +',Ass(B' in the display might actually be represented [latin-iso8859-1 +#x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B +#xF3 ESC - A] in the file. In some cases this treatment would be +appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00 +(the CJK ideographic character meaning ``one'')), and although arguably +incorrect it is convenient when mixing the CJK scripts. But in the case +of the Latin scripts this is wrong. + +Worse yet, it is very likely to occur when mixing ``different'' encodings +(such as ISO 8859/1 and ISO 8859/15) that differ only in a few code +points that are almost never used. A very important example involves +email. Many sites, especially in the U.S., default to use of the ISO +8859/1 coded character set (also called ``Latin 1,'' though these are +somewhat different concepts). However, ISO 8859/1 provides a generic +CURRENCY SIGN character. Now that the Euro has become the official +currency of most countries in Europe, this is unsatisfactory (and in +practice, useless). So Europeans generally use ISO 8859/15, which is +nearly identical to ISO 8859/1 for most languages, except that it +substitutes EURO SIGN for CURRENCY SIGN. + +Suppose a European user yanks text from a post encoded in ISO 8859/1 +into a message composition buffer, and enters some text including the +Euro sign. Then Mule will consider the buffer to contain both ISO +8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively +programmed) send the message as a multipart mixed MIME body! + +This is clearly stupid. What is not as obvious is that, just as any +European can include American English in their text because ASCII is a +subset of ISO 8859/15, most European languages which use Latin +characters (eg, German and Polish) can typically be mixed while using +only one Latin coded character set (in the case of German and Polish, +ISO 8859/2). However, this often depends on exactly what text is to be +encoded (even for the same pair of languages). + +Unification works around the problem by converting as many characters as +possible to use a single Latin coded character set before saving the +buffer. + +Because the problem is rarely noticable in editing a buffer, but tends +to manifest when that buffer is exported to a file or process, +unification uses the strategy of examining the buffer prior to export. +If use of multiple Latin coded character sets is detected, unification +attempts to unify them by finding a single coded character set which +contains all of the Latin characters in the buffer. + +The primary purpose of unification is to fix the problem by giving the +user the choice to change the representation of all characters to one +character set and give sensible recommendations based on context. In +the ',As(B' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and +both will be suggested. In the EURO SIGN example, only ISO 8859/15 +makes sense, and that is what will be recommended. In both cases, the +user will be reminded that there are universal encodings available. + +I call this @dfn{remapping} (from the universal character set to a +particular ISO 8859 coded character set). It is mere accident that this +letter has the same code point in both character sets. (Not entirely, +but there are many examples of Latin characters that have different code +points in different Latin-X sets.) + +Note that, in the ',As(B' example, that treating the buffer in this way will +result in a representation such as [latin-iso8859-2 +#x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3]. +This is guaranteed to occasionally result in the second problem you +observed, to which we now turn. + +This problem is that, although the file is intended to be an +ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX +compliant program---this is required by the standard, obvious if you +think a bit, @pxref{What Unification Cannot Do for You}) will read that +file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73]. Of course this +is no problem if all of the characters in the file are contained in ISO +8859/1, but suppose there are some which are not, but are contained in +the (intended) ISO 8859/2. + +You now want to fix this, but not by finding the same character in +another set. Instead, you want to simply change the character set +that Mule associates with that buffer position without changing the +code. (This is conceptually somewhat distinct from the first problem, +and logically ought to be handled in the code that defines coding +systems. However, unification is not an unreasonable place for it.) +Unification provides two functions (one fast and dangerous, the other +@c #### fix latin-unity.texi +slower and careful) to handle this. I call this @dfn{recoding}, because +the transformation actually involves @emph{encoding} the buffer to +file representation, then @emph{decoding} it to buffer representation +(in a different character set). This cannot be done automatically +because Mule can have no idea what the correct encoding is---after +all, it already gave you its best guess. @xref{What Unification +Cannot Do for You}. So these functions must be invoked by the user. +@xref{Interactive Usage}. + + +@node What Unification Cannot Do for You, , Unification Theory, Unification +@subsection What Unification Cannot Do for You + +Unification @strong{cannot} save you if you insist on exporting data in +8-bit encodings in a multilingual environment. @emph{You will +eventually corrupt data if you do this.} It is not Mule's, or any +application's, fault. You will have only yourself to blame; consider +yourself warned. (It is true that Mule has bugs, which make Mule +somewhat more dangerous and inconvenient than some naive applications. +We're working to address those, but no application can remedy the +inherent defect of 8-bit encodings.) + +Use standard universal encodings, preferably Unicode (UTF-8) unless +applicable standards indicate otherwise. The most important such case +is Internet messages, where MIME should be used, whether or not the +subordinate encoding is a universal encoding. (Note that since one of +the important provisions of MIME is the @samp{Content-Type} header, +which has the charset parameter, MIME is to be considered a universal +encoding for the purposes of this manual. Of course, technically +speaking it's neither a coded character set nor a coding extension +technique compliant with ISO 2022.) + +As mentioned earlier, the problem is that standard encodings suffer from +the design defect that they do not provide a reliable way to recognize +which coded character sets are in use. There are scores of character +sets which can be represented by a single octet (8-bit byte), whose +union contains many hundreds of characters. Thus any 8-bit coded +character set must contain characters that share code points used for +different characters in other coded character sets. + +This means that a given file's intended encoding cannot be identified +with 100% reliability unless it contains encoding markers such as those +provided by MIME or ISO 2022. + +Unification actually makes it more likely that you will have problems of +this kind. Traditionally Mule has been ``helpful'' by simply using an +ISO 2022 universal coding system when the current buffer coding system +cannot handle all the characters in the buffer. This has the effect +that, because the file contains control sequences, it is not recognized +as being in the locale's normal 8-bit encoding. It may be annoying if +@c #### fix in latin-unity.texi +you are not a Mule expert, but your data is guaranteed to be recoverable +with a tool you already have: Mule. + +However, with unification, Mule converts to a single 8-bit character set +when possible. But typically this will @emph{not} be in your usual +locale. Ie, the times that an ISO 8859/1 user will need unification is +when there are ISO 8859/2 characters in the buffer. But then most +likely the file will be saved in a pure 8-bit encoding that is not ISO +8859/1, ie, ISO 8859/2. Mule's autorecognizer (which is probably the +most sophisticated yet available) cannot tell the difference between ISO +8859/1 and ISO 8859/2, and in a Western European locale will choose the +former even though the latter was intended. Even the extension +@c #### fix in latin-unity.texi +(``statistical recognition'') planned for XEmacs 22 is unlikely to be +acceptably accurate in the case of mixed codes. + +So now consider adding some additional ISO 8859/1 text to the buffer. +If it includes any ISO 8859/1 codes that are used by different +characters in ISO 8859/2, you now have a file that cannot be +mechanically disentangled. You need a human being who can recognize +that @emph{this is German and Swedish} and stays in Latin-1, while +@emph{that is Polish} and needs to be recoded to Latin-2. + +Moral: switch to a universal coded character set, preferably Unicode +using the UTF-8 transformation format. If you really need the space, +compress your files. + + +@node Specify Coding, Charsets and Coding Systems, Unification, Mule @section Specifying a Coding System In cases where XEmacs does not automatically choose the right coding @@ -549,3 +1286,192 @@ those non-Latin-1 characters which the specified coding system can encode. By default, this variable is @code{nil}, which implies that you cannot use non-Latin-1 characters in file names. + + +@node Charsets and Coding Systems, , Specify Coding, Mule +@section Charsets and Coding Systems + +This section provides reference lists of Mule charsets and coding +systems. Mule charsets are typically named by character set and +standard. + +@table @strong +@item ASCII variants + +Identification of equivalent characters in these sets is not properly +implemented. Unification does not distinguish the two charsets. + +@samp{ascii} @samp{latin-jisx0201} + +@item Extended Latin + +Characters from the following ISO 2022 conformant charsets are +identified with equivalents in other charsets in the group by +unification. + +@samp{latin-iso8859-1} @samp{latin-iso8859-15} @samp{latin-iso8859-2} +@samp{latin-iso8859-3} @samp{latin-iso8859-4} @samp{latin-iso8859-9} +@samp{latin-iso8859-13} @samp{latin-iso8859-16} + +The follow charsets are Latin variants which are not understood by +unification. In addition, many of the Asian language standards provide +ASCII, at least, and sometimes other Latin characters. None of these +are identified with their ISO 8859 equivalents. + +@samp{vietnamese-viscii-lower} +@samp{vietnamese-viscii-upper} + +@item Other character sets + +@samp{arabic-1-column} +@samp{arabic-2-column} +@samp{arabic-digit} +@samp{arabic-iso8859-6} +@samp{chinese-big5-1} +@samp{chinese-big5-2} +@samp{chinese-cns11643-1} +@samp{chinese-cns11643-2} +@samp{chinese-cns11643-3} +@samp{chinese-cns11643-4} +@samp{chinese-cns11643-5} +@samp{chinese-cns11643-6} +@samp{chinese-cns11643-7} +@samp{chinese-gb2312} +@samp{chinese-isoir165} +@samp{cyrillic-iso8859-5} +@samp{ethiopic} +@samp{greek-iso8859-7} +@samp{hebrew-iso8859-8} +@samp{ipa} +@samp{japanese-jisx0208} +@samp{japanese-jisx0208-1978} +@samp{japanese-jisx0212} +@samp{katakana-jisx0201} +@samp{korean-ksc5601} +@samp{sisheng} +@samp{thai-tis620} +@samp{thai-xtis} + +@item Non-graphic charsets + +@samp{control-1} +@end table + +@table @strong +@item No conversion + +Some of these coding systems may specify EOL conventions. Note that +@samp{iso-8859-1} is a no-conversion coding system, not an ISO 2022 +coding system. Although unification attempts to compensate for this, it +is possible that the @samp{iso-8859-1} coding system will behave +differently from other ISO 8859 coding systems. + +@samp{binary} @samp{no-conversion} @samp{raw-text} @samp{iso-8859-1} + +@item Latin coding systems + +These coding systems are all single-byte, 8-bit ISO 2022 coding systems, +combining ASCII in the GL register (bytes with high-bit clear) and an +extended Latin character set in the GR register (bytes with high-bit set). + +@samp{iso-8859-15} @samp{iso-8859-2} @samp{iso-8859-3} @samp{iso-8859-4} +@samp{iso-8859-9} @samp{iso-8859-13} @samp{iso-8859-14} @samp{iso-8859-16} + +These coding systems are single-byte, 8-bit coding systems that do not +conform to international standards. They should be avoided in all +potentially multilingual contexts, including any text distributed over +the Internet and World Wide Web. + +@samp{windows-1251} + +@item Multilingual coding systems + +The following ISO-2022-based coding systems are useful for multilingual +text. + +@samp{ctext} @samp{iso-2022-lock} @samp{iso-2022-7} @samp{iso-2022-7bit} +@samp{iso-2022-7bit-ss2} @samp{iso-2022-8} @samp{iso-2022-8bit-ss2} + +XEmacs also supports Unicode with the Mule-UCS package. These are the +preferred coding systems for multilingual use. (There is a possible +exception for texts that mix several Asian ideographic character sets.) + +@samp{utf-16-be} @samp{utf-16-be-no-signature} @samp{utf-16-le} +@samp{utf-16-le-no-signature} @samp{utf-7} @samp{utf-7-safe} +@samp{utf-8} @samp{utf-8-ws} + +Development versions of XEmacs (the 21.5 series) support Unicode +internally, with (at least) the following coding systems implemented: + +@samp{utf-16-be} @samp{utf-16-be-bom} @samp{utf-16-le} +@samp{utf-16-le-bom} @samp{utf-8} @samp{utf-8-bom} + +@item Asian ideographic languages + +The following coding systems are based on ISO 2022, and are more or less +suitable for encoding multilingual texts. They all can represent ASCII +at least, and sometimes several other foreign character sets, without +resort to arbitrary ISO 2022 designations. However, these subsets are +not identified with the corresponding national standards in XEmacs Mule. + +@samp{chinese-euc} @samp{cn-big5} @samp{cn-gb-2312} @samp{gb2312} +@samp{hz} @samp{hz-gb-2312} @samp{old-jis} @samp{japanese-euc} +@samp{junet} @samp{euc-japan} @samp{euc-jp} @samp{iso-2022-jp} +@samp{iso-2022-jp-1978-irv} @samp{iso-2022-jp-2} @samp{euc-kr} +@samp{korean-euc} @samp{iso-2022-kr} @samp{iso-2022-int-1} + +The following coding systems cannot be used for general multilingual +text and do not cooperate well with other coding systems. + +@samp{big5} @samp{shift_jis} + +@item Other languages + +The following coding systems are based on ISO 2022. Though none of them +provides any Latin characters beyond ASCII, XEmacs Mule allows (and up +to 21.4 defaults to) use of ISO 2022 control sequences to designate +other character sets for inclusion the text. + +@samp{iso-8859-5} @samp{iso-8859-7} @samp{iso-8859-8} +@samp{ctext-hebrew} + +The following are character sets that do not conform to ISO 2022 and +thus cannot be safely used in a multilingual context. + +@samp{alternativnyj} @samp{koi8-r} @samp{tis-620} @samp{viqr} +@samp{viscii} @samp{vscii} + +@item Special coding systems + +Mule uses the following coding systems for special purposes. + +@samp{automatic-conversion} @samp{undecided} @samp{escape-quoted} + +@samp{escape-quoted} is especially important, as it is used internally +as the coding system for autosaved data. + +The following coding systems are aliases for others, and are used for +communication with the host operating system. + +@samp{file-name} @samp{keyboard} @samp{terminal} + +@end table + +Mule detection of coding systems is actually limited to detection of +classes of coding systems called @dfn{coding categories}. These coding +categories are identified by the ISO 2022 control sequences they use, if +any, by their conformance to ISO 2022 restrictions on code points that +may be used, and by characteristic patterns of use of 8-bit code points. + +@samp{no-conversion} +@samp{utf-8} +@samp{ucs-4} +@samp{iso-7} +@samp{iso-lock-shift} +@samp{iso-8-1} +@samp{iso-8-2} +@samp{iso-8-designate} +@samp{shift-jis} +@samp{big5} + +
--- a/man/xemacs/startup.texi Thu Jan 02 22:52:44 2003 +0000 +++ b/man/xemacs/startup.texi Fri Jan 03 12:12:40 2003 +0000 @@ -92,10 +92,10 @@ late hierarchy. At run time, the package path may also be specified via the @code{EMACSPACKAGEPATH} environment variable. -An XEmacs package is laid out just like a normal installed XEmacs lisp -directory. It may have @file{lisp}, @file{etc}, @file{info}, and -@file{lib-src} subdirectories. XEmacs adds these at appropriate places -within the various system-wide paths. +An XEmacs package hierarchy is laid out just like a normal installed +XEmacs lisp directory. It may have @file{lisp}, @file{etc}, +@file{info}, and @file{lib-src} subdirectories. XEmacs adds these at +appropriate places within the various system-wide paths. There may be any number of package hierarchy directories.