comparison man/lispref/mule.texi @ 1183:c1553814932e

[xemacs-hg @ 2003-01-03 12:12:30 by stephent] various docs <873coa5unb.fsf@tleepslib.sk.tsukuba.ac.jp> <87r8bu4emz.fsf@tleepslib.sk.tsukuba.ac.jp>
author stephent
date Fri, 03 Jan 2003 12:12:40 +0000
parents 37e56e920ac5
children 11ff4edb6bb7
comparison
equal deleted inserted replaced
1182:7d696106ffe9 1183:c1553814932e
22 * Composite Characters:: Making new characters by overstriking other ones. 22 * Composite Characters:: Making new characters by overstriking other ones.
23 * Coding Systems:: Ways of representing a string of chars using integers. 23 * Coding Systems:: Ways of representing a string of chars using integers.
24 * CCL:: A special language for writing fast converters. 24 * CCL:: A special language for writing fast converters.
25 * Category Tables:: Subdividing charsets into groups. 25 * Category Tables:: Subdividing charsets into groups.
26 * Unicode Support:: The universal coded character set. 26 * Unicode Support:: The universal coded character set.
27 * Charset Unification:: Handling overlapping character sets.
28 * Charsets and Coding Systems:: Tables and reference information.
27 @end menu 29 @end menu
28 30
29 @node Internationalization Terminology, Charsets, , MULE 31 @node Internationalization Terminology, Charsets, , MULE
30 @section Internationalization Terminology 32 @section Internationalization Terminology
31 33
2070 Valid values are @code{nil} or a bit vector of size 95. 2072 Valid values are @code{nil} or a bit vector of size 95.
2071 @end defun 2073 @end defun
2072 2074
2073 2075
2074 @c Added 2002-03-13 sjt 2076 @c Added 2002-03-13 sjt
2075 @node Unicode Support, , Category Tables, MULE 2077 @node Unicode Support, Charset Unification, Category Tables, MULE
2076 @section Unicode Support 2078 @section Unicode Support
2077 @cindex unicode 2079 @cindex unicode
2078 @cindex utf-8 2080 @cindex utf-8
2079 @cindex utf-16 2081 @cindex utf-16
2080 @cindex ucs-2 2082 @cindex ucs-2
2179 The charset codepoint is a Big Five codepoint; convert it to the 2181 The charset codepoint is a Big Five codepoint; convert it to the
2180 proper hacked-up codepoint in `chinese-big5-1' or `chinese-big5-2'. 2182 proper hacked-up codepoint in `chinese-big5-1' or `chinese-big5-2'.
2181 @end table 2183 @end table
2182 @end defun 2184 @end defun
2183 2185
2186
2187 @node Charset Unification, Charsets and Coding Systems, Unicode Support, MULE
2188 @section Character Set Unification
2189
2190 Mule suffers from a design defect that causes it to consider the ISO
2191 Latin character sets to be disjoint. This results in oddities such as
2192 files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO
2193 2022 control sequences to switch between them, as well as more plausible
2194 but often unnecessary combinations like ISO 8859/1 with ISO 8859/2.
2195 This can be very annoying when sending messages or even in simple
2196 editing on a single host. Unification works around the problem by
2197 converting as many characters as possible to use a single Latin coded
2198 character set before saving the buffer.
2199
2200 This node and its children were ripp'd untimely from
2201 @file{latin-unity.texi}, and have been quickly converted for use here.
2202 However as APIs are likely to diverge, beware of inaccuracies. Please
2203 report any you discover with @kbd{M-x report-xemacs-bug RET}, as well
2204 as any ambiguities or downright unintelligible passages.
2205
2206 A lot of the stuff here doesn't belong here; it belongs in the
2207 @ref{Top, , , xemacs, XEmacs User's Manual}. Report those as bugs,
2208 too, preferably with patches.
2209
2210 @menu
2211 * Overview:: Unification history and general information.
2212 * Usage:: An overview of the operation of Unification.
2213 * Configuration:: Configuring Unification for use.
2214 * Theory of Operation:: How Unification works.
2215 * What Unification Cannot Do for You:: Inherent problems of 8-bit charsets.
2216 * Charsets and Coding Systems:: Reference lists with annotations.
2217 * Internals:: Utilities and implementation details.
2218 @end menu
2219
2220 @node Overview, Usage, Charset Unification, Charset Unification
2221 @subsection An Overview of Unification
2222
2223 Mule suffers from a design defect that causes it to consider the ISO
2224 Latin character sets to be disjoint. This manifests itself when a user
2225 enters characters using input methods associated with different coded
2226 character sets into a single buffer.
2227
2228 A very important example involves email. Many sites, especially in the
2229 U.S., default to use of the ISO 8859/1 coded character set (also called
2230 ``Latin 1,'' though these are somewhat different concepts). However,
2231 ISO 8859/1 provides a generic CURRENCY SIGN character. Now that the
2232 Euro has become the official currency of most countries in Europe, this
2233 is unsatisfactory (and in practice, useless). So Europeans generally
2234 use ISO 8859/15, which is nearly identical to ISO 8859/1 for most
2235 languages, except that it substitutes EURO SIGN for CURRENCY SIGN.
2236
2237 Suppose a European user yanks text from a post encoded in ISO 8859/1
2238 into a message composition buffer, and enters some text including the
2239 Euro sign. Then Mule will consider the buffer to contain both ISO
2240 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
2241 programmed) send the message as a multipart mixed MIME body!
2242
2243 This is clearly stupid. What is not as obvious is that, just as any
2244 European can include American English in their text because ASCII is a
2245 subset of ISO 8859/15, most European languages which use Latin
2246 characters (eg, German and Polish) can typically be mixed while using
2247 only one Latin coded character set (in this case, ISO 8859/2). However,
2248 this often depends on exactly what text is to be encoded.
2249
2250 Unification works around the problem by converting as many characters as
2251 possible to use a single Latin coded character set before saving the
2252 buffer.
2253
2254 @node Usage, Configuration, Overview, Charset Unification
2255 @subsection Operation of Unification
2256
2257 Normally, Unification works in the background by installing
2258 @code{unity-sanity-check} on @code{write-region-pre-hook}. This is
2259 done by default for the ISO 8859 Latin family of character sets. The
2260 user activates this functionality for other character set families by
2261 invoking @code{enable-unification}, either interactively or in her
2262 init file. @xref{Init File, , , xemacs}. Unification can be
2263 deactivated by invoking @code{disable-unification}.
2264
2265 Unification also provides a few functions for remapping or recoding the
2266 buffer by hand. To @dfn{remap} a character means to change the buffer
2267 representation of the character by using another coded character set.
2268 Remapping never changes the identity of the character, but may involve
2269 altering the code point of the character. To @dfn{recode} a character
2270 means to simply change the coded character set. Recoding never alters
2271 the code point of the character, but may change the identity of the
2272 character. @xref{Theory of Operation}.
2273
2274 There are a few variables which determine which coding systems are
2275 always acceptable to Unification: @code{unity-ucs-list},
2276 @code{unity-preferred-coding-system-list}, and
2277 @code{unity-preapproved-coding-system-list}. The latter two default
2278 to @code{()}, and should probably be avoided because they short-circuit
2279 the sanity check. If you find you need to use them, consider reporting
2280 it as a bug or request for enhancement. Because they seem unsafe, the
2281 recommended interface is likely to change.
2282
2283 @menu
2284 * Basic Functionality:: User interface and customization.
2285 * Interactive Usage:: Treating text by hand.
2286 Also documents the hook function(s).
2287 @end menu
2288
2289
2290 @node Basic Functionality, Interactive Usage, , Usage
2291 @section Basic Functionality
2292
2293 These functions and user options initialize and configure Unification.
2294 In normal use, none of these should be needed.
2295
2296 @strong{These APIs are certain to change.}
2297
2298 @defun enable-unification
2299 Set up hooks and initialize variables for latin-unity.
2300
2301 There are no arguments.
2302
2303 This function is idempotent. It will reinitialize any hooks or variables
2304 that are not in initial state.
2305 @end defun
2306
2307 @defun disable-unification
2308 There are no arguments.
2309
2310 Clean up hooks and void variables used by latin-unity.
2311 @end defun
2312
2313 @defopt unity-ucs-list
2314 List of coding systems considered to be universal.
2315
2316 The default value is @code{'(utf-8 iso-2022-7 ctext escape-quoted)}.
2317
2318 Order matters; coding systems earlier in the list will be preferred when
2319 recommending a coding system. These coding systems will not be used
2320 without querying the user (unless they are also present in
2321 @code{unity-preapproved-coding-system-list}), and follow the
2322 @code{unity-preferred-coding-system-list} in the list of suggested
2323 coding systems.
2324
2325 If none of the preferred coding systems are feasible, the first in
2326 this list will be the default.
2327
2328 Notes on certain coding systems: @code{escape-quoted} is a special
2329 coding system used for autosaves and compiled Lisp in Mule. You should
2330 @c #### fix in latin-unity.texi
2331 never delete this, although it is rare that a user would want to use it
2332 directly. Unification does not try to be \"smart\" about other general
2333 ISO 2022 coding systems, such as ISO-2022-JP. (They are not recognized
2334 as equivalent to @code{iso-2022-7}.) If your preferred coding system is
2335 one of these, you may consider adding it to @code{unity-ucs-list}.
2336 However, this will typically have the side effect that (eg) ISO 8859/1
2337 files will be saved in 7-bit form with ISO 2022 escape sequences.
2338 @end defopt
2339
2340 Coding systems which are not Latin and not in
2341 @code{unity-ucs-list} are handled by short circuiting checks of
2342 coding system against the next two variables.
2343
2344 @defopt unity-preapproved-coding-system-list
2345 List of coding systems used without querying the user if feasible.
2346
2347 The default value is @samp{(buffer-default preferred)}.
2348
2349 The first feasible coding system in this list is used. The special values
2350 @samp{preferred} and @samp{buffer-default} may be present:
2351
2352 @table @code
2353 @item buffer-default
2354 Use the coding system used by @samp{write-region}, if feasible.
2355
2356 @item preferred
2357 Use the coding system specified by @samp{prefer-coding-system} if feasible.
2358 @end table
2359
2360 "Feasible" means that all characters in the buffer can be represented by
2361 the coding system. Coding systems in @samp{unity-ucs-list} are
2362 always considered feasible. Other feasible coding systems are computed
2363 by @samp{unity-representations-feasible-region}.
2364
2365 Note that the first universal coding system in this list shadows all
2366 other coding systems. In particular, if your preferred coding system is
2367 a universal coding system, and @code{preferred} is a member of this
2368 list, unification will blithely convert all your files to that coding
2369 system. This is considered a feature, but it may surprise most users.
2370 Users who don't like this behavior should put @code{preferred} in
2371 @code{unity-preferred-coding-system-list}.
2372 @end defopt
2373
2374 @defopt unity-preferred-coding-system-list
2375 @c #### fix in latin-unity.texi
2376 List of coding systems suggested to the user if feasible.
2377
2378 The default value is @samp{(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3
2379 iso-8859-4 iso-8859-9)}.
2380
2381 If none of the coding systems in
2382 @c #### fix in latin-unity.texi
2383 @code{unity-preapproved-coding-system-list} are feasible, this list
2384 will be recommended to the user, followed by the
2385 @code{unity-ucs-list}. The first coding system in this list is default. The
2386 special values @samp{preferred} and @samp{buffer-default} may be
2387 present:
2388
2389 @table @code
2390 @item buffer-default
2391 Use the coding system used by @samp{write-region}, if feasible.
2392
2393 @item preferred
2394 Use the coding system specified by @samp{prefer-coding-system} if feasible.
2395 @end table
2396
2397 "Feasible" means that all characters in the buffer can be represented by
2398 the coding system. Coding systems in @samp{unity-ucs-list} are
2399 always considered feasible. Other feasible coding systems are computed
2400 by @samp{unity-representations-feasible-region}.
2401 @end defopt
2402
2403
2404 @defvar unity-iso-8859-1-aliases
2405 List of coding systems to be treated as aliases of ISO 8859/1.
2406
2407 The default value is '(iso-8859-1).
2408
2409 This is not a user variable; to customize input of coding systems or
2410 charsets, @samp{unity-coding-system-alias-alist} or
2411 @samp{unity-charset-alias-alist}.
2412 @end defvar
2413
2414
2415 @node Interactive Usage, , Basic Functionality, Usage
2416 @section Interactive Usage
2417
2418 First, the hook function @code{unity-sanity-check} is documented.
2419 (It is placed here because it is not an interactive function, and there
2420 is not yet a programmer's section of the manual.)
2421
2422 These functions provide access to internal functionality (such as the
2423 remapping function) and to extra functionality (the recoding functions
2424 and the test function).
2425
2426
2427 @defun unity-sanity-check begin end filename append visit lockname &optional coding-system
2428
2429 Check if @var{coding-system} can represent all characters between
2430 @var{begin} and @var{end}.
2431
2432 For compatibility with old broken versions of @code{write-region},
2433 @var{coding-system} defaults to @code{buffer-file-coding-system}.
2434 @var{filename}, @var{append}, @var{visit}, and @var{lockname} are
2435 ignored.
2436
2437 Return nil if buffer-file-coding-system is not (ISO-2022-compatible)
2438 Latin. If @code{buffer-file-coding-system} is safe for the charsets
2439 actually present in the buffer, return it. Otherwise, ask the user to
2440 choose a coding system, and return that.
2441
2442 This function does @emph{not} do the safe thing when
2443 @code{buffer-file-coding-system} is nil (aka no-conversion). It
2444 considers that ``non-Latin,'' and passes it on to the Mule detection
2445 mechanism.
2446
2447 This function is intended for use as a @code{write-region-pre-hook}. It
2448 does nothing except return @var{coding-system} if @code{write-region}
2449 handlers are inhibited.
2450 @end defun
2451
2452 @defun unity-buffer-representations-feasible
2453
2454 There are no arguments.
2455
2456 Apply unity-region-representations-feasible to the current buffer.
2457 @end defun
2458
2459 @defun unity-region-representations-feasible begin end &optional buf
2460
2461 Return character sets that can represent the text from @var{begin} to @var{end} in @var{buf}.
2462
2463 @var{buf} defaults to the current buffer. Called interactively, will be
2464 applied to the region. Function assumes @var{begin} <= @var{end}.
2465
2466 The return value is a cons. The car is the list of character sets
2467 that can individually represent all of the non-ASCII portion of the
2468 buffer, and the cdr is the list of character sets that can
2469 individually represent all of the ASCII portion.
2470
2471 The following is taken from a comment in the source. Please refer to
2472 the source to be sure of an accurate description.
2473
2474 The basic algorithm is to map over the region, compute the set of
2475 charsets that can represent each character (the ``feasible charset''),
2476 and take the intersection of those sets.
2477
2478 The current implementation takes advantage of the fact that ASCII
2479 characters are common and cannot change asciisets. Then using
2480 skip-chars-forward makes motion over ASCII subregions very fast.
2481
2482 This same strategy could be applied generally by precomputing classes
2483 of characters equivalent according to their effect on latinsets, and
2484 adding a whole class to the skip-chars-forward string once a member is
2485 found.
2486
2487 Probably efficiency is a function of the number of characters matched,
2488 or maybe the length of the match string? With @code{skip-category-forward}
2489 over a precomputed category table it should be really fast. In practice
2490 for Latin character sets there are only 29 classes.
2491 @end defun
2492
2493 @defun unity-remap-region begin end character-set &optional coding-system
2494
2495 Remap characters between @var{begin} and @var{end} to equivalents in
2496 @var{character-set}. Optional argument @var{coding-system} may be a
2497 coding system name (a symbol) or nil. Characters with no equivalent are
2498 left as-is.
2499
2500 When called interactively, @var{begin} and @var{end} are set to the
2501 beginning and end, respectively, of the active region, and the function
2502 prompts for @var{character-set}. The function does completion, knows
2503 how to guess a character set name from a coding system name, and also
2504 provides some common aliases. See @code{unity-guess-charset}.
2505 There is no way to specify @var{coding-system}, as it has no useful
2506 function interactively.
2507
2508 Return @var{coding-system} if @var{coding-system} can encode all
2509 characters in the region, t if @var{coding-system} is nil and the coding
2510 system with G0 = 'ascii and G1 = @var{character-set} can encode all
2511 characters, and otherwise nil. Note that a non-null return does
2512 @emph{not} mean it is safe to write the file, only the specified region.
2513 (This behavior is useful for multipart MIME encoding and the like.)
2514
2515 Note: by default this function is quite fascist about universal coding
2516 systems. It only admits @samp{utf-8}, @samp{iso-2022-7}, and
2517 @samp{ctext}. Customize @code{unity-approved-ucs-list} to change
2518 this.
2519
2520 This function remaps characters that are artificially distinguished by Mule
2521 internal code. It may change the code point as well as the character set.
2522 To recode characters that were decoded in the wrong coding system, use
2523 @code{unity-recode-region}.
2524 @end defun
2525
2526 @defun unity-recode-region begin end wrong-cs right-cs
2527
2528 Recode characters between @var{begin} and @var{end} from @var{wrong-cs}
2529 to @var{right-cs}.
2530
2531 @var{wrong-cs} and @var{right-cs} are character sets. Characters retain
2532 the same code point but the character set is changed. Only characters
2533 from @var{wrong-cs} are changed to @var{right-cs}. The identity of the
2534 character may change. Note that this could be dangerous, if characters
2535 whose identities you do not want changed are included in the region.
2536 This function cannot guess which characters you want changed, and which
2537 should be left alone.
2538
2539 When called interactively, @var{begin} and @var{end} are set to the
2540 beginning and end, respectively, of the active region, and the function
2541 prompts for @var{wrong-cs} and @var{right-cs}. The function does
2542 completion, knows how to guess a character set name from a coding system
2543 name, and also provides some common aliases. See
2544 @code{unity-guess-charset}.
2545
2546 Another way to accomplish this, but using coding systems rather than
2547 character sets to specify the desired recoding, is
2548 @samp{unity-recode-coding-region}. That function may be faster
2549 but is somewhat more dangerous, because it may recode more than one
2550 character set.
2551
2552 To change from one Mule representation to another without changing identity
2553 of any characters, use @samp{unity-remap-region}.
2554 @end defun
2555
2556 @defun unity-recode-coding-region begin end wrong-cs right-cs
2557
2558 Recode text between @var{begin} and @var{end} from @var{wrong-cs} to
2559 @var{right-cs}.
2560
2561 @var{wrong-cs} and @var{right-cs} are coding systems. Characters retain
2562 the same code point but the character set is changed. The identity of
2563 characters may change. This is an inherently dangerous function;
2564 multilingual text may be recoded in unexpected ways. #### It's also
2565 dangerous because the coding systems are not sanity-checked in the
2566 current implementation.
2567
2568 When called interactively, @var{begin} and @var{end} are set to the
2569 beginning and end, respectively, of the active region, and the function
2570 prompts for @var{wrong-cs} and @var{right-cs}. The function does
2571 completion, knows how to guess a coding system name from a character set
2572 name, and also provides some common aliases. See
2573 @code{unity-guess-coding-system}.
2574
2575 Another, safer, way to accomplish this, using character sets rather
2576 than coding systems to specify the desired recoding, is to use
2577 @c #### fixme in latin-unity.texi
2578 @code{unity-recode-region}.
2579
2580 To change from one Mule representation to another without changing identity
2581 of any characters, use @code{unity-remap-region}.
2582 @end defun
2583
2584 Helper functions for input of coding system and character set names.
2585
2586 @defun unity-guess-charset candidate
2587 Guess a charset based on the symbol @var{candidate}.
2588
2589 @var{candidate} itself is not tried as the value.
2590
2591 Uses the natural mapping in @samp{unity-cset-codesys-alist}, and
2592 the values in @samp{unity-charset-alias-alist}."
2593 @end defun
2594
2595 @defun unity-guess-coding-system candidate
2596 Guess a coding system based on the symbol @var{candidate}.
2597
2598 @var{candidate} itself is not tried as the value.
2599
2600 Uses the natural mapping in @samp{unity-cset-codesys-alist}, and
2601 the values in @samp{unity-coding-system-alias-alist}."
2602 @end defun
2603
2604 @defun unity-example
2605
2606 A cheesy example for Unification.
2607
2608 At present it just makes a multilingual buffer. To test, setq
2609 buffer-file-coding-system to some value, make the buffer dirty (eg
2610 with RET BackSpace), and save.
2611 @end defun
2612
2613
2614 @node Configuration, Theory of Operation, Usage, Charset Unification
2615 @subsection Configuring Unification for Use
2616
2617 If you want Unification to be automatically initialized, invoke
2618 @samp{enable-unification} with no arguments in your init file.
2619 @xref{Init File, , , xemacs}. If you are using GNU Emacs or an XEmacs
2620 earlier than 21.1, you should also load @file{auto-autoloads} using the
2621 full path (@emph{never} @samp{require} @file{auto-autoloads} libraries).
2622
2623 You may wish to define aliases for commonly used character sets and
2624 coding systems for convenience in input.
2625
2626 @defopt unity-charset-alias-alist
2627 Alist mapping aliases to Mule charset names (symbols)."
2628
2629 The default value is
2630 @example
2631 ((latin-1 . latin-iso8859-1)
2632 (latin-2 . latin-iso8859-2)
2633 (latin-3 . latin-iso8859-3)
2634 (latin-4 . latin-iso8859-4)
2635 (latin-5 . latin-iso8859-9)
2636 (latin-9 . latin-iso8859-15)
2637 (latin-10 . latin-iso8859-16))
2638 @end example
2639
2640 If a charset does not exist on your system, it will not complete and you
2641 will not be able to enter it in response to prompts. A real charset
2642 with the same name as an alias in this list will shadow the alias.
2643 @end defopt
2644
2645 @defopt unity-coding-system-alias-alist nil
2646 Alist mapping aliases to Mule coding system names (symbols).
2647
2648 The default value is @samp{nil}.
2649 @end defopt
2650
2651
2652 @node Theory of Operation, What Unification Cannot Do for You, Configuration, Charset Unification
2653 @subsection Theory of Operation
2654
2655 Standard encodings suffer from the design defect that they do not
2656 provide a reliable way to recognize which coded character sets in use.
2657 @xref{What Unification Cannot Do for You}. There are scores of
2658 character sets which can be represented by a single octet (8-bit byte),
2659 whose union contains many hundreds of characters. Obviously this
2660 results in great confusion, since you can't tell the players without a
2661 scorecard, and there is no scorecard.
2662
2663 There are two ways to solve this problem. The first is to create a
2664 universal coded character set. This is the concept behind Unicode.
2665 However, there have been satisfactory (nearly) universal character sets
2666 for several decades, but even today many Westerners resist using Unicode
2667 because they consider its space requirements excessive. On the other
2668 hand, Asians dislike Unicode because they consider it to be incomplete.
2669 (This is partly, but not entirely, political.)
2670
2671 In any case, Unicode only solves the internal representation problem.
2672 Many data sets will contain files in ``legacy'' encodings, and Unicode
2673 does not help distinguish among them.
2674
2675 The second approach is to embed information about the encodings used in
2676 a document in its text. This approach is taken by the ISO 2022
2677 standard. This would solve the problem completely from the users' of
2678 view, except that ISO 2022 is basically not implemented at all, in the
2679 sense that few applications or systems implement more than a small
2680 subset of ISO 2022 functionality. This is due to the fact that
2681 mono-literate users object to the presence of escape sequences in their
2682 texts (which they, with some justification, consider data corruption).
2683 Programmers are more than willing to cater to these users, since
2684 implementing ISO 2022 is a painstaking task.
2685
2686 In fact, Emacs/Mule adopts both of these approaches. Internally it uses
2687 a universal character set, @dfn{Mule code}. Externally it uses ISO 2022
2688 techniques both to save files in forms robust to encoding issues, and as
2689 hints when attempting to ``guess'' an unknown encoding. However, Mule
2690 suffers from a design defect, namely it embeds the character set
2691 information that ISO 2022 attaches to runs of characters by introducing
2692 them with a control sequence in each character. That causes Mule to
2693 consider the ISO Latin character sets to be disjoint. This manifests
2694 itself when a user enters characters using input methods associated with
2695 different coded character sets into a single buffer.
2696
2697 There are two problems stemming from this design. First, Mule
2698 represents the same character in different ways. Abstractly, ',As(B'
2699 (LATIN SMALL LETTER O WITH ACUTE) can get represented as
2700 [latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73]. So what looks like
2701 ',Ass(B' in the display might actually be represented [latin-iso8859-1
2702 #x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B
2703 #xF3 ESC - A] in the file. In some cases this treatment would be
2704 appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00
2705 (the CJK ideographic character meaning ``one'')), and although arguably
2706 incorrect it is convenient when mixing the CJK scripts. But in the case
2707 of the Latin scripts this is wrong.
2708
2709 Worse yet, it is very likely to occur when mixing ``different'' encodings
2710 (such as ISO 8859/1 and ISO 8859/15) that differ only in a few code
2711 points that are almost never used. A very important example involves
2712 email. Many sites, especially in the U.S., default to use of the ISO
2713 8859/1 coded character set (also called ``Latin 1,'' though these are
2714 somewhat different concepts). However, ISO 8859/1 provides a generic
2715 CURRENCY SIGN character. Now that the Euro has become the official
2716 currency of most countries in Europe, this is unsatisfactory (and in
2717 practice, useless). So Europeans generally use ISO 8859/15, which is
2718 nearly identical to ISO 8859/1 for most languages, except that it
2719 substitutes EURO SIGN for CURRENCY SIGN.
2720
2721 Suppose a European user yanks text from a post encoded in ISO 8859/1
2722 into a message composition buffer, and enters some text including the
2723 Euro sign. Then Mule will consider the buffer to contain both ISO
2724 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively
2725 programmed) send the message as a multipart mixed MIME body!
2726
2727 This is clearly stupid. What is not as obvious is that, just as any
2728 European can include American English in their text because ASCII is a
2729 subset of ISO 8859/15, most European languages which use Latin
2730 characters (eg, German and Polish) can typically be mixed while using
2731 only one Latin coded character set (in the case of German and Polish,
2732 ISO 8859/2). However, this often depends on exactly what text is to be
2733 encoded (even for the same pair of languages).
2734
2735 Unification works around the problem by converting as many characters as
2736 possible to use a single Latin coded character set before saving the
2737 buffer.
2738
2739 Because the problem is rarely noticable in editing a buffer, but tends
2740 to manifest when that buffer is exported to a file or process, the
2741 Unification package uses the strategy of examining the buffer prior to
2742 export. If use of multiple Latin coded character sets is detected,
2743 Unification attempts to unify them by finding a single coded character
2744 set which contains all of the Latin characters in the buffer.
2745
2746 The primary purpose of Unification is to fix the problem by giving the
2747 user the choice to change the representation of all characters to one
2748 character set and give sensible recommendations based on context. In
2749 the ',As(B' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and
2750 both will be suggested. In the EURO SIGN example, only ISO 8859/15
2751 makes sense, and that is what will be recommended. In both cases, the
2752 user will be reminded that there are universal encodings available.
2753
2754 I call this @dfn{remapping} (from the universal character set to a
2755 particular ISO 8859 coded character set). It is mere accident that this
2756 letter has the same code point in both character sets. (Not entirely,
2757 but there are many examples of Latin characters that have different code
2758 points in different Latin-X sets.)
2759
2760 Note that, in the ',As(B' example, that treating the buffer in this way will
2761 result in a representation such as [latin-iso8859-2
2762 #x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3].
2763 This is guaranteed to occasionally result in the second problem you
2764 observed, to which we now turn.
2765
2766 This problem is that, although the file is intended to be an
2767 ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX
2768 compliant program---this is required by the standard, obvious if you
2769 think a bit, @pxref{What Unification Cannot Do for You}) will read that
2770 file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73]. Of course this
2771 is no problem if all of the characters in the file are contained in ISO
2772 8859/1, but suppose there are some which are not, but are contained in
2773 the (intended) ISO 8859/2.
2774
2775 You now want to fix this, but not by finding the same character in
2776 another set. Instead, you want to simply change the character set that
2777 Mule associates with that buffer position without changing the code.
2778 (This is conceptually somewhat distinct from the first problem, and
2779 logically ought to be handled in the code that defines coding systems.
2780 However, unification is not an unreasonable place for it.) Unification
2781 provides two functions (one fast and dangerous, the other slow and
2782 careful) to handle this. I call this @dfn{recoding}, because the
2783 transformation actually involves @emph{encoding} the buffer to file
2784 representation, then @emph{decoding} it to buffer representation (in a
2785 different character set). This cannot be done automatically because
2786 Mule can have no idea what the correct encoding is---after all, it
2787 already gave you its best guess. @xref{What Unification Cannot Do for
2788 You}. So these functions must be invoked by the user. @xref{Interactive
2789 Usage}.
2790
2791
2792 @node What Unification Cannot Do for You, Unification Internals, Theory of Operation, Charset Unification
2793 @subsection What Unification Cannot Do for You
2794
2795 Unification @strong{cannot} save you if you insist on exporting data in
2796 8-bit encodings in a multilingual environment. @emph{You will
2797 eventually corrupt data if you do this.} It is not Mule's, or any
2798 application's, fault. You will have only yourself to blame; consider
2799 yourself warned. (It is true that Mule has bugs, which make Mule
2800 somewhat more dangerous and inconvenient than some naive applications.
2801 We're working to address those, but no application can remedy the
2802 inherent defect of 8-bit encodings.)
2803
2804 Use standard universal encodings, preferably Unicode (UTF-8) unless
2805 applicable standards indicate otherwise. The most important such case
2806 is Internet messages, where MIME should be used, whether or not the
2807 subordinate encoding is a universal encoding. (Note that since one of
2808 the important provisions of MIME is the @samp{Content-Type} header,
2809 which has the charset parameter, MIME is to be considered a universal
2810 encoding for the purposes of this manual. Of course, technically
2811 speaking it's neither a coded character set nor a coding extension
2812 technique compliant with ISO 2022.)
2813
2814 As mentioned earlier, the problem is that standard encodings suffer from
2815 the design defect that they do not provide a reliable way to recognize
2816 which coded character sets are in use. There are scores of character
2817 sets which can be represented by a single octet (8-bit byte), whose
2818 union contains many hundreds of characters. Thus any 8-bit coded
2819 character set must contain characters that share code points used for
2820 different characters in other coded character sets.
2821
2822 This means that a given file's intended encoding cannot be identified
2823 with 100% reliability unless it contains encoding markers such as those
2824 provided by MIME or ISO 2022.
2825
2826 Unification actually makes it more likely that you will have problems of
2827 this kind. Traditionally Mule has been ``helpful'' by simply using an
2828 ISO 2022 universal coding system when the current buffer coding system
2829 cannot handle all the characters in the buffer. This has the effect
2830 that, because the file contains control sequences, it is not recognized
2831 as being in the locale's normal 8-bit encoding. It may be annoying if
2832 you are not a Mule expert, but your data is automatically recoverable
2833 with a tool you already have: Mule.
2834
2835 However, with unification, Mule converts to a single 8-bit character set
2836 when possible. But typically this will @emph{not} be in your usual
2837 locale. Ie, the times that an ISO 8859/1 user will need Unification is
2838 when there are ISO 8859/2 characters in the buffer. But then most
2839 likely the file will be saved in a pure 8-bit encoding that is not ISO
2840 8859/1, ie, ISO 8859/2. Mule's autorecognizer (which is probably the
2841 most sophisticated yet available) cannot tell the difference between ISO
2842 8859/1 and ISO 8859/2, and in a Western European locale will choose the
2843 former even though the latter was intended. Even the extension
2844 (``statistical recognition'') planned for XEmacs 22 is unlikely to be at
2845 all accurate in the case of mixed codes.
2846
2847 So now consider adding some additional ISO 8859/1 text to the buffer.
2848 If it includes any ISO 8859/1 codes that are used by different
2849 characters in ISO 8859/2, you now have a file that cannot be
2850 mechanically disentangled. You need a human being who can recognize
2851 that @emph{this is German and Swedish} and stays in Latin-1, while
2852 @emph{that is Polish} and needs to be recoded to Latin-2.
2853
2854 Moral: switch to a universal coded character set, preferably Unicode
2855 using the UTF-8 transformation format. If you really need the space,
2856 compress your files.
2857
2858
2859 @node Unification Internals, , What Unification Cannot Do for You, Charset Unification
2860 @subsection Internals
2861
2862 No internals documentation yet.
2863
2864 @file{unity-utils.el} provides one utility function.
2865
2866 @defun unity-dump-tables
2867
2868 Dump the temporary table created by loading @file{unity-utils.el}
2869 to @file{unity-tables.el}. Loading the latter file initializes
2870 @samp{unity-equivalences}.
2871 @end defun
2872
2873
2874 @node Charsets and Coding Systems, , Charset Unification, MULE
2875 @subsection Charsets and Coding Systems
2876
2877 This section provides reference lists of Mule charsets and coding
2878 systems. Mule charsets are typically named by character set and
2879 standard.
2880
2881 @table @strong
2882 @item ASCII variants
2883
2884 Identification of equivalent characters in these sets is not properly
2885 implemented. Unification does not distinguish the two charsets.
2886
2887 @samp{ascii} @samp{latin-jisx0201}
2888
2889 @item Extended Latin
2890
2891 Characters from the following ISO 2022 conformant charsets are
2892 identified with equivalents in other charsets in the group by
2893 Unification.
2894
2895 @samp{latin-iso8859-1} @samp{latin-iso8859-15} @samp{latin-iso8859-2}
2896 @samp{latin-iso8859-3} @samp{latin-iso8859-4} @samp{latin-iso8859-9}
2897 @samp{latin-iso8859-13} @samp{latin-iso8859-16}
2898
2899 The follow charsets are Latin variants which are not understood by
2900 Unification. In addition, many of the Asian language standards provide
2901 ASCII, at least, and sometimes other Latin characters. None of these
2902 are identified with their ISO 8859 equivalents.
2903
2904 @samp{vietnamese-viscii-lower}
2905 @samp{vietnamese-viscii-upper}
2906
2907 @item Other character sets
2908
2909 @samp{arabic-1-column}
2910 @samp{arabic-2-column}
2911 @samp{arabic-digit}
2912 @samp{arabic-iso8859-6}
2913 @samp{chinese-big5-1}
2914 @samp{chinese-big5-2}
2915 @samp{chinese-cns11643-1}
2916 @samp{chinese-cns11643-2}
2917 @samp{chinese-cns11643-3}
2918 @samp{chinese-cns11643-4}
2919 @samp{chinese-cns11643-5}
2920 @samp{chinese-cns11643-6}
2921 @samp{chinese-cns11643-7}
2922 @samp{chinese-gb2312}
2923 @samp{chinese-isoir165}
2924 @samp{cyrillic-iso8859-5}
2925 @samp{ethiopic}
2926 @samp{greek-iso8859-7}
2927 @samp{hebrew-iso8859-8}
2928 @samp{ipa}
2929 @samp{japanese-jisx0208}
2930 @samp{japanese-jisx0208-1978}
2931 @samp{japanese-jisx0212}
2932 @samp{katakana-jisx0201}
2933 @samp{korean-ksc5601}
2934 @samp{sisheng}
2935 @samp{thai-tis620}
2936 @samp{thai-xtis}
2937
2938 @item Non-graphic charsets
2939
2940 @samp{control-1}
2941 @end table
2942
2943 @table @strong
2944 @item No conversion
2945
2946 Some of these coding systems may specify EOL conventions. Note that
2947 @samp{iso-8859-1} is a no-conversion coding system, not an ISO 2022
2948 coding system. Although unification attempts to compensate for this, it
2949 is possible that the @samp{iso-8859-1} coding system will behave
2950 differently from other ISO 8859 coding systems.
2951
2952 @samp{binary} @samp{no-conversion} @samp{raw-text} @samp{iso-8859-1}
2953
2954 @item Latin coding systems
2955
2956 These coding systems are all single-byte, 8-bit ISO 2022 coding systems,
2957 combining ASCII in the GL register (bytes with high-bit clear) and an
2958 extended Latin character set in the GR register (bytes with high-bit set).
2959
2960 @samp{iso-8859-15} @samp{iso-8859-2} @samp{iso-8859-3} @samp{iso-8859-4}
2961 @samp{iso-8859-9} @samp{iso-8859-13} @samp{iso-8859-14} @samp{iso-8859-16}
2962
2963 These coding systems are single-byte, 8-bit coding systems that do not
2964 conform to international standards. They should be avoided in all
2965 potentially multilingual contexts, including any text distributed over
2966 the Internet and World Wide Web.
2967
2968 @samp{windows-1251}
2969
2970 @item Multilingual coding systems
2971
2972 The following ISO-2022-based coding systems are useful for multilingual
2973 text.
2974
2975 @samp{ctext} @samp{iso-2022-lock} @samp{iso-2022-7} @samp{iso-2022-7bit}
2976 @samp{iso-2022-7bit-ss2} @samp{iso-2022-8} @samp{iso-2022-8bit-ss2}
2977
2978 XEmacs also supports Unicode with the Mule-UCS package. These are the
2979 preferred coding systems for multilingual use. (There is a possible
2980 exception for texts that mix several Asian ideographic character sets.)
2981
2982 @samp{utf-16-be} @samp{utf-16-be-no-signature} @samp{utf-16-le}
2983 @samp{utf-16-le-no-signature} @samp{utf-7} @samp{utf-7-safe}
2984 @samp{utf-8} @samp{utf-8-ws}
2985
2986 Development versions of XEmacs (the 21.5 series) support Unicode
2987 internally, with (at least) the following coding systems implemented:
2988
2989 @samp{utf-16-be} @samp{utf-16-be-bom} @samp{utf-16-le}
2990 @samp{utf-16-le-bom} @samp{utf-8} @samp{utf-8-bom}
2991
2992 @item Asian ideographic languages
2993
2994 The following coding systems are based on ISO 2022, and are more or less
2995 suitable for encoding multilingual texts. They all can represent ASCII
2996 at least, and sometimes several other foreign character sets, without
2997 resort to arbitrary ISO 2022 designations. However, these subsets are
2998 not identified with the corresponding national standards in XEmacs Mule.
2999
3000 @samp{chinese-euc} @samp{cn-big5} @samp{cn-gb-2312} @samp{gb2312}
3001 @samp{hz} @samp{hz-gb-2312} @samp{old-jis} @samp{japanese-euc}
3002 @samp{junet} @samp{euc-japan} @samp{euc-jp} @samp{iso-2022-jp}
3003 @samp{iso-2022-jp-1978-irv} @samp{iso-2022-jp-2} @samp{euc-kr}
3004 @samp{korean-euc} @samp{iso-2022-kr} @samp{iso-2022-int-1}
3005
3006 The following coding systems cannot be used for general multilingual
3007 text and do not cooperate well with other coding systems.
3008
3009 @samp{big5} @samp{shift_jis}
3010
3011 @item Other languages
3012
3013 The following coding systems are based on ISO 2022. Though none of them
3014 provides any Latin characters beyond ASCII, XEmacs Mule allows (and up
3015 to 21.4 defaults to) use of ISO 2022 control sequences to designate
3016 other character sets for inclusion the text.
3017
3018 @samp{iso-8859-5} @samp{iso-8859-7} @samp{iso-8859-8}
3019 @samp{ctext-hebrew}
3020
3021 The following are character sets that do not conform to ISO 2022 and
3022 thus cannot be safely used in a multilingual context.
3023
3024 @samp{alternativnyj} @samp{koi8-r} @samp{tis-620} @samp{viqr}
3025 @samp{viscii} @samp{vscii}
3026
3027 @item Special coding systems
3028
3029 Mule uses the following coding systems for special purposes.
3030
3031 @samp{automatic-conversion} @samp{undecided} @samp{escape-quoted}
3032
3033 @samp{escape-quoted} is especially important, as it is used internally
3034 as the coding system for autosaved data.
3035
3036 The following coding systems are aliases for others, and are used for
3037 communication with the host operating system.
3038
3039 @samp{file-name} @samp{keyboard} @samp{terminal}
3040
3041 @end table
3042
3043 Mule detection of coding systems is actually limited to detection of
3044 classes of coding systems called @dfn{coding categories}. These coding
3045 categories are identified by the ISO 2022 control sequences they use, if
3046 any, by their conformance to ISO 2022 restrictions on code points that
3047 may be used, and by characteristic patterns of use of 8-bit code points.
3048
3049 @samp{no-conversion}
3050 @samp{utf-8}
3051 @samp{ucs-4}
3052 @samp{iso-7}
3053 @samp{iso-lock-shift}
3054 @samp{iso-8-1}
3055 @samp{iso-8-2}
3056 @samp{iso-8-designate}
3057 @samp{shift-jis}
3058 @samp{big5}
3059
3060
3061 @c end of mule.texi
3062