Mercurial > hg > xemacs-beta
comparison man/lispref/mule.texi @ 1183:c1553814932e
[xemacs-hg @ 2003-01-03 12:12:30 by stephent]
various docs
<873coa5unb.fsf@tleepslib.sk.tsukuba.ac.jp>
<87r8bu4emz.fsf@tleepslib.sk.tsukuba.ac.jp>
author | stephent |
---|---|
date | Fri, 03 Jan 2003 12:12:40 +0000 |
parents | 37e56e920ac5 |
children | 11ff4edb6bb7 |
comparison
equal
deleted
inserted
replaced
1182:7d696106ffe9 | 1183:c1553814932e |
---|---|
22 * Composite Characters:: Making new characters by overstriking other ones. | 22 * Composite Characters:: Making new characters by overstriking other ones. |
23 * Coding Systems:: Ways of representing a string of chars using integers. | 23 * Coding Systems:: Ways of representing a string of chars using integers. |
24 * CCL:: A special language for writing fast converters. | 24 * CCL:: A special language for writing fast converters. |
25 * Category Tables:: Subdividing charsets into groups. | 25 * Category Tables:: Subdividing charsets into groups. |
26 * Unicode Support:: The universal coded character set. | 26 * Unicode Support:: The universal coded character set. |
27 * Charset Unification:: Handling overlapping character sets. | |
28 * Charsets and Coding Systems:: Tables and reference information. | |
27 @end menu | 29 @end menu |
28 | 30 |
29 @node Internationalization Terminology, Charsets, , MULE | 31 @node Internationalization Terminology, Charsets, , MULE |
30 @section Internationalization Terminology | 32 @section Internationalization Terminology |
31 | 33 |
2070 Valid values are @code{nil} or a bit vector of size 95. | 2072 Valid values are @code{nil} or a bit vector of size 95. |
2071 @end defun | 2073 @end defun |
2072 | 2074 |
2073 | 2075 |
2074 @c Added 2002-03-13 sjt | 2076 @c Added 2002-03-13 sjt |
2075 @node Unicode Support, , Category Tables, MULE | 2077 @node Unicode Support, Charset Unification, Category Tables, MULE |
2076 @section Unicode Support | 2078 @section Unicode Support |
2077 @cindex unicode | 2079 @cindex unicode |
2078 @cindex utf-8 | 2080 @cindex utf-8 |
2079 @cindex utf-16 | 2081 @cindex utf-16 |
2080 @cindex ucs-2 | 2082 @cindex ucs-2 |
2179 The charset codepoint is a Big Five codepoint; convert it to the | 2181 The charset codepoint is a Big Five codepoint; convert it to the |
2180 proper hacked-up codepoint in `chinese-big5-1' or `chinese-big5-2'. | 2182 proper hacked-up codepoint in `chinese-big5-1' or `chinese-big5-2'. |
2181 @end table | 2183 @end table |
2182 @end defun | 2184 @end defun |
2183 | 2185 |
2186 | |
2187 @node Charset Unification, Charsets and Coding Systems, Unicode Support, MULE | |
2188 @section Character Set Unification | |
2189 | |
2190 Mule suffers from a design defect that causes it to consider the ISO | |
2191 Latin character sets to be disjoint. This results in oddities such as | |
2192 files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO | |
2193 2022 control sequences to switch between them, as well as more plausible | |
2194 but often unnecessary combinations like ISO 8859/1 with ISO 8859/2. | |
2195 This can be very annoying when sending messages or even in simple | |
2196 editing on a single host. Unification works around the problem by | |
2197 converting as many characters as possible to use a single Latin coded | |
2198 character set before saving the buffer. | |
2199 | |
2200 This node and its children were ripp'd untimely from | |
2201 @file{latin-unity.texi}, and have been quickly converted for use here. | |
2202 However as APIs are likely to diverge, beware of inaccuracies. Please | |
2203 report any you discover with @kbd{M-x report-xemacs-bug RET}, as well | |
2204 as any ambiguities or downright unintelligible passages. | |
2205 | |
2206 A lot of the stuff here doesn't belong here; it belongs in the | |
2207 @ref{Top, , , xemacs, XEmacs User's Manual}. Report those as bugs, | |
2208 too, preferably with patches. | |
2209 | |
2210 @menu | |
2211 * Overview:: Unification history and general information. | |
2212 * Usage:: An overview of the operation of Unification. | |
2213 * Configuration:: Configuring Unification for use. | |
2214 * Theory of Operation:: How Unification works. | |
2215 * What Unification Cannot Do for You:: Inherent problems of 8-bit charsets. | |
2216 * Charsets and Coding Systems:: Reference lists with annotations. | |
2217 * Internals:: Utilities and implementation details. | |
2218 @end menu | |
2219 | |
2220 @node Overview, Usage, Charset Unification, Charset Unification | |
2221 @subsection An Overview of Unification | |
2222 | |
2223 Mule suffers from a design defect that causes it to consider the ISO | |
2224 Latin character sets to be disjoint. This manifests itself when a user | |
2225 enters characters using input methods associated with different coded | |
2226 character sets into a single buffer. | |
2227 | |
2228 A very important example involves email. Many sites, especially in the | |
2229 U.S., default to use of the ISO 8859/1 coded character set (also called | |
2230 ``Latin 1,'' though these are somewhat different concepts). However, | |
2231 ISO 8859/1 provides a generic CURRENCY SIGN character. Now that the | |
2232 Euro has become the official currency of most countries in Europe, this | |
2233 is unsatisfactory (and in practice, useless). So Europeans generally | |
2234 use ISO 8859/15, which is nearly identical to ISO 8859/1 for most | |
2235 languages, except that it substitutes EURO SIGN for CURRENCY SIGN. | |
2236 | |
2237 Suppose a European user yanks text from a post encoded in ISO 8859/1 | |
2238 into a message composition buffer, and enters some text including the | |
2239 Euro sign. Then Mule will consider the buffer to contain both ISO | |
2240 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively | |
2241 programmed) send the message as a multipart mixed MIME body! | |
2242 | |
2243 This is clearly stupid. What is not as obvious is that, just as any | |
2244 European can include American English in their text because ASCII is a | |
2245 subset of ISO 8859/15, most European languages which use Latin | |
2246 characters (eg, German and Polish) can typically be mixed while using | |
2247 only one Latin coded character set (in this case, ISO 8859/2). However, | |
2248 this often depends on exactly what text is to be encoded. | |
2249 | |
2250 Unification works around the problem by converting as many characters as | |
2251 possible to use a single Latin coded character set before saving the | |
2252 buffer. | |
2253 | |
2254 @node Usage, Configuration, Overview, Charset Unification | |
2255 @subsection Operation of Unification | |
2256 | |
2257 Normally, Unification works in the background by installing | |
2258 @code{unity-sanity-check} on @code{write-region-pre-hook}. This is | |
2259 done by default for the ISO 8859 Latin family of character sets. The | |
2260 user activates this functionality for other character set families by | |
2261 invoking @code{enable-unification}, either interactively or in her | |
2262 init file. @xref{Init File, , , xemacs}. Unification can be | |
2263 deactivated by invoking @code{disable-unification}. | |
2264 | |
2265 Unification also provides a few functions for remapping or recoding the | |
2266 buffer by hand. To @dfn{remap} a character means to change the buffer | |
2267 representation of the character by using another coded character set. | |
2268 Remapping never changes the identity of the character, but may involve | |
2269 altering the code point of the character. To @dfn{recode} a character | |
2270 means to simply change the coded character set. Recoding never alters | |
2271 the code point of the character, but may change the identity of the | |
2272 character. @xref{Theory of Operation}. | |
2273 | |
2274 There are a few variables which determine which coding systems are | |
2275 always acceptable to Unification: @code{unity-ucs-list}, | |
2276 @code{unity-preferred-coding-system-list}, and | |
2277 @code{unity-preapproved-coding-system-list}. The latter two default | |
2278 to @code{()}, and should probably be avoided because they short-circuit | |
2279 the sanity check. If you find you need to use them, consider reporting | |
2280 it as a bug or request for enhancement. Because they seem unsafe, the | |
2281 recommended interface is likely to change. | |
2282 | |
2283 @menu | |
2284 * Basic Functionality:: User interface and customization. | |
2285 * Interactive Usage:: Treating text by hand. | |
2286 Also documents the hook function(s). | |
2287 @end menu | |
2288 | |
2289 | |
2290 @node Basic Functionality, Interactive Usage, , Usage | |
2291 @section Basic Functionality | |
2292 | |
2293 These functions and user options initialize and configure Unification. | |
2294 In normal use, none of these should be needed. | |
2295 | |
2296 @strong{These APIs are certain to change.} | |
2297 | |
2298 @defun enable-unification | |
2299 Set up hooks and initialize variables for latin-unity. | |
2300 | |
2301 There are no arguments. | |
2302 | |
2303 This function is idempotent. It will reinitialize any hooks or variables | |
2304 that are not in initial state. | |
2305 @end defun | |
2306 | |
2307 @defun disable-unification | |
2308 There are no arguments. | |
2309 | |
2310 Clean up hooks and void variables used by latin-unity. | |
2311 @end defun | |
2312 | |
2313 @defopt unity-ucs-list | |
2314 List of coding systems considered to be universal. | |
2315 | |
2316 The default value is @code{'(utf-8 iso-2022-7 ctext escape-quoted)}. | |
2317 | |
2318 Order matters; coding systems earlier in the list will be preferred when | |
2319 recommending a coding system. These coding systems will not be used | |
2320 without querying the user (unless they are also present in | |
2321 @code{unity-preapproved-coding-system-list}), and follow the | |
2322 @code{unity-preferred-coding-system-list} in the list of suggested | |
2323 coding systems. | |
2324 | |
2325 If none of the preferred coding systems are feasible, the first in | |
2326 this list will be the default. | |
2327 | |
2328 Notes on certain coding systems: @code{escape-quoted} is a special | |
2329 coding system used for autosaves and compiled Lisp in Mule. You should | |
2330 @c #### fix in latin-unity.texi | |
2331 never delete this, although it is rare that a user would want to use it | |
2332 directly. Unification does not try to be \"smart\" about other general | |
2333 ISO 2022 coding systems, such as ISO-2022-JP. (They are not recognized | |
2334 as equivalent to @code{iso-2022-7}.) If your preferred coding system is | |
2335 one of these, you may consider adding it to @code{unity-ucs-list}. | |
2336 However, this will typically have the side effect that (eg) ISO 8859/1 | |
2337 files will be saved in 7-bit form with ISO 2022 escape sequences. | |
2338 @end defopt | |
2339 | |
2340 Coding systems which are not Latin and not in | |
2341 @code{unity-ucs-list} are handled by short circuiting checks of | |
2342 coding system against the next two variables. | |
2343 | |
2344 @defopt unity-preapproved-coding-system-list | |
2345 List of coding systems used without querying the user if feasible. | |
2346 | |
2347 The default value is @samp{(buffer-default preferred)}. | |
2348 | |
2349 The first feasible coding system in this list is used. The special values | |
2350 @samp{preferred} and @samp{buffer-default} may be present: | |
2351 | |
2352 @table @code | |
2353 @item buffer-default | |
2354 Use the coding system used by @samp{write-region}, if feasible. | |
2355 | |
2356 @item preferred | |
2357 Use the coding system specified by @samp{prefer-coding-system} if feasible. | |
2358 @end table | |
2359 | |
2360 "Feasible" means that all characters in the buffer can be represented by | |
2361 the coding system. Coding systems in @samp{unity-ucs-list} are | |
2362 always considered feasible. Other feasible coding systems are computed | |
2363 by @samp{unity-representations-feasible-region}. | |
2364 | |
2365 Note that the first universal coding system in this list shadows all | |
2366 other coding systems. In particular, if your preferred coding system is | |
2367 a universal coding system, and @code{preferred} is a member of this | |
2368 list, unification will blithely convert all your files to that coding | |
2369 system. This is considered a feature, but it may surprise most users. | |
2370 Users who don't like this behavior should put @code{preferred} in | |
2371 @code{unity-preferred-coding-system-list}. | |
2372 @end defopt | |
2373 | |
2374 @defopt unity-preferred-coding-system-list | |
2375 @c #### fix in latin-unity.texi | |
2376 List of coding systems suggested to the user if feasible. | |
2377 | |
2378 The default value is @samp{(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3 | |
2379 iso-8859-4 iso-8859-9)}. | |
2380 | |
2381 If none of the coding systems in | |
2382 @c #### fix in latin-unity.texi | |
2383 @code{unity-preapproved-coding-system-list} are feasible, this list | |
2384 will be recommended to the user, followed by the | |
2385 @code{unity-ucs-list}. The first coding system in this list is default. The | |
2386 special values @samp{preferred} and @samp{buffer-default} may be | |
2387 present: | |
2388 | |
2389 @table @code | |
2390 @item buffer-default | |
2391 Use the coding system used by @samp{write-region}, if feasible. | |
2392 | |
2393 @item preferred | |
2394 Use the coding system specified by @samp{prefer-coding-system} if feasible. | |
2395 @end table | |
2396 | |
2397 "Feasible" means that all characters in the buffer can be represented by | |
2398 the coding system. Coding systems in @samp{unity-ucs-list} are | |
2399 always considered feasible. Other feasible coding systems are computed | |
2400 by @samp{unity-representations-feasible-region}. | |
2401 @end defopt | |
2402 | |
2403 | |
2404 @defvar unity-iso-8859-1-aliases | |
2405 List of coding systems to be treated as aliases of ISO 8859/1. | |
2406 | |
2407 The default value is '(iso-8859-1). | |
2408 | |
2409 This is not a user variable; to customize input of coding systems or | |
2410 charsets, @samp{unity-coding-system-alias-alist} or | |
2411 @samp{unity-charset-alias-alist}. | |
2412 @end defvar | |
2413 | |
2414 | |
2415 @node Interactive Usage, , Basic Functionality, Usage | |
2416 @section Interactive Usage | |
2417 | |
2418 First, the hook function @code{unity-sanity-check} is documented. | |
2419 (It is placed here because it is not an interactive function, and there | |
2420 is not yet a programmer's section of the manual.) | |
2421 | |
2422 These functions provide access to internal functionality (such as the | |
2423 remapping function) and to extra functionality (the recoding functions | |
2424 and the test function). | |
2425 | |
2426 | |
2427 @defun unity-sanity-check begin end filename append visit lockname &optional coding-system | |
2428 | |
2429 Check if @var{coding-system} can represent all characters between | |
2430 @var{begin} and @var{end}. | |
2431 | |
2432 For compatibility with old broken versions of @code{write-region}, | |
2433 @var{coding-system} defaults to @code{buffer-file-coding-system}. | |
2434 @var{filename}, @var{append}, @var{visit}, and @var{lockname} are | |
2435 ignored. | |
2436 | |
2437 Return nil if buffer-file-coding-system is not (ISO-2022-compatible) | |
2438 Latin. If @code{buffer-file-coding-system} is safe for the charsets | |
2439 actually present in the buffer, return it. Otherwise, ask the user to | |
2440 choose a coding system, and return that. | |
2441 | |
2442 This function does @emph{not} do the safe thing when | |
2443 @code{buffer-file-coding-system} is nil (aka no-conversion). It | |
2444 considers that ``non-Latin,'' and passes it on to the Mule detection | |
2445 mechanism. | |
2446 | |
2447 This function is intended for use as a @code{write-region-pre-hook}. It | |
2448 does nothing except return @var{coding-system} if @code{write-region} | |
2449 handlers are inhibited. | |
2450 @end defun | |
2451 | |
2452 @defun unity-buffer-representations-feasible | |
2453 | |
2454 There are no arguments. | |
2455 | |
2456 Apply unity-region-representations-feasible to the current buffer. | |
2457 @end defun | |
2458 | |
2459 @defun unity-region-representations-feasible begin end &optional buf | |
2460 | |
2461 Return character sets that can represent the text from @var{begin} to @var{end} in @var{buf}. | |
2462 | |
2463 @var{buf} defaults to the current buffer. Called interactively, will be | |
2464 applied to the region. Function assumes @var{begin} <= @var{end}. | |
2465 | |
2466 The return value is a cons. The car is the list of character sets | |
2467 that can individually represent all of the non-ASCII portion of the | |
2468 buffer, and the cdr is the list of character sets that can | |
2469 individually represent all of the ASCII portion. | |
2470 | |
2471 The following is taken from a comment in the source. Please refer to | |
2472 the source to be sure of an accurate description. | |
2473 | |
2474 The basic algorithm is to map over the region, compute the set of | |
2475 charsets that can represent each character (the ``feasible charset''), | |
2476 and take the intersection of those sets. | |
2477 | |
2478 The current implementation takes advantage of the fact that ASCII | |
2479 characters are common and cannot change asciisets. Then using | |
2480 skip-chars-forward makes motion over ASCII subregions very fast. | |
2481 | |
2482 This same strategy could be applied generally by precomputing classes | |
2483 of characters equivalent according to their effect on latinsets, and | |
2484 adding a whole class to the skip-chars-forward string once a member is | |
2485 found. | |
2486 | |
2487 Probably efficiency is a function of the number of characters matched, | |
2488 or maybe the length of the match string? With @code{skip-category-forward} | |
2489 over a precomputed category table it should be really fast. In practice | |
2490 for Latin character sets there are only 29 classes. | |
2491 @end defun | |
2492 | |
2493 @defun unity-remap-region begin end character-set &optional coding-system | |
2494 | |
2495 Remap characters between @var{begin} and @var{end} to equivalents in | |
2496 @var{character-set}. Optional argument @var{coding-system} may be a | |
2497 coding system name (a symbol) or nil. Characters with no equivalent are | |
2498 left as-is. | |
2499 | |
2500 When called interactively, @var{begin} and @var{end} are set to the | |
2501 beginning and end, respectively, of the active region, and the function | |
2502 prompts for @var{character-set}. The function does completion, knows | |
2503 how to guess a character set name from a coding system name, and also | |
2504 provides some common aliases. See @code{unity-guess-charset}. | |
2505 There is no way to specify @var{coding-system}, as it has no useful | |
2506 function interactively. | |
2507 | |
2508 Return @var{coding-system} if @var{coding-system} can encode all | |
2509 characters in the region, t if @var{coding-system} is nil and the coding | |
2510 system with G0 = 'ascii and G1 = @var{character-set} can encode all | |
2511 characters, and otherwise nil. Note that a non-null return does | |
2512 @emph{not} mean it is safe to write the file, only the specified region. | |
2513 (This behavior is useful for multipart MIME encoding and the like.) | |
2514 | |
2515 Note: by default this function is quite fascist about universal coding | |
2516 systems. It only admits @samp{utf-8}, @samp{iso-2022-7}, and | |
2517 @samp{ctext}. Customize @code{unity-approved-ucs-list} to change | |
2518 this. | |
2519 | |
2520 This function remaps characters that are artificially distinguished by Mule | |
2521 internal code. It may change the code point as well as the character set. | |
2522 To recode characters that were decoded in the wrong coding system, use | |
2523 @code{unity-recode-region}. | |
2524 @end defun | |
2525 | |
2526 @defun unity-recode-region begin end wrong-cs right-cs | |
2527 | |
2528 Recode characters between @var{begin} and @var{end} from @var{wrong-cs} | |
2529 to @var{right-cs}. | |
2530 | |
2531 @var{wrong-cs} and @var{right-cs} are character sets. Characters retain | |
2532 the same code point but the character set is changed. Only characters | |
2533 from @var{wrong-cs} are changed to @var{right-cs}. The identity of the | |
2534 character may change. Note that this could be dangerous, if characters | |
2535 whose identities you do not want changed are included in the region. | |
2536 This function cannot guess which characters you want changed, and which | |
2537 should be left alone. | |
2538 | |
2539 When called interactively, @var{begin} and @var{end} are set to the | |
2540 beginning and end, respectively, of the active region, and the function | |
2541 prompts for @var{wrong-cs} and @var{right-cs}. The function does | |
2542 completion, knows how to guess a character set name from a coding system | |
2543 name, and also provides some common aliases. See | |
2544 @code{unity-guess-charset}. | |
2545 | |
2546 Another way to accomplish this, but using coding systems rather than | |
2547 character sets to specify the desired recoding, is | |
2548 @samp{unity-recode-coding-region}. That function may be faster | |
2549 but is somewhat more dangerous, because it may recode more than one | |
2550 character set. | |
2551 | |
2552 To change from one Mule representation to another without changing identity | |
2553 of any characters, use @samp{unity-remap-region}. | |
2554 @end defun | |
2555 | |
2556 @defun unity-recode-coding-region begin end wrong-cs right-cs | |
2557 | |
2558 Recode text between @var{begin} and @var{end} from @var{wrong-cs} to | |
2559 @var{right-cs}. | |
2560 | |
2561 @var{wrong-cs} and @var{right-cs} are coding systems. Characters retain | |
2562 the same code point but the character set is changed. The identity of | |
2563 characters may change. This is an inherently dangerous function; | |
2564 multilingual text may be recoded in unexpected ways. #### It's also | |
2565 dangerous because the coding systems are not sanity-checked in the | |
2566 current implementation. | |
2567 | |
2568 When called interactively, @var{begin} and @var{end} are set to the | |
2569 beginning and end, respectively, of the active region, and the function | |
2570 prompts for @var{wrong-cs} and @var{right-cs}. The function does | |
2571 completion, knows how to guess a coding system name from a character set | |
2572 name, and also provides some common aliases. See | |
2573 @code{unity-guess-coding-system}. | |
2574 | |
2575 Another, safer, way to accomplish this, using character sets rather | |
2576 than coding systems to specify the desired recoding, is to use | |
2577 @c #### fixme in latin-unity.texi | |
2578 @code{unity-recode-region}. | |
2579 | |
2580 To change from one Mule representation to another without changing identity | |
2581 of any characters, use @code{unity-remap-region}. | |
2582 @end defun | |
2583 | |
2584 Helper functions for input of coding system and character set names. | |
2585 | |
2586 @defun unity-guess-charset candidate | |
2587 Guess a charset based on the symbol @var{candidate}. | |
2588 | |
2589 @var{candidate} itself is not tried as the value. | |
2590 | |
2591 Uses the natural mapping in @samp{unity-cset-codesys-alist}, and | |
2592 the values in @samp{unity-charset-alias-alist}." | |
2593 @end defun | |
2594 | |
2595 @defun unity-guess-coding-system candidate | |
2596 Guess a coding system based on the symbol @var{candidate}. | |
2597 | |
2598 @var{candidate} itself is not tried as the value. | |
2599 | |
2600 Uses the natural mapping in @samp{unity-cset-codesys-alist}, and | |
2601 the values in @samp{unity-coding-system-alias-alist}." | |
2602 @end defun | |
2603 | |
2604 @defun unity-example | |
2605 | |
2606 A cheesy example for Unification. | |
2607 | |
2608 At present it just makes a multilingual buffer. To test, setq | |
2609 buffer-file-coding-system to some value, make the buffer dirty (eg | |
2610 with RET BackSpace), and save. | |
2611 @end defun | |
2612 | |
2613 | |
2614 @node Configuration, Theory of Operation, Usage, Charset Unification | |
2615 @subsection Configuring Unification for Use | |
2616 | |
2617 If you want Unification to be automatically initialized, invoke | |
2618 @samp{enable-unification} with no arguments in your init file. | |
2619 @xref{Init File, , , xemacs}. If you are using GNU Emacs or an XEmacs | |
2620 earlier than 21.1, you should also load @file{auto-autoloads} using the | |
2621 full path (@emph{never} @samp{require} @file{auto-autoloads} libraries). | |
2622 | |
2623 You may wish to define aliases for commonly used character sets and | |
2624 coding systems for convenience in input. | |
2625 | |
2626 @defopt unity-charset-alias-alist | |
2627 Alist mapping aliases to Mule charset names (symbols)." | |
2628 | |
2629 The default value is | |
2630 @example | |
2631 ((latin-1 . latin-iso8859-1) | |
2632 (latin-2 . latin-iso8859-2) | |
2633 (latin-3 . latin-iso8859-3) | |
2634 (latin-4 . latin-iso8859-4) | |
2635 (latin-5 . latin-iso8859-9) | |
2636 (latin-9 . latin-iso8859-15) | |
2637 (latin-10 . latin-iso8859-16)) | |
2638 @end example | |
2639 | |
2640 If a charset does not exist on your system, it will not complete and you | |
2641 will not be able to enter it in response to prompts. A real charset | |
2642 with the same name as an alias in this list will shadow the alias. | |
2643 @end defopt | |
2644 | |
2645 @defopt unity-coding-system-alias-alist nil | |
2646 Alist mapping aliases to Mule coding system names (symbols). | |
2647 | |
2648 The default value is @samp{nil}. | |
2649 @end defopt | |
2650 | |
2651 | |
2652 @node Theory of Operation, What Unification Cannot Do for You, Configuration, Charset Unification | |
2653 @subsection Theory of Operation | |
2654 | |
2655 Standard encodings suffer from the design defect that they do not | |
2656 provide a reliable way to recognize which coded character sets in use. | |
2657 @xref{What Unification Cannot Do for You}. There are scores of | |
2658 character sets which can be represented by a single octet (8-bit byte), | |
2659 whose union contains many hundreds of characters. Obviously this | |
2660 results in great confusion, since you can't tell the players without a | |
2661 scorecard, and there is no scorecard. | |
2662 | |
2663 There are two ways to solve this problem. The first is to create a | |
2664 universal coded character set. This is the concept behind Unicode. | |
2665 However, there have been satisfactory (nearly) universal character sets | |
2666 for several decades, but even today many Westerners resist using Unicode | |
2667 because they consider its space requirements excessive. On the other | |
2668 hand, Asians dislike Unicode because they consider it to be incomplete. | |
2669 (This is partly, but not entirely, political.) | |
2670 | |
2671 In any case, Unicode only solves the internal representation problem. | |
2672 Many data sets will contain files in ``legacy'' encodings, and Unicode | |
2673 does not help distinguish among them. | |
2674 | |
2675 The second approach is to embed information about the encodings used in | |
2676 a document in its text. This approach is taken by the ISO 2022 | |
2677 standard. This would solve the problem completely from the users' of | |
2678 view, except that ISO 2022 is basically not implemented at all, in the | |
2679 sense that few applications or systems implement more than a small | |
2680 subset of ISO 2022 functionality. This is due to the fact that | |
2681 mono-literate users object to the presence of escape sequences in their | |
2682 texts (which they, with some justification, consider data corruption). | |
2683 Programmers are more than willing to cater to these users, since | |
2684 implementing ISO 2022 is a painstaking task. | |
2685 | |
2686 In fact, Emacs/Mule adopts both of these approaches. Internally it uses | |
2687 a universal character set, @dfn{Mule code}. Externally it uses ISO 2022 | |
2688 techniques both to save files in forms robust to encoding issues, and as | |
2689 hints when attempting to ``guess'' an unknown encoding. However, Mule | |
2690 suffers from a design defect, namely it embeds the character set | |
2691 information that ISO 2022 attaches to runs of characters by introducing | |
2692 them with a control sequence in each character. That causes Mule to | |
2693 consider the ISO Latin character sets to be disjoint. This manifests | |
2694 itself when a user enters characters using input methods associated with | |
2695 different coded character sets into a single buffer. | |
2696 | |
2697 There are two problems stemming from this design. First, Mule | |
2698 represents the same character in different ways. Abstractly, ',As(B' | |
2699 (LATIN SMALL LETTER O WITH ACUTE) can get represented as | |
2700 [latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73]. So what looks like | |
2701 ',Ass(B' in the display might actually be represented [latin-iso8859-1 | |
2702 #x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B | |
2703 #xF3 ESC - A] in the file. In some cases this treatment would be | |
2704 appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00 | |
2705 (the CJK ideographic character meaning ``one'')), and although arguably | |
2706 incorrect it is convenient when mixing the CJK scripts. But in the case | |
2707 of the Latin scripts this is wrong. | |
2708 | |
2709 Worse yet, it is very likely to occur when mixing ``different'' encodings | |
2710 (such as ISO 8859/1 and ISO 8859/15) that differ only in a few code | |
2711 points that are almost never used. A very important example involves | |
2712 email. Many sites, especially in the U.S., default to use of the ISO | |
2713 8859/1 coded character set (also called ``Latin 1,'' though these are | |
2714 somewhat different concepts). However, ISO 8859/1 provides a generic | |
2715 CURRENCY SIGN character. Now that the Euro has become the official | |
2716 currency of most countries in Europe, this is unsatisfactory (and in | |
2717 practice, useless). So Europeans generally use ISO 8859/15, which is | |
2718 nearly identical to ISO 8859/1 for most languages, except that it | |
2719 substitutes EURO SIGN for CURRENCY SIGN. | |
2720 | |
2721 Suppose a European user yanks text from a post encoded in ISO 8859/1 | |
2722 into a message composition buffer, and enters some text including the | |
2723 Euro sign. Then Mule will consider the buffer to contain both ISO | |
2724 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively | |
2725 programmed) send the message as a multipart mixed MIME body! | |
2726 | |
2727 This is clearly stupid. What is not as obvious is that, just as any | |
2728 European can include American English in their text because ASCII is a | |
2729 subset of ISO 8859/15, most European languages which use Latin | |
2730 characters (eg, German and Polish) can typically be mixed while using | |
2731 only one Latin coded character set (in the case of German and Polish, | |
2732 ISO 8859/2). However, this often depends on exactly what text is to be | |
2733 encoded (even for the same pair of languages). | |
2734 | |
2735 Unification works around the problem by converting as many characters as | |
2736 possible to use a single Latin coded character set before saving the | |
2737 buffer. | |
2738 | |
2739 Because the problem is rarely noticable in editing a buffer, but tends | |
2740 to manifest when that buffer is exported to a file or process, the | |
2741 Unification package uses the strategy of examining the buffer prior to | |
2742 export. If use of multiple Latin coded character sets is detected, | |
2743 Unification attempts to unify them by finding a single coded character | |
2744 set which contains all of the Latin characters in the buffer. | |
2745 | |
2746 The primary purpose of Unification is to fix the problem by giving the | |
2747 user the choice to change the representation of all characters to one | |
2748 character set and give sensible recommendations based on context. In | |
2749 the ',As(B' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and | |
2750 both will be suggested. In the EURO SIGN example, only ISO 8859/15 | |
2751 makes sense, and that is what will be recommended. In both cases, the | |
2752 user will be reminded that there are universal encodings available. | |
2753 | |
2754 I call this @dfn{remapping} (from the universal character set to a | |
2755 particular ISO 8859 coded character set). It is mere accident that this | |
2756 letter has the same code point in both character sets. (Not entirely, | |
2757 but there are many examples of Latin characters that have different code | |
2758 points in different Latin-X sets.) | |
2759 | |
2760 Note that, in the ',As(B' example, that treating the buffer in this way will | |
2761 result in a representation such as [latin-iso8859-2 | |
2762 #x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3]. | |
2763 This is guaranteed to occasionally result in the second problem you | |
2764 observed, to which we now turn. | |
2765 | |
2766 This problem is that, although the file is intended to be an | |
2767 ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX | |
2768 compliant program---this is required by the standard, obvious if you | |
2769 think a bit, @pxref{What Unification Cannot Do for You}) will read that | |
2770 file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73]. Of course this | |
2771 is no problem if all of the characters in the file are contained in ISO | |
2772 8859/1, but suppose there are some which are not, but are contained in | |
2773 the (intended) ISO 8859/2. | |
2774 | |
2775 You now want to fix this, but not by finding the same character in | |
2776 another set. Instead, you want to simply change the character set that | |
2777 Mule associates with that buffer position without changing the code. | |
2778 (This is conceptually somewhat distinct from the first problem, and | |
2779 logically ought to be handled in the code that defines coding systems. | |
2780 However, unification is not an unreasonable place for it.) Unification | |
2781 provides two functions (one fast and dangerous, the other slow and | |
2782 careful) to handle this. I call this @dfn{recoding}, because the | |
2783 transformation actually involves @emph{encoding} the buffer to file | |
2784 representation, then @emph{decoding} it to buffer representation (in a | |
2785 different character set). This cannot be done automatically because | |
2786 Mule can have no idea what the correct encoding is---after all, it | |
2787 already gave you its best guess. @xref{What Unification Cannot Do for | |
2788 You}. So these functions must be invoked by the user. @xref{Interactive | |
2789 Usage}. | |
2790 | |
2791 | |
2792 @node What Unification Cannot Do for You, Unification Internals, Theory of Operation, Charset Unification | |
2793 @subsection What Unification Cannot Do for You | |
2794 | |
2795 Unification @strong{cannot} save you if you insist on exporting data in | |
2796 8-bit encodings in a multilingual environment. @emph{You will | |
2797 eventually corrupt data if you do this.} It is not Mule's, or any | |
2798 application's, fault. You will have only yourself to blame; consider | |
2799 yourself warned. (It is true that Mule has bugs, which make Mule | |
2800 somewhat more dangerous and inconvenient than some naive applications. | |
2801 We're working to address those, but no application can remedy the | |
2802 inherent defect of 8-bit encodings.) | |
2803 | |
2804 Use standard universal encodings, preferably Unicode (UTF-8) unless | |
2805 applicable standards indicate otherwise. The most important such case | |
2806 is Internet messages, where MIME should be used, whether or not the | |
2807 subordinate encoding is a universal encoding. (Note that since one of | |
2808 the important provisions of MIME is the @samp{Content-Type} header, | |
2809 which has the charset parameter, MIME is to be considered a universal | |
2810 encoding for the purposes of this manual. Of course, technically | |
2811 speaking it's neither a coded character set nor a coding extension | |
2812 technique compliant with ISO 2022.) | |
2813 | |
2814 As mentioned earlier, the problem is that standard encodings suffer from | |
2815 the design defect that they do not provide a reliable way to recognize | |
2816 which coded character sets are in use. There are scores of character | |
2817 sets which can be represented by a single octet (8-bit byte), whose | |
2818 union contains many hundreds of characters. Thus any 8-bit coded | |
2819 character set must contain characters that share code points used for | |
2820 different characters in other coded character sets. | |
2821 | |
2822 This means that a given file's intended encoding cannot be identified | |
2823 with 100% reliability unless it contains encoding markers such as those | |
2824 provided by MIME or ISO 2022. | |
2825 | |
2826 Unification actually makes it more likely that you will have problems of | |
2827 this kind. Traditionally Mule has been ``helpful'' by simply using an | |
2828 ISO 2022 universal coding system when the current buffer coding system | |
2829 cannot handle all the characters in the buffer. This has the effect | |
2830 that, because the file contains control sequences, it is not recognized | |
2831 as being in the locale's normal 8-bit encoding. It may be annoying if | |
2832 you are not a Mule expert, but your data is automatically recoverable | |
2833 with a tool you already have: Mule. | |
2834 | |
2835 However, with unification, Mule converts to a single 8-bit character set | |
2836 when possible. But typically this will @emph{not} be in your usual | |
2837 locale. Ie, the times that an ISO 8859/1 user will need Unification is | |
2838 when there are ISO 8859/2 characters in the buffer. But then most | |
2839 likely the file will be saved in a pure 8-bit encoding that is not ISO | |
2840 8859/1, ie, ISO 8859/2. Mule's autorecognizer (which is probably the | |
2841 most sophisticated yet available) cannot tell the difference between ISO | |
2842 8859/1 and ISO 8859/2, and in a Western European locale will choose the | |
2843 former even though the latter was intended. Even the extension | |
2844 (``statistical recognition'') planned for XEmacs 22 is unlikely to be at | |
2845 all accurate in the case of mixed codes. | |
2846 | |
2847 So now consider adding some additional ISO 8859/1 text to the buffer. | |
2848 If it includes any ISO 8859/1 codes that are used by different | |
2849 characters in ISO 8859/2, you now have a file that cannot be | |
2850 mechanically disentangled. You need a human being who can recognize | |
2851 that @emph{this is German and Swedish} and stays in Latin-1, while | |
2852 @emph{that is Polish} and needs to be recoded to Latin-2. | |
2853 | |
2854 Moral: switch to a universal coded character set, preferably Unicode | |
2855 using the UTF-8 transformation format. If you really need the space, | |
2856 compress your files. | |
2857 | |
2858 | |
2859 @node Unification Internals, , What Unification Cannot Do for You, Charset Unification | |
2860 @subsection Internals | |
2861 | |
2862 No internals documentation yet. | |
2863 | |
2864 @file{unity-utils.el} provides one utility function. | |
2865 | |
2866 @defun unity-dump-tables | |
2867 | |
2868 Dump the temporary table created by loading @file{unity-utils.el} | |
2869 to @file{unity-tables.el}. Loading the latter file initializes | |
2870 @samp{unity-equivalences}. | |
2871 @end defun | |
2872 | |
2873 | |
2874 @node Charsets and Coding Systems, , Charset Unification, MULE | |
2875 @subsection Charsets and Coding Systems | |
2876 | |
2877 This section provides reference lists of Mule charsets and coding | |
2878 systems. Mule charsets are typically named by character set and | |
2879 standard. | |
2880 | |
2881 @table @strong | |
2882 @item ASCII variants | |
2883 | |
2884 Identification of equivalent characters in these sets is not properly | |
2885 implemented. Unification does not distinguish the two charsets. | |
2886 | |
2887 @samp{ascii} @samp{latin-jisx0201} | |
2888 | |
2889 @item Extended Latin | |
2890 | |
2891 Characters from the following ISO 2022 conformant charsets are | |
2892 identified with equivalents in other charsets in the group by | |
2893 Unification. | |
2894 | |
2895 @samp{latin-iso8859-1} @samp{latin-iso8859-15} @samp{latin-iso8859-2} | |
2896 @samp{latin-iso8859-3} @samp{latin-iso8859-4} @samp{latin-iso8859-9} | |
2897 @samp{latin-iso8859-13} @samp{latin-iso8859-16} | |
2898 | |
2899 The follow charsets are Latin variants which are not understood by | |
2900 Unification. In addition, many of the Asian language standards provide | |
2901 ASCII, at least, and sometimes other Latin characters. None of these | |
2902 are identified with their ISO 8859 equivalents. | |
2903 | |
2904 @samp{vietnamese-viscii-lower} | |
2905 @samp{vietnamese-viscii-upper} | |
2906 | |
2907 @item Other character sets | |
2908 | |
2909 @samp{arabic-1-column} | |
2910 @samp{arabic-2-column} | |
2911 @samp{arabic-digit} | |
2912 @samp{arabic-iso8859-6} | |
2913 @samp{chinese-big5-1} | |
2914 @samp{chinese-big5-2} | |
2915 @samp{chinese-cns11643-1} | |
2916 @samp{chinese-cns11643-2} | |
2917 @samp{chinese-cns11643-3} | |
2918 @samp{chinese-cns11643-4} | |
2919 @samp{chinese-cns11643-5} | |
2920 @samp{chinese-cns11643-6} | |
2921 @samp{chinese-cns11643-7} | |
2922 @samp{chinese-gb2312} | |
2923 @samp{chinese-isoir165} | |
2924 @samp{cyrillic-iso8859-5} | |
2925 @samp{ethiopic} | |
2926 @samp{greek-iso8859-7} | |
2927 @samp{hebrew-iso8859-8} | |
2928 @samp{ipa} | |
2929 @samp{japanese-jisx0208} | |
2930 @samp{japanese-jisx0208-1978} | |
2931 @samp{japanese-jisx0212} | |
2932 @samp{katakana-jisx0201} | |
2933 @samp{korean-ksc5601} | |
2934 @samp{sisheng} | |
2935 @samp{thai-tis620} | |
2936 @samp{thai-xtis} | |
2937 | |
2938 @item Non-graphic charsets | |
2939 | |
2940 @samp{control-1} | |
2941 @end table | |
2942 | |
2943 @table @strong | |
2944 @item No conversion | |
2945 | |
2946 Some of these coding systems may specify EOL conventions. Note that | |
2947 @samp{iso-8859-1} is a no-conversion coding system, not an ISO 2022 | |
2948 coding system. Although unification attempts to compensate for this, it | |
2949 is possible that the @samp{iso-8859-1} coding system will behave | |
2950 differently from other ISO 8859 coding systems. | |
2951 | |
2952 @samp{binary} @samp{no-conversion} @samp{raw-text} @samp{iso-8859-1} | |
2953 | |
2954 @item Latin coding systems | |
2955 | |
2956 These coding systems are all single-byte, 8-bit ISO 2022 coding systems, | |
2957 combining ASCII in the GL register (bytes with high-bit clear) and an | |
2958 extended Latin character set in the GR register (bytes with high-bit set). | |
2959 | |
2960 @samp{iso-8859-15} @samp{iso-8859-2} @samp{iso-8859-3} @samp{iso-8859-4} | |
2961 @samp{iso-8859-9} @samp{iso-8859-13} @samp{iso-8859-14} @samp{iso-8859-16} | |
2962 | |
2963 These coding systems are single-byte, 8-bit coding systems that do not | |
2964 conform to international standards. They should be avoided in all | |
2965 potentially multilingual contexts, including any text distributed over | |
2966 the Internet and World Wide Web. | |
2967 | |
2968 @samp{windows-1251} | |
2969 | |
2970 @item Multilingual coding systems | |
2971 | |
2972 The following ISO-2022-based coding systems are useful for multilingual | |
2973 text. | |
2974 | |
2975 @samp{ctext} @samp{iso-2022-lock} @samp{iso-2022-7} @samp{iso-2022-7bit} | |
2976 @samp{iso-2022-7bit-ss2} @samp{iso-2022-8} @samp{iso-2022-8bit-ss2} | |
2977 | |
2978 XEmacs also supports Unicode with the Mule-UCS package. These are the | |
2979 preferred coding systems for multilingual use. (There is a possible | |
2980 exception for texts that mix several Asian ideographic character sets.) | |
2981 | |
2982 @samp{utf-16-be} @samp{utf-16-be-no-signature} @samp{utf-16-le} | |
2983 @samp{utf-16-le-no-signature} @samp{utf-7} @samp{utf-7-safe} | |
2984 @samp{utf-8} @samp{utf-8-ws} | |
2985 | |
2986 Development versions of XEmacs (the 21.5 series) support Unicode | |
2987 internally, with (at least) the following coding systems implemented: | |
2988 | |
2989 @samp{utf-16-be} @samp{utf-16-be-bom} @samp{utf-16-le} | |
2990 @samp{utf-16-le-bom} @samp{utf-8} @samp{utf-8-bom} | |
2991 | |
2992 @item Asian ideographic languages | |
2993 | |
2994 The following coding systems are based on ISO 2022, and are more or less | |
2995 suitable for encoding multilingual texts. They all can represent ASCII | |
2996 at least, and sometimes several other foreign character sets, without | |
2997 resort to arbitrary ISO 2022 designations. However, these subsets are | |
2998 not identified with the corresponding national standards in XEmacs Mule. | |
2999 | |
3000 @samp{chinese-euc} @samp{cn-big5} @samp{cn-gb-2312} @samp{gb2312} | |
3001 @samp{hz} @samp{hz-gb-2312} @samp{old-jis} @samp{japanese-euc} | |
3002 @samp{junet} @samp{euc-japan} @samp{euc-jp} @samp{iso-2022-jp} | |
3003 @samp{iso-2022-jp-1978-irv} @samp{iso-2022-jp-2} @samp{euc-kr} | |
3004 @samp{korean-euc} @samp{iso-2022-kr} @samp{iso-2022-int-1} | |
3005 | |
3006 The following coding systems cannot be used for general multilingual | |
3007 text and do not cooperate well with other coding systems. | |
3008 | |
3009 @samp{big5} @samp{shift_jis} | |
3010 | |
3011 @item Other languages | |
3012 | |
3013 The following coding systems are based on ISO 2022. Though none of them | |
3014 provides any Latin characters beyond ASCII, XEmacs Mule allows (and up | |
3015 to 21.4 defaults to) use of ISO 2022 control sequences to designate | |
3016 other character sets for inclusion the text. | |
3017 | |
3018 @samp{iso-8859-5} @samp{iso-8859-7} @samp{iso-8859-8} | |
3019 @samp{ctext-hebrew} | |
3020 | |
3021 The following are character sets that do not conform to ISO 2022 and | |
3022 thus cannot be safely used in a multilingual context. | |
3023 | |
3024 @samp{alternativnyj} @samp{koi8-r} @samp{tis-620} @samp{viqr} | |
3025 @samp{viscii} @samp{vscii} | |
3026 | |
3027 @item Special coding systems | |
3028 | |
3029 Mule uses the following coding systems for special purposes. | |
3030 | |
3031 @samp{automatic-conversion} @samp{undecided} @samp{escape-quoted} | |
3032 | |
3033 @samp{escape-quoted} is especially important, as it is used internally | |
3034 as the coding system for autosaved data. | |
3035 | |
3036 The following coding systems are aliases for others, and are used for | |
3037 communication with the host operating system. | |
3038 | |
3039 @samp{file-name} @samp{keyboard} @samp{terminal} | |
3040 | |
3041 @end table | |
3042 | |
3043 Mule detection of coding systems is actually limited to detection of | |
3044 classes of coding systems called @dfn{coding categories}. These coding | |
3045 categories are identified by the ISO 2022 control sequences they use, if | |
3046 any, by their conformance to ISO 2022 restrictions on code points that | |
3047 may be used, and by characteristic patterns of use of 8-bit code points. | |
3048 | |
3049 @samp{no-conversion} | |
3050 @samp{utf-8} | |
3051 @samp{ucs-4} | |
3052 @samp{iso-7} | |
3053 @samp{iso-lock-shift} | |
3054 @samp{iso-8-1} | |
3055 @samp{iso-8-2} | |
3056 @samp{iso-8-designate} | |
3057 @samp{shift-jis} | |
3058 @samp{big5} | |
3059 | |
3060 | |
3061 @c end of mule.texi | |
3062 |