comparison man/lispref/mule.texi @ 440:8de8e3f6228a r21-2-28

Import from CVS: tag r21-2-28
author cvs
date Mon, 13 Aug 2007 11:33:38 +0200
parents 3ecd8885ac67
children abe6d1db359e
comparison
equal deleted inserted replaced
439:357dd071b03c 440:8de8e3f6228a
41 ways, although the basic shape will be the same. 41 ways, although the basic shape will be the same.
42 42
43 In some cases, the differences will be significant enough that it is 43 In some cases, the differences will be significant enough that it is
44 actually possible to identify two or more distinct shapes that both 44 actually possible to identify two or more distinct shapes that both
45 represent the same character. For example, the lowercase letters 45 represent the same character. For example, the lowercase letters
46 @samp{a} and @samp{g} each have two distinct possible shapes -- the 46 @samp{a} and @samp{g} each have two distinct possible shapes---the
47 @samp{a} can optionally have a curved tail projecting off the top, and 47 @samp{a} can optionally have a curved tail projecting off the top, and
48 the @samp{g} can be formed either of two loops, or of one loop and a 48 the @samp{g} can be formed either of two loops, or of one loop and a
49 tail hanging off the bottom. Such distinct possible shapes of a 49 tail hanging off the bottom. Such distinct possible shapes of a
50 character are called @dfn{glyphs}. The important characteristic of two 50 character are called @dfn{glyphs}. The important characteristic of two
51 glyphs making up the same character is that the choice between one or 51 glyphs making up the same character is that the choice between one or
52 the other is purely stylistic and has no linguistic effect on a word 52 the other is purely stylistic and has no linguistic effect on a word
53 (this is the reason why a capital @samp{A} and lowercase @samp{a} 53 (this is the reason why a capital @samp{A} and lowercase @samp{a}
54 are different characters rather than different glyphs -- e.g. 54 are different characters rather than different glyphs---e.g.
55 @samp{Aspen} is a city while @samp{aspen} is a kind of tree). 55 @samp{Aspen} is a city while @samp{aspen} is a kind of tree).
56 56
57 Note that @dfn{character} and @dfn{glyph} are used differently 57 Note that @dfn{character} and @dfn{glyph} are used differently
58 here than elsewhere in XEmacs. 58 here than elsewhere in XEmacs.
59 59
72 particular ordering. ASCII, for example, places letters in their 72 particular ordering. ASCII, for example, places letters in their
73 ``natural'' order, puts uppercase letters before lowercase letters, 73 ``natural'' order, puts uppercase letters before lowercase letters,
74 numbers before letters, etc. Note that for many of the Asian character 74 numbers before letters, etc. Note that for many of the Asian character
75 sets, there is no natural ordering of the characters. The actual 75 sets, there is no natural ordering of the characters. The actual
76 orderings are based on one or more salient characteristic, of which 76 orderings are based on one or more salient characteristic, of which
77 there are many to choose from -- e.g. number of strokes, common 77 there are many to choose from---e.g. number of strokes, common
78 radicals, phonetic ordering, etc. 78 radicals, phonetic ordering, etc.
79 79
80 The set of numbers assigned to any particular character are called 80 The set of numbers assigned to any particular character are called
81 the character's @dfn{position codes}. The number of position codes 81 the character's @dfn{position codes}. The number of position codes
82 required to index a particular character in a character set is called 82 required to index a particular character in a character set is called
103 position codes for the characters in that character set could be used 103 position codes for the characters in that character set could be used
104 directly. (This is the case with ASCII, and as a result, most people do 104 directly. (This is the case with ASCII, and as a result, most people do
105 not understand the difference between a character set and an encoding.) 105 not understand the difference between a character set and an encoding.)
106 This is not possible, however, if more than one character set is to be 106 This is not possible, however, if more than one character set is to be
107 used in the encoding. For example, printed Japanese text typically 107 used in the encoding. For example, printed Japanese text typically
108 requires characters from multiple character sets -- ASCII, JISX0208, and 108 requires characters from multiple character sets---ASCII, JISX0208, and
109 JISX0212, to be specific. Each of these is indexed using one or more 109 JISX0212, to be specific. Each of these is indexed using one or more
110 position codes in the range 33 through 126, so the position codes could 110 position codes in the range 33 through 126, so the position codes could
111 not be used directly or there would be no way to tell which character 111 not be used directly or there would be no way to tell which character
112 was meant. Different Japanese encodings handle this differently -- JIS 112 was meant. Different Japanese encodings handle this differently---JIS
113 uses special escape characters to denote different character sets; EUC 113 uses special escape characters to denote different character sets; EUC
114 sets the high bit of the position codes for JISX0208 and JISX0212, and 114 sets the high bit of the position codes for JISX0208 and JISX0212, and
115 puts a special extra byte before each JISX0212 character; etc. (JIS, 115 puts a special extra byte before each JISX0212 character; etc. (JIS,
116 EUC, and most of the other encodings you will encounter are 7-bit or 116 EUC, and most of the other encodings you will encounter are 7-bit or
117 8-bit encodings. There is one common 16-bit encoding, which is Unicode; 117 8-bit encodings. There is one common 16-bit encoding, which is Unicode;
364 This function returns the number of display columns per character (in 364 This function returns the number of display columns per character (in
365 TTY mode) of @var{charset}. 365 TTY mode) of @var{charset}.
366 @end defun 366 @end defun
367 367
368 @defun charset-direction charset 368 @defun charset-direction charset
369 This function returns the display direction of @var{charset} -- either 369 This function returns the display direction of @var{charset}---either
370 @code{l2r} or @code{r2l}. 370 @code{l2r} or @code{r2l}.
371 @end defun 371 @end defun
372 372
373 @defun charset-final charset 373 @defun charset-final charset
374 This function returns the final byte of the ISO 2022 escape sequence 374 This function returns the final byte of the ISO 2022 escape sequence
553 4 areas: C0, GL, C1, and GR. GL and GR are the areas into which a 553 4 areas: C0, GL, C1, and GR. GL and GR are the areas into which a
554 register of charset can be invoked into. 554 register of charset can be invoked into.
555 555
556 @example 556 @example
557 @group 557 @group
558 C0: 0x00 - 0x1F 558 C0: 0x00 - 0x1F
559 GL: 0x20 - 0x7F 559 GL: 0x20 - 0x7F
560 C1: 0x80 - 0x9F 560 C1: 0x80 - 0x9F
561 GR: 0xA0 - 0xFF 561 GR: 0xA0 - 0xFF
562 @end group 562 @end group
563 @end example 563 @end example
564 564
565 Usually, in the initial state, G0 is invoked into GL, and G1 565 Usually, in the initial state, G0 is invoked into GL, and G1
566 is invoked into GR. 566 is invoked into GR.
569 7-bit environments, only C0 and GL are used. 569 7-bit environments, only C0 and GL are used.
570 570
571 Charset designation is done by escape sequences of the form: 571 Charset designation is done by escape sequences of the form:
572 572
573 @example 573 @example
574 ESC [@var{I}] @var{I} @var{F} 574 ESC [@var{I}] @var{I} @var{F}
575 @end example 575 @end example
576 576
577 where @var{I} is an intermediate character in the range 0x20 - 0x2F, and 577 where @var{I} is an intermediate character in the range 0x20 - 0x2F, and
578 @var{F} is the final character identifying this charset. 578 @var{F} is the final character identifying this charset.
579 579
580 The meaning of intermediate characters are: 580 The meaning of intermediate characters are:
581 581
582 @example 582 @example
583 @group 583 @group
584 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96). 584 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
585 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}. 585 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}.
586 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}. 586 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}.
587 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}. 587 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}.
588 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}. 588 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}.
589 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}. 589 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}.
590 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}. 590 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}.
591 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}. 591 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}.
592 @end group 592 @end group
593 @end example 593 @end example
594 594
595 The following rule is not allowed in ISO 2022 but can be used in Mule. 595 The following rule is not allowed in ISO 2022 but can be used in Mule.
596 596
597 @example 597 @example
598 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}. 598 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}.
599 @end example 599 @end example
600 600
601 Here are examples of designations: 601 Here are examples of designations:
602 602
603 @example 603 @example
604 @group 604 @group
605 ESC ( B : designate to G0 ASCII 605 ESC ( B : designate to G0 ASCII
606 ESC - A : designate to G1 Latin-1 606 ESC - A : designate to G1 Latin-1
607 ESC $ ( A or ESC $ A : designate to G0 GB2312 607 ESC $ ( A or ESC $ A : designate to G0 GB2312
608 ESC $ ( B or ESC $ B : designate to G0 JISX0208 608 ESC $ ( B or ESC $ B : designate to G0 JISX0208
609 ESC $ ) C : designate to G1 KSC5601 609 ESC $ ) C : designate to G1 KSC5601
610 @end group 610 @end group
611 @end example 611 @end example
612 612
613 To use a charset designated to G2 or G3, and to use a charset designated 613 To use a charset designated to G2 or G3, and to use a charset designated
614 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3 614 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3
616 Single Shift (one character only). 616 Single Shift (one character only).
617 617
618 Locking Shift is done as follows: 618 Locking Shift is done as follows:
619 619
620 @example 620 @example
621 LS0 or SI (0x0F): invoke G0 into GL 621 LS0 or SI (0x0F): invoke G0 into GL
622 LS1 or SO (0x0E): invoke G1 into GL 622 LS1 or SO (0x0E): invoke G1 into GL
623 LS2: invoke G2 into GL 623 LS2: invoke G2 into GL
624 LS3: invoke G3 into GL 624 LS3: invoke G3 into GL
625 LS1R: invoke G1 into GR 625 LS1R: invoke G1 into GR
626 LS2R: invoke G2 into GR 626 LS2R: invoke G2 into GR
627 LS3R: invoke G3 into GR 627 LS3R: invoke G3 into GR
628 @end example 628 @end example
629 629
630 Single Shift is done as follows: 630 Single Shift is done as follows:
631 631
632 @example 632 @example
633 @group 633 @group
634 SS2 or ESC N: invoke G2 into GL 634 SS2 or ESC N: invoke G2 into GL
635 SS3 or ESC O: invoke G3 into GL 635 SS3 or ESC O: invoke G3 into GL
636 @end group 636 @end group
637 @end example 637 @end example
638 638
639 (#### Ben says: I think the above is slightly incorrect. It appears that 639 (#### Ben says: I think the above is slightly incorrect. It appears that
640 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and 640 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and
676 Here are several examples: 676 Here are several examples:
677 677
678 @example 678 @example
679 @group 679 @group
680 junet -- Coding system used in JUNET. 680 junet -- Coding system used in JUNET.
681 1. G0 <- ASCII, G1..3 <- never used 681 1. G0 <- ASCII, G1..3 <- never used
682 2. Yes. 682 2. Yes.
683 3. Yes. 683 3. Yes.
684 4. Yes. 684 4. Yes.
685 5. 7-bit environment 685 5. 7-bit environment
686 6. No. 686 6. No.
687 7. Use ASCII 687 7. Use ASCII
688 8. Use JISX0208-1983 688 8. Use JISX0208-1983
689 @end group 689 @end group
690 690
691 @group 691 @group
692 ctext -- Compound Text 692 ctext -- Compound Text
693 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used 693 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used
694 2. No. 694 2. No.
695 3. No. 695 3. No.
696 4. Yes. 696 4. Yes.
697 5. 8-bit environment 697 5. 8-bit environment
698 6. No. 698 6. No.
699 7. Use ASCII 699 7. Use ASCII
700 8. Use JISX0208-1983 700 8. Use JISX0208-1983
701 @end group 701 @end group
702 702
703 @group 703 @group
704 euc-china -- Chinese EUC. Although many people call this 704 euc-china -- Chinese EUC. Although many people call this
705 as "GB encoding", the name may cause misunderstanding. 705 as "GB encoding", the name may cause misunderstanding.
706 1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used 706 1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used
707 2. No. 707 2. No.
708 3. Yes. 708 3. Yes.
709 4. Yes. 709 4. Yes.
710 5. 8-bit environment 710 5. 8-bit environment
711 6. No. 711 6. No.
712 7. Use ASCII 712 7. Use ASCII
713 8. Use JISX0208-1983 713 8. Use JISX0208-1983
714 @end group 714 @end group
715 715
716 @group 716 @group
717 korean-mail -- Coding system used in Korean network. 717 korean-mail -- Coding system used in Korean network.
718 1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used 718 1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used
719 2. No. 719 2. No.
720 3. Yes. 720 3. Yes.
721 4. Yes. 721 4. Yes.
722 5. 7-bit environment 722 5. 7-bit environment
723 6. Yes. 723 6. Yes.
724 7. No. 724 7. No.
725 8. No. 725 8. No.
726 @end group 726 @end group
727 @end example 727 @end example
728 728
729 Mule creates all these coding systems by default. 729 Mule creates all these coding systems by default.
730 730
738 or process, and is used to encode the text back into the same format 738 or process, and is used to encode the text back into the same format
739 when it is written out to a file or process. 739 when it is written out to a file or process.
740 740
741 For example, many ISO-2022-compliant coding systems (such as Compound 741 For example, many ISO-2022-compliant coding systems (such as Compound
742 Text, which is used for inter-client data under the X Window System) use 742 Text, which is used for inter-client data under the X Window System) use
743 escape sequences to switch between different charsets -- Japanese Kanji, 743 escape sequences to switch between different charsets---Japanese Kanji,
744 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with 744 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with
745 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See 745 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See
746 @code{make-coding-system} for more information. 746 @code{make-coding-system} for more information.
747 747
748 Coding systems are normally identified using a symbol, and the symbol is 748 Coding systems are normally identified using a symbol, and the symbol is
1445 @node Category Tables, , CCL, MULE 1445 @node Category Tables, , CCL, MULE
1446 @section Category Tables 1446 @section Category Tables
1447 1447
1448 A category table is a type of char table used for keeping track of 1448 A category table is a type of char table used for keeping track of
1449 categories. Categories are used for classifying characters for use in 1449 categories. Categories are used for classifying characters for use in
1450 regexps -- you can refer to a category rather than having to use a 1450 regexps---you can refer to a category rather than having to use a
1451 complicated [] expression (and category lookups are significantly 1451 complicated [] expression (and category lookups are significantly
1452 faster). 1452 faster).
1453 1453
1454 There are 95 different categories available, one for each printable 1454 There are 95 different categories available, one for each printable
1455 character (including space) in the ASCII charset. Each category is 1455 character (including space) in the ASCII charset. Each category is