Mercurial > hg > xemacs-beta
comparison man/lispref/mule.texi @ 440:8de8e3f6228a r21-2-28
Import from CVS: tag r21-2-28
author | cvs |
---|---|
date | Mon, 13 Aug 2007 11:33:38 +0200 |
parents | 3ecd8885ac67 |
children | abe6d1db359e |
comparison
equal
deleted
inserted
replaced
439:357dd071b03c | 440:8de8e3f6228a |
---|---|
41 ways, although the basic shape will be the same. | 41 ways, although the basic shape will be the same. |
42 | 42 |
43 In some cases, the differences will be significant enough that it is | 43 In some cases, the differences will be significant enough that it is |
44 actually possible to identify two or more distinct shapes that both | 44 actually possible to identify two or more distinct shapes that both |
45 represent the same character. For example, the lowercase letters | 45 represent the same character. For example, the lowercase letters |
46 @samp{a} and @samp{g} each have two distinct possible shapes -- the | 46 @samp{a} and @samp{g} each have two distinct possible shapes---the |
47 @samp{a} can optionally have a curved tail projecting off the top, and | 47 @samp{a} can optionally have a curved tail projecting off the top, and |
48 the @samp{g} can be formed either of two loops, or of one loop and a | 48 the @samp{g} can be formed either of two loops, or of one loop and a |
49 tail hanging off the bottom. Such distinct possible shapes of a | 49 tail hanging off the bottom. Such distinct possible shapes of a |
50 character are called @dfn{glyphs}. The important characteristic of two | 50 character are called @dfn{glyphs}. The important characteristic of two |
51 glyphs making up the same character is that the choice between one or | 51 glyphs making up the same character is that the choice between one or |
52 the other is purely stylistic and has no linguistic effect on a word | 52 the other is purely stylistic and has no linguistic effect on a word |
53 (this is the reason why a capital @samp{A} and lowercase @samp{a} | 53 (this is the reason why a capital @samp{A} and lowercase @samp{a} |
54 are different characters rather than different glyphs -- e.g. | 54 are different characters rather than different glyphs---e.g. |
55 @samp{Aspen} is a city while @samp{aspen} is a kind of tree). | 55 @samp{Aspen} is a city while @samp{aspen} is a kind of tree). |
56 | 56 |
57 Note that @dfn{character} and @dfn{glyph} are used differently | 57 Note that @dfn{character} and @dfn{glyph} are used differently |
58 here than elsewhere in XEmacs. | 58 here than elsewhere in XEmacs. |
59 | 59 |
72 particular ordering. ASCII, for example, places letters in their | 72 particular ordering. ASCII, for example, places letters in their |
73 ``natural'' order, puts uppercase letters before lowercase letters, | 73 ``natural'' order, puts uppercase letters before lowercase letters, |
74 numbers before letters, etc. Note that for many of the Asian character | 74 numbers before letters, etc. Note that for many of the Asian character |
75 sets, there is no natural ordering of the characters. The actual | 75 sets, there is no natural ordering of the characters. The actual |
76 orderings are based on one or more salient characteristic, of which | 76 orderings are based on one or more salient characteristic, of which |
77 there are many to choose from -- e.g. number of strokes, common | 77 there are many to choose from---e.g. number of strokes, common |
78 radicals, phonetic ordering, etc. | 78 radicals, phonetic ordering, etc. |
79 | 79 |
80 The set of numbers assigned to any particular character are called | 80 The set of numbers assigned to any particular character are called |
81 the character's @dfn{position codes}. The number of position codes | 81 the character's @dfn{position codes}. The number of position codes |
82 required to index a particular character in a character set is called | 82 required to index a particular character in a character set is called |
103 position codes for the characters in that character set could be used | 103 position codes for the characters in that character set could be used |
104 directly. (This is the case with ASCII, and as a result, most people do | 104 directly. (This is the case with ASCII, and as a result, most people do |
105 not understand the difference between a character set and an encoding.) | 105 not understand the difference between a character set and an encoding.) |
106 This is not possible, however, if more than one character set is to be | 106 This is not possible, however, if more than one character set is to be |
107 used in the encoding. For example, printed Japanese text typically | 107 used in the encoding. For example, printed Japanese text typically |
108 requires characters from multiple character sets -- ASCII, JISX0208, and | 108 requires characters from multiple character sets---ASCII, JISX0208, and |
109 JISX0212, to be specific. Each of these is indexed using one or more | 109 JISX0212, to be specific. Each of these is indexed using one or more |
110 position codes in the range 33 through 126, so the position codes could | 110 position codes in the range 33 through 126, so the position codes could |
111 not be used directly or there would be no way to tell which character | 111 not be used directly or there would be no way to tell which character |
112 was meant. Different Japanese encodings handle this differently -- JIS | 112 was meant. Different Japanese encodings handle this differently---JIS |
113 uses special escape characters to denote different character sets; EUC | 113 uses special escape characters to denote different character sets; EUC |
114 sets the high bit of the position codes for JISX0208 and JISX0212, and | 114 sets the high bit of the position codes for JISX0208 and JISX0212, and |
115 puts a special extra byte before each JISX0212 character; etc. (JIS, | 115 puts a special extra byte before each JISX0212 character; etc. (JIS, |
116 EUC, and most of the other encodings you will encounter are 7-bit or | 116 EUC, and most of the other encodings you will encounter are 7-bit or |
117 8-bit encodings. There is one common 16-bit encoding, which is Unicode; | 117 8-bit encodings. There is one common 16-bit encoding, which is Unicode; |
364 This function returns the number of display columns per character (in | 364 This function returns the number of display columns per character (in |
365 TTY mode) of @var{charset}. | 365 TTY mode) of @var{charset}. |
366 @end defun | 366 @end defun |
367 | 367 |
368 @defun charset-direction charset | 368 @defun charset-direction charset |
369 This function returns the display direction of @var{charset} -- either | 369 This function returns the display direction of @var{charset}---either |
370 @code{l2r} or @code{r2l}. | 370 @code{l2r} or @code{r2l}. |
371 @end defun | 371 @end defun |
372 | 372 |
373 @defun charset-final charset | 373 @defun charset-final charset |
374 This function returns the final byte of the ISO 2022 escape sequence | 374 This function returns the final byte of the ISO 2022 escape sequence |
553 4 areas: C0, GL, C1, and GR. GL and GR are the areas into which a | 553 4 areas: C0, GL, C1, and GR. GL and GR are the areas into which a |
554 register of charset can be invoked into. | 554 register of charset can be invoked into. |
555 | 555 |
556 @example | 556 @example |
557 @group | 557 @group |
558 C0: 0x00 - 0x1F | 558 C0: 0x00 - 0x1F |
559 GL: 0x20 - 0x7F | 559 GL: 0x20 - 0x7F |
560 C1: 0x80 - 0x9F | 560 C1: 0x80 - 0x9F |
561 GR: 0xA0 - 0xFF | 561 GR: 0xA0 - 0xFF |
562 @end group | 562 @end group |
563 @end example | 563 @end example |
564 | 564 |
565 Usually, in the initial state, G0 is invoked into GL, and G1 | 565 Usually, in the initial state, G0 is invoked into GL, and G1 |
566 is invoked into GR. | 566 is invoked into GR. |
569 7-bit environments, only C0 and GL are used. | 569 7-bit environments, only C0 and GL are used. |
570 | 570 |
571 Charset designation is done by escape sequences of the form: | 571 Charset designation is done by escape sequences of the form: |
572 | 572 |
573 @example | 573 @example |
574 ESC [@var{I}] @var{I} @var{F} | 574 ESC [@var{I}] @var{I} @var{F} |
575 @end example | 575 @end example |
576 | 576 |
577 where @var{I} is an intermediate character in the range 0x20 - 0x2F, and | 577 where @var{I} is an intermediate character in the range 0x20 - 0x2F, and |
578 @var{F} is the final character identifying this charset. | 578 @var{F} is the final character identifying this charset. |
579 | 579 |
580 The meaning of intermediate characters are: | 580 The meaning of intermediate characters are: |
581 | 581 |
582 @example | 582 @example |
583 @group | 583 @group |
584 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96). | 584 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96). |
585 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}. | 585 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}. |
586 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}. | 586 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}. |
587 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}. | 587 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}. |
588 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}. | 588 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}. |
589 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}. | 589 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}. |
590 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}. | 590 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}. |
591 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}. | 591 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}. |
592 @end group | 592 @end group |
593 @end example | 593 @end example |
594 | 594 |
595 The following rule is not allowed in ISO 2022 but can be used in Mule. | 595 The following rule is not allowed in ISO 2022 but can be used in Mule. |
596 | 596 |
597 @example | 597 @example |
598 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}. | 598 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}. |
599 @end example | 599 @end example |
600 | 600 |
601 Here are examples of designations: | 601 Here are examples of designations: |
602 | 602 |
603 @example | 603 @example |
604 @group | 604 @group |
605 ESC ( B : designate to G0 ASCII | 605 ESC ( B : designate to G0 ASCII |
606 ESC - A : designate to G1 Latin-1 | 606 ESC - A : designate to G1 Latin-1 |
607 ESC $ ( A or ESC $ A : designate to G0 GB2312 | 607 ESC $ ( A or ESC $ A : designate to G0 GB2312 |
608 ESC $ ( B or ESC $ B : designate to G0 JISX0208 | 608 ESC $ ( B or ESC $ B : designate to G0 JISX0208 |
609 ESC $ ) C : designate to G1 KSC5601 | 609 ESC $ ) C : designate to G1 KSC5601 |
610 @end group | 610 @end group |
611 @end example | 611 @end example |
612 | 612 |
613 To use a charset designated to G2 or G3, and to use a charset designated | 613 To use a charset designated to G2 or G3, and to use a charset designated |
614 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3 | 614 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3 |
616 Single Shift (one character only). | 616 Single Shift (one character only). |
617 | 617 |
618 Locking Shift is done as follows: | 618 Locking Shift is done as follows: |
619 | 619 |
620 @example | 620 @example |
621 LS0 or SI (0x0F): invoke G0 into GL | 621 LS0 or SI (0x0F): invoke G0 into GL |
622 LS1 or SO (0x0E): invoke G1 into GL | 622 LS1 or SO (0x0E): invoke G1 into GL |
623 LS2: invoke G2 into GL | 623 LS2: invoke G2 into GL |
624 LS3: invoke G3 into GL | 624 LS3: invoke G3 into GL |
625 LS1R: invoke G1 into GR | 625 LS1R: invoke G1 into GR |
626 LS2R: invoke G2 into GR | 626 LS2R: invoke G2 into GR |
627 LS3R: invoke G3 into GR | 627 LS3R: invoke G3 into GR |
628 @end example | 628 @end example |
629 | 629 |
630 Single Shift is done as follows: | 630 Single Shift is done as follows: |
631 | 631 |
632 @example | 632 @example |
633 @group | 633 @group |
634 SS2 or ESC N: invoke G2 into GL | 634 SS2 or ESC N: invoke G2 into GL |
635 SS3 or ESC O: invoke G3 into GL | 635 SS3 or ESC O: invoke G3 into GL |
636 @end group | 636 @end group |
637 @end example | 637 @end example |
638 | 638 |
639 (#### Ben says: I think the above is slightly incorrect. It appears that | 639 (#### Ben says: I think the above is slightly incorrect. It appears that |
640 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and | 640 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and |
676 Here are several examples: | 676 Here are several examples: |
677 | 677 |
678 @example | 678 @example |
679 @group | 679 @group |
680 junet -- Coding system used in JUNET. | 680 junet -- Coding system used in JUNET. |
681 1. G0 <- ASCII, G1..3 <- never used | 681 1. G0 <- ASCII, G1..3 <- never used |
682 2. Yes. | 682 2. Yes. |
683 3. Yes. | 683 3. Yes. |
684 4. Yes. | 684 4. Yes. |
685 5. 7-bit environment | 685 5. 7-bit environment |
686 6. No. | 686 6. No. |
687 7. Use ASCII | 687 7. Use ASCII |
688 8. Use JISX0208-1983 | 688 8. Use JISX0208-1983 |
689 @end group | 689 @end group |
690 | 690 |
691 @group | 691 @group |
692 ctext -- Compound Text | 692 ctext -- Compound Text |
693 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used | 693 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used |
694 2. No. | 694 2. No. |
695 3. No. | 695 3. No. |
696 4. Yes. | 696 4. Yes. |
697 5. 8-bit environment | 697 5. 8-bit environment |
698 6. No. | 698 6. No. |
699 7. Use ASCII | 699 7. Use ASCII |
700 8. Use JISX0208-1983 | 700 8. Use JISX0208-1983 |
701 @end group | 701 @end group |
702 | 702 |
703 @group | 703 @group |
704 euc-china -- Chinese EUC. Although many people call this | 704 euc-china -- Chinese EUC. Although many people call this |
705 as "GB encoding", the name may cause misunderstanding. | 705 as "GB encoding", the name may cause misunderstanding. |
706 1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used | 706 1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used |
707 2. No. | 707 2. No. |
708 3. Yes. | 708 3. Yes. |
709 4. Yes. | 709 4. Yes. |
710 5. 8-bit environment | 710 5. 8-bit environment |
711 6. No. | 711 6. No. |
712 7. Use ASCII | 712 7. Use ASCII |
713 8. Use JISX0208-1983 | 713 8. Use JISX0208-1983 |
714 @end group | 714 @end group |
715 | 715 |
716 @group | 716 @group |
717 korean-mail -- Coding system used in Korean network. | 717 korean-mail -- Coding system used in Korean network. |
718 1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used | 718 1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used |
719 2. No. | 719 2. No. |
720 3. Yes. | 720 3. Yes. |
721 4. Yes. | 721 4. Yes. |
722 5. 7-bit environment | 722 5. 7-bit environment |
723 6. Yes. | 723 6. Yes. |
724 7. No. | 724 7. No. |
725 8. No. | 725 8. No. |
726 @end group | 726 @end group |
727 @end example | 727 @end example |
728 | 728 |
729 Mule creates all these coding systems by default. | 729 Mule creates all these coding systems by default. |
730 | 730 |
738 or process, and is used to encode the text back into the same format | 738 or process, and is used to encode the text back into the same format |
739 when it is written out to a file or process. | 739 when it is written out to a file or process. |
740 | 740 |
741 For example, many ISO-2022-compliant coding systems (such as Compound | 741 For example, many ISO-2022-compliant coding systems (such as Compound |
742 Text, which is used for inter-client data under the X Window System) use | 742 Text, which is used for inter-client data under the X Window System) use |
743 escape sequences to switch between different charsets -- Japanese Kanji, | 743 escape sequences to switch between different charsets---Japanese Kanji, |
744 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with | 744 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with |
745 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See | 745 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See |
746 @code{make-coding-system} for more information. | 746 @code{make-coding-system} for more information. |
747 | 747 |
748 Coding systems are normally identified using a symbol, and the symbol is | 748 Coding systems are normally identified using a symbol, and the symbol is |
1445 @node Category Tables, , CCL, MULE | 1445 @node Category Tables, , CCL, MULE |
1446 @section Category Tables | 1446 @section Category Tables |
1447 | 1447 |
1448 A category table is a type of char table used for keeping track of | 1448 A category table is a type of char table used for keeping track of |
1449 categories. Categories are used for classifying characters for use in | 1449 categories. Categories are used for classifying characters for use in |
1450 regexps -- you can refer to a category rather than having to use a | 1450 regexps---you can refer to a category rather than having to use a |
1451 complicated [] expression (and category lookups are significantly | 1451 complicated [] expression (and category lookups are significantly |
1452 faster). | 1452 faster). |
1453 | 1453 |
1454 There are 95 different categories available, one for each printable | 1454 There are 95 different categories available, one for each printable |
1455 character (including space) in the ASCII charset. Each category is | 1455 character (including space) in the ASCII charset. Each category is |