Mercurial > hg > xemacs-beta
comparison man/lispref/mule.texi @ 2818:9fa10603c898
[xemacs-hg @ 2005-06-19 20:49:43 by aidan]
Pure storage is long gone.
author | aidan |
---|---|
date | Sun, 19 Jun 2005 20:49:47 +0000 |
parents | d5bfa26d5c3f |
children | d1754e7f0cea |
comparison
equal
deleted
inserted
replaced
2817:9244a70250d8 | 2818:9fa10603c898 |
---|---|
139 directly match ASCII printing characters. | 139 directly match ASCII printing characters. |
140 | 140 |
141 An @dfn{encoding} is a way of numerically representing characters from | 141 An @dfn{encoding} is a way of numerically representing characters from |
142 one or more character sets into a stream of like-sized numerical values | 142 one or more character sets into a stream of like-sized numerical values |
143 called @dfn{words} -- typically 8-bit bytes, but sometimes 16-bit or | 143 called @dfn{words} -- typically 8-bit bytes, but sometimes 16-bit or |
144 32-bit quantities. It's very important to clearly distinguish between | 144 32-bit quantities. In a context where dealing with Japanese motivates |
145 charsets and encodings. For a simple charset like ASCII, there is only | 145 much of XEmacs' design in this area, it's important to clearly |
146 one encoding normally used -- each character is represented by a single | 146 distinguish between charsets and encodings. For a simple charset like |
147 byte, with the same value as its code point. For more complicated | 147 ASCII, there is only one encoding normally used -- each character is |
148 charsets, however, or when a single encoding needs to represent more | 148 represented by a single byte, with the same value as its code point. |
149 than charset, things are not so obvious. Unicode version 2, for | 149 For more complicated charsets, however, or when a single encoding needs |
150 example, is a large charset with thousands of characters, each indexed | 150 to represent more than charset, things are not so obvious. Unicode |
151 by a 16-bit number, often represented in hex, e.g. 0x05D0 for the Hebrew | 151 version 2, for example, is a large charset with thousands of characters, |
152 letter "aleph". One obvious encoding (actually two encodings, depending | 152 each indexed by a 16-bit number, often represented in hex, e.g. 0x05D0 |
153 on which of the two possible byte orderings is chosen) simply uses two | 153 for the Hebrew letter "aleph". One obvious encoding (actually two |
154 bytes per character. This encoding is convenient for internal | 154 encodings, depending on which of the two possible byte orderings is |
155 processing of Unicode text; however, it's incompatible with ASCII, and | 155 chosen) simply uses two bytes per character. This encoding is |
156 thus external text (files, e-mail, etc.) that is encoded this way is | 156 convenient for internal processing of Unicode text; however, it's |
157 completely uninterpretable by programs lacking Unicode support. For | 157 incompatible with ASCII, and thus external text (files, e-mail, etc.) |
158 this reason, a different, ASCII-compatible encoding, e.g. UTF-8, is | 158 that is encoded this way is completely uninterpretable by programs |
159 usually used for external text. UTF-8 represents Unicode characters | 159 lacking Unicode support. For this reason, a different, ASCII-compatible |
160 with one to three bytes (often extended to six bytes to handle | 160 encoding, e.g. UTF-8, is usually used for external text. UTF-8 |
161 characters with up to 31-bit indices). Unicode characters 00 to 7F | 161 represents Unicode characters with one to three bytes (often extended to |
162 (identical with ASCII) are directly represented with one byte, and other | 162 six bytes to handle characters with up to 31-bit indices). Unicode |
163 characters with two or more bytes, each in the range 80 to FF. | 163 characters 00 to 7F (identical with ASCII) are directly represented with |
164 Applications that don't understand Unicode will still be able to process | 164 one byte, and other characters with two or more bytes, each in the range |
165 ASCII characters represented in UTF-8-encoded text, and will typically | 165 80 to FF. Applications that don't understand Unicode will still be able |
166 ignore (and hopefully preserve) the high-bit characters. | 166 to process ASCII characters represented in UTF-8-encoded text, and will |
167 typically ignore (and hopefully preserve) the high-bit characters. | |
168 | |
169 Similarly, Shift-JIS and EUC-JP are different encodings normally used to | |
170 encode the same character set(s), these character sets being subsets of | |
171 Unicode. However, the obvious approach of unifying XEmacs' internal | |
172 encoding across character sets, as was part of the motivation behind | |
173 Unicode, wasn't taken. This means that characters in these character | |
174 sets that are identical to characters in other character sets---for | |
175 example, the Greek alphabet is in the large Japanese character sets and | |
176 at least one European character set--are unfortunately disjoint. | |
167 | 177 |
168 Naive use of code points is also not possible if more than one | 178 Naive use of code points is also not possible if more than one |
169 character set is to be used in the encoding. For example, printed | 179 character set is to be used in the encoding. For example, printed |
170 Japanese text typically requires characters from multiple character sets | 180 Japanese text typically requires characters from multiple character sets |
171 -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is | 181 -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is |