comparison man/lispref/mule.texi @ 2818:9fa10603c898

[xemacs-hg @ 2005-06-19 20:49:43 by aidan] Pure storage is long gone.
author aidan
date Sun, 19 Jun 2005 20:49:47 +0000
parents d5bfa26d5c3f
children d1754e7f0cea
comparison
equal deleted inserted replaced
2817:9244a70250d8 2818:9fa10603c898
139 directly match ASCII printing characters. 139 directly match ASCII printing characters.
140 140
141 An @dfn{encoding} is a way of numerically representing characters from 141 An @dfn{encoding} is a way of numerically representing characters from
142 one or more character sets into a stream of like-sized numerical values 142 one or more character sets into a stream of like-sized numerical values
143 called @dfn{words} -- typically 8-bit bytes, but sometimes 16-bit or 143 called @dfn{words} -- typically 8-bit bytes, but sometimes 16-bit or
144 32-bit quantities. It's very important to clearly distinguish between 144 32-bit quantities. In a context where dealing with Japanese motivates
145 charsets and encodings. For a simple charset like ASCII, there is only 145 much of XEmacs' design in this area, it's important to clearly
146 one encoding normally used -- each character is represented by a single 146 distinguish between charsets and encodings. For a simple charset like
147 byte, with the same value as its code point. For more complicated 147 ASCII, there is only one encoding normally used -- each character is
148 charsets, however, or when a single encoding needs to represent more 148 represented by a single byte, with the same value as its code point.
149 than charset, things are not so obvious. Unicode version 2, for 149 For more complicated charsets, however, or when a single encoding needs
150 example, is a large charset with thousands of characters, each indexed 150 to represent more than charset, things are not so obvious. Unicode
151 by a 16-bit number, often represented in hex, e.g. 0x05D0 for the Hebrew 151 version 2, for example, is a large charset with thousands of characters,
152 letter "aleph". One obvious encoding (actually two encodings, depending 152 each indexed by a 16-bit number, often represented in hex, e.g. 0x05D0
153 on which of the two possible byte orderings is chosen) simply uses two 153 for the Hebrew letter "aleph". One obvious encoding (actually two
154 bytes per character. This encoding is convenient for internal 154 encodings, depending on which of the two possible byte orderings is
155 processing of Unicode text; however, it's incompatible with ASCII, and 155 chosen) simply uses two bytes per character. This encoding is
156 thus external text (files, e-mail, etc.) that is encoded this way is 156 convenient for internal processing of Unicode text; however, it's
157 completely uninterpretable by programs lacking Unicode support. For 157 incompatible with ASCII, and thus external text (files, e-mail, etc.)
158 this reason, a different, ASCII-compatible encoding, e.g. UTF-8, is 158 that is encoded this way is completely uninterpretable by programs
159 usually used for external text. UTF-8 represents Unicode characters 159 lacking Unicode support. For this reason, a different, ASCII-compatible
160 with one to three bytes (often extended to six bytes to handle 160 encoding, e.g. UTF-8, is usually used for external text. UTF-8
161 characters with up to 31-bit indices). Unicode characters 00 to 7F 161 represents Unicode characters with one to three bytes (often extended to
162 (identical with ASCII) are directly represented with one byte, and other 162 six bytes to handle characters with up to 31-bit indices). Unicode
163 characters with two or more bytes, each in the range 80 to FF. 163 characters 00 to 7F (identical with ASCII) are directly represented with
164 Applications that don't understand Unicode will still be able to process 164 one byte, and other characters with two or more bytes, each in the range
165 ASCII characters represented in UTF-8-encoded text, and will typically 165 80 to FF. Applications that don't understand Unicode will still be able
166 ignore (and hopefully preserve) the high-bit characters. 166 to process ASCII characters represented in UTF-8-encoded text, and will
167 typically ignore (and hopefully preserve) the high-bit characters.
168
169 Similarly, Shift-JIS and EUC-JP are different encodings normally used to
170 encode the same character set(s), these character sets being subsets of
171 Unicode. However, the obvious approach of unifying XEmacs' internal
172 encoding across character sets, as was part of the motivation behind
173 Unicode, wasn't taken. This means that characters in these character
174 sets that are identical to characters in other character sets---for
175 example, the Greek alphabet is in the large Japanese character sets and
176 at least one European character set--are unfortunately disjoint.
167 177
168 Naive use of code points is also not possible if more than one 178 Naive use of code points is also not possible if more than one
169 character set is to be used in the encoding. For example, printed 179 character set is to be used in the encoding. For example, printed
170 Japanese text typically requires characters from multiple character sets 180 Japanese text typically requires characters from multiple character sets
171 -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is 181 -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is