Mercurial > hg > xemacs-beta
comparison man/lispref/mule.texi @ 1261:465bd3c7d932
[xemacs-hg @ 2003-02-06 06:35:47 by ben]
various bug fixes
mule/cyril-util.el: Fix compile warning.
loadup.el, make-docfile.el, update-elc-2.el, update-elc.el: Set stack-trace-on-error, load-always-display-messages so we
get better debug results.
update-elc-2.el: Fix typo in name of lisp/mule, leading to compile failure.
simple.el: Omit M-S-home/end from motion keys.
update-elc.el: Overhaul:
-- allow list of "early-compile" files to be specified, not hardcoded
-- fix autoload checking to include all .el files, not just dumped ones
-- be smarter about regenerating autoloads, so we don't need to use
loadup-el if not necessary
-- use standard methods for loading/not loading auto-autoloads.el
(maybe fixes "Already loaded" error?)
-- rename misleading NOBYTECOMPILE flag file.
window-xemacs.el: Fix bug in default param.
window-xemacs.el: Fix compile warnings.
lwlib-Xm.c: Fix compile warning.
lispref/mule.texi: Lots of Mule rewriting.
internals/internals.texi: Major fixup. Correct for new names of Bytebpos, Ichar, etc. and
lots of Mule rewriting.
config.inc.samp: Various fixups.
Makefile.in.in: NOBYTECOMPILE -> BYTECOMPILE_CHANGE.
esd.c: Warning fixes.
fns.c: Eliminate bogus require-prints-loading-message; use already
existent load-always-display-messages instead. Make sure `load'
knows we are coming from `require'.
lread.c: Turn on `load-warn-when-source-newer' by default. Change loading
message to indicate when we are `require'ing. Eliminate
purify_flag hacks to display more messages; instead, loadup and
friends specify this explicitly with
`load-always-display-messages'. Add spaces when batch to clearly
indicate recursive loading. Fassoc() does not GC so no need to
gcpro.
gui-x.c, gui-x.h, menubar-x.c: Fix up crashes when selecting menubar items due to lack of GCPROing
of callbacks in lwlib structures.
eval.c, lisp.h, print.c: Don't canonicalize to selected-frame when noninteractive, or
backtraces get all screwed up as some values are printed through
the stream console and some aren't. Export
canonicalize_printcharfun() and use in Fbacktrace().
author | ben |
---|---|
date | Thu, 06 Feb 2003 06:36:17 +0000 |
parents | 11ff4edb6bb7 |
children | d6d41d23b6ec |
comparison
equal
deleted
inserted
replaced
1260:278c9cd3435e | 1261:465bd3c7d932 |
---|---|
37 number @samp{2}, a Katakana character, a Hangul character, a Kanji | 37 number @samp{2}, a Katakana character, a Hangul character, a Kanji |
38 ideograph (an @dfn{ideograph} is a ``picture'' character, such as is | 38 ideograph (an @dfn{ideograph} is a ``picture'' character, such as is |
39 used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there | 39 used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there |
40 are thousands of such ideographs in each language), etc. The basic | 40 are thousands of such ideographs in each language), etc. The basic |
41 property of a character is that it is the smallest unit of text with | 41 property of a character is that it is the smallest unit of text with |
42 semantic significance in text processing. | 42 semantic significance in text processing---i.e., characters are abstract |
43 units defined by their meaning, not by their exact appearance. | |
43 | 44 |
44 Human beings normally process text visually, so to a first approximation | 45 Human beings normally process text visually, so to a first approximation |
45 a character may be identified with its shape. Note that the same | 46 a character may be identified with its shape. Note that the same |
46 character may be drawn by two different people (or in two different | 47 character may be drawn by two different people (or in two different |
47 fonts) in slightly different ways, although the "basic shape" will be the | 48 fonts) in slightly different ways, although the "basic shape" will be the |
96 different orderings of the same characters are different character sets. | 97 different orderings of the same characters are different character sets. |
97 Identifying characters is simple enough for alphabetic character sets, | 98 Identifying characters is simple enough for alphabetic character sets, |
98 but the difference in ordering can cause great headaches when the same | 99 but the difference in ordering can cause great headaches when the same |
99 thousands of characters are used by different cultures as in the Hanzi.) | 100 thousands of characters are used by different cultures as in the Hanzi.) |
100 | 101 |
101 A code point may be broken into a number of @dfn{position codes}. The | 102 It's important to understand that a character is defined not by any |
102 number of position codes required to index a particular character in a | 103 number attached to it, but by its meaning. For example, ASCII and |
103 character set is called the @dfn{dimension} of the character set. For | 104 EBCDIC are two charsets containing exactly the same characters |
104 practical purposes, a position code may be thought of as a byte-sized | 105 (lowercase and uppercase letters, numbers 0 through 9, particular |
105 index. The printing characters of ASCII, being a relatively small | 106 punctuation marks) but with different numberings. The @samp{comma} |
106 character set, is of dimension one, and each character in the set is | 107 character in ASCII and EBCDIC, for instance, is the same character |
107 indexed using a single position code, in the range 1 through 94. Use of | 108 despite having a different numbering. Conversely, when comparing ASCII |
108 this unusual range, rather than the familiar 33 through 126, is an | 109 and JIS-Roman, which look the same except that the latter has a yen sign |
109 intentional abstraction; to understand the programming issues you must | 110 substituted for the backslash, we would say that the backslash and yen |
110 break the equation between character sets and encodings. | 111 sign are @emph{not} the same characters, despite having the same number |
111 | 112 (95) and despite the fact that all other characters are present in both |
112 JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is | 113 charsets, with the same numbering. ASCII and JIS-Roman, then, do |
113 of dimension two -- every character is indexed by two position codes, | 114 @emph{not} have exactly the same characters in them (ASCII has a |
114 each in the range 1 through 94. (This number ``94'' is not a | 115 backslash character but no yen-sign character, and vice-versa for |
115 coincidence; we shall see that the JIS position codes were chosen so | 116 JIS-Roman), unlike ASCII and EBCDIC, even though the numberings in ASCII |
116 that JIS kanji could be encoded without using codes that in ASCII are | 117 and JIS-Roman are closer. |
117 associated with device control functions.) Note that the choice of the | 118 |
118 range here is somewhat arbitrary. You could just as easily index the | 119 Sometimes, a code point is not a single number, but instead a group of |
119 printing characters in ASCII using numbers in the range 0 through 93, 2 | 120 numbers, called @dfn{position codes}. In such cases, the number of |
120 through 95, 3 through 96, etc. In fact, the standardized | 121 position codes required to index a particular character in a character |
121 @emph{encoding} for the ASCII @emph{character set} uses the range 33 | 122 set is called the @dfn{dimension} of the character set. Character sets |
122 through 126. | 123 indexed by more than one position code typically use byte-sized position |
124 codes. Small character sets, e.g. ASCII, invariably use a single | |
125 position code, but for larger character sets, the choice of whether to | |
126 use multiple position codes or a single large (16-bit or 32-bit) number | |
127 is arbitrary. Unicode typically uses a single large number, but | |
128 language-specific or "national" character sets often use multiple | |
129 (usually two) position codes. For example, JIS X 0208, i.e. Japanese | |
130 Kanji, has thousands of characters, and is of dimension two -- every | |
131 character is indexed by two position codes, each in the range 1 through | |
132 94. (This number ``94'' is not a coincidence; it is the same as the | |
133 number of printable characters in ASCII, and was chosen so that JIS | |
134 characters could be directly encoded using two printable ASCII | |
135 characters.) Note that the choice of the range here is somewhat | |
136 arbitrary -- it could just as easily be 0 through 93, 2 through 95, etc. | |
137 In fact, the range for JIS position codes (and for other character sets | |
138 modeled after it) is often given as range 33 through 126, so as to | |
139 directly match ASCII printing characters. | |
123 | 140 |
124 An @dfn{encoding} is a way of numerically representing characters from | 141 An @dfn{encoding} is a way of numerically representing characters from |
125 one or more character sets into a stream of like-sized numerical values | 142 one or more character sets into a stream of like-sized numerical values |
126 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit | 143 called @dfn{words} -- typically 8-bit bytes, but sometimes 16-bit or |
127 quantities. If an encoding encompasses only one character set, then the | 144 32-bit quantities. It's very important to clearly distinguish between |
128 position codes for the characters in that character set could be used | 145 charsets and encodings. For a simple charset like ASCII, there is only |
129 directly. (This is the case with the trivial cipher used by children, | 146 one encoding normally used -- each character is represented by a single |
130 assigning 1 to `A', 2 to `B', and so on.) However, even with ASCII, | 147 byte, with the same value as its code point. For more complicated |
131 other considerations intrude. For example, why are the upper- and | 148 charsets, however, or when a single encoding needs to represent more |
132 lowercase alphabets separated by 8 characters? Why do the digits start | 149 than charset, things are not so obvious. Unicode version 2, for |
133 with `0' being assigned the code 48? In both cases because semantically | 150 example, is a large charset with thousands of characters, each indexed |
134 interesting operations (case conversion and numerical value extraction) | 151 by a 16-bit number, often represented in hex, e.g. 0x05D0 for the Hebrew |
135 become convenient masking operations. Other artificial aspects (the | 152 letter "aleph". One obvious encoding (actually two encodings, depending |
136 control characters being assigned to codes 0--31 and 127) are historical | 153 on which of the two possible byte orderings is chosen) simply uses two |
137 accidents. (The use of 127 for @samp{DEL} is an artifact of the "punch | 154 bytes per character. This encoding is convenient for internal |
138 once" nature of paper tape, for example.) | 155 processing of Unicode text; however, it's incompatible with ASCII, and |
139 | 156 thus external text (files, e-mail, etc.) that is encoded this way is |
140 Naive use of the position code is not possible, however, if more than | 157 completely uninterpretable by programs lacking Unicode support. For |
141 one character set is to be used in the encoding. For example, printed | 158 this reason, a different, ASCII-compatible encoding, e.g. UTF-8, is |
159 usually used for external text. UTF-8 represents Unicode characters | |
160 with one to three bytes (often extended to six bytes to handle | |
161 characters with up to 31-bit indices). Unicode characters 00 to 7F | |
162 (identical with ASCII) are directly represented with one byte, and other | |
163 characters with two or more bytes, each in the range 80 to FF. | |
164 Applications that don't understand Unicode will still be able to process | |
165 ASCII characters represented in UTF-8-encoded text, and will typically | |
166 ignore (and hopefully preserve) the high-bit characters. | |
167 | |
168 Naive use of code points is also not possible if more than one | |
169 character set is to be used in the encoding. For example, printed | |
142 Japanese text typically requires characters from multiple character sets | 170 Japanese text typically requires characters from multiple character sets |
143 -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is | 171 -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is |
144 indexed using one or more position codes in the range 1 through 94, so | 172 indexed using one or more position codes in the range 1 through 94 (or |
145 the position codes could not be used directly or there would be no way | 173 33 through 126), so the position codes could not be used directly or |
146 to tell which character was meant. Different Japanese encodings handle | 174 there would be no way to tell which character was meant. Different |
147 this differently -- JIS uses special escape characters to denote | 175 Japanese encodings handle this differently -- JIS uses special escape |
148 different character sets; EUC sets the high bit of the position codes | 176 characters to denote different character sets; EUC sets the high bit of |
149 for JIS X 0208 and JIS X 0212, and puts a special extra byte before each | 177 the position codes for JIS X 0208 and JIS X 0212, and puts a special |
150 JIS X 0212 character; etc. (JIS, EUC, and most of the other encodings | 178 extra byte before each JIS X 0212 character; etc. |
151 you will encounter in files are 7-bit or 8-bit encodings. There is one | 179 |
152 common 16-bit encoding, which is Unicode; this strives to represent all | 180 The encodings described above are all 7-bit or 8-bit encodings. The |
153 the world's characters in a single large character set. 32-bit | 181 fixed-width Unicode encoding previous described, however, is sometimes |
154 encodings are often used internally in programs, such as XEmacs with | 182 considered to be a 16-bit encoding, in which case the issue of byte |
155 MULE support, to simplify the code that manipulates them; however, they | 183 ordering does not come up. (Imagine, for example, that the text is |
156 are not used externally because they are not very space-efficient.) | 184 represented as an array of shorts.) Similarly, Unicode version 3 (which |
185 has characters with indices above 0xFFFF), and other very large | |
186 character sets, may be represented internally as 32-bit encodings, | |
187 i.e. arrays of ints. However, it does not make too much sense to talk | |
188 about 16-bit or 32-bit encodings for external data, since nowadays 8-bit | |
189 data is a universal standard -- the closest you can get is fixed-width | |
190 encodings using two or four bytes to encode 16-bit or 32-bit values. (A | |
191 "7-bit" encoding is used when it cannot be guaranteed that the high bit | |
192 of 8-bit data will be correctly preserved. Some e-mail gateways, for | |
193 example, strip the high bit of text passing through them. These same | |
194 gateways often handle non-printable characters incorrectly, and so 7-bit | |
195 encodings usually avoid using bytes with such values.) | |
157 | 196 |
158 A general method of handling text using multiple character sets | 197 A general method of handling text using multiple character sets |
159 (whether for multilingual text, or simply text in an extremely | 198 (whether for multilingual text, or simply text in an extremely |
160 complicated single language like Japanese) is defined in the | 199 complicated single language like Japanese) is defined in the |
161 international standard ISO 2022. ISO 2022 will be discussed in more | 200 international standard ISO 2022. ISO 2022 will be discussed in more |