comparison man/lispref/mule.texi @ 1261:465bd3c7d932

[xemacs-hg @ 2003-02-06 06:35:47 by ben] various bug fixes mule/cyril-util.el: Fix compile warning. loadup.el, make-docfile.el, update-elc-2.el, update-elc.el: Set stack-trace-on-error, load-always-display-messages so we get better debug results. update-elc-2.el: Fix typo in name of lisp/mule, leading to compile failure. simple.el: Omit M-S-home/end from motion keys. update-elc.el: Overhaul: -- allow list of "early-compile" files to be specified, not hardcoded -- fix autoload checking to include all .el files, not just dumped ones -- be smarter about regenerating autoloads, so we don't need to use loadup-el if not necessary -- use standard methods for loading/not loading auto-autoloads.el (maybe fixes "Already loaded" error?) -- rename misleading NOBYTECOMPILE flag file. window-xemacs.el: Fix bug in default param. window-xemacs.el: Fix compile warnings. lwlib-Xm.c: Fix compile warning. lispref/mule.texi: Lots of Mule rewriting. internals/internals.texi: Major fixup. Correct for new names of Bytebpos, Ichar, etc. and lots of Mule rewriting. config.inc.samp: Various fixups. Makefile.in.in: NOBYTECOMPILE -> BYTECOMPILE_CHANGE. esd.c: Warning fixes. fns.c: Eliminate bogus require-prints-loading-message; use already existent load-always-display-messages instead. Make sure `load' knows we are coming from `require'. lread.c: Turn on `load-warn-when-source-newer' by default. Change loading message to indicate when we are `require'ing. Eliminate purify_flag hacks to display more messages; instead, loadup and friends specify this explicitly with `load-always-display-messages'. Add spaces when batch to clearly indicate recursive loading. Fassoc() does not GC so no need to gcpro. gui-x.c, gui-x.h, menubar-x.c: Fix up crashes when selecting menubar items due to lack of GCPROing of callbacks in lwlib structures. eval.c, lisp.h, print.c: Don't canonicalize to selected-frame when noninteractive, or backtraces get all screwed up as some values are printed through the stream console and some aren't. Export canonicalize_printcharfun() and use in Fbacktrace().
author ben
date Thu, 06 Feb 2003 06:36:17 +0000
parents 11ff4edb6bb7
children d6d41d23b6ec
comparison
equal deleted inserted replaced
1260:278c9cd3435e 1261:465bd3c7d932
37 number @samp{2}, a Katakana character, a Hangul character, a Kanji 37 number @samp{2}, a Katakana character, a Hangul character, a Kanji
38 ideograph (an @dfn{ideograph} is a ``picture'' character, such as is 38 ideograph (an @dfn{ideograph} is a ``picture'' character, such as is
39 used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there 39 used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there
40 are thousands of such ideographs in each language), etc. The basic 40 are thousands of such ideographs in each language), etc. The basic
41 property of a character is that it is the smallest unit of text with 41 property of a character is that it is the smallest unit of text with
42 semantic significance in text processing. 42 semantic significance in text processing---i.e., characters are abstract
43 units defined by their meaning, not by their exact appearance.
43 44
44 Human beings normally process text visually, so to a first approximation 45 Human beings normally process text visually, so to a first approximation
45 a character may be identified with its shape. Note that the same 46 a character may be identified with its shape. Note that the same
46 character may be drawn by two different people (or in two different 47 character may be drawn by two different people (or in two different
47 fonts) in slightly different ways, although the "basic shape" will be the 48 fonts) in slightly different ways, although the "basic shape" will be the
96 different orderings of the same characters are different character sets. 97 different orderings of the same characters are different character sets.
97 Identifying characters is simple enough for alphabetic character sets, 98 Identifying characters is simple enough for alphabetic character sets,
98 but the difference in ordering can cause great headaches when the same 99 but the difference in ordering can cause great headaches when the same
99 thousands of characters are used by different cultures as in the Hanzi.) 100 thousands of characters are used by different cultures as in the Hanzi.)
100 101
101 A code point may be broken into a number of @dfn{position codes}. The 102 It's important to understand that a character is defined not by any
102 number of position codes required to index a particular character in a 103 number attached to it, but by its meaning. For example, ASCII and
103 character set is called the @dfn{dimension} of the character set. For 104 EBCDIC are two charsets containing exactly the same characters
104 practical purposes, a position code may be thought of as a byte-sized 105 (lowercase and uppercase letters, numbers 0 through 9, particular
105 index. The printing characters of ASCII, being a relatively small 106 punctuation marks) but with different numberings. The @samp{comma}
106 character set, is of dimension one, and each character in the set is 107 character in ASCII and EBCDIC, for instance, is the same character
107 indexed using a single position code, in the range 1 through 94. Use of 108 despite having a different numbering. Conversely, when comparing ASCII
108 this unusual range, rather than the familiar 33 through 126, is an 109 and JIS-Roman, which look the same except that the latter has a yen sign
109 intentional abstraction; to understand the programming issues you must 110 substituted for the backslash, we would say that the backslash and yen
110 break the equation between character sets and encodings. 111 sign are @emph{not} the same characters, despite having the same number
111 112 (95) and despite the fact that all other characters are present in both
112 JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is 113 charsets, with the same numbering. ASCII and JIS-Roman, then, do
113 of dimension two -- every character is indexed by two position codes, 114 @emph{not} have exactly the same characters in them (ASCII has a
114 each in the range 1 through 94. (This number ``94'' is not a 115 backslash character but no yen-sign character, and vice-versa for
115 coincidence; we shall see that the JIS position codes were chosen so 116 JIS-Roman), unlike ASCII and EBCDIC, even though the numberings in ASCII
116 that JIS kanji could be encoded without using codes that in ASCII are 117 and JIS-Roman are closer.
117 associated with device control functions.) Note that the choice of the 118
118 range here is somewhat arbitrary. You could just as easily index the 119 Sometimes, a code point is not a single number, but instead a group of
119 printing characters in ASCII using numbers in the range 0 through 93, 2 120 numbers, called @dfn{position codes}. In such cases, the number of
120 through 95, 3 through 96, etc. In fact, the standardized 121 position codes required to index a particular character in a character
121 @emph{encoding} for the ASCII @emph{character set} uses the range 33 122 set is called the @dfn{dimension} of the character set. Character sets
122 through 126. 123 indexed by more than one position code typically use byte-sized position
124 codes. Small character sets, e.g. ASCII, invariably use a single
125 position code, but for larger character sets, the choice of whether to
126 use multiple position codes or a single large (16-bit or 32-bit) number
127 is arbitrary. Unicode typically uses a single large number, but
128 language-specific or "national" character sets often use multiple
129 (usually two) position codes. For example, JIS X 0208, i.e. Japanese
130 Kanji, has thousands of characters, and is of dimension two -- every
131 character is indexed by two position codes, each in the range 1 through
132 94. (This number ``94'' is not a coincidence; it is the same as the
133 number of printable characters in ASCII, and was chosen so that JIS
134 characters could be directly encoded using two printable ASCII
135 characters.) Note that the choice of the range here is somewhat
136 arbitrary -- it could just as easily be 0 through 93, 2 through 95, etc.
137 In fact, the range for JIS position codes (and for other character sets
138 modeled after it) is often given as range 33 through 126, so as to
139 directly match ASCII printing characters.
123 140
124 An @dfn{encoding} is a way of numerically representing characters from 141 An @dfn{encoding} is a way of numerically representing characters from
125 one or more character sets into a stream of like-sized numerical values 142 one or more character sets into a stream of like-sized numerical values
126 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit 143 called @dfn{words} -- typically 8-bit bytes, but sometimes 16-bit or
127 quantities. If an encoding encompasses only one character set, then the 144 32-bit quantities. It's very important to clearly distinguish between
128 position codes for the characters in that character set could be used 145 charsets and encodings. For a simple charset like ASCII, there is only
129 directly. (This is the case with the trivial cipher used by children, 146 one encoding normally used -- each character is represented by a single
130 assigning 1 to `A', 2 to `B', and so on.) However, even with ASCII, 147 byte, with the same value as its code point. For more complicated
131 other considerations intrude. For example, why are the upper- and 148 charsets, however, or when a single encoding needs to represent more
132 lowercase alphabets separated by 8 characters? Why do the digits start 149 than charset, things are not so obvious. Unicode version 2, for
133 with `0' being assigned the code 48? In both cases because semantically 150 example, is a large charset with thousands of characters, each indexed
134 interesting operations (case conversion and numerical value extraction) 151 by a 16-bit number, often represented in hex, e.g. 0x05D0 for the Hebrew
135 become convenient masking operations. Other artificial aspects (the 152 letter "aleph". One obvious encoding (actually two encodings, depending
136 control characters being assigned to codes 0--31 and 127) are historical 153 on which of the two possible byte orderings is chosen) simply uses two
137 accidents. (The use of 127 for @samp{DEL} is an artifact of the "punch 154 bytes per character. This encoding is convenient for internal
138 once" nature of paper tape, for example.) 155 processing of Unicode text; however, it's incompatible with ASCII, and
139 156 thus external text (files, e-mail, etc.) that is encoded this way is
140 Naive use of the position code is not possible, however, if more than 157 completely uninterpretable by programs lacking Unicode support. For
141 one character set is to be used in the encoding. For example, printed 158 this reason, a different, ASCII-compatible encoding, e.g. UTF-8, is
159 usually used for external text. UTF-8 represents Unicode characters
160 with one to three bytes (often extended to six bytes to handle
161 characters with up to 31-bit indices). Unicode characters 00 to 7F
162 (identical with ASCII) are directly represented with one byte, and other
163 characters with two or more bytes, each in the range 80 to FF.
164 Applications that don't understand Unicode will still be able to process
165 ASCII characters represented in UTF-8-encoded text, and will typically
166 ignore (and hopefully preserve) the high-bit characters.
167
168 Naive use of code points is also not possible if more than one
169 character set is to be used in the encoding. For example, printed
142 Japanese text typically requires characters from multiple character sets 170 Japanese text typically requires characters from multiple character sets
143 -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is 171 -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is
144 indexed using one or more position codes in the range 1 through 94, so 172 indexed using one or more position codes in the range 1 through 94 (or
145 the position codes could not be used directly or there would be no way 173 33 through 126), so the position codes could not be used directly or
146 to tell which character was meant. Different Japanese encodings handle 174 there would be no way to tell which character was meant. Different
147 this differently -- JIS uses special escape characters to denote 175 Japanese encodings handle this differently -- JIS uses special escape
148 different character sets; EUC sets the high bit of the position codes 176 characters to denote different character sets; EUC sets the high bit of
149 for JIS X 0208 and JIS X 0212, and puts a special extra byte before each 177 the position codes for JIS X 0208 and JIS X 0212, and puts a special
150 JIS X 0212 character; etc. (JIS, EUC, and most of the other encodings 178 extra byte before each JIS X 0212 character; etc.
151 you will encounter in files are 7-bit or 8-bit encodings. There is one 179
152 common 16-bit encoding, which is Unicode; this strives to represent all 180 The encodings described above are all 7-bit or 8-bit encodings. The
153 the world's characters in a single large character set. 32-bit 181 fixed-width Unicode encoding previous described, however, is sometimes
154 encodings are often used internally in programs, such as XEmacs with 182 considered to be a 16-bit encoding, in which case the issue of byte
155 MULE support, to simplify the code that manipulates them; however, they 183 ordering does not come up. (Imagine, for example, that the text is
156 are not used externally because they are not very space-efficient.) 184 represented as an array of shorts.) Similarly, Unicode version 3 (which
185 has characters with indices above 0xFFFF), and other very large
186 character sets, may be represented internally as 32-bit encodings,
187 i.e. arrays of ints. However, it does not make too much sense to talk
188 about 16-bit or 32-bit encodings for external data, since nowadays 8-bit
189 data is a universal standard -- the closest you can get is fixed-width
190 encodings using two or four bytes to encode 16-bit or 32-bit values. (A
191 "7-bit" encoding is used when it cannot be guaranteed that the high bit
192 of 8-bit data will be correctly preserved. Some e-mail gateways, for
193 example, strip the high bit of text passing through them. These same
194 gateways often handle non-printable characters incorrectly, and so 7-bit
195 encodings usually avoid using bytes with such values.)
157 196
158 A general method of handling text using multiple character sets 197 A general method of handling text using multiple character sets
159 (whether for multilingual text, or simply text in an extremely 198 (whether for multilingual text, or simply text in an extremely
160 complicated single language like Japanese) is defined in the 199 complicated single language like Japanese) is defined in the
161 international standard ISO 2022. ISO 2022 will be discussed in more 200 international standard ISO 2022. ISO 2022 will be discussed in more