xemacs-beta: man/lispref/mule.texi comparison

comparison man/lispref/mule.texi @ 1261:465bd3c7d932

[xemacs-hg @ 2003-02-06 06:35:47 by ben] various bug fixes mule/cyril-util.el: Fix compile warning. loadup.el, make-docfile.el, update-elc-2.el, update-elc.el: Set stack-trace-on-error, load-always-display-messages so we get better debug results. update-elc-2.el: Fix typo in name of lisp/mule, leading to compile failure. simple.el: Omit M-S-home/end from motion keys. update-elc.el: Overhaul: -- allow list of "early-compile" files to be specified, not hardcoded -- fix autoload checking to include all .el files, not just dumped ones -- be smarter about regenerating autoloads, so we don't need to use loadup-el if not necessary -- use standard methods for loading/not loading auto-autoloads.el (maybe fixes "Already loaded" error?) -- rename misleading NOBYTECOMPILE flag file. window-xemacs.el: Fix bug in default param. window-xemacs.el: Fix compile warnings. lwlib-Xm.c: Fix compile warning. lispref/mule.texi: Lots of Mule rewriting. internals/internals.texi: Major fixup. Correct for new names of Bytebpos, Ichar, etc. and lots of Mule rewriting. config.inc.samp: Various fixups. Makefile.in.in: NOBYTECOMPILE -> BYTECOMPILE_CHANGE. esd.c: Warning fixes. fns.c: Eliminate bogus require-prints-loading-message; use already existent load-always-display-messages instead. Make sure `load' knows we are coming from `require'. lread.c: Turn on `load-warn-when-source-newer' by default. Change loading message to indicate when we are `require'ing. Eliminate purify_flag hacks to display more messages; instead, loadup and friends specify this explicitly with `load-always-display-messages'. Add spaces when batch to clearly indicate recursive loading. Fassoc() does not GC so no need to gcpro. gui-x.c, gui-x.h, menubar-x.c: Fix up crashes when selecting menubar items due to lack of GCPROing of callbacks in lwlib structures. eval.c, lisp.h, print.c: Don't canonicalize to selected-frame when noninteractive, or backtraces get all screwed up as some values are printed through the stream console and some aren't. Export canonicalize_printcharfun() and use in Fbacktrace().

author	ben
date	Thu, 06 Feb 2003 06:36:17 +0000
parents	11ff4edb6bb7
children	d6d41d23b6ec

comparison

equal deleted inserted replaced

-:278c9cd3435e
+:465bd3c7d932
 number @samp{2}, a Katakana character, a Hangul character, a Kanji
 ideograph (an @dfn{ideograph} is a ``picture'' character, such as is
 used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there
 are thousands of such ideographs in each language), etc.  The basic
 property of a character is that it is the smallest unit of text with
-semantic significance in text processing.
+semantic significance in text processing---i.e., characters are abstract
+units defined by their meaning, not by their exact appearance.
 Human beings normally process text visually, so to a first approximation
 a character may be identified with its shape.  Note that the same
 character may be drawn by two different people (or in two different
 fonts) in slightly different ways, although the "basic shape" will be the
 different orderings of the same characters are different character sets.
 Identifying characters is simple enough for alphabetic character sets,
 but the difference in ordering can cause great headaches when the same
 thousands of characters are used by different cultures as in the Hanzi.)
-A code point may be broken into a number of @dfn{position codes}.  The
+It's important to understand that a character is defined not by any
-number of position codes required to index a particular character in a
+number attached to it, but by its meaning.  For example, ASCII and
-character set is called the @dfn{dimension} of the character set.  For
+EBCDIC are two charsets containing exactly the same characters
-practical purposes, a position code may be thought of as a byte-sized
+(lowercase and uppercase letters, numbers 0 through 9, particular
-index.  The printing characters of ASCII, being a relatively small
+punctuation marks) but with different numberings.  The @samp{comma}
-character set, is of dimension one, and each character in the set is
+character in ASCII and EBCDIC, for instance, is the same character
-indexed using a single position code, in the range 1 through 94.  Use of
+despite having a different numbering.  Conversely, when comparing ASCII
-this unusual range, rather than the familiar 33 through 126, is an
+and JIS-Roman, which look the same except that the latter has a yen sign
-intentional abstraction; to understand the programming issues you must
+substituted for the backslash, we would say that the backslash and yen
-break the equation between character sets and encodings.
+sign are @emph{not} the same characters, despite having the same number
+(95) and despite the fact that all other characters are present in both
-JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is
+charsets, with the same numbering.  ASCII and JIS-Roman, then, do
-of dimension two -- every character is indexed by two position codes,
+@emph{not} have exactly the same characters in them (ASCII has a
-each in the range 1 through 94.  (This number ``94'' is not a
+backslash character but no yen-sign character, and vice-versa for
-coincidence; we shall see that the JIS position codes were chosen so
+JIS-Roman), unlike ASCII and EBCDIC, even though the numberings in ASCII
-that JIS kanji could be encoded without using codes that in ASCII are
+and JIS-Roman are closer.
-associated with device control functions.)  Note that the choice of the
-range here is somewhat arbitrary.  You could just as easily index the
+Sometimes, a code point is not a single number, but instead a group of
-printing characters in ASCII using numbers in the range 0 through 93, 2
+numbers, called @dfn{position codes}.  In such cases, the number of
-through 95, 3 through 96, etc.  In fact, the standardized
+position codes required to index a particular character in a character
-@emph{encoding} for the ASCII @emph{character set} uses the range 33
+set is called the @dfn{dimension} of the character set.  Character sets
-through 126.
+indexed by more than one position code typically use byte-sized position
+codes.  Small character sets, e.g. ASCII, invariably use a single
+position code, but for larger character sets, the choice of whether to
+use multiple position codes or a single large (16-bit or 32-bit) number
+is arbitrary.  Unicode typically uses a single large number, but
+language-specific or "national" character sets often use multiple
+(usually two) position codes.  For example, JIS X 0208, i.e. Japanese
+Kanji, has thousands of characters, and is of dimension two -- every
+character is indexed by two position codes, each in the range 1 through
+94.  (This number ``94'' is not a coincidence; it is the same as the
+number of printable characters in ASCII, and was chosen so that JIS
+characters could be directly encoded using two printable ASCII
+characters.)  Note that the choice of the range here is somewhat
+arbitrary -- it could just as easily be 0 through 93, 2 through 95, etc.
+In fact, the range for JIS position codes (and for other character sets
+modeled after it) is often given as range 33 through 126, so as to
+directly match ASCII printing characters.
 An @dfn{encoding} is a way of numerically representing characters from
 one or more character sets into a stream of like-sized numerical values
-called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit
+called @dfn{words} -- typically 8-bit bytes, but sometimes 16-bit or
-quantities.  If an encoding encompasses only one character set, then the
+32-bit quantities.  It's very important to clearly distinguish between
-position codes for the characters in that character set could be used
+charsets and encodings.  For a simple charset like ASCII, there is only
-directly.  (This is the case with the trivial cipher used by children,
+one encoding normally used -- each character is represented by a single
-assigning 1 to `A', 2 to `B', and so on.)  However, even with ASCII,
+byte, with the same value as its code point.  For more complicated
-other considerations intrude.  For example, why are the upper- and
+charsets, however, or when a single encoding needs to represent more
-lowercase alphabets separated by 8 characters?  Why do the digits start
+than charset, things are not so obvious.  Unicode version 2, for
-with `0' being assigned the code 48?  In both cases because semantically
+example, is a large charset with thousands of characters, each indexed
-interesting operations (case conversion and numerical value extraction)
+by a 16-bit number, often represented in hex, e.g. 0x05D0 for the Hebrew
-become convenient masking operations.  Other artificial aspects (the
+letter "aleph".  One obvious encoding (actually two encodings, depending
-control characters being assigned to codes 0--31 and 127) are historical
+on which of the two possible byte orderings is chosen) simply uses two
-accidents.  (The use of 127 for @samp{DEL} is an artifact of the "punch
+bytes per character.  This encoding is convenient for internal
-once" nature of paper tape, for example.)
+processing of Unicode text; however, it's incompatible with ASCII, and
+thus external text (files, e-mail, etc.) that is encoded this way is
-Naive use of the position code is not possible, however, if more than
+completely uninterpretable by programs lacking Unicode support.  For
-one character set is to be used in the encoding.  For example, printed
+this reason, a different, ASCII-compatible encoding, e.g. UTF-8, is
+usually used for external text.  UTF-8 represents Unicode characters
+with one to three bytes (often extended to six bytes to handle
+characters with up to 31-bit indices).  Unicode characters 00 to 7F
+(identical with ASCII) are directly represented with one byte, and other
+characters with two or more bytes, each in the range 80 to FF.
+Applications that don't understand Unicode will still be able to process
+ASCII characters represented in UTF-8-encoded text, and will typically
+ignore (and hopefully preserve) the high-bit characters.
+Naive use of code points is also not possible if more than one
+character set is to be used in the encoding.  For example, printed
 Japanese text typically requires characters from multiple character sets
 -- ASCII, JIS X 0208, and JIS X 0212, to be specific.  Each of these is
-indexed using one or more position codes in the range 1 through 94, so
+indexed using one or more position codes in the range 1 through 94 (or
-the position codes could not be used directly or there would be no way
+33 through 126), so the position codes could not be used directly or
-to tell which character was meant.  Different Japanese encodings handle
+there would be no way to tell which character was meant.  Different
-this differently -- JIS uses special escape characters to denote
+Japanese encodings handle this differently -- JIS uses special escape
-different character sets; EUC sets the high bit of the position codes
+characters to denote different character sets; EUC sets the high bit of
-for JIS X 0208 and JIS X 0212, and puts a special extra byte before each
+the position codes for JIS X 0208 and JIS X 0212, and puts a special
-JIS X 0212 character; etc.  (JIS, EUC, and most of the other encodings
+extra byte before each JIS X 0212 character; etc.
-you will encounter in files are 7-bit or 8-bit encodings.  There is one
-common 16-bit encoding, which is Unicode; this strives to represent all
+The encodings described above are all 7-bit or 8-bit encodings.  The
-the world's characters in a single large character set.  32-bit
+fixed-width Unicode encoding previous described, however, is sometimes
-encodings are often used internally in programs, such as XEmacs with
+considered to be a 16-bit encoding, in which case the issue of byte
-MULE support, to simplify the code that manipulates them; however, they
+ordering does not come up. (Imagine, for example, that the text is
-are not used externally because they are not very space-efficient.)
+represented as an array of shorts.) Similarly, Unicode version 3 (which
+has characters with indices above 0xFFFF), and other very large
+character sets, may be represented internally as 32-bit encodings,
+i.e. arrays of ints.  However, it does not make too much sense to talk
+about 16-bit or 32-bit encodings for external data, since nowadays 8-bit
+data is a universal standard -- the closest you can get is fixed-width
+encodings using two or four bytes to encode 16-bit or 32-bit values. (A
+"7-bit" encoding is used when it cannot be guaranteed that the high bit
+of 8-bit data will be correctly preserved.  Some e-mail gateways, for
+example, strip the high bit of text passing through them.  These same
+gateways often handle non-printable characters incorrectly, and so 7-bit
+encodings usually avoid using bytes with such values.)
 A general method of handling text using multiple character sets
 (whether for multilingual text, or simply text in an extremely
 complicated single language like Japanese) is defined in the
 international standard ISO 2022.  ISO 2022 will be discussed in more

Mercurial > hg > xemacs-beta

comparison man/lispref/mule.texi @ 1261:465bd3c7d932