diff man/lispref/mule.texi @ 1261:465bd3c7d932

[xemacs-hg @ 2003-02-06 06:35:47 by ben] various bug fixes mule/cyril-util.el: Fix compile warning. loadup.el, make-docfile.el, update-elc-2.el, update-elc.el: Set stack-trace-on-error, load-always-display-messages so we get better debug results. update-elc-2.el: Fix typo in name of lisp/mule, leading to compile failure. simple.el: Omit M-S-home/end from motion keys. update-elc.el: Overhaul: -- allow list of "early-compile" files to be specified, not hardcoded -- fix autoload checking to include all .el files, not just dumped ones -- be smarter about regenerating autoloads, so we don't need to use loadup-el if not necessary -- use standard methods for loading/not loading auto-autoloads.el (maybe fixes "Already loaded" error?) -- rename misleading NOBYTECOMPILE flag file. window-xemacs.el: Fix bug in default param. window-xemacs.el: Fix compile warnings. lwlib-Xm.c: Fix compile warning. lispref/mule.texi: Lots of Mule rewriting. internals/internals.texi: Major fixup. Correct for new names of Bytebpos, Ichar, etc. and lots of Mule rewriting. config.inc.samp: Various fixups. Makefile.in.in: NOBYTECOMPILE -> BYTECOMPILE_CHANGE. esd.c: Warning fixes. fns.c: Eliminate bogus require-prints-loading-message; use already existent load-always-display-messages instead. Make sure `load' knows we are coming from `require'. lread.c: Turn on `load-warn-when-source-newer' by default. Change loading message to indicate when we are `require'ing. Eliminate purify_flag hacks to display more messages; instead, loadup and friends specify this explicitly with `load-always-display-messages'. Add spaces when batch to clearly indicate recursive loading. Fassoc() does not GC so no need to gcpro. gui-x.c, gui-x.h, menubar-x.c: Fix up crashes when selecting menubar items due to lack of GCPROing of callbacks in lwlib structures. eval.c, lisp.h, print.c: Don't canonicalize to selected-frame when noninteractive, or backtraces get all screwed up as some values are printed through the stream console and some aren't. Export canonicalize_printcharfun() and use in Fbacktrace().
author ben
date Thu, 06 Feb 2003 06:36:17 +0000
parents 11ff4edb6bb7
children d6d41d23b6ec
line wrap: on
line diff
--- a/man/lispref/mule.texi	Wed Feb 05 22:53:04 2003 +0000
+++ b/man/lispref/mule.texi	Thu Feb 06 06:36:17 2003 +0000
@@ -39,7 +39,8 @@
 used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there
 are thousands of such ideographs in each language), etc.  The basic
 property of a character is that it is the smallest unit of text with
-semantic significance in text processing.
+semantic significance in text processing---i.e., characters are abstract
+units defined by their meaning, not by their exact appearance.
 
   Human beings normally process text visually, so to a first approximation
 a character may be identified with its shape.  Note that the same
@@ -98,62 +99,100 @@
 but the difference in ordering can cause great headaches when the same
 thousands of characters are used by different cultures as in the Hanzi.)
 
-  A code point may be broken into a number of @dfn{position codes}.  The
-number of position codes required to index a particular character in a
-character set is called the @dfn{dimension} of the character set.  For
-practical purposes, a position code may be thought of as a byte-sized
-index.  The printing characters of ASCII, being a relatively small
-character set, is of dimension one, and each character in the set is
-indexed using a single position code, in the range 1 through 94.  Use of
-this unusual range, rather than the familiar 33 through 126, is an
-intentional abstraction; to understand the programming issues you must
-break the equation between character sets and encodings.
-
-  JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is
-of dimension two -- every character is indexed by two position codes,
-each in the range 1 through 94.  (This number ``94'' is not a
-coincidence; we shall see that the JIS position codes were chosen so
-that JIS kanji could be encoded without using codes that in ASCII are
-associated with device control functions.)  Note that the choice of the
-range here is somewhat arbitrary.  You could just as easily index the
-printing characters in ASCII using numbers in the range 0 through 93, 2
-through 95, 3 through 96, etc.  In fact, the standardized
-@emph{encoding} for the ASCII @emph{character set} uses the range 33
-through 126.
+  It's important to understand that a character is defined not by any
+number attached to it, but by its meaning.  For example, ASCII and
+EBCDIC are two charsets containing exactly the same characters
+(lowercase and uppercase letters, numbers 0 through 9, particular
+punctuation marks) but with different numberings.  The @samp{comma}
+character in ASCII and EBCDIC, for instance, is the same character
+despite having a different numbering.  Conversely, when comparing ASCII
+and JIS-Roman, which look the same except that the latter has a yen sign
+substituted for the backslash, we would say that the backslash and yen
+sign are @emph{not} the same characters, despite having the same number
+(95) and despite the fact that all other characters are present in both
+charsets, with the same numbering.  ASCII and JIS-Roman, then, do
+@emph{not} have exactly the same characters in them (ASCII has a
+backslash character but no yen-sign character, and vice-versa for
+JIS-Roman), unlike ASCII and EBCDIC, even though the numberings in ASCII
+and JIS-Roman are closer.
+
+  Sometimes, a code point is not a single number, but instead a group of
+numbers, called @dfn{position codes}.  In such cases, the number of
+position codes required to index a particular character in a character
+set is called the @dfn{dimension} of the character set.  Character sets
+indexed by more than one position code typically use byte-sized position
+codes.  Small character sets, e.g. ASCII, invariably use a single
+position code, but for larger character sets, the choice of whether to
+use multiple position codes or a single large (16-bit or 32-bit) number
+is arbitrary.  Unicode typically uses a single large number, but
+language-specific or "national" character sets often use multiple
+(usually two) position codes.  For example, JIS X 0208, i.e. Japanese
+Kanji, has thousands of characters, and is of dimension two -- every
+character is indexed by two position codes, each in the range 1 through
+94.  (This number ``94'' is not a coincidence; it is the same as the
+number of printable characters in ASCII, and was chosen so that JIS
+characters could be directly encoded using two printable ASCII
+characters.)  Note that the choice of the range here is somewhat
+arbitrary -- it could just as easily be 0 through 93, 2 through 95, etc.
+In fact, the range for JIS position codes (and for other character sets
+modeled after it) is often given as range 33 through 126, so as to
+directly match ASCII printing characters.
 
   An @dfn{encoding} is a way of numerically representing characters from
 one or more character sets into a stream of like-sized numerical values
-called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit
-quantities.  If an encoding encompasses only one character set, then the
-position codes for the characters in that character set could be used
-directly.  (This is the case with the trivial cipher used by children,
-assigning 1 to `A', 2 to `B', and so on.)  However, even with ASCII,
-other considerations intrude.  For example, why are the upper- and
-lowercase alphabets separated by 8 characters?  Why do the digits start
-with `0' being assigned the code 48?  In both cases because semantically
-interesting operations (case conversion and numerical value extraction)
-become convenient masking operations.  Other artificial aspects (the
-control characters being assigned to codes 0--31 and 127) are historical
-accidents.  (The use of 127 for @samp{DEL} is an artifact of the "punch
-once" nature of paper tape, for example.)
-
-  Naive use of the position code is not possible, however, if more than
-one character set is to be used in the encoding.  For example, printed
+called @dfn{words} -- typically 8-bit bytes, but sometimes 16-bit or
+32-bit quantities.  It's very important to clearly distinguish between
+charsets and encodings.  For a simple charset like ASCII, there is only
+one encoding normally used -- each character is represented by a single
+byte, with the same value as its code point.  For more complicated
+charsets, however, or when a single encoding needs to represent more
+than charset, things are not so obvious.  Unicode version 2, for
+example, is a large charset with thousands of characters, each indexed
+by a 16-bit number, often represented in hex, e.g. 0x05D0 for the Hebrew
+letter "aleph".  One obvious encoding (actually two encodings, depending
+on which of the two possible byte orderings is chosen) simply uses two
+bytes per character.  This encoding is convenient for internal
+processing of Unicode text; however, it's incompatible with ASCII, and
+thus external text (files, e-mail, etc.) that is encoded this way is
+completely uninterpretable by programs lacking Unicode support.  For
+this reason, a different, ASCII-compatible encoding, e.g. UTF-8, is
+usually used for external text.  UTF-8 represents Unicode characters
+with one to three bytes (often extended to six bytes to handle
+characters with up to 31-bit indices).  Unicode characters 00 to 7F
+(identical with ASCII) are directly represented with one byte, and other
+characters with two or more bytes, each in the range 80 to FF.
+Applications that don't understand Unicode will still be able to process
+ASCII characters represented in UTF-8-encoded text, and will typically
+ignore (and hopefully preserve) the high-bit characters.
+
+  Naive use of code points is also not possible if more than one
+character set is to be used in the encoding.  For example, printed
 Japanese text typically requires characters from multiple character sets
 -- ASCII, JIS X 0208, and JIS X 0212, to be specific.  Each of these is
-indexed using one or more position codes in the range 1 through 94, so
-the position codes could not be used directly or there would be no way
-to tell which character was meant.  Different Japanese encodings handle
-this differently -- JIS uses special escape characters to denote
-different character sets; EUC sets the high bit of the position codes
-for JIS X 0208 and JIS X 0212, and puts a special extra byte before each
-JIS X 0212 character; etc.  (JIS, EUC, and most of the other encodings
-you will encounter in files are 7-bit or 8-bit encodings.  There is one
-common 16-bit encoding, which is Unicode; this strives to represent all
-the world's characters in a single large character set.  32-bit
-encodings are often used internally in programs, such as XEmacs with
-MULE support, to simplify the code that manipulates them; however, they
-are not used externally because they are not very space-efficient.)
+indexed using one or more position codes in the range 1 through 94 (or
+33 through 126), so the position codes could not be used directly or
+there would be no way to tell which character was meant.  Different
+Japanese encodings handle this differently -- JIS uses special escape
+characters to denote different character sets; EUC sets the high bit of
+the position codes for JIS X 0208 and JIS X 0212, and puts a special
+extra byte before each JIS X 0212 character; etc.
+
+  The encodings described above are all 7-bit or 8-bit encodings.  The
+fixed-width Unicode encoding previous described, however, is sometimes
+considered to be a 16-bit encoding, in which case the issue of byte
+ordering does not come up. (Imagine, for example, that the text is
+represented as an array of shorts.) Similarly, Unicode version 3 (which
+has characters with indices above 0xFFFF), and other very large
+character sets, may be represented internally as 32-bit encodings,
+i.e. arrays of ints.  However, it does not make too much sense to talk
+about 16-bit or 32-bit encodings for external data, since nowadays 8-bit
+data is a universal standard -- the closest you can get is fixed-width
+encodings using two or four bytes to encode 16-bit or 32-bit values. (A
+"7-bit" encoding is used when it cannot be guaranteed that the high bit
+of 8-bit data will be correctly preserved.  Some e-mail gateways, for
+example, strip the high bit of text passing through them.  These same
+gateways often handle non-printable characters incorrectly, and so 7-bit
+encodings usually avoid using bytes with such values.)
 
   A general method of handling text using multiple character sets
 (whether for multilingual text, or simply text in an extremely