comparison man/internals/internals.texi @ 3496:d08f0a2c8722

[xemacs-hg @ 2006-07-07 23:01:01 by aidan] Adjust the Mule charsets to support 500,000 unknown Unicode charsets.
author aidan
date Fri, 07 Jul 2006 23:01:11 +0000
parents 15fb91e3a115
children 382b11fa8866
comparison
equal deleted inserted replaced
3495:61954f295412 3496:d08f0a2c8722
11333 Textual searches can simply treat encoded strings as if they 11333 Textual searches can simply treat encoded strings as if they
11334 were encoded in a one-byte-per-character fashion rather than 11334 were encoded in a one-byte-per-character fashion rather than
11335 the actual multi-byte encoding. 11335 the actual multi-byte encoding.
11336 @end enumerate 11336 @end enumerate
11337 11337
11338 None of the standard non-modal encodings meet all of these 11338 None of the pre-Unciode standard non-modal encodings meet all of these
11339 conditions. For example, EUC satisfies only (2) and (3), while 11339 conditions. For example, EUC satisfies only (2) and (3), while
11340 Shift-JIS and Big5 (not yet described) satisfy only (2). (All 11340 Shift-JIS and Big5 (not yet described) satisfy only (2). (All non-modal
11341 non-modal encodings must satisfy (2), in order to be unambiguous.) 11341 encodings must satisfy (2), in order to be unambiguous.) UTF-8,
11342 however, meets all three, and we are considering moving to it as an
11343 internal encoding.
11342 11344
11343 @node Internal Character Encoding, , Internal String Encoding, Internal Mule Encodings 11345 @node Internal Character Encoding, , Internal String Encoding, Internal Mule Encodings
11344 @subsection Internal Character Encoding 11346 @subsection Internal Character Encoding
11345 @cindex internal character encoding 11347 @cindex internal character encoding
11346 @cindex character encoding, internal 11348 @cindex character encoding, internal
11347 @cindex encoding, internal character 11349 @cindex encoding, internal character
11348 11350
11349 One 19-bit word represents a single character. The word is 11351 One 21-bit word represents a single character. The word is
11350 separated into three fields: 11352 separated into three fields:
11351 11353
11352 @example 11354 @example
11353 Bit number: 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 11355 Bit number: 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
11354 <------------> <------------------> <------------------> 11356 <------------------> <------------------> <------------------>
11355 Field: 1 2 3 11357 Field: 1 2 3
11356 @end example 11358 @end example
11357 11359
11358 Note that fields 2 and 3 hold 7 bits each, while field 1 holds 5 bits. 11360 Note that each field holds 7 bits.
11359 11361
11360 @example 11362 @example
11361 Character set Field 1 Field 2 Field 3 11363 Character set Field 1 Field 2 Field 3
11362 ------------- ------- ------- ------- 11364 ------------- ------- ------- -------
11363 ASCII 0 0 PC1 11365 ASCII 0 0 PC1
11368 range: (01 - 0D) (20 - 7F) 11370 range: (01 - 0D) (20 - 7F)
11369 Dimension-1 private 0 LB - 0x80 PC1 11371 Dimension-1 private 0 LB - 0x80 PC1
11370 range: (20 - 6F) (20 - 7F) 11372 range: (20 - 6F) (20 - 7F)
11371 Dimension-2 official LB - 0x8F PC1 PC2 11373 Dimension-2 official LB - 0x8F PC1 PC2
11372 range: (01 - 0A) (20 - 7F) (20 - 7F) 11374 range: (01 - 0A) (20 - 7F) (20 - 7F)
11373 Dimension-2 private LB - 0xE1 PC1 PC2 11375 Dimension-2 private LB - 0x80 PC1 PC2
11374 range: (0F - 1E) (20 - 7F) (20 - 7F) 11376 range: (0F - 1E) (20 - 7F) (20 - 7F)
11375 Composite 0x1F ? ? 11377 Composite 0x1F ? ?
11376 @end example 11378 @end example
11377 11379
11378 Note that character codes 0 - 255 are the same as the ``binary 11380 Note also that character codes 0 - 255 are the same as the ``binary
11379 encoding'' described above. 11381 encoding'' described above.
11380 11382
11381 Most of the code in XEmacs knows nothing of the representation of a 11383 Most of the code in XEmacs knows nothing of the representation of a
11382 character other than that values 0 - 255 represent ASCII, Control 1, 11384 character other than that values 0 - 255 represent ASCII, Control 1,
11383 and Latin 1. 11385 and Latin 1.
11605 Kanji. Note that the representation of a character as an Ichar is @strong{not} 11607 Kanji. Note that the representation of a character as an Ichar is @strong{not}
11606 the same as the representation of that same character in a string; thus, 11608 the same as the representation of that same character in a string; thus,
11607 you cannot do the standard C trick of passing a pointer to a character 11609 you cannot do the standard C trick of passing a pointer to a character
11608 to a function that expects a string. 11610 to a function that expects a string.
11609 11611
11610 An Ichar takes up 19 bits of representation and (for code compatibility 11612 An Ichar takes up 21 bits of representation and (for code compatibility
11611 and such) is compatible with an int. This representation is visible on 11613 and such) is compatible with an int. This representation is visible on
11612 the Lisp level. The important characteristics of the Ichar 11614 the Lisp level. The important characteristics of the Ichar
11613 representation are 11615 representation are
11614 11616
11615 @itemize @minus 11617 @itemize @minus