Mercurial > hg > xemacs-beta
comparison man/internals/internals.texi @ 3496:d08f0a2c8722
[xemacs-hg @ 2006-07-07 23:01:01 by aidan]
Adjust the Mule charsets to support 500,000 unknown Unicode charsets.
author | aidan |
---|---|
date | Fri, 07 Jul 2006 23:01:11 +0000 |
parents | 15fb91e3a115 |
children | 382b11fa8866 |
comparison
equal
deleted
inserted
replaced
3495:61954f295412 | 3496:d08f0a2c8722 |
---|---|
11333 Textual searches can simply treat encoded strings as if they | 11333 Textual searches can simply treat encoded strings as if they |
11334 were encoded in a one-byte-per-character fashion rather than | 11334 were encoded in a one-byte-per-character fashion rather than |
11335 the actual multi-byte encoding. | 11335 the actual multi-byte encoding. |
11336 @end enumerate | 11336 @end enumerate |
11337 | 11337 |
11338 None of the standard non-modal encodings meet all of these | 11338 None of the pre-Unciode standard non-modal encodings meet all of these |
11339 conditions. For example, EUC satisfies only (2) and (3), while | 11339 conditions. For example, EUC satisfies only (2) and (3), while |
11340 Shift-JIS and Big5 (not yet described) satisfy only (2). (All | 11340 Shift-JIS and Big5 (not yet described) satisfy only (2). (All non-modal |
11341 non-modal encodings must satisfy (2), in order to be unambiguous.) | 11341 encodings must satisfy (2), in order to be unambiguous.) UTF-8, |
11342 however, meets all three, and we are considering moving to it as an | |
11343 internal encoding. | |
11342 | 11344 |
11343 @node Internal Character Encoding, , Internal String Encoding, Internal Mule Encodings | 11345 @node Internal Character Encoding, , Internal String Encoding, Internal Mule Encodings |
11344 @subsection Internal Character Encoding | 11346 @subsection Internal Character Encoding |
11345 @cindex internal character encoding | 11347 @cindex internal character encoding |
11346 @cindex character encoding, internal | 11348 @cindex character encoding, internal |
11347 @cindex encoding, internal character | 11349 @cindex encoding, internal character |
11348 | 11350 |
11349 One 19-bit word represents a single character. The word is | 11351 One 21-bit word represents a single character. The word is |
11350 separated into three fields: | 11352 separated into three fields: |
11351 | 11353 |
11352 @example | 11354 @example |
11353 Bit number: 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 | 11355 Bit number: 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 |
11354 <------------> <------------------> <------------------> | 11356 <------------------> <------------------> <------------------> |
11355 Field: 1 2 3 | 11357 Field: 1 2 3 |
11356 @end example | 11358 @end example |
11357 | 11359 |
11358 Note that fields 2 and 3 hold 7 bits each, while field 1 holds 5 bits. | 11360 Note that each field holds 7 bits. |
11359 | 11361 |
11360 @example | 11362 @example |
11361 Character set Field 1 Field 2 Field 3 | 11363 Character set Field 1 Field 2 Field 3 |
11362 ------------- ------- ------- ------- | 11364 ------------- ------- ------- ------- |
11363 ASCII 0 0 PC1 | 11365 ASCII 0 0 PC1 |
11368 range: (01 - 0D) (20 - 7F) | 11370 range: (01 - 0D) (20 - 7F) |
11369 Dimension-1 private 0 LB - 0x80 PC1 | 11371 Dimension-1 private 0 LB - 0x80 PC1 |
11370 range: (20 - 6F) (20 - 7F) | 11372 range: (20 - 6F) (20 - 7F) |
11371 Dimension-2 official LB - 0x8F PC1 PC2 | 11373 Dimension-2 official LB - 0x8F PC1 PC2 |
11372 range: (01 - 0A) (20 - 7F) (20 - 7F) | 11374 range: (01 - 0A) (20 - 7F) (20 - 7F) |
11373 Dimension-2 private LB - 0xE1 PC1 PC2 | 11375 Dimension-2 private LB - 0x80 PC1 PC2 |
11374 range: (0F - 1E) (20 - 7F) (20 - 7F) | 11376 range: (0F - 1E) (20 - 7F) (20 - 7F) |
11375 Composite 0x1F ? ? | 11377 Composite 0x1F ? ? |
11376 @end example | 11378 @end example |
11377 | 11379 |
11378 Note that character codes 0 - 255 are the same as the ``binary | 11380 Note also that character codes 0 - 255 are the same as the ``binary |
11379 encoding'' described above. | 11381 encoding'' described above. |
11380 | 11382 |
11381 Most of the code in XEmacs knows nothing of the representation of a | 11383 Most of the code in XEmacs knows nothing of the representation of a |
11382 character other than that values 0 - 255 represent ASCII, Control 1, | 11384 character other than that values 0 - 255 represent ASCII, Control 1, |
11383 and Latin 1. | 11385 and Latin 1. |
11605 Kanji. Note that the representation of a character as an Ichar is @strong{not} | 11607 Kanji. Note that the representation of a character as an Ichar is @strong{not} |
11606 the same as the representation of that same character in a string; thus, | 11608 the same as the representation of that same character in a string; thus, |
11607 you cannot do the standard C trick of passing a pointer to a character | 11609 you cannot do the standard C trick of passing a pointer to a character |
11608 to a function that expects a string. | 11610 to a function that expects a string. |
11609 | 11611 |
11610 An Ichar takes up 19 bits of representation and (for code compatibility | 11612 An Ichar takes up 21 bits of representation and (for code compatibility |
11611 and such) is compatible with an int. This representation is visible on | 11613 and such) is compatible with an int. This representation is visible on |
11612 the Lisp level. The important characteristics of the Ichar | 11614 the Lisp level. The important characteristics of the Ichar |
11613 representation are | 11615 representation are |
11614 | 11616 |
11615 @itemize @minus | 11617 @itemize @minus |