Mercurial > hg > xemacs-beta
view lisp/unicode.el @ 3767:6b2ef948e140
[xemacs-hg @ 2006-12-29 18:09:38 by aidan]
etc/ChangeLog addition:
2006-12-21 Aidan Kehoe <kehoea@parhasard.net>
* unicode/unicode-consortium/8859-7.TXT:
Update the mapping to the 2003 version of ISO 8859-7.
lisp/ChangeLog addition:
2006-12-21 Aidan Kehoe <kehoea@parhasard.net>
* mule/cyrillic.el:
* mule/cyrillic.el (iso-8859-5):
* mule/cyrillic.el (cyrillic-koi8-r-encode-table):
Add syntax, case support for Cyrillic; make some parentheses more
Lispy.
* mule/european.el:
Content moved to latin.el, file deleted.
* mule/general-late.el:
If Unicode tables are to be loaded at dump time, do it here, not
in loadup.el.
* mule/greek.el:
Add syntax, case support for Greek.
* mule/latin.el:
Move the content of european.el here. Change the case table
mappings to use hexadecimal codes, to make cross reference to the
standards easier. In all cases, take character syntax from similar
characters in Latin-1 , rather than deciding separately what
syntax they should take. Add (incomplete) support for case with
Turkish. Remove description of the character sets used from the
language environments' doc strings, since now that we create
variant language environments on the fly, such descriptions will
often be inaccurate. Set the native-coding-system language info
property while setting the other coding-system properties of the
language.
* mule/misc-lang.el (ipa):
Remove the language environment. The International Phonetic
_Alphabet_ is not a language, it's inane to have a corresponding
language environment in XEmacs.
* mule/mule-cmds.el (create-variant-language-environment):
Also modify the coding-priority when creating a new language
environment; document that.
* mule/mule-cmds.el (get-language-environment-from-locale):
Recognise that the 'native-coding-system language-info property
can be a list, interpret it correctly when it is one.
2006-12-21 Aidan Kehoe <kehoea@parhasard.net>
* coding.el (coding-system-category):
Use the new 'unicode-type property for finding what sort of
Unicode coding system subtype a coding system is, instead of the
overshadowed 'type property.
* dumped-lisp.el (preloaded-file-list):
mule/european.el has been removed.
* loadup.el (really-early-error-handler):
Unicode tables loaded at dump time are now in
mule/general-late.el.
* simple.el (count-lines):
Add some backslashes to to parentheses in docstrings to help
fontification along.
* simple.el (what-cursor-position):
Wrap a line to fit in 80 characters.
* unicode.el:
Use the 'unicode-type property, not 'type, for setting the Unicode
coding-system subtype.
src/ChangeLog addition:
2006-12-21 Aidan Kehoe <kehoea@parhasard.net>
* file-coding.c:
Update the make-coding-system docstring to reflect unicode-type
* general-slots.h:
New symbol, unicode-type, since 'type was being overridden when
accessing a coding system's Unicode subtype.
* intl-win32.c:
Backslash a few parentheses, to help fontification along.
* intl-win32.c (complex_vars_of_intl_win32):
Use the 'unicode-type symbol, not 'type, when creating the
Microsoft Unicode coding system.
* unicode.c (unicode_putprop):
* unicode.c (unicode_getprop):
* unicode.c (unicode_print):
Using 'type as the property name when working out what Unicode
subtype a given coding system is was broken, since there's a
general coding system property called 'type. Change the former to
use 'unicode-type instead.
author | aidan |
---|---|
date | Fri, 29 Dec 2006 18:09:51 +0000 |
parents | 4c8ad140bcec |
children | aa28d959af41 |
line wrap: on
line source
;;; unicode.el --- Unicode support -*- coding: iso-2022-7bit; -*- ;; Copyright (C) 2001, 2002 Ben Wing. ;; Keywords: multilingual, Unicode ;; This file is part of XEmacs. ;; XEmacs is free software; you can redistribute it and/or modify it ;; under the terms of the GNU General Public License as published by ;; the Free Software Foundation; either version 2, or (at your option) ;; any later version. ;; XEmacs is distributed in the hope that it will be useful, but ;; WITHOUT ANY WARRANTY; without even the implied warranty of ;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ;; General Public License for more details. ;; You should have received a copy of the GNU General Public License ;; along with XEmacs; see the file COPYING. If not, write to the Free ;; Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA ;; 02111-1307, USA. ;;; Synched up with: Not in FSF. ;;; Commentary: ;; Lisp support for Unicode, e.g. initialize the translation tables. ;;; Code: ;; GNU Emacs has the charsets: ;; mule-unicode-2500-33ff ;; mule-unicode-e000-ffff ;; mule-unicode-0100-24ff ;; built-in. This is hack--and an incomplete hack at that--against the ;; spirit and the letter of standard ISO 2022 character sets. Instead of ;; this, we have the jit-ucs-charset-N Mule character sets, created in ;; unicode.c on encountering a Unicode code point that we don't recognise, ;; and saved in ISO 2022 coding systems using the UTF-8 escape described in ;; ISO-IR 196. ;; accessed in loadup.el, mule-cmds.el; see discussion in unicode.c (defvar load-unicode-tables-at-dump-time (eq system-type 'windows-nt) "[INTERNAL] Whether to load the Unicode tables at dump time. Setting this at run-time does nothing.") ;; NOTE: This takes only a fraction of a second on my Pentium III ;; 700Mhz even with a totally optimization-disabled XEmacs. (defun load-unicode-tables () "Initialize the Unicode translation tables for all standard charsets." (let ((parse-args '(("unicode/unicode-consortium" ;; Due to the braindamaged way Mule treats the ASCII and Control-1 ;; charsets' types, trying to load them results in out-of-range ;; warnings at unicode.c:1439. They're no-ops anyway, they're ;; hardwired in unicode.c (unicode_to_ichar, ichar_to_unicode). ;; ("8859-1.TXT" ascii #x00 #x7F #x0) ;; ("8859-1.TXT" control-1 #x80 #x9F #x-80) ;; The 8859-1.TXT G1 assignments are half no-ops, hardwired in ;; unicode.c ichar_to_unicode, but not in unicode_to_ichar. ("8859-1.TXT" latin-iso8859-1 #xA0 #xFF #x-80) ;; "8859-10.TXT" ;; "8859-13.TXT" ("8859-14.TXT" latin-iso8859-14 #xA0 #xFF #x-80) ("8859-15.TXT" latin-iso8859-15 #xA0 #xFF #x-80) ("8859-16.TXT" latin-iso8859-16 #xA0 #xFF #x-80) ("8859-2.TXT" latin-iso8859-2 #xA0 #xFF #x-80) ("8859-3.TXT" latin-iso8859-3 #xA0 #xFF #x-80) ("8859-4.TXT" latin-iso8859-4 #xA0 #xFF #x-80) ("8859-5.TXT" cyrillic-iso8859-5 #xA0 #xFF #x-80) ("8859-6.TXT" arabic-iso8859-6 #xA0 #xFF #x-80) ("8859-7.TXT" greek-iso8859-7 #xA0 #xFF #x-80) ("8859-8.TXT" hebrew-iso8859-8 #xA0 #xFF #x-80) ("8859-9.TXT" latin-iso8859-9 #xA0 #xFF #x-80) ;; charset for Big5 does not matter; specifying `big5' will ;; automatically make the right thing happen ("BIG5.TXT" chinese-big5-1 nil nil nil big5) ("CNS11643.TXT" chinese-cns11643-1 #x10000 #x1FFFF #x-10000) ("CNS11643.TXT" chinese-cns11643-2 #x20000 #x2FFFF #x-20000) ;; "CP1250.TXT" ;; "CP1251.TXT" ;; "CP1252.TXT" ;; "CP1253.TXT" ;; "CP1254.TXT" ;; "CP1255.TXT" ;; "CP1256.TXT" ;; "CP1257.TXT" ;; "CP1258.TXT" ;; "CP874.TXT" ;; "CP932.TXT" ;; "CP936.TXT" ;; "CP949.TXT" ;; "CP950.TXT" ;; "GB12345.TXT" ("GB2312.TXT" chinese-gb2312) ;; "HANGUL.TXT" ;; #### shouldn't JIS X 0201's upper limit be 7f? ("JIS0201.TXT" latin-jisx0201 #x21 #x80) ("JIS0201.TXT" katakana-jisx0201 #xA0 #xFF #x-80) ("JIS0208.TXT" japanese-jisx0208 nil nil nil ignore-first-column) ("JIS0212.TXT" japanese-jisx0212) ;; "JOHAB.TXT" ;; "KOI8-R.TXT" ;; "KSC5601.TXT" ;; note that KSC5601.TXT as currently distributed is NOT what ;; it claims to be! see comments in KSX1001.TXT. ("KSX1001.TXT" korean-ksc5601) ;; "OLD5601.TXT" ;; "SHIFTJIS.TXT" ) ("unicode/mule-ucs" ;; #### we don't support surrogates?!?? ;; use these instead of the above ones once we support surrogates ;;("chinese-cns11643-1.txt" chinese-cns11643-1) ;;("chinese-cns11643-2.txt" chinese-cns11643-2) ;;("chinese-cns11643-3.txt" chinese-cns11643-3) ;;("chinese-cns11643-4.txt" chinese-cns11643-4) ;;("chinese-cns11643-5.txt" chinese-cns11643-5) ;;("chinese-cns11643-6.txt" chinese-cns11643-6) ;;("chinese-cns11643-7.txt" chinese-cns11643-7) ("chinese-sisheng.txt" chinese-sisheng) ("ethiopic.txt" ethiopic) ("indian-is13194.txt" indian-is13194) ("ipa.txt" ipa) ("thai-tis620.txt" thai-tis620) ("tibetan.txt" tibetan) ("vietnamese-viscii-lower.txt" vietnamese-viscii-lower) ("vietnamese-viscii-upper.txt" vietnamese-viscii-upper) ) ("unicode/other" ("lao.txt" lao) ) ))) (mapcar #'(lambda (tables) (let ((undir (expand-file-name (car tables) data-directory))) (mapcar #'(lambda (args) (apply 'load-unicode-mapping-table (expand-file-name (car args) undir) (cdr args))) (cdr tables)))) parse-args))) (make-coding-system 'utf-16 'unicode "UTF-16" '(mnemonic "UTF-16" documentation "UTF-16 Unicode encoding -- the standard (almost-) fixed-width two-byte encoding, with surrogates. It will be fixed-width if all characters are in the BMP (Basic Multilingual Plane -- first 65536 codepoints). Cannot represent characters with codepoints above 0x10FFFF (a little more than 1,000,000). Unicode and ISO guarantee never to encode any characters outside this range -- all the rest are for private, corporate or internal use." unicode-type utf-16)) (define-coding-system-alias 'utf-16-be 'utf-16) (make-coding-system 'utf-16-bom 'unicode "UTF-16 w/BOM" '(mnemonic "UTF16-BOM" documentation "UTF-16 Unicode encoding with byte order mark (BOM) at the beginning. The BOM is Unicode character U+FEFF -- i.e. the first two bytes are 0xFE and 0xFF, respectively, or reversed in a little-endian representation. It has been sanctioned by the Unicode Consortium for use at the beginning of a Unicode stream as a marker of the byte order of the stream, and commonly appears in Unicode files under Microsoft Windows, where it also functions as a magic cookie identifying a Unicode file. The character is called \"ZERO WIDTH NO-BREAK SPACE\" and is suitable as a byte-order marker because: -- it has no displayable representation -- due to its semantics it never normally appears at the beginning of a stream -- its reverse U+FFFE is not a legal Unicode character -- neither byte sequence is at all likely in any other standard encoding, particularly at the beginning of a stream This coding system will insert a BOM at the beginning of a stream when writing and strip it off when reading." unicode-type utf-16 need-bom t)) (make-coding-system 'utf-16-little-endian 'unicode "UTF-16 Little Endian" '(mnemonic "UTF16-LE" documentation "Little-endian version of UTF-16 Unicode encoding. See `utf-16' coding system." unicode-type utf-16 little-endian t)) (define-coding-system-alias 'utf-16-le 'utf-16-little-endian) (make-coding-system 'utf-16-little-endian-bom 'unicode "UTF-16 Little Endian w/BOM" '(mnemonic "MSW-Unicode" documentation "Little-endian version of UTF-16 Unicode encoding, with byte order mark. Standard encoding for representing Unicode under MS Windows. See `utf-16-bom' coding system." unicode-type utf-16 little-endian t need-bom t)) (make-coding-system 'ucs-4 'unicode "UCS-4" '(mnemonic "UCS4" documentation "UCS-4 Unicode encoding -- fully fixed-width four-byte encoding." unicode-type ucs-4)) (make-coding-system 'ucs-4-little-endian 'unicode "UCS-4 Little Endian" '(mnemonic "UCS4-LE" documentation ;; #### I don't think this is permitted by ISO 10646, only Unicode. ;; Call it UTF-32 instead? "Little-endian version of UCS-4 Unicode encoding. See `ucs-4' coding system." unicode-type ucs-4 little-endian t)) (make-coding-system 'utf-8 'unicode "UTF-8" '(mnemonic "UTF8" documentation " UTF-8 Unicode encoding -- ASCII-compatible 8-bit variable-width encoding sharing the following principles with the Mule-internal encoding: -- All ASCII characters (codepoints 0 through 127) are represented by themselves (i.e. using one byte, with the same value as the ASCII codepoint), and these bytes are disjoint from bytes representing non-ASCII characters. This means that any 8-bit clean application can safely process UTF-8-encoded text as it were ASCII, with no corruption (e.g. a '/' byte is always a slash character, never the second byte of some other character, as with Big5, so a pathname encoded in UTF-8 can safely be split up into components and reassembled again using standard ASCII processes). -- Leading bytes and non-leading bytes in the encoding of a character are disjoint, so moving backwards is easy. -- Given only the leading byte, you know how many following bytes are present. " unicode-type utf-8)) (make-coding-system 'utf-8-bom 'unicode "UTF-8 w/BOM" '(mnemonic "MSW-UTF8" documentation "UTF-8 Unicode encoding, with byte order mark. Standard encoding for representing UTF-8 under MS Windows." unicode-type utf-8 little-endian t need-bom t)) (defun decode-char (quote-ucs code &optional restriction) "FSF compatibility--return Mule character with Unicode codepoint CODE. The second argument must be 'ucs, the third argument is ignored. " (assert (eq quote-ucs 'ucs) t "Sorry, decode-char doesn't yet support anything but the UCS. ") (unicode-to-char code)) (defun encode-char (char quote-ucs &optional restriction) "FSF compatibility--return the Unicode code point of CHAR. The second argument must be 'ucs, the third argument is ignored. " (assert (eq quote-ucs 'ucs) t "Sorry, encode-char doesn't yet support anything but the UCS. ") (char-to-unicode char)) (when (featurep 'mule) ;; This CCL program is used for displaying the fallback UCS character set, ;; and can be repurposed to lao and the IPA, all going well. ;; ;; define-ccl-program is available after mule-ccl is loaded, much later ;; than this file in the build process. The below is the result of ;; ;; (macroexpand ;; '(define-ccl-program ccl-encode-to-ucs-2 ;; `(1 ;; ((r1 = (r1 << 8)) ;; (r1 = (r1 | r2)) ;; (mule-to-unicode r0 r1) ;; (r1 = (r0 >> 8)) ;; (r2 = (r0 & 255)))) ;; "CCL program to transform Mule characters to UCS-2.")) ;; ;; and it should occasionally be confirmed that the correspondence still ;; holds. (let ((prog [1 10 131127 8 98872 65823 147513 8 82009 255 22])) (defconst ccl-encode-to-ucs-2 prog "CCL program to transform Mule characters to UCS-2.") (put (quote ccl-encode-to-ucs-2) (quote ccl-program-idx) (register-ccl-program (quote ccl-encode-to-ucs-2) prog)) nil)) ;; #### UTF-7 is not yet implemented, and it's tricky to do. There's ;; an implementation in appendix A.1 of the Unicode Standard, Version ;; 2.0, but I don't know its licensing characteristics. ; (make-coding-system ; 'utf-7 'unicode ; "UTF-7" ; '(mnemonic "UTF7" ; documentation; "UTF-7 Unicode encoding -- 7-bit-ASCII modal Internet-mail-compatible ; encoding especially designed for headers, with the following ; properties: ; -- Only characters that are considered safe for passing through any mail ; gateway without damage are used. ; -- This is a modal encoding, with two states. The first, default ; state encodes the most common Unicode characters (upper and ; lowercase letters, digits, and 9 common punctuation marks) as ; themselves, and the second state, entered using '+' and ; terminated with '-' or any character disallowed in state 2, ; encodes any Unicode characters by first converting to UTF-16, ; most significant byte first, and then to a slightly modified ; Base64 encoding. (Thus, UTF-7 has the same limitations on the ; characters it can encode as UTF-16.) ; -- The modified Base64 encoding deviates from standard Base64 in ; that it omits the `=' pad character. This is eliminated so as to ; avoid conflicts with the use of `=' as an escape in the ; Quoted-Printable encoding and the related Q encoding for headers: ; With this modification, non-whitespace chars in UTF-7 will be ; represented in Quoted-Printable and in Q as-is, with no further ; encoding. ; For more information, see Appendix A.1 of The Unicode Standard 2.0, or ; wherever it is in v3.0." ; unicode-type utf-7))