annotate lisp/unicode.el @ 771:943eaba38521

[xemacs-hg @ 2002-03-13 08:51:24 by ben] The big ben-mule-21-5 check-in! Various files were added and deleted. See CHANGES-ben-mule. There are still some test suite failures. No crashes, though. Many of the failures have to do with problems in the test suite itself rather than in the actual code. I'll be addressing these in the next day or so -- none of the test suite failures are at all critical. Meanwhile I'll be trying to address the biggest issues -- i.e. build or run failures, which will almost certainly happen on various platforms. All comments should be sent to ben@xemacs.org -- use a Cc: if necessary when sending to mailing lists. There will be pre- and post- tags, something like pre-ben-mule-21-5-merge-in, and post-ben-mule-21-5-merge-in.
author ben
date Wed, 13 Mar 2002 08:54:06 +0000
parents
children 2923009caf47
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
771
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
1 ;;; unicode.el --- Unicode support -*- coding: iso-2022-7bit; -*-
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
2
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
3 ;; Copyright (C) 2001 Ben Wing.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
4
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
5 ;; Keywords: multilingual, Unicode
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
6
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
7 ;; This file is part of XEmacs.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
8
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
9 ;; XEmacs is free software; you can redistribute it and/or modify it
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
10 ;; under the terms of the GNU General Public License as published by
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
11 ;; the Free Software Foundation; either version 2, or (at your option)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
12 ;; any later version.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
13
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
14 ;; XEmacs is distributed in the hope that it will be useful, but
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
15 ;; WITHOUT ANY WARRANTY; without even the implied warranty of
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
16 ;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
17 ;; General Public License for more details.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
18
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
19 ;; You should have received a copy of the GNU General Public License
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
20 ;; along with XEmacs; see the file COPYING. If not, write to the Free
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
21 ;; Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
22 ;; 02111-1307, USA.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
23
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
24 ;;; Synched up with: Not in FSF.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
25
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
26 ;;; Commentary:
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
27
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
28 ;; Lisp support for Unicode, e.g. initialize the translation tables.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
29
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
30 ;;; Code:
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
31
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
32 ;; NOTE: This takes only a fraction of a second on my Pentium III
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
33 ;; 700Mhz even with a totally optimization-disabled XEmacs.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
34 (defun load-unicode-tables ()
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
35 "Initialize the Unicode translation tables for all standard charsets."
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
36 (let ((undir (expand-file-name "unicode/unicode-consortium" data-directory))
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
37 (parse-args
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
38 '(("8859-1.TXT" latin-iso8859-1 #xA0 #xFF #x-80)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
39 ;; "8859-10.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
40 ;; "8859-13.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
41 ;; "8859-14.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
42 ;; "8859-15.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
43 ("8859-2.TXT" latin-iso8859-2 #xA0 #xFF #x-80)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
44 ("8859-3.TXT" latin-iso8859-3 #xA0 #xFF #x-80)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
45 ("8859-4.TXT" latin-iso8859-4 #xA0 #xFF #x-80)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
46 ("8859-5.TXT" cyrillic-iso8859-5 #xA0 #xFF #x-80)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
47 ("8859-6.TXT" arabic-iso8859-6 #xA0 #xFF #x-80)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
48 ("8859-7.TXT" greek-iso8859-7 #xA0 #xFF #x-80)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
49 ("8859-8.TXT" hebrew-iso8859-8 #xA0 #xFF #x-80)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
50 ("8859-9.TXT" latin-iso8859-9 #xA0 #xFF #x-80)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
51 ;; charset for Big5 does not matter; specifying `big5' will
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
52 ;; automatically make the right thing happen
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
53 ("BIG5.TXT" chinese-big5-1 nil nil nil big5)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
54 ("CNS11643.TXT" chinese-cns11643-1 #x10000 #x1FFFF #x-10000)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
55 ("CNS11643.TXT" chinese-cns11643-2 #x20000 #x2FFFF #x-20000)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
56 ;; "CP1250.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
57 ;; "CP1251.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
58 ;; "CP1252.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
59 ;; "CP1253.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
60 ;; "CP1254.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
61 ;; "CP1255.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
62 ;; "CP1256.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
63 ;; "CP1257.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
64 ;; "CP1258.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
65 ;; "CP874.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
66 ;; "CP932.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
67 ;; "CP936.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
68 ;; "CP949.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
69 ;; "CP950.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
70 ;; "GB12345.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
71 ("GB2312.TXT" chinese-gb2312)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
72 ;; "HANGUL.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
73 ("JIS0201.TXT" katakana-jisx0201 #xA0 #xFF #x-80)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
74 ("JIS0208.TXT" japanese-jisx0208 nil nil nil ignore-first-column)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
75 ("JIS0212.TXT" japanese-jisx0212)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
76 ;; "JOHAB.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
77 ;; "KOI8-R.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
78 ;; "KSC5601.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
79 ;; note that KSC5601.TXT as currently distributed is NOT what
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
80 ;; it claims to be! see comments in KSX1001.TXT.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
81 ("KSX1001.TXT" korean-ksc5601)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
82 ;; "OLD5601.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
83 ;; "SHIFTJIS.TXT"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
84 )))
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
85 (mapcar #'(lambda (args)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
86 (apply 'parse-unicode-translation-table
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
87 (expand-file-name (car args) undir)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
88 (cdr args)))
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
89 parse-args)))
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
90
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
91 (defun init-unicode-at-startup ()
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
92 (load-unicode-tables))
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
93
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
94 (make-coding-system
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
95 'utf-16 'unicode
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
96 "UTF-16"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
97 '(mnemonic "UTF-16"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
98 documentation
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
99 "UTF-16 Unicode encoding -- the standard (almost-) fixed-width
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
100 two-byte encoding, with surrogates. It will be fixed-width if all
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
101 characters are in the BMP (Basic Multilingual Plane -- first 65536
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
102 codepoints). Cannot represent characters with codepoints above
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
103 0x10FFFF (a little more than 1,000,000). Unicode and ISO guarantee
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
104 never to encode any characters outside this range -- all the rest are
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
105 for private, corporate or internal use."
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
106 type utf-16))
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
107
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
108 (make-coding-system
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
109 'utf-16-bom 'unicode
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
110 "UTF-16 w/BOM"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
111 '(mnemonic "UTF16-BOM"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
112 documentation
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
113 "UTF-16 Unicode encoding with byte order mark (BOM) at the beginning.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
114 The BOM is Unicode character U+FEFF -- i.e. the first two bytes are
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
115 0xFE and 0xFF, respectively, or reversed in a little-endian
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
116 representation. It has been sanctioned by the Unicode Consortium for
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
117 use at the beginning of a Unicode stream as a marker of the byte order
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
118 of the stream, and commonly appears in Unicode files under Microsoft
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
119 Windows, where it also functions as a magic cookie identifying a
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
120 Unicode file. The character is called \"ZERO WIDTH NO-BREAK SPACE\"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
121 and is suitable as a byte-order marker because:
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
122
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
123 -- it has no displayable representation
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
124 -- due to its semantics it never normally appears at the beginning
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
125 of a stream
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
126 -- its reverse U+FFFE is not a legal Unicode character
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
127 -- neither byte sequence is at all likely in any other standard
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
128 encoding, particularly at the beginning of a stream
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
129
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
130 This coding system will insert a BOM at the beginning of a stream when
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
131 writing and strip it off when reading."
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
132 type utf-16
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
133 need-bom t))
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
134
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
135 (make-coding-system
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
136 'utf-16-little-endian 'unicode
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
137 "UTF-16 Little Endian"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
138 '(mnemonic "UTF16-LE"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
139 documentation
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
140 "Little-endian version of UTF-16 Unicode encoding.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
141 See `utf-16' coding system."
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
142 type utf-16
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
143 little-endian t))
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
144
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
145 (make-coding-system
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
146 'utf-16-little-endian-bom 'unicode
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
147 "UTF-16 Little Endian w/BOM"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
148 '(mnemonic "MSW-Unicode"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
149 documentation
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
150 "Little-endian version of UTF-16 Unicode encoding, with byte order mark.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
151 Standard encoding for representing Unicode under MS Windows. See
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
152 `utf-16-bom' coding system."
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
153 type utf-16
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
154 little-endian t
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
155 need-bom t))
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
156
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
157 (make-coding-system
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
158 'ucs-4 'unicode
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
159 "UCS-4"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
160 '(mnemonic "UCS4"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
161 documentation
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
162 "UCS-4 Unicode encoding -- fully fixed-width four-byte encoding."
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
163 type ucs-4))
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
164
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
165 (make-coding-system
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
166 'ucs-4-little-endian 'unicode
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
167 "UCS-4 Little Endian"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
168 '(mnemonic "UCS4-LE"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
169 documentation
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
170 "Little-endian version of UCS-4 Unicode encoding. See `ucs-4' coding system."
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
171 type ucs-4
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
172 little-endian t))
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
173
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
174 (make-coding-system
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
175 'utf-8 'unicode
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
176 "UTF-8"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
177 '(mnemonic "UTF8"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
178 documentation
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
179 "UTF-8 Unicode encoding -- ASCII-compatible 8-bit variable-width encoding
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
180 with the same principles as the Mule-internal encoding:
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
181
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
182 -- All ASCII characters (codepoints 0 through 127) are represented
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
183 by themselves (i.e. using one byte, with the same value as the
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
184 ASCII codepoint), and these bytes are disjoint from bytes
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
185 representing non-ASCII characters.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
186
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
187 This means that any 8-bit clean application can safely process
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
188 UTF-8-encoded text as it were ASCII, with no corruption (e.g. a
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
189 '/' byte is always a slash character, never the second byte of
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
190 some other character, as with Big5, so a pathname encoded in
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
191 UTF-8 can safely be split up into components and reassembled
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
192 again using standard ASCII processes).
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
193
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
194 -- Leading bytes and non-leading bytes in the encoding of a
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
195 character are disjoint, so moving backwards is easy.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
196
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
197 -- Given only the leading byte, you know how many following bytes
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
198 are present.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
199 "
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
200 type utf-8))
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
201
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
202 ;; #### UTF-7 is not yet implemented, and it's tricky to do. There's
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
203 ;; an implementation in appendix A.1 of the Unicode Standard, Version
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
204 ;; 2.0, but I don't know its licensing characteristics.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
205
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
206 ; (make-coding-system
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
207 ; 'utf-7 'unicode
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
208 ; "UTF-7"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
209 ; '(mnemonic "UTF7"
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
210 ; documentation
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
211 ; "UTF-7 Unicode encoding -- 7-bit-ASCII modal Internet-mail-compatible
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
212 ; encoding especially designed for headers, with the following
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
213 ; properties:
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
214
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
215 ; -- Only characters that are considered safe for passing through any mail
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
216 ; gateway without damage are used.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
217
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
218 ; -- This is a modal encoding, with two states. The first, default
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
219 ; state encodes the most common Unicode characters (upper and
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
220 ; lowercase letters, digits, and 9 common punctuation marks) as
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
221 ; themselves, and the second state, entered using '+' and
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
222 ; terminated with '-' or any character disallowed in state 2,
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
223 ; encodes any Unicode characters by first converting to UTF-16,
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
224 ; most significant byte first, and then to a slightly modified
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
225 ; Base64 encoding. (Thus, UTF-7 has the same limitations on the
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
226 ; characters it can encode as UTF-16.)
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
227
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
228 ; -- The modified Base64 encoding deviates from standard Base64 in
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
229 ; that it omits the `=' pad character. This is eliminated so as to
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
230 ; avoid conflicts with the use of `=' as an escape in the
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
231 ; Quoted-Printable encoding and the related Q encoding for headers:
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
232 ; With this modification, non-whitespace chars in UTF-7 will be
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
233 ; represented in Quoted-Printable and in Q as-is, with no further
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
234 ; encoding.
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
235
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
236 ; For more information, see Appendix A.1 of The Unicode Standard 2.0, or
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
237 ; wherever it is in v3.0."
943eaba38521 [xemacs-hg @ 2002-03-13 08:51:24 by ben]
ben
parents:
diff changeset
238 ; type utf-7))