view lisp/unicode.el @ 793:e38acbeb1cae

[xemacs-hg @ 2002-03-29 04:46:17 by ben] lots o' fixes etc/ChangeLog: New file. Separated out all entries for etc/ into their own ChangeLog. Includes entries for the following files: etc/BABYL, etc/BETA, etc/CHARSETS, etc/DISTRIB, etc/Emacs.ad, etc/FTP, etc/GNUS-NEWS, etc/GOATS, etc/HELLO, etc/INSTALL, etc/MACHINES, etc/MAILINGLISTS, etc/MSDOS, etc/MYTHOLOGY, etc/NEWS, etc/OXYMORONS, etc/PACKAGES, etc/README, etc/TUTORIAL, etc/TUTORIAL.de, etc/TUTORIAL.ja, etc/TUTORIAL.ko, etc/TUTORIAL.se, etc/aliases.ksh, etc/altrasoft-logo.xpm, etc/check_cygwin_setup.sh, etc/custom/example-themes/europe-theme.el, etc/custom/example-themes/ex-custom-file, etc/custom/example-themes/example-theme.el, etc/e/eterm.ti, etc/edt-user.doc, etc/enriched.doc, etc/etags.1, etc/gnuserv.1, etc/gnuserv.README, etc/package-index.LATEST.gpg, etc/package-index.LATEST.pgp, etc/photos/jan.png, etc/recycle.xpm, etc/refcard.tex, etc/sample.Xdefaults, etc/sample.emacs, etc/sgml/CATALOG, etc/sgml/HTML32.dtd, etc/skk/SKK.tut.E, etc/smilies/Face_ase.xbm, etc/smilies/Face_ase2.xbm, etc/smilies/Face_ase3.xbm, etc/smilies/Face_smile.xbm, etc/smilies/Face_weep.xbm, etc/sounds, etc/toolbar, etc/toolbar/workshop-cap-up.xpm, etc/xemacs-ja.1, etc/xemacs.1, etc/yow.lines, etc\BETA, etc\NEWS, etc\README, etc\TUTORIAL, etc\TUTORIAL.de, etc\check_cygwin_setup.sh, etc\sample.init.el, etc\unicode\README, etc\unicode\mule-ucs\*, etc\unicode\other\* unicode/unicode-consortium/8859-16.TXT: New file. mule/english.el: Define this charset now, since a bug was fixed that formerly prevented it. mule/ethio-util.el: Fix compile errors involving Unicode `characters', which should be integers. Makefile.in.in: Always include gui.c, to fix compile error when TTY-only. EmacsFrame.c, abbrev.c, alloc.c, buffer.c, buffer.h, bytecode.c, bytecode.h, callint.c, callproc.c, casetab.c, casetab.h, charset.h, chartab.c, chartab.h, cmds.c, console-msw.c, console-msw.h, console-tty.c, console-x.c, console-x.h, console.c, console.h, data.c, database.c, device-gtk.c, device-msw.c, device-x.c, device.c, device.h, dialog-msw.c, doc.c, doprnt.c, dumper.c, dynarr.c, editfns.c, eldap.c, eldap.h, elhash.c, elhash.h, emacs.c, eval.c, event-Xt.c, event-gtk.c, event-msw.c, event-stream.c, event-tty.c, event-unixoid.c, events.c, events.h, extents.c, extents.h, faces.c, faces.h, file-coding.c, file-coding.h, fileio.c, filelock.c, fns.c, frame-gtk.c, frame-msw.c, frame-tty.c, frame-x.c, frame.c, frame.h, free-hook.c, general-slots.h, glyphs-eimage.c, glyphs-gtk.c, glyphs-msw.c, glyphs-widget.c, glyphs-x.c, glyphs.c, glyphs.h, gpmevent.c, gtk-xemacs.c, gui-msw.c, gui-x.c, gui-x.h, gui.c, gui.h, gutter.c, gutter.h, indent.c, input-method-xlib.c, insdel.c, keymap.c, keymap.h, lisp-disunion.h, lisp-union.h, lisp.h, lread.c, lrecord.h, lstream.c, lstream.h, marker.c, menubar-gtk.c, menubar-msw.c, menubar-x.c, menubar.c, minibuf.c, mule-canna.c, mule-ccl.c, mule-charset.c, mule-wnnfns.c, native-gtk-toolbar.c, objects-msw.c, objects-tty.c, objects-x.c, objects.c, objects.h, opaque.c, opaque.h, postgresql.c, postgresql.h, print.c, process-unix.c, process.c, process.h, rangetab.c, rangetab.h, redisplay-gtk.c, redisplay-msw.c, redisplay-output.c, redisplay-tty.c, redisplay-x.c, redisplay.c, scrollbar-gtk.c, scrollbar-msw.c, scrollbar-x.c, scrollbar.c, scrollbar.h, search.c, select-gtk.c, select-x.c, sound.c, specifier.c, specifier.h, strftime.c, symbols.c, symeval.h, syntax.h, text.c, text.h, toolbar-common.c, toolbar-msw.c, toolbar.c, toolbar.h, tooltalk.c, tooltalk.h, ui-gtk.c, ui-gtk.h, undo.c, vm-limit.c, window.c, window.h: Eliminate XSETFOO. Replace all usages with wrap_foo(). Make symbol->name a Lisp_Object, not Lisp_String *. Eliminate nearly all uses of Lisp_String * in favor of Lisp_Object, and correct macros so most of them favor Lisp_Object. Create new error-behavior ERROR_ME_DEBUG_WARN -- output warnings, but at level `debug' (usually ignored). Use it when instantiating specifiers, so problems can be debugged. Move log-warning-minimum-level into C so that we can optimize ERROR_ME_DEBUG_WARN. Fix warning levels consistent with new definitions. Add default_ and parent fields to char table; not yet implemented. New fun Dynarr_verify(); use for further error checking on Dynarrs. Rearrange code at top of lisp.h in conjunction with dynarr changes. Fix eifree(). Use Eistrings in various places (format_event_object(), where_is_to_char(), and callers thereof) to avoid fixed-size strings buffers. New fun write_eistring(). Reindent and fix GPM code to follow standards. Set default MS Windows font to Lucida Console (same size as Courier New but less interline spacing, so more lines fit). Increase default frame size on Windows to 50 lines. (If that's too big for the workspace, the frame will be shrunk as necessary.) Fix problem with text files with no newlines (). (Change `convert-eol' coding system to use `nil' for autodetect, consistent with make-coding-system.) Correct compile warnings in vm-limit.c. Fix handling of reverse-direction charsets to avoid errors when opening (e.g.) mule-ucs/lisp/reldata/uiso8859-6.el. Recode some object printing methods to use write_fmt_string() instead of a fixed buffer and sprintf. Turn on display of png comments as warnings (level `info'), now that they're unobtrusive. Revamped the sound documentation. Fixed bug in redisplay w.r.t. hscroll/truncation/continuation glyphs causing jumping up and down of the lines, since they're bigger than the line size. (It was seen most obviously when there's a horizontal scroll bar, e.g. do C-h a glyph or something like that.) The problem was that the glyph-contrib-p setting on glyphs was ignored even if it was set properly, which it wasn't until now.
author ben
date Fri, 29 Mar 2002 04:49:13 +0000
parents 578cb2932d72
children e54d47b2d736
line wrap: on
line source

;;; unicode.el --- Unicode support -*- coding: iso-2022-7bit; -*-

;; Copyright (C) 2001, 2002 Ben Wing.

;; Keywords: multilingual, Unicode

;; This file is part of XEmacs.

;; XEmacs is free software; you can redistribute it and/or modify it
;; under the terms of the GNU General Public License as published by
;; the Free Software Foundation; either version 2, or (at your option)
;; any later version.

;; XEmacs is distributed in the hope that it will be useful, but
;; WITHOUT ANY WARRANTY; without even the implied warranty of
;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
;; General Public License for more details.

;; You should have received a copy of the GNU General Public License
;; along with XEmacs; see the file COPYING.  If not, write to the Free
;; Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
;; 02111-1307, USA.

;;; Synched up with: Not in FSF.

;;; Commentary:

;; Lisp support for Unicode, e.g. initialize the translation tables.

;;; Code:

; ;; Subsets of Unicode.

; (make-charset 'mule-unicode-2500-33ff 
; 	      "Unicode characters of the range U+2500..U+33FF."
; 	      '(dimension
; 		2
; 		registry "ISO10646-1"
; 		chars 96
; 		columns 1
; 		direction l2r
; 		final ?2
; 		graphic 0
; 		short-name "Unicode subset 2"
; 		long-name "Unicode subset (U+2500..U+33FF)"
; 		))


; (make-charset 'mule-unicode-e000-ffff 
; 	      "Unicode characters of the range U+E000..U+FFFF."
; 	      '(dimension
; 		2
; 		registry "ISO10646-1"
; 		chars 96
; 		columns 1
; 		direction l2r
; 		final ?3
; 		graphic 0
; 		short-name "Unicode subset 3"
; 		long-name "Unicode subset (U+E000+FFFF)"
; 		))


; (make-charset 'mule-unicode-0100-24ff 
; 	      "Unicode characters of the range U+0100..U+24FF."
; 	      '(dimension
; 		2
; 		registry "ISO10646-1"
; 		chars 96
; 		columns 1
; 		direction l2r
; 		final ?1
; 		graphic 0
; 		short-name "Unicode subset"
; 		long-name "Unicode subset (U+0100..U+24FF)"
; 		))


;; NOTE: This takes only a fraction of a second on my Pentium III
;; 700Mhz even with a totally optimization-disabled XEmacs.
(defun load-unicode-tables ()
  "Initialize the Unicode translation tables for all standard charsets."
  (let ((parse-args
	 '(("unicode/unicode-consortium"
	    ("8859-1.TXT" latin-iso8859-1 #xA0 #xFF #x-80)
	    ;; "8859-10.TXT"
	    ;; "8859-13.TXT"
	    ;; "8859-14.TXT"
	    ("8859-14.TXT" latin-iso8859-14 #xA0 #xFF #x-80)
	    ("8859-15.TXT" latin-iso8859-15 #xA0 #xFF #x-80)
	    ("8859-2.TXT" latin-iso8859-2 #xA0 #xFF #x-80)
	    ("8859-3.TXT" latin-iso8859-3 #xA0 #xFF #x-80)
	    ("8859-4.TXT" latin-iso8859-4 #xA0 #xFF #x-80)
	    ("8859-5.TXT" cyrillic-iso8859-5 #xA0 #xFF #x-80)
	    ("8859-6.TXT" arabic-iso8859-6 #xA0 #xFF #x-80)
	    ("8859-7.TXT" greek-iso8859-7 #xA0 #xFF #x-80)
	    ("8859-8.TXT" hebrew-iso8859-8 #xA0 #xFF #x-80)
	    ("8859-9.TXT" latin-iso8859-9 #xA0 #xFF #x-80)
	    ;; charset for Big5 does not matter; specifying `big5' will
	    ;; automatically make the right thing happen
	    ("BIG5.TXT" chinese-big5-1 nil nil nil big5)
	    ("CNS11643.TXT" chinese-cns11643-1 #x10000 #x1FFFF #x-10000)
	    ("CNS11643.TXT" chinese-cns11643-2 #x20000 #x2FFFF #x-20000)
	    ;; "CP1250.TXT" 
	    ;; "CP1251.TXT" 
	    ;; "CP1252.TXT" 
	    ;; "CP1253.TXT" 
	    ;; "CP1254.TXT" 
	    ;; "CP1255.TXT" 
	    ;; "CP1256.TXT" 
	    ;; "CP1257.TXT" 
	    ;; "CP1258.TXT" 
	    ;; "CP874.TXT" 
	    ;; "CP932.TXT" 
	    ;; "CP936.TXT" 
	    ;; "CP949.TXT" 
	    ;; "CP950.TXT" 
	    ;; "GB12345.TXT" 
	    ("GB2312.TXT" chinese-gb2312)
	    ;; "HANGUL.TXT" 
	    ("JIS0201.TXT" latin-jisx0201 #x21 #x80)
	    ("JIS0201.TXT" katakana-jisx0201 #xA0 #xFF #x-80)
	    ("JIS0208.TXT" japanese-jisx0208 nil nil nil ignore-first-column)
	    ("JIS0212.TXT" japanese-jisx0212)
	    ;; "JOHAB.TXT" 
	    ;; "KOI8-R.TXT" 
	    ;; "KSC5601.TXT" 
	    ;; note that KSC5601.TXT as currently distributed is NOT what
	    ;; it claims to be!  see comments in KSX1001.TXT.
	    ("KSX1001.TXT" korean-ksc5601)
	    ;; "OLD5601.TXT" 
	    ;; "SHIFTJIS.TXT"
	    )
	   ("unicode/mule-ucs"
	    ;; use these instead of the above ones once we support surrogates
	    ;;("chinese-cns11643-1.txt" chinese-cns11643-1)
	    ;;("chinese-cns11643-2.txt" chinese-cns11643-2)
	    ;;("chinese-cns11643-3.txt" chinese-cns11643-3)
	    ;;("chinese-cns11643-4.txt" chinese-cns11643-4)
	    ;;("chinese-cns11643-5.txt" chinese-cns11643-5)
	    ;;("chinese-cns11643-6.txt" chinese-cns11643-6)
	    ;;("chinese-cns11643-7.txt" chinese-cns11643-7)
	    ("chinese-sisheng.txt" chinese-sisheng)
	    ("ethiopic.txt" ethiopic)
	    ("indian-is13194.txt" indian-is13194)
	    ("ipa.txt" ipa)
	    ("thai-tis620.txt" thai-tis620)
	    ("tibetan.txt" tibetan)
	    ("vietnamese-viscii-lower.txt" vietnamese-viscii-lower)
	    ("vietnamese-viscii-upper.txt" vietnamese-viscii-upper)
	    )
	   ("unicode/other"
	    ("lao.txt" lao)
	    )
	   )))
    (mapcar #'(lambda (tables)
		(let ((undir
		       (expand-file-name (car tables) data-directory)))
		  (mapcar #'(lambda (args)
			      (apply 'parse-unicode-translation-table
				     (expand-file-name (car args) undir)
				     (cdr args)))
			  (cdr tables))))
	    parse-args)))

(defun init-unicode-at-startup ()
  (load-unicode-tables))

(make-coding-system
 'utf-16 'unicode
 "UTF-16"
 '(mnemonic "UTF-16"
   documentation
   "UTF-16 Unicode encoding -- the standard (almost-) fixed-width
two-byte encoding, with surrogates.  It will be fixed-width if all
characters are in the BMP (Basic Multilingual Plane -- first 65536
codepoints).  Cannot represent characters with codepoints above
0x10FFFF (a little more than 1,000,000).  Unicode and ISO guarantee
never to encode any characters outside this range -- all the rest are
for private, corporate or internal use."
   type utf-16))

(make-coding-system
 'utf-16-bom 'unicode
 "UTF-16 w/BOM"
 '(mnemonic "UTF16-BOM"
   documentation
   "UTF-16 Unicode encoding with byte order mark (BOM) at the beginning.
The BOM is Unicode character U+FEFF -- i.e. the first two bytes are
0xFE and 0xFF, respectively, or reversed in a little-endian
representation.  It has been sanctioned by the Unicode Consortium for
use at the beginning of a Unicode stream as a marker of the byte order
of the stream, and commonly appears in Unicode files under Microsoft
Windows, where it also functions as a magic cookie identifying a
Unicode file.  The character is called \"ZERO WIDTH NO-BREAK SPACE\"
and is suitable as a byte-order marker because:

  -- it has no displayable representation
  -- due to its semantics it never normally appears at the beginning
     of a stream
  -- its reverse U+FFFE is not a legal Unicode character
  -- neither byte sequence is at all likely in any other standard
     encoding, particularly at the beginning of a stream

This coding system will insert a BOM at the beginning of a stream when
writing and strip it off when reading."
   type utf-16
   need-bom t))

(make-coding-system
 'utf-16-little-endian 'unicode
 "UTF-16 Little Endian"
 '(mnemonic "UTF16-LE"
   documentation
   "Little-endian version of UTF-16 Unicode encoding.
See `utf-16' coding system."
   type utf-16
   little-endian t))

(make-coding-system
 'utf-16-little-endian-bom 'unicode
 "UTF-16 Little Endian w/BOM"
 '(mnemonic "MSW-Unicode"
   documentation
   "Little-endian version of UTF-16 Unicode encoding, with byte order mark.
Standard encoding for representing Unicode under MS Windows.  See
`utf-16-bom' coding system."
   type utf-16
   little-endian t
   need-bom t))

(make-coding-system
 'ucs-4 'unicode
 "UCS-4"
 '(mnemonic "UCS4"
   documentation
   "UCS-4 Unicode encoding -- fully fixed-width four-byte encoding."
   type ucs-4))

(make-coding-system
 'ucs-4-little-endian 'unicode
 "UCS-4 Little Endian"
 '(mnemonic "UCS4-LE"
   documentation
   "Little-endian version of UCS-4 Unicode encoding.  See `ucs-4' coding system."
   type ucs-4
   little-endian t))

(make-coding-system
 'utf-8 'unicode
 "UTF-8"
 '(mnemonic "UTF8"
   documentation
   "UTF-8 Unicode encoding -- ASCII-compatible 8-bit variable-width encoding
with the same principles as the Mule-internal encoding:

  -- All ASCII characters (codepoints 0 through 127) are represented
     by themselves (i.e. using one byte, with the same value as the
     ASCII codepoint), and these bytes are disjoint from bytes
     representing non-ASCII characters.

     This means that any 8-bit clean application can safely process
     UTF-8-encoded text as it were ASCII, with no corruption (e.g. a
     '/' byte is always a slash character, never the second byte of
     some other character, as with Big5, so a pathname encoded in
     UTF-8 can safely be split up into components and reassembled
     again using standard ASCII processes).

  -- Leading bytes and non-leading bytes in the encoding of a
     character are disjoint, so moving backwards is easy.

  -- Given only the leading byte, you know how many following bytes
     are present.
"
   type utf-8))

;; #### UTF-7 is not yet implemented, and it's tricky to do.  There's
;; an implementation in appendix A.1 of the Unicode Standard, Version
;; 2.0, but I don't know its licensing characteristics.

; (make-coding-system
;  'utf-7 'unicode
;  "UTF-7"
;  '(mnemonic "UTF7"
;    documentation
;    "UTF-7 Unicode encoding -- 7-bit-ASCII modal Internet-mail-compatible
; encoding especially designed for headers, with the following
; properties:

;   -- Only characters that are considered safe for passing through any mail
;      gateway without damage are used.

;   -- This is a modal encoding, with two states.  The first, default
;      state encodes the most common Unicode characters (upper and
;      lowercase letters, digits, and 9 common punctuation marks) as
;      themselves, and the second state, entered using '+' and
;      terminated with '-' or any character disallowed in state 2,
;      encodes any Unicode characters by first converting to UTF-16,
;      most significant byte first, and then to a slightly modified
;      Base64 encoding. (Thus, UTF-7 has the same limitations on the
;      characters it can encode as UTF-16.)

;   -- The modified Base64 encoding deviates from standard Base64 in
;      that it omits the `=' pad character.  This is eliminated so as to
;      avoid conflicts with the use of `=' as an escape in the
;      Quoted-Printable encoding and the related Q encoding for headers:
;      With this modification, non-whitespace chars in UTF-7 will be
;      represented in Quoted-Printable and in Q as-is, with no further
;      encoding.

; For more information, see Appendix A.1 of The Unicode Standard 2.0, or
; wherever it is in v3.0."
;    type utf-7))