Mercurial > hg > xemacs-beta

diff lisp/mule/make-coding-system.el @ 4690:257b468bf2ca
Move the #'query-coding-region implementation to C. This is necessary because there is no reasonable way to access the corresponding mswindows-multibyte functionality from Lisp, and we need such functionality if we're going to have a reliable and portable #'query-coding-region implementation. However, this change doesn't yet provide #'query-coding-region for the mswindow-multibyte coding systems, there should be no functional differences between an XEmacs with this change and one without it. src/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> Move the #'query-coding-region implementation to C. This is necessary because there is no reasonable way to access the corresponding mswindows-multibyte functionality from Lisp, and we need such functionality if we're going to have a reliable and portable #'query-coding-region implementation. However, this change doesn't yet provide #'query-coding-region for the mswindow-multibyte coding systems, there should be no functional differences between an XEmacs with this change and one without it. * mule-coding.c (struct fixed_width_coding_system): Add a new coding system type, fixed_width, and implement it. It uses the CCL infrastructure but has a much simpler creation API, and its own query_method, formerly in lisp/mule/mule-coding.el. * unicode.c: Move the Unicode query method implementation here from unicode.el. * lisp.h: Declare Fmake_coding_system_internal, Fcopy_range_table here. * intl-win32.c (complex_vars_of_intl_win32): Use Fmake_coding_system_internal, not Fmake_coding_system. * general-slots.h: Add Qsucceeded, Qunencodable, Qinvalid_sequence here. * file-coding.h (enum coding_system_variant): Add fixed_width_coding_system here. (struct coding_system_methods): Add query_method and query_lstream_method to the coding system methods. Provide flags for the query methods. Declare the default query method; initialise it correctly in INITIALIZE_CODING_SYSTEM_TYPE. * file-coding.c (default_query_method): New function, the default query method for coding systems that do not set it. Moved from coding.el. (make_coding_system_1): Accept new elements in PROPS in #'make-coding-system; aliases, a list of aliases; safe-chars and safe-charsets (these were previously accepted but not saved); and category. (Fmake_coding_system_internal): New function, what used to be #'make-coding-system--on Mule builds, we've now moved some of the functionality of this to Lisp. (Fcoding_system_canonical_name_p): Move this earlier in the file, since it's now called from within make_coding_system_1. (Fquery_coding_region): Move the implementation of this here, from coding.el. (complex_vars_of_file_coding): Call Fmake_coding_system_internal, not Fmake_coding_system; specify safe-charsets properties when we're a mule build. * extents.h (mouse_highlight_priority, Fset_extent_priority, Fset_extent_face, Fmap_extents): Make these available to other C files. lisp/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> Move the #'query-coding-region implementation to C. * coding.el: Consolidate code that depends on the presence or absence of Mule at the end of this file. (default-query-coding-region, query-coding-region): Move these functions to C. (default-query-coding-region-safe-charset-skip-chars-map): Remove this variable, the corresponding C variable is Vdefault_query_coding_region_chartab_cache in file-coding.c. (query-coding-string): Update docstring to reflect actual multiple values, be more careful about not modifying a range table that we're currently mapping over. (encode-coding-char): Make the implementation of this simpler. (featurep 'mule): Autoload #'make-coding-system from mule/make-coding-system.el if we're a mule build; provide an appropriate compiler macro. Do various non-mule compatibility things if we're not a mule build. * update-elc.el (additional-dump-dependencies): Add mule/make-coding-system as a dump time dependency if we're a mule build. * unicode.el (ccl-encode-to-ucs-2): (decode-char): (encode-char): Move these earlier in the file, for the sake of some byte compile warnings. (unicode-query-coding-region): Move this to unicode.c * mule/make-coding-system.el: New file, not dumped. Contains the functionality to rework the arguments necessary for fixed-width coding systems, and contains the implementation of #'make-coding-system, which now calls #'make-coding-system-internal. * mule/vietnamese.el (viscii): * mule/latin.el (iso-8859-2): (windows-1250): (iso-8859-3): (iso-8859-4): (iso-8859-14): (iso-8859-15): (iso-8859-16): (iso-8859-9): (macintosh): (windows-1252): * mule/hebrew.el (iso-8859-8): * mule/greek.el (iso-8859-7): (windows-1253): * mule/cyrillic.el (iso-8859-5): (koi8-r): (koi8-u): (windows-1251): (alternativnyj): (koi8-ru): (koi8-t): (koi8-c): (koi8-o): * mule/arabic.el (iso-8859-6): (windows-1256): Move all these coding systems to being of type fixed-width, not of type CCL. This allows the distinct query-coding-region for them to be in C, something which will eventually allow us to implement query-coding-region for the mswindows-multibyte coding systems. * mule/general-late.el (posix-charset-to-coding-system-hash): Document why we're pre-emptively persuading the byte compiler that the ELC for this file needs to be written using escape-quoted. Call #'set-unicode-query-skip-chars-args, now the Unicode query-coding-region implementation is in C. * mule/thai-xtis.el (tis-620): Don't bother checking whether we're XEmacs or not here. * mule/mule-coding.el: Move the eight bit fixed-width functionality from this file to make-coding-system.el. tests/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> * automated/mule-tests.el: Check a coding system's type, not an 8-bit-fixed property, for whether that coding system should be treated as a fixed-width coding system. * automated/query-coding-tests.el: Don't test the query coding functionality for mswindows-multibyte coding systems, it's not yet implemented.
author: Aidan Kehoe <kehoea@parhasard.net>
date: Sat, 19 Sep 2009 22:53:13 +0100
children: dc3c2f298857
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/lisp/mule/make-coding-system.el	Sat Sep 19 22:53:13 2009 +0100
@@ -0,0 +1,984 @@
+;;; make-coding-system.el; Provides the #'make-coding-system function and
+;;; much of the implementation of the fixed-width coding system type.
+
+;; Copyright (C) 2009 Free Software Foundation
+
+;; Author: Aidan Kehoe
+
+;; This file is part of XEmacs.
+
+;; XEmacs is free software; you can redistribute it and/or modify it
+;; under the terms of the GNU General Public License as published by
+;; the Free Software Foundation; either version 2, or (at your option)
+;; any later version.
+
+;; XEmacs is distributed in the hope that it will be useful, but
+;; WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+;; General Public License for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with XEmacs; see the file COPYING.  If not, write to the
+;; Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
+;; Boston, MA 02110-1301, USA.
+
+;;; Commentary:
+
+;;; Code:
+
+(defvar fixed-width-private-use-start (decode-char 'ucs #xE000)
+  "Start of a 256 code private use area for fixed-width coding systems.
+
+This is used to ensure that distinct octets on disk for a given coding
+system map to distinct XEmacs characters, preventing spurious changes when
+a file is read, not changed, and then written.  ")
+
+(defun fixed-width-generate-helper (decode-table encode-table
+				   encode-failure-octet)
+  "Helper func, `fixed-width-generate-encode-program-and-skip-chars-strings',
+which see.
+
+Deals with the case where ASCII and another character set can both be
+encoded unambiguously and completely into the coding-system; if this is so,
+returns multiple values comprisig of such a ccl-program and the character
+set in question.  If not, it returns nil."
+  (let ((tentative-encode-program-parts
+	 (eval-when-compile 
+	   (let* ((vec-len 128)
+		  (compiled 
+		   (append
+                    (ccl-compile
+                     `(1
+                       (loop
+                         (read-multibyte-character r0 r1)
+                         (if (r0 == ,(charset-id 'ascii))
+                             (write r1)
+                           ((if (r0 == #xABAB)
+                                ;; #xBFFE is a sentinel in the compiled
+                                ;; program.
+				((r0 = r1 & #x7F)
+				 (write r0 ,(make-vector vec-len #xBFFE)))
+                              ((mule-to-unicode r0 r1)
+                               (if (r0 == #xFFFD)
+                                   (write #xBEEF)
+                                 ((lookup-integer encode-table-sym r0 r3)
+                                  (if r7
+                                      (write-multibyte-character r0 r3)
+                                    (write #xBEEF))))))))
+                         (repeat)))) nil))
+		  (first-part compiled)
+		  (last-part
+		   (member-if-not (lambda (entr) (eq #xBFFE entr))
+				  (member-if
+                                   (lambda (entr) (eq #xBFFE entr))
+                                   first-part))))
+	     (while compiled
+	       (when (eq #xBFFE (cadr compiled))
+		 (assert (= vec-len (search '(#xBFFE) (cdr compiled)
+					    :test #'/=)) nil
+					    "Strange ccl vector length")
+		 (setcdr compiled nil))
+	       (setq compiled (cdr compiled)))
+             ;; Is the generated code as we expect it to be?
+	     (assert (and (memq #xABAB first-part)
+			  (memq #xBEEF14 last-part))
+	    nil
+	    "This code assumes that the constant #xBEEF is #xBEEF14 in \
+compiled CCL code,\nand that the constant #xABAB is #xABAB. If that is
+not the case, and it appears not to be--that's why you're getting this
+message--it will not work.  ")
+	     (list first-part last-part vec-len))))
+	(charset-lower -1)
+	(charset-upper -1)
+	worth-trying known-charsets encode-program
+	other-charset-vector ucs)
+
+    (loop for char across decode-table
+      do (pushnew (char-charset char) known-charsets))
+    (setq known-charsets (delq 'ascii known-charsets))
+
+    (loop for known-charset in known-charsets 
+      do
+      ;; This is not possible for two dimensional charsets. 
+      (when (eq 1 (charset-dimension known-charset))
+        (if (eq 'control-1 known-charset)
+            (setq charset-lower 0
+                  charset-upper 31)
+	  ;; There should be a nicer way to get the limits here.
+          (condition-case args-out-of-range
+              (make-char known-charset #x100)
+            (args-out-of-range 
+             (setq charset-lower (third args-out-of-range)
+                   charset-upper (fourth args-out-of-range)))))
+	(loop
+	  for i from charset-lower to charset-upper
+	  always (and (setq ucs 
+			    (encode-char (make-char known-charset i) 'ucs))
+		      (gethash ucs encode-table))
+	  finally (setq worth-trying known-charset))
+
+	;; Only trying this for one charset at a time, the first find.
+	(when worth-trying (return))
+
+	;; Okay, this charset is not worth trying, Try the next.
+	(setq charset-lower -1
+	      charset-upper -1
+	      worth-trying nil)))
+
+    (when worth-trying
+      (setq other-charset-vector
+	    (make-vector (third tentative-encode-program-parts)
+			 encode-failure-octet))
+      (loop for i from charset-lower to charset-upper
+        do (aset other-charset-vector i
+		 (gethash (encode-char (make-char worth-trying i)
+				       'ucs) encode-table)))
+      (setq encode-program
+            (nsublis
+             (list (cons #xABAB (charset-id worth-trying)))
+             (nconc
+              (copy-list (first 
+                          tentative-encode-program-parts))
+              (append other-charset-vector nil)
+              (copy-tree (second 
+                          tentative-encode-program-parts))))))
+    (and encode-program (values encode-program worth-trying))))
+
+(defun fixed-width-generate-encode-program-and-skip-chars-strings
+  (decode-table encode-table encode-failure-octet)
+  "Generate a CCL program to encode a 8-bit fixed-width charset.
+
+DECODE-TABLE must have 256 non-cons entries, and will be regarded as
+describing a map from the octet corresponding to an offset in the
+table to the that entry in the table.  ENCODE-TABLE is a hash table
+map from unicode values to characters in the range [0,255].
+ENCODE-FAILURE-OCTET describes an integer between 0 and 255
+\(inclusive) to write in the event that a character cannot be encoded."
+  (check-argument-type #'vectorp decode-table)
+  (check-argument-range (length decode-table) #x100 #x100)
+  (check-argument-type #'hash-table-p encode-table)
+  (check-argument-type #'integerp encode-failure-octet)
+  (check-argument-range encode-failure-octet #x00 #xFF)
+  (let ((encode-program nil)
+	(general-encode-program
+	 (eval-when-compile
+	   (let ((prog (append 
+			(ccl-compile
+			 `(1
+			   (loop
+			     (read-multibyte-character r0 r1)
+			     (mule-to-unicode r0 r1)
+			     (if (r0 == #xFFFD)
+				 (write #xBEEF)
+			       ((lookup-integer encode-table-sym r0 r3)
+				(if r7
+				    (write-multibyte-character r0 r3)
+				  (write #xBEEF))))
+			     (repeat)))) nil)))
+	     (assert (memq #xBEEF14 prog)
+		     nil
+		     "This code assumes that the constant #xBEEF is #xBEEF14 \
+in compiled CCL code.\nIf that is not the case, and it appears not to
+be--that's why you're getting this message--it will not work.  ")
+	     prog)))
+	(encode-program-with-ascii-optimisation
+	 (eval-when-compile 
+	   (let ((prog (append
+			(ccl-compile
+			 `(1
+			   (loop
+			     (read-multibyte-character r0 r1)
+			     (if (r0 == ,(charset-id 'ascii))
+				 (write r1)
+			       ((mule-to-unicode r0 r1)
+				(if (r0 == #xFFFD)
+				    (write #xBEEF)
+				  ((lookup-integer encode-table-sym r0 r3)
+				   (if r7
+				       (write-multibyte-character r0 r3)
+				     (write #xBEEF))))))
+			     (repeat)))) nil)))
+	     (assert (memq #xBEEF14 prog)
+		     nil
+		     "This code assumes that the constant #xBEEF is #xBEEF14 \
+in compiled CCL code.\nIf that is not the case, and it appears not to
+be--that's why you're getting this message--it will not work.  ")
+	     prog)))
+         (ascii-encodes-as-itself nil)
+         (control-1-encodes-as-itself t)
+         (invalid-sequence-code-point-start
+          (eval-when-compile
+            (char-to-unicode
+             (aref (decode-coding-string "\xd8\x00\x00\x00" 'utf-16-be) 3))))
+         further-char-set skip-chars invalid-sequences-skip-chars)
+
+    ;; Is this coding system ASCII-compatible? If so, we can avoid the hash
+    ;; table lookup for those characters. 
+    (loop
+      for i from #x00 to #x7f
+      always (eq (int-to-char i) (gethash i encode-table))
+      finally (setq ascii-encodes-as-itself t))
+
+    ;; Note that this logic handles EBCDIC badly. For example, CP037,
+    ;; MIME name ebcdic-na, has the entire repertoire of ASCII and
+    ;; Latin 1, and thus a more optimal ccl encode program would check
+    ;; for those character sets and use tables. But for now, we do a
+    ;; hash table lookup for every character.
+    (if (null ascii-encodes-as-itself)
+	;; General encode program. Pros; general and correct. Cons;
+	;; slow, a hash table lookup + mule-unicode conversion is done
+	;; for every character encoding. 
+	(setq encode-program general-encode-program)
+      (multiple-value-setq
+          (encode-program further-char-set)
+        ;; Encode program with ascii-ascii mapping (based on a
+        ;; character's mule character set), and one other mule
+        ;; character set using table-based encoding, other
+        ;; character sets using hash table lookups.
+        ;; fixed-width-non-ascii-completely-coveredp only returns
+        ;; such a mapping if some non-ASCII charset with
+        ;; characters in decode-table is entirely covered by
+        ;; encode-table.
+        (fixed-width-generate-helper decode-table encode-table
+                                    encode-failure-octet))
+      (unless encode-program
+	;; If fixed-width-non-ascii-completely-coveredp returned nil,
+	;; but ASCII still encodes as itself, do one-to-one mapping
+	;; for ASCII, and a hash table lookup for everything else.
+	(setq encode-program encode-program-with-ascii-optimisation)))
+
+    (setq encode-program
+          (nsublis
+           (list (cons #xBEEF14 
+                       (logior (lsh encode-failure-octet 8)
+                               #x14)))
+           (copy-tree encode-program)))
+    (loop
+      for i from #x80 to #x9f
+      do (unless (= i (aref decode-table i))
+           (setq control-1-encodes-as-itself nil)
+           (return)))
+    (loop
+      for i from #x00 to #xFF
+      initially (setq skip-chars
+                      (cond
+                       ((and ascii-encodes-as-itself
+                             control-1-encodes-as-itself further-char-set)
+                        (concat "\x00-\x9f" (charset-skip-chars-string
+                                             further-char-set)))
+                       ((and ascii-encodes-as-itself
+                             control-1-encodes-as-itself)
+                        "\x00-\x9f")
+                       ((null ascii-encodes-as-itself)
+                        (skip-chars-quote (apply #'string
+                                                 (append decode-table nil))))
+                       (further-char-set
+                        (concat (charset-skip-chars-string 'ascii)
+                                (charset-skip-chars-string further-char-set)))
+                       (t 
+                        (charset-skip-chars-string 'ascii)))
+                      invalid-sequences-skip-chars "")
+      with decoded-ucs = nil
+      with decoded = nil
+      with no-ascii-transparency-skip-chars-list = 
+           (unless ascii-encodes-as-itself (append decode-table nil))
+      ;; Can't use #'match-string here, see:
+      ;; http://mid.gmane.org/18829.34118.709782.704574@parhasard.net
+      with skip-chars-test = 
+           #'(lambda (skip-chars-string testing)
+               (with-temp-buffer
+                 (insert testing)
+                 (goto-char (point-min))
+                 (skip-chars-forward skip-chars-string)
+                 (= (point) (point-max))))
+      do
+      (setq decoded (aref decode-table i)
+            decoded-ucs (char-to-unicode decoded))
+      (cond
+       ((<= invalid-sequence-code-point-start decoded-ucs
+            (+ invalid-sequence-code-point-start #xFF))
+        (setq invalid-sequences-skip-chars
+              (concat (string decoded)
+                      invalid-sequences-skip-chars))
+        (assert (not (funcall skip-chars-test skip-chars decoded))
+                "This char should only be skipped with \
+`invalid-sequences-skip-chars', not by `skip-chars'"))
+       ((not (funcall skip-chars-test skip-chars decoded))
+        (if ascii-encodes-as-itself
+            (setq skip-chars (concat skip-chars (string decoded)))
+          (push decoded no-ascii-transparency-skip-chars-list))))
+      finally (unless ascii-encodes-as-itself
+                (setq skip-chars
+                      (skip-chars-quote
+                       (apply #'string
+                              no-ascii-transparency-skip-chars-list)))))
+    (values encode-program skip-chars invalid-sequences-skip-chars)))
+
+(defun fixed-width-create-decode-encode-tables (unicode-map)
+  "Return multiple values \(DECODE-TABLE ENCODE-TABLE) given UNICODE-MAP. 
+UNICODE-MAP should be an alist mapping from integer octet values to
+characters with UCS code points; DECODE-TABLE will be a 256-element
+vector, and ENCODE-TABLE will be a hash table mapping from 256 numbers
+to 256 distinct characters."
+  (check-argument-type #'listp unicode-map)
+  (let ((decode-table (make-vector 256 nil))
+        (encode-table (make-hash-table :size 256))
+	(private-use-start (encode-char fixed-width-private-use-start 'ucs))
+        (invalid-sequence-code-point-start
+         (eval-when-compile
+           (char-to-unicode
+            (aref (decode-coding-string "\xd8\x00\x00\x00" 'utf-16-be) 3))))
+	desired-ucs decode-table-entry)
+
+    (loop for (external internal)
+      in unicode-map
+      do
+      (aset decode-table external internal)
+      (assert (not (eq (encode-char internal 'ucs) -1))
+	      nil
+	      "Looks like you're creating a fixed-width coding system \
+in a dumped file, \nand you're either not providing a literal unicode map
+or PROPS. Don't do that; fixed-width coding systems rely on sensible
+Unicode mappings being available, which they are at compile time for
+dumped files (but this requires the mentioned literals), but not, for
+most of them, at run time.  ")
+
+      (puthash (encode-char internal 'ucs)
+	       ;; This is semantically an integer, but Dave Love's design
+	       ;; for lookup-integer in CCL means we need to store it as a
+	       ;; character.
+	       (int-to-char external)
+	       encode-table))
+
+    ;; Now, go through the decode table. For octet values above #x7f, if the
+    ;; decode table entry is nil, this means that they have an undefined
+    ;; mapping (= they map to XEmacs characters with keys in
+    ;; unicode-error-default-translation-table); for octet values below or
+    ;; equal to #x7f, it means that they map to ASCII.
+
+    ;; If any entry (whether below or above #x7f) in the decode-table
+    ;; already maps to some character with a key in
+    ;; unicode-error-default-translation-table, it is treated as an
+    ;; undefined octet by `query-coding-region'. That is, it is not
+    ;; necessary for an octet value to be above #x7f for this to happen.
+
+    (dotimes (i 256)
+      (setq decode-table-entry (aref decode-table i))
+      (if decode-table-entry
+          (when (get-char-table
+                 decode-table-entry
+                 unicode-error-default-translation-table)
+            ;; The caller is explicitly specifying that this octet
+            ;; corresponds to an invalid sequence on disk:
+            (assert (= (get-char-table
+                        decode-table-entry
+                        unicode-error-default-translation-table) i)
+                    "Bad argument for a fixed-width coding system.
+If you're going to designate an octet with value below #x80 as invalid
+for this coding system, make sure to map it to the invalid sequence
+character corresponding to its octet value on disk. "))
+
+        ;; decode-table-entry is nil; either the octet is to be treated as
+        ;; contributing to an error sequence (when (> #x7f i)), or it should
+        ;; be attempted to treat it as ASCII-equivalent. 
+        (setq desired-ucs (or (and (< i #x80) i)
+                              (+ invalid-sequence-code-point-start i)))
+        (while (gethash desired-ucs encode-table)
+          (assert (not (< i #x80))
+                  "UCS code point should not already be in encode-table!"
+                  ;; There is one invalid sequence char per octet value;
+                  ;; with fixed-width coding systems, it makes no sense
+                  ;; for us to be multiply allocating them.
+                  (gethash desired-ucs encode-table))
+          (setq desired-ucs (+ private-use-start desired-ucs)
+                private-use-start (+ private-use-start 1)))
+        (puthash desired-ucs (int-to-char i) encode-table)
+        (setq desired-ucs (if (> desired-ucs #xFF)
+                              (unicode-to-char desired-ucs)
+                            ;; So we get Latin-1 when run at dump time,
+                            ;; instead of JIT-allocated characters.
+                            (int-to-char desired-ucs)))
+        (aset decode-table i desired-ucs)))
+    (values decode-table encode-table)))
+
+(defun fixed-width-generate-decode-program (decode-table)
+  "Given DECODE-TABLE, generate a CCL program to decode an 8-bit charset.
+DECODE-TABLE must have 256 non-cons entries, and will be regarded as
+describing a map from the octet corresponding to an offset in the
+table to the that entry in the table.  "
+  (check-argument-type #'vectorp decode-table)
+  (check-argument-range (length decode-table) #x100 #x100)
+  (let ((decode-program-parts
+	 (eval-when-compile
+	   (let* ((compiled
+		   (append 
+		    (ccl-compile 
+		     `(3
+		       ((read r0)
+                        (loop
+			 (write-read-repeat r0 ,(make-vector
+						 256 'sentinel)))))) nil))
+		  (first-part compiled)
+		  (last-part
+		   (member-if-not #'symbolp
+				  (member-if-not #'integerp first-part))))
+	     ;; Chop off the sentinel sentinel sentinel [..] part. 
+	     (while compiled
+	       (if (symbolp (cadr compiled))
+		   (setcdr compiled nil))
+	       (setq compiled (cdr compiled)))
+	     (list first-part last-part)))))
+   (nconc
+    ;; copy-list needed, because the structure of the literal provided
+    ;; by our eval-when-compile hangs around.
+    (copy-list (first decode-program-parts))
+    (append decode-table nil)
+    (second decode-program-parts))))
+
+(defun fixed-width-choose-category (decode-table)
+  "Given DECODE-TABLE, return an appropriate coding category.
+DECODE-TABLE is a 256-entry vector describing the mapping from octets on
+disk to XEmacs characters for some fixed-width 8-bit coding system."
+  (check-argument-type #'vectorp decode-table)
+  (check-argument-range (length decode-table) #x100 #x100)
+  (loop
+    named category
+    for i from #x80 to #x9F
+    do (unless (= i (aref decode-table i))
+	 (return-from category 'no-conversion))
+    finally return 'iso-8-1))
+
+(defun fixed-width-rework-props-runtime (name props)
+  "Rework PROPS to a form understood by `make-coding-system-internal'. 
+
+NAME must be a symbol, describing a fixed-width coding system that is
+about to be created.  Much of the implementation of the fixed-width
+coding system is in Lisp, and this function allows us to rework the
+arguments that `make-coding-system-internal' sees accordingly.
+
+If you are calling this function from anywhere but
+`make-coding-system', you're probably doing something wrong."
+  (check-argument-type #'symbolp name)
+  (check-valid-plist props)
+  (let  ((encode-failure-octet (or (plist-get props 'encode-failure-octet)
+				   (char-to-int ?~)))
+         (unicode-map (plist-get props 'unicode-map))
+	 (hash-table-sym (gensym (format "%s-encode-table" name)))
+	 encode-program decode-program decode-table encode-table skip-chars
+         invalid-sequences-skip-chars category)
+
+    (check-argument-range encode-failure-octet 0 #xFF)
+    ;; unicode-map must be a true list, and must be non-nil.
+    (check-argument-type #'true-list-p unicode-map)
+    (check-argument-type #'consp unicode-map)
+
+    ;; Don't pass on our extra data to make-coding-system-internal.
+    (setq props (plist-remprop props 'encode-failure-octet)
+	  props (plist-remprop props 'unicode-map))
+
+    (multiple-value-setq
+	(decode-table encode-table)
+      (fixed-width-create-decode-encode-tables unicode-map))
+
+    ;; Register the decode-table. 
+    (define-translation-hash-table hash-table-sym encode-table)
+
+    ;; Generate the programs and skip-chars strings. 
+    (setq decode-program (fixed-width-generate-decode-program decode-table))
+    (multiple-value-setq
+        (encode-program skip-chars invalid-sequences-skip-chars)
+      (fixed-width-generate-encode-program-and-skip-chars-strings
+       decode-table encode-table encode-failure-octet))
+
+    (setq category (fixed-width-choose-category decode-table))
+
+    (unless (vectorp encode-program) 
+      (setq encode-program 
+	    (apply #'vector
+		   (nsublis (list (cons 'encode-table-sym hash-table-sym))
+			    (copy-tree encode-program)))))
+    (unless (vectorp decode-program)
+      (setq decode-program
+	    (apply #'vector decode-program)))
+
+    (loop for (symbol . value)
+      in `((decode . ,decode-program)
+           (encode . ,encode-program)
+           (from-unicode . ,encode-table)
+           (query-skip-chars . ,skip-chars)
+           (invalid-sequences-skip-chars . ,invalid-sequences-skip-chars)
+           (category . ,category))
+      with default = (gensym)
+      do
+      (unless (eq default (plist-get props symbol default))
+        (error
+         'invalid-argument
+         "Explicit property not allowed for fixed-width coding systems"
+         symbol))
+      (setq props (nconc (list symbol value) props)))
+    props))
+
+;;;###autoload
+(defun make-coding-system (name type description props)
+  "Register symbol NAME as a coding system.
+
+TYPE describes the conversion method used and should be one of
+
+nil or `undecided'
+     Automatic conversion.  XEmacs attempts to detect the coding system
+     used in the file.
+`chain'
+     Chain two or more coding systems together to make a combination coding
+     system.
+`no-conversion'
+     No conversion.  Use this for binary files and such.  On output,
+     graphic characters that are not in ASCII or Latin-1 will be
+     replaced by a ?. (For a no-conversion-encoded buffer, these
+     characters will only be present if you explicitly insert them.)
+`convert-eol'
+     Convert CRLF sequences or CR to LF.
+`shift-jis'
+     Shift-JIS (a Japanese encoding commonly used in PC operating systems).
+`unicode'
+     Any Unicode encoding (UCS-4, UTF-8, UTF-16, etc.).
+`mswindows-unicode-to-multibyte'
+     (MS Windows only) Converts from Windows Unicode to Windows Multibyte
+     (any code page encoding) upon encoding, and the other way upon decoding.
+`mswindows-multibyte'
+     Converts to or from Windows Multibyte (any code page encoding).
+     This is resolved into a chain of `mswindows-unicode' and
+     `mswindows-unicode-to-multibyte'.
+`iso2022'
+     Any ISO2022-compliant encoding.  Among other things, this includes
+     JIS (the Japanese encoding commonly used for e-mail), EUC (the
+     standard Unix encoding for Japanese and other languages), and
+     Compound Text (the encoding used in X11).  You can specify more
+     specific information about the conversion with the PROPS argument.
+`fixed-width'
+     A fixed-width eight bit encoding that is not necessarily compliant with
+     ISO 2022.  This coding system assumes Unicode equivalency, that is, if
+     two given XEmacs characters have the same Unicode mapping, they will
+     always map to the same octet on disk.
+`big5'
+     Big5 (the encoding commonly used for Mandarin Chinese in Taiwan).
+`ccl'
+     The conversion is performed using a user-written pseudo-code
+     program.  CCL (Code Conversion Language) is the name of this
+     pseudo-code.
+`gzip'
+     GZIP compression format.
+`internal'
+     Write out or read in the raw contents of the memory representing
+     the buffer's text.  This is primarily useful for debugging
+     purposes, and is only enabled when XEmacs has been compiled with
+     DEBUG_XEMACS defined (via the --debug configure option).
+     WARNING: Reading in a file using `internal' conversion can result
+     in an internal inconsistency in the memory representing a
+     buffer's text, which will produce unpredictable results and may
+     cause XEmacs to crash.  Under normal circumstances you should
+     never use `internal' conversion.
+
+DESCRIPTION is a short English phrase describing the coding system,
+suitable for use as a menu item. (See also the `documentation' property
+below.)
+
+PROPS is a property list, describing the specific nature of the
+character set.  Recognized properties are:
+
+`mnemonic'
+     String to be displayed in the modeline when this coding system is
+     active.
+
+`documentation'
+     Detailed documentation on the coding system.
+
+`aliases'
+     A list of aliases for the coding system.  See
+     `define-coding-system-alias'.
+
+`eol-type'
+     End-of-line conversion to be used.  It should be one of
+
+	nil
+		Automatically detect the end-of-line type (LF, CRLF,
+		or CR).  Also generate subsidiary coding systems named
+		`NAME-unix', `NAME-dos', and `NAME-mac', that are
+		identical to this coding system but have an EOL-TYPE
+		value of `lf', `crlf', and `cr', respectively.
+	`lf'
+		The end of a line is marked externally using ASCII LF.
+		Since this is also the way that XEmacs represents an
+		end-of-line internally, specifying this option results
+		in no end-of-line conversion.  This is the standard
+		format for Unix text files.
+	`crlf'
+		The end of a line is marked externally using ASCII
+		CRLF.  This is the standard format for MS-DOS text
+		files.
+	`cr'
+		The end of a line is marked externally using ASCII CR.
+		This is the standard format for Macintosh text files.
+	t
+		Automatically detect the end-of-line type but do not
+		generate subsidiary coding systems.  (This value is
+		converted to nil when stored internally, and
+		`coding-system-property' will return nil.)
+
+`post-read-conversion'
+     The value is a function to call after some text is inserted and
+     decoded by the coding system itself and before any functions in
+     `after-change-functions' are called. (#### Not actually true in
+     XEmacs. `after-change-functions' will be called twice if
+     `post-read-conversion' changes something.) The argument of this
+     function is the same as for a function in
+     `after-insert-file-functions', i.e. LENGTH of the text inserted,
+     with point at the head of the text to be decoded.
+
+`pre-write-conversion'
+     The value is a function to call after all functions in
+     `write-region-annotate-functions' and `buffer-file-format' are
+     called, and before the text is encoded by the coding system itself.
+     The arguments to this function are the same as those of a function
+     in `write-region-annotate-functions', i.e. FROM and TO, specifying
+     a region of text.
+
+The following properties are used by `default-query-coding-region',
+the default implementation of `query-coding-region'. This
+implementation and these properties are not used by the Unicode coding
+systems, nor by fixed-width coding systems. 
+
+`safe-chars'
+     The value is a char table.  If a character has non-nil value in it,
+     the character is safely supported by the coding system.  
+     This overrides the `safe-charsets' property.
+   
+`safe-charsets'
+     The value is a list of charsets safely supported by the coding
+     system.  For coding systems based on ISO 2022, XEmacs may try to
+     encode characters outside these character sets, but outside of
+     East Asia and East Asian coding systems, it is unlikely that
+     consumers of the data will understand XEmacs' encoding.
+     The value t means that all XEmacs character sets handles are supported.  
+
+The following properties are allowed for FSF compatibility but currently
+ignored:
+
+`translation-table-for-decode'
+     The value is a translation table to be applied on decoding.  See
+     the function `make-translation-table' for the format of translation
+     table.  This is not applicable to CCL-based coding systems.
+    
+`translation-table-for-encode'
+     The value is a translation table to be applied on encoding.  This is
+     not applicable to CCL-based coding systems.
+     
+`mime-charset'
+     The value is a symbol of which name is `MIME-charset' parameter of
+     the coding system.
+    
+`valid-codes' (meaningful only for a coding system based on CCL)
+     The value is a list to indicate valid byte ranges of the encoded
+     file.  Each element of the list is an integer or a cons of integer.
+     In the former case, the integer value is a valid byte code.  In the
+     latter case, the integers specifies the range of valid byte codes.
+
+The following additional property is recognized if TYPE is `convert-eol':
+
+`subtype'
+     One of `lf', `crlf', `cr' or nil (for autodetection).  When decoding,
+     the corresponding sequence will be converted to LF.  When encoding,
+     the opposite happens.  This coding system converts characters to
+     characters.
+
+
+
+The following additional properties are recognized if TYPE is `iso2022':
+
+`charset-g0'
+`charset-g1'
+`charset-g2'
+`charset-g3'
+     The character set initially designated to the G0 - G3 registers.
+     The value should be one of
+
+          -- A charset object (designate that character set)
+	  -- nil (do not ever use this register)
+	  -- t (no character set is initially designated to
+		the register, but may be later on; this automatically
+		sets the corresponding `force-g*-on-output' property)
+
+`force-g0-on-output'
+`force-g1-on-output'
+`force-g2-on-output'
+`force-g2-on-output'
+     If non-nil, send an explicit designation sequence on output before
+     using the specified register.
+
+`short'
+     If non-nil, use the short forms \"ESC $ @\", \"ESC $ A\", and
+     \"ESC $ B\" on output in place of the full designation sequences
+     \"ESC $ ( @\", \"ESC $ ( A\", and \"ESC $ ( B\".
+
+`no-ascii-eol'
+     If non-nil, don't designate ASCII to G0 at each end of line on output.
+     Setting this to non-nil also suppresses other state-resetting that
+     normally happens at the end of a line.
+
+`no-ascii-cntl'
+     If non-nil, don't designate ASCII to G0 before control chars on output.
+
+`seven'
+     If non-nil, use 7-bit environment on output.  Otherwise, use 8-bit
+     environment.
+
+`lock-shift'
+     If non-nil, use locking-shift (SO/SI) instead of single-shift
+     or designation by escape sequence.
+
+`no-iso6429'
+     If non-nil, don't use ISO6429's direction specification.
+
+`escape-quoted'
+     If non-nil, literal control characters that are the same as
+     the beginning of a recognized ISO2022 or ISO6429 escape sequence
+     (in particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E),
+     SS3 (0x8F), and CSI (0x9B)) are \"quoted\" with an escape character
+     so that they can be properly distinguished from an escape sequence.
+     (Note that doing this results in a non-portable encoding.) This
+     encoding flag is used for byte-compiled files.  Note that ESC
+     is a good choice for a quoting character because there are no
+     escape sequences whose second byte is a character from the Control-0
+     or Control-1 character sets; this is explicitly disallowed by the
+     ISO2022 standard.
+
+`input-charset-conversion'
+     A list of conversion specifications, specifying conversion of
+     characters in one charset to another when decoding is performed.
+     Each specification is a list of two elements: the source charset,
+     and the destination charset.
+
+`output-charset-conversion'
+     A list of conversion specifications, specifying conversion of
+     characters in one charset to another when encoding is performed.
+     The form of each specification is the same as for
+     `input-charset-conversion'.
+
+The following additional properties are recognized if TYPE is
+`fixed-width':
+
+`unicode-map'
+     Required.  A plist describing a map from octets in the coding system
+     NAME (as integers) to XEmacs characters.  Those XEmacs characters will
+     be used explicitly on decoding, but for encoding (most relevantly, on
+     writing to disk) XEmacs characters that map to the same Unicode code
+     point will be unified.  This means that the ISO-8859-? characters that
+     map to the same Unicode code point will not be distinct when written to
+     disk, which is normally what is intended; it also means that East Asian
+     Han characters from different XEmacs character sets will not be
+     distinct when written to disk, which is less often what is intended.
+
+     Any octets not mapped, and with values above #x7f, will be decoded into
+     XEmacs characters that reflect that their values are undefined.  These
+     characters will be displayed in a language-environment-specific
+     way. See `unicode-error-default-translation-table' and the
+     `invalid-sequence-coding-system' argument to `set-language-info'.
+
+     These characters will normally be treated as invalid when checking
+     whether text can be encoded with `query-coding-region'--see the
+     IGNORE-INVALID-SEQUENCESP argument to that function to avoid this.  It
+     is possible to specify that octets with values less than #x80 (or
+     indeed greater than it) be treated in this way, by specifying
+     explicitly that they correspond to the character mapping to that octet
+     in `unicode-error-default-translation-table'.  Far fewer coding systems
+     override the ASCII mapping, though, so this is not the default.
+
+`encode-failure-octet'
+     An integer between 0 and 255 to write in place of XEmacs characters
+     that cannot be encoded, defaulting to the code for tilde `~'.
+
+The following additional properties are recognized (and required)
+if TYPE is `ccl':
+
+`decode'
+     CCL program used for decoding (converting to internal format).
+
+`encode'
+     CCL program used for encoding (converting to external format).
+
+
+The following additional properties are recognized if TYPE is `chain':
+
+`chain'
+     List of coding systems to be chained together, in decoding order.
+
+`canonicalize-after-coding'
+     Coding system to be returned by the detector routines in place of
+     this coding system.
+
+
+
+The following additional properties are recognized if TYPE is `unicode':
+
+`unicode-type'
+     One of `utf-16', `utf-8', `ucs-4', or `utf-7' (the latter is not
+     yet implemented).  `utf-16' is the basic two-byte encoding;
+     `ucs-4' is the four-byte encoding; `utf-8' is an ASCII-compatible
+     variable-width 8-bit encoding; `utf-7' is a 7-bit encoding using
+     only characters that will safely pass through all mail gateways.
+     [[ This should be \"transformation format\".  There should also be
+     `ucs-2' (or `bmp' -- no surrogates) and `utf-32' (range checked). ]]
+
+`little-endian'
+     If non-nil, `utf-16' and `ucs-4' will write out the groups of two
+     or four bytes little-endian instead of big-endian.  This is required,
+     for example, under Windows.
+
+`need-bom'
+     If non-nil, a byte order mark (BOM, or Unicode FFFE) should be
+     written out at the beginning of the data.  This serves both to
+     identify the endianness of the following data and to mark the
+     data as Unicode (at least, this is how Windows uses it).
+     [[ The correct term is \"signature\", since this technique may also
+     be used with UTF-8.  That is the term used in the standard. ]]
+
+
+The following additional properties are recognized if TYPE is
+`mswindows-multibyte':
+
+`code-page'
+     Either a number (specifying a particular code page) or one of the
+     symbols `ansi', `oem', `mac', or `ebcdic', specifying the ANSI,
+     OEM, Macintosh, or EBCDIC code page associated with a particular
+     locale (given by the `locale' property).  NOTE: EBCDIC code pages
+     only exist in Windows 2000 and later.
+
+`locale'
+     If `code-page' is a symbol, this specifies the locale whose code
+     page of the corresponding type should be used.  This should be
+     one of the following: A cons of two strings, (LANGUAGE
+     . SUBLANGUAGE) (see `mswindows-set-current-locale'); a string (a
+     language; SUBLANG_DEFAULT, i.e. the default sublanguage, is
+     used); or one of the symbols `current', `user-default', or
+     `system-default', corresponding to the values of
+     `mswindows-current-locale', `mswindows-user-default-locale', or
+     `mswindows-system-default-locale', respectively.
+
+
+The following additional properties are recognized if TYPE is `undecided':
+\[[ Doesn't GNU use \"detect-*\" for the following two? ]]
+
+`do-eol'
+     Do EOL detection.
+
+`do-coding'
+     Do encoding detection.
+
+`coding-system'
+     If encoding detection is not done, use the specified coding system
+     to do decoding.  This is used internally when implementing coding
+     systems with an EOL type that specifies autodetection (the default),
+     so that the detector routines return the proper subsidiary.
+
+
+
+The following additional property is recognized if TYPE is `gzip':
+
+`level'
+     Compression level: 0 through 9, or `default' (currently 6)."
+  (when (eq 'fixed-width type)
+    (setq props (fixed-width-rework-props-runtime name props)))
+  (make-coding-system-internal name type description props))
+
+(define-compiler-macro make-coding-system (&whole form name type
+                                           &optional description props)
+  (if (equal '(quote fixed-width) type)
+      (if (memq (car-safe props) '(quote eval-when-compile))
+          (let* ((props (if (eq 'eval-when-compile (car props))
+                            (eval (cadr props))
+                          (cadr props)))
+                 (encode-failure-octet
+                  (or (plist-get props 'encode-failure-octet)
+                      (char-to-int ?~)))
+                 (unicode-map (plist-get props 'unicode-map))
+                 (default-plist-entry (gensym))
+                 (encode-table-sym (gensym 
+                                    (if (eq 'quote (car name))
+                                        (format "%s-enc-" (second name)))))
+                 encode-program decode-program
+                 decode-table encode-table
+                 skip-chars invalid-sequences-skip-chars category)
+
+            (check-argument-range encode-failure-octet 0 #xFF)
+            ;; unicode-map must be a true list, and must be non-nil.
+            (check-argument-type #'true-list-p unicode-map)
+            (check-argument-type #'consp unicode-map)
+
+            ;; Don't pass on our extra data to make-coding-system-internal.
+            (setq props (plist-remprop props 'encode-failure-octet)
+                  props (plist-remprop props 'unicode-map))
+
+            (multiple-value-setq
+                (decode-table encode-table)
+              (fixed-width-create-decode-encode-tables unicode-map))
+    
+            ;; Generate the decode and encode programs, and the skip-chars
+            ;; arguments.
+            (setq decode-program
+                  (fixed-width-generate-decode-program decode-table)
+                  category (fixed-width-choose-category decode-table))
+
+            (multiple-value-setq
+                (encode-program skip-chars invalid-sequences-skip-chars)
+              (fixed-width-generate-encode-program-and-skip-chars-strings
+               decode-table encode-table encode-failure-octet))
+
+            (unless (vectorp decode-program)
+              (setq decode-program
+                    (apply #'vector decode-program)))
+
+            (unless (eq default-plist-entry (plist-get props 'encode
+                                                       default-plist-entry))
+              (error
+               'invalid-argument
+               "Explicit property not allowed for fixed-width coding system"
+               'encode))
+            (loop for (symbol . value)
+              in `((decode . ,decode-program)
+                   (from-unicode . ,encode-table)
+                   (query-skip-chars . ,skip-chars)
+                   (invalid-sequences-skip-chars .
+                                                 ,invalid-sequences-skip-chars)
+                   (category . ,category))
+              do
+              (unless (eq default-plist-entry (plist-get props symbol
+                                                         default-plist-entry))
+                (error
+                 'invalid-argument
+                 "Explicit property not allowed for \
+fixed-width coding systems"
+                 symbol))
+              (setq props (nconc (list symbol value) props)))
+            `(progn
+              (define-translation-hash-table ',encode-table-sym ,encode-table)
+              (make-coding-system-internal
+               ,name ,type ,description
+               ',(nconc (list 'encode
+                              (apply #'vector
+                                     (nsublis
+                                      (list (cons 'encode-table-sym
+                                                  encode-table-sym))
+                                      encode-program)))
+                        props))))
+        ;; The form does not use literals; call make-coding-system at
+        ;; run time.
+        form)
+    (if (byte-compile-constp type)
+        ;; This is not a fixed-width call; compile it to a form that 21.4
+        ;; can also understand.
+        `(funcall (or (and (fboundp 'make-coding-system-internal)
+                           'make-coding-system-internal)
+                      'make-coding-system)
+          ,@(cdr form))
+      ;; TYPE is not literal; work things out at runtime.
+      form)))
+
author	Aidan Kehoe <kehoea@parhasard.net>
date	Sat, 19 Sep 2009 22:53:13 +0100
parents
children	dc3c2f298857