Mercurial > hg > xemacs-beta
diff lisp/mule/make-coding-system.el @ 4690:257b468bf2ca
Move the #'query-coding-region implementation to C.
This is necessary because there is no reasonable way to access the
corresponding mswindows-multibyte functionality from Lisp, and we need such
functionality if we're going to have a reliable and portable
#'query-coding-region implementation. However, this change doesn't yet
provide #'query-coding-region for the mswindow-multibyte coding systems,
there should be no functional differences between an XEmacs with this change
and one without it.
src/ChangeLog addition:
2009-09-19 Aidan Kehoe <kehoea@parhasard.net>
Move the #'query-coding-region implementation to C.
This is necessary because there is no reasonable way to access the
corresponding mswindows-multibyte functionality from Lisp, and we
need such functionality if we're going to have a reliable and
portable #'query-coding-region implementation. However, this
change doesn't yet provide #'query-coding-region for the
mswindow-multibyte coding systems, there should be no functional
differences between an XEmacs with this change and one without it.
* mule-coding.c (struct fixed_width_coding_system):
Add a new coding system type, fixed_width, and implement it. It
uses the CCL infrastructure but has a much simpler creation API,
and its own query_method, formerly in lisp/mule/mule-coding.el.
* unicode.c:
Move the Unicode query method implementation here from
unicode.el.
* lisp.h: Declare Fmake_coding_system_internal, Fcopy_range_table
here.
* intl-win32.c (complex_vars_of_intl_win32):
Use Fmake_coding_system_internal, not Fmake_coding_system.
* general-slots.h: Add Qsucceeded, Qunencodable, Qinvalid_sequence
here.
* file-coding.h (enum coding_system_variant):
Add fixed_width_coding_system here.
(struct coding_system_methods):
Add query_method and query_lstream_method to the coding system
methods.
Provide flags for the query methods.
Declare the default query method; initialise it correctly in
INITIALIZE_CODING_SYSTEM_TYPE.
* file-coding.c (default_query_method):
New function, the default query method for coding systems that do
not set it. Moved from coding.el.
(make_coding_system_1):
Accept new elements in PROPS in #'make-coding-system; aliases, a
list of aliases; safe-chars and safe-charsets (these were
previously accepted but not saved); and category.
(Fmake_coding_system_internal):
New function, what used to be #'make-coding-system--on Mule
builds, we've now moved some of the functionality of this to
Lisp.
(Fcoding_system_canonical_name_p):
Move this earlier in the file, since it's now called from within
make_coding_system_1.
(Fquery_coding_region):
Move the implementation of this here, from coding.el.
(complex_vars_of_file_coding):
Call Fmake_coding_system_internal, not Fmake_coding_system;
specify safe-charsets properties when we're a mule build.
* extents.h (mouse_highlight_priority, Fset_extent_priority,
Fset_extent_face, Fmap_extents):
Make these available to other C files.
lisp/ChangeLog addition:
2009-09-19 Aidan Kehoe <kehoea@parhasard.net>
Move the #'query-coding-region implementation to C.
* coding.el:
Consolidate code that depends on the presence or absence of Mule
at the end of this file.
(default-query-coding-region, query-coding-region):
Move these functions to C.
(default-query-coding-region-safe-charset-skip-chars-map):
Remove this variable, the corresponding C variable is
Vdefault_query_coding_region_chartab_cache in file-coding.c.
(query-coding-string): Update docstring to reflect actual multiple
values, be more careful about not modifying a range table that
we're currently mapping over.
(encode-coding-char): Make the implementation of this simpler.
(featurep 'mule): Autoload #'make-coding-system from
mule/make-coding-system.el if we're a mule build; provide an
appropriate compiler macro.
Do various non-mule compatibility things if we're not a mule
build.
* update-elc.el (additional-dump-dependencies):
Add mule/make-coding-system as a dump time dependency if we're a
mule build.
* unicode.el (ccl-encode-to-ucs-2):
(decode-char):
(encode-char):
Move these earlier in the file, for the sake of some byte compile
warnings.
(unicode-query-coding-region):
Move this to unicode.c
* mule/make-coding-system.el:
New file, not dumped. Contains the functionality to rework the
arguments necessary for fixed-width coding systems, and contains
the implementation of #'make-coding-system, which now calls
#'make-coding-system-internal.
* mule/vietnamese.el (viscii):
* mule/latin.el (iso-8859-2):
(windows-1250):
(iso-8859-3):
(iso-8859-4):
(iso-8859-14):
(iso-8859-15):
(iso-8859-16):
(iso-8859-9):
(macintosh):
(windows-1252):
* mule/hebrew.el (iso-8859-8):
* mule/greek.el (iso-8859-7):
(windows-1253):
* mule/cyrillic.el (iso-8859-5):
(koi8-r):
(koi8-u):
(windows-1251):
(alternativnyj):
(koi8-ru):
(koi8-t):
(koi8-c):
(koi8-o):
* mule/arabic.el (iso-8859-6):
(windows-1256):
Move all these coding systems to being of type fixed-width, not of
type CCL. This allows the distinct query-coding-region for them to
be in C, something which will eventually allow us to implement
query-coding-region for the mswindows-multibyte coding systems.
* mule/general-late.el (posix-charset-to-coding-system-hash):
Document why we're pre-emptively persuading the byte compiler that
the ELC for this file needs to be written using escape-quoted.
Call #'set-unicode-query-skip-chars-args, now the Unicode
query-coding-region implementation is in C.
* mule/thai-xtis.el (tis-620):
Don't bother checking whether we're XEmacs or not here.
* mule/mule-coding.el:
Move the eight bit fixed-width functionality from this file to
make-coding-system.el.
tests/ChangeLog addition:
2009-09-19 Aidan Kehoe <kehoea@parhasard.net>
* automated/mule-tests.el:
Check a coding system's type, not an 8-bit-fixed property, for
whether that coding system should be treated as a fixed-width
coding system.
* automated/query-coding-tests.el:
Don't test the query coding functionality for mswindows-multibyte
coding systems, it's not yet implemented.
author | Aidan Kehoe <kehoea@parhasard.net> |
---|---|
date | Sat, 19 Sep 2009 22:53:13 +0100 |
parents | |
children | dc3c2f298857 |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/lisp/mule/make-coding-system.el Sat Sep 19 22:53:13 2009 +0100 @@ -0,0 +1,984 @@ +;;; make-coding-system.el; Provides the #'make-coding-system function and +;;; much of the implementation of the fixed-width coding system type. + +;; Copyright (C) 2009 Free Software Foundation + +;; Author: Aidan Kehoe + +;; This file is part of XEmacs. + +;; XEmacs is free software; you can redistribute it and/or modify it +;; under the terms of the GNU General Public License as published by +;; the Free Software Foundation; either version 2, or (at your option) +;; any later version. + +;; XEmacs is distributed in the hope that it will be useful, but +;; WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +;; General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with XEmacs; see the file COPYING. If not, write to the +;; Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, +;; Boston, MA 02110-1301, USA. + +;;; Commentary: + +;;; Code: + +(defvar fixed-width-private-use-start (decode-char 'ucs #xE000) + "Start of a 256 code private use area for fixed-width coding systems. + +This is used to ensure that distinct octets on disk for a given coding +system map to distinct XEmacs characters, preventing spurious changes when +a file is read, not changed, and then written. ") + +(defun fixed-width-generate-helper (decode-table encode-table + encode-failure-octet) + "Helper func, `fixed-width-generate-encode-program-and-skip-chars-strings', +which see. + +Deals with the case where ASCII and another character set can both be +encoded unambiguously and completely into the coding-system; if this is so, +returns multiple values comprisig of such a ccl-program and the character +set in question. If not, it returns nil." + (let ((tentative-encode-program-parts + (eval-when-compile + (let* ((vec-len 128) + (compiled + (append + (ccl-compile + `(1 + (loop + (read-multibyte-character r0 r1) + (if (r0 == ,(charset-id 'ascii)) + (write r1) + ((if (r0 == #xABAB) + ;; #xBFFE is a sentinel in the compiled + ;; program. + ((r0 = r1 & #x7F) + (write r0 ,(make-vector vec-len #xBFFE))) + ((mule-to-unicode r0 r1) + (if (r0 == #xFFFD) + (write #xBEEF) + ((lookup-integer encode-table-sym r0 r3) + (if r7 + (write-multibyte-character r0 r3) + (write #xBEEF)))))))) + (repeat)))) nil)) + (first-part compiled) + (last-part + (member-if-not (lambda (entr) (eq #xBFFE entr)) + (member-if + (lambda (entr) (eq #xBFFE entr)) + first-part)))) + (while compiled + (when (eq #xBFFE (cadr compiled)) + (assert (= vec-len (search '(#xBFFE) (cdr compiled) + :test #'/=)) nil + "Strange ccl vector length") + (setcdr compiled nil)) + (setq compiled (cdr compiled))) + ;; Is the generated code as we expect it to be? + (assert (and (memq #xABAB first-part) + (memq #xBEEF14 last-part)) + nil + "This code assumes that the constant #xBEEF is #xBEEF14 in \ +compiled CCL code,\nand that the constant #xABAB is #xABAB. If that is +not the case, and it appears not to be--that's why you're getting this +message--it will not work. ") + (list first-part last-part vec-len)))) + (charset-lower -1) + (charset-upper -1) + worth-trying known-charsets encode-program + other-charset-vector ucs) + + (loop for char across decode-table + do (pushnew (char-charset char) known-charsets)) + (setq known-charsets (delq 'ascii known-charsets)) + + (loop for known-charset in known-charsets + do + ;; This is not possible for two dimensional charsets. + (when (eq 1 (charset-dimension known-charset)) + (if (eq 'control-1 known-charset) + (setq charset-lower 0 + charset-upper 31) + ;; There should be a nicer way to get the limits here. + (condition-case args-out-of-range + (make-char known-charset #x100) + (args-out-of-range + (setq charset-lower (third args-out-of-range) + charset-upper (fourth args-out-of-range))))) + (loop + for i from charset-lower to charset-upper + always (and (setq ucs + (encode-char (make-char known-charset i) 'ucs)) + (gethash ucs encode-table)) + finally (setq worth-trying known-charset)) + + ;; Only trying this for one charset at a time, the first find. + (when worth-trying (return)) + + ;; Okay, this charset is not worth trying, Try the next. + (setq charset-lower -1 + charset-upper -1 + worth-trying nil))) + + (when worth-trying + (setq other-charset-vector + (make-vector (third tentative-encode-program-parts) + encode-failure-octet)) + (loop for i from charset-lower to charset-upper + do (aset other-charset-vector i + (gethash (encode-char (make-char worth-trying i) + 'ucs) encode-table))) + (setq encode-program + (nsublis + (list (cons #xABAB (charset-id worth-trying))) + (nconc + (copy-list (first + tentative-encode-program-parts)) + (append other-charset-vector nil) + (copy-tree (second + tentative-encode-program-parts)))))) + (and encode-program (values encode-program worth-trying)))) + +(defun fixed-width-generate-encode-program-and-skip-chars-strings + (decode-table encode-table encode-failure-octet) + "Generate a CCL program to encode a 8-bit fixed-width charset. + +DECODE-TABLE must have 256 non-cons entries, and will be regarded as +describing a map from the octet corresponding to an offset in the +table to the that entry in the table. ENCODE-TABLE is a hash table +map from unicode values to characters in the range [0,255]. +ENCODE-FAILURE-OCTET describes an integer between 0 and 255 +\(inclusive) to write in the event that a character cannot be encoded." + (check-argument-type #'vectorp decode-table) + (check-argument-range (length decode-table) #x100 #x100) + (check-argument-type #'hash-table-p encode-table) + (check-argument-type #'integerp encode-failure-octet) + (check-argument-range encode-failure-octet #x00 #xFF) + (let ((encode-program nil) + (general-encode-program + (eval-when-compile + (let ((prog (append + (ccl-compile + `(1 + (loop + (read-multibyte-character r0 r1) + (mule-to-unicode r0 r1) + (if (r0 == #xFFFD) + (write #xBEEF) + ((lookup-integer encode-table-sym r0 r3) + (if r7 + (write-multibyte-character r0 r3) + (write #xBEEF)))) + (repeat)))) nil))) + (assert (memq #xBEEF14 prog) + nil + "This code assumes that the constant #xBEEF is #xBEEF14 \ +in compiled CCL code.\nIf that is not the case, and it appears not to +be--that's why you're getting this message--it will not work. ") + prog))) + (encode-program-with-ascii-optimisation + (eval-when-compile + (let ((prog (append + (ccl-compile + `(1 + (loop + (read-multibyte-character r0 r1) + (if (r0 == ,(charset-id 'ascii)) + (write r1) + ((mule-to-unicode r0 r1) + (if (r0 == #xFFFD) + (write #xBEEF) + ((lookup-integer encode-table-sym r0 r3) + (if r7 + (write-multibyte-character r0 r3) + (write #xBEEF)))))) + (repeat)))) nil))) + (assert (memq #xBEEF14 prog) + nil + "This code assumes that the constant #xBEEF is #xBEEF14 \ +in compiled CCL code.\nIf that is not the case, and it appears not to +be--that's why you're getting this message--it will not work. ") + prog))) + (ascii-encodes-as-itself nil) + (control-1-encodes-as-itself t) + (invalid-sequence-code-point-start + (eval-when-compile + (char-to-unicode + (aref (decode-coding-string "\xd8\x00\x00\x00" 'utf-16-be) 3)))) + further-char-set skip-chars invalid-sequences-skip-chars) + + ;; Is this coding system ASCII-compatible? If so, we can avoid the hash + ;; table lookup for those characters. + (loop + for i from #x00 to #x7f + always (eq (int-to-char i) (gethash i encode-table)) + finally (setq ascii-encodes-as-itself t)) + + ;; Note that this logic handles EBCDIC badly. For example, CP037, + ;; MIME name ebcdic-na, has the entire repertoire of ASCII and + ;; Latin 1, and thus a more optimal ccl encode program would check + ;; for those character sets and use tables. But for now, we do a + ;; hash table lookup for every character. + (if (null ascii-encodes-as-itself) + ;; General encode program. Pros; general and correct. Cons; + ;; slow, a hash table lookup + mule-unicode conversion is done + ;; for every character encoding. + (setq encode-program general-encode-program) + (multiple-value-setq + (encode-program further-char-set) + ;; Encode program with ascii-ascii mapping (based on a + ;; character's mule character set), and one other mule + ;; character set using table-based encoding, other + ;; character sets using hash table lookups. + ;; fixed-width-non-ascii-completely-coveredp only returns + ;; such a mapping if some non-ASCII charset with + ;; characters in decode-table is entirely covered by + ;; encode-table. + (fixed-width-generate-helper decode-table encode-table + encode-failure-octet)) + (unless encode-program + ;; If fixed-width-non-ascii-completely-coveredp returned nil, + ;; but ASCII still encodes as itself, do one-to-one mapping + ;; for ASCII, and a hash table lookup for everything else. + (setq encode-program encode-program-with-ascii-optimisation))) + + (setq encode-program + (nsublis + (list (cons #xBEEF14 + (logior (lsh encode-failure-octet 8) + #x14))) + (copy-tree encode-program))) + (loop + for i from #x80 to #x9f + do (unless (= i (aref decode-table i)) + (setq control-1-encodes-as-itself nil) + (return))) + (loop + for i from #x00 to #xFF + initially (setq skip-chars + (cond + ((and ascii-encodes-as-itself + control-1-encodes-as-itself further-char-set) + (concat "\x00-\x9f" (charset-skip-chars-string + further-char-set))) + ((and ascii-encodes-as-itself + control-1-encodes-as-itself) + "\x00-\x9f") + ((null ascii-encodes-as-itself) + (skip-chars-quote (apply #'string + (append decode-table nil)))) + (further-char-set + (concat (charset-skip-chars-string 'ascii) + (charset-skip-chars-string further-char-set))) + (t + (charset-skip-chars-string 'ascii))) + invalid-sequences-skip-chars "") + with decoded-ucs = nil + with decoded = nil + with no-ascii-transparency-skip-chars-list = + (unless ascii-encodes-as-itself (append decode-table nil)) + ;; Can't use #'match-string here, see: + ;; http://mid.gmane.org/18829.34118.709782.704574@parhasard.net + with skip-chars-test = + #'(lambda (skip-chars-string testing) + (with-temp-buffer + (insert testing) + (goto-char (point-min)) + (skip-chars-forward skip-chars-string) + (= (point) (point-max)))) + do + (setq decoded (aref decode-table i) + decoded-ucs (char-to-unicode decoded)) + (cond + ((<= invalid-sequence-code-point-start decoded-ucs + (+ invalid-sequence-code-point-start #xFF)) + (setq invalid-sequences-skip-chars + (concat (string decoded) + invalid-sequences-skip-chars)) + (assert (not (funcall skip-chars-test skip-chars decoded)) + "This char should only be skipped with \ +`invalid-sequences-skip-chars', not by `skip-chars'")) + ((not (funcall skip-chars-test skip-chars decoded)) + (if ascii-encodes-as-itself + (setq skip-chars (concat skip-chars (string decoded))) + (push decoded no-ascii-transparency-skip-chars-list)))) + finally (unless ascii-encodes-as-itself + (setq skip-chars + (skip-chars-quote + (apply #'string + no-ascii-transparency-skip-chars-list))))) + (values encode-program skip-chars invalid-sequences-skip-chars))) + +(defun fixed-width-create-decode-encode-tables (unicode-map) + "Return multiple values \(DECODE-TABLE ENCODE-TABLE) given UNICODE-MAP. +UNICODE-MAP should be an alist mapping from integer octet values to +characters with UCS code points; DECODE-TABLE will be a 256-element +vector, and ENCODE-TABLE will be a hash table mapping from 256 numbers +to 256 distinct characters." + (check-argument-type #'listp unicode-map) + (let ((decode-table (make-vector 256 nil)) + (encode-table (make-hash-table :size 256)) + (private-use-start (encode-char fixed-width-private-use-start 'ucs)) + (invalid-sequence-code-point-start + (eval-when-compile + (char-to-unicode + (aref (decode-coding-string "\xd8\x00\x00\x00" 'utf-16-be) 3)))) + desired-ucs decode-table-entry) + + (loop for (external internal) + in unicode-map + do + (aset decode-table external internal) + (assert (not (eq (encode-char internal 'ucs) -1)) + nil + "Looks like you're creating a fixed-width coding system \ +in a dumped file, \nand you're either not providing a literal unicode map +or PROPS. Don't do that; fixed-width coding systems rely on sensible +Unicode mappings being available, which they are at compile time for +dumped files (but this requires the mentioned literals), but not, for +most of them, at run time. ") + + (puthash (encode-char internal 'ucs) + ;; This is semantically an integer, but Dave Love's design + ;; for lookup-integer in CCL means we need to store it as a + ;; character. + (int-to-char external) + encode-table)) + + ;; Now, go through the decode table. For octet values above #x7f, if the + ;; decode table entry is nil, this means that they have an undefined + ;; mapping (= they map to XEmacs characters with keys in + ;; unicode-error-default-translation-table); for octet values below or + ;; equal to #x7f, it means that they map to ASCII. + + ;; If any entry (whether below or above #x7f) in the decode-table + ;; already maps to some character with a key in + ;; unicode-error-default-translation-table, it is treated as an + ;; undefined octet by `query-coding-region'. That is, it is not + ;; necessary for an octet value to be above #x7f for this to happen. + + (dotimes (i 256) + (setq decode-table-entry (aref decode-table i)) + (if decode-table-entry + (when (get-char-table + decode-table-entry + unicode-error-default-translation-table) + ;; The caller is explicitly specifying that this octet + ;; corresponds to an invalid sequence on disk: + (assert (= (get-char-table + decode-table-entry + unicode-error-default-translation-table) i) + "Bad argument for a fixed-width coding system. +If you're going to designate an octet with value below #x80 as invalid +for this coding system, make sure to map it to the invalid sequence +character corresponding to its octet value on disk. ")) + + ;; decode-table-entry is nil; either the octet is to be treated as + ;; contributing to an error sequence (when (> #x7f i)), or it should + ;; be attempted to treat it as ASCII-equivalent. + (setq desired-ucs (or (and (< i #x80) i) + (+ invalid-sequence-code-point-start i))) + (while (gethash desired-ucs encode-table) + (assert (not (< i #x80)) + "UCS code point should not already be in encode-table!" + ;; There is one invalid sequence char per octet value; + ;; with fixed-width coding systems, it makes no sense + ;; for us to be multiply allocating them. + (gethash desired-ucs encode-table)) + (setq desired-ucs (+ private-use-start desired-ucs) + private-use-start (+ private-use-start 1))) + (puthash desired-ucs (int-to-char i) encode-table) + (setq desired-ucs (if (> desired-ucs #xFF) + (unicode-to-char desired-ucs) + ;; So we get Latin-1 when run at dump time, + ;; instead of JIT-allocated characters. + (int-to-char desired-ucs))) + (aset decode-table i desired-ucs))) + (values decode-table encode-table))) + +(defun fixed-width-generate-decode-program (decode-table) + "Given DECODE-TABLE, generate a CCL program to decode an 8-bit charset. +DECODE-TABLE must have 256 non-cons entries, and will be regarded as +describing a map from the octet corresponding to an offset in the +table to the that entry in the table. " + (check-argument-type #'vectorp decode-table) + (check-argument-range (length decode-table) #x100 #x100) + (let ((decode-program-parts + (eval-when-compile + (let* ((compiled + (append + (ccl-compile + `(3 + ((read r0) + (loop + (write-read-repeat r0 ,(make-vector + 256 'sentinel)))))) nil)) + (first-part compiled) + (last-part + (member-if-not #'symbolp + (member-if-not #'integerp first-part)))) + ;; Chop off the sentinel sentinel sentinel [..] part. + (while compiled + (if (symbolp (cadr compiled)) + (setcdr compiled nil)) + (setq compiled (cdr compiled))) + (list first-part last-part))))) + (nconc + ;; copy-list needed, because the structure of the literal provided + ;; by our eval-when-compile hangs around. + (copy-list (first decode-program-parts)) + (append decode-table nil) + (second decode-program-parts)))) + +(defun fixed-width-choose-category (decode-table) + "Given DECODE-TABLE, return an appropriate coding category. +DECODE-TABLE is a 256-entry vector describing the mapping from octets on +disk to XEmacs characters for some fixed-width 8-bit coding system." + (check-argument-type #'vectorp decode-table) + (check-argument-range (length decode-table) #x100 #x100) + (loop + named category + for i from #x80 to #x9F + do (unless (= i (aref decode-table i)) + (return-from category 'no-conversion)) + finally return 'iso-8-1)) + +(defun fixed-width-rework-props-runtime (name props) + "Rework PROPS to a form understood by `make-coding-system-internal'. + +NAME must be a symbol, describing a fixed-width coding system that is +about to be created. Much of the implementation of the fixed-width +coding system is in Lisp, and this function allows us to rework the +arguments that `make-coding-system-internal' sees accordingly. + +If you are calling this function from anywhere but +`make-coding-system', you're probably doing something wrong." + (check-argument-type #'symbolp name) + (check-valid-plist props) + (let ((encode-failure-octet (or (plist-get props 'encode-failure-octet) + (char-to-int ?~))) + (unicode-map (plist-get props 'unicode-map)) + (hash-table-sym (gensym (format "%s-encode-table" name))) + encode-program decode-program decode-table encode-table skip-chars + invalid-sequences-skip-chars category) + + (check-argument-range encode-failure-octet 0 #xFF) + ;; unicode-map must be a true list, and must be non-nil. + (check-argument-type #'true-list-p unicode-map) + (check-argument-type #'consp unicode-map) + + ;; Don't pass on our extra data to make-coding-system-internal. + (setq props (plist-remprop props 'encode-failure-octet) + props (plist-remprop props 'unicode-map)) + + (multiple-value-setq + (decode-table encode-table) + (fixed-width-create-decode-encode-tables unicode-map)) + + ;; Register the decode-table. + (define-translation-hash-table hash-table-sym encode-table) + + ;; Generate the programs and skip-chars strings. + (setq decode-program (fixed-width-generate-decode-program decode-table)) + (multiple-value-setq + (encode-program skip-chars invalid-sequences-skip-chars) + (fixed-width-generate-encode-program-and-skip-chars-strings + decode-table encode-table encode-failure-octet)) + + (setq category (fixed-width-choose-category decode-table)) + + (unless (vectorp encode-program) + (setq encode-program + (apply #'vector + (nsublis (list (cons 'encode-table-sym hash-table-sym)) + (copy-tree encode-program))))) + (unless (vectorp decode-program) + (setq decode-program + (apply #'vector decode-program))) + + (loop for (symbol . value) + in `((decode . ,decode-program) + (encode . ,encode-program) + (from-unicode . ,encode-table) + (query-skip-chars . ,skip-chars) + (invalid-sequences-skip-chars . ,invalid-sequences-skip-chars) + (category . ,category)) + with default = (gensym) + do + (unless (eq default (plist-get props symbol default)) + (error + 'invalid-argument + "Explicit property not allowed for fixed-width coding systems" + symbol)) + (setq props (nconc (list symbol value) props))) + props)) + +;;;###autoload +(defun make-coding-system (name type description props) + "Register symbol NAME as a coding system. + +TYPE describes the conversion method used and should be one of + +nil or `undecided' + Automatic conversion. XEmacs attempts to detect the coding system + used in the file. +`chain' + Chain two or more coding systems together to make a combination coding + system. +`no-conversion' + No conversion. Use this for binary files and such. On output, + graphic characters that are not in ASCII or Latin-1 will be + replaced by a ?. (For a no-conversion-encoded buffer, these + characters will only be present if you explicitly insert them.) +`convert-eol' + Convert CRLF sequences or CR to LF. +`shift-jis' + Shift-JIS (a Japanese encoding commonly used in PC operating systems). +`unicode' + Any Unicode encoding (UCS-4, UTF-8, UTF-16, etc.). +`mswindows-unicode-to-multibyte' + (MS Windows only) Converts from Windows Unicode to Windows Multibyte + (any code page encoding) upon encoding, and the other way upon decoding. +`mswindows-multibyte' + Converts to or from Windows Multibyte (any code page encoding). + This is resolved into a chain of `mswindows-unicode' and + `mswindows-unicode-to-multibyte'. +`iso2022' + Any ISO2022-compliant encoding. Among other things, this includes + JIS (the Japanese encoding commonly used for e-mail), EUC (the + standard Unix encoding for Japanese and other languages), and + Compound Text (the encoding used in X11). You can specify more + specific information about the conversion with the PROPS argument. +`fixed-width' + A fixed-width eight bit encoding that is not necessarily compliant with + ISO 2022. This coding system assumes Unicode equivalency, that is, if + two given XEmacs characters have the same Unicode mapping, they will + always map to the same octet on disk. +`big5' + Big5 (the encoding commonly used for Mandarin Chinese in Taiwan). +`ccl' + The conversion is performed using a user-written pseudo-code + program. CCL (Code Conversion Language) is the name of this + pseudo-code. +`gzip' + GZIP compression format. +`internal' + Write out or read in the raw contents of the memory representing + the buffer's text. This is primarily useful for debugging + purposes, and is only enabled when XEmacs has been compiled with + DEBUG_XEMACS defined (via the --debug configure option). + WARNING: Reading in a file using `internal' conversion can result + in an internal inconsistency in the memory representing a + buffer's text, which will produce unpredictable results and may + cause XEmacs to crash. Under normal circumstances you should + never use `internal' conversion. + +DESCRIPTION is a short English phrase describing the coding system, +suitable for use as a menu item. (See also the `documentation' property +below.) + +PROPS is a property list, describing the specific nature of the +character set. Recognized properties are: + +`mnemonic' + String to be displayed in the modeline when this coding system is + active. + +`documentation' + Detailed documentation on the coding system. + +`aliases' + A list of aliases for the coding system. See + `define-coding-system-alias'. + +`eol-type' + End-of-line conversion to be used. It should be one of + + nil + Automatically detect the end-of-line type (LF, CRLF, + or CR). Also generate subsidiary coding systems named + `NAME-unix', `NAME-dos', and `NAME-mac', that are + identical to this coding system but have an EOL-TYPE + value of `lf', `crlf', and `cr', respectively. + `lf' + The end of a line is marked externally using ASCII LF. + Since this is also the way that XEmacs represents an + end-of-line internally, specifying this option results + in no end-of-line conversion. This is the standard + format for Unix text files. + `crlf' + The end of a line is marked externally using ASCII + CRLF. This is the standard format for MS-DOS text + files. + `cr' + The end of a line is marked externally using ASCII CR. + This is the standard format for Macintosh text files. + t + Automatically detect the end-of-line type but do not + generate subsidiary coding systems. (This value is + converted to nil when stored internally, and + `coding-system-property' will return nil.) + +`post-read-conversion' + The value is a function to call after some text is inserted and + decoded by the coding system itself and before any functions in + `after-change-functions' are called. (#### Not actually true in + XEmacs. `after-change-functions' will be called twice if + `post-read-conversion' changes something.) The argument of this + function is the same as for a function in + `after-insert-file-functions', i.e. LENGTH of the text inserted, + with point at the head of the text to be decoded. + +`pre-write-conversion' + The value is a function to call after all functions in + `write-region-annotate-functions' and `buffer-file-format' are + called, and before the text is encoded by the coding system itself. + The arguments to this function are the same as those of a function + in `write-region-annotate-functions', i.e. FROM and TO, specifying + a region of text. + +The following properties are used by `default-query-coding-region', +the default implementation of `query-coding-region'. This +implementation and these properties are not used by the Unicode coding +systems, nor by fixed-width coding systems. + +`safe-chars' + The value is a char table. If a character has non-nil value in it, + the character is safely supported by the coding system. + This overrides the `safe-charsets' property. + +`safe-charsets' + The value is a list of charsets safely supported by the coding + system. For coding systems based on ISO 2022, XEmacs may try to + encode characters outside these character sets, but outside of + East Asia and East Asian coding systems, it is unlikely that + consumers of the data will understand XEmacs' encoding. + The value t means that all XEmacs character sets handles are supported. + +The following properties are allowed for FSF compatibility but currently +ignored: + +`translation-table-for-decode' + The value is a translation table to be applied on decoding. See + the function `make-translation-table' for the format of translation + table. This is not applicable to CCL-based coding systems. + +`translation-table-for-encode' + The value is a translation table to be applied on encoding. This is + not applicable to CCL-based coding systems. + +`mime-charset' + The value is a symbol of which name is `MIME-charset' parameter of + the coding system. + +`valid-codes' (meaningful only for a coding system based on CCL) + The value is a list to indicate valid byte ranges of the encoded + file. Each element of the list is an integer or a cons of integer. + In the former case, the integer value is a valid byte code. In the + latter case, the integers specifies the range of valid byte codes. + +The following additional property is recognized if TYPE is `convert-eol': + +`subtype' + One of `lf', `crlf', `cr' or nil (for autodetection). When decoding, + the corresponding sequence will be converted to LF. When encoding, + the opposite happens. This coding system converts characters to + characters. + + + +The following additional properties are recognized if TYPE is `iso2022': + +`charset-g0' +`charset-g1' +`charset-g2' +`charset-g3' + The character set initially designated to the G0 - G3 registers. + The value should be one of + + -- A charset object (designate that character set) + -- nil (do not ever use this register) + -- t (no character set is initially designated to + the register, but may be later on; this automatically + sets the corresponding `force-g*-on-output' property) + +`force-g0-on-output' +`force-g1-on-output' +`force-g2-on-output' +`force-g2-on-output' + If non-nil, send an explicit designation sequence on output before + using the specified register. + +`short' + If non-nil, use the short forms \"ESC $ @\", \"ESC $ A\", and + \"ESC $ B\" on output in place of the full designation sequences + \"ESC $ ( @\", \"ESC $ ( A\", and \"ESC $ ( B\". + +`no-ascii-eol' + If non-nil, don't designate ASCII to G0 at each end of line on output. + Setting this to non-nil also suppresses other state-resetting that + normally happens at the end of a line. + +`no-ascii-cntl' + If non-nil, don't designate ASCII to G0 before control chars on output. + +`seven' + If non-nil, use 7-bit environment on output. Otherwise, use 8-bit + environment. + +`lock-shift' + If non-nil, use locking-shift (SO/SI) instead of single-shift + or designation by escape sequence. + +`no-iso6429' + If non-nil, don't use ISO6429's direction specification. + +`escape-quoted' + If non-nil, literal control characters that are the same as + the beginning of a recognized ISO2022 or ISO6429 escape sequence + (in particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), + SS3 (0x8F), and CSI (0x9B)) are \"quoted\" with an escape character + so that they can be properly distinguished from an escape sequence. + (Note that doing this results in a non-portable encoding.) This + encoding flag is used for byte-compiled files. Note that ESC + is a good choice for a quoting character because there are no + escape sequences whose second byte is a character from the Control-0 + or Control-1 character sets; this is explicitly disallowed by the + ISO2022 standard. + +`input-charset-conversion' + A list of conversion specifications, specifying conversion of + characters in one charset to another when decoding is performed. + Each specification is a list of two elements: the source charset, + and the destination charset. + +`output-charset-conversion' + A list of conversion specifications, specifying conversion of + characters in one charset to another when encoding is performed. + The form of each specification is the same as for + `input-charset-conversion'. + +The following additional properties are recognized if TYPE is +`fixed-width': + +`unicode-map' + Required. A plist describing a map from octets in the coding system + NAME (as integers) to XEmacs characters. Those XEmacs characters will + be used explicitly on decoding, but for encoding (most relevantly, on + writing to disk) XEmacs characters that map to the same Unicode code + point will be unified. This means that the ISO-8859-? characters that + map to the same Unicode code point will not be distinct when written to + disk, which is normally what is intended; it also means that East Asian + Han characters from different XEmacs character sets will not be + distinct when written to disk, which is less often what is intended. + + Any octets not mapped, and with values above #x7f, will be decoded into + XEmacs characters that reflect that their values are undefined. These + characters will be displayed in a language-environment-specific + way. See `unicode-error-default-translation-table' and the + `invalid-sequence-coding-system' argument to `set-language-info'. + + These characters will normally be treated as invalid when checking + whether text can be encoded with `query-coding-region'--see the + IGNORE-INVALID-SEQUENCESP argument to that function to avoid this. It + is possible to specify that octets with values less than #x80 (or + indeed greater than it) be treated in this way, by specifying + explicitly that they correspond to the character mapping to that octet + in `unicode-error-default-translation-table'. Far fewer coding systems + override the ASCII mapping, though, so this is not the default. + +`encode-failure-octet' + An integer between 0 and 255 to write in place of XEmacs characters + that cannot be encoded, defaulting to the code for tilde `~'. + +The following additional properties are recognized (and required) +if TYPE is `ccl': + +`decode' + CCL program used for decoding (converting to internal format). + +`encode' + CCL program used for encoding (converting to external format). + + +The following additional properties are recognized if TYPE is `chain': + +`chain' + List of coding systems to be chained together, in decoding order. + +`canonicalize-after-coding' + Coding system to be returned by the detector routines in place of + this coding system. + + + +The following additional properties are recognized if TYPE is `unicode': + +`unicode-type' + One of `utf-16', `utf-8', `ucs-4', or `utf-7' (the latter is not + yet implemented). `utf-16' is the basic two-byte encoding; + `ucs-4' is the four-byte encoding; `utf-8' is an ASCII-compatible + variable-width 8-bit encoding; `utf-7' is a 7-bit encoding using + only characters that will safely pass through all mail gateways. + [[ This should be \"transformation format\". There should also be + `ucs-2' (or `bmp' -- no surrogates) and `utf-32' (range checked). ]] + +`little-endian' + If non-nil, `utf-16' and `ucs-4' will write out the groups of two + or four bytes little-endian instead of big-endian. This is required, + for example, under Windows. + +`need-bom' + If non-nil, a byte order mark (BOM, or Unicode FFFE) should be + written out at the beginning of the data. This serves both to + identify the endianness of the following data and to mark the + data as Unicode (at least, this is how Windows uses it). + [[ The correct term is \"signature\", since this technique may also + be used with UTF-8. That is the term used in the standard. ]] + + +The following additional properties are recognized if TYPE is +`mswindows-multibyte': + +`code-page' + Either a number (specifying a particular code page) or one of the + symbols `ansi', `oem', `mac', or `ebcdic', specifying the ANSI, + OEM, Macintosh, or EBCDIC code page associated with a particular + locale (given by the `locale' property). NOTE: EBCDIC code pages + only exist in Windows 2000 and later. + +`locale' + If `code-page' is a symbol, this specifies the locale whose code + page of the corresponding type should be used. This should be + one of the following: A cons of two strings, (LANGUAGE + . SUBLANGUAGE) (see `mswindows-set-current-locale'); a string (a + language; SUBLANG_DEFAULT, i.e. the default sublanguage, is + used); or one of the symbols `current', `user-default', or + `system-default', corresponding to the values of + `mswindows-current-locale', `mswindows-user-default-locale', or + `mswindows-system-default-locale', respectively. + + +The following additional properties are recognized if TYPE is `undecided': +\[[ Doesn't GNU use \"detect-*\" for the following two? ]] + +`do-eol' + Do EOL detection. + +`do-coding' + Do encoding detection. + +`coding-system' + If encoding detection is not done, use the specified coding system + to do decoding. This is used internally when implementing coding + systems with an EOL type that specifies autodetection (the default), + so that the detector routines return the proper subsidiary. + + + +The following additional property is recognized if TYPE is `gzip': + +`level' + Compression level: 0 through 9, or `default' (currently 6)." + (when (eq 'fixed-width type) + (setq props (fixed-width-rework-props-runtime name props))) + (make-coding-system-internal name type description props)) + +(define-compiler-macro make-coding-system (&whole form name type + &optional description props) + (if (equal '(quote fixed-width) type) + (if (memq (car-safe props) '(quote eval-when-compile)) + (let* ((props (if (eq 'eval-when-compile (car props)) + (eval (cadr props)) + (cadr props))) + (encode-failure-octet + (or (plist-get props 'encode-failure-octet) + (char-to-int ?~))) + (unicode-map (plist-get props 'unicode-map)) + (default-plist-entry (gensym)) + (encode-table-sym (gensym + (if (eq 'quote (car name)) + (format "%s-enc-" (second name))))) + encode-program decode-program + decode-table encode-table + skip-chars invalid-sequences-skip-chars category) + + (check-argument-range encode-failure-octet 0 #xFF) + ;; unicode-map must be a true list, and must be non-nil. + (check-argument-type #'true-list-p unicode-map) + (check-argument-type #'consp unicode-map) + + ;; Don't pass on our extra data to make-coding-system-internal. + (setq props (plist-remprop props 'encode-failure-octet) + props (plist-remprop props 'unicode-map)) + + (multiple-value-setq + (decode-table encode-table) + (fixed-width-create-decode-encode-tables unicode-map)) + + ;; Generate the decode and encode programs, and the skip-chars + ;; arguments. + (setq decode-program + (fixed-width-generate-decode-program decode-table) + category (fixed-width-choose-category decode-table)) + + (multiple-value-setq + (encode-program skip-chars invalid-sequences-skip-chars) + (fixed-width-generate-encode-program-and-skip-chars-strings + decode-table encode-table encode-failure-octet)) + + (unless (vectorp decode-program) + (setq decode-program + (apply #'vector decode-program))) + + (unless (eq default-plist-entry (plist-get props 'encode + default-plist-entry)) + (error + 'invalid-argument + "Explicit property not allowed for fixed-width coding system" + 'encode)) + (loop for (symbol . value) + in `((decode . ,decode-program) + (from-unicode . ,encode-table) + (query-skip-chars . ,skip-chars) + (invalid-sequences-skip-chars . + ,invalid-sequences-skip-chars) + (category . ,category)) + do + (unless (eq default-plist-entry (plist-get props symbol + default-plist-entry)) + (error + 'invalid-argument + "Explicit property not allowed for \ +fixed-width coding systems" + symbol)) + (setq props (nconc (list symbol value) props))) + `(progn + (define-translation-hash-table ',encode-table-sym ,encode-table) + (make-coding-system-internal + ,name ,type ,description + ',(nconc (list 'encode + (apply #'vector + (nsublis + (list (cons 'encode-table-sym + encode-table-sym)) + encode-program))) + props)))) + ;; The form does not use literals; call make-coding-system at + ;; run time. + form) + (if (byte-compile-constp type) + ;; This is not a fixed-width call; compile it to a form that 21.4 + ;; can also understand. + `(funcall (or (and (fboundp 'make-coding-system-internal) + 'make-coding-system-internal) + 'make-coding-system) + ,@(cdr form)) + ;; TYPE is not literal; work things out at runtime. + form))) +