comparison lisp/mule/make-coding-system.el @ 4690:257b468bf2ca

Move the #'query-coding-region implementation to C. This is necessary because there is no reasonable way to access the corresponding mswindows-multibyte functionality from Lisp, and we need such functionality if we're going to have a reliable and portable #'query-coding-region implementation. However, this change doesn't yet provide #'query-coding-region for the mswindow-multibyte coding systems, there should be no functional differences between an XEmacs with this change and one without it. src/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> Move the #'query-coding-region implementation to C. This is necessary because there is no reasonable way to access the corresponding mswindows-multibyte functionality from Lisp, and we need such functionality if we're going to have a reliable and portable #'query-coding-region implementation. However, this change doesn't yet provide #'query-coding-region for the mswindow-multibyte coding systems, there should be no functional differences between an XEmacs with this change and one without it. * mule-coding.c (struct fixed_width_coding_system): Add a new coding system type, fixed_width, and implement it. It uses the CCL infrastructure but has a much simpler creation API, and its own query_method, formerly in lisp/mule/mule-coding.el. * unicode.c: Move the Unicode query method implementation here from unicode.el. * lisp.h: Declare Fmake_coding_system_internal, Fcopy_range_table here. * intl-win32.c (complex_vars_of_intl_win32): Use Fmake_coding_system_internal, not Fmake_coding_system. * general-slots.h: Add Qsucceeded, Qunencodable, Qinvalid_sequence here. * file-coding.h (enum coding_system_variant): Add fixed_width_coding_system here. (struct coding_system_methods): Add query_method and query_lstream_method to the coding system methods. Provide flags for the query methods. Declare the default query method; initialise it correctly in INITIALIZE_CODING_SYSTEM_TYPE. * file-coding.c (default_query_method): New function, the default query method for coding systems that do not set it. Moved from coding.el. (make_coding_system_1): Accept new elements in PROPS in #'make-coding-system; aliases, a list of aliases; safe-chars and safe-charsets (these were previously accepted but not saved); and category. (Fmake_coding_system_internal): New function, what used to be #'make-coding-system--on Mule builds, we've now moved some of the functionality of this to Lisp. (Fcoding_system_canonical_name_p): Move this earlier in the file, since it's now called from within make_coding_system_1. (Fquery_coding_region): Move the implementation of this here, from coding.el. (complex_vars_of_file_coding): Call Fmake_coding_system_internal, not Fmake_coding_system; specify safe-charsets properties when we're a mule build. * extents.h (mouse_highlight_priority, Fset_extent_priority, Fset_extent_face, Fmap_extents): Make these available to other C files. lisp/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> Move the #'query-coding-region implementation to C. * coding.el: Consolidate code that depends on the presence or absence of Mule at the end of this file. (default-query-coding-region, query-coding-region): Move these functions to C. (default-query-coding-region-safe-charset-skip-chars-map): Remove this variable, the corresponding C variable is Vdefault_query_coding_region_chartab_cache in file-coding.c. (query-coding-string): Update docstring to reflect actual multiple values, be more careful about not modifying a range table that we're currently mapping over. (encode-coding-char): Make the implementation of this simpler. (featurep 'mule): Autoload #'make-coding-system from mule/make-coding-system.el if we're a mule build; provide an appropriate compiler macro. Do various non-mule compatibility things if we're not a mule build. * update-elc.el (additional-dump-dependencies): Add mule/make-coding-system as a dump time dependency if we're a mule build. * unicode.el (ccl-encode-to-ucs-2): (decode-char): (encode-char): Move these earlier in the file, for the sake of some byte compile warnings. (unicode-query-coding-region): Move this to unicode.c * mule/make-coding-system.el: New file, not dumped. Contains the functionality to rework the arguments necessary for fixed-width coding systems, and contains the implementation of #'make-coding-system, which now calls #'make-coding-system-internal. * mule/vietnamese.el (viscii): * mule/latin.el (iso-8859-2): (windows-1250): (iso-8859-3): (iso-8859-4): (iso-8859-14): (iso-8859-15): (iso-8859-16): (iso-8859-9): (macintosh): (windows-1252): * mule/hebrew.el (iso-8859-8): * mule/greek.el (iso-8859-7): (windows-1253): * mule/cyrillic.el (iso-8859-5): (koi8-r): (koi8-u): (windows-1251): (alternativnyj): (koi8-ru): (koi8-t): (koi8-c): (koi8-o): * mule/arabic.el (iso-8859-6): (windows-1256): Move all these coding systems to being of type fixed-width, not of type CCL. This allows the distinct query-coding-region for them to be in C, something which will eventually allow us to implement query-coding-region for the mswindows-multibyte coding systems. * mule/general-late.el (posix-charset-to-coding-system-hash): Document why we're pre-emptively persuading the byte compiler that the ELC for this file needs to be written using escape-quoted. Call #'set-unicode-query-skip-chars-args, now the Unicode query-coding-region implementation is in C. * mule/thai-xtis.el (tis-620): Don't bother checking whether we're XEmacs or not here. * mule/mule-coding.el: Move the eight bit fixed-width functionality from this file to make-coding-system.el. tests/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> * automated/mule-tests.el: Check a coding system's type, not an 8-bit-fixed property, for whether that coding system should be treated as a fixed-width coding system. * automated/query-coding-tests.el: Don't test the query coding functionality for mswindows-multibyte coding systems, it's not yet implemented.
author Aidan Kehoe <kehoea@parhasard.net>
date Sat, 19 Sep 2009 22:53:13 +0100
parents
children dc3c2f298857
comparison
equal deleted inserted replaced
4689:0636c6ccb430 4690:257b468bf2ca
1 ;;; make-coding-system.el; Provides the #'make-coding-system function and
2 ;;; much of the implementation of the fixed-width coding system type.
3
4 ;; Copyright (C) 2009 Free Software Foundation
5
6 ;; Author: Aidan Kehoe
7
8 ;; This file is part of XEmacs.
9
10 ;; XEmacs is free software; you can redistribute it and/or modify it
11 ;; under the terms of the GNU General Public License as published by
12 ;; the Free Software Foundation; either version 2, or (at your option)
13 ;; any later version.
14
15 ;; XEmacs is distributed in the hope that it will be useful, but
16 ;; WITHOUT ANY WARRANTY; without even the implied warranty of
17 ;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
18 ;; General Public License for more details.
19
20 ;; You should have received a copy of the GNU General Public License
21 ;; along with XEmacs; see the file COPYING. If not, write to the
22 ;; Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
23 ;; Boston, MA 02110-1301, USA.
24
25 ;;; Commentary:
26
27 ;;; Code:
28
29 (defvar fixed-width-private-use-start (decode-char 'ucs #xE000)
30 "Start of a 256 code private use area for fixed-width coding systems.
31
32 This is used to ensure that distinct octets on disk for a given coding
33 system map to distinct XEmacs characters, preventing spurious changes when
34 a file is read, not changed, and then written. ")
35
36 (defun fixed-width-generate-helper (decode-table encode-table
37 encode-failure-octet)
38 "Helper func, `fixed-width-generate-encode-program-and-skip-chars-strings',
39 which see.
40
41 Deals with the case where ASCII and another character set can both be
42 encoded unambiguously and completely into the coding-system; if this is so,
43 returns multiple values comprisig of such a ccl-program and the character
44 set in question. If not, it returns nil."
45 (let ((tentative-encode-program-parts
46 (eval-when-compile
47 (let* ((vec-len 128)
48 (compiled
49 (append
50 (ccl-compile
51 `(1
52 (loop
53 (read-multibyte-character r0 r1)
54 (if (r0 == ,(charset-id 'ascii))
55 (write r1)
56 ((if (r0 == #xABAB)
57 ;; #xBFFE is a sentinel in the compiled
58 ;; program.
59 ((r0 = r1 & #x7F)
60 (write r0 ,(make-vector vec-len #xBFFE)))
61 ((mule-to-unicode r0 r1)
62 (if (r0 == #xFFFD)
63 (write #xBEEF)
64 ((lookup-integer encode-table-sym r0 r3)
65 (if r7
66 (write-multibyte-character r0 r3)
67 (write #xBEEF))))))))
68 (repeat)))) nil))
69 (first-part compiled)
70 (last-part
71 (member-if-not (lambda (entr) (eq #xBFFE entr))
72 (member-if
73 (lambda (entr) (eq #xBFFE entr))
74 first-part))))
75 (while compiled
76 (when (eq #xBFFE (cadr compiled))
77 (assert (= vec-len (search '(#xBFFE) (cdr compiled)
78 :test #'/=)) nil
79 "Strange ccl vector length")
80 (setcdr compiled nil))
81 (setq compiled (cdr compiled)))
82 ;; Is the generated code as we expect it to be?
83 (assert (and (memq #xABAB first-part)
84 (memq #xBEEF14 last-part))
85 nil
86 "This code assumes that the constant #xBEEF is #xBEEF14 in \
87 compiled CCL code,\nand that the constant #xABAB is #xABAB. If that is
88 not the case, and it appears not to be--that's why you're getting this
89 message--it will not work. ")
90 (list first-part last-part vec-len))))
91 (charset-lower -1)
92 (charset-upper -1)
93 worth-trying known-charsets encode-program
94 other-charset-vector ucs)
95
96 (loop for char across decode-table
97 do (pushnew (char-charset char) known-charsets))
98 (setq known-charsets (delq 'ascii known-charsets))
99
100 (loop for known-charset in known-charsets
101 do
102 ;; This is not possible for two dimensional charsets.
103 (when (eq 1 (charset-dimension known-charset))
104 (if (eq 'control-1 known-charset)
105 (setq charset-lower 0
106 charset-upper 31)
107 ;; There should be a nicer way to get the limits here.
108 (condition-case args-out-of-range
109 (make-char known-charset #x100)
110 (args-out-of-range
111 (setq charset-lower (third args-out-of-range)
112 charset-upper (fourth args-out-of-range)))))
113 (loop
114 for i from charset-lower to charset-upper
115 always (and (setq ucs
116 (encode-char (make-char known-charset i) 'ucs))
117 (gethash ucs encode-table))
118 finally (setq worth-trying known-charset))
119
120 ;; Only trying this for one charset at a time, the first find.
121 (when worth-trying (return))
122
123 ;; Okay, this charset is not worth trying, Try the next.
124 (setq charset-lower -1
125 charset-upper -1
126 worth-trying nil)))
127
128 (when worth-trying
129 (setq other-charset-vector
130 (make-vector (third tentative-encode-program-parts)
131 encode-failure-octet))
132 (loop for i from charset-lower to charset-upper
133 do (aset other-charset-vector i
134 (gethash (encode-char (make-char worth-trying i)
135 'ucs) encode-table)))
136 (setq encode-program
137 (nsublis
138 (list (cons #xABAB (charset-id worth-trying)))
139 (nconc
140 (copy-list (first
141 tentative-encode-program-parts))
142 (append other-charset-vector nil)
143 (copy-tree (second
144 tentative-encode-program-parts))))))
145 (and encode-program (values encode-program worth-trying))))
146
147 (defun fixed-width-generate-encode-program-and-skip-chars-strings
148 (decode-table encode-table encode-failure-octet)
149 "Generate a CCL program to encode a 8-bit fixed-width charset.
150
151 DECODE-TABLE must have 256 non-cons entries, and will be regarded as
152 describing a map from the octet corresponding to an offset in the
153 table to the that entry in the table. ENCODE-TABLE is a hash table
154 map from unicode values to characters in the range [0,255].
155 ENCODE-FAILURE-OCTET describes an integer between 0 and 255
156 \(inclusive) to write in the event that a character cannot be encoded."
157 (check-argument-type #'vectorp decode-table)
158 (check-argument-range (length decode-table) #x100 #x100)
159 (check-argument-type #'hash-table-p encode-table)
160 (check-argument-type #'integerp encode-failure-octet)
161 (check-argument-range encode-failure-octet #x00 #xFF)
162 (let ((encode-program nil)
163 (general-encode-program
164 (eval-when-compile
165 (let ((prog (append
166 (ccl-compile
167 `(1
168 (loop
169 (read-multibyte-character r0 r1)
170 (mule-to-unicode r0 r1)
171 (if (r0 == #xFFFD)
172 (write #xBEEF)
173 ((lookup-integer encode-table-sym r0 r3)
174 (if r7
175 (write-multibyte-character r0 r3)
176 (write #xBEEF))))
177 (repeat)))) nil)))
178 (assert (memq #xBEEF14 prog)
179 nil
180 "This code assumes that the constant #xBEEF is #xBEEF14 \
181 in compiled CCL code.\nIf that is not the case, and it appears not to
182 be--that's why you're getting this message--it will not work. ")
183 prog)))
184 (encode-program-with-ascii-optimisation
185 (eval-when-compile
186 (let ((prog (append
187 (ccl-compile
188 `(1
189 (loop
190 (read-multibyte-character r0 r1)
191 (if (r0 == ,(charset-id 'ascii))
192 (write r1)
193 ((mule-to-unicode r0 r1)
194 (if (r0 == #xFFFD)
195 (write #xBEEF)
196 ((lookup-integer encode-table-sym r0 r3)
197 (if r7
198 (write-multibyte-character r0 r3)
199 (write #xBEEF))))))
200 (repeat)))) nil)))
201 (assert (memq #xBEEF14 prog)
202 nil
203 "This code assumes that the constant #xBEEF is #xBEEF14 \
204 in compiled CCL code.\nIf that is not the case, and it appears not to
205 be--that's why you're getting this message--it will not work. ")
206 prog)))
207 (ascii-encodes-as-itself nil)
208 (control-1-encodes-as-itself t)
209 (invalid-sequence-code-point-start
210 (eval-when-compile
211 (char-to-unicode
212 (aref (decode-coding-string "\xd8\x00\x00\x00" 'utf-16-be) 3))))
213 further-char-set skip-chars invalid-sequences-skip-chars)
214
215 ;; Is this coding system ASCII-compatible? If so, we can avoid the hash
216 ;; table lookup for those characters.
217 (loop
218 for i from #x00 to #x7f
219 always (eq (int-to-char i) (gethash i encode-table))
220 finally (setq ascii-encodes-as-itself t))
221
222 ;; Note that this logic handles EBCDIC badly. For example, CP037,
223 ;; MIME name ebcdic-na, has the entire repertoire of ASCII and
224 ;; Latin 1, and thus a more optimal ccl encode program would check
225 ;; for those character sets and use tables. But for now, we do a
226 ;; hash table lookup for every character.
227 (if (null ascii-encodes-as-itself)
228 ;; General encode program. Pros; general and correct. Cons;
229 ;; slow, a hash table lookup + mule-unicode conversion is done
230 ;; for every character encoding.
231 (setq encode-program general-encode-program)
232 (multiple-value-setq
233 (encode-program further-char-set)
234 ;; Encode program with ascii-ascii mapping (based on a
235 ;; character's mule character set), and one other mule
236 ;; character set using table-based encoding, other
237 ;; character sets using hash table lookups.
238 ;; fixed-width-non-ascii-completely-coveredp only returns
239 ;; such a mapping if some non-ASCII charset with
240 ;; characters in decode-table is entirely covered by
241 ;; encode-table.
242 (fixed-width-generate-helper decode-table encode-table
243 encode-failure-octet))
244 (unless encode-program
245 ;; If fixed-width-non-ascii-completely-coveredp returned nil,
246 ;; but ASCII still encodes as itself, do one-to-one mapping
247 ;; for ASCII, and a hash table lookup for everything else.
248 (setq encode-program encode-program-with-ascii-optimisation)))
249
250 (setq encode-program
251 (nsublis
252 (list (cons #xBEEF14
253 (logior (lsh encode-failure-octet 8)
254 #x14)))
255 (copy-tree encode-program)))
256 (loop
257 for i from #x80 to #x9f
258 do (unless (= i (aref decode-table i))
259 (setq control-1-encodes-as-itself nil)
260 (return)))
261 (loop
262 for i from #x00 to #xFF
263 initially (setq skip-chars
264 (cond
265 ((and ascii-encodes-as-itself
266 control-1-encodes-as-itself further-char-set)
267 (concat "\x00-\x9f" (charset-skip-chars-string
268 further-char-set)))
269 ((and ascii-encodes-as-itself
270 control-1-encodes-as-itself)
271 "\x00-\x9f")
272 ((null ascii-encodes-as-itself)
273 (skip-chars-quote (apply #'string
274 (append decode-table nil))))
275 (further-char-set
276 (concat (charset-skip-chars-string 'ascii)
277 (charset-skip-chars-string further-char-set)))
278 (t
279 (charset-skip-chars-string 'ascii)))
280 invalid-sequences-skip-chars "")
281 with decoded-ucs = nil
282 with decoded = nil
283 with no-ascii-transparency-skip-chars-list =
284 (unless ascii-encodes-as-itself (append decode-table nil))
285 ;; Can't use #'match-string here, see:
286 ;; http://mid.gmane.org/18829.34118.709782.704574@parhasard.net
287 with skip-chars-test =
288 #'(lambda (skip-chars-string testing)
289 (with-temp-buffer
290 (insert testing)
291 (goto-char (point-min))
292 (skip-chars-forward skip-chars-string)
293 (= (point) (point-max))))
294 do
295 (setq decoded (aref decode-table i)
296 decoded-ucs (char-to-unicode decoded))
297 (cond
298 ((<= invalid-sequence-code-point-start decoded-ucs
299 (+ invalid-sequence-code-point-start #xFF))
300 (setq invalid-sequences-skip-chars
301 (concat (string decoded)
302 invalid-sequences-skip-chars))
303 (assert (not (funcall skip-chars-test skip-chars decoded))
304 "This char should only be skipped with \
305 `invalid-sequences-skip-chars', not by `skip-chars'"))
306 ((not (funcall skip-chars-test skip-chars decoded))
307 (if ascii-encodes-as-itself
308 (setq skip-chars (concat skip-chars (string decoded)))
309 (push decoded no-ascii-transparency-skip-chars-list))))
310 finally (unless ascii-encodes-as-itself
311 (setq skip-chars
312 (skip-chars-quote
313 (apply #'string
314 no-ascii-transparency-skip-chars-list)))))
315 (values encode-program skip-chars invalid-sequences-skip-chars)))
316
317 (defun fixed-width-create-decode-encode-tables (unicode-map)
318 "Return multiple values \(DECODE-TABLE ENCODE-TABLE) given UNICODE-MAP.
319 UNICODE-MAP should be an alist mapping from integer octet values to
320 characters with UCS code points; DECODE-TABLE will be a 256-element
321 vector, and ENCODE-TABLE will be a hash table mapping from 256 numbers
322 to 256 distinct characters."
323 (check-argument-type #'listp unicode-map)
324 (let ((decode-table (make-vector 256 nil))
325 (encode-table (make-hash-table :size 256))
326 (private-use-start (encode-char fixed-width-private-use-start 'ucs))
327 (invalid-sequence-code-point-start
328 (eval-when-compile
329 (char-to-unicode
330 (aref (decode-coding-string "\xd8\x00\x00\x00" 'utf-16-be) 3))))
331 desired-ucs decode-table-entry)
332
333 (loop for (external internal)
334 in unicode-map
335 do
336 (aset decode-table external internal)
337 (assert (not (eq (encode-char internal 'ucs) -1))
338 nil
339 "Looks like you're creating a fixed-width coding system \
340 in a dumped file, \nand you're either not providing a literal unicode map
341 or PROPS. Don't do that; fixed-width coding systems rely on sensible
342 Unicode mappings being available, which they are at compile time for
343 dumped files (but this requires the mentioned literals), but not, for
344 most of them, at run time. ")
345
346 (puthash (encode-char internal 'ucs)
347 ;; This is semantically an integer, but Dave Love's design
348 ;; for lookup-integer in CCL means we need to store it as a
349 ;; character.
350 (int-to-char external)
351 encode-table))
352
353 ;; Now, go through the decode table. For octet values above #x7f, if the
354 ;; decode table entry is nil, this means that they have an undefined
355 ;; mapping (= they map to XEmacs characters with keys in
356 ;; unicode-error-default-translation-table); for octet values below or
357 ;; equal to #x7f, it means that they map to ASCII.
358
359 ;; If any entry (whether below or above #x7f) in the decode-table
360 ;; already maps to some character with a key in
361 ;; unicode-error-default-translation-table, it is treated as an
362 ;; undefined octet by `query-coding-region'. That is, it is not
363 ;; necessary for an octet value to be above #x7f for this to happen.
364
365 (dotimes (i 256)
366 (setq decode-table-entry (aref decode-table i))
367 (if decode-table-entry
368 (when (get-char-table
369 decode-table-entry
370 unicode-error-default-translation-table)
371 ;; The caller is explicitly specifying that this octet
372 ;; corresponds to an invalid sequence on disk:
373 (assert (= (get-char-table
374 decode-table-entry
375 unicode-error-default-translation-table) i)
376 "Bad argument for a fixed-width coding system.
377 If you're going to designate an octet with value below #x80 as invalid
378 for this coding system, make sure to map it to the invalid sequence
379 character corresponding to its octet value on disk. "))
380
381 ;; decode-table-entry is nil; either the octet is to be treated as
382 ;; contributing to an error sequence (when (> #x7f i)), or it should
383 ;; be attempted to treat it as ASCII-equivalent.
384 (setq desired-ucs (or (and (< i #x80) i)
385 (+ invalid-sequence-code-point-start i)))
386 (while (gethash desired-ucs encode-table)
387 (assert (not (< i #x80))
388 "UCS code point should not already be in encode-table!"
389 ;; There is one invalid sequence char per octet value;
390 ;; with fixed-width coding systems, it makes no sense
391 ;; for us to be multiply allocating them.
392 (gethash desired-ucs encode-table))
393 (setq desired-ucs (+ private-use-start desired-ucs)
394 private-use-start (+ private-use-start 1)))
395 (puthash desired-ucs (int-to-char i) encode-table)
396 (setq desired-ucs (if (> desired-ucs #xFF)
397 (unicode-to-char desired-ucs)
398 ;; So we get Latin-1 when run at dump time,
399 ;; instead of JIT-allocated characters.
400 (int-to-char desired-ucs)))
401 (aset decode-table i desired-ucs)))
402 (values decode-table encode-table)))
403
404 (defun fixed-width-generate-decode-program (decode-table)
405 "Given DECODE-TABLE, generate a CCL program to decode an 8-bit charset.
406 DECODE-TABLE must have 256 non-cons entries, and will be regarded as
407 describing a map from the octet corresponding to an offset in the
408 table to the that entry in the table. "
409 (check-argument-type #'vectorp decode-table)
410 (check-argument-range (length decode-table) #x100 #x100)
411 (let ((decode-program-parts
412 (eval-when-compile
413 (let* ((compiled
414 (append
415 (ccl-compile
416 `(3
417 ((read r0)
418 (loop
419 (write-read-repeat r0 ,(make-vector
420 256 'sentinel)))))) nil))
421 (first-part compiled)
422 (last-part
423 (member-if-not #'symbolp
424 (member-if-not #'integerp first-part))))
425 ;; Chop off the sentinel sentinel sentinel [..] part.
426 (while compiled
427 (if (symbolp (cadr compiled))
428 (setcdr compiled nil))
429 (setq compiled (cdr compiled)))
430 (list first-part last-part)))))
431 (nconc
432 ;; copy-list needed, because the structure of the literal provided
433 ;; by our eval-when-compile hangs around.
434 (copy-list (first decode-program-parts))
435 (append decode-table nil)
436 (second decode-program-parts))))
437
438 (defun fixed-width-choose-category (decode-table)
439 "Given DECODE-TABLE, return an appropriate coding category.
440 DECODE-TABLE is a 256-entry vector describing the mapping from octets on
441 disk to XEmacs characters for some fixed-width 8-bit coding system."
442 (check-argument-type #'vectorp decode-table)
443 (check-argument-range (length decode-table) #x100 #x100)
444 (loop
445 named category
446 for i from #x80 to #x9F
447 do (unless (= i (aref decode-table i))
448 (return-from category 'no-conversion))
449 finally return 'iso-8-1))
450
451 (defun fixed-width-rework-props-runtime (name props)
452 "Rework PROPS to a form understood by `make-coding-system-internal'.
453
454 NAME must be a symbol, describing a fixed-width coding system that is
455 about to be created. Much of the implementation of the fixed-width
456 coding system is in Lisp, and this function allows us to rework the
457 arguments that `make-coding-system-internal' sees accordingly.
458
459 If you are calling this function from anywhere but
460 `make-coding-system', you're probably doing something wrong."
461 (check-argument-type #'symbolp name)
462 (check-valid-plist props)
463 (let ((encode-failure-octet (or (plist-get props 'encode-failure-octet)
464 (char-to-int ?~)))
465 (unicode-map (plist-get props 'unicode-map))
466 (hash-table-sym (gensym (format "%s-encode-table" name)))
467 encode-program decode-program decode-table encode-table skip-chars
468 invalid-sequences-skip-chars category)
469
470 (check-argument-range encode-failure-octet 0 #xFF)
471 ;; unicode-map must be a true list, and must be non-nil.
472 (check-argument-type #'true-list-p unicode-map)
473 (check-argument-type #'consp unicode-map)
474
475 ;; Don't pass on our extra data to make-coding-system-internal.
476 (setq props (plist-remprop props 'encode-failure-octet)
477 props (plist-remprop props 'unicode-map))
478
479 (multiple-value-setq
480 (decode-table encode-table)
481 (fixed-width-create-decode-encode-tables unicode-map))
482
483 ;; Register the decode-table.
484 (define-translation-hash-table hash-table-sym encode-table)
485
486 ;; Generate the programs and skip-chars strings.
487 (setq decode-program (fixed-width-generate-decode-program decode-table))
488 (multiple-value-setq
489 (encode-program skip-chars invalid-sequences-skip-chars)
490 (fixed-width-generate-encode-program-and-skip-chars-strings
491 decode-table encode-table encode-failure-octet))
492
493 (setq category (fixed-width-choose-category decode-table))
494
495 (unless (vectorp encode-program)
496 (setq encode-program
497 (apply #'vector
498 (nsublis (list (cons 'encode-table-sym hash-table-sym))
499 (copy-tree encode-program)))))
500 (unless (vectorp decode-program)
501 (setq decode-program
502 (apply #'vector decode-program)))
503
504 (loop for (symbol . value)
505 in `((decode . ,decode-program)
506 (encode . ,encode-program)
507 (from-unicode . ,encode-table)
508 (query-skip-chars . ,skip-chars)
509 (invalid-sequences-skip-chars . ,invalid-sequences-skip-chars)
510 (category . ,category))
511 with default = (gensym)
512 do
513 (unless (eq default (plist-get props symbol default))
514 (error
515 'invalid-argument
516 "Explicit property not allowed for fixed-width coding systems"
517 symbol))
518 (setq props (nconc (list symbol value) props)))
519 props))
520
521 ;;;###autoload
522 (defun make-coding-system (name type description props)
523 "Register symbol NAME as a coding system.
524
525 TYPE describes the conversion method used and should be one of
526
527 nil or `undecided'
528 Automatic conversion. XEmacs attempts to detect the coding system
529 used in the file.
530 `chain'
531 Chain two or more coding systems together to make a combination coding
532 system.
533 `no-conversion'
534 No conversion. Use this for binary files and such. On output,
535 graphic characters that are not in ASCII or Latin-1 will be
536 replaced by a ?. (For a no-conversion-encoded buffer, these
537 characters will only be present if you explicitly insert them.)
538 `convert-eol'
539 Convert CRLF sequences or CR to LF.
540 `shift-jis'
541 Shift-JIS (a Japanese encoding commonly used in PC operating systems).
542 `unicode'
543 Any Unicode encoding (UCS-4, UTF-8, UTF-16, etc.).
544 `mswindows-unicode-to-multibyte'
545 (MS Windows only) Converts from Windows Unicode to Windows Multibyte
546 (any code page encoding) upon encoding, and the other way upon decoding.
547 `mswindows-multibyte'
548 Converts to or from Windows Multibyte (any code page encoding).
549 This is resolved into a chain of `mswindows-unicode' and
550 `mswindows-unicode-to-multibyte'.
551 `iso2022'
552 Any ISO2022-compliant encoding. Among other things, this includes
553 JIS (the Japanese encoding commonly used for e-mail), EUC (the
554 standard Unix encoding for Japanese and other languages), and
555 Compound Text (the encoding used in X11). You can specify more
556 specific information about the conversion with the PROPS argument.
557 `fixed-width'
558 A fixed-width eight bit encoding that is not necessarily compliant with
559 ISO 2022. This coding system assumes Unicode equivalency, that is, if
560 two given XEmacs characters have the same Unicode mapping, they will
561 always map to the same octet on disk.
562 `big5'
563 Big5 (the encoding commonly used for Mandarin Chinese in Taiwan).
564 `ccl'
565 The conversion is performed using a user-written pseudo-code
566 program. CCL (Code Conversion Language) is the name of this
567 pseudo-code.
568 `gzip'
569 GZIP compression format.
570 `internal'
571 Write out or read in the raw contents of the memory representing
572 the buffer's text. This is primarily useful for debugging
573 purposes, and is only enabled when XEmacs has been compiled with
574 DEBUG_XEMACS defined (via the --debug configure option).
575 WARNING: Reading in a file using `internal' conversion can result
576 in an internal inconsistency in the memory representing a
577 buffer's text, which will produce unpredictable results and may
578 cause XEmacs to crash. Under normal circumstances you should
579 never use `internal' conversion.
580
581 DESCRIPTION is a short English phrase describing the coding system,
582 suitable for use as a menu item. (See also the `documentation' property
583 below.)
584
585 PROPS is a property list, describing the specific nature of the
586 character set. Recognized properties are:
587
588 `mnemonic'
589 String to be displayed in the modeline when this coding system is
590 active.
591
592 `documentation'
593 Detailed documentation on the coding system.
594
595 `aliases'
596 A list of aliases for the coding system. See
597 `define-coding-system-alias'.
598
599 `eol-type'
600 End-of-line conversion to be used. It should be one of
601
602 nil
603 Automatically detect the end-of-line type (LF, CRLF,
604 or CR). Also generate subsidiary coding systems named
605 `NAME-unix', `NAME-dos', and `NAME-mac', that are
606 identical to this coding system but have an EOL-TYPE
607 value of `lf', `crlf', and `cr', respectively.
608 `lf'
609 The end of a line is marked externally using ASCII LF.
610 Since this is also the way that XEmacs represents an
611 end-of-line internally, specifying this option results
612 in no end-of-line conversion. This is the standard
613 format for Unix text files.
614 `crlf'
615 The end of a line is marked externally using ASCII
616 CRLF. This is the standard format for MS-DOS text
617 files.
618 `cr'
619 The end of a line is marked externally using ASCII CR.
620 This is the standard format for Macintosh text files.
621 t
622 Automatically detect the end-of-line type but do not
623 generate subsidiary coding systems. (This value is
624 converted to nil when stored internally, and
625 `coding-system-property' will return nil.)
626
627 `post-read-conversion'
628 The value is a function to call after some text is inserted and
629 decoded by the coding system itself and before any functions in
630 `after-change-functions' are called. (#### Not actually true in
631 XEmacs. `after-change-functions' will be called twice if
632 `post-read-conversion' changes something.) The argument of this
633 function is the same as for a function in
634 `after-insert-file-functions', i.e. LENGTH of the text inserted,
635 with point at the head of the text to be decoded.
636
637 `pre-write-conversion'
638 The value is a function to call after all functions in
639 `write-region-annotate-functions' and `buffer-file-format' are
640 called, and before the text is encoded by the coding system itself.
641 The arguments to this function are the same as those of a function
642 in `write-region-annotate-functions', i.e. FROM and TO, specifying
643 a region of text.
644
645 The following properties are used by `default-query-coding-region',
646 the default implementation of `query-coding-region'. This
647 implementation and these properties are not used by the Unicode coding
648 systems, nor by fixed-width coding systems.
649
650 `safe-chars'
651 The value is a char table. If a character has non-nil value in it,
652 the character is safely supported by the coding system.
653 This overrides the `safe-charsets' property.
654
655 `safe-charsets'
656 The value is a list of charsets safely supported by the coding
657 system. For coding systems based on ISO 2022, XEmacs may try to
658 encode characters outside these character sets, but outside of
659 East Asia and East Asian coding systems, it is unlikely that
660 consumers of the data will understand XEmacs' encoding.
661 The value t means that all XEmacs character sets handles are supported.
662
663 The following properties are allowed for FSF compatibility but currently
664 ignored:
665
666 `translation-table-for-decode'
667 The value is a translation table to be applied on decoding. See
668 the function `make-translation-table' for the format of translation
669 table. This is not applicable to CCL-based coding systems.
670
671 `translation-table-for-encode'
672 The value is a translation table to be applied on encoding. This is
673 not applicable to CCL-based coding systems.
674
675 `mime-charset'
676 The value is a symbol of which name is `MIME-charset' parameter of
677 the coding system.
678
679 `valid-codes' (meaningful only for a coding system based on CCL)
680 The value is a list to indicate valid byte ranges of the encoded
681 file. Each element of the list is an integer or a cons of integer.
682 In the former case, the integer value is a valid byte code. In the
683 latter case, the integers specifies the range of valid byte codes.
684
685 The following additional property is recognized if TYPE is `convert-eol':
686
687 `subtype'
688 One of `lf', `crlf', `cr' or nil (for autodetection). When decoding,
689 the corresponding sequence will be converted to LF. When encoding,
690 the opposite happens. This coding system converts characters to
691 characters.
692
693
694
695 The following additional properties are recognized if TYPE is `iso2022':
696
697 `charset-g0'
698 `charset-g1'
699 `charset-g2'
700 `charset-g3'
701 The character set initially designated to the G0 - G3 registers.
702 The value should be one of
703
704 -- A charset object (designate that character set)
705 -- nil (do not ever use this register)
706 -- t (no character set is initially designated to
707 the register, but may be later on; this automatically
708 sets the corresponding `force-g*-on-output' property)
709
710 `force-g0-on-output'
711 `force-g1-on-output'
712 `force-g2-on-output'
713 `force-g2-on-output'
714 If non-nil, send an explicit designation sequence on output before
715 using the specified register.
716
717 `short'
718 If non-nil, use the short forms \"ESC $ @\", \"ESC $ A\", and
719 \"ESC $ B\" on output in place of the full designation sequences
720 \"ESC $ ( @\", \"ESC $ ( A\", and \"ESC $ ( B\".
721
722 `no-ascii-eol'
723 If non-nil, don't designate ASCII to G0 at each end of line on output.
724 Setting this to non-nil also suppresses other state-resetting that
725 normally happens at the end of a line.
726
727 `no-ascii-cntl'
728 If non-nil, don't designate ASCII to G0 before control chars on output.
729
730 `seven'
731 If non-nil, use 7-bit environment on output. Otherwise, use 8-bit
732 environment.
733
734 `lock-shift'
735 If non-nil, use locking-shift (SO/SI) instead of single-shift
736 or designation by escape sequence.
737
738 `no-iso6429'
739 If non-nil, don't use ISO6429's direction specification.
740
741 `escape-quoted'
742 If non-nil, literal control characters that are the same as
743 the beginning of a recognized ISO2022 or ISO6429 escape sequence
744 (in particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E),
745 SS3 (0x8F), and CSI (0x9B)) are \"quoted\" with an escape character
746 so that they can be properly distinguished from an escape sequence.
747 (Note that doing this results in a non-portable encoding.) This
748 encoding flag is used for byte-compiled files. Note that ESC
749 is a good choice for a quoting character because there are no
750 escape sequences whose second byte is a character from the Control-0
751 or Control-1 character sets; this is explicitly disallowed by the
752 ISO2022 standard.
753
754 `input-charset-conversion'
755 A list of conversion specifications, specifying conversion of
756 characters in one charset to another when decoding is performed.
757 Each specification is a list of two elements: the source charset,
758 and the destination charset.
759
760 `output-charset-conversion'
761 A list of conversion specifications, specifying conversion of
762 characters in one charset to another when encoding is performed.
763 The form of each specification is the same as for
764 `input-charset-conversion'.
765
766 The following additional properties are recognized if TYPE is
767 `fixed-width':
768
769 `unicode-map'
770 Required. A plist describing a map from octets in the coding system
771 NAME (as integers) to XEmacs characters. Those XEmacs characters will
772 be used explicitly on decoding, but for encoding (most relevantly, on
773 writing to disk) XEmacs characters that map to the same Unicode code
774 point will be unified. This means that the ISO-8859-? characters that
775 map to the same Unicode code point will not be distinct when written to
776 disk, which is normally what is intended; it also means that East Asian
777 Han characters from different XEmacs character sets will not be
778 distinct when written to disk, which is less often what is intended.
779
780 Any octets not mapped, and with values above #x7f, will be decoded into
781 XEmacs characters that reflect that their values are undefined. These
782 characters will be displayed in a language-environment-specific
783 way. See `unicode-error-default-translation-table' and the
784 `invalid-sequence-coding-system' argument to `set-language-info'.
785
786 These characters will normally be treated as invalid when checking
787 whether text can be encoded with `query-coding-region'--see the
788 IGNORE-INVALID-SEQUENCESP argument to that function to avoid this. It
789 is possible to specify that octets with values less than #x80 (or
790 indeed greater than it) be treated in this way, by specifying
791 explicitly that they correspond to the character mapping to that octet
792 in `unicode-error-default-translation-table'. Far fewer coding systems
793 override the ASCII mapping, though, so this is not the default.
794
795 `encode-failure-octet'
796 An integer between 0 and 255 to write in place of XEmacs characters
797 that cannot be encoded, defaulting to the code for tilde `~'.
798
799 The following additional properties are recognized (and required)
800 if TYPE is `ccl':
801
802 `decode'
803 CCL program used for decoding (converting to internal format).
804
805 `encode'
806 CCL program used for encoding (converting to external format).
807
808
809 The following additional properties are recognized if TYPE is `chain':
810
811 `chain'
812 List of coding systems to be chained together, in decoding order.
813
814 `canonicalize-after-coding'
815 Coding system to be returned by the detector routines in place of
816 this coding system.
817
818
819
820 The following additional properties are recognized if TYPE is `unicode':
821
822 `unicode-type'
823 One of `utf-16', `utf-8', `ucs-4', or `utf-7' (the latter is not
824 yet implemented). `utf-16' is the basic two-byte encoding;
825 `ucs-4' is the four-byte encoding; `utf-8' is an ASCII-compatible
826 variable-width 8-bit encoding; `utf-7' is a 7-bit encoding using
827 only characters that will safely pass through all mail gateways.
828 [[ This should be \"transformation format\". There should also be
829 `ucs-2' (or `bmp' -- no surrogates) and `utf-32' (range checked). ]]
830
831 `little-endian'
832 If non-nil, `utf-16' and `ucs-4' will write out the groups of two
833 or four bytes little-endian instead of big-endian. This is required,
834 for example, under Windows.
835
836 `need-bom'
837 If non-nil, a byte order mark (BOM, or Unicode FFFE) should be
838 written out at the beginning of the data. This serves both to
839 identify the endianness of the following data and to mark the
840 data as Unicode (at least, this is how Windows uses it).
841 [[ The correct term is \"signature\", since this technique may also
842 be used with UTF-8. That is the term used in the standard. ]]
843
844
845 The following additional properties are recognized if TYPE is
846 `mswindows-multibyte':
847
848 `code-page'
849 Either a number (specifying a particular code page) or one of the
850 symbols `ansi', `oem', `mac', or `ebcdic', specifying the ANSI,
851 OEM, Macintosh, or EBCDIC code page associated with a particular
852 locale (given by the `locale' property). NOTE: EBCDIC code pages
853 only exist in Windows 2000 and later.
854
855 `locale'
856 If `code-page' is a symbol, this specifies the locale whose code
857 page of the corresponding type should be used. This should be
858 one of the following: A cons of two strings, (LANGUAGE
859 . SUBLANGUAGE) (see `mswindows-set-current-locale'); a string (a
860 language; SUBLANG_DEFAULT, i.e. the default sublanguage, is
861 used); or one of the symbols `current', `user-default', or
862 `system-default', corresponding to the values of
863 `mswindows-current-locale', `mswindows-user-default-locale', or
864 `mswindows-system-default-locale', respectively.
865
866
867 The following additional properties are recognized if TYPE is `undecided':
868 \[[ Doesn't GNU use \"detect-*\" for the following two? ]]
869
870 `do-eol'
871 Do EOL detection.
872
873 `do-coding'
874 Do encoding detection.
875
876 `coding-system'
877 If encoding detection is not done, use the specified coding system
878 to do decoding. This is used internally when implementing coding
879 systems with an EOL type that specifies autodetection (the default),
880 so that the detector routines return the proper subsidiary.
881
882
883
884 The following additional property is recognized if TYPE is `gzip':
885
886 `level'
887 Compression level: 0 through 9, or `default' (currently 6)."
888 (when (eq 'fixed-width type)
889 (setq props (fixed-width-rework-props-runtime name props)))
890 (make-coding-system-internal name type description props))
891
892 (define-compiler-macro make-coding-system (&whole form name type
893 &optional description props)
894 (if (equal '(quote fixed-width) type)
895 (if (memq (car-safe props) '(quote eval-when-compile))
896 (let* ((props (if (eq 'eval-when-compile (car props))
897 (eval (cadr props))
898 (cadr props)))
899 (encode-failure-octet
900 (or (plist-get props 'encode-failure-octet)
901 (char-to-int ?~)))
902 (unicode-map (plist-get props 'unicode-map))
903 (default-plist-entry (gensym))
904 (encode-table-sym (gensym
905 (if (eq 'quote (car name))
906 (format "%s-enc-" (second name)))))
907 encode-program decode-program
908 decode-table encode-table
909 skip-chars invalid-sequences-skip-chars category)
910
911 (check-argument-range encode-failure-octet 0 #xFF)
912 ;; unicode-map must be a true list, and must be non-nil.
913 (check-argument-type #'true-list-p unicode-map)
914 (check-argument-type #'consp unicode-map)
915
916 ;; Don't pass on our extra data to make-coding-system-internal.
917 (setq props (plist-remprop props 'encode-failure-octet)
918 props (plist-remprop props 'unicode-map))
919
920 (multiple-value-setq
921 (decode-table encode-table)
922 (fixed-width-create-decode-encode-tables unicode-map))
923
924 ;; Generate the decode and encode programs, and the skip-chars
925 ;; arguments.
926 (setq decode-program
927 (fixed-width-generate-decode-program decode-table)
928 category (fixed-width-choose-category decode-table))
929
930 (multiple-value-setq
931 (encode-program skip-chars invalid-sequences-skip-chars)
932 (fixed-width-generate-encode-program-and-skip-chars-strings
933 decode-table encode-table encode-failure-octet))
934
935 (unless (vectorp decode-program)
936 (setq decode-program
937 (apply #'vector decode-program)))
938
939 (unless (eq default-plist-entry (plist-get props 'encode
940 default-plist-entry))
941 (error
942 'invalid-argument
943 "Explicit property not allowed for fixed-width coding system"
944 'encode))
945 (loop for (symbol . value)
946 in `((decode . ,decode-program)
947 (from-unicode . ,encode-table)
948 (query-skip-chars . ,skip-chars)
949 (invalid-sequences-skip-chars .
950 ,invalid-sequences-skip-chars)
951 (category . ,category))
952 do
953 (unless (eq default-plist-entry (plist-get props symbol
954 default-plist-entry))
955 (error
956 'invalid-argument
957 "Explicit property not allowed for \
958 fixed-width coding systems"
959 symbol))
960 (setq props (nconc (list symbol value) props)))
961 `(progn
962 (define-translation-hash-table ',encode-table-sym ,encode-table)
963 (make-coding-system-internal
964 ,name ,type ,description
965 ',(nconc (list 'encode
966 (apply #'vector
967 (nsublis
968 (list (cons 'encode-table-sym
969 encode-table-sym))
970 encode-program)))
971 props))))
972 ;; The form does not use literals; call make-coding-system at
973 ;; run time.
974 form)
975 (if (byte-compile-constp type)
976 ;; This is not a fixed-width call; compile it to a form that 21.4
977 ;; can also understand.
978 `(funcall (or (and (fboundp 'make-coding-system-internal)
979 'make-coding-system-internal)
980 'make-coding-system)
981 ,@(cdr form))
982 ;; TYPE is not literal; work things out at runtime.
983 form)))
984