comparison lisp/mule/mule-coding.el @ 4690:257b468bf2ca

Move the #'query-coding-region implementation to C. This is necessary because there is no reasonable way to access the corresponding mswindows-multibyte functionality from Lisp, and we need such functionality if we're going to have a reliable and portable #'query-coding-region implementation. However, this change doesn't yet provide #'query-coding-region for the mswindow-multibyte coding systems, there should be no functional differences between an XEmacs with this change and one without it. src/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> Move the #'query-coding-region implementation to C. This is necessary because there is no reasonable way to access the corresponding mswindows-multibyte functionality from Lisp, and we need such functionality if we're going to have a reliable and portable #'query-coding-region implementation. However, this change doesn't yet provide #'query-coding-region for the mswindow-multibyte coding systems, there should be no functional differences between an XEmacs with this change and one without it. * mule-coding.c (struct fixed_width_coding_system): Add a new coding system type, fixed_width, and implement it. It uses the CCL infrastructure but has a much simpler creation API, and its own query_method, formerly in lisp/mule/mule-coding.el. * unicode.c: Move the Unicode query method implementation here from unicode.el. * lisp.h: Declare Fmake_coding_system_internal, Fcopy_range_table here. * intl-win32.c (complex_vars_of_intl_win32): Use Fmake_coding_system_internal, not Fmake_coding_system. * general-slots.h: Add Qsucceeded, Qunencodable, Qinvalid_sequence here. * file-coding.h (enum coding_system_variant): Add fixed_width_coding_system here. (struct coding_system_methods): Add query_method and query_lstream_method to the coding system methods. Provide flags for the query methods. Declare the default query method; initialise it correctly in INITIALIZE_CODING_SYSTEM_TYPE. * file-coding.c (default_query_method): New function, the default query method for coding systems that do not set it. Moved from coding.el. (make_coding_system_1): Accept new elements in PROPS in #'make-coding-system; aliases, a list of aliases; safe-chars and safe-charsets (these were previously accepted but not saved); and category. (Fmake_coding_system_internal): New function, what used to be #'make-coding-system--on Mule builds, we've now moved some of the functionality of this to Lisp. (Fcoding_system_canonical_name_p): Move this earlier in the file, since it's now called from within make_coding_system_1. (Fquery_coding_region): Move the implementation of this here, from coding.el. (complex_vars_of_file_coding): Call Fmake_coding_system_internal, not Fmake_coding_system; specify safe-charsets properties when we're a mule build. * extents.h (mouse_highlight_priority, Fset_extent_priority, Fset_extent_face, Fmap_extents): Make these available to other C files. lisp/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> Move the #'query-coding-region implementation to C. * coding.el: Consolidate code that depends on the presence or absence of Mule at the end of this file. (default-query-coding-region, query-coding-region): Move these functions to C. (default-query-coding-region-safe-charset-skip-chars-map): Remove this variable, the corresponding C variable is Vdefault_query_coding_region_chartab_cache in file-coding.c. (query-coding-string): Update docstring to reflect actual multiple values, be more careful about not modifying a range table that we're currently mapping over. (encode-coding-char): Make the implementation of this simpler. (featurep 'mule): Autoload #'make-coding-system from mule/make-coding-system.el if we're a mule build; provide an appropriate compiler macro. Do various non-mule compatibility things if we're not a mule build. * update-elc.el (additional-dump-dependencies): Add mule/make-coding-system as a dump time dependency if we're a mule build. * unicode.el (ccl-encode-to-ucs-2): (decode-char): (encode-char): Move these earlier in the file, for the sake of some byte compile warnings. (unicode-query-coding-region): Move this to unicode.c * mule/make-coding-system.el: New file, not dumped. Contains the functionality to rework the arguments necessary for fixed-width coding systems, and contains the implementation of #'make-coding-system, which now calls #'make-coding-system-internal. * mule/vietnamese.el (viscii): * mule/latin.el (iso-8859-2): (windows-1250): (iso-8859-3): (iso-8859-4): (iso-8859-14): (iso-8859-15): (iso-8859-16): (iso-8859-9): (macintosh): (windows-1252): * mule/hebrew.el (iso-8859-8): * mule/greek.el (iso-8859-7): (windows-1253): * mule/cyrillic.el (iso-8859-5): (koi8-r): (koi8-u): (windows-1251): (alternativnyj): (koi8-ru): (koi8-t): (koi8-c): (koi8-o): * mule/arabic.el (iso-8859-6): (windows-1256): Move all these coding systems to being of type fixed-width, not of type CCL. This allows the distinct query-coding-region for them to be in C, something which will eventually allow us to implement query-coding-region for the mswindows-multibyte coding systems. * mule/general-late.el (posix-charset-to-coding-system-hash): Document why we're pre-emptively persuading the byte compiler that the ELC for this file needs to be written using escape-quoted. Call #'set-unicode-query-skip-chars-args, now the Unicode query-coding-region implementation is in C. * mule/thai-xtis.el (tis-620): Don't bother checking whether we're XEmacs or not here. * mule/mule-coding.el: Move the eight bit fixed-width functionality from this file to make-coding-system.el. tests/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> * automated/mule-tests.el: Check a coding system's type, not an 8-bit-fixed property, for whether that coding system should be treated as a fixed-width coding system. * automated/query-coding-tests.el: Don't test the query coding functionality for mswindows-multibyte coding systems, it's not yet implemented.
author Aidan Kehoe <kehoea@parhasard.net>
date Sat, 19 Sep 2009 22:53:13 +0100
parents c786c3fd0740
children 308d34e9f07d
comparison
equal deleted inserted replaced
4689:0636c6ccb430 4690:257b468bf2ca
25 ;; Boston, MA 02111-1307, USA. 25 ;; Boston, MA 02111-1307, USA.
26 26
27 ;;; Commentary: 27 ;;; Commentary:
28 28
29 ;;; split off of mule.el and mostly moved to coding.el 29 ;;; split off of mule.el and mostly moved to coding.el
30
31 ;; Needed for make-8-bit-coding-system.
32 (eval-when-compile (require 'ccl))
33 30
34 ;;; Code: 31 ;;; Code:
35 32
36 (defun coding-system-force-on-output (coding-system register) 33 (defun coding-system-force-on-output (coding-system register)
37 "Return the 'force-on-output property of CODING-SYSTEM for the specified REGISTER." 34 "Return the 'force-on-output property of CODING-SYSTEM for the specified REGISTER."
220 (setq done t)) 217 (setq done t))
221 (setq id (1+ id))))) 218 (setq id (1+ id)))))
222 (put symbol 'translation-hash-table-id id) 219 (put symbol 'translation-hash-table-id id)
223 id)) 220 id))
224 221
225 (defvar make-8-bit-private-use-start (decode-char 'ucs #xE000)
226 "Start of a 256 code private use area for make-8-bit-coding-system.
227
228 This is used to ensure that distinct octets on disk for a given coding
229 system map to distinct XEmacs characters, preventing a spurious changes when
230 a file is read, not changed, and then written. ")
231
232 (defun make-8-bit-generate-helper (decode-table encode-table
233 encode-failure-octet)
234 "Helper function, `make-8-bit-generate-encode-program-and-skip-chars-strings',
235 which see.
236
237 Deals with the case where ASCII and another character set can both be
238 encoded unambiguously and completely into the coding-system; if this is so,
239 returns a list comprised of such a ccl-program and the character set in
240 question. If not, it returns a list with both entries nil."
241 (let ((tentative-encode-program-parts
242 (eval-when-compile
243 (let* ((vec-len 128)
244 (compiled
245 (append
246 (ccl-compile
247 `(1
248 (loop
249 (read-multibyte-character r0 r1)
250 (if (r0 == ,(charset-id 'ascii))
251 (write r1)
252 ((if (r0 == #xABAB)
253 ;; #xBFFE is a sentinel in the compiled
254 ;; program.
255 ((r0 = r1 & #x7F)
256 (write r0 ,(make-vector vec-len #xBFFE)))
257 ((mule-to-unicode r0 r1)
258 (if (r0 == #xFFFD)
259 (write #xBEEF)
260 ((lookup-integer encode-table-sym r0 r3)
261 (if r7
262 (write-multibyte-character r0 r3)
263 (write #xBEEF))))))))
264 (repeat)))) nil))
265 (first-part compiled)
266 (last-part
267 (member-if-not (lambda (entr) (eq #xBFFE entr))
268 (member-if
269 (lambda (entr) (eq #xBFFE entr))
270 first-part))))
271 (while compiled
272 (when (eq #xBFFE (cadr compiled))
273 (assert (= vec-len (search '(#xBFFE) (cdr compiled)
274 :test #'/=)) nil
275 "Strange ccl vector length")
276 (setcdr compiled nil))
277 (setq compiled (cdr compiled)))
278 ;; Is the generated code as we expect it to be?
279 (assert (and (memq #xABAB first-part)
280 (memq #xBEEF14 last-part))
281 nil
282 "This code assumes that the constant #xBEEF is #xBEEF14 in \
283 compiled CCL code,\nand that the constant #xABAB is #xABAB. If that is
284 not the case, and it appears not to be--that's why you're getting this
285 message--it will not work. ")
286 (list first-part last-part vec-len))))
287 (charset-lower -1)
288 (charset-upper -1)
289 worth-trying known-charsets encode-program
290 other-charset-vector ucs)
291
292 (loop for char across decode-table
293 do (pushnew (char-charset char) known-charsets))
294 (setq known-charsets (delq 'ascii known-charsets))
295
296 (loop for known-charset in known-charsets
297 do
298 ;; This is not possible for two dimensional charsets.
299 (when (eq 1 (charset-dimension known-charset))
300 (if (eq 'control-1 known-charset)
301 (setq charset-lower 0
302 charset-upper 31)
303 ;; There should be a nicer way to get the limits here.
304 (condition-case args-out-of-range
305 (make-char known-charset #x100)
306 (args-out-of-range
307 (setq charset-lower (third args-out-of-range)
308 charset-upper (fourth args-out-of-range)))))
309 (loop
310 for i from charset-lower to charset-upper
311 always (and (setq ucs
312 (encode-char (make-char known-charset i) 'ucs))
313 (gethash ucs encode-table))
314 finally (setq worth-trying known-charset))
315
316 ;; Only trying this for one charset at a time, the first find.
317 (when worth-trying (return))
318
319 ;; Okay, this charset is not worth trying, Try the next.
320 (setq charset-lower -1
321 charset-upper -1
322 worth-trying nil)))
323
324 (when worth-trying
325 (setq other-charset-vector
326 (make-vector (third tentative-encode-program-parts)
327 encode-failure-octet))
328 (loop for i from charset-lower to charset-upper
329 do (aset other-charset-vector i
330 (gethash (encode-char (make-char worth-trying i)
331 'ucs) encode-table)))
332 (setq encode-program
333 (nsublis
334 (list (cons #xABAB (charset-id worth-trying)))
335 (nconc
336 (copy-list (first
337 tentative-encode-program-parts))
338 (append other-charset-vector nil)
339 (copy-tree (second
340 tentative-encode-program-parts))))))
341 (values encode-program worth-trying)))
342
343 (defun make-8-bit-generate-encode-program-and-skip-chars-strings
344 (decode-table encode-table encode-failure-octet)
345 "Generate a CCL program to encode a 8-bit fixed-width charset.
346
347 DECODE-TABLE must have 256 non-cons entries, and will be regarded as
348 describing a map from the octet corresponding to an offset in the
349 table to the that entry in the table. ENCODE-TABLE is a hash table
350 map from unicode values to characters in the range [0,255].
351 ENCODE-FAILURE-OCTET describes an integer between 0 and 255
352 \(inclusive) to write in the event that a character cannot be encoded. "
353 (check-argument-type #'vectorp decode-table)
354 (check-argument-range (length decode-table) #x100 #x100)
355 (check-argument-type #'hash-table-p encode-table)
356 (check-argument-type #'integerp encode-failure-octet)
357 (check-argument-range encode-failure-octet #x00 #xFF)
358 (let ((encode-program nil)
359 (general-encode-program
360 (eval-when-compile
361 (let ((prog (append
362 (ccl-compile
363 `(1
364 (loop
365 (read-multibyte-character r0 r1)
366 (mule-to-unicode r0 r1)
367 (if (r0 == #xFFFD)
368 (write #xBEEF)
369 ((lookup-integer encode-table-sym r0 r3)
370 (if r7
371 (write-multibyte-character r0 r3)
372 (write #xBEEF))))
373 (repeat)))) nil)))
374 (assert (memq #xBEEF14 prog)
375 nil
376 "This code assumes that the constant #xBEEF is #xBEEF14 \
377 in compiled CCL code.\nIf that is not the case, and it appears not to
378 be--that's why you're getting this message--it will not work. ")
379 prog)))
380 (encode-program-with-ascii-optimisation
381 (eval-when-compile
382 (let ((prog (append
383 (ccl-compile
384 `(1
385 (loop
386 (read-multibyte-character r0 r1)
387 (if (r0 == ,(charset-id 'ascii))
388 (write r1)
389 ((mule-to-unicode r0 r1)
390 (if (r0 == #xFFFD)
391 (write #xBEEF)
392 ((lookup-integer encode-table-sym r0 r3)
393 (if r7
394 (write-multibyte-character r0 r3)
395 (write #xBEEF))))))
396 (repeat)))) nil)))
397 (assert (memq #xBEEF14 prog)
398 nil
399 "This code assumes that the constant #xBEEF is #xBEEF14 \
400 in compiled CCL code.\nIf that is not the case, and it appears not to
401 be--that's why you're getting this message--it will not work. ")
402 prog)))
403 (ascii-encodes-as-itself nil)
404 (control-1-encodes-as-itself t)
405 (invalid-sequence-code-point-start
406 (eval-when-compile
407 (char-to-unicode
408 (aref (decode-coding-string "\xd8\x00\x00\x00" 'utf-16-be) 3))))
409 further-char-set skip-chars invalid-sequences-skip-chars)
410
411 ;; Is this coding system ASCII-compatible? If so, we can avoid the hash
412 ;; table lookup for those characters.
413 (loop
414 for i from #x00 to #x7f
415 always (eq (int-to-char i) (gethash i encode-table))
416 finally (setq ascii-encodes-as-itself t))
417
418 ;; Note that this logic handles EBCDIC badly. For example, CP037,
419 ;; MIME name ebcdic-na, has the entire repertoire of ASCII and
420 ;; Latin 1, and thus a more optimal ccl encode program would check
421 ;; for those character sets and use tables. But for now, we do a
422 ;; hash table lookup for every character.
423 (if (null ascii-encodes-as-itself)
424 ;; General encode program. Pros; general and correct. Cons;
425 ;; slow, a hash table lookup + mule-unicode conversion is done
426 ;; for every character encoding.
427 (setq encode-program general-encode-program)
428 (multiple-value-setq
429 (encode-program further-char-set)
430 ;; Encode program with ascii-ascii mapping (based on a
431 ;; character's mule character set), and one other mule
432 ;; character set using table-based encoding, other
433 ;; character sets using hash table lookups.
434 ;; make-8-bit-non-ascii-completely-coveredp only returns
435 ;; such a mapping if some non-ASCII charset with
436 ;; characters in decode-table is entirely covered by
437 ;; encode-table.
438 (make-8-bit-generate-helper decode-table encode-table
439 encode-failure-octet))
440 (unless encode-program
441 ;; If make-8-bit-non-ascii-completely-coveredp returned nil,
442 ;; but ASCII still encodes as itself, do one-to-one mapping
443 ;; for ASCII, and a hash table lookup for everything else.
444 (setq encode-program encode-program-with-ascii-optimisation)))
445
446 (setq encode-program
447 (nsublis
448 (list (cons #xBEEF14
449 (logior (lsh encode-failure-octet 8)
450 #x14)))
451 (copy-tree encode-program)))
452 (loop
453 for i from #x80 to #x9f
454 do (unless (= i (aref decode-table i))
455 (setq control-1-encodes-as-itself nil)
456 (return)))
457 (loop
458 for i from #x00 to #xFF
459 initially (setq skip-chars
460 (cond
461 ((and ascii-encodes-as-itself
462 control-1-encodes-as-itself further-char-set)
463 (concat "\x00-\x9f" (charset-skip-chars-string
464 further-char-set)))
465 ((and ascii-encodes-as-itself
466 control-1-encodes-as-itself)
467 "\x00-\x9f")
468 ((null ascii-encodes-as-itself)
469 (skip-chars-quote (apply #'string
470 (append decode-table nil))))
471 (further-char-set
472 (concat (charset-skip-chars-string 'ascii)
473 (charset-skip-chars-string further-char-set)))
474 (t
475 (charset-skip-chars-string 'ascii)))
476 invalid-sequences-skip-chars "")
477 with decoded-ucs = nil
478 with decoded = nil
479 with no-ascii-transparency-skip-chars-list =
480 (unless ascii-encodes-as-itself (append decode-table nil))
481 ;; Can't use #'match-string here, see:
482 ;; http://mid.gmane.org/18829.34118.709782.704574@parhasard.net
483 with skip-chars-test =
484 #'(lambda (skip-chars-string testing)
485 (with-temp-buffer
486 (insert testing)
487 (goto-char (point-min))
488 (skip-chars-forward skip-chars-string)
489 (= (point) (point-max))))
490 do
491 (setq decoded (aref decode-table i)
492 decoded-ucs (char-to-unicode decoded))
493 (cond
494 ((<= invalid-sequence-code-point-start decoded-ucs
495 (+ invalid-sequence-code-point-start #xFF))
496 (setq invalid-sequences-skip-chars
497 (concat (string decoded)
498 invalid-sequences-skip-chars))
499 (assert (not (funcall skip-chars-test skip-chars decoded))
500 "This char should only be skipped with \
501 `invalid-sequences-skip-chars', not by `skip-chars'"))
502 ((not (funcall skip-chars-test skip-chars decoded))
503 (if ascii-encodes-as-itself
504 (setq skip-chars (concat skip-chars (string decoded)))
505 (push decoded no-ascii-transparency-skip-chars-list))))
506 finally (unless ascii-encodes-as-itself
507 (setq skip-chars
508 (skip-chars-quote
509 (apply #'string
510 no-ascii-transparency-skip-chars-list)))))
511 (values encode-program skip-chars invalid-sequences-skip-chars)))
512
513 (defun make-8-bit-create-decode-encode-tables (unicode-map)
514 "Return a list \(DECODE-TABLE ENCODE-TABLE) given UNICODE-MAP.
515 UNICODE-MAP should be an alist mapping from integer octet values to
516 characters with UCS code points; DECODE-TABLE will be a 256-element
517 vector, and ENCODE-TABLE will be a hash table mapping from 256 numbers
518 to 256 distinct characters. "
519 (check-argument-type #'listp unicode-map)
520 (let ((decode-table (make-vector 256 nil))
521 (encode-table (make-hash-table :size 256))
522 (private-use-start (encode-char make-8-bit-private-use-start 'ucs))
523 (invalid-sequence-code-point-start
524 (eval-when-compile
525 (char-to-unicode
526 (aref (decode-coding-string "\xd8\x00\x00\x00" 'utf-16-be) 3))))
527 desired-ucs decode-table-entry)
528
529 (loop for (external internal)
530 in unicode-map
531 do
532 (aset decode-table external internal)
533 (assert (not (eq (encode-char internal 'ucs) -1))
534 nil
535 "Looks like you're calling make-8-bit-coding-system in a \
536 dumped file, \nand you're either not providing a literal UNICODE-MAP
537 or PROPS. Don't do that; make-8-bit-coding-system relies on sensible
538 Unicode mappings being available, which they are at compile time for
539 dumped files (but this requires the mentioned literals), but not, for
540 most of them, at run time. ")
541
542 (puthash (encode-char internal 'ucs)
543 ;; This is semantically an integer, but Dave Love's design
544 ;; for lookup-integer in CCL means we need to store it as a
545 ;; character.
546 (int-to-char external)
547 encode-table))
548
549 ;; Now, go through the decode table. For octet values above #x7f, if the
550 ;; decode table entry is nil, this means that they have an undefined
551 ;; mapping (= they map to XEmacs characters with keys in
552 ;; unicode-error-default-translation-table); for octet values below or
553 ;; equal to #x7f, it means that they map to ASCII.
554
555 ;; If any entry (whether below or above #x7f) in the decode-table
556 ;; already maps to some character with a key in
557 ;; unicode-error-default-translation-table, it is treated as an
558 ;; undefined octet by `query-coding-region'. That is, it is not
559 ;; necessary for an octet value to be above #x7f for this to happen.
560
561 (dotimes (i 256)
562 (setq decode-table-entry (aref decode-table i))
563 (if decode-table-entry
564 (when (get-char-table
565 decode-table-entry
566 unicode-error-default-translation-table)
567 ;; The caller is explicitly specifying that this octet
568 ;; corresponds to an invalid sequence on disk:
569 (assert (= (get-char-table
570 decode-table-entry
571 unicode-error-default-translation-table) i)
572 "Bad argument to `make-8-bit-coding-system'.
573 If you're going to designate an octet with value below #x80 as invalid
574 for this coding system, make sure to map it to the invalid sequence
575 character corresponding to its octet value on disk. "))
576
577 ;; decode-table-entry is nil; either the octet is to be treated as
578 ;; contributing to an error sequence (when (> #x7f i)), or it should
579 ;; be attempted to treat it as ASCII-equivalent.
580 (setq desired-ucs (or (and (< i #x80) i)
581 (+ invalid-sequence-code-point-start i)))
582 (while (gethash desired-ucs encode-table)
583 (assert (not (< i #x80))
584 "UCS code point should not already be in encode-table!"
585 ;; There is one invalid sequence char per octet value;
586 ;; with eight-bit-fixed coding systems, it makes no sense
587 ;; for us to be multiply allocating them.
588 (gethash desired-ucs encode-table))
589 (setq desired-ucs (+ private-use-start desired-ucs)
590 private-use-start (+ private-use-start 1)))
591 (puthash desired-ucs (int-to-char i) encode-table)
592 (setq desired-ucs (if (> desired-ucs #xFF)
593 (unicode-to-char desired-ucs)
594 ;; So we get Latin-1 when run at dump time,
595 ;; instead of JIT-allocated characters.
596 (int-to-char desired-ucs)))
597 (aset decode-table i desired-ucs)))
598 (values decode-table encode-table)))
599
600 (defun make-8-bit-generate-decode-program (decode-table)
601 "Given DECODE-TABLE, generate a CCL program to decode an 8-bit charset.
602 DECODE-TABLE must have 256 non-cons entries, and will be regarded as
603 describing a map from the octet corresponding to an offset in the
604 table to the that entry in the table. "
605 (check-argument-type #'vectorp decode-table)
606 (check-argument-range (length decode-table) #x100 #x100)
607 (let ((decode-program-parts
608 (eval-when-compile
609 (let* ((compiled
610 (append
611 (ccl-compile
612 `(3
613 ((read r0)
614 (loop
615 (write-read-repeat r0 ,(make-vector
616 256 'sentinel)))))) nil))
617 (first-part compiled)
618 (last-part
619 (member-if-not #'symbolp
620 (member-if-not #'integerp first-part))))
621 ;; Chop off the sentinel sentinel sentinel [..] part.
622 (while compiled
623 (if (symbolp (cadr compiled))
624 (setcdr compiled nil))
625 (setq compiled (cdr compiled)))
626 (list first-part last-part)))))
627 (nconc
628 ;; copy-list needed, because the structure of the literal provided
629 ;; by our eval-when-compile hangs around.
630 (copy-list (first decode-program-parts))
631 (append decode-table nil)
632 (second decode-program-parts))))
633
634 (defun make-8-bit-choose-category (decode-table)
635 "Given DECODE-TABLE, return an appropriate coding category.
636 DECODE-TABLE is a 256-entry vector describing the mapping from octets on
637 disk to XEmacs characters for some fixed-width 8-bit coding system. "
638 (check-argument-type #'vectorp decode-table)
639 (check-argument-range (length decode-table) #x100 #x100)
640 (loop
641 named category
642 for i from #x80 to #x9F
643 do (unless (= i (aref decode-table i))
644 (return-from category 'no-conversion))
645 finally return 'iso-8-1))
646
647 (defun 8-bit-fixed-query-coding-region (begin end coding-system &optional
648 buffer ignore-invalid-sequencesp
649 errorp highlightp)
650 "The `query-coding-region' implementation for 8-bit-fixed coding systems.
651
652 Uses the `8-bit-fixed-query-from-unicode' and `8-bit-fixed-query-skip-chars'
653 coding system properties. The former is a hash table mapping from valid
654 Unicode code points to on-disk octets in the coding system; the latter a set
655 of characters as used by `skip-chars-forward'. Both of these properties are
656 generated automatically by `make-8-bit-coding-system'.
657
658 See that the documentation of `query-coding-region'; see also
659 `make-8-bit-coding-system'. "
660 (check-argument-type #'coding-system-p
661 (setq coding-system (find-coding-system coding-system)))
662 (check-argument-type #'integer-or-marker-p begin)
663 (check-argument-type #'integer-or-marker-p end)
664 (let ((from-unicode
665 (or (coding-system-get coding-system '8-bit-fixed-query-from-unicode)
666 (coding-system-get (coding-system-base coding-system)
667 '8-bit-fixed-query-from-unicode)))
668 (skip-chars-arg
669 (or (coding-system-get coding-system '8-bit-fixed-query-skip-chars)
670 (coding-system-get (coding-system-base coding-system)
671 '8-bit-fixed-query-skip-chars)))
672 (invalid-sequences-skip-chars
673 (or (coding-system-get coding-system
674 '8-bit-fixed-invalid-sequences-skip-chars)
675 (coding-system-get (coding-system-base coding-system)
676 '8-bit-fixed-invalid-sequences-skip-chars)))
677 (ranges (make-range-table))
678 (case-fold-search nil)
679 char-after fail-range-start fail-range-end extent
680 failed invalid-sequences-looking-at failed-reason
681 previous-failed-reason)
682 (check-type from-unicode hash-table)
683 (check-type skip-chars-arg string)
684 (check-type invalid-sequences-skip-chars string)
685 (setq invalid-sequences-looking-at
686 (if (equal "" invalid-sequences-skip-chars)
687 ;; Regexp that will never match.
688 #r".\{0,0\}"
689 (concat "[" invalid-sequences-skip-chars "]")))
690 (when ignore-invalid-sequencesp
691 (setq skip-chars-arg
692 (concat skip-chars-arg invalid-sequences-skip-chars)))
693 (save-excursion
694 (when highlightp
695 (query-coding-clear-highlights begin end buffer))
696 (goto-char begin buffer)
697 (skip-chars-forward skip-chars-arg end buffer)
698 (while (< (point buffer) end)
699 (setq char-after (char-after (point buffer) buffer)
700 fail-range-start (point buffer))
701 (while (and
702 (< (point buffer) end)
703 (or (and
704 (not (gethash (encode-char char-after 'ucs) from-unicode))
705 (setq failed-reason 'unencodable))
706 (and (not ignore-invalid-sequencesp)
707 (looking-at invalid-sequences-looking-at buffer)
708 (setq failed-reason 'invalid-sequence)))
709 (or (null previous-failed-reason)
710 (eq previous-failed-reason failed-reason)))
711 (forward-char 1 buffer)
712 (setq char-after (char-after (point buffer) buffer)
713 failed t
714 previous-failed-reason failed-reason))
715 (if (= fail-range-start (point buffer))
716 ;; The character can actually be encoded by the coding
717 ;; system; check the characters past it.
718 (forward-char 1 buffer)
719 ;; The character actually failed.
720 (when errorp
721 (error 'text-conversion-error
722 (format "Cannot encode %s using coding system"
723 (buffer-substring fail-range-start (point buffer)
724 buffer))
725 (coding-system-name coding-system)))
726 (assert (not (null previous-failed-reason)) t
727 "previous-failed-reason should always be non-nil here")
728 (put-range-table fail-range-start
729 ;; If char-after is non-nil, we're not at
730 ;; the end of the buffer.
731 (setq fail-range-end (if char-after
732 (point buffer)
733 (point-max buffer)))
734 previous-failed-reason ranges)
735 (setq previous-failed-reason nil)
736 (when highlightp
737 (setq extent (make-extent fail-range-start fail-range-end buffer))
738 (set-extent-priority extent (+ mouse-highlight-priority 2))
739 (set-extent-face extent 'query-coding-warning-face))
740 (skip-chars-forward skip-chars-arg end buffer)))
741 (if failed
742 (values nil ranges)
743 (values t nil)))))
744
745 (defun make-8-bit-coding-system (name unicode-map &optional description props)
746 "Make and return a fixed-width 8-bit CCL coding system named NAME.
747 NAME must be a symbol, and UNICODE-MAP a list.
748
749 UNICODE-MAP is a plist describing a map from octets in the coding
750 system NAME (as integers) to XEmacs characters. Those XEmacs
751 characters will be used explicitly on decoding, but for encoding (most
752 relevantly, on writing to disk) XEmacs characters that map to the same
753 Unicode code point will be unified. This means that the ISO-8859-?
754 characters that map to the same Unicode code point will not be
755 distinct when written to disk, which is normally what is intended; it
756 also means that East Asian Han characters from different XEmacs
757 character sets will not be distinct when written to disk, which is
758 less often what is intended.
759
760 Any octets not mapped, and with values above #x7f, will be decoded into
761 XEmacs characters that reflect that their values are undefined. These
762 characters will be displayed in a language-environment-specific way. See
763 `unicode-error-default-translation-table' and the
764 `invalid-sequence-coding-system' argument to `set-language-info'.
765
766 These characters will normally be treated as invalid when checking whether
767 text can be encoded with `query-coding-region'--see the
768 IGNORE-INVALID-SEQUENCESP argument to that function to avoid this. It is
769 possible to specify that octets with values less than #x80 (or indeed
770 greater than it) be treated in this way, by specifying explicitly that they
771 correspond to the character mapping to that octet in
772 `unicode-error-default-translation-table'. Far fewer coding systems
773 override the ASCII mapping, though, so this is not the default.
774
775 DESCRIPTION and PROPS are as in `make-coding-system', which see. This
776 function also accepts two additional (optional) properties in PROPS;
777 `aliases', giving a list of aliases to be initialized for this
778 coding-system, and `encode-failure-octet', an integer between 0 and 256 to
779 write in place of XEmacs characters that cannot be encoded, defaulting to
780 the code for tilde `~'. "
781 (check-argument-type #'symbolp name)
782 (check-argument-type #'listp unicode-map)
783 (check-argument-type #'stringp
784 (or description
785 (setq description
786 (format "Coding system used for %s." name))))
787 (check-valid-plist props)
788 (let ((encode-failure-octet (or (plist-get props 'encode-failure-octet)
789 (char-to-int ?~)))
790 (aliases (plist-get props 'aliases))
791 (hash-table-sym (gentemp (format "%s-encode-table" name)))
792 encode-program decode-program result decode-table encode-table
793 skip-chars invalid-sequences-skip-chars)
794
795 ;; Some more sanity checking.
796 (check-argument-range encode-failure-octet 0 #xFF)
797 (check-argument-type #'listp aliases)
798
799 ;; Don't pass on our extra data to make-coding-system.
800 (setq props (plist-remprop props 'encode-failure-octet)
801 props (plist-remprop props 'aliases))
802
803 (multiple-value-setq
804 (decode-table encode-table)
805 (make-8-bit-create-decode-encode-tables unicode-map))
806
807 ;; Register the decode-table.
808 (define-translation-hash-table hash-table-sym encode-table)
809
810 ;; Generate the programs and skip-chars strings.
811 (setq decode-program (make-8-bit-generate-decode-program decode-table))
812 (multiple-value-setq
813 (encode-program skip-chars invalid-sequences-skip-chars)
814 (make-8-bit-generate-encode-program-and-skip-chars-strings
815 decode-table encode-table encode-failure-octet))
816
817 (unless (vectorp encode-program)
818 (setq encode-program
819 (apply #'vector
820 (nsublis (list (cons 'encode-table-sym hash-table-sym))
821 (copy-tree encode-program)))))
822 (unless (vectorp decode-program)
823 (setq decode-program
824 (apply #'vector decode-program)))
825
826 ;; And now generate the actual coding system.
827 (setq result
828 (make-coding-system
829 name 'ccl
830 description
831 (plist-put (plist-put props 'decode decode-program)
832 'encode encode-program)))
833 (coding-system-put name '8-bit-fixed t)
834 (coding-system-put name 'category
835 (make-8-bit-choose-category decode-table))
836 (coding-system-put name '8-bit-fixed-query-skip-chars
837 skip-chars)
838 (coding-system-put name '8-bit-fixed-invalid-sequences-skip-chars
839 invalid-sequences-skip-chars)
840 (coding-system-put name '8-bit-fixed-query-from-unicode encode-table)
841 (coding-system-put name 'query-coding-function
842 #'8-bit-fixed-query-coding-region)
843 (coding-system-put (intern (format "%s-unix" name))
844 'query-coding-function
845 #'8-bit-fixed-query-coding-region)
846 (coding-system-put (intern (format "%s-dos" name))
847 'query-coding-function
848 #'8-bit-fixed-query-coding-region)
849 (coding-system-put (intern (format "%s-mac" name))
850 'query-coding-function
851 #'8-bit-fixed-query-coding-region)
852 (loop for alias in aliases
853 do (define-coding-system-alias alias name))
854 result))
855
856 (define-compiler-macro make-8-bit-coding-system (&whole form name unicode-map
857 &optional description props)
858 ;; We provide the compiler macro (= macro that is expanded only on
859 ;; compilation, and that can punt to a runtime version of the
860 ;; associate function if necessary) not for reasons of speed, though
861 ;; it does speed up things at runtime a little, but because the
862 ;; Unicode mappings are available at compile time in the dumped
863 ;; files, but they are not available at run time for the vast
864 ;; majority of them.
865
866 (if (not (and (and (consp name) (eq (car name) 'quote))
867 (and (consp unicode-map) (eq (car unicode-map) 'quote))
868 (and (or (and (consp props) (eq (car props) 'quote))
869 (null props)))))
870 ;; The call does not use literals; do it at runtime.
871 form
872 (setq name (cadr name)
873 unicode-map (cadr unicode-map)
874 props (if props (cadr props)))
875 (let ((encode-failure-octet
876 (or (plist-get props 'encode-failure-octet) (char-to-int ?~)))
877 (aliases (plist-get props 'aliases))
878 encode-program decode-program
879 decode-table encode-table
880 skip-chars invalid-sequences-skip-chars)
881
882 ;; Some sanity checking.
883 (check-argument-range encode-failure-octet 0 #xFF)
884 (check-argument-type #'listp aliases)
885
886 ;; Don't pass on our extra data to make-coding-system.
887 (setq props (plist-remprop props 'encode-failure-octet)
888 props (plist-remprop props 'aliases))
889
890 ;; Work out encode-table and decode-table
891 (multiple-value-setq
892 (decode-table encode-table)
893 (make-8-bit-create-decode-encode-tables unicode-map))
894
895 ;; Generate the decode and encode programs, and the skip-chars
896 ;; arguments.
897 (setq decode-program (make-8-bit-generate-decode-program decode-table))
898 (multiple-value-setq
899 (encode-program skip-chars invalid-sequences-skip-chars)
900 (make-8-bit-generate-encode-program-and-skip-chars-strings
901 decode-table encode-table encode-failure-octet))
902
903 ;; And return the generated code.
904 `(let ((encode-table-sym (gentemp (format "%s-encode-table" ',name)))
905 (encode-table ,encode-table))
906 (define-translation-hash-table encode-table-sym encode-table)
907 (make-coding-system
908 ',name 'ccl ,description
909 (plist-put (plist-put ',props 'decode
910 ,(apply #'vector decode-program))
911 'encode
912 (apply #'vector
913 (nsublis
914 (list (cons
915 'encode-table-sym
916 (symbol-value 'encode-table-sym)))
917 ',encode-program))))
918 (coding-system-put ',name '8-bit-fixed t)
919 (coding-system-put ',name 'category
920 ',(make-8-bit-choose-category decode-table))
921 (coding-system-put ',name '8-bit-fixed-query-skip-chars
922 ,skip-chars)
923 (coding-system-put ',name '8-bit-fixed-invalid-sequences-skip-chars
924 ,invalid-sequences-skip-chars)
925 (coding-system-put ',name '8-bit-fixed-query-from-unicode encode-table)
926 (coding-system-put ',name 'query-coding-function
927 #'8-bit-fixed-query-coding-region)
928 (coding-system-put ',(intern (format "%s-unix" name))
929 'query-coding-function
930 #'8-bit-fixed-query-coding-region)
931 (coding-system-put ',(intern (format "%s-dos" name))
932 'query-coding-function
933 #'8-bit-fixed-query-coding-region)
934 (coding-system-put ',(intern (format "%s-mac" name))
935 'query-coding-function
936 #'8-bit-fixed-query-coding-region)
937 ,(macroexpand `(loop for alias in ',aliases
938 do (define-coding-system-alias alias
939 ',name)))
940 (find-coding-system ',name)))))
941
942 ;; Ideally this would be in latin.el, but code-init.el uses it. 222 ;; Ideally this would be in latin.el, but code-init.el uses it.
943 (make-8-bit-coding-system 223 (make-coding-system
944 'iso-8859-1 224 'iso-8859-1
945 (loop 225 'fixed-width
946 for i from #x80 to #xff
947 collect (list i (int-char i))) ;; Identical to Latin-1.
948 "ISO-8859-1 (Latin-1)" 226 "ISO-8859-1 (Latin-1)"
949 '(mnemonic "Latin 1" 227 (eval-when-compile
950 documentation "The most used encoding of Western Europe and the Americas." 228 `(unicode-map
951 aliases (iso-latin-1 latin-1))) 229 ,(loop
230 for i from #x80 to #xff
231 collect (list i (int-char i))) ;; Identical to Latin-1.
232 mnemonic "Latin 1"
233 documentation "The most used encoding of Western Europe and the Americas."
234 aliases (iso-latin-1 latin-1))))