comparison lisp/mule/mule-coding.el @ 4604:e0a8715fdb1f

Support new IGNORE-INVALID-SEQUENCESP argument, #'query-coding-region. lisp/ChangeLog addition: 2009-02-07 Aidan Kehoe <kehoea@parhasard.net> * coding.el (query-coding-clear-highlights): Rename the BUFFER argument to BUFFER-OR-STRING, describe it as possibly being a string in its documentation. (default-query-coding-region): Add a new IGNORE-INVALID-SEQUENCESP argument, document that this function does not support it. Bind case-fold-search to nil, we don't want this to influence what the function thinks is encodable or not. (query-coding-region): Add a new IGNORE-INVALID-SEQUENCESP argument, document what it does; reflect this new argument in the associated compiler macro. (query-coding-string): Add a new IGNORE-INVALID-SEQUENCESP argument, document what it does. Support the HIGHLIGHT argument correctly. * unicode.el (unicode-query-coding-region): Add a new IGNORE-INVALID-SEQUENCESP argument, document what it does, implement this. Document a potential problem. Use #'query-coding-clear-highlights instead of reimplementing it ourselves. Remove some debugging messages. * mule/arabic.el (iso-8859-6): * mule/cyrillic.el (iso-8859-5): * mule/greek.el (iso-8859-7): * mule/hebrew.el (iso-8859-8): * mule/latin.el (iso-8859-2): * mule/latin.el (iso-8859-3): * mule/latin.el (iso-8859-4): * mule/latin.el (iso-8859-14): * mule/latin.el (iso-8859-15): * mule/latin.el (iso-8859-16): * mule/latin.el (iso-8859-9): * mule/latin.el (windows-1252): * mule/mule-coding.el (iso-8859-1): Avoid the assumption that characters not given an explicit mapping in these coding systems map to the ISO 8859-1 characters corresponding to the octets on disk; this makes it much more reasonable to implement the IGNORE-INVALID-SEQUENCESP argument to query-coding-region. * mule/mule-cmds.el (set-language-info): Correct the docstring. * mule/mule-cmds.el (finish-set-language-environment): Treat invalid Unicode sequences produced from invalid-sequence-coding-system and corresponding to control characters the same as control characters in redisplay. * mule/mule-cmds.el: Document that encode-coding-char is available in coding.el * mule/mule-coding.el (make-8-bit-generate-helper): Change to return the both the encode-program generated and the relevant non-ASCII charset; update the docstring to reflect this. * mule/mule-coding.el (make-8-bit-generate-encode-program-and-skip-chars-strings): Rename this function; have it return skip-chars-strings as well as the encode program. Have these skip-chars-strings use ranges for charsets, where possible. * mule/mule-coding.el (make-8-bit-create-decode-encode-tables): Revise this to allow people to specify explicitly characters that should be undefined (= corresponding to keys in unicode-error-default-translation-table), and treating unspecified octets above #x7f as undefined by default. * mule/mule-coding.el (8-bit-fixed-query-coding-region): Add a new IGNORE-INVALID-SEQUENCESP argument, implement support for it using the 8-bit-fixed-invalid-sequences-skip-chars coding system property; remove some debugging messages. * mule/mule-coding.el (make-8-bit-coding-system): This function is dumped, autoloading it makes no sense. Document what happens when characters above #x7f are not specified, implement this. * mule/vietnamese.el: Correct spelling. tests/ChangeLog addition: 2009-02-07 Aidan Kehoe <kehoea@parhasard.net> * automated/query-coding-tests.el: Add FAILING-CASE arguments to the Assert calls, making #'q-c-debug mostly unnecessary. Remove #'q-c-debug. Add new tests that use the IGNORE-INVALID-SEQUENCESP argument to #'query-coding-region; rework the existing ones to respect it.
author Aidan Kehoe <kehoea@parhasard.net>
date Sat, 07 Feb 2009 17:13:37 +0000
parents 1d74a1d115ee
children c786c3fd0740
comparison
equal deleted inserted replaced
4603:202cb69c4d87 4604:e0a8715fdb1f
229 system map to distinct XEmacs characters, preventing a spurious changes when 229 system map to distinct XEmacs characters, preventing a spurious changes when
230 a file is read, not changed, and then written. ") 230 a file is read, not changed, and then written. ")
231 231
232 (defun make-8-bit-generate-helper (decode-table encode-table 232 (defun make-8-bit-generate-helper (decode-table encode-table
233 encode-failure-octet) 233 encode-failure-octet)
234 "Helper function for `make-8-bit-generate-encode-program', which see. 234 "Helper function, `make-8-bit-generate-encode-program-and-skip-chars-strings',
235 which see.
235 236
236 Deals with the case where ASCII and another character set can both be 237 Deals with the case where ASCII and another character set can both be
237 encoded unambiguously and completely into the coding-system; if this is so, 238 encoded unambiguously and completely into the coding-system; if this is so,
238 returns a list corresponding to such a ccl-program. If not, it returns nil. " 239 returns a list comprised of such a ccl-program and the character set in
240 question. If not, it returns a list with both entries nil."
239 (let ((tentative-encode-program-parts 241 (let ((tentative-encode-program-parts
240 (eval-when-compile 242 (eval-when-compile
241 (let* ((vec-len 128) 243 (let* ((vec-len 128)
242 (compiled 244 (compiled
243 (append 245 (append
335 (copy-list (first 337 (copy-list (first
336 tentative-encode-program-parts)) 338 tentative-encode-program-parts))
337 (append other-charset-vector nil) 339 (append other-charset-vector nil)
338 (copy-tree (second 340 (copy-tree (second
339 tentative-encode-program-parts)))))) 341 tentative-encode-program-parts))))))
340 encode-program)) 342 (values encode-program worth-trying)))
341 343
342 (defun make-8-bit-generate-encode-program (decode-table encode-table 344 (defun make-8-bit-generate-encode-program-and-skip-chars-strings
343 encode-failure-octet) 345 (decode-table encode-table encode-failure-octet)
344 "Generate a CCL program to decode a 8-bit fixed-width charset. 346 "Generate a CCL program to encode a 8-bit fixed-width charset.
345 347
346 DECODE-TABLE must have 256 non-cons entries, and will be regarded as 348 DECODE-TABLE must have 256 non-cons entries, and will be regarded as
347 describing a map from the octet corresponding to an offset in the 349 describing a map from the octet corresponding to an offset in the
348 table to the that entry in the table. ENCODE-TABLE is a hash table 350 table to the that entry in the table. ENCODE-TABLE is a hash table
349 map from unicode values to characters in the range [0,255]. 351 map from unicode values to characters in the range [0,255].
397 nil 399 nil
398 "This code assumes that the constant #xBEEF is #xBEEF14 \ 400 "This code assumes that the constant #xBEEF is #xBEEF14 \
399 in compiled CCL code.\nIf that is not the case, and it appears not to 401 in compiled CCL code.\nIf that is not the case, and it appears not to
400 be--that's why you're getting this message--it will not work. ") 402 be--that's why you're getting this message--it will not work. ")
401 prog))) 403 prog)))
402 (ascii-encodes-as-itself nil)) 404 (ascii-encodes-as-itself nil)
405 (control-1-encodes-as-itself t)
406 (invalid-sequence-code-point-start
407 (eval-when-compile
408 (char-to-unicode
409 (aref (decode-coding-string "\xd8\x00\x00\x00" 'utf-16-be) 3))))
410 further-char-set skip-chars invalid-sequences-skip-chars)
403 411
404 ;; Is this coding system ASCII-compatible? If so, we can avoid the hash 412 ;; Is this coding system ASCII-compatible? If so, we can avoid the hash
405 ;; table lookup for those characters. 413 ;; table lookup for those characters.
406 (loop 414 (loop
407 for i from #x00 to #x7f 415 for i from #x00 to #x7f
416 (if (null ascii-encodes-as-itself) 424 (if (null ascii-encodes-as-itself)
417 ;; General encode program. Pros; general and correct. Cons; 425 ;; General encode program. Pros; general and correct. Cons;
418 ;; slow, a hash table lookup + mule-unicode conversion is done 426 ;; slow, a hash table lookup + mule-unicode conversion is done
419 ;; for every character encoding. 427 ;; for every character encoding.
420 (setq encode-program general-encode-program) 428 (setq encode-program general-encode-program)
421 (setq encode-program 429 (multiple-value-setq
422 ;; Encode program with ascii-ascii mapping (based on a 430 (encode-program further-char-set)
423 ;; character's mule character set), and one other mule 431 ;; Encode program with ascii-ascii mapping (based on a
424 ;; character set using table-based encoding, other 432 ;; character's mule character set), and one other mule
425 ;; character sets using hash table lookups. 433 ;; character set using table-based encoding, other
426 ;; make-8-bit-non-ascii-completely-coveredp only returns 434 ;; character sets using hash table lookups.
427 ;; such a mapping if some non-ASCII charset with 435 ;; make-8-bit-non-ascii-completely-coveredp only returns
428 ;; characters in decode-table is entirely covered by 436 ;; such a mapping if some non-ASCII charset with
429 ;; encode-table. 437 ;; characters in decode-table is entirely covered by
430 (make-8-bit-generate-helper decode-table encode-table 438 ;; encode-table.
431 encode-failure-octet)) 439 (make-8-bit-generate-helper decode-table encode-table
440 encode-failure-octet))
432 (unless encode-program 441 (unless encode-program
433 ;; If make-8-bit-non-ascii-completely-coveredp returned nil, 442 ;; If make-8-bit-non-ascii-completely-coveredp returned nil,
434 ;; but ASCII still encodes as itself, do one-to-one mapping 443 ;; but ASCII still encodes as itself, do one-to-one mapping
435 ;; for ASCII, and a hash table lookup for everything else. 444 ;; for ASCII, and a hash table lookup for everything else.
436 (setq encode-program encode-program-with-ascii-optimisation))) 445 (setq encode-program encode-program-with-ascii-optimisation)))
439 (nsublis 448 (nsublis
440 (list (cons #xBEEF14 449 (list (cons #xBEEF14
441 (logior (lsh encode-failure-octet 8) 450 (logior (lsh encode-failure-octet 8)
442 #x14))) 451 #x14)))
443 (copy-tree encode-program))) 452 (copy-tree encode-program)))
444 encode-program)) 453 (loop
454 for i from #x80 to #x9f
455 do (unless (= i (aref decode-table i))
456 (setq control-1-encodes-as-itself nil)
457 (return)))
458 (loop
459 for i from #x00 to #xFF
460 initially (setq skip-chars
461 (cond
462 ((and ascii-encodes-as-itself
463 control-1-encodes-as-itself further-char-set)
464 (concat "\x00-\x9f" (charset-skip-chars-string
465 further-char-set)))
466 ((and ascii-encodes-as-itself
467 control-1-encodes-as-itself)
468 "\x00-\x9f")
469 ((null ascii-encodes-as-itself)
470 (skip-chars-quote (apply #'string
471 (append decode-table nil))))
472 (further-char-set
473 (concat (charset-skip-chars-string 'ascii)
474 (charset-skip-chars-string further-char-set)))
475 (t
476 (charset-skip-chars-string 'ascii)))
477 invalid-sequences-skip-chars "")
478 with decoded-ucs = nil
479 with decoded = nil
480 with no-ascii-transparency-skip-chars-list =
481 (unless ascii-encodes-as-itself (append decode-table nil))
482 ;; Can't use #'match-string here, see:
483 ;; http://mid.gmane.org/18829.34118.709782.704574@parhasard.net
484 with skip-chars-test =
485 #'(lambda (skip-chars-string testing)
486 (with-temp-buffer
487 (insert testing)
488 (goto-char (point-min))
489 (skip-chars-forward skip-chars-string)
490 (= (point) (point-max))))
491 do
492 (setq decoded (aref decode-table i)
493 decoded-ucs (char-to-unicode decoded))
494 (cond
495 ((<= invalid-sequence-code-point-start decoded-ucs
496 (+ invalid-sequence-code-point-start #xFF))
497 (setq invalid-sequences-skip-chars
498 (concat (string decoded)
499 invalid-sequences-skip-chars))
500 (assert (not (funcall skip-chars-test skip-chars decoded))
501 "This char should only be skipped with \
502 `invalid-sequences-skip-chars', not by `skip-chars'"))
503 ((not (funcall skip-chars-test skip-chars decoded))
504 (if ascii-encodes-as-itself
505 (setq skip-chars (concat skip-chars (string decoded)))
506 (push decoded no-ascii-transparency-skip-chars-list))))
507 finally (unless ascii-encodes-as-itself
508 (setq skip-chars
509 (skip-chars-quote
510 (apply #'string
511 no-ascii-transparency-skip-chars-list)))))
512 (values encode-program skip-chars invalid-sequences-skip-chars)))
445 513
446 (defun make-8-bit-create-decode-encode-tables (unicode-map) 514 (defun make-8-bit-create-decode-encode-tables (unicode-map)
447 "Return a list \(DECODE-TABLE ENCODE-TABLE) given UNICODE-MAP. 515 "Return a list \(DECODE-TABLE ENCODE-TABLE) given UNICODE-MAP.
448 UNICODE-MAP should be an alist mapping from integer octet values to 516 UNICODE-MAP should be an alist mapping from integer octet values to
449 characters with UCS code points; DECODE-TABLE will be a 256-element 517 characters with UCS code points; DECODE-TABLE will be a 256-element
451 to 256 distinct characters. " 519 to 256 distinct characters. "
452 (check-argument-type #'listp unicode-map) 520 (check-argument-type #'listp unicode-map)
453 (let ((decode-table (make-vector 256 nil)) 521 (let ((decode-table (make-vector 256 nil))
454 (encode-table (make-hash-table :size 256)) 522 (encode-table (make-hash-table :size 256))
455 (private-use-start (encode-char make-8-bit-private-use-start 'ucs)) 523 (private-use-start (encode-char make-8-bit-private-use-start 'ucs))
456 desired-ucs) 524 (invalid-sequence-code-point-start
525 (eval-when-compile
526 (char-to-unicode
527 (aref (decode-coding-string "\xd8\x00\x00\x00" 'utf-16-be) 3))))
528 desired-ucs decode-table-entry)
457 529
458 (loop for (external internal) 530 (loop for (external internal)
459 in unicode-map 531 in unicode-map
460 do 532 do
461 (aset decode-table external internal) 533 (aset decode-table external internal)
473 ;; for lookup-integer in CCL means we need to store it as a 545 ;; for lookup-integer in CCL means we need to store it as a
474 ;; character. 546 ;; character.
475 (int-to-char external) 547 (int-to-char external)
476 encode-table)) 548 encode-table))
477 549
478 ;; Now, go through the decode table looking at the characters that 550 ;; Now, go through the decode table. For octet values above #x7f, if the
479 ;; remain nil. If the XEmacs character with that integer is already in 551 ;; decode table entry is nil, this means that they have an undefined
480 ;; the encode table, map the on-disk octet to a Unicode private use 552 ;; mapping (= they map to XEmacs characters with keys in
481 ;; character. Otherwise map the on-disk octet to the XEmacs character 553 ;; unicode-error-default-translation-table); for octet values below or
482 ;; with that numeric value, to make it clearer what it is. 554 ;; equal to #x7f, it means that they map to ASCII.
555
556 ;; If any entry (whether below or above #x7f) in the decode-table
557 ;; already maps to some character with a key in
558 ;; unicode-error-default-translation-table, it is treated as an
559 ;; undefined octet by `query-coding-region'. That is, it is not
560 ;; necessary for an octet value to be above #x7f for this to happen.
561
483 (dotimes (i 256) 562 (dotimes (i 256)
484 (when (null (aref decode-table i)) 563 (setq decode-table-entry (aref decode-table i))
485 ;; Find a free code point. 564 (if decode-table-entry
486 (setq desired-ucs i) 565 (when (get-char-table
487 (while (gethash desired-ucs encode-table) 566 decode-table-entry
488 ;; In the normal case, the code point chosen will be U+E0XY, where 567 unicode-error-default-translation-table)
489 ;; XY is the hexadecimal octet on disk. In pathological cases 568 ;; The caller is explicitly specifying that this octet
490 ;; it'll be something else. 569 ;; corresponds to an invalid sequence on disk:
491 (setq desired-ucs (+ private-use-start desired-ucs) 570 (assert (= (get-char-table
492 private-use-start (+ private-use-start 1))) 571 decode-table-entry
493 (puthash desired-ucs (int-to-char i) encode-table) 572 unicode-error-default-translation-table) i)
573 "Bad argument to `make-8-bit-coding-system'.
574 If you're going to designate an octet with value below #x80 as invalid
575 for this coding system, make sure to map it to the invalid sequence
576 character corresponding to its octet value on disk. "))
577
578 ;; decode-table-entry is nil; either the octet is to be treated as
579 ;; contributing to an error sequence (when (> #x7f i)), or it should
580 ;; be attempted to treat it as ASCII-equivalent.
581 (setq desired-ucs (or (and (< i #x80) i)
582 (+ invalid-sequence-code-point-start i)))
583 (while (gethash desired-ucs encode-table)
584 (assert (not (< i #x80))
585 "UCS code point should not already be in encode-table!"
586 ;; There is one invalid sequence char per octet value;
587 ;; with eight-bit-fixed coding systems, it makes no sense
588 ;; for us to be multiply allocating them.
589 (gethash desired-ucs encode-table))
590 (setq desired-ucs (+ private-use-start desired-ucs)
591 private-use-start (+ private-use-start 1)))
592 (puthash desired-ucs (int-to-char i) encode-table)
494 (setq desired-ucs (if (> desired-ucs #xFF) 593 (setq desired-ucs (if (> desired-ucs #xFF)
495 (decode-char 'ucs desired-ucs) 594 (unicode-to-char desired-ucs)
496 ;; So we get Latin-1 when run at dump time, 595 ;; So we get Latin-1 when run at dump time,
497 ;; instead of JIT-allocated characters. 596 ;; instead of JIT-allocated characters.
498 (int-to-char desired-ucs))) 597 (int-to-char desired-ucs)))
499 (aset decode-table i desired-ucs))) 598 (aset decode-table i desired-ucs)))
500 (values decode-table encode-table))) 599 (values decode-table encode-table)))
544 for i from #x80 to #x9F 643 for i from #x80 to #x9F
545 do (unless (= i (aref decode-table i)) 644 do (unless (= i (aref decode-table i))
546 (return-from category 'no-conversion)) 645 (return-from category 'no-conversion))
547 finally return 'iso-8-1)) 646 finally return 'iso-8-1))
548 647
549 (defun 8-bit-fixed-query-coding-region (begin end coding-system 648 (defun 8-bit-fixed-query-coding-region (begin end coding-system &optional
550 &optional buffer errorp highlightp) 649 buffer ignore-invalid-sequencesp
650 errorp highlightp)
551 "The `query-coding-region' implementation for 8-bit-fixed coding systems. 651 "The `query-coding-region' implementation for 8-bit-fixed coding systems.
552 652
553 Uses the `8-bit-fixed-query-from-unicode' and `8-bit-fixed-query-skip-chars' 653 Uses the `8-bit-fixed-query-from-unicode' and `8-bit-fixed-query-skip-chars'
554 coding system properties. The former is a hash table mapping from valid 654 coding system properties. The former is a hash table mapping from valid
555 Unicode code points to on-disk octets in the coding system; the latter a set 655 Unicode code points to on-disk octets in the coding system; the latter a set
568 '8-bit-fixed-query-from-unicode))) 668 '8-bit-fixed-query-from-unicode)))
569 (skip-chars-arg 669 (skip-chars-arg
570 (or (coding-system-get coding-system '8-bit-fixed-query-skip-chars) 670 (or (coding-system-get coding-system '8-bit-fixed-query-skip-chars)
571 (coding-system-get (coding-system-base coding-system) 671 (coding-system-get (coding-system-base coding-system)
572 '8-bit-fixed-query-skip-chars))) 672 '8-bit-fixed-query-skip-chars)))
673 (invalid-sequences-skip-chars
674 (or (coding-system-get coding-system
675 '8-bit-fixed-invalid-sequences-skip-chars)
676 (coding-system-get (coding-system-base coding-system)
677 '8-bit-fixed-invalid-sequences-skip-chars)))
573 (ranges (make-range-table)) 678 (ranges (make-range-table))
679 (case-fold-search nil)
574 char-after fail-range-start fail-range-end previous-fail extent 680 char-after fail-range-start fail-range-end previous-fail extent
575 failed) 681 failed invalid-sequences-looking-at failed-reason
682 previous-failed-reason)
576 (check-type from-unicode hash-table) 683 (check-type from-unicode hash-table)
577 (check-type skip-chars-arg string) 684 (check-type skip-chars-arg string)
685 (check-type invalid-sequences-skip-chars string)
686 (setq invalid-sequences-looking-at
687 (if (equal "" invalid-sequences-skip-chars)
688 ;; Regexp that will never match.
689 #r".\{0,0\}"
690 (concat "[" invalid-sequences-skip-chars "]")))
691 (when ignore-invalid-sequencesp
692 (setq skip-chars-arg
693 (concat skip-chars-arg invalid-sequences-skip-chars)))
578 (save-excursion 694 (save-excursion
579 (when highlightp 695 (when highlightp
580 (map-extents #'(lambda (extent ignored-arg) 696 (query-coding-clear-highlights begin end buffer))
581 (when (eq 'query-coding-warning-face
582 (extent-face extent))
583 (delete-extent extent))) buffer begin end))
584 (goto-char begin buffer) 697 (goto-char begin buffer)
585 (skip-chars-forward skip-chars-arg end buffer) 698 (skip-chars-forward skip-chars-arg end buffer)
586 (while (< (point buffer) end) 699 (while (< (point buffer) end)
587 ; (message
588 ; "fail-range-start is %S, previous-fail %S, point is %S, end is %S"
589 ; fail-range-start previous-fail (point buffer) end)
590 (setq char-after (char-after (point buffer) buffer) 700 (setq char-after (char-after (point buffer) buffer)
591 fail-range-start (point buffer)) 701 fail-range-start (point buffer))
592 ; (message "arguments are %S %S"
593 ; (< (point buffer) end)
594 ; (not (gethash (encode-char char-after 'ucs) from-unicode)))
595 (while (and 702 (while (and
596 (< (point buffer) end) 703 (< (point buffer) end)
597 (not (gethash (encode-char char-after 'ucs) from-unicode))) 704 (or (and
705 (not (gethash (encode-char char-after 'ucs) from-unicode))
706 (setq failed-reason 'unencodable))
707 (and (not ignore-invalid-sequencesp)
708 (looking-at invalid-sequences-looking-at buffer)
709 (setq failed-reason 'invalid-sequence)))
710 (or (null previous-failed-reason)
711 (eq previous-failed-reason failed-reason)))
598 (forward-char 1 buffer) 712 (forward-char 1 buffer)
599 (setq char-after (char-after (point buffer) buffer) 713 (setq char-after (char-after (point buffer) buffer)
600 failed t)) 714 failed t
715 previous-failed-reason failed-reason))
601 (if (= fail-range-start (point buffer)) 716 (if (= fail-range-start (point buffer))
602 ;; The character can actually be encoded by the coding 717 ;; The character can actually be encoded by the coding
603 ;; system; check the characters past it. 718 ;; system; check the characters past it.
604 (forward-char 1 buffer) 719 (forward-char 1 buffer)
605 ;; The character actually failed. 720 ;; The character actually failed.
606 ; (message "past the move through, point now %S" (point buffer))
607 (when errorp 721 (when errorp
608 (error 'text-conversion-error 722 (error 'text-conversion-error
609 (format "Cannot encode %s using coding system" 723 (format "Cannot encode %s using coding system"
610 (buffer-substring fail-range-start (point buffer) 724 (buffer-substring fail-range-start (point buffer)
611 buffer)) 725 buffer))
612 (coding-system-name coding-system))) 726 (coding-system-name coding-system)))
727 (assert (not (null previous-failed-reason)) t
728 "previous-failed-reason should always be non-nil here")
613 (put-range-table fail-range-start 729 (put-range-table fail-range-start
614 ;; If char-after is non-nil, we're not at 730 ;; If char-after is non-nil, we're not at
615 ;; the end of the buffer. 731 ;; the end of the buffer.
616 (setq fail-range-end (if char-after 732 (setq fail-range-end (if char-after
617 (point buffer) 733 (point buffer)
618 (point-max buffer))) 734 (point-max buffer)))
619 t ranges) 735 previous-failed-reason ranges)
736 (setq previous-failed-reason nil)
620 (when highlightp 737 (when highlightp
621 ; (message "highlighting")
622 (setq extent (make-extent fail-range-start fail-range-end buffer)) 738 (setq extent (make-extent fail-range-start fail-range-end buffer))
623 (set-extent-priority extent (+ mouse-highlight-priority 2)) 739 (set-extent-priority extent (+ mouse-highlight-priority 2))
624 (set-extent-face extent 'query-coding-warning-face)) 740 (set-extent-face extent 'query-coding-warning-face))
625 (skip-chars-forward skip-chars-arg end buffer))) 741 (skip-chars-forward skip-chars-arg end buffer)))
626 ; (message "about to give the result, ranges %S" ranges)
627 (if failed 742 (if failed
628 (values nil ranges) 743 (values nil ranges)
629 (values t nil))))) 744 (values t nil)))))
630 745
631 ;;;###autoload
632 (defun make-8-bit-coding-system (name unicode-map &optional description props) 746 (defun make-8-bit-coding-system (name unicode-map &optional description props)
633 "Make and return a fixed-width 8-bit CCL coding system named NAME. 747 "Make and return a fixed-width 8-bit CCL coding system named NAME.
634 NAME must be a symbol, and UNICODE-MAP a list. 748 NAME must be a symbol, and UNICODE-MAP a list.
635 749
636 UNICODE-MAP is a plist describing a map from octets in the coding 750 UNICODE-MAP is a plist describing a map from octets in the coding
642 distinct when written to disk, which is normally what is intended; it 756 distinct when written to disk, which is normally what is intended; it
643 also means that East Asian Han characters from different XEmacs 757 also means that East Asian Han characters from different XEmacs
644 character sets will not be distinct when written to disk, which is 758 character sets will not be distinct when written to disk, which is
645 less often what is intended. 759 less often what is intended.
646 760
647 Any octets not mapped will be decoded into the ISO 8859-1 characters with 761 Any octets not mapped, and with values above #x7f, will be decoded into
648 the corresponding numeric value; unless another octet maps to that 762 XEmacs characters that reflect that their values are undefined. These
649 character, in which case the Unicode private use area will be used. This 763 characters will be displayed in a language-environment-specific way. See
650 avoids spurious changes to files on disk when they contain octets that would 764 `unicode-error-default-translation-table' and the
651 be otherwise remapped to the canonical values for the corresponding 765 `invalid-sequence-coding-system' argument to `set-language-info'.
652 characters in the coding system. 766
767 These characters will normally be treated as invalid when checking whether
768 text can be encoded with `query-coding-region'--see the
769 IGNORE-INVALID-SEQUENCESP argument to that function to avoid this. It is
770 possible to specify that octets with values less than #x80 (or indeed
771 greater than it) be treated in this way, by specifying explicitly that they
772 correspond to the character mapping to that octet in
773 `unicode-error-default-translation-table'. Far fewer coding systems
774 override the ASCII mapping, though, so this is not the default.
653 775
654 DESCRIPTION and PROPS are as in `make-coding-system', which see. This 776 DESCRIPTION and PROPS are as in `make-coding-system', which see. This
655 function also accepts two additional (optional) properties in PROPS; 777 function also accepts two additional (optional) properties in PROPS;
656 `aliases', giving a list of aliases to be initialized for this 778 `aliases', giving a list of aliases to be initialized for this
657 coding-system, and `encode-failure-octet', an integer between 0 and 256 to 779 coding-system, and `encode-failure-octet', an integer between 0 and 256 to
666 (check-valid-plist props) 788 (check-valid-plist props)
667 (let ((encode-failure-octet (or (plist-get props 'encode-failure-octet) 789 (let ((encode-failure-octet (or (plist-get props 'encode-failure-octet)
668 (char-to-int ?~))) 790 (char-to-int ?~)))
669 (aliases (plist-get props 'aliases)) 791 (aliases (plist-get props 'aliases))
670 (hash-table-sym (gentemp (format "%s-encode-table" name))) 792 (hash-table-sym (gentemp (format "%s-encode-table" name)))
671 encode-program decode-program result decode-table encode-table) 793 encode-program decode-program result decode-table encode-table
794 skip-chars invalid-sequences-skip-chars)
672 795
673 ;; Some more sanity checking. 796 ;; Some more sanity checking.
674 (check-argument-range encode-failure-octet 0 #xFF) 797 (check-argument-range encode-failure-octet 0 #xFF)
675 (check-argument-type #'listp aliases) 798 (check-argument-type #'listp aliases)
676 799
683 (make-8-bit-create-decode-encode-tables unicode-map)) 806 (make-8-bit-create-decode-encode-tables unicode-map))
684 807
685 ;; Register the decode-table. 808 ;; Register the decode-table.
686 (define-translation-hash-table hash-table-sym encode-table) 809 (define-translation-hash-table hash-table-sym encode-table)
687 810
688 ;; Generate the programs. 811 ;; Generate the programs and skip-chars strings.
689 (setq decode-program (make-8-bit-generate-decode-program decode-table) 812 (setq decode-program (make-8-bit-generate-decode-program decode-table))
690 encode-program (make-8-bit-generate-encode-program 813 (multiple-value-setq
691 decode-table encode-table encode-failure-octet)) 814 (encode-program skip-chars invalid-sequences-skip-chars)
815 (make-8-bit-generate-encode-program-and-skip-chars-strings
816 decode-table encode-table encode-failure-octet))
817
692 (unless (vectorp encode-program) 818 (unless (vectorp encode-program)
693 (setq encode-program 819 (setq encode-program
694 (apply #'vector 820 (apply #'vector
695 (nsublis (list (cons 'encode-table-sym hash-table-sym)) 821 (nsublis (list (cons 'encode-table-sym hash-table-sym))
696 (copy-tree encode-program))))) 822 (copy-tree encode-program)))))
707 'encode encode-program))) 833 'encode encode-program)))
708 (coding-system-put name '8-bit-fixed t) 834 (coding-system-put name '8-bit-fixed t)
709 (coding-system-put name 'category 835 (coding-system-put name 'category
710 (make-8-bit-choose-category decode-table)) 836 (make-8-bit-choose-category decode-table))
711 (coding-system-put name '8-bit-fixed-query-skip-chars 837 (coding-system-put name '8-bit-fixed-query-skip-chars
712 (skip-chars-quote 838 skip-chars)
713 (apply #'string (append decode-table nil)))) 839 (coding-system-put name '8-bit-fixed-invalid-sequences-skip-chars
840 invalid-sequences-skip-chars)
714 (coding-system-put name '8-bit-fixed-query-from-unicode encode-table) 841 (coding-system-put name '8-bit-fixed-query-from-unicode encode-table)
715
716 (coding-system-put name 'query-coding-function 842 (coding-system-put name 'query-coding-function
717 #'8-bit-fixed-query-coding-region) 843 #'8-bit-fixed-query-coding-region)
718 (coding-system-put (intern (format "%s-unix" name)) 844 (coding-system-put (intern (format "%s-unix" name))
719 'query-coding-function 845 'query-coding-function
720 #'8-bit-fixed-query-coding-region) 846 #'8-bit-fixed-query-coding-region)
749 props (if props (cadr props))) 875 props (if props (cadr props)))
750 (let ((encode-failure-octet 876 (let ((encode-failure-octet
751 (or (plist-get props 'encode-failure-octet) (char-to-int ?~))) 877 (or (plist-get props 'encode-failure-octet) (char-to-int ?~)))
752 (aliases (plist-get props 'aliases)) 878 (aliases (plist-get props 'aliases))
753 encode-program decode-program 879 encode-program decode-program
754 decode-table encode-table) 880 decode-table encode-table
881 skip-chars invalid-sequences-skip-chars)
755 882
756 ;; Some sanity checking. 883 ;; Some sanity checking.
757 (check-argument-range encode-failure-octet 0 #xFF) 884 (check-argument-range encode-failure-octet 0 #xFF)
758 (check-argument-type #'listp aliases) 885 (check-argument-type #'listp aliases)
759 886
760 ;; Don't pass on our extra data to make-coding-system. 887 ;; Don't pass on our extra data to make-coding-system.
761 (setq props (plist-remprop props 'encode-failure-octet) 888 (setq props (plist-remprop props 'encode-failure-octet)
762 props (plist-remprop props 'aliases)) 889 props (plist-remprop props 'aliases))
763 890
764 ;; Work out encode-table and decode-table. 891 ;; Work out encode-table and decode-table
765 (multiple-value-setq 892 (multiple-value-setq
766 (decode-table encode-table) 893 (decode-table encode-table)
767 (make-8-bit-create-decode-encode-tables unicode-map)) 894 (make-8-bit-create-decode-encode-tables unicode-map))
768 895
769 ;; Generate the decode and encode programs. 896 ;; Generate the decode and encode programs, and the skip-chars
770 (setq decode-program (make-8-bit-generate-decode-program decode-table) 897 ;; arguments.
771 encode-program (make-8-bit-generate-encode-program 898 (setq decode-program (make-8-bit-generate-decode-program decode-table))
772 decode-table encode-table encode-failure-octet)) 899 (multiple-value-setq
900 (encode-program skip-chars invalid-sequences-skip-chars)
901 (make-8-bit-generate-encode-program-and-skip-chars-strings
902 decode-table encode-table encode-failure-octet))
773 903
774 ;; And return the generated code. 904 ;; And return the generated code.
775 `(let ((encode-table-sym (gentemp (format "%s-encode-table" ',name))) 905 `(let ((encode-table-sym (gentemp (format "%s-encode-table" ',name)))
776 ;; The case-fold-search bind shouldn't be necessary. If I take
777 ;; it, out, though, I get:
778 ;;
779 ;; (invalid-read-syntax "Multiply defined symbol label" 1)
780 ;;
781 ;; when the file is byte compiled.
782 (case-fold-search t)
783 (encode-table ,encode-table)) 906 (encode-table ,encode-table))
784 (define-translation-hash-table encode-table-sym encode-table) 907 (define-translation-hash-table encode-table-sym encode-table)
785 (make-coding-system 908 (make-coding-system
786 ',name 'ccl ,description 909 ',name 'ccl ,description
787 (plist-put (plist-put ',props 'decode 910 (plist-put (plist-put ',props 'decode
795 ',encode-program)))) 918 ',encode-program))))
796 (coding-system-put ',name '8-bit-fixed t) 919 (coding-system-put ',name '8-bit-fixed t)
797 (coding-system-put ',name 'category 920 (coding-system-put ',name 'category
798 ',(make-8-bit-choose-category decode-table)) 921 ',(make-8-bit-choose-category decode-table))
799 (coding-system-put ',name '8-bit-fixed-query-skip-chars 922 (coding-system-put ',name '8-bit-fixed-query-skip-chars
800 ',(skip-chars-quote 923 ,skip-chars)
801 (apply #'string (append decode-table nil)))) 924 (coding-system-put ',name '8-bit-fixed-invalid-sequences-skip-chars
925 ,invalid-sequences-skip-chars)
802 (coding-system-put ',name '8-bit-fixed-query-from-unicode encode-table) 926 (coding-system-put ',name '8-bit-fixed-query-from-unicode encode-table)
803 (coding-system-put ',name 'query-coding-function 927 (coding-system-put ',name 'query-coding-function
804 #'8-bit-fixed-query-coding-region) 928 #'8-bit-fixed-query-coding-region)
805 (coding-system-put ',(intern (format "%s-unix" name)) 929 (coding-system-put ',(intern (format "%s-unix" name))
806 'query-coding-function 930 'query-coding-function
817 (find-coding-system ',name))))) 941 (find-coding-system ',name)))))
818 942
819 ;; Ideally this would be in latin.el, but code-init.el uses it. 943 ;; Ideally this would be in latin.el, but code-init.el uses it.
820 (make-8-bit-coding-system 944 (make-8-bit-coding-system
821 'iso-8859-1 945 'iso-8859-1
822 '() ;; No differences from Latin 1. 946 (loop
947 for i from #x80 to #xff
948 collect (list i (int-char i))) ;; Identical to Latin-1.
823 "ISO-8859-1 (Latin-1)" 949 "ISO-8859-1 (Latin-1)"
824 '(mnemonic "Latin 1" 950 '(mnemonic "Latin 1"
825 documentation "The most used encoding of Western Europe and the Americas." 951 documentation "The most used encoding of Western Europe and the Americas."
826 aliases (iso-latin-1 latin-1))) 952 aliases (iso-latin-1 latin-1)))