Mercurial > hg > xemacs-beta
comparison lisp/mule/mule-coding.el @ 4604:e0a8715fdb1f
Support new IGNORE-INVALID-SEQUENCESP argument, #'query-coding-region.
lisp/ChangeLog addition:
2009-02-07 Aidan Kehoe <kehoea@parhasard.net>
* coding.el (query-coding-clear-highlights):
Rename the BUFFER argument to BUFFER-OR-STRING, describe it as
possibly being a string in its documentation.
(default-query-coding-region):
Add a new IGNORE-INVALID-SEQUENCESP argument, document that this
function does not support it.
Bind case-fold-search to nil, we don't want this to influence what the
function thinks is encodable or not.
(query-coding-region):
Add a new IGNORE-INVALID-SEQUENCESP argument, document what it
does; reflect this new argument in the associated compiler macro.
(query-coding-string):
Add a new IGNORE-INVALID-SEQUENCESP argument, document what it
does. Support the HIGHLIGHT argument correctly.
* unicode.el (unicode-query-coding-region):
Add a new IGNORE-INVALID-SEQUENCESP argument, document what it
does, implement this. Document a potential problem.
Use #'query-coding-clear-highlights instead of reimplementing it
ourselves.
Remove some debugging messages.
* mule/arabic.el (iso-8859-6):
* mule/cyrillic.el (iso-8859-5):
* mule/greek.el (iso-8859-7):
* mule/hebrew.el (iso-8859-8):
* mule/latin.el (iso-8859-2):
* mule/latin.el (iso-8859-3):
* mule/latin.el (iso-8859-4):
* mule/latin.el (iso-8859-14):
* mule/latin.el (iso-8859-15):
* mule/latin.el (iso-8859-16):
* mule/latin.el (iso-8859-9):
* mule/latin.el (windows-1252):
* mule/mule-coding.el (iso-8859-1):
Avoid the assumption that characters not given an explicit mapping
in these coding systems map to the ISO 8859-1 characters
corresponding to the octets on disk; this makes it much more
reasonable to implement the IGNORE-INVALID-SEQUENCESP argument to
query-coding-region.
* mule/mule-cmds.el (set-language-info):
Correct the docstring.
* mule/mule-cmds.el (finish-set-language-environment):
Treat invalid Unicode sequences produced from
invalid-sequence-coding-system and corresponding to control
characters the same as control characters in redisplay.
* mule/mule-cmds.el:
Document that encode-coding-char is available in coding.el
* mule/mule-coding.el (make-8-bit-generate-helper):
Change to return the both the encode-program generated and the
relevant non-ASCII charset; update the docstring to reflect this.
* mule/mule-coding.el
(make-8-bit-generate-encode-program-and-skip-chars-strings):
Rename this function; have it return skip-chars-strings as well as
the encode program. Have these skip-chars-strings use ranges for
charsets, where possible.
* mule/mule-coding.el (make-8-bit-create-decode-encode-tables):
Revise this to allow people to specify explicitly characters that
should be undefined (= corresponding to keys in
unicode-error-default-translation-table), and treating unspecified
octets above #x7f as undefined by default.
* mule/mule-coding.el (8-bit-fixed-query-coding-region):
Add a new IGNORE-INVALID-SEQUENCESP argument, implement support
for it using the 8-bit-fixed-invalid-sequences-skip-chars coding
system property; remove some debugging messages.
* mule/mule-coding.el (make-8-bit-coding-system):
This function is dumped, autoloading it makes no sense.
Document what happens when characters above #x7f are not
specified, implement this.
* mule/vietnamese.el:
Correct spelling.
tests/ChangeLog addition:
2009-02-07 Aidan Kehoe <kehoea@parhasard.net>
* automated/query-coding-tests.el:
Add FAILING-CASE arguments to the Assert calls, making #'q-c-debug
mostly unnecessary. Remove #'q-c-debug.
Add new tests that use the IGNORE-INVALID-SEQUENCESP argument to
#'query-coding-region; rework the existing ones to respect it.
author | Aidan Kehoe <kehoea@parhasard.net> |
---|---|
date | Sat, 07 Feb 2009 17:13:37 +0000 |
parents | 1d74a1d115ee |
children | c786c3fd0740 |
comparison
equal
deleted
inserted
replaced
4603:202cb69c4d87 | 4604:e0a8715fdb1f |
---|---|
229 system map to distinct XEmacs characters, preventing a spurious changes when | 229 system map to distinct XEmacs characters, preventing a spurious changes when |
230 a file is read, not changed, and then written. ") | 230 a file is read, not changed, and then written. ") |
231 | 231 |
232 (defun make-8-bit-generate-helper (decode-table encode-table | 232 (defun make-8-bit-generate-helper (decode-table encode-table |
233 encode-failure-octet) | 233 encode-failure-octet) |
234 "Helper function for `make-8-bit-generate-encode-program', which see. | 234 "Helper function, `make-8-bit-generate-encode-program-and-skip-chars-strings', |
235 which see. | |
235 | 236 |
236 Deals with the case where ASCII and another character set can both be | 237 Deals with the case where ASCII and another character set can both be |
237 encoded unambiguously and completely into the coding-system; if this is so, | 238 encoded unambiguously and completely into the coding-system; if this is so, |
238 returns a list corresponding to such a ccl-program. If not, it returns nil. " | 239 returns a list comprised of such a ccl-program and the character set in |
240 question. If not, it returns a list with both entries nil." | |
239 (let ((tentative-encode-program-parts | 241 (let ((tentative-encode-program-parts |
240 (eval-when-compile | 242 (eval-when-compile |
241 (let* ((vec-len 128) | 243 (let* ((vec-len 128) |
242 (compiled | 244 (compiled |
243 (append | 245 (append |
335 (copy-list (first | 337 (copy-list (first |
336 tentative-encode-program-parts)) | 338 tentative-encode-program-parts)) |
337 (append other-charset-vector nil) | 339 (append other-charset-vector nil) |
338 (copy-tree (second | 340 (copy-tree (second |
339 tentative-encode-program-parts)))))) | 341 tentative-encode-program-parts)))))) |
340 encode-program)) | 342 (values encode-program worth-trying))) |
341 | 343 |
342 (defun make-8-bit-generate-encode-program (decode-table encode-table | 344 (defun make-8-bit-generate-encode-program-and-skip-chars-strings |
343 encode-failure-octet) | 345 (decode-table encode-table encode-failure-octet) |
344 "Generate a CCL program to decode a 8-bit fixed-width charset. | 346 "Generate a CCL program to encode a 8-bit fixed-width charset. |
345 | 347 |
346 DECODE-TABLE must have 256 non-cons entries, and will be regarded as | 348 DECODE-TABLE must have 256 non-cons entries, and will be regarded as |
347 describing a map from the octet corresponding to an offset in the | 349 describing a map from the octet corresponding to an offset in the |
348 table to the that entry in the table. ENCODE-TABLE is a hash table | 350 table to the that entry in the table. ENCODE-TABLE is a hash table |
349 map from unicode values to characters in the range [0,255]. | 351 map from unicode values to characters in the range [0,255]. |
397 nil | 399 nil |
398 "This code assumes that the constant #xBEEF is #xBEEF14 \ | 400 "This code assumes that the constant #xBEEF is #xBEEF14 \ |
399 in compiled CCL code.\nIf that is not the case, and it appears not to | 401 in compiled CCL code.\nIf that is not the case, and it appears not to |
400 be--that's why you're getting this message--it will not work. ") | 402 be--that's why you're getting this message--it will not work. ") |
401 prog))) | 403 prog))) |
402 (ascii-encodes-as-itself nil)) | 404 (ascii-encodes-as-itself nil) |
405 (control-1-encodes-as-itself t) | |
406 (invalid-sequence-code-point-start | |
407 (eval-when-compile | |
408 (char-to-unicode | |
409 (aref (decode-coding-string "\xd8\x00\x00\x00" 'utf-16-be) 3)))) | |
410 further-char-set skip-chars invalid-sequences-skip-chars) | |
403 | 411 |
404 ;; Is this coding system ASCII-compatible? If so, we can avoid the hash | 412 ;; Is this coding system ASCII-compatible? If so, we can avoid the hash |
405 ;; table lookup for those characters. | 413 ;; table lookup for those characters. |
406 (loop | 414 (loop |
407 for i from #x00 to #x7f | 415 for i from #x00 to #x7f |
416 (if (null ascii-encodes-as-itself) | 424 (if (null ascii-encodes-as-itself) |
417 ;; General encode program. Pros; general and correct. Cons; | 425 ;; General encode program. Pros; general and correct. Cons; |
418 ;; slow, a hash table lookup + mule-unicode conversion is done | 426 ;; slow, a hash table lookup + mule-unicode conversion is done |
419 ;; for every character encoding. | 427 ;; for every character encoding. |
420 (setq encode-program general-encode-program) | 428 (setq encode-program general-encode-program) |
421 (setq encode-program | 429 (multiple-value-setq |
422 ;; Encode program with ascii-ascii mapping (based on a | 430 (encode-program further-char-set) |
423 ;; character's mule character set), and one other mule | 431 ;; Encode program with ascii-ascii mapping (based on a |
424 ;; character set using table-based encoding, other | 432 ;; character's mule character set), and one other mule |
425 ;; character sets using hash table lookups. | 433 ;; character set using table-based encoding, other |
426 ;; make-8-bit-non-ascii-completely-coveredp only returns | 434 ;; character sets using hash table lookups. |
427 ;; such a mapping if some non-ASCII charset with | 435 ;; make-8-bit-non-ascii-completely-coveredp only returns |
428 ;; characters in decode-table is entirely covered by | 436 ;; such a mapping if some non-ASCII charset with |
429 ;; encode-table. | 437 ;; characters in decode-table is entirely covered by |
430 (make-8-bit-generate-helper decode-table encode-table | 438 ;; encode-table. |
431 encode-failure-octet)) | 439 (make-8-bit-generate-helper decode-table encode-table |
440 encode-failure-octet)) | |
432 (unless encode-program | 441 (unless encode-program |
433 ;; If make-8-bit-non-ascii-completely-coveredp returned nil, | 442 ;; If make-8-bit-non-ascii-completely-coveredp returned nil, |
434 ;; but ASCII still encodes as itself, do one-to-one mapping | 443 ;; but ASCII still encodes as itself, do one-to-one mapping |
435 ;; for ASCII, and a hash table lookup for everything else. | 444 ;; for ASCII, and a hash table lookup for everything else. |
436 (setq encode-program encode-program-with-ascii-optimisation))) | 445 (setq encode-program encode-program-with-ascii-optimisation))) |
439 (nsublis | 448 (nsublis |
440 (list (cons #xBEEF14 | 449 (list (cons #xBEEF14 |
441 (logior (lsh encode-failure-octet 8) | 450 (logior (lsh encode-failure-octet 8) |
442 #x14))) | 451 #x14))) |
443 (copy-tree encode-program))) | 452 (copy-tree encode-program))) |
444 encode-program)) | 453 (loop |
454 for i from #x80 to #x9f | |
455 do (unless (= i (aref decode-table i)) | |
456 (setq control-1-encodes-as-itself nil) | |
457 (return))) | |
458 (loop | |
459 for i from #x00 to #xFF | |
460 initially (setq skip-chars | |
461 (cond | |
462 ((and ascii-encodes-as-itself | |
463 control-1-encodes-as-itself further-char-set) | |
464 (concat "\x00-\x9f" (charset-skip-chars-string | |
465 further-char-set))) | |
466 ((and ascii-encodes-as-itself | |
467 control-1-encodes-as-itself) | |
468 "\x00-\x9f") | |
469 ((null ascii-encodes-as-itself) | |
470 (skip-chars-quote (apply #'string | |
471 (append decode-table nil)))) | |
472 (further-char-set | |
473 (concat (charset-skip-chars-string 'ascii) | |
474 (charset-skip-chars-string further-char-set))) | |
475 (t | |
476 (charset-skip-chars-string 'ascii))) | |
477 invalid-sequences-skip-chars "") | |
478 with decoded-ucs = nil | |
479 with decoded = nil | |
480 with no-ascii-transparency-skip-chars-list = | |
481 (unless ascii-encodes-as-itself (append decode-table nil)) | |
482 ;; Can't use #'match-string here, see: | |
483 ;; http://mid.gmane.org/18829.34118.709782.704574@parhasard.net | |
484 with skip-chars-test = | |
485 #'(lambda (skip-chars-string testing) | |
486 (with-temp-buffer | |
487 (insert testing) | |
488 (goto-char (point-min)) | |
489 (skip-chars-forward skip-chars-string) | |
490 (= (point) (point-max)))) | |
491 do | |
492 (setq decoded (aref decode-table i) | |
493 decoded-ucs (char-to-unicode decoded)) | |
494 (cond | |
495 ((<= invalid-sequence-code-point-start decoded-ucs | |
496 (+ invalid-sequence-code-point-start #xFF)) | |
497 (setq invalid-sequences-skip-chars | |
498 (concat (string decoded) | |
499 invalid-sequences-skip-chars)) | |
500 (assert (not (funcall skip-chars-test skip-chars decoded)) | |
501 "This char should only be skipped with \ | |
502 `invalid-sequences-skip-chars', not by `skip-chars'")) | |
503 ((not (funcall skip-chars-test skip-chars decoded)) | |
504 (if ascii-encodes-as-itself | |
505 (setq skip-chars (concat skip-chars (string decoded))) | |
506 (push decoded no-ascii-transparency-skip-chars-list)))) | |
507 finally (unless ascii-encodes-as-itself | |
508 (setq skip-chars | |
509 (skip-chars-quote | |
510 (apply #'string | |
511 no-ascii-transparency-skip-chars-list))))) | |
512 (values encode-program skip-chars invalid-sequences-skip-chars))) | |
445 | 513 |
446 (defun make-8-bit-create-decode-encode-tables (unicode-map) | 514 (defun make-8-bit-create-decode-encode-tables (unicode-map) |
447 "Return a list \(DECODE-TABLE ENCODE-TABLE) given UNICODE-MAP. | 515 "Return a list \(DECODE-TABLE ENCODE-TABLE) given UNICODE-MAP. |
448 UNICODE-MAP should be an alist mapping from integer octet values to | 516 UNICODE-MAP should be an alist mapping from integer octet values to |
449 characters with UCS code points; DECODE-TABLE will be a 256-element | 517 characters with UCS code points; DECODE-TABLE will be a 256-element |
451 to 256 distinct characters. " | 519 to 256 distinct characters. " |
452 (check-argument-type #'listp unicode-map) | 520 (check-argument-type #'listp unicode-map) |
453 (let ((decode-table (make-vector 256 nil)) | 521 (let ((decode-table (make-vector 256 nil)) |
454 (encode-table (make-hash-table :size 256)) | 522 (encode-table (make-hash-table :size 256)) |
455 (private-use-start (encode-char make-8-bit-private-use-start 'ucs)) | 523 (private-use-start (encode-char make-8-bit-private-use-start 'ucs)) |
456 desired-ucs) | 524 (invalid-sequence-code-point-start |
525 (eval-when-compile | |
526 (char-to-unicode | |
527 (aref (decode-coding-string "\xd8\x00\x00\x00" 'utf-16-be) 3)))) | |
528 desired-ucs decode-table-entry) | |
457 | 529 |
458 (loop for (external internal) | 530 (loop for (external internal) |
459 in unicode-map | 531 in unicode-map |
460 do | 532 do |
461 (aset decode-table external internal) | 533 (aset decode-table external internal) |
473 ;; for lookup-integer in CCL means we need to store it as a | 545 ;; for lookup-integer in CCL means we need to store it as a |
474 ;; character. | 546 ;; character. |
475 (int-to-char external) | 547 (int-to-char external) |
476 encode-table)) | 548 encode-table)) |
477 | 549 |
478 ;; Now, go through the decode table looking at the characters that | 550 ;; Now, go through the decode table. For octet values above #x7f, if the |
479 ;; remain nil. If the XEmacs character with that integer is already in | 551 ;; decode table entry is nil, this means that they have an undefined |
480 ;; the encode table, map the on-disk octet to a Unicode private use | 552 ;; mapping (= they map to XEmacs characters with keys in |
481 ;; character. Otherwise map the on-disk octet to the XEmacs character | 553 ;; unicode-error-default-translation-table); for octet values below or |
482 ;; with that numeric value, to make it clearer what it is. | 554 ;; equal to #x7f, it means that they map to ASCII. |
555 | |
556 ;; If any entry (whether below or above #x7f) in the decode-table | |
557 ;; already maps to some character with a key in | |
558 ;; unicode-error-default-translation-table, it is treated as an | |
559 ;; undefined octet by `query-coding-region'. That is, it is not | |
560 ;; necessary for an octet value to be above #x7f for this to happen. | |
561 | |
483 (dotimes (i 256) | 562 (dotimes (i 256) |
484 (when (null (aref decode-table i)) | 563 (setq decode-table-entry (aref decode-table i)) |
485 ;; Find a free code point. | 564 (if decode-table-entry |
486 (setq desired-ucs i) | 565 (when (get-char-table |
487 (while (gethash desired-ucs encode-table) | 566 decode-table-entry |
488 ;; In the normal case, the code point chosen will be U+E0XY, where | 567 unicode-error-default-translation-table) |
489 ;; XY is the hexadecimal octet on disk. In pathological cases | 568 ;; The caller is explicitly specifying that this octet |
490 ;; it'll be something else. | 569 ;; corresponds to an invalid sequence on disk: |
491 (setq desired-ucs (+ private-use-start desired-ucs) | 570 (assert (= (get-char-table |
492 private-use-start (+ private-use-start 1))) | 571 decode-table-entry |
493 (puthash desired-ucs (int-to-char i) encode-table) | 572 unicode-error-default-translation-table) i) |
573 "Bad argument to `make-8-bit-coding-system'. | |
574 If you're going to designate an octet with value below #x80 as invalid | |
575 for this coding system, make sure to map it to the invalid sequence | |
576 character corresponding to its octet value on disk. ")) | |
577 | |
578 ;; decode-table-entry is nil; either the octet is to be treated as | |
579 ;; contributing to an error sequence (when (> #x7f i)), or it should | |
580 ;; be attempted to treat it as ASCII-equivalent. | |
581 (setq desired-ucs (or (and (< i #x80) i) | |
582 (+ invalid-sequence-code-point-start i))) | |
583 (while (gethash desired-ucs encode-table) | |
584 (assert (not (< i #x80)) | |
585 "UCS code point should not already be in encode-table!" | |
586 ;; There is one invalid sequence char per octet value; | |
587 ;; with eight-bit-fixed coding systems, it makes no sense | |
588 ;; for us to be multiply allocating them. | |
589 (gethash desired-ucs encode-table)) | |
590 (setq desired-ucs (+ private-use-start desired-ucs) | |
591 private-use-start (+ private-use-start 1))) | |
592 (puthash desired-ucs (int-to-char i) encode-table) | |
494 (setq desired-ucs (if (> desired-ucs #xFF) | 593 (setq desired-ucs (if (> desired-ucs #xFF) |
495 (decode-char 'ucs desired-ucs) | 594 (unicode-to-char desired-ucs) |
496 ;; So we get Latin-1 when run at dump time, | 595 ;; So we get Latin-1 when run at dump time, |
497 ;; instead of JIT-allocated characters. | 596 ;; instead of JIT-allocated characters. |
498 (int-to-char desired-ucs))) | 597 (int-to-char desired-ucs))) |
499 (aset decode-table i desired-ucs))) | 598 (aset decode-table i desired-ucs))) |
500 (values decode-table encode-table))) | 599 (values decode-table encode-table))) |
544 for i from #x80 to #x9F | 643 for i from #x80 to #x9F |
545 do (unless (= i (aref decode-table i)) | 644 do (unless (= i (aref decode-table i)) |
546 (return-from category 'no-conversion)) | 645 (return-from category 'no-conversion)) |
547 finally return 'iso-8-1)) | 646 finally return 'iso-8-1)) |
548 | 647 |
549 (defun 8-bit-fixed-query-coding-region (begin end coding-system | 648 (defun 8-bit-fixed-query-coding-region (begin end coding-system &optional |
550 &optional buffer errorp highlightp) | 649 buffer ignore-invalid-sequencesp |
650 errorp highlightp) | |
551 "The `query-coding-region' implementation for 8-bit-fixed coding systems. | 651 "The `query-coding-region' implementation for 8-bit-fixed coding systems. |
552 | 652 |
553 Uses the `8-bit-fixed-query-from-unicode' and `8-bit-fixed-query-skip-chars' | 653 Uses the `8-bit-fixed-query-from-unicode' and `8-bit-fixed-query-skip-chars' |
554 coding system properties. The former is a hash table mapping from valid | 654 coding system properties. The former is a hash table mapping from valid |
555 Unicode code points to on-disk octets in the coding system; the latter a set | 655 Unicode code points to on-disk octets in the coding system; the latter a set |
568 '8-bit-fixed-query-from-unicode))) | 668 '8-bit-fixed-query-from-unicode))) |
569 (skip-chars-arg | 669 (skip-chars-arg |
570 (or (coding-system-get coding-system '8-bit-fixed-query-skip-chars) | 670 (or (coding-system-get coding-system '8-bit-fixed-query-skip-chars) |
571 (coding-system-get (coding-system-base coding-system) | 671 (coding-system-get (coding-system-base coding-system) |
572 '8-bit-fixed-query-skip-chars))) | 672 '8-bit-fixed-query-skip-chars))) |
673 (invalid-sequences-skip-chars | |
674 (or (coding-system-get coding-system | |
675 '8-bit-fixed-invalid-sequences-skip-chars) | |
676 (coding-system-get (coding-system-base coding-system) | |
677 '8-bit-fixed-invalid-sequences-skip-chars))) | |
573 (ranges (make-range-table)) | 678 (ranges (make-range-table)) |
679 (case-fold-search nil) | |
574 char-after fail-range-start fail-range-end previous-fail extent | 680 char-after fail-range-start fail-range-end previous-fail extent |
575 failed) | 681 failed invalid-sequences-looking-at failed-reason |
682 previous-failed-reason) | |
576 (check-type from-unicode hash-table) | 683 (check-type from-unicode hash-table) |
577 (check-type skip-chars-arg string) | 684 (check-type skip-chars-arg string) |
685 (check-type invalid-sequences-skip-chars string) | |
686 (setq invalid-sequences-looking-at | |
687 (if (equal "" invalid-sequences-skip-chars) | |
688 ;; Regexp that will never match. | |
689 #r".\{0,0\}" | |
690 (concat "[" invalid-sequences-skip-chars "]"))) | |
691 (when ignore-invalid-sequencesp | |
692 (setq skip-chars-arg | |
693 (concat skip-chars-arg invalid-sequences-skip-chars))) | |
578 (save-excursion | 694 (save-excursion |
579 (when highlightp | 695 (when highlightp |
580 (map-extents #'(lambda (extent ignored-arg) | 696 (query-coding-clear-highlights begin end buffer)) |
581 (when (eq 'query-coding-warning-face | |
582 (extent-face extent)) | |
583 (delete-extent extent))) buffer begin end)) | |
584 (goto-char begin buffer) | 697 (goto-char begin buffer) |
585 (skip-chars-forward skip-chars-arg end buffer) | 698 (skip-chars-forward skip-chars-arg end buffer) |
586 (while (< (point buffer) end) | 699 (while (< (point buffer) end) |
587 ; (message | |
588 ; "fail-range-start is %S, previous-fail %S, point is %S, end is %S" | |
589 ; fail-range-start previous-fail (point buffer) end) | |
590 (setq char-after (char-after (point buffer) buffer) | 700 (setq char-after (char-after (point buffer) buffer) |
591 fail-range-start (point buffer)) | 701 fail-range-start (point buffer)) |
592 ; (message "arguments are %S %S" | |
593 ; (< (point buffer) end) | |
594 ; (not (gethash (encode-char char-after 'ucs) from-unicode))) | |
595 (while (and | 702 (while (and |
596 (< (point buffer) end) | 703 (< (point buffer) end) |
597 (not (gethash (encode-char char-after 'ucs) from-unicode))) | 704 (or (and |
705 (not (gethash (encode-char char-after 'ucs) from-unicode)) | |
706 (setq failed-reason 'unencodable)) | |
707 (and (not ignore-invalid-sequencesp) | |
708 (looking-at invalid-sequences-looking-at buffer) | |
709 (setq failed-reason 'invalid-sequence))) | |
710 (or (null previous-failed-reason) | |
711 (eq previous-failed-reason failed-reason))) | |
598 (forward-char 1 buffer) | 712 (forward-char 1 buffer) |
599 (setq char-after (char-after (point buffer) buffer) | 713 (setq char-after (char-after (point buffer) buffer) |
600 failed t)) | 714 failed t |
715 previous-failed-reason failed-reason)) | |
601 (if (= fail-range-start (point buffer)) | 716 (if (= fail-range-start (point buffer)) |
602 ;; The character can actually be encoded by the coding | 717 ;; The character can actually be encoded by the coding |
603 ;; system; check the characters past it. | 718 ;; system; check the characters past it. |
604 (forward-char 1 buffer) | 719 (forward-char 1 buffer) |
605 ;; The character actually failed. | 720 ;; The character actually failed. |
606 ; (message "past the move through, point now %S" (point buffer)) | |
607 (when errorp | 721 (when errorp |
608 (error 'text-conversion-error | 722 (error 'text-conversion-error |
609 (format "Cannot encode %s using coding system" | 723 (format "Cannot encode %s using coding system" |
610 (buffer-substring fail-range-start (point buffer) | 724 (buffer-substring fail-range-start (point buffer) |
611 buffer)) | 725 buffer)) |
612 (coding-system-name coding-system))) | 726 (coding-system-name coding-system))) |
727 (assert (not (null previous-failed-reason)) t | |
728 "previous-failed-reason should always be non-nil here") | |
613 (put-range-table fail-range-start | 729 (put-range-table fail-range-start |
614 ;; If char-after is non-nil, we're not at | 730 ;; If char-after is non-nil, we're not at |
615 ;; the end of the buffer. | 731 ;; the end of the buffer. |
616 (setq fail-range-end (if char-after | 732 (setq fail-range-end (if char-after |
617 (point buffer) | 733 (point buffer) |
618 (point-max buffer))) | 734 (point-max buffer))) |
619 t ranges) | 735 previous-failed-reason ranges) |
736 (setq previous-failed-reason nil) | |
620 (when highlightp | 737 (when highlightp |
621 ; (message "highlighting") | |
622 (setq extent (make-extent fail-range-start fail-range-end buffer)) | 738 (setq extent (make-extent fail-range-start fail-range-end buffer)) |
623 (set-extent-priority extent (+ mouse-highlight-priority 2)) | 739 (set-extent-priority extent (+ mouse-highlight-priority 2)) |
624 (set-extent-face extent 'query-coding-warning-face)) | 740 (set-extent-face extent 'query-coding-warning-face)) |
625 (skip-chars-forward skip-chars-arg end buffer))) | 741 (skip-chars-forward skip-chars-arg end buffer))) |
626 ; (message "about to give the result, ranges %S" ranges) | |
627 (if failed | 742 (if failed |
628 (values nil ranges) | 743 (values nil ranges) |
629 (values t nil))))) | 744 (values t nil))))) |
630 | 745 |
631 ;;;###autoload | |
632 (defun make-8-bit-coding-system (name unicode-map &optional description props) | 746 (defun make-8-bit-coding-system (name unicode-map &optional description props) |
633 "Make and return a fixed-width 8-bit CCL coding system named NAME. | 747 "Make and return a fixed-width 8-bit CCL coding system named NAME. |
634 NAME must be a symbol, and UNICODE-MAP a list. | 748 NAME must be a symbol, and UNICODE-MAP a list. |
635 | 749 |
636 UNICODE-MAP is a plist describing a map from octets in the coding | 750 UNICODE-MAP is a plist describing a map from octets in the coding |
642 distinct when written to disk, which is normally what is intended; it | 756 distinct when written to disk, which is normally what is intended; it |
643 also means that East Asian Han characters from different XEmacs | 757 also means that East Asian Han characters from different XEmacs |
644 character sets will not be distinct when written to disk, which is | 758 character sets will not be distinct when written to disk, which is |
645 less often what is intended. | 759 less often what is intended. |
646 | 760 |
647 Any octets not mapped will be decoded into the ISO 8859-1 characters with | 761 Any octets not mapped, and with values above #x7f, will be decoded into |
648 the corresponding numeric value; unless another octet maps to that | 762 XEmacs characters that reflect that their values are undefined. These |
649 character, in which case the Unicode private use area will be used. This | 763 characters will be displayed in a language-environment-specific way. See |
650 avoids spurious changes to files on disk when they contain octets that would | 764 `unicode-error-default-translation-table' and the |
651 be otherwise remapped to the canonical values for the corresponding | 765 `invalid-sequence-coding-system' argument to `set-language-info'. |
652 characters in the coding system. | 766 |
767 These characters will normally be treated as invalid when checking whether | |
768 text can be encoded with `query-coding-region'--see the | |
769 IGNORE-INVALID-SEQUENCESP argument to that function to avoid this. It is | |
770 possible to specify that octets with values less than #x80 (or indeed | |
771 greater than it) be treated in this way, by specifying explicitly that they | |
772 correspond to the character mapping to that octet in | |
773 `unicode-error-default-translation-table'. Far fewer coding systems | |
774 override the ASCII mapping, though, so this is not the default. | |
653 | 775 |
654 DESCRIPTION and PROPS are as in `make-coding-system', which see. This | 776 DESCRIPTION and PROPS are as in `make-coding-system', which see. This |
655 function also accepts two additional (optional) properties in PROPS; | 777 function also accepts two additional (optional) properties in PROPS; |
656 `aliases', giving a list of aliases to be initialized for this | 778 `aliases', giving a list of aliases to be initialized for this |
657 coding-system, and `encode-failure-octet', an integer between 0 and 256 to | 779 coding-system, and `encode-failure-octet', an integer between 0 and 256 to |
666 (check-valid-plist props) | 788 (check-valid-plist props) |
667 (let ((encode-failure-octet (or (plist-get props 'encode-failure-octet) | 789 (let ((encode-failure-octet (or (plist-get props 'encode-failure-octet) |
668 (char-to-int ?~))) | 790 (char-to-int ?~))) |
669 (aliases (plist-get props 'aliases)) | 791 (aliases (plist-get props 'aliases)) |
670 (hash-table-sym (gentemp (format "%s-encode-table" name))) | 792 (hash-table-sym (gentemp (format "%s-encode-table" name))) |
671 encode-program decode-program result decode-table encode-table) | 793 encode-program decode-program result decode-table encode-table |
794 skip-chars invalid-sequences-skip-chars) | |
672 | 795 |
673 ;; Some more sanity checking. | 796 ;; Some more sanity checking. |
674 (check-argument-range encode-failure-octet 0 #xFF) | 797 (check-argument-range encode-failure-octet 0 #xFF) |
675 (check-argument-type #'listp aliases) | 798 (check-argument-type #'listp aliases) |
676 | 799 |
683 (make-8-bit-create-decode-encode-tables unicode-map)) | 806 (make-8-bit-create-decode-encode-tables unicode-map)) |
684 | 807 |
685 ;; Register the decode-table. | 808 ;; Register the decode-table. |
686 (define-translation-hash-table hash-table-sym encode-table) | 809 (define-translation-hash-table hash-table-sym encode-table) |
687 | 810 |
688 ;; Generate the programs. | 811 ;; Generate the programs and skip-chars strings. |
689 (setq decode-program (make-8-bit-generate-decode-program decode-table) | 812 (setq decode-program (make-8-bit-generate-decode-program decode-table)) |
690 encode-program (make-8-bit-generate-encode-program | 813 (multiple-value-setq |
691 decode-table encode-table encode-failure-octet)) | 814 (encode-program skip-chars invalid-sequences-skip-chars) |
815 (make-8-bit-generate-encode-program-and-skip-chars-strings | |
816 decode-table encode-table encode-failure-octet)) | |
817 | |
692 (unless (vectorp encode-program) | 818 (unless (vectorp encode-program) |
693 (setq encode-program | 819 (setq encode-program |
694 (apply #'vector | 820 (apply #'vector |
695 (nsublis (list (cons 'encode-table-sym hash-table-sym)) | 821 (nsublis (list (cons 'encode-table-sym hash-table-sym)) |
696 (copy-tree encode-program))))) | 822 (copy-tree encode-program))))) |
707 'encode encode-program))) | 833 'encode encode-program))) |
708 (coding-system-put name '8-bit-fixed t) | 834 (coding-system-put name '8-bit-fixed t) |
709 (coding-system-put name 'category | 835 (coding-system-put name 'category |
710 (make-8-bit-choose-category decode-table)) | 836 (make-8-bit-choose-category decode-table)) |
711 (coding-system-put name '8-bit-fixed-query-skip-chars | 837 (coding-system-put name '8-bit-fixed-query-skip-chars |
712 (skip-chars-quote | 838 skip-chars) |
713 (apply #'string (append decode-table nil)))) | 839 (coding-system-put name '8-bit-fixed-invalid-sequences-skip-chars |
840 invalid-sequences-skip-chars) | |
714 (coding-system-put name '8-bit-fixed-query-from-unicode encode-table) | 841 (coding-system-put name '8-bit-fixed-query-from-unicode encode-table) |
715 | |
716 (coding-system-put name 'query-coding-function | 842 (coding-system-put name 'query-coding-function |
717 #'8-bit-fixed-query-coding-region) | 843 #'8-bit-fixed-query-coding-region) |
718 (coding-system-put (intern (format "%s-unix" name)) | 844 (coding-system-put (intern (format "%s-unix" name)) |
719 'query-coding-function | 845 'query-coding-function |
720 #'8-bit-fixed-query-coding-region) | 846 #'8-bit-fixed-query-coding-region) |
749 props (if props (cadr props))) | 875 props (if props (cadr props))) |
750 (let ((encode-failure-octet | 876 (let ((encode-failure-octet |
751 (or (plist-get props 'encode-failure-octet) (char-to-int ?~))) | 877 (or (plist-get props 'encode-failure-octet) (char-to-int ?~))) |
752 (aliases (plist-get props 'aliases)) | 878 (aliases (plist-get props 'aliases)) |
753 encode-program decode-program | 879 encode-program decode-program |
754 decode-table encode-table) | 880 decode-table encode-table |
881 skip-chars invalid-sequences-skip-chars) | |
755 | 882 |
756 ;; Some sanity checking. | 883 ;; Some sanity checking. |
757 (check-argument-range encode-failure-octet 0 #xFF) | 884 (check-argument-range encode-failure-octet 0 #xFF) |
758 (check-argument-type #'listp aliases) | 885 (check-argument-type #'listp aliases) |
759 | 886 |
760 ;; Don't pass on our extra data to make-coding-system. | 887 ;; Don't pass on our extra data to make-coding-system. |
761 (setq props (plist-remprop props 'encode-failure-octet) | 888 (setq props (plist-remprop props 'encode-failure-octet) |
762 props (plist-remprop props 'aliases)) | 889 props (plist-remprop props 'aliases)) |
763 | 890 |
764 ;; Work out encode-table and decode-table. | 891 ;; Work out encode-table and decode-table |
765 (multiple-value-setq | 892 (multiple-value-setq |
766 (decode-table encode-table) | 893 (decode-table encode-table) |
767 (make-8-bit-create-decode-encode-tables unicode-map)) | 894 (make-8-bit-create-decode-encode-tables unicode-map)) |
768 | 895 |
769 ;; Generate the decode and encode programs. | 896 ;; Generate the decode and encode programs, and the skip-chars |
770 (setq decode-program (make-8-bit-generate-decode-program decode-table) | 897 ;; arguments. |
771 encode-program (make-8-bit-generate-encode-program | 898 (setq decode-program (make-8-bit-generate-decode-program decode-table)) |
772 decode-table encode-table encode-failure-octet)) | 899 (multiple-value-setq |
900 (encode-program skip-chars invalid-sequences-skip-chars) | |
901 (make-8-bit-generate-encode-program-and-skip-chars-strings | |
902 decode-table encode-table encode-failure-octet)) | |
773 | 903 |
774 ;; And return the generated code. | 904 ;; And return the generated code. |
775 `(let ((encode-table-sym (gentemp (format "%s-encode-table" ',name))) | 905 `(let ((encode-table-sym (gentemp (format "%s-encode-table" ',name))) |
776 ;; The case-fold-search bind shouldn't be necessary. If I take | |
777 ;; it, out, though, I get: | |
778 ;; | |
779 ;; (invalid-read-syntax "Multiply defined symbol label" 1) | |
780 ;; | |
781 ;; when the file is byte compiled. | |
782 (case-fold-search t) | |
783 (encode-table ,encode-table)) | 906 (encode-table ,encode-table)) |
784 (define-translation-hash-table encode-table-sym encode-table) | 907 (define-translation-hash-table encode-table-sym encode-table) |
785 (make-coding-system | 908 (make-coding-system |
786 ',name 'ccl ,description | 909 ',name 'ccl ,description |
787 (plist-put (plist-put ',props 'decode | 910 (plist-put (plist-put ',props 'decode |
795 ',encode-program)))) | 918 ',encode-program)))) |
796 (coding-system-put ',name '8-bit-fixed t) | 919 (coding-system-put ',name '8-bit-fixed t) |
797 (coding-system-put ',name 'category | 920 (coding-system-put ',name 'category |
798 ',(make-8-bit-choose-category decode-table)) | 921 ',(make-8-bit-choose-category decode-table)) |
799 (coding-system-put ',name '8-bit-fixed-query-skip-chars | 922 (coding-system-put ',name '8-bit-fixed-query-skip-chars |
800 ',(skip-chars-quote | 923 ,skip-chars) |
801 (apply #'string (append decode-table nil)))) | 924 (coding-system-put ',name '8-bit-fixed-invalid-sequences-skip-chars |
925 ,invalid-sequences-skip-chars) | |
802 (coding-system-put ',name '8-bit-fixed-query-from-unicode encode-table) | 926 (coding-system-put ',name '8-bit-fixed-query-from-unicode encode-table) |
803 (coding-system-put ',name 'query-coding-function | 927 (coding-system-put ',name 'query-coding-function |
804 #'8-bit-fixed-query-coding-region) | 928 #'8-bit-fixed-query-coding-region) |
805 (coding-system-put ',(intern (format "%s-unix" name)) | 929 (coding-system-put ',(intern (format "%s-unix" name)) |
806 'query-coding-function | 930 'query-coding-function |
817 (find-coding-system ',name))))) | 941 (find-coding-system ',name))))) |
818 | 942 |
819 ;; Ideally this would be in latin.el, but code-init.el uses it. | 943 ;; Ideally this would be in latin.el, but code-init.el uses it. |
820 (make-8-bit-coding-system | 944 (make-8-bit-coding-system |
821 'iso-8859-1 | 945 'iso-8859-1 |
822 '() ;; No differences from Latin 1. | 946 (loop |
947 for i from #x80 to #xff | |
948 collect (list i (int-char i))) ;; Identical to Latin-1. | |
823 "ISO-8859-1 (Latin-1)" | 949 "ISO-8859-1 (Latin-1)" |
824 '(mnemonic "Latin 1" | 950 '(mnemonic "Latin 1" |
825 documentation "The most used encoding of Western Europe and the Americas." | 951 documentation "The most used encoding of Western Europe and the Americas." |
826 aliases (iso-latin-1 latin-1))) | 952 aliases (iso-latin-1 latin-1))) |