diff lisp/coding.el @ 4690:257b468bf2ca

Move the #'query-coding-region implementation to C. This is necessary because there is no reasonable way to access the corresponding mswindows-multibyte functionality from Lisp, and we need such functionality if we're going to have a reliable and portable #'query-coding-region implementation. However, this change doesn't yet provide #'query-coding-region for the mswindow-multibyte coding systems, there should be no functional differences between an XEmacs with this change and one without it. src/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> Move the #'query-coding-region implementation to C. This is necessary because there is no reasonable way to access the corresponding mswindows-multibyte functionality from Lisp, and we need such functionality if we're going to have a reliable and portable #'query-coding-region implementation. However, this change doesn't yet provide #'query-coding-region for the mswindow-multibyte coding systems, there should be no functional differences between an XEmacs with this change and one without it. * mule-coding.c (struct fixed_width_coding_system): Add a new coding system type, fixed_width, and implement it. It uses the CCL infrastructure but has a much simpler creation API, and its own query_method, formerly in lisp/mule/mule-coding.el. * unicode.c: Move the Unicode query method implementation here from unicode.el. * lisp.h: Declare Fmake_coding_system_internal, Fcopy_range_table here. * intl-win32.c (complex_vars_of_intl_win32): Use Fmake_coding_system_internal, not Fmake_coding_system. * general-slots.h: Add Qsucceeded, Qunencodable, Qinvalid_sequence here. * file-coding.h (enum coding_system_variant): Add fixed_width_coding_system here. (struct coding_system_methods): Add query_method and query_lstream_method to the coding system methods. Provide flags for the query methods. Declare the default query method; initialise it correctly in INITIALIZE_CODING_SYSTEM_TYPE. * file-coding.c (default_query_method): New function, the default query method for coding systems that do not set it. Moved from coding.el. (make_coding_system_1): Accept new elements in PROPS in #'make-coding-system; aliases, a list of aliases; safe-chars and safe-charsets (these were previously accepted but not saved); and category. (Fmake_coding_system_internal): New function, what used to be #'make-coding-system--on Mule builds, we've now moved some of the functionality of this to Lisp. (Fcoding_system_canonical_name_p): Move this earlier in the file, since it's now called from within make_coding_system_1. (Fquery_coding_region): Move the implementation of this here, from coding.el. (complex_vars_of_file_coding): Call Fmake_coding_system_internal, not Fmake_coding_system; specify safe-charsets properties when we're a mule build. * extents.h (mouse_highlight_priority, Fset_extent_priority, Fset_extent_face, Fmap_extents): Make these available to other C files. lisp/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> Move the #'query-coding-region implementation to C. * coding.el: Consolidate code that depends on the presence or absence of Mule at the end of this file. (default-query-coding-region, query-coding-region): Move these functions to C. (default-query-coding-region-safe-charset-skip-chars-map): Remove this variable, the corresponding C variable is Vdefault_query_coding_region_chartab_cache in file-coding.c. (query-coding-string): Update docstring to reflect actual multiple values, be more careful about not modifying a range table that we're currently mapping over. (encode-coding-char): Make the implementation of this simpler. (featurep 'mule): Autoload #'make-coding-system from mule/make-coding-system.el if we're a mule build; provide an appropriate compiler macro. Do various non-mule compatibility things if we're not a mule build. * update-elc.el (additional-dump-dependencies): Add mule/make-coding-system as a dump time dependency if we're a mule build. * unicode.el (ccl-encode-to-ucs-2): (decode-char): (encode-char): Move these earlier in the file, for the sake of some byte compile warnings. (unicode-query-coding-region): Move this to unicode.c * mule/make-coding-system.el: New file, not dumped. Contains the functionality to rework the arguments necessary for fixed-width coding systems, and contains the implementation of #'make-coding-system, which now calls #'make-coding-system-internal. * mule/vietnamese.el (viscii): * mule/latin.el (iso-8859-2): (windows-1250): (iso-8859-3): (iso-8859-4): (iso-8859-14): (iso-8859-15): (iso-8859-16): (iso-8859-9): (macintosh): (windows-1252): * mule/hebrew.el (iso-8859-8): * mule/greek.el (iso-8859-7): (windows-1253): * mule/cyrillic.el (iso-8859-5): (koi8-r): (koi8-u): (windows-1251): (alternativnyj): (koi8-ru): (koi8-t): (koi8-c): (koi8-o): * mule/arabic.el (iso-8859-6): (windows-1256): Move all these coding systems to being of type fixed-width, not of type CCL. This allows the distinct query-coding-region for them to be in C, something which will eventually allow us to implement query-coding-region for the mswindows-multibyte coding systems. * mule/general-late.el (posix-charset-to-coding-system-hash): Document why we're pre-emptively persuading the byte compiler that the ELC for this file needs to be written using escape-quoted. Call #'set-unicode-query-skip-chars-args, now the Unicode query-coding-region implementation is in C. * mule/thai-xtis.el (tis-620): Don't bother checking whether we're XEmacs or not here. * mule/mule-coding.el: Move the eight bit fixed-width functionality from this file to make-coding-system.el. tests/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> * automated/mule-tests.el: Check a coding system's type, not an 8-bit-fixed property, for whether that coding system should be treated as a fixed-width coding system. * automated/query-coding-tests.el: Don't test the query coding functionality for mswindows-multibyte coding systems, it's not yet implemented.
author Aidan Kehoe <kehoea@parhasard.net>
date Sat, 19 Sep 2009 22:53:13 +0100
parents 8cbca852bcd4
children c673987f5f3d
line wrap: on
line diff
--- a/lisp/coding.el	Sat Sep 19 17:56:23 2009 +0200
+++ b/lisp/coding.el	Sat Sep 19 22:53:13 2009 +0100
@@ -270,25 +270,12 @@
    (terminal terminal-coding-system)
    (keyboard keyboard-coding-system)))
 
-(when (not (featurep 'mule))
-  (define-coding-system-alias 'escape-quoted 'binary)
-  ;; these are so that gnus and friends work when not mule
-  (define-coding-system-alias 'iso-8859-1 'raw-text)
-  ;; We're misrepresenting ourselves to the gnus code by saying we support
-  ;; both.
-  ; (define-coding-system-alias 'iso-8859-2 'raw-text)
-  (define-coding-system-alias 'ctext 'raw-text))
-
 (make-compatible-variable 'enable-multibyte-characters "Unimplemented")
 
 ;; Sure would be nice to be able to use defface here. 
 (copy-face 'highlight 'query-coding-warning-face)
 
-(defvar default-query-coding-region-safe-charset-skip-chars-map
-  #s(hash-table test equal data ())
-  "A map from list of charsets to `skip-chars-forward' arguments for them.")
-
-(defsubst query-coding-clear-highlights (begin end &optional buffer-or-string)
+(defun query-coding-clear-highlights (begin end &optional buffer-or-string)
   "Remove extent faces added by `query-coding-region' between BEGIN and END.
 
 Optional argument BUFFER-OR-STRING is the buffer or string to use, and
@@ -302,170 +289,6 @@
                              (extent-face extent))
                      (delete-extent extent))) buffer-or-string begin end))
 
-(defun* default-query-coding-region (begin end coding-system
-				     &optional buffer ignore-invalid-sequencesp
-                                     errorp highlightp)
-  "The default `query-coding-region' implementation.
-
-Uses the `safe-charsets' and `safe-chars' coding system properties.
-The former is a list of XEmacs character sets that can be safely
-encoded by CODING-SYSTEM; the latter a char table describing, in
-addition, characters that can be safely encoded by CODING-SYSTEM.
-
-Does not support IGNORE-INVALID-SEQUENCESP."
-  (check-argument-type #'coding-system-p
-                       (setq coding-system (find-coding-system coding-system)))
-  (check-argument-type #'integer-or-marker-p begin)
-  (check-argument-type #'integer-or-marker-p end)
-  (let* ((safe-charsets
-          (or (coding-system-get coding-system 'safe-charsets)
-	      (coding-system-get (coding-system-base coding-system)
-				 'safe-charsets)))
-         (safe-chars
-	  (or (coding-system-get coding-system 'safe-chars)
-	      (coding-system-get (coding-system-base coding-system)
-				 'safe-chars)))
-         (skip-chars-arg
-          (gethash safe-charsets
-                   default-query-coding-region-safe-charset-skip-chars-map))
-         (ranges (make-range-table))
-         (case-fold-search nil)
-         fail-range-start fail-range-end char-after
-	 looking-at-arg failed extent)
-    ;; Coding systems with a value of t for safe-charsets support everything.
-    (when (eq t safe-charsets)
-      (return-from default-query-coding-region (values t nil)))
-    (unless skip-chars-arg
-      (setq skip-chars-arg
-	    (puthash safe-charsets
-		     (mapconcat #'charset-skip-chars-string
-				safe-charsets "")
-		     default-query-coding-region-safe-charset-skip-chars-map)))
-    (when highlightp
-      (query-coding-clear-highlights begin end buffer))
-    (if (and (zerop (length skip-chars-arg)) (null safe-chars))
-	(progn
-	    ;; Uh-oh, nothing known about this coding system. Fail. 
-	    (when errorp 
-	      (error 'text-conversion-error
-		     "Coding system doesn't say what it can encode"
-		     (coding-system-name coding-system)))
-	    (put-range-table begin end t ranges)
-	    (when highlightp
-	      (setq extent (make-extent begin end buffer))
-	      (set-extent-priority extent (+ mouse-highlight-priority 2))
-	      (set-extent-face extent 'query-coding-warning-face))
-	    (values nil ranges))
-      (setq looking-at-arg (if (equal "" skip-chars-arg)
-			       ;; Regexp that will never match.
-			       #r".\{0,0\}" 
-                             (concat "[" skip-chars-arg "]")))
-      (save-excursion
-	(goto-char begin buffer)
-	(skip-chars-forward skip-chars-arg end buffer)
-	(while (< (point buffer) end)
-	  ; (message
-	  ; "fail-range-start is %S, point is %S, end is %S"
-	  ;  fail-range-start (point buffer) end)
-	  (setq char-after (char-after (point buffer) buffer)
-		fail-range-start (point buffer))
-	  (while (and
-		  (< (point buffer) end)
-		  (not (looking-at looking-at-arg))
-		  (or (not safe-chars)
-		      (not (get-char-table char-after safe-chars))))
-	    (forward-char 1 buffer)
-	    (setq char-after (char-after (point buffer) buffer)
-		  failed t))
-	  (if (= fail-range-start (point buffer))
-	      ;; The character can actually be encoded by the coding
-	      ;; system; check the characters past it.
-	      (forward-char 1 buffer)
-            ;; Can't be encoded; note this.
-	    (when errorp 
-	      (error 'text-conversion-error
-		     (format "Cannot encode %s using coding system"
-			     (buffer-substring fail-range-start (point buffer)
-					       buffer))
-		     (coding-system-name coding-system)))
-	    (put-range-table fail-range-start
-			     ;; If char-after is non-nil, we're not at
-			     ;; the end of the buffer.
-			     (setq fail-range-end (if char-after
-						      (point buffer)
-						    (point-max buffer)))
-			     t ranges)
-	    (when highlightp
-	      (setq extent (make-extent fail-range-start fail-range-end buffer))
-	      (set-extent-priority extent (+ mouse-highlight-priority 2))
-	      (set-extent-face extent 'query-coding-warning-face)))
-	  (skip-chars-forward skip-chars-arg end buffer))
-	(if failed
-	    (values nil ranges)
-	  (values t nil))))))
-
-(defun query-coding-region (start end coding-system &optional buffer
-                            ignore-invalid-sequencesp errorp highlight)
-  "Work out whether CODING-SYSTEM can losslessly encode a region.
-
-START and END are the beginning and end of the region to check.
-CODING-SYSTEM is the coding system to try.
-
-Optional argument BUFFER is the buffer to check, and defaults to the current
-buffer.
-
-IGNORE-INVALID-SEQUENCESP, also an optional argument, says to treat XEmacs
-characters which have an unambiguous encoded representation, despite being
-undefined in what they represent, as encodable.  These chiefly arise with
-variable-length encodings like UTF-8 and UTF-16, where an invalid sequence
-is passed through to XEmacs as a sequence of characters with a defined
-correspondence to the octets on disk, but no non-error semantics; see the
-`invalid-sequence-coding-system' argument to `set-language-info'.
-
-They can also arise with fixed-length encodings like ISO 8859-7, where
-certain octets on disk have undefined values, and treating them as
-corresponding to the ISO 8859-1 characters with the same numerical values
-may lead to data that is not understood by other applications.
-
-Optional argument ERRORP says to signal a `text-conversion-error' if some
-character in the region cannot be encoded, and defaults to nil.
-
-Optional argument HIGHLIGHT says to display unencodable characters in the
-region using `query-coding-warning-face'. It defaults to nil.
-
-This function returns a list; the intention is that callers use
-`multiple-value-bind' or the related CL multiple value functions to deal
-with it.  The first element is `t' if the region can be encoded using
-CODING-SYSTEM, or `nil' if not.  The second element is `nil' if the region
-can be encoded using CODING-SYSTEM; otherwise, it is a range table
-describing the positions of the unencodable characters.  Ranges that
-describe characters that would be ignored were IGNORE-INVALID-SEQUENCESP
-non-nil map to the symbol `invalid-sequence'; other ranges map to the symbol
-`unencodable'.  If IGNORE-INVALID-SEQUENCESP is non-nil, all ranges will map
-to the symbol `unencodable'.  See `make-range-table' for more details of
-range tables."
-  (funcall (or (coding-system-get coding-system 'query-coding-function)
-               #'default-query-coding-region)
-           start end coding-system buffer ignore-invalid-sequencesp errorp
-           highlight))
-
-(define-compiler-macro query-coding-region (start end coding-system
-                                            &optional buffer 
-                                            ignore-invalid-sequencesp
-                                            errorp highlight)
-  `(funcall (or (coding-system-get ,coding-system 'query-coding-function)
-                #'default-query-coding-region)
-    ,start ,end ,coding-system ,@(append (when (or buffer
-                                                   ignore-invalid-sequencesp
-						   errorp highlight)
-					   (list buffer))
-                                         (when (or ignore-invalid-sequencesp
-						   errorp highlight)
-					   (list ignore-invalid-sequencesp))
-                                         (when (or errorp highlight)
-					   (list errorp))
-                                         (when highlight (list highlight)))))
-
 (defun query-coding-string (string coding-system &optional
                             ignore-invalid-sequencesp errorp highlight)
   "Work out whether CODING-SYSTEM can losslessly encode STRING.
@@ -482,7 +305,7 @@
 They can also arise with fixed-length encodings like ISO 8859-7, where
 certain octets on disk have undefined values, and treating them as
 corresponding to the ISO 8859-1 characters with the same numerical values
-may lead to data that is not understood by other applications.
+may lead to data that are not understood by other applications.
 
 Optional argument ERRORP says to signal a `text-conversion-error' if some
 character in the region cannot be encoded, and defaults to nil.
@@ -490,39 +313,42 @@
 Optional argument HIGHLIGHT says to display unencodable characters in the
 region using `query-coding-warning-face'. It defaults to nil.
 
-This function returns a list; the intention is that callers use
+This function can return multiple values; the intention is that callers use
 `multiple-value-bind' or the related CL multiple value functions to deal
-with it.  The first element is `t' if the region can be encoded using
-CODING-SYSTEM, or `nil' if not.  The second element is `nil' if the region
-can be encoded using CODING-SYSTEM; otherwise, it is a range table
-describing the positions of the unencodable characters.  Ranges that
-describe characters that would be ignored were IGNORE-INVALID-SEQUENCESP
-non-nil map to the symbol `invalid-sequence'; other ranges map to the symbol
-`unencodable'.  If IGNORE-INVALID-SEQUENCESP is non-nil, all ranges will map
-to the symbol `unencodable'.  See `make-range-table' for more details of
-range tables."
+with it.  The first result is `t' if the region can be encoded using
+CODING-SYSTEM, or `nil' if not.  If the region cannot be encoded using
+CODING-SYSTEM, the second result is a range table describing the positions
+of the unencodable characters.
+
+Ranges that describe characters that would be ignored were
+IGNORE-INVALID-SEQUENCESP non-nil map to the symbol `invalid-sequence';
+other ranges map to the symbol `unencodable'.  If IGNORE-INVALID-SEQUENCESP
+is non-nil, all ranges will map to the symbol `unencodable'.  See
+`make-range-table' for more details of range tables."
   (with-temp-buffer 
     (when highlight
       (query-coding-clear-highlights 0 (length string) string))
     (insert string)
-    (multiple-value-bind (result ranges extent)
+    (multiple-value-bind (result ranges)
         (query-coding-region (point-min) (point-max) coding-system
                              (current-buffer) ignore-invalid-sequencesp
 			     errorp)
-      (unless result
-        (map-range-table
-         #'(lambda (begin end value)
-	     ;; Sigh, string indices are zero-based, buffer offsets are
-	     ;; one-based.
-             (remove-range-table begin end ranges)
-             (put-range-table (decf begin) (decf end) value ranges)
-	     (when highlight
-	       (setq extent (make-extent begin end string))
-	       (set-extent-priority extent (+ mouse-highlight-priority 2))
-	       (set-extent-property extent 'duplicable t)
-	       (set-extent-face extent 'query-coding-warning-face)))
-         ranges))
-      (values result ranges))))
+        (unless result
+          (let ((original-ranges ranges)
+                extent)
+            (setq ranges (make-range-table))
+            (map-range-table
+             #'(lambda (begin end value)
+                 ;; Sigh, string indices are zero-based, buffer offsets are
+                 ;; one-based.
+                 (put-range-table (decf begin) (decf end) value ranges)
+                 (when highlight
+                   (setq extent (make-extent begin end string))
+                   (set-extent-priority extent (+ mouse-highlight-priority 2))
+                   (set-extent-property extent 'duplicable t)
+                   (set-extent-face extent 'query-coding-warning-face)))
+             original-ranges)))
+        (if result result (values result ranges)))))
 
 ;; Function docstring and API are taken from GNU coding.c version 1.353, GPLv2. 
 (defun unencodable-char-position  (start end coding-system
@@ -615,7 +441,8 @@
                (multiple-value-bind (encoded ranges)
 		   (query-coding-region begin end coding-system)
                  (unless encoded
-                   (setq intermediate (list (coding-system-name coding-system)))
+                   (setq intermediate
+                         (list (coding-system-name coding-system)))
                    (map-range-table range-lambda ranges)
                    (push (nreverse intermediate) result)))
                finally return result))))
@@ -634,18 +461,36 @@
 If CODING-SYSTEM can't safely encode CHAR, return nil.
 The optional third argument CHARSET is, for the moment, ignored."
   (check-argument-type #'characterp char)
-  (multiple-value-bind (succeededp)
-      (query-coding-string char coding-system)
-    (when succeededp
-      (encode-coding-string char coding-system))))
+  (and (query-coding-string char coding-system)
+       (encode-coding-string char coding-system)))
+
+(if (featurep 'mule)
+    (progn
+      ;; Under Mule, we do much of the complicated coding system creation in
+      ;; Lisp and especially at compile time. We need some function
+      ;; definition for this function to be created in this file, but we can
+      ;; leave assigning the docstring to the autoload cookie
+      ;; handling later. Thankfully; that docstring is big.
+      (autoload 'make-coding-system "mule/make-coding-system")
 
-(unless (featurep 'mule)
-  ;; If we're under non-Mule, every XEmacs character can be encoded
-  ;; with every XEmacs coding system.
-  (fset #'default-query-coding-region
-	#'(lambda (&rest ignored)
-	    "Stub `query-coding-region' implementation. Always succeeds."
-	    (values t nil)))
-  (unintern 'default-query-coding-region-safe-charset-skip-chars-map))
+      ;; (During byte-compile before dumping, make-coding-system may already
+      ;; have been loaded, make sure not to overwrite the correct compiler
+      ;; macro:)
+      (when (eq 'autoload (car (symbol-function 'make-coding-system)))
+        ;; Make sure to pick up the correct compiler macro when compiling
+        ;; files:
+        (define-compiler-macro make-coding-system (&whole form name type
+                                                   &optional description props)
+          (load (second (symbol-function 'make-coding-system)))
+          (funcall (get 'make-coding-system 'cl-compiler-macro)
+                   form name type description props))))
+
+  ;; Mule's not available; 
+  (fset 'make-coding-system (symbol-function 'make-coding-system-internal))
+  (define-coding-system-alias 'escape-quoted 'binary)
+
+  ;; These are so that gnus and friends work when not mule:
+  (define-coding-system-alias 'iso-8859-1 'raw-text)
+  (define-coding-system-alias 'ctext 'raw-text))
 
 ;;; coding.el ends here