comparison lisp/coding.el @ 4690:257b468bf2ca

Move the #'query-coding-region implementation to C. This is necessary because there is no reasonable way to access the corresponding mswindows-multibyte functionality from Lisp, and we need such functionality if we're going to have a reliable and portable #'query-coding-region implementation. However, this change doesn't yet provide #'query-coding-region for the mswindow-multibyte coding systems, there should be no functional differences between an XEmacs with this change and one without it. src/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> Move the #'query-coding-region implementation to C. This is necessary because there is no reasonable way to access the corresponding mswindows-multibyte functionality from Lisp, and we need such functionality if we're going to have a reliable and portable #'query-coding-region implementation. However, this change doesn't yet provide #'query-coding-region for the mswindow-multibyte coding systems, there should be no functional differences between an XEmacs with this change and one without it. * mule-coding.c (struct fixed_width_coding_system): Add a new coding system type, fixed_width, and implement it. It uses the CCL infrastructure but has a much simpler creation API, and its own query_method, formerly in lisp/mule/mule-coding.el. * unicode.c: Move the Unicode query method implementation here from unicode.el. * lisp.h: Declare Fmake_coding_system_internal, Fcopy_range_table here. * intl-win32.c (complex_vars_of_intl_win32): Use Fmake_coding_system_internal, not Fmake_coding_system. * general-slots.h: Add Qsucceeded, Qunencodable, Qinvalid_sequence here. * file-coding.h (enum coding_system_variant): Add fixed_width_coding_system here. (struct coding_system_methods): Add query_method and query_lstream_method to the coding system methods. Provide flags for the query methods. Declare the default query method; initialise it correctly in INITIALIZE_CODING_SYSTEM_TYPE. * file-coding.c (default_query_method): New function, the default query method for coding systems that do not set it. Moved from coding.el. (make_coding_system_1): Accept new elements in PROPS in #'make-coding-system; aliases, a list of aliases; safe-chars and safe-charsets (these were previously accepted but not saved); and category. (Fmake_coding_system_internal): New function, what used to be #'make-coding-system--on Mule builds, we've now moved some of the functionality of this to Lisp. (Fcoding_system_canonical_name_p): Move this earlier in the file, since it's now called from within make_coding_system_1. (Fquery_coding_region): Move the implementation of this here, from coding.el. (complex_vars_of_file_coding): Call Fmake_coding_system_internal, not Fmake_coding_system; specify safe-charsets properties when we're a mule build. * extents.h (mouse_highlight_priority, Fset_extent_priority, Fset_extent_face, Fmap_extents): Make these available to other C files. lisp/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> Move the #'query-coding-region implementation to C. * coding.el: Consolidate code that depends on the presence or absence of Mule at the end of this file. (default-query-coding-region, query-coding-region): Move these functions to C. (default-query-coding-region-safe-charset-skip-chars-map): Remove this variable, the corresponding C variable is Vdefault_query_coding_region_chartab_cache in file-coding.c. (query-coding-string): Update docstring to reflect actual multiple values, be more careful about not modifying a range table that we're currently mapping over. (encode-coding-char): Make the implementation of this simpler. (featurep 'mule): Autoload #'make-coding-system from mule/make-coding-system.el if we're a mule build; provide an appropriate compiler macro. Do various non-mule compatibility things if we're not a mule build. * update-elc.el (additional-dump-dependencies): Add mule/make-coding-system as a dump time dependency if we're a mule build. * unicode.el (ccl-encode-to-ucs-2): (decode-char): (encode-char): Move these earlier in the file, for the sake of some byte compile warnings. (unicode-query-coding-region): Move this to unicode.c * mule/make-coding-system.el: New file, not dumped. Contains the functionality to rework the arguments necessary for fixed-width coding systems, and contains the implementation of #'make-coding-system, which now calls #'make-coding-system-internal. * mule/vietnamese.el (viscii): * mule/latin.el (iso-8859-2): (windows-1250): (iso-8859-3): (iso-8859-4): (iso-8859-14): (iso-8859-15): (iso-8859-16): (iso-8859-9): (macintosh): (windows-1252): * mule/hebrew.el (iso-8859-8): * mule/greek.el (iso-8859-7): (windows-1253): * mule/cyrillic.el (iso-8859-5): (koi8-r): (koi8-u): (windows-1251): (alternativnyj): (koi8-ru): (koi8-t): (koi8-c): (koi8-o): * mule/arabic.el (iso-8859-6): (windows-1256): Move all these coding systems to being of type fixed-width, not of type CCL. This allows the distinct query-coding-region for them to be in C, something which will eventually allow us to implement query-coding-region for the mswindows-multibyte coding systems. * mule/general-late.el (posix-charset-to-coding-system-hash): Document why we're pre-emptively persuading the byte compiler that the ELC for this file needs to be written using escape-quoted. Call #'set-unicode-query-skip-chars-args, now the Unicode query-coding-region implementation is in C. * mule/thai-xtis.el (tis-620): Don't bother checking whether we're XEmacs or not here. * mule/mule-coding.el: Move the eight bit fixed-width functionality from this file to make-coding-system.el. tests/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> * automated/mule-tests.el: Check a coding system's type, not an 8-bit-fixed property, for whether that coding system should be treated as a fixed-width coding system. * automated/query-coding-tests.el: Don't test the query coding functionality for mswindows-multibyte coding systems, it's not yet implemented.
author Aidan Kehoe <kehoea@parhasard.net>
date Sat, 19 Sep 2009 22:53:13 +0100
parents 8cbca852bcd4
children c673987f5f3d
comparison
equal deleted inserted replaced
4689:0636c6ccb430 4690:257b468bf2ca
268 (force-coding-system-equivalency 268 (force-coding-system-equivalency
269 (file-name file-name-coding-system) 269 (file-name file-name-coding-system)
270 (terminal terminal-coding-system) 270 (terminal terminal-coding-system)
271 (keyboard keyboard-coding-system))) 271 (keyboard keyboard-coding-system)))
272 272
273 (when (not (featurep 'mule))
274 (define-coding-system-alias 'escape-quoted 'binary)
275 ;; these are so that gnus and friends work when not mule
276 (define-coding-system-alias 'iso-8859-1 'raw-text)
277 ;; We're misrepresenting ourselves to the gnus code by saying we support
278 ;; both.
279 ; (define-coding-system-alias 'iso-8859-2 'raw-text)
280 (define-coding-system-alias 'ctext 'raw-text))
281
282 (make-compatible-variable 'enable-multibyte-characters "Unimplemented") 273 (make-compatible-variable 'enable-multibyte-characters "Unimplemented")
283 274
284 ;; Sure would be nice to be able to use defface here. 275 ;; Sure would be nice to be able to use defface here.
285 (copy-face 'highlight 'query-coding-warning-face) 276 (copy-face 'highlight 'query-coding-warning-face)
286 277
287 (defvar default-query-coding-region-safe-charset-skip-chars-map 278 (defun query-coding-clear-highlights (begin end &optional buffer-or-string)
288 #s(hash-table test equal data ())
289 "A map from list of charsets to `skip-chars-forward' arguments for them.")
290
291 (defsubst query-coding-clear-highlights (begin end &optional buffer-or-string)
292 "Remove extent faces added by `query-coding-region' between BEGIN and END. 279 "Remove extent faces added by `query-coding-region' between BEGIN and END.
293 280
294 Optional argument BUFFER-OR-STRING is the buffer or string to use, and 281 Optional argument BUFFER-OR-STRING is the buffer or string to use, and
295 defaults to the current buffer. 282 defaults to the current buffer.
296 283
300 (map-extents #'(lambda (extent ignored-arg) 287 (map-extents #'(lambda (extent ignored-arg)
301 (when (eq 'query-coding-warning-face 288 (when (eq 'query-coding-warning-face
302 (extent-face extent)) 289 (extent-face extent))
303 (delete-extent extent))) buffer-or-string begin end)) 290 (delete-extent extent))) buffer-or-string begin end))
304 291
305 (defun* default-query-coding-region (begin end coding-system
306 &optional buffer ignore-invalid-sequencesp
307 errorp highlightp)
308 "The default `query-coding-region' implementation.
309
310 Uses the `safe-charsets' and `safe-chars' coding system properties.
311 The former is a list of XEmacs character sets that can be safely
312 encoded by CODING-SYSTEM; the latter a char table describing, in
313 addition, characters that can be safely encoded by CODING-SYSTEM.
314
315 Does not support IGNORE-INVALID-SEQUENCESP."
316 (check-argument-type #'coding-system-p
317 (setq coding-system (find-coding-system coding-system)))
318 (check-argument-type #'integer-or-marker-p begin)
319 (check-argument-type #'integer-or-marker-p end)
320 (let* ((safe-charsets
321 (or (coding-system-get coding-system 'safe-charsets)
322 (coding-system-get (coding-system-base coding-system)
323 'safe-charsets)))
324 (safe-chars
325 (or (coding-system-get coding-system 'safe-chars)
326 (coding-system-get (coding-system-base coding-system)
327 'safe-chars)))
328 (skip-chars-arg
329 (gethash safe-charsets
330 default-query-coding-region-safe-charset-skip-chars-map))
331 (ranges (make-range-table))
332 (case-fold-search nil)
333 fail-range-start fail-range-end char-after
334 looking-at-arg failed extent)
335 ;; Coding systems with a value of t for safe-charsets support everything.
336 (when (eq t safe-charsets)
337 (return-from default-query-coding-region (values t nil)))
338 (unless skip-chars-arg
339 (setq skip-chars-arg
340 (puthash safe-charsets
341 (mapconcat #'charset-skip-chars-string
342 safe-charsets "")
343 default-query-coding-region-safe-charset-skip-chars-map)))
344 (when highlightp
345 (query-coding-clear-highlights begin end buffer))
346 (if (and (zerop (length skip-chars-arg)) (null safe-chars))
347 (progn
348 ;; Uh-oh, nothing known about this coding system. Fail.
349 (when errorp
350 (error 'text-conversion-error
351 "Coding system doesn't say what it can encode"
352 (coding-system-name coding-system)))
353 (put-range-table begin end t ranges)
354 (when highlightp
355 (setq extent (make-extent begin end buffer))
356 (set-extent-priority extent (+ mouse-highlight-priority 2))
357 (set-extent-face extent 'query-coding-warning-face))
358 (values nil ranges))
359 (setq looking-at-arg (if (equal "" skip-chars-arg)
360 ;; Regexp that will never match.
361 #r".\{0,0\}"
362 (concat "[" skip-chars-arg "]")))
363 (save-excursion
364 (goto-char begin buffer)
365 (skip-chars-forward skip-chars-arg end buffer)
366 (while (< (point buffer) end)
367 ; (message
368 ; "fail-range-start is %S, point is %S, end is %S"
369 ; fail-range-start (point buffer) end)
370 (setq char-after (char-after (point buffer) buffer)
371 fail-range-start (point buffer))
372 (while (and
373 (< (point buffer) end)
374 (not (looking-at looking-at-arg))
375 (or (not safe-chars)
376 (not (get-char-table char-after safe-chars))))
377 (forward-char 1 buffer)
378 (setq char-after (char-after (point buffer) buffer)
379 failed t))
380 (if (= fail-range-start (point buffer))
381 ;; The character can actually be encoded by the coding
382 ;; system; check the characters past it.
383 (forward-char 1 buffer)
384 ;; Can't be encoded; note this.
385 (when errorp
386 (error 'text-conversion-error
387 (format "Cannot encode %s using coding system"
388 (buffer-substring fail-range-start (point buffer)
389 buffer))
390 (coding-system-name coding-system)))
391 (put-range-table fail-range-start
392 ;; If char-after is non-nil, we're not at
393 ;; the end of the buffer.
394 (setq fail-range-end (if char-after
395 (point buffer)
396 (point-max buffer)))
397 t ranges)
398 (when highlightp
399 (setq extent (make-extent fail-range-start fail-range-end buffer))
400 (set-extent-priority extent (+ mouse-highlight-priority 2))
401 (set-extent-face extent 'query-coding-warning-face)))
402 (skip-chars-forward skip-chars-arg end buffer))
403 (if failed
404 (values nil ranges)
405 (values t nil))))))
406
407 (defun query-coding-region (start end coding-system &optional buffer
408 ignore-invalid-sequencesp errorp highlight)
409 "Work out whether CODING-SYSTEM can losslessly encode a region.
410
411 START and END are the beginning and end of the region to check.
412 CODING-SYSTEM is the coding system to try.
413
414 Optional argument BUFFER is the buffer to check, and defaults to the current
415 buffer.
416
417 IGNORE-INVALID-SEQUENCESP, also an optional argument, says to treat XEmacs
418 characters which have an unambiguous encoded representation, despite being
419 undefined in what they represent, as encodable. These chiefly arise with
420 variable-length encodings like UTF-8 and UTF-16, where an invalid sequence
421 is passed through to XEmacs as a sequence of characters with a defined
422 correspondence to the octets on disk, but no non-error semantics; see the
423 `invalid-sequence-coding-system' argument to `set-language-info'.
424
425 They can also arise with fixed-length encodings like ISO 8859-7, where
426 certain octets on disk have undefined values, and treating them as
427 corresponding to the ISO 8859-1 characters with the same numerical values
428 may lead to data that is not understood by other applications.
429
430 Optional argument ERRORP says to signal a `text-conversion-error' if some
431 character in the region cannot be encoded, and defaults to nil.
432
433 Optional argument HIGHLIGHT says to display unencodable characters in the
434 region using `query-coding-warning-face'. It defaults to nil.
435
436 This function returns a list; the intention is that callers use
437 `multiple-value-bind' or the related CL multiple value functions to deal
438 with it. The first element is `t' if the region can be encoded using
439 CODING-SYSTEM, or `nil' if not. The second element is `nil' if the region
440 can be encoded using CODING-SYSTEM; otherwise, it is a range table
441 describing the positions of the unencodable characters. Ranges that
442 describe characters that would be ignored were IGNORE-INVALID-SEQUENCESP
443 non-nil map to the symbol `invalid-sequence'; other ranges map to the symbol
444 `unencodable'. If IGNORE-INVALID-SEQUENCESP is non-nil, all ranges will map
445 to the symbol `unencodable'. See `make-range-table' for more details of
446 range tables."
447 (funcall (or (coding-system-get coding-system 'query-coding-function)
448 #'default-query-coding-region)
449 start end coding-system buffer ignore-invalid-sequencesp errorp
450 highlight))
451
452 (define-compiler-macro query-coding-region (start end coding-system
453 &optional buffer
454 ignore-invalid-sequencesp
455 errorp highlight)
456 `(funcall (or (coding-system-get ,coding-system 'query-coding-function)
457 #'default-query-coding-region)
458 ,start ,end ,coding-system ,@(append (when (or buffer
459 ignore-invalid-sequencesp
460 errorp highlight)
461 (list buffer))
462 (when (or ignore-invalid-sequencesp
463 errorp highlight)
464 (list ignore-invalid-sequencesp))
465 (when (or errorp highlight)
466 (list errorp))
467 (when highlight (list highlight)))))
468
469 (defun query-coding-string (string coding-system &optional 292 (defun query-coding-string (string coding-system &optional
470 ignore-invalid-sequencesp errorp highlight) 293 ignore-invalid-sequencesp errorp highlight)
471 "Work out whether CODING-SYSTEM can losslessly encode STRING. 294 "Work out whether CODING-SYSTEM can losslessly encode STRING.
472 CODING-SYSTEM is the coding system to check. 295 CODING-SYSTEM is the coding system to check.
473 296
480 `invalid-sequence-coding-system' argument to `set-language-info'. 303 `invalid-sequence-coding-system' argument to `set-language-info'.
481 304
482 They can also arise with fixed-length encodings like ISO 8859-7, where 305 They can also arise with fixed-length encodings like ISO 8859-7, where
483 certain octets on disk have undefined values, and treating them as 306 certain octets on disk have undefined values, and treating them as
484 corresponding to the ISO 8859-1 characters with the same numerical values 307 corresponding to the ISO 8859-1 characters with the same numerical values
485 may lead to data that is not understood by other applications. 308 may lead to data that are not understood by other applications.
486 309
487 Optional argument ERRORP says to signal a `text-conversion-error' if some 310 Optional argument ERRORP says to signal a `text-conversion-error' if some
488 character in the region cannot be encoded, and defaults to nil. 311 character in the region cannot be encoded, and defaults to nil.
489 312
490 Optional argument HIGHLIGHT says to display unencodable characters in the 313 Optional argument HIGHLIGHT says to display unencodable characters in the
491 region using `query-coding-warning-face'. It defaults to nil. 314 region using `query-coding-warning-face'. It defaults to nil.
492 315
493 This function returns a list; the intention is that callers use 316 This function can return multiple values; the intention is that callers use
494 `multiple-value-bind' or the related CL multiple value functions to deal 317 `multiple-value-bind' or the related CL multiple value functions to deal
495 with it. The first element is `t' if the region can be encoded using 318 with it. The first result is `t' if the region can be encoded using
496 CODING-SYSTEM, or `nil' if not. The second element is `nil' if the region 319 CODING-SYSTEM, or `nil' if not. If the region cannot be encoded using
497 can be encoded using CODING-SYSTEM; otherwise, it is a range table 320 CODING-SYSTEM, the second result is a range table describing the positions
498 describing the positions of the unencodable characters. Ranges that 321 of the unencodable characters.
499 describe characters that would be ignored were IGNORE-INVALID-SEQUENCESP 322
500 non-nil map to the symbol `invalid-sequence'; other ranges map to the symbol 323 Ranges that describe characters that would be ignored were
501 `unencodable'. If IGNORE-INVALID-SEQUENCESP is non-nil, all ranges will map 324 IGNORE-INVALID-SEQUENCESP non-nil map to the symbol `invalid-sequence';
502 to the symbol `unencodable'. See `make-range-table' for more details of 325 other ranges map to the symbol `unencodable'. If IGNORE-INVALID-SEQUENCESP
503 range tables." 326 is non-nil, all ranges will map to the symbol `unencodable'. See
327 `make-range-table' for more details of range tables."
504 (with-temp-buffer 328 (with-temp-buffer
505 (when highlight 329 (when highlight
506 (query-coding-clear-highlights 0 (length string) string)) 330 (query-coding-clear-highlights 0 (length string) string))
507 (insert string) 331 (insert string)
508 (multiple-value-bind (result ranges extent) 332 (multiple-value-bind (result ranges)
509 (query-coding-region (point-min) (point-max) coding-system 333 (query-coding-region (point-min) (point-max) coding-system
510 (current-buffer) ignore-invalid-sequencesp 334 (current-buffer) ignore-invalid-sequencesp
511 errorp) 335 errorp)
512 (unless result 336 (unless result
513 (map-range-table 337 (let ((original-ranges ranges)
514 #'(lambda (begin end value) 338 extent)
515 ;; Sigh, string indices are zero-based, buffer offsets are 339 (setq ranges (make-range-table))
516 ;; one-based. 340 (map-range-table
517 (remove-range-table begin end ranges) 341 #'(lambda (begin end value)
518 (put-range-table (decf begin) (decf end) value ranges) 342 ;; Sigh, string indices are zero-based, buffer offsets are
519 (when highlight 343 ;; one-based.
520 (setq extent (make-extent begin end string)) 344 (put-range-table (decf begin) (decf end) value ranges)
521 (set-extent-priority extent (+ mouse-highlight-priority 2)) 345 (when highlight
522 (set-extent-property extent 'duplicable t) 346 (setq extent (make-extent begin end string))
523 (set-extent-face extent 'query-coding-warning-face))) 347 (set-extent-priority extent (+ mouse-highlight-priority 2))
524 ranges)) 348 (set-extent-property extent 'duplicable t)
525 (values result ranges)))) 349 (set-extent-face extent 'query-coding-warning-face)))
350 original-ranges)))
351 (if result result (values result ranges)))))
526 352
527 ;; Function docstring and API are taken from GNU coding.c version 1.353, GPLv2. 353 ;; Function docstring and API are taken from GNU coding.c version 1.353, GPLv2.
528 (defun unencodable-char-position (start end coding-system 354 (defun unencodable-char-position (start end coding-system
529 &optional count string) 355 &optional count string)
530 "Return position of first un-encodable character in a region. 356 "Return position of first un-encodable character in a region.
613 (incf begin)))) 439 (incf begin))))
614 do (setq coding-system (check-coding-system coding-system)) 440 do (setq coding-system (check-coding-system coding-system))
615 (multiple-value-bind (encoded ranges) 441 (multiple-value-bind (encoded ranges)
616 (query-coding-region begin end coding-system) 442 (query-coding-region begin end coding-system)
617 (unless encoded 443 (unless encoded
618 (setq intermediate (list (coding-system-name coding-system))) 444 (setq intermediate
445 (list (coding-system-name coding-system)))
619 (map-range-table range-lambda ranges) 446 (map-range-table range-lambda ranges)
620 (push (nreverse intermediate) result))) 447 (push (nreverse intermediate) result)))
621 finally return result)))) 448 finally return result))))
622 (if (stringp begin) 449 (if (stringp begin)
623 (with-temp-buffer 450 (with-temp-buffer
632 (defun encode-coding-char (char coding-system &optional charset) 459 (defun encode-coding-char (char coding-system &optional charset)
633 "Encode CHAR by CODING-SYSTEM and return the resulting string. 460 "Encode CHAR by CODING-SYSTEM and return the resulting string.
634 If CODING-SYSTEM can't safely encode CHAR, return nil. 461 If CODING-SYSTEM can't safely encode CHAR, return nil.
635 The optional third argument CHARSET is, for the moment, ignored." 462 The optional third argument CHARSET is, for the moment, ignored."
636 (check-argument-type #'characterp char) 463 (check-argument-type #'characterp char)
637 (multiple-value-bind (succeededp) 464 (and (query-coding-string char coding-system)
638 (query-coding-string char coding-system) 465 (encode-coding-string char coding-system)))
639 (when succeededp 466
640 (encode-coding-string char coding-system)))) 467 (if (featurep 'mule)
641 468 (progn
642 (unless (featurep 'mule) 469 ;; Under Mule, we do much of the complicated coding system creation in
643 ;; If we're under non-Mule, every XEmacs character can be encoded 470 ;; Lisp and especially at compile time. We need some function
644 ;; with every XEmacs coding system. 471 ;; definition for this function to be created in this file, but we can
645 (fset #'default-query-coding-region 472 ;; leave assigning the docstring to the autoload cookie
646 #'(lambda (&rest ignored) 473 ;; handling later. Thankfully; that docstring is big.
647 "Stub `query-coding-region' implementation. Always succeeds." 474 (autoload 'make-coding-system "mule/make-coding-system")
648 (values t nil))) 475
649 (unintern 'default-query-coding-region-safe-charset-skip-chars-map)) 476 ;; (During byte-compile before dumping, make-coding-system may already
477 ;; have been loaded, make sure not to overwrite the correct compiler
478 ;; macro:)
479 (when (eq 'autoload (car (symbol-function 'make-coding-system)))
480 ;; Make sure to pick up the correct compiler macro when compiling
481 ;; files:
482 (define-compiler-macro make-coding-system (&whole form name type
483 &optional description props)
484 (load (second (symbol-function 'make-coding-system)))
485 (funcall (get 'make-coding-system 'cl-compiler-macro)
486 form name type description props))))
487
488 ;; Mule's not available;
489 (fset 'make-coding-system (symbol-function 'make-coding-system-internal))
490 (define-coding-system-alias 'escape-quoted 'binary)
491
492 ;; These are so that gnus and friends work when not mule:
493 (define-coding-system-alias 'iso-8859-1 'raw-text)
494 (define-coding-system-alias 'ctext 'raw-text))
650 495
651 ;;; coding.el ends here 496 ;;; coding.el ends here