comparison lisp/coding.el @ 4604:e0a8715fdb1f

Support new IGNORE-INVALID-SEQUENCESP argument, #'query-coding-region. lisp/ChangeLog addition: 2009-02-07 Aidan Kehoe <kehoea@parhasard.net> * coding.el (query-coding-clear-highlights): Rename the BUFFER argument to BUFFER-OR-STRING, describe it as possibly being a string in its documentation. (default-query-coding-region): Add a new IGNORE-INVALID-SEQUENCESP argument, document that this function does not support it. Bind case-fold-search to nil, we don't want this to influence what the function thinks is encodable or not. (query-coding-region): Add a new IGNORE-INVALID-SEQUENCESP argument, document what it does; reflect this new argument in the associated compiler macro. (query-coding-string): Add a new IGNORE-INVALID-SEQUENCESP argument, document what it does. Support the HIGHLIGHT argument correctly. * unicode.el (unicode-query-coding-region): Add a new IGNORE-INVALID-SEQUENCESP argument, document what it does, implement this. Document a potential problem. Use #'query-coding-clear-highlights instead of reimplementing it ourselves. Remove some debugging messages. * mule/arabic.el (iso-8859-6): * mule/cyrillic.el (iso-8859-5): * mule/greek.el (iso-8859-7): * mule/hebrew.el (iso-8859-8): * mule/latin.el (iso-8859-2): * mule/latin.el (iso-8859-3): * mule/latin.el (iso-8859-4): * mule/latin.el (iso-8859-14): * mule/latin.el (iso-8859-15): * mule/latin.el (iso-8859-16): * mule/latin.el (iso-8859-9): * mule/latin.el (windows-1252): * mule/mule-coding.el (iso-8859-1): Avoid the assumption that characters not given an explicit mapping in these coding systems map to the ISO 8859-1 characters corresponding to the octets on disk; this makes it much more reasonable to implement the IGNORE-INVALID-SEQUENCESP argument to query-coding-region. * mule/mule-cmds.el (set-language-info): Correct the docstring. * mule/mule-cmds.el (finish-set-language-environment): Treat invalid Unicode sequences produced from invalid-sequence-coding-system and corresponding to control characters the same as control characters in redisplay. * mule/mule-cmds.el: Document that encode-coding-char is available in coding.el * mule/mule-coding.el (make-8-bit-generate-helper): Change to return the both the encode-program generated and the relevant non-ASCII charset; update the docstring to reflect this. * mule/mule-coding.el (make-8-bit-generate-encode-program-and-skip-chars-strings): Rename this function; have it return skip-chars-strings as well as the encode program. Have these skip-chars-strings use ranges for charsets, where possible. * mule/mule-coding.el (make-8-bit-create-decode-encode-tables): Revise this to allow people to specify explicitly characters that should be undefined (= corresponding to keys in unicode-error-default-translation-table), and treating unspecified octets above #x7f as undefined by default. * mule/mule-coding.el (8-bit-fixed-query-coding-region): Add a new IGNORE-INVALID-SEQUENCESP argument, implement support for it using the 8-bit-fixed-invalid-sequences-skip-chars coding system property; remove some debugging messages. * mule/mule-coding.el (make-8-bit-coding-system): This function is dumped, autoloading it makes no sense. Document what happens when characters above #x7f are not specified, implement this. * mule/vietnamese.el: Correct spelling. tests/ChangeLog addition: 2009-02-07 Aidan Kehoe <kehoea@parhasard.net> * automated/query-coding-tests.el: Add FAILING-CASE arguments to the Assert calls, making #'q-c-debug mostly unnecessary. Remove #'q-c-debug. Add new tests that use the IGNORE-INVALID-SEQUENCESP argument to #'query-coding-region; rework the existing ones to respect it.
author Aidan Kehoe <kehoea@parhasard.net>
date Sat, 07 Feb 2009 17:13:37 +0000
parents 0347879667ed
children 33b8c874b2c8
comparison
equal deleted inserted replaced
4603:202cb69c4d87 4604:e0a8715fdb1f
286 286
287 (defvar default-query-coding-region-safe-charset-skip-chars-map 287 (defvar default-query-coding-region-safe-charset-skip-chars-map
288 #s(hash-table test equal data ()) 288 #s(hash-table test equal data ())
289 "A map from list of charsets to `skip-chars-forward' arguments for them.") 289 "A map from list of charsets to `skip-chars-forward' arguments for them.")
290 290
291 (defsubst query-coding-clear-highlights (begin end &optional buffer) 291 (defsubst query-coding-clear-highlights (begin end &optional buffer-or-string)
292 "Remove extent faces added by `query-coding-region' between BEGIN and END. 292 "Remove extent faces added by `query-coding-region' between BEGIN and END.
293 293
294 Optional argument BUFFER is the buffer to use, and defaults to the current 294 Optional argument BUFFER-OR-STRING is the buffer or string to use, and
295 buffer. 295 defaults to the current buffer.
296 296
297 The HIGHLIGHTP argument to `query-coding-region' indicates that it should 297 The HIGHLIGHTP argument to `query-coding-region' indicates that it should
298 display unencodable characters using `query-coding-warning-face'. After 298 display unencodable characters using `query-coding-warning-face'. After
299 this function has been called, this will no longer be the case. " 299 this function has been called, this will no longer be the case. "
300 (map-extents #'(lambda (extent ignored-arg) 300 (map-extents #'(lambda (extent ignored-arg)
301 (when (eq 'query-coding-warning-face 301 (when (eq 'query-coding-warning-face
302 (extent-face extent)) 302 (extent-face extent))
303 (delete-extent extent))) buffer begin end)) 303 (delete-extent extent))) buffer-or-string begin end))
304 304
305 (defun* default-query-coding-region (begin end coding-system 305 (defun* default-query-coding-region (begin end coding-system
306 &optional buffer errorp highlightp) 306 &optional buffer ignore-invalid-sequencesp
307 errorp highlightp)
307 "The default `query-coding-region' implementation. 308 "The default `query-coding-region' implementation.
308 309
309 Uses the `safe-charsets' and `safe-chars' coding system properties. 310 Uses the `safe-charsets' and `safe-chars' coding system properties.
310 The former is a list of XEmacs character sets that can be safely 311 The former is a list of XEmacs character sets that can be safely
311 encoded by CODING-SYSTEM; the latter a char table describing, in 312 encoded by CODING-SYSTEM; the latter a char table describing, in
312 addition, characters that can be safely encoded by CODING-SYSTEM." 313 addition, characters that can be safely encoded by CODING-SYSTEM.
314
315 Does not support IGNORE-INVALID-SEQUENCESP."
313 (check-argument-type #'coding-system-p 316 (check-argument-type #'coding-system-p
314 (setq coding-system (find-coding-system coding-system))) 317 (setq coding-system (find-coding-system coding-system)))
315 (check-argument-type #'integer-or-marker-p begin) 318 (check-argument-type #'integer-or-marker-p begin)
316 (check-argument-type #'integer-or-marker-p end) 319 (check-argument-type #'integer-or-marker-p end)
317 (let* ((safe-charsets 320 (let* ((safe-charsets
324 'safe-chars))) 327 'safe-chars)))
325 (skip-chars-arg 328 (skip-chars-arg
326 (gethash safe-charsets 329 (gethash safe-charsets
327 default-query-coding-region-safe-charset-skip-chars-map)) 330 default-query-coding-region-safe-charset-skip-chars-map))
328 (ranges (make-range-table)) 331 (ranges (make-range-table))
332 (case-fold-search nil)
329 fail-range-start fail-range-end char-after 333 fail-range-start fail-range-end char-after
330 looking-at-arg failed extent) 334 looking-at-arg failed extent)
331 ;; Coding systems with a value of t for safe-charsets support everything. 335 ;; Coding systems with a value of t for safe-charsets support everything.
332 (when (eq t safe-charsets) 336 (when (eq t safe-charsets)
333 (return-from default-query-coding-region (values t nil))) 337 (return-from default-query-coding-region (values t nil)))
399 (if failed 403 (if failed
400 (values nil ranges) 404 (values nil ranges)
401 (values t nil)))))) 405 (values t nil))))))
402 406
403 (defun query-coding-region (start end coding-system &optional buffer 407 (defun query-coding-region (start end coding-system &optional buffer
404 errorp highlight) 408 ignore-invalid-sequencesp errorp highlight)
405 "Work out whether CODING-SYSTEM can losslessly encode a region. 409 "Work out whether CODING-SYSTEM can losslessly encode a region.
406 410
407 START and END are the beginning and end of the region to check. 411 START and END are the beginning and end of the region to check.
408 CODING-SYSTEM is the coding system to try. 412 CODING-SYSTEM is the coding system to try.
409 413
410 Optional argument BUFFER is the buffer to check, and defaults to the current 414 Optional argument BUFFER is the buffer to check, and defaults to the current
411 buffer. Optional argument ERRORP says to signal a `text-conversion-error' 415 buffer.
412 if some character in the region cannot be encoded, and defaults to nil. 416
417 IGNORE-INVALID-SEQUENCESP, also an optional argument, says to treat XEmacs
418 characters which have an unambiguous encoded representation, despite being
419 undefined in what they represent, as encodable. These chiefly arise with
420 variable-length encodings like UTF-8 and UTF-16, where an invalid sequence
421 is passed through to XEmacs as a sequence of characters with a defined
422 correspondence to the octets on disk, but no non-error semantics; see the
423 `invalid-sequence-coding-system' argument to `set-language-info'.
424
425 They can also arise with fixed-length encodings like ISO 8859-7, where
426 certain octets on disk have undefined values, and treating them as
427 corresponding to the ISO 8859-1 characters with the same numerical values
428 may lead to data that is not understood by other applications.
429
430 Optional argument ERRORP says to signal a `text-conversion-error' if some
431 character in the region cannot be encoded, and defaults to nil.
413 432
414 Optional argument HIGHLIGHT says to display unencodable characters in the 433 Optional argument HIGHLIGHT says to display unencodable characters in the
415 region using `query-coding-warning-face'. It defaults to nil. 434 region using `query-coding-warning-face'. It defaults to nil.
416 435
417 This function returns a list; the intention is that callers use 436 This function returns a list; the intention is that callers use
418 `multiple-value-bind' or the related CL multiple value functions to deal 437 `multiple-value-bind' or the related CL multiple value functions to deal
419 with it. The first element is `t' if the region can be encoded using 438 with it. The first element is `t' if the region can be encoded using
420 CODING-SYSTEM, or `nil' if not. The second element is `nil' if the region 439 CODING-SYSTEM, or `nil' if not. The second element is `nil' if the region
421 can be encoded using CODING-SYSTEM; otherwise, it is a range table 440 can be encoded using CODING-SYSTEM; otherwise, it is a range table
422 describing the positions of the unencodable characters. See 441 describing the positions of the unencodable characters. Ranges that
423 `make-range-table'." 442 describe characters that would be ignored were IGNORE-INVALID-SEQUENCESP
443 non-nil map to the symbol `invalid-sequence'; other ranges map to the symbol
444 `unencodable'. If IGNORE-INVALID-SEQUENCESP is non-nil, all ranges will map
445 to the symbol `unencodable'. See `make-range-table' for more details of
446 range tables."
424 (funcall (or (coding-system-get coding-system 'query-coding-function) 447 (funcall (or (coding-system-get coding-system 'query-coding-function)
425 #'default-query-coding-region) 448 #'default-query-coding-region)
426 start end coding-system buffer errorp highlight)) 449 start end coding-system buffer ignore-invalid-sequencesp errorp
450 highlight))
427 451
428 (define-compiler-macro query-coding-region (start end coding-system 452 (define-compiler-macro query-coding-region (start end coding-system
429 &optional buffer errorp highlight) 453 &optional buffer
454 ignore-invalid-sequencesp
455 errorp highlight)
430 `(funcall (or (coding-system-get ,coding-system 'query-coding-function) 456 `(funcall (or (coding-system-get ,coding-system 'query-coding-function)
431 #'default-query-coding-region) 457 #'default-query-coding-region)
432 ,start ,end ,coding-system ,@(append (if buffer (list buffer)) 458 ,start ,end ,coding-system ,@(append (when (or buffer
433 (if errorp (list errorp)) 459 ignore-invalid-sequencesp
434 (if highlight (list highlight))))) 460 errorp highlight)
435 461 (list buffer))
436 (defun query-coding-string (string coding-system &optional errorp highlight) 462 (when (or ignore-invalid-sequencesp
463 errorp highlight)
464 (list ignore-invalid-sequencesp))
465 (when (or errorp highlight)
466 (list errorp))
467 (when highlight (list highlight)))))
468
469 (defun query-coding-string (string coding-system &optional
470 ignore-invalid-sequencesp errorp highlight)
437 "Work out whether CODING-SYSTEM can losslessly encode STRING. 471 "Work out whether CODING-SYSTEM can losslessly encode STRING.
438 CODING-SYSTEM is the coding system to check. 472 CODING-SYSTEM is the coding system to check.
439 473
474 IGNORE-INVALID-SEQUENCESP, an optional argument, says to treat XEmacs
475 characters which have an unambiguous encoded representation, despite being
476 undefined in what they represent, as encodable. These chiefly arise with
477 variable-length encodings like UTF-8 and UTF-16, where an invalid sequence
478 is passed through to XEmacs as a sequence of characters with a defined
479 correspondence to the octets on disk, but no non-error semantics; see the
480 `invalid-sequence-coding-system' argument to `set-language-info'.
481
482 They can also arise with fixed-length encodings like ISO 8859-7, where
483 certain octets on disk have undefined values, and treating them as
484 corresponding to the ISO 8859-1 characters with the same numerical values
485 may lead to data that is not understood by other applications.
486
440 Optional argument ERRORP says to signal a `text-conversion-error' if some 487 Optional argument ERRORP says to signal a `text-conversion-error' if some
441 character in the region cannot be encoded, and defaults to nil. 488 character in the region cannot be encoded, and defaults to nil.
442 489
443 Optional argument HIGHLIGHT says to display unencodable characters in the 490 Optional argument HIGHLIGHT says to display unencodable characters in the
444 region using `query-coding-warning-face'. It defaults to nil. 491 region using `query-coding-warning-face'. It defaults to nil.
445 492
446 This function returns a list; the intention is that callers use use 493 This function returns a list; the intention is that callers use
447 `multiple-value-bind' or the related CL multiple value functions to deal 494 `multiple-value-bind' or the related CL multiple value functions to deal
448 with it. The first element is `t' if the string can be encoded using 495 with it. The first element is `t' if the region can be encoded using
449 CODING-SYSTEM, or `nil' if not. The second element is `nil' if the string 496 CODING-SYSTEM, or `nil' if not. The second element is `nil' if the region
450 can be encoded using CODING-SYSTEM; otherwise, it is a range table 497 can be encoded using CODING-SYSTEM; otherwise, it is a range table
451 describing the positions of the unencodable characters. See 498 describing the positions of the unencodable characters. Ranges that
452 `make-range-table'." 499 describe characters that would be ignored were IGNORE-INVALID-SEQUENCESP
500 non-nil map to the symbol `invalid-sequence'; other ranges map to the symbol
501 `unencodable'. If IGNORE-INVALID-SEQUENCESP is non-nil, all ranges will map
502 to the symbol `unencodable'. See `make-range-table' for more details of
503 range tables."
453 (with-temp-buffer 504 (with-temp-buffer
505 (when highlight
506 (query-coding-clear-highlights 0 (length string) string))
454 (insert string) 507 (insert string)
455 (multiple-value-bind (result ranges) 508 (multiple-value-bind (result ranges extent)
456 (query-coding-region (point-min) (point-max) coding-system 509 (query-coding-region (point-min) (point-max) coding-system
457 (current-buffer) errorp 510 (current-buffer) errorp
458 ;; #### Highlight won't work here, 511 nil ignore-invalid-sequencesp)
459 ;; query-coding-region may need to be modified.
460 highlight)
461 (unless result 512 (unless result
462 ;; Sigh, string indices are zero-based, buffer offsets are
463 ;; one-based.
464 (map-range-table 513 (map-range-table
465 #'(lambda (begin end value) 514 #'(lambda (begin end value)
515 ;; Sigh, string indices are zero-based, buffer offsets are
516 ;; one-based.
466 (remove-range-table begin end ranges) 517 (remove-range-table begin end ranges)
467 (put-range-table (1- begin) (1- end) value ranges)) 518 (put-range-table (decf begin) (decf end) value ranges)
519 (when highlight
520 (setq extent (make-extent begin end string))
521 (set-extent-priority extent (+ mouse-highlight-priority 2))
522 (set-extent-property extent 'duplicable t)
523 (set-extent-face extent 'query-coding-warning-face)))
468 ranges)) 524 ranges))
469 (values result ranges)))) 525 (values result ranges))))
470 526
471 ;; Function docstring and API are taken from GNU coding.c version 1.353, GPLv2. 527 ;; Function docstring and API are taken from GNU coding.c version 1.353, GPLv2.
472 (defun unencodable-char-position (start end coding-system 528 (defun unencodable-char-position (start end coding-system