Mercurial > hg > xemacs-beta
comparison lisp/coding.el @ 4604:e0a8715fdb1f
Support new IGNORE-INVALID-SEQUENCESP argument, #'query-coding-region.
lisp/ChangeLog addition:
2009-02-07 Aidan Kehoe <kehoea@parhasard.net>
* coding.el (query-coding-clear-highlights):
Rename the BUFFER argument to BUFFER-OR-STRING, describe it as
possibly being a string in its documentation.
(default-query-coding-region):
Add a new IGNORE-INVALID-SEQUENCESP argument, document that this
function does not support it.
Bind case-fold-search to nil, we don't want this to influence what the
function thinks is encodable or not.
(query-coding-region):
Add a new IGNORE-INVALID-SEQUENCESP argument, document what it
does; reflect this new argument in the associated compiler macro.
(query-coding-string):
Add a new IGNORE-INVALID-SEQUENCESP argument, document what it
does. Support the HIGHLIGHT argument correctly.
* unicode.el (unicode-query-coding-region):
Add a new IGNORE-INVALID-SEQUENCESP argument, document what it
does, implement this. Document a potential problem.
Use #'query-coding-clear-highlights instead of reimplementing it
ourselves.
Remove some debugging messages.
* mule/arabic.el (iso-8859-6):
* mule/cyrillic.el (iso-8859-5):
* mule/greek.el (iso-8859-7):
* mule/hebrew.el (iso-8859-8):
* mule/latin.el (iso-8859-2):
* mule/latin.el (iso-8859-3):
* mule/latin.el (iso-8859-4):
* mule/latin.el (iso-8859-14):
* mule/latin.el (iso-8859-15):
* mule/latin.el (iso-8859-16):
* mule/latin.el (iso-8859-9):
* mule/latin.el (windows-1252):
* mule/mule-coding.el (iso-8859-1):
Avoid the assumption that characters not given an explicit mapping
in these coding systems map to the ISO 8859-1 characters
corresponding to the octets on disk; this makes it much more
reasonable to implement the IGNORE-INVALID-SEQUENCESP argument to
query-coding-region.
* mule/mule-cmds.el (set-language-info):
Correct the docstring.
* mule/mule-cmds.el (finish-set-language-environment):
Treat invalid Unicode sequences produced from
invalid-sequence-coding-system and corresponding to control
characters the same as control characters in redisplay.
* mule/mule-cmds.el:
Document that encode-coding-char is available in coding.el
* mule/mule-coding.el (make-8-bit-generate-helper):
Change to return the both the encode-program generated and the
relevant non-ASCII charset; update the docstring to reflect this.
* mule/mule-coding.el
(make-8-bit-generate-encode-program-and-skip-chars-strings):
Rename this function; have it return skip-chars-strings as well as
the encode program. Have these skip-chars-strings use ranges for
charsets, where possible.
* mule/mule-coding.el (make-8-bit-create-decode-encode-tables):
Revise this to allow people to specify explicitly characters that
should be undefined (= corresponding to keys in
unicode-error-default-translation-table), and treating unspecified
octets above #x7f as undefined by default.
* mule/mule-coding.el (8-bit-fixed-query-coding-region):
Add a new IGNORE-INVALID-SEQUENCESP argument, implement support
for it using the 8-bit-fixed-invalid-sequences-skip-chars coding
system property; remove some debugging messages.
* mule/mule-coding.el (make-8-bit-coding-system):
This function is dumped, autoloading it makes no sense.
Document what happens when characters above #x7f are not
specified, implement this.
* mule/vietnamese.el:
Correct spelling.
tests/ChangeLog addition:
2009-02-07 Aidan Kehoe <kehoea@parhasard.net>
* automated/query-coding-tests.el:
Add FAILING-CASE arguments to the Assert calls, making #'q-c-debug
mostly unnecessary. Remove #'q-c-debug.
Add new tests that use the IGNORE-INVALID-SEQUENCESP argument to
#'query-coding-region; rework the existing ones to respect it.
author | Aidan Kehoe <kehoea@parhasard.net> |
---|---|
date | Sat, 07 Feb 2009 17:13:37 +0000 |
parents | 0347879667ed |
children | 33b8c874b2c8 |
comparison
equal
deleted
inserted
replaced
4603:202cb69c4d87 | 4604:e0a8715fdb1f |
---|---|
286 | 286 |
287 (defvar default-query-coding-region-safe-charset-skip-chars-map | 287 (defvar default-query-coding-region-safe-charset-skip-chars-map |
288 #s(hash-table test equal data ()) | 288 #s(hash-table test equal data ()) |
289 "A map from list of charsets to `skip-chars-forward' arguments for them.") | 289 "A map from list of charsets to `skip-chars-forward' arguments for them.") |
290 | 290 |
291 (defsubst query-coding-clear-highlights (begin end &optional buffer) | 291 (defsubst query-coding-clear-highlights (begin end &optional buffer-or-string) |
292 "Remove extent faces added by `query-coding-region' between BEGIN and END. | 292 "Remove extent faces added by `query-coding-region' between BEGIN and END. |
293 | 293 |
294 Optional argument BUFFER is the buffer to use, and defaults to the current | 294 Optional argument BUFFER-OR-STRING is the buffer or string to use, and |
295 buffer. | 295 defaults to the current buffer. |
296 | 296 |
297 The HIGHLIGHTP argument to `query-coding-region' indicates that it should | 297 The HIGHLIGHTP argument to `query-coding-region' indicates that it should |
298 display unencodable characters using `query-coding-warning-face'. After | 298 display unencodable characters using `query-coding-warning-face'. After |
299 this function has been called, this will no longer be the case. " | 299 this function has been called, this will no longer be the case. " |
300 (map-extents #'(lambda (extent ignored-arg) | 300 (map-extents #'(lambda (extent ignored-arg) |
301 (when (eq 'query-coding-warning-face | 301 (when (eq 'query-coding-warning-face |
302 (extent-face extent)) | 302 (extent-face extent)) |
303 (delete-extent extent))) buffer begin end)) | 303 (delete-extent extent))) buffer-or-string begin end)) |
304 | 304 |
305 (defun* default-query-coding-region (begin end coding-system | 305 (defun* default-query-coding-region (begin end coding-system |
306 &optional buffer errorp highlightp) | 306 &optional buffer ignore-invalid-sequencesp |
307 errorp highlightp) | |
307 "The default `query-coding-region' implementation. | 308 "The default `query-coding-region' implementation. |
308 | 309 |
309 Uses the `safe-charsets' and `safe-chars' coding system properties. | 310 Uses the `safe-charsets' and `safe-chars' coding system properties. |
310 The former is a list of XEmacs character sets that can be safely | 311 The former is a list of XEmacs character sets that can be safely |
311 encoded by CODING-SYSTEM; the latter a char table describing, in | 312 encoded by CODING-SYSTEM; the latter a char table describing, in |
312 addition, characters that can be safely encoded by CODING-SYSTEM." | 313 addition, characters that can be safely encoded by CODING-SYSTEM. |
314 | |
315 Does not support IGNORE-INVALID-SEQUENCESP." | |
313 (check-argument-type #'coding-system-p | 316 (check-argument-type #'coding-system-p |
314 (setq coding-system (find-coding-system coding-system))) | 317 (setq coding-system (find-coding-system coding-system))) |
315 (check-argument-type #'integer-or-marker-p begin) | 318 (check-argument-type #'integer-or-marker-p begin) |
316 (check-argument-type #'integer-or-marker-p end) | 319 (check-argument-type #'integer-or-marker-p end) |
317 (let* ((safe-charsets | 320 (let* ((safe-charsets |
324 'safe-chars))) | 327 'safe-chars))) |
325 (skip-chars-arg | 328 (skip-chars-arg |
326 (gethash safe-charsets | 329 (gethash safe-charsets |
327 default-query-coding-region-safe-charset-skip-chars-map)) | 330 default-query-coding-region-safe-charset-skip-chars-map)) |
328 (ranges (make-range-table)) | 331 (ranges (make-range-table)) |
332 (case-fold-search nil) | |
329 fail-range-start fail-range-end char-after | 333 fail-range-start fail-range-end char-after |
330 looking-at-arg failed extent) | 334 looking-at-arg failed extent) |
331 ;; Coding systems with a value of t for safe-charsets support everything. | 335 ;; Coding systems with a value of t for safe-charsets support everything. |
332 (when (eq t safe-charsets) | 336 (when (eq t safe-charsets) |
333 (return-from default-query-coding-region (values t nil))) | 337 (return-from default-query-coding-region (values t nil))) |
399 (if failed | 403 (if failed |
400 (values nil ranges) | 404 (values nil ranges) |
401 (values t nil)))))) | 405 (values t nil)))))) |
402 | 406 |
403 (defun query-coding-region (start end coding-system &optional buffer | 407 (defun query-coding-region (start end coding-system &optional buffer |
404 errorp highlight) | 408 ignore-invalid-sequencesp errorp highlight) |
405 "Work out whether CODING-SYSTEM can losslessly encode a region. | 409 "Work out whether CODING-SYSTEM can losslessly encode a region. |
406 | 410 |
407 START and END are the beginning and end of the region to check. | 411 START and END are the beginning and end of the region to check. |
408 CODING-SYSTEM is the coding system to try. | 412 CODING-SYSTEM is the coding system to try. |
409 | 413 |
410 Optional argument BUFFER is the buffer to check, and defaults to the current | 414 Optional argument BUFFER is the buffer to check, and defaults to the current |
411 buffer. Optional argument ERRORP says to signal a `text-conversion-error' | 415 buffer. |
412 if some character in the region cannot be encoded, and defaults to nil. | 416 |
417 IGNORE-INVALID-SEQUENCESP, also an optional argument, says to treat XEmacs | |
418 characters which have an unambiguous encoded representation, despite being | |
419 undefined in what they represent, as encodable. These chiefly arise with | |
420 variable-length encodings like UTF-8 and UTF-16, where an invalid sequence | |
421 is passed through to XEmacs as a sequence of characters with a defined | |
422 correspondence to the octets on disk, but no non-error semantics; see the | |
423 `invalid-sequence-coding-system' argument to `set-language-info'. | |
424 | |
425 They can also arise with fixed-length encodings like ISO 8859-7, where | |
426 certain octets on disk have undefined values, and treating them as | |
427 corresponding to the ISO 8859-1 characters with the same numerical values | |
428 may lead to data that is not understood by other applications. | |
429 | |
430 Optional argument ERRORP says to signal a `text-conversion-error' if some | |
431 character in the region cannot be encoded, and defaults to nil. | |
413 | 432 |
414 Optional argument HIGHLIGHT says to display unencodable characters in the | 433 Optional argument HIGHLIGHT says to display unencodable characters in the |
415 region using `query-coding-warning-face'. It defaults to nil. | 434 region using `query-coding-warning-face'. It defaults to nil. |
416 | 435 |
417 This function returns a list; the intention is that callers use | 436 This function returns a list; the intention is that callers use |
418 `multiple-value-bind' or the related CL multiple value functions to deal | 437 `multiple-value-bind' or the related CL multiple value functions to deal |
419 with it. The first element is `t' if the region can be encoded using | 438 with it. The first element is `t' if the region can be encoded using |
420 CODING-SYSTEM, or `nil' if not. The second element is `nil' if the region | 439 CODING-SYSTEM, or `nil' if not. The second element is `nil' if the region |
421 can be encoded using CODING-SYSTEM; otherwise, it is a range table | 440 can be encoded using CODING-SYSTEM; otherwise, it is a range table |
422 describing the positions of the unencodable characters. See | 441 describing the positions of the unencodable characters. Ranges that |
423 `make-range-table'." | 442 describe characters that would be ignored were IGNORE-INVALID-SEQUENCESP |
443 non-nil map to the symbol `invalid-sequence'; other ranges map to the symbol | |
444 `unencodable'. If IGNORE-INVALID-SEQUENCESP is non-nil, all ranges will map | |
445 to the symbol `unencodable'. See `make-range-table' for more details of | |
446 range tables." | |
424 (funcall (or (coding-system-get coding-system 'query-coding-function) | 447 (funcall (or (coding-system-get coding-system 'query-coding-function) |
425 #'default-query-coding-region) | 448 #'default-query-coding-region) |
426 start end coding-system buffer errorp highlight)) | 449 start end coding-system buffer ignore-invalid-sequencesp errorp |
450 highlight)) | |
427 | 451 |
428 (define-compiler-macro query-coding-region (start end coding-system | 452 (define-compiler-macro query-coding-region (start end coding-system |
429 &optional buffer errorp highlight) | 453 &optional buffer |
454 ignore-invalid-sequencesp | |
455 errorp highlight) | |
430 `(funcall (or (coding-system-get ,coding-system 'query-coding-function) | 456 `(funcall (or (coding-system-get ,coding-system 'query-coding-function) |
431 #'default-query-coding-region) | 457 #'default-query-coding-region) |
432 ,start ,end ,coding-system ,@(append (if buffer (list buffer)) | 458 ,start ,end ,coding-system ,@(append (when (or buffer |
433 (if errorp (list errorp)) | 459 ignore-invalid-sequencesp |
434 (if highlight (list highlight))))) | 460 errorp highlight) |
435 | 461 (list buffer)) |
436 (defun query-coding-string (string coding-system &optional errorp highlight) | 462 (when (or ignore-invalid-sequencesp |
463 errorp highlight) | |
464 (list ignore-invalid-sequencesp)) | |
465 (when (or errorp highlight) | |
466 (list errorp)) | |
467 (when highlight (list highlight))))) | |
468 | |
469 (defun query-coding-string (string coding-system &optional | |
470 ignore-invalid-sequencesp errorp highlight) | |
437 "Work out whether CODING-SYSTEM can losslessly encode STRING. | 471 "Work out whether CODING-SYSTEM can losslessly encode STRING. |
438 CODING-SYSTEM is the coding system to check. | 472 CODING-SYSTEM is the coding system to check. |
439 | 473 |
474 IGNORE-INVALID-SEQUENCESP, an optional argument, says to treat XEmacs | |
475 characters which have an unambiguous encoded representation, despite being | |
476 undefined in what they represent, as encodable. These chiefly arise with | |
477 variable-length encodings like UTF-8 and UTF-16, where an invalid sequence | |
478 is passed through to XEmacs as a sequence of characters with a defined | |
479 correspondence to the octets on disk, but no non-error semantics; see the | |
480 `invalid-sequence-coding-system' argument to `set-language-info'. | |
481 | |
482 They can also arise with fixed-length encodings like ISO 8859-7, where | |
483 certain octets on disk have undefined values, and treating them as | |
484 corresponding to the ISO 8859-1 characters with the same numerical values | |
485 may lead to data that is not understood by other applications. | |
486 | |
440 Optional argument ERRORP says to signal a `text-conversion-error' if some | 487 Optional argument ERRORP says to signal a `text-conversion-error' if some |
441 character in the region cannot be encoded, and defaults to nil. | 488 character in the region cannot be encoded, and defaults to nil. |
442 | 489 |
443 Optional argument HIGHLIGHT says to display unencodable characters in the | 490 Optional argument HIGHLIGHT says to display unencodable characters in the |
444 region using `query-coding-warning-face'. It defaults to nil. | 491 region using `query-coding-warning-face'. It defaults to nil. |
445 | 492 |
446 This function returns a list; the intention is that callers use use | 493 This function returns a list; the intention is that callers use |
447 `multiple-value-bind' or the related CL multiple value functions to deal | 494 `multiple-value-bind' or the related CL multiple value functions to deal |
448 with it. The first element is `t' if the string can be encoded using | 495 with it. The first element is `t' if the region can be encoded using |
449 CODING-SYSTEM, or `nil' if not. The second element is `nil' if the string | 496 CODING-SYSTEM, or `nil' if not. The second element is `nil' if the region |
450 can be encoded using CODING-SYSTEM; otherwise, it is a range table | 497 can be encoded using CODING-SYSTEM; otherwise, it is a range table |
451 describing the positions of the unencodable characters. See | 498 describing the positions of the unencodable characters. Ranges that |
452 `make-range-table'." | 499 describe characters that would be ignored were IGNORE-INVALID-SEQUENCESP |
500 non-nil map to the symbol `invalid-sequence'; other ranges map to the symbol | |
501 `unencodable'. If IGNORE-INVALID-SEQUENCESP is non-nil, all ranges will map | |
502 to the symbol `unencodable'. See `make-range-table' for more details of | |
503 range tables." | |
453 (with-temp-buffer | 504 (with-temp-buffer |
505 (when highlight | |
506 (query-coding-clear-highlights 0 (length string) string)) | |
454 (insert string) | 507 (insert string) |
455 (multiple-value-bind (result ranges) | 508 (multiple-value-bind (result ranges extent) |
456 (query-coding-region (point-min) (point-max) coding-system | 509 (query-coding-region (point-min) (point-max) coding-system |
457 (current-buffer) errorp | 510 (current-buffer) errorp |
458 ;; #### Highlight won't work here, | 511 nil ignore-invalid-sequencesp) |
459 ;; query-coding-region may need to be modified. | |
460 highlight) | |
461 (unless result | 512 (unless result |
462 ;; Sigh, string indices are zero-based, buffer offsets are | |
463 ;; one-based. | |
464 (map-range-table | 513 (map-range-table |
465 #'(lambda (begin end value) | 514 #'(lambda (begin end value) |
515 ;; Sigh, string indices are zero-based, buffer offsets are | |
516 ;; one-based. | |
466 (remove-range-table begin end ranges) | 517 (remove-range-table begin end ranges) |
467 (put-range-table (1- begin) (1- end) value ranges)) | 518 (put-range-table (decf begin) (decf end) value ranges) |
519 (when highlight | |
520 (setq extent (make-extent begin end string)) | |
521 (set-extent-priority extent (+ mouse-highlight-priority 2)) | |
522 (set-extent-property extent 'duplicable t) | |
523 (set-extent-face extent 'query-coding-warning-face))) | |
468 ranges)) | 524 ranges)) |
469 (values result ranges)))) | 525 (values result ranges)))) |
470 | 526 |
471 ;; Function docstring and API are taken from GNU coding.c version 1.353, GPLv2. | 527 ;; Function docstring and API are taken from GNU coding.c version 1.353, GPLv2. |
472 (defun unencodable-char-position (start end coding-system | 528 (defun unencodable-char-position (start end coding-system |