Mercurial > hg > xemacs-beta
comparison lisp/mule/mule-coding.el @ 4690:257b468bf2ca
Move the #'query-coding-region implementation to C.
This is necessary because there is no reasonable way to access the
corresponding mswindows-multibyte functionality from Lisp, and we need such
functionality if we're going to have a reliable and portable
#'query-coding-region implementation. However, this change doesn't yet
provide #'query-coding-region for the mswindow-multibyte coding systems,
there should be no functional differences between an XEmacs with this change
and one without it.
src/ChangeLog addition:
2009-09-19 Aidan Kehoe <kehoea@parhasard.net>
Move the #'query-coding-region implementation to C.
This is necessary because there is no reasonable way to access the
corresponding mswindows-multibyte functionality from Lisp, and we
need such functionality if we're going to have a reliable and
portable #'query-coding-region implementation. However, this
change doesn't yet provide #'query-coding-region for the
mswindow-multibyte coding systems, there should be no functional
differences between an XEmacs with this change and one without it.
* mule-coding.c (struct fixed_width_coding_system):
Add a new coding system type, fixed_width, and implement it. It
uses the CCL infrastructure but has a much simpler creation API,
and its own query_method, formerly in lisp/mule/mule-coding.el.
* unicode.c:
Move the Unicode query method implementation here from
unicode.el.
* lisp.h: Declare Fmake_coding_system_internal, Fcopy_range_table
here.
* intl-win32.c (complex_vars_of_intl_win32):
Use Fmake_coding_system_internal, not Fmake_coding_system.
* general-slots.h: Add Qsucceeded, Qunencodable, Qinvalid_sequence
here.
* file-coding.h (enum coding_system_variant):
Add fixed_width_coding_system here.
(struct coding_system_methods):
Add query_method and query_lstream_method to the coding system
methods.
Provide flags for the query methods.
Declare the default query method; initialise it correctly in
INITIALIZE_CODING_SYSTEM_TYPE.
* file-coding.c (default_query_method):
New function, the default query method for coding systems that do
not set it. Moved from coding.el.
(make_coding_system_1):
Accept new elements in PROPS in #'make-coding-system; aliases, a
list of aliases; safe-chars and safe-charsets (these were
previously accepted but not saved); and category.
(Fmake_coding_system_internal):
New function, what used to be #'make-coding-system--on Mule
builds, we've now moved some of the functionality of this to
Lisp.
(Fcoding_system_canonical_name_p):
Move this earlier in the file, since it's now called from within
make_coding_system_1.
(Fquery_coding_region):
Move the implementation of this here, from coding.el.
(complex_vars_of_file_coding):
Call Fmake_coding_system_internal, not Fmake_coding_system;
specify safe-charsets properties when we're a mule build.
* extents.h (mouse_highlight_priority, Fset_extent_priority,
Fset_extent_face, Fmap_extents):
Make these available to other C files.
lisp/ChangeLog addition:
2009-09-19 Aidan Kehoe <kehoea@parhasard.net>
Move the #'query-coding-region implementation to C.
* coding.el:
Consolidate code that depends on the presence or absence of Mule
at the end of this file.
(default-query-coding-region, query-coding-region):
Move these functions to C.
(default-query-coding-region-safe-charset-skip-chars-map):
Remove this variable, the corresponding C variable is
Vdefault_query_coding_region_chartab_cache in file-coding.c.
(query-coding-string): Update docstring to reflect actual multiple
values, be more careful about not modifying a range table that
we're currently mapping over.
(encode-coding-char): Make the implementation of this simpler.
(featurep 'mule): Autoload #'make-coding-system from
mule/make-coding-system.el if we're a mule build; provide an
appropriate compiler macro.
Do various non-mule compatibility things if we're not a mule
build.
* update-elc.el (additional-dump-dependencies):
Add mule/make-coding-system as a dump time dependency if we're a
mule build.
* unicode.el (ccl-encode-to-ucs-2):
(decode-char):
(encode-char):
Move these earlier in the file, for the sake of some byte compile
warnings.
(unicode-query-coding-region):
Move this to unicode.c
* mule/make-coding-system.el:
New file, not dumped. Contains the functionality to rework the
arguments necessary for fixed-width coding systems, and contains
the implementation of #'make-coding-system, which now calls
#'make-coding-system-internal.
* mule/vietnamese.el (viscii):
* mule/latin.el (iso-8859-2):
(windows-1250):
(iso-8859-3):
(iso-8859-4):
(iso-8859-14):
(iso-8859-15):
(iso-8859-16):
(iso-8859-9):
(macintosh):
(windows-1252):
* mule/hebrew.el (iso-8859-8):
* mule/greek.el (iso-8859-7):
(windows-1253):
* mule/cyrillic.el (iso-8859-5):
(koi8-r):
(koi8-u):
(windows-1251):
(alternativnyj):
(koi8-ru):
(koi8-t):
(koi8-c):
(koi8-o):
* mule/arabic.el (iso-8859-6):
(windows-1256):
Move all these coding systems to being of type fixed-width, not of
type CCL. This allows the distinct query-coding-region for them to
be in C, something which will eventually allow us to implement
query-coding-region for the mswindows-multibyte coding systems.
* mule/general-late.el (posix-charset-to-coding-system-hash):
Document why we're pre-emptively persuading the byte compiler that
the ELC for this file needs to be written using escape-quoted.
Call #'set-unicode-query-skip-chars-args, now the Unicode
query-coding-region implementation is in C.
* mule/thai-xtis.el (tis-620):
Don't bother checking whether we're XEmacs or not here.
* mule/mule-coding.el:
Move the eight bit fixed-width functionality from this file to
make-coding-system.el.
tests/ChangeLog addition:
2009-09-19 Aidan Kehoe <kehoea@parhasard.net>
* automated/mule-tests.el:
Check a coding system's type, not an 8-bit-fixed property, for
whether that coding system should be treated as a fixed-width
coding system.
* automated/query-coding-tests.el:
Don't test the query coding functionality for mswindows-multibyte
coding systems, it's not yet implemented.
author | Aidan Kehoe <kehoea@parhasard.net> |
---|---|
date | Sat, 19 Sep 2009 22:53:13 +0100 |
parents | c786c3fd0740 |
children | 308d34e9f07d |
comparison
equal
deleted
inserted
replaced
4689:0636c6ccb430 | 4690:257b468bf2ca |
---|---|
25 ;; Boston, MA 02111-1307, USA. | 25 ;; Boston, MA 02111-1307, USA. |
26 | 26 |
27 ;;; Commentary: | 27 ;;; Commentary: |
28 | 28 |
29 ;;; split off of mule.el and mostly moved to coding.el | 29 ;;; split off of mule.el and mostly moved to coding.el |
30 | |
31 ;; Needed for make-8-bit-coding-system. | |
32 (eval-when-compile (require 'ccl)) | |
33 | 30 |
34 ;;; Code: | 31 ;;; Code: |
35 | 32 |
36 (defun coding-system-force-on-output (coding-system register) | 33 (defun coding-system-force-on-output (coding-system register) |
37 "Return the 'force-on-output property of CODING-SYSTEM for the specified REGISTER." | 34 "Return the 'force-on-output property of CODING-SYSTEM for the specified REGISTER." |
220 (setq done t)) | 217 (setq done t)) |
221 (setq id (1+ id))))) | 218 (setq id (1+ id))))) |
222 (put symbol 'translation-hash-table-id id) | 219 (put symbol 'translation-hash-table-id id) |
223 id)) | 220 id)) |
224 | 221 |
225 (defvar make-8-bit-private-use-start (decode-char 'ucs #xE000) | |
226 "Start of a 256 code private use area for make-8-bit-coding-system. | |
227 | |
228 This is used to ensure that distinct octets on disk for a given coding | |
229 system map to distinct XEmacs characters, preventing a spurious changes when | |
230 a file is read, not changed, and then written. ") | |
231 | |
232 (defun make-8-bit-generate-helper (decode-table encode-table | |
233 encode-failure-octet) | |
234 "Helper function, `make-8-bit-generate-encode-program-and-skip-chars-strings', | |
235 which see. | |
236 | |
237 Deals with the case where ASCII and another character set can both be | |
238 encoded unambiguously and completely into the coding-system; if this is so, | |
239 returns a list comprised of such a ccl-program and the character set in | |
240 question. If not, it returns a list with both entries nil." | |
241 (let ((tentative-encode-program-parts | |
242 (eval-when-compile | |
243 (let* ((vec-len 128) | |
244 (compiled | |
245 (append | |
246 (ccl-compile | |
247 `(1 | |
248 (loop | |
249 (read-multibyte-character r0 r1) | |
250 (if (r0 == ,(charset-id 'ascii)) | |
251 (write r1) | |
252 ((if (r0 == #xABAB) | |
253 ;; #xBFFE is a sentinel in the compiled | |
254 ;; program. | |
255 ((r0 = r1 & #x7F) | |
256 (write r0 ,(make-vector vec-len #xBFFE))) | |
257 ((mule-to-unicode r0 r1) | |
258 (if (r0 == #xFFFD) | |
259 (write #xBEEF) | |
260 ((lookup-integer encode-table-sym r0 r3) | |
261 (if r7 | |
262 (write-multibyte-character r0 r3) | |
263 (write #xBEEF)))))))) | |
264 (repeat)))) nil)) | |
265 (first-part compiled) | |
266 (last-part | |
267 (member-if-not (lambda (entr) (eq #xBFFE entr)) | |
268 (member-if | |
269 (lambda (entr) (eq #xBFFE entr)) | |
270 first-part)))) | |
271 (while compiled | |
272 (when (eq #xBFFE (cadr compiled)) | |
273 (assert (= vec-len (search '(#xBFFE) (cdr compiled) | |
274 :test #'/=)) nil | |
275 "Strange ccl vector length") | |
276 (setcdr compiled nil)) | |
277 (setq compiled (cdr compiled))) | |
278 ;; Is the generated code as we expect it to be? | |
279 (assert (and (memq #xABAB first-part) | |
280 (memq #xBEEF14 last-part)) | |
281 nil | |
282 "This code assumes that the constant #xBEEF is #xBEEF14 in \ | |
283 compiled CCL code,\nand that the constant #xABAB is #xABAB. If that is | |
284 not the case, and it appears not to be--that's why you're getting this | |
285 message--it will not work. ") | |
286 (list first-part last-part vec-len)))) | |
287 (charset-lower -1) | |
288 (charset-upper -1) | |
289 worth-trying known-charsets encode-program | |
290 other-charset-vector ucs) | |
291 | |
292 (loop for char across decode-table | |
293 do (pushnew (char-charset char) known-charsets)) | |
294 (setq known-charsets (delq 'ascii known-charsets)) | |
295 | |
296 (loop for known-charset in known-charsets | |
297 do | |
298 ;; This is not possible for two dimensional charsets. | |
299 (when (eq 1 (charset-dimension known-charset)) | |
300 (if (eq 'control-1 known-charset) | |
301 (setq charset-lower 0 | |
302 charset-upper 31) | |
303 ;; There should be a nicer way to get the limits here. | |
304 (condition-case args-out-of-range | |
305 (make-char known-charset #x100) | |
306 (args-out-of-range | |
307 (setq charset-lower (third args-out-of-range) | |
308 charset-upper (fourth args-out-of-range))))) | |
309 (loop | |
310 for i from charset-lower to charset-upper | |
311 always (and (setq ucs | |
312 (encode-char (make-char known-charset i) 'ucs)) | |
313 (gethash ucs encode-table)) | |
314 finally (setq worth-trying known-charset)) | |
315 | |
316 ;; Only trying this for one charset at a time, the first find. | |
317 (when worth-trying (return)) | |
318 | |
319 ;; Okay, this charset is not worth trying, Try the next. | |
320 (setq charset-lower -1 | |
321 charset-upper -1 | |
322 worth-trying nil))) | |
323 | |
324 (when worth-trying | |
325 (setq other-charset-vector | |
326 (make-vector (third tentative-encode-program-parts) | |
327 encode-failure-octet)) | |
328 (loop for i from charset-lower to charset-upper | |
329 do (aset other-charset-vector i | |
330 (gethash (encode-char (make-char worth-trying i) | |
331 'ucs) encode-table))) | |
332 (setq encode-program | |
333 (nsublis | |
334 (list (cons #xABAB (charset-id worth-trying))) | |
335 (nconc | |
336 (copy-list (first | |
337 tentative-encode-program-parts)) | |
338 (append other-charset-vector nil) | |
339 (copy-tree (second | |
340 tentative-encode-program-parts)))))) | |
341 (values encode-program worth-trying))) | |
342 | |
343 (defun make-8-bit-generate-encode-program-and-skip-chars-strings | |
344 (decode-table encode-table encode-failure-octet) | |
345 "Generate a CCL program to encode a 8-bit fixed-width charset. | |
346 | |
347 DECODE-TABLE must have 256 non-cons entries, and will be regarded as | |
348 describing a map from the octet corresponding to an offset in the | |
349 table to the that entry in the table. ENCODE-TABLE is a hash table | |
350 map from unicode values to characters in the range [0,255]. | |
351 ENCODE-FAILURE-OCTET describes an integer between 0 and 255 | |
352 \(inclusive) to write in the event that a character cannot be encoded. " | |
353 (check-argument-type #'vectorp decode-table) | |
354 (check-argument-range (length decode-table) #x100 #x100) | |
355 (check-argument-type #'hash-table-p encode-table) | |
356 (check-argument-type #'integerp encode-failure-octet) | |
357 (check-argument-range encode-failure-octet #x00 #xFF) | |
358 (let ((encode-program nil) | |
359 (general-encode-program | |
360 (eval-when-compile | |
361 (let ((prog (append | |
362 (ccl-compile | |
363 `(1 | |
364 (loop | |
365 (read-multibyte-character r0 r1) | |
366 (mule-to-unicode r0 r1) | |
367 (if (r0 == #xFFFD) | |
368 (write #xBEEF) | |
369 ((lookup-integer encode-table-sym r0 r3) | |
370 (if r7 | |
371 (write-multibyte-character r0 r3) | |
372 (write #xBEEF)))) | |
373 (repeat)))) nil))) | |
374 (assert (memq #xBEEF14 prog) | |
375 nil | |
376 "This code assumes that the constant #xBEEF is #xBEEF14 \ | |
377 in compiled CCL code.\nIf that is not the case, and it appears not to | |
378 be--that's why you're getting this message--it will not work. ") | |
379 prog))) | |
380 (encode-program-with-ascii-optimisation | |
381 (eval-when-compile | |
382 (let ((prog (append | |
383 (ccl-compile | |
384 `(1 | |
385 (loop | |
386 (read-multibyte-character r0 r1) | |
387 (if (r0 == ,(charset-id 'ascii)) | |
388 (write r1) | |
389 ((mule-to-unicode r0 r1) | |
390 (if (r0 == #xFFFD) | |
391 (write #xBEEF) | |
392 ((lookup-integer encode-table-sym r0 r3) | |
393 (if r7 | |
394 (write-multibyte-character r0 r3) | |
395 (write #xBEEF)))))) | |
396 (repeat)))) nil))) | |
397 (assert (memq #xBEEF14 prog) | |
398 nil | |
399 "This code assumes that the constant #xBEEF is #xBEEF14 \ | |
400 in compiled CCL code.\nIf that is not the case, and it appears not to | |
401 be--that's why you're getting this message--it will not work. ") | |
402 prog))) | |
403 (ascii-encodes-as-itself nil) | |
404 (control-1-encodes-as-itself t) | |
405 (invalid-sequence-code-point-start | |
406 (eval-when-compile | |
407 (char-to-unicode | |
408 (aref (decode-coding-string "\xd8\x00\x00\x00" 'utf-16-be) 3)))) | |
409 further-char-set skip-chars invalid-sequences-skip-chars) | |
410 | |
411 ;; Is this coding system ASCII-compatible? If so, we can avoid the hash | |
412 ;; table lookup for those characters. | |
413 (loop | |
414 for i from #x00 to #x7f | |
415 always (eq (int-to-char i) (gethash i encode-table)) | |
416 finally (setq ascii-encodes-as-itself t)) | |
417 | |
418 ;; Note that this logic handles EBCDIC badly. For example, CP037, | |
419 ;; MIME name ebcdic-na, has the entire repertoire of ASCII and | |
420 ;; Latin 1, and thus a more optimal ccl encode program would check | |
421 ;; for those character sets and use tables. But for now, we do a | |
422 ;; hash table lookup for every character. | |
423 (if (null ascii-encodes-as-itself) | |
424 ;; General encode program. Pros; general and correct. Cons; | |
425 ;; slow, a hash table lookup + mule-unicode conversion is done | |
426 ;; for every character encoding. | |
427 (setq encode-program general-encode-program) | |
428 (multiple-value-setq | |
429 (encode-program further-char-set) | |
430 ;; Encode program with ascii-ascii mapping (based on a | |
431 ;; character's mule character set), and one other mule | |
432 ;; character set using table-based encoding, other | |
433 ;; character sets using hash table lookups. | |
434 ;; make-8-bit-non-ascii-completely-coveredp only returns | |
435 ;; such a mapping if some non-ASCII charset with | |
436 ;; characters in decode-table is entirely covered by | |
437 ;; encode-table. | |
438 (make-8-bit-generate-helper decode-table encode-table | |
439 encode-failure-octet)) | |
440 (unless encode-program | |
441 ;; If make-8-bit-non-ascii-completely-coveredp returned nil, | |
442 ;; but ASCII still encodes as itself, do one-to-one mapping | |
443 ;; for ASCII, and a hash table lookup for everything else. | |
444 (setq encode-program encode-program-with-ascii-optimisation))) | |
445 | |
446 (setq encode-program | |
447 (nsublis | |
448 (list (cons #xBEEF14 | |
449 (logior (lsh encode-failure-octet 8) | |
450 #x14))) | |
451 (copy-tree encode-program))) | |
452 (loop | |
453 for i from #x80 to #x9f | |
454 do (unless (= i (aref decode-table i)) | |
455 (setq control-1-encodes-as-itself nil) | |
456 (return))) | |
457 (loop | |
458 for i from #x00 to #xFF | |
459 initially (setq skip-chars | |
460 (cond | |
461 ((and ascii-encodes-as-itself | |
462 control-1-encodes-as-itself further-char-set) | |
463 (concat "\x00-\x9f" (charset-skip-chars-string | |
464 further-char-set))) | |
465 ((and ascii-encodes-as-itself | |
466 control-1-encodes-as-itself) | |
467 "\x00-\x9f") | |
468 ((null ascii-encodes-as-itself) | |
469 (skip-chars-quote (apply #'string | |
470 (append decode-table nil)))) | |
471 (further-char-set | |
472 (concat (charset-skip-chars-string 'ascii) | |
473 (charset-skip-chars-string further-char-set))) | |
474 (t | |
475 (charset-skip-chars-string 'ascii))) | |
476 invalid-sequences-skip-chars "") | |
477 with decoded-ucs = nil | |
478 with decoded = nil | |
479 with no-ascii-transparency-skip-chars-list = | |
480 (unless ascii-encodes-as-itself (append decode-table nil)) | |
481 ;; Can't use #'match-string here, see: | |
482 ;; http://mid.gmane.org/18829.34118.709782.704574@parhasard.net | |
483 with skip-chars-test = | |
484 #'(lambda (skip-chars-string testing) | |
485 (with-temp-buffer | |
486 (insert testing) | |
487 (goto-char (point-min)) | |
488 (skip-chars-forward skip-chars-string) | |
489 (= (point) (point-max)))) | |
490 do | |
491 (setq decoded (aref decode-table i) | |
492 decoded-ucs (char-to-unicode decoded)) | |
493 (cond | |
494 ((<= invalid-sequence-code-point-start decoded-ucs | |
495 (+ invalid-sequence-code-point-start #xFF)) | |
496 (setq invalid-sequences-skip-chars | |
497 (concat (string decoded) | |
498 invalid-sequences-skip-chars)) | |
499 (assert (not (funcall skip-chars-test skip-chars decoded)) | |
500 "This char should only be skipped with \ | |
501 `invalid-sequences-skip-chars', not by `skip-chars'")) | |
502 ((not (funcall skip-chars-test skip-chars decoded)) | |
503 (if ascii-encodes-as-itself | |
504 (setq skip-chars (concat skip-chars (string decoded))) | |
505 (push decoded no-ascii-transparency-skip-chars-list)))) | |
506 finally (unless ascii-encodes-as-itself | |
507 (setq skip-chars | |
508 (skip-chars-quote | |
509 (apply #'string | |
510 no-ascii-transparency-skip-chars-list))))) | |
511 (values encode-program skip-chars invalid-sequences-skip-chars))) | |
512 | |
513 (defun make-8-bit-create-decode-encode-tables (unicode-map) | |
514 "Return a list \(DECODE-TABLE ENCODE-TABLE) given UNICODE-MAP. | |
515 UNICODE-MAP should be an alist mapping from integer octet values to | |
516 characters with UCS code points; DECODE-TABLE will be a 256-element | |
517 vector, and ENCODE-TABLE will be a hash table mapping from 256 numbers | |
518 to 256 distinct characters. " | |
519 (check-argument-type #'listp unicode-map) | |
520 (let ((decode-table (make-vector 256 nil)) | |
521 (encode-table (make-hash-table :size 256)) | |
522 (private-use-start (encode-char make-8-bit-private-use-start 'ucs)) | |
523 (invalid-sequence-code-point-start | |
524 (eval-when-compile | |
525 (char-to-unicode | |
526 (aref (decode-coding-string "\xd8\x00\x00\x00" 'utf-16-be) 3)))) | |
527 desired-ucs decode-table-entry) | |
528 | |
529 (loop for (external internal) | |
530 in unicode-map | |
531 do | |
532 (aset decode-table external internal) | |
533 (assert (not (eq (encode-char internal 'ucs) -1)) | |
534 nil | |
535 "Looks like you're calling make-8-bit-coding-system in a \ | |
536 dumped file, \nand you're either not providing a literal UNICODE-MAP | |
537 or PROPS. Don't do that; make-8-bit-coding-system relies on sensible | |
538 Unicode mappings being available, which they are at compile time for | |
539 dumped files (but this requires the mentioned literals), but not, for | |
540 most of them, at run time. ") | |
541 | |
542 (puthash (encode-char internal 'ucs) | |
543 ;; This is semantically an integer, but Dave Love's design | |
544 ;; for lookup-integer in CCL means we need to store it as a | |
545 ;; character. | |
546 (int-to-char external) | |
547 encode-table)) | |
548 | |
549 ;; Now, go through the decode table. For octet values above #x7f, if the | |
550 ;; decode table entry is nil, this means that they have an undefined | |
551 ;; mapping (= they map to XEmacs characters with keys in | |
552 ;; unicode-error-default-translation-table); for octet values below or | |
553 ;; equal to #x7f, it means that they map to ASCII. | |
554 | |
555 ;; If any entry (whether below or above #x7f) in the decode-table | |
556 ;; already maps to some character with a key in | |
557 ;; unicode-error-default-translation-table, it is treated as an | |
558 ;; undefined octet by `query-coding-region'. That is, it is not | |
559 ;; necessary for an octet value to be above #x7f for this to happen. | |
560 | |
561 (dotimes (i 256) | |
562 (setq decode-table-entry (aref decode-table i)) | |
563 (if decode-table-entry | |
564 (when (get-char-table | |
565 decode-table-entry | |
566 unicode-error-default-translation-table) | |
567 ;; The caller is explicitly specifying that this octet | |
568 ;; corresponds to an invalid sequence on disk: | |
569 (assert (= (get-char-table | |
570 decode-table-entry | |
571 unicode-error-default-translation-table) i) | |
572 "Bad argument to `make-8-bit-coding-system'. | |
573 If you're going to designate an octet with value below #x80 as invalid | |
574 for this coding system, make sure to map it to the invalid sequence | |
575 character corresponding to its octet value on disk. ")) | |
576 | |
577 ;; decode-table-entry is nil; either the octet is to be treated as | |
578 ;; contributing to an error sequence (when (> #x7f i)), or it should | |
579 ;; be attempted to treat it as ASCII-equivalent. | |
580 (setq desired-ucs (or (and (< i #x80) i) | |
581 (+ invalid-sequence-code-point-start i))) | |
582 (while (gethash desired-ucs encode-table) | |
583 (assert (not (< i #x80)) | |
584 "UCS code point should not already be in encode-table!" | |
585 ;; There is one invalid sequence char per octet value; | |
586 ;; with eight-bit-fixed coding systems, it makes no sense | |
587 ;; for us to be multiply allocating them. | |
588 (gethash desired-ucs encode-table)) | |
589 (setq desired-ucs (+ private-use-start desired-ucs) | |
590 private-use-start (+ private-use-start 1))) | |
591 (puthash desired-ucs (int-to-char i) encode-table) | |
592 (setq desired-ucs (if (> desired-ucs #xFF) | |
593 (unicode-to-char desired-ucs) | |
594 ;; So we get Latin-1 when run at dump time, | |
595 ;; instead of JIT-allocated characters. | |
596 (int-to-char desired-ucs))) | |
597 (aset decode-table i desired-ucs))) | |
598 (values decode-table encode-table))) | |
599 | |
600 (defun make-8-bit-generate-decode-program (decode-table) | |
601 "Given DECODE-TABLE, generate a CCL program to decode an 8-bit charset. | |
602 DECODE-TABLE must have 256 non-cons entries, and will be regarded as | |
603 describing a map from the octet corresponding to an offset in the | |
604 table to the that entry in the table. " | |
605 (check-argument-type #'vectorp decode-table) | |
606 (check-argument-range (length decode-table) #x100 #x100) | |
607 (let ((decode-program-parts | |
608 (eval-when-compile | |
609 (let* ((compiled | |
610 (append | |
611 (ccl-compile | |
612 `(3 | |
613 ((read r0) | |
614 (loop | |
615 (write-read-repeat r0 ,(make-vector | |
616 256 'sentinel)))))) nil)) | |
617 (first-part compiled) | |
618 (last-part | |
619 (member-if-not #'symbolp | |
620 (member-if-not #'integerp first-part)))) | |
621 ;; Chop off the sentinel sentinel sentinel [..] part. | |
622 (while compiled | |
623 (if (symbolp (cadr compiled)) | |
624 (setcdr compiled nil)) | |
625 (setq compiled (cdr compiled))) | |
626 (list first-part last-part))))) | |
627 (nconc | |
628 ;; copy-list needed, because the structure of the literal provided | |
629 ;; by our eval-when-compile hangs around. | |
630 (copy-list (first decode-program-parts)) | |
631 (append decode-table nil) | |
632 (second decode-program-parts)))) | |
633 | |
634 (defun make-8-bit-choose-category (decode-table) | |
635 "Given DECODE-TABLE, return an appropriate coding category. | |
636 DECODE-TABLE is a 256-entry vector describing the mapping from octets on | |
637 disk to XEmacs characters for some fixed-width 8-bit coding system. " | |
638 (check-argument-type #'vectorp decode-table) | |
639 (check-argument-range (length decode-table) #x100 #x100) | |
640 (loop | |
641 named category | |
642 for i from #x80 to #x9F | |
643 do (unless (= i (aref decode-table i)) | |
644 (return-from category 'no-conversion)) | |
645 finally return 'iso-8-1)) | |
646 | |
647 (defun 8-bit-fixed-query-coding-region (begin end coding-system &optional | |
648 buffer ignore-invalid-sequencesp | |
649 errorp highlightp) | |
650 "The `query-coding-region' implementation for 8-bit-fixed coding systems. | |
651 | |
652 Uses the `8-bit-fixed-query-from-unicode' and `8-bit-fixed-query-skip-chars' | |
653 coding system properties. The former is a hash table mapping from valid | |
654 Unicode code points to on-disk octets in the coding system; the latter a set | |
655 of characters as used by `skip-chars-forward'. Both of these properties are | |
656 generated automatically by `make-8-bit-coding-system'. | |
657 | |
658 See that the documentation of `query-coding-region'; see also | |
659 `make-8-bit-coding-system'. " | |
660 (check-argument-type #'coding-system-p | |
661 (setq coding-system (find-coding-system coding-system))) | |
662 (check-argument-type #'integer-or-marker-p begin) | |
663 (check-argument-type #'integer-or-marker-p end) | |
664 (let ((from-unicode | |
665 (or (coding-system-get coding-system '8-bit-fixed-query-from-unicode) | |
666 (coding-system-get (coding-system-base coding-system) | |
667 '8-bit-fixed-query-from-unicode))) | |
668 (skip-chars-arg | |
669 (or (coding-system-get coding-system '8-bit-fixed-query-skip-chars) | |
670 (coding-system-get (coding-system-base coding-system) | |
671 '8-bit-fixed-query-skip-chars))) | |
672 (invalid-sequences-skip-chars | |
673 (or (coding-system-get coding-system | |
674 '8-bit-fixed-invalid-sequences-skip-chars) | |
675 (coding-system-get (coding-system-base coding-system) | |
676 '8-bit-fixed-invalid-sequences-skip-chars))) | |
677 (ranges (make-range-table)) | |
678 (case-fold-search nil) | |
679 char-after fail-range-start fail-range-end extent | |
680 failed invalid-sequences-looking-at failed-reason | |
681 previous-failed-reason) | |
682 (check-type from-unicode hash-table) | |
683 (check-type skip-chars-arg string) | |
684 (check-type invalid-sequences-skip-chars string) | |
685 (setq invalid-sequences-looking-at | |
686 (if (equal "" invalid-sequences-skip-chars) | |
687 ;; Regexp that will never match. | |
688 #r".\{0,0\}" | |
689 (concat "[" invalid-sequences-skip-chars "]"))) | |
690 (when ignore-invalid-sequencesp | |
691 (setq skip-chars-arg | |
692 (concat skip-chars-arg invalid-sequences-skip-chars))) | |
693 (save-excursion | |
694 (when highlightp | |
695 (query-coding-clear-highlights begin end buffer)) | |
696 (goto-char begin buffer) | |
697 (skip-chars-forward skip-chars-arg end buffer) | |
698 (while (< (point buffer) end) | |
699 (setq char-after (char-after (point buffer) buffer) | |
700 fail-range-start (point buffer)) | |
701 (while (and | |
702 (< (point buffer) end) | |
703 (or (and | |
704 (not (gethash (encode-char char-after 'ucs) from-unicode)) | |
705 (setq failed-reason 'unencodable)) | |
706 (and (not ignore-invalid-sequencesp) | |
707 (looking-at invalid-sequences-looking-at buffer) | |
708 (setq failed-reason 'invalid-sequence))) | |
709 (or (null previous-failed-reason) | |
710 (eq previous-failed-reason failed-reason))) | |
711 (forward-char 1 buffer) | |
712 (setq char-after (char-after (point buffer) buffer) | |
713 failed t | |
714 previous-failed-reason failed-reason)) | |
715 (if (= fail-range-start (point buffer)) | |
716 ;; The character can actually be encoded by the coding | |
717 ;; system; check the characters past it. | |
718 (forward-char 1 buffer) | |
719 ;; The character actually failed. | |
720 (when errorp | |
721 (error 'text-conversion-error | |
722 (format "Cannot encode %s using coding system" | |
723 (buffer-substring fail-range-start (point buffer) | |
724 buffer)) | |
725 (coding-system-name coding-system))) | |
726 (assert (not (null previous-failed-reason)) t | |
727 "previous-failed-reason should always be non-nil here") | |
728 (put-range-table fail-range-start | |
729 ;; If char-after is non-nil, we're not at | |
730 ;; the end of the buffer. | |
731 (setq fail-range-end (if char-after | |
732 (point buffer) | |
733 (point-max buffer))) | |
734 previous-failed-reason ranges) | |
735 (setq previous-failed-reason nil) | |
736 (when highlightp | |
737 (setq extent (make-extent fail-range-start fail-range-end buffer)) | |
738 (set-extent-priority extent (+ mouse-highlight-priority 2)) | |
739 (set-extent-face extent 'query-coding-warning-face)) | |
740 (skip-chars-forward skip-chars-arg end buffer))) | |
741 (if failed | |
742 (values nil ranges) | |
743 (values t nil))))) | |
744 | |
745 (defun make-8-bit-coding-system (name unicode-map &optional description props) | |
746 "Make and return a fixed-width 8-bit CCL coding system named NAME. | |
747 NAME must be a symbol, and UNICODE-MAP a list. | |
748 | |
749 UNICODE-MAP is a plist describing a map from octets in the coding | |
750 system NAME (as integers) to XEmacs characters. Those XEmacs | |
751 characters will be used explicitly on decoding, but for encoding (most | |
752 relevantly, on writing to disk) XEmacs characters that map to the same | |
753 Unicode code point will be unified. This means that the ISO-8859-? | |
754 characters that map to the same Unicode code point will not be | |
755 distinct when written to disk, which is normally what is intended; it | |
756 also means that East Asian Han characters from different XEmacs | |
757 character sets will not be distinct when written to disk, which is | |
758 less often what is intended. | |
759 | |
760 Any octets not mapped, and with values above #x7f, will be decoded into | |
761 XEmacs characters that reflect that their values are undefined. These | |
762 characters will be displayed in a language-environment-specific way. See | |
763 `unicode-error-default-translation-table' and the | |
764 `invalid-sequence-coding-system' argument to `set-language-info'. | |
765 | |
766 These characters will normally be treated as invalid when checking whether | |
767 text can be encoded with `query-coding-region'--see the | |
768 IGNORE-INVALID-SEQUENCESP argument to that function to avoid this. It is | |
769 possible to specify that octets with values less than #x80 (or indeed | |
770 greater than it) be treated in this way, by specifying explicitly that they | |
771 correspond to the character mapping to that octet in | |
772 `unicode-error-default-translation-table'. Far fewer coding systems | |
773 override the ASCII mapping, though, so this is not the default. | |
774 | |
775 DESCRIPTION and PROPS are as in `make-coding-system', which see. This | |
776 function also accepts two additional (optional) properties in PROPS; | |
777 `aliases', giving a list of aliases to be initialized for this | |
778 coding-system, and `encode-failure-octet', an integer between 0 and 256 to | |
779 write in place of XEmacs characters that cannot be encoded, defaulting to | |
780 the code for tilde `~'. " | |
781 (check-argument-type #'symbolp name) | |
782 (check-argument-type #'listp unicode-map) | |
783 (check-argument-type #'stringp | |
784 (or description | |
785 (setq description | |
786 (format "Coding system used for %s." name)))) | |
787 (check-valid-plist props) | |
788 (let ((encode-failure-octet (or (plist-get props 'encode-failure-octet) | |
789 (char-to-int ?~))) | |
790 (aliases (plist-get props 'aliases)) | |
791 (hash-table-sym (gentemp (format "%s-encode-table" name))) | |
792 encode-program decode-program result decode-table encode-table | |
793 skip-chars invalid-sequences-skip-chars) | |
794 | |
795 ;; Some more sanity checking. | |
796 (check-argument-range encode-failure-octet 0 #xFF) | |
797 (check-argument-type #'listp aliases) | |
798 | |
799 ;; Don't pass on our extra data to make-coding-system. | |
800 (setq props (plist-remprop props 'encode-failure-octet) | |
801 props (plist-remprop props 'aliases)) | |
802 | |
803 (multiple-value-setq | |
804 (decode-table encode-table) | |
805 (make-8-bit-create-decode-encode-tables unicode-map)) | |
806 | |
807 ;; Register the decode-table. | |
808 (define-translation-hash-table hash-table-sym encode-table) | |
809 | |
810 ;; Generate the programs and skip-chars strings. | |
811 (setq decode-program (make-8-bit-generate-decode-program decode-table)) | |
812 (multiple-value-setq | |
813 (encode-program skip-chars invalid-sequences-skip-chars) | |
814 (make-8-bit-generate-encode-program-and-skip-chars-strings | |
815 decode-table encode-table encode-failure-octet)) | |
816 | |
817 (unless (vectorp encode-program) | |
818 (setq encode-program | |
819 (apply #'vector | |
820 (nsublis (list (cons 'encode-table-sym hash-table-sym)) | |
821 (copy-tree encode-program))))) | |
822 (unless (vectorp decode-program) | |
823 (setq decode-program | |
824 (apply #'vector decode-program))) | |
825 | |
826 ;; And now generate the actual coding system. | |
827 (setq result | |
828 (make-coding-system | |
829 name 'ccl | |
830 description | |
831 (plist-put (plist-put props 'decode decode-program) | |
832 'encode encode-program))) | |
833 (coding-system-put name '8-bit-fixed t) | |
834 (coding-system-put name 'category | |
835 (make-8-bit-choose-category decode-table)) | |
836 (coding-system-put name '8-bit-fixed-query-skip-chars | |
837 skip-chars) | |
838 (coding-system-put name '8-bit-fixed-invalid-sequences-skip-chars | |
839 invalid-sequences-skip-chars) | |
840 (coding-system-put name '8-bit-fixed-query-from-unicode encode-table) | |
841 (coding-system-put name 'query-coding-function | |
842 #'8-bit-fixed-query-coding-region) | |
843 (coding-system-put (intern (format "%s-unix" name)) | |
844 'query-coding-function | |
845 #'8-bit-fixed-query-coding-region) | |
846 (coding-system-put (intern (format "%s-dos" name)) | |
847 'query-coding-function | |
848 #'8-bit-fixed-query-coding-region) | |
849 (coding-system-put (intern (format "%s-mac" name)) | |
850 'query-coding-function | |
851 #'8-bit-fixed-query-coding-region) | |
852 (loop for alias in aliases | |
853 do (define-coding-system-alias alias name)) | |
854 result)) | |
855 | |
856 (define-compiler-macro make-8-bit-coding-system (&whole form name unicode-map | |
857 &optional description props) | |
858 ;; We provide the compiler macro (= macro that is expanded only on | |
859 ;; compilation, and that can punt to a runtime version of the | |
860 ;; associate function if necessary) not for reasons of speed, though | |
861 ;; it does speed up things at runtime a little, but because the | |
862 ;; Unicode mappings are available at compile time in the dumped | |
863 ;; files, but they are not available at run time for the vast | |
864 ;; majority of them. | |
865 | |
866 (if (not (and (and (consp name) (eq (car name) 'quote)) | |
867 (and (consp unicode-map) (eq (car unicode-map) 'quote)) | |
868 (and (or (and (consp props) (eq (car props) 'quote)) | |
869 (null props))))) | |
870 ;; The call does not use literals; do it at runtime. | |
871 form | |
872 (setq name (cadr name) | |
873 unicode-map (cadr unicode-map) | |
874 props (if props (cadr props))) | |
875 (let ((encode-failure-octet | |
876 (or (plist-get props 'encode-failure-octet) (char-to-int ?~))) | |
877 (aliases (plist-get props 'aliases)) | |
878 encode-program decode-program | |
879 decode-table encode-table | |
880 skip-chars invalid-sequences-skip-chars) | |
881 | |
882 ;; Some sanity checking. | |
883 (check-argument-range encode-failure-octet 0 #xFF) | |
884 (check-argument-type #'listp aliases) | |
885 | |
886 ;; Don't pass on our extra data to make-coding-system. | |
887 (setq props (plist-remprop props 'encode-failure-octet) | |
888 props (plist-remprop props 'aliases)) | |
889 | |
890 ;; Work out encode-table and decode-table | |
891 (multiple-value-setq | |
892 (decode-table encode-table) | |
893 (make-8-bit-create-decode-encode-tables unicode-map)) | |
894 | |
895 ;; Generate the decode and encode programs, and the skip-chars | |
896 ;; arguments. | |
897 (setq decode-program (make-8-bit-generate-decode-program decode-table)) | |
898 (multiple-value-setq | |
899 (encode-program skip-chars invalid-sequences-skip-chars) | |
900 (make-8-bit-generate-encode-program-and-skip-chars-strings | |
901 decode-table encode-table encode-failure-octet)) | |
902 | |
903 ;; And return the generated code. | |
904 `(let ((encode-table-sym (gentemp (format "%s-encode-table" ',name))) | |
905 (encode-table ,encode-table)) | |
906 (define-translation-hash-table encode-table-sym encode-table) | |
907 (make-coding-system | |
908 ',name 'ccl ,description | |
909 (plist-put (plist-put ',props 'decode | |
910 ,(apply #'vector decode-program)) | |
911 'encode | |
912 (apply #'vector | |
913 (nsublis | |
914 (list (cons | |
915 'encode-table-sym | |
916 (symbol-value 'encode-table-sym))) | |
917 ',encode-program)))) | |
918 (coding-system-put ',name '8-bit-fixed t) | |
919 (coding-system-put ',name 'category | |
920 ',(make-8-bit-choose-category decode-table)) | |
921 (coding-system-put ',name '8-bit-fixed-query-skip-chars | |
922 ,skip-chars) | |
923 (coding-system-put ',name '8-bit-fixed-invalid-sequences-skip-chars | |
924 ,invalid-sequences-skip-chars) | |
925 (coding-system-put ',name '8-bit-fixed-query-from-unicode encode-table) | |
926 (coding-system-put ',name 'query-coding-function | |
927 #'8-bit-fixed-query-coding-region) | |
928 (coding-system-put ',(intern (format "%s-unix" name)) | |
929 'query-coding-function | |
930 #'8-bit-fixed-query-coding-region) | |
931 (coding-system-put ',(intern (format "%s-dos" name)) | |
932 'query-coding-function | |
933 #'8-bit-fixed-query-coding-region) | |
934 (coding-system-put ',(intern (format "%s-mac" name)) | |
935 'query-coding-function | |
936 #'8-bit-fixed-query-coding-region) | |
937 ,(macroexpand `(loop for alias in ',aliases | |
938 do (define-coding-system-alias alias | |
939 ',name))) | |
940 (find-coding-system ',name))))) | |
941 | |
942 ;; Ideally this would be in latin.el, but code-init.el uses it. | 222 ;; Ideally this would be in latin.el, but code-init.el uses it. |
943 (make-8-bit-coding-system | 223 (make-coding-system |
944 'iso-8859-1 | 224 'iso-8859-1 |
945 (loop | 225 'fixed-width |
946 for i from #x80 to #xff | |
947 collect (list i (int-char i))) ;; Identical to Latin-1. | |
948 "ISO-8859-1 (Latin-1)" | 226 "ISO-8859-1 (Latin-1)" |
949 '(mnemonic "Latin 1" | 227 (eval-when-compile |
950 documentation "The most used encoding of Western Europe and the Americas." | 228 `(unicode-map |
951 aliases (iso-latin-1 latin-1))) | 229 ,(loop |
230 for i from #x80 to #xff | |
231 collect (list i (int-char i))) ;; Identical to Latin-1. | |
232 mnemonic "Latin 1" | |
233 documentation "The most used encoding of Western Europe and the Americas." | |
234 aliases (iso-latin-1 latin-1)))) |