xemacs-beta: lisp/unicode.el comparison

comparison lisp/unicode.el @ 4690:257b468bf2ca

Move the #'query-coding-region implementation to C. This is necessary because there is no reasonable way to access the corresponding mswindows-multibyte functionality from Lisp, and we need such functionality if we're going to have a reliable and portable #'query-coding-region implementation. However, this change doesn't yet provide #'query-coding-region for the mswindow-multibyte coding systems, there should be no functional differences between an XEmacs with this change and one without it. src/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> Move the #'query-coding-region implementation to C. This is necessary because there is no reasonable way to access the corresponding mswindows-multibyte functionality from Lisp, and we need such functionality if we're going to have a reliable and portable #'query-coding-region implementation. However, this change doesn't yet provide #'query-coding-region for the mswindow-multibyte coding systems, there should be no functional differences between an XEmacs with this change and one without it. * mule-coding.c (struct fixed_width_coding_system): Add a new coding system type, fixed_width, and implement it. It uses the CCL infrastructure but has a much simpler creation API, and its own query_method, formerly in lisp/mule/mule-coding.el. * unicode.c: Move the Unicode query method implementation here from unicode.el. * lisp.h: Declare Fmake_coding_system_internal, Fcopy_range_table here. * intl-win32.c (complex_vars_of_intl_win32): Use Fmake_coding_system_internal, not Fmake_coding_system. * general-slots.h: Add Qsucceeded, Qunencodable, Qinvalid_sequence here. * file-coding.h (enum coding_system_variant): Add fixed_width_coding_system here. (struct coding_system_methods): Add query_method and query_lstream_method to the coding system methods. Provide flags for the query methods. Declare the default query method; initialise it correctly in INITIALIZE_CODING_SYSTEM_TYPE. * file-coding.c (default_query_method): New function, the default query method for coding systems that do not set it. Moved from coding.el. (make_coding_system_1): Accept new elements in PROPS in #'make-coding-system; aliases, a list of aliases; safe-chars and safe-charsets (these were previously accepted but not saved); and category. (Fmake_coding_system_internal): New function, what used to be #'make-coding-system--on Mule builds, we've now moved some of the functionality of this to Lisp. (Fcoding_system_canonical_name_p): Move this earlier in the file, since it's now called from within make_coding_system_1. (Fquery_coding_region): Move the implementation of this here, from coding.el. (complex_vars_of_file_coding): Call Fmake_coding_system_internal, not Fmake_coding_system; specify safe-charsets properties when we're a mule build. * extents.h (mouse_highlight_priority, Fset_extent_priority, Fset_extent_face, Fmap_extents): Make these available to other C files. lisp/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> Move the #'query-coding-region implementation to C. * coding.el: Consolidate code that depends on the presence or absence of Mule at the end of this file. (default-query-coding-region, query-coding-region): Move these functions to C. (default-query-coding-region-safe-charset-skip-chars-map): Remove this variable, the corresponding C variable is Vdefault_query_coding_region_chartab_cache in file-coding.c. (query-coding-string): Update docstring to reflect actual multiple values, be more careful about not modifying a range table that we're currently mapping over. (encode-coding-char): Make the implementation of this simpler. (featurep 'mule): Autoload #'make-coding-system from mule/make-coding-system.el if we're a mule build; provide an appropriate compiler macro. Do various non-mule compatibility things if we're not a mule build. * update-elc.el (additional-dump-dependencies): Add mule/make-coding-system as a dump time dependency if we're a mule build. * unicode.el (ccl-encode-to-ucs-2): (decode-char): (encode-char): Move these earlier in the file, for the sake of some byte compile warnings. (unicode-query-coding-region): Move this to unicode.c * mule/make-coding-system.el: New file, not dumped. Contains the functionality to rework the arguments necessary for fixed-width coding systems, and contains the implementation of #'make-coding-system, which now calls #'make-coding-system-internal. * mule/vietnamese.el (viscii): * mule/latin.el (iso-8859-2): (windows-1250): (iso-8859-3): (iso-8859-4): (iso-8859-14): (iso-8859-15): (iso-8859-16): (iso-8859-9): (macintosh): (windows-1252): * mule/hebrew.el (iso-8859-8): * mule/greek.el (iso-8859-7): (windows-1253): * mule/cyrillic.el (iso-8859-5): (koi8-r): (koi8-u): (windows-1251): (alternativnyj): (koi8-ru): (koi8-t): (koi8-c): (koi8-o): * mule/arabic.el (iso-8859-6): (windows-1256): Move all these coding systems to being of type fixed-width, not of type CCL. This allows the distinct query-coding-region for them to be in C, something which will eventually allow us to implement query-coding-region for the mswindows-multibyte coding systems. * mule/general-late.el (posix-charset-to-coding-system-hash): Document why we're pre-emptively persuading the byte compiler that the ELC for this file needs to be written using escape-quoted. Call #'set-unicode-query-skip-chars-args, now the Unicode query-coding-region implementation is in C. * mule/thai-xtis.el (tis-620): Don't bother checking whether we're XEmacs or not here. * mule/mule-coding.el: Move the eight bit fixed-width functionality from this file to make-coding-system.el. tests/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> * automated/mule-tests.el: Check a coding system's type, not an 8-bit-fixed property, for whether that coding system should be treated as a fixed-width coding system. * automated/query-coding-tests.el: Don't test the query coding functionality for mswindows-multibyte coding systems, it's not yet implemented.

author	Aidan Kehoe <kehoea@parhasard.net>
date	Sat, 19 Sep 2009 22:53:13 +0100
parents	75e7ab37b6c8
children	e29fcfd8df5f

comparison

equal deleted inserted replaced

-:0636c6ccb430
+:257b468bf2ca
 	composite ethiopic indian-1-column indian-2-column jit-ucs-charset-0
 	katakana-jisx0201 lao thai-tis620 thai-xtis tibetan tibetan-1-column
 	latin-jisx0201 chinese-cns11643-3 chinese-cns11643-4
 	chinese-cns11643-5 chinese-cns11643-6 chinese-cns11643-7)))))
-(make-coding-system
-'utf-16 'unicode
-"UTF-16"
-'(mnemonic "UTF-16"
-documentation
-"UTF-16 Unicode encoding -- the standard (almost-) fixed-width
-two-byte encoding, with surrogates.  It will be fixed-width if all
-characters are in the BMP (Basic Multilingual Plane -- first 65536
-codepoints).  Cannot represent characters with codepoints above
-0x10FFFF (a little more than 1,000,000).  Unicode and ISO guarantee
-never to encode any characters outside this range -- all the rest are
-for private, corporate or internal use."
-unicode-type utf-16))
-(define-coding-system-alias 'utf-16-be 'utf-16)
-(make-coding-system
-'utf-16-bom 'unicode
-"UTF-16 w/BOM"
-'(mnemonic "UTF16-BOM"
-documentation
-"UTF-16 Unicode encoding with byte order mark (BOM) at the beginning.
-The BOM is Unicode character U+FEFF -- i.e. the first two bytes are
-0xFE and 0xFF, respectively, or reversed in a little-endian
-representation.  It has been sanctioned by the Unicode Consortium for
-use at the beginning of a Unicode stream as a marker of the byte order
-of the stream, and commonly appears in Unicode files under Microsoft
-Windows, where it also functions as a magic cookie identifying a
-Unicode file.  The character is called \"ZERO WIDTH NO-BREAK SPACE\"
-and is suitable as a byte-order marker because:
--- it has no displayable representation
--- due to its semantics it never normally appears at the beginning
-of a stream
--- its reverse U+FFFE is not a legal Unicode character
--- neither byte sequence is at all likely in any other standard
-encoding, particularly at the beginning of a stream
-This coding system will insert a BOM at the beginning of a stream when
-writing and strip it off when reading."
-unicode-type utf-16
-need-bom t))
-(make-coding-system
-'utf-16-little-endian 'unicode
-"UTF-16 Little Endian"
-'(mnemonic "UTF16-LE"
-documentation
-"Little-endian version of UTF-16 Unicode encoding.
-See `utf-16' coding system."
-unicode-type utf-16
-little-endian t))
-(define-coding-system-alias 'utf-16-le 'utf-16-little-endian)
-(make-coding-system
-'utf-16-little-endian-bom 'unicode
-"UTF-16 Little Endian w/BOM"
-'(mnemonic "MSW-Unicode"
-documentation
-"Little-endian version of UTF-16 Unicode encoding, with byte order mark.
-Standard encoding for representing Unicode under MS Windows.  See
-`utf-16-bom' coding system."
-unicode-type utf-16
-little-endian t
-need-bom t))
-(make-coding-system
-'ucs-4 'unicode
-"UCS-4"
-'(mnemonic "UCS4"
-documentation
-"UCS-4 Unicode encoding -- fully fixed-width four-byte encoding."
-unicode-type ucs-4))
-(make-coding-system
-'ucs-4-little-endian 'unicode
-"UCS-4 Little Endian"
-'(mnemonic "UCS4-LE"
-documentation
-;; #### I don't think this is permitted by ISO 10646, only Unicode.
-;; Call it UTF-32 instead?
-"Little-endian version of UCS-4 Unicode encoding.  See `ucs-4' coding system."
-unicode-type ucs-4
-little-endian t))
-(make-coding-system
-'utf-32 'unicode
-"UTF-32"
-'(mnemonic "UTF32"
-documentation
-"UTF-32 Unicode encoding -- fixed-width four-byte encoding,
-characters less than #x10FFFF are not supported.  "
-unicode-type utf-32))
-(make-coding-system
-'utf-32-little-endian 'unicode
-"UTF-32 Little Endian"
-'(mnemonic "UTF32-LE"
-documentation
-"Little-endian version of UTF-32 Unicode encoding.
-A fixed-width four-byte encoding, characters less than #x10FFFF are not
-supported.  "
-unicode-type ucs-4 little-endian t))
-(make-coding-system
-'utf-8 'unicode
-"UTF-8"
-'(mnemonic "UTF8"
-documentation "
-UTF-8 Unicode encoding -- ASCII-compatible 8-bit variable-width encoding
-sharing the following principles with the Mule-internal encoding:
--- All ASCII characters (codepoints 0 through 127) are represented
-by themselves (i.e. using one byte, with the same value as the
-ASCII codepoint), and these bytes are disjoint from bytes
-representing non-ASCII characters.
-This means that any 8-bit clean application can safely process
-UTF-8-encoded text as it were ASCII, with no corruption (e.g. a
-'/' byte is always a slash character, never the second byte of
-some other character, as with Big5, so a pathname encoded in
-UTF-8 can safely be split up into components and reassembled
-again using standard ASCII processes).
--- Leading bytes and non-leading bytes in the encoding of a
-character are disjoint, so moving backwards is easy.
--- Given only the leading byte, you know how many following bytes
-are present.
-"
-unicode-type utf-8))
-(make-coding-system
-'utf-8-bom 'unicode
-"UTF-8 w/BOM"
-'(mnemonic "MSW-UTF8"
-documentation
-"UTF-8 Unicode encoding, with byte order mark.
-Standard encoding for representing UTF-8 under MS Windows."
-unicode-type utf-8
-little-endian t
-need-bom t))
-(defun decode-char (quote-ucs code &optional restriction)
-"FSF compatibility--return Mule character with Unicode codepoint CODE.
-The second argument must be 'ucs, the third argument is ignored.  "
-;; We're prepared to accept invalid Unicode in unicode-to-char, but not in
-;; this function, which is the API that should actually be used, since
-;; it's available in GNU and in Mule-UCS.
-(check-argument-range code #x0 #x10FFFF)
-(assert (eq quote-ucs 'ucs) t
-	  "Sorry, decode-char doesn't yet support anything but the UCS.  ")
-(unicode-to-char code))
-(defun encode-char (char quote-ucs &optional restriction)
-"FSF compatibility--return the Unicode code point of CHAR.
-The second argument must be 'ucs, the third argument is ignored.  "
-(assert (eq quote-ucs 'ucs) t
-	  "Sorry, encode-char doesn't yet support anything but the UCS.  ")
-(char-to-unicode char))
 (defconst ccl-encode-to-ucs-2
 (eval-when-compile
 (let ((pre-existing
 ;; This is the compiled CCL program from the assert
 ;; below. Since this file is dumped and ccl.el isn't (and
 (when (featurep 'mule)
 (put 'ccl-encode-to-ucs-2 'ccl-program-idx
 (declare-fboundp
 	(register-ccl-program 'ccl-encode-to-ucs-2 ccl-encode-to-ucs-2))))
+(defun decode-char (quote-ucs code &optional restriction)
+"FSF compatibility--return Mule character with Unicode codepoint CODE.
+The second argument must be 'ucs, the third argument is ignored.  "
+;; We're prepared to accept invalid Unicode in unicode-to-char, but not in
+;; this function, which is the API that should actually be used, since
+;; it's available in GNU and in Mule-UCS.
+(check-argument-range code #x0 #x10FFFF)
+(assert (eq quote-ucs 'ucs) t
+	  "Sorry, decode-char doesn't yet support anything but the UCS.  ")
+(unicode-to-char code))
+(defun encode-char (char quote-ucs &optional restriction)
+"FSF compatibility--return the Unicode code point of CHAR.
+The second argument must be 'ucs, the third argument is ignored.  "
+(assert (eq quote-ucs 'ucs) t
+	  "Sorry, encode-char doesn't yet support anything but the UCS.  ")
+(char-to-unicode char))
+(make-coding-system
+'utf-16 'unicode
+"UTF-16"
+'(mnemonic "UTF-16"
+documentation
+"UTF-16 Unicode encoding -- the standard (almost-) fixed-width
+two-byte encoding, with surrogates.  It will be fixed-width if all
+characters are in the BMP (Basic Multilingual Plane -- first 65536
+codepoints).  Cannot represent characters with codepoints above
+0x10FFFF (a little more than 1,000,000).  Unicode and ISO guarantee
+never to encode any characters outside this range -- all the rest are
+for private, corporate or internal use."
+unicode-type utf-16))
+(define-coding-system-alias 'utf-16-be 'utf-16)
+(make-coding-system
+'utf-16-bom 'unicode
+"UTF-16 w/BOM"
+'(mnemonic "UTF16-BOM"
+documentation
+"UTF-16 Unicode encoding with byte order mark (BOM) at the beginning.
+The BOM is Unicode character U+FEFF -- i.e. the first two bytes are
+0xFE and 0xFF, respectively, or reversed in a little-endian
+representation.  It has been sanctioned by the Unicode Consortium for
+use at the beginning of a Unicode stream as a marker of the byte order
+of the stream, and commonly appears in Unicode files under Microsoft
+Windows, where it also functions as a magic cookie identifying a
+Unicode file.  The character is called \"ZERO WIDTH NO-BREAK SPACE\"
+and is suitable as a byte-order marker because:
+-- it has no displayable representation
+-- due to its semantics it never normally appears at the beginning
+of a stream
+-- its reverse U+FFFE is not a legal Unicode character
+-- neither byte sequence is at all likely in any other standard
+encoding, particularly at the beginning of a stream
+This coding system will insert a BOM at the beginning of a stream when
+writing and strip it off when reading."
+unicode-type utf-16
+need-bom t))
+(make-coding-system
+'utf-16-little-endian 'unicode
+"UTF-16 Little Endian"
+'(mnemonic "UTF16-LE"
+documentation
+"Little-endian version of UTF-16 Unicode encoding.
+See `utf-16' coding system."
+unicode-type utf-16
+little-endian t))
+(define-coding-system-alias 'utf-16-le 'utf-16-little-endian)
+(make-coding-system
+'utf-16-little-endian-bom 'unicode
+"UTF-16 Little Endian w/BOM"
+'(mnemonic "MSW-Unicode"
+documentation
+"Little-endian version of UTF-16 Unicode encoding, with byte order mark.
+Standard encoding for representing Unicode under MS Windows.  See
+`utf-16-bom' coding system."
+unicode-type utf-16
+little-endian t
+need-bom t))
+(make-coding-system
+'ucs-4 'unicode
+"UCS-4"
+'(mnemonic "UCS4"
+documentation
+"UCS-4 Unicode encoding -- fully fixed-width four-byte encoding."
+unicode-type ucs-4))
+(make-coding-system
+'ucs-4-little-endian 'unicode
+"UCS-4 Little Endian"
+'(mnemonic "UCS4-LE"
+documentation
+;; #### I don't think this is permitted by ISO 10646, only Unicode.
+;; Call it UTF-32 instead?
+"Little-endian version of UCS-4 Unicode encoding.  See `ucs-4' coding system."
+unicode-type ucs-4
+little-endian t))
+(make-coding-system
+'utf-32 'unicode
+"UTF-32"
+'(mnemonic "UTF32"
+documentation
+"UTF-32 Unicode encoding -- fixed-width four-byte encoding,
+characters less than #x10FFFF are not supported.  "
+unicode-type utf-32))
+(make-coding-system
+'utf-32-little-endian 'unicode
+"UTF-32 Little Endian"
+'(mnemonic "UTF32-LE"
+documentation
+"Little-endian version of UTF-32 Unicode encoding.
+A fixed-width four-byte encoding, characters less than #x10FFFF are not
+supported.  "
+unicode-type ucs-4 little-endian t))
+(make-coding-system
+'utf-8 'unicode
+"UTF-8"
+'(mnemonic "UTF8"
+documentation "
+UTF-8 Unicode encoding -- ASCII-compatible 8-bit variable-width encoding
+sharing the following principles with the Mule-internal encoding:
+-- All ASCII characters (codepoints 0 through 127) are represented
+by themselves (i.e. using one byte, with the same value as the
+ASCII codepoint), and these bytes are disjoint from bytes
+representing non-ASCII characters.
+This means that any 8-bit clean application can safely process
+UTF-8-encoded text as it were ASCII, with no corruption (e.g. a
+'/' byte is always a slash character, never the second byte of
+some other character, as with Big5, so a pathname encoded in
+UTF-8 can safely be split up into components and reassembled
+again using standard ASCII processes).
+-- Leading bytes and non-leading bytes in the encoding of a
+character are disjoint, so moving backwards is easy.
+-- Given only the leading byte, you know how many following bytes
+are present.
+"
+unicode-type utf-8))
+(make-coding-system
+'utf-8-bom 'unicode
+"UTF-8 w/BOM"
+'(mnemonic "MSW-UTF8"
+documentation
+"UTF-8 Unicode encoding, with byte order mark.
+Standard encoding for representing UTF-8 under MS Windows."
+unicode-type utf-8
+little-endian t
+need-bom t))
 ;; Now, create jit-ucs-charset-0 entries for those characters in Windows
 ;; Glyph List 4 that would otherwise end up in East Asian character sets.
 ;;
 ;; WGL4 is a character repertoire from Microsoft that gives a guideline
 begin end buffer))
 ;; Sure would be nice to be able to use defface here.
 (copy-face 'highlight 'unicode-invalid-sequence-warning-face)
-(defvar unicode-query-coding-skip-chars-arg nil ;; Set in general-late.el
-"Used by `unicode-query-coding-region' to skip chars with known mappings.")
-(defun unicode-query-coding-region (begin end coding-system
-				    &optional buffer ignore-invalid-sequencesp
-errorp highlightp)
-"The `query-coding-region' implementation for Unicode coding systems.
-Supports IGNORE-INVALID-SEQUENCESP, that is, XEmacs characters that reflect
-invalid octets on disk will be treated as encodable if this argument is
-specified, and as not encodable if it is not specified."
-;; Potential problem here; the octets that correspond to octets from #x00
-;; to #x7f on disk will be treated by utf-8 and utf-7 as invalid
-;; sequences, and thus, in theory, encodable.
-(check-argument-type #'coding-system-p
-(setq coding-system (find-coding-system coding-system)))
-(check-argument-type #'integer-or-marker-p begin)
-(check-argument-type #'integer-or-marker-p end)
-(let* ((skip-chars-arg (concat unicode-query-coding-skip-chars-arg
-				 (if ignore-invalid-sequencesp
-				     unicode-invalid-sequence-regexp-range
-				   "")))
-(ranges (make-range-table))
-(looking-at-arg (concat "[" skip-chars-arg "]"))
-(case-fold-search nil)
-	 (invalid-sequence-lower-unicode-bound
-	  (char-to-unicode
-	   (aref (decode-coding-string "\xd8\x00\x00\x00"
-				       'utf-16-be) 3)))
-	  (invalid-sequence-upper-unicode-bound
-	   (char-to-unicode
-	    (aref (decode-coding-string "\xd8\x00\x00\xFF"
-					'utf-16-be) 3)))
-fail-range-start fail-range-end char-after failed
-	 extent char-unicode failed-reason previous-failed-reason)
-(save-excursion
-(when highlightp
-(query-coding-clear-highlights begin end buffer))
-(goto-char begin buffer)
-(skip-chars-forward skip-chars-arg end buffer)
-(while (< (point buffer) end)
-(setq char-after (char-after (point buffer) buffer)
-fail-range-start (point buffer))
-(while (and
-(< (point buffer) end)
-(not (looking-at looking-at-arg))
-(or (and
-(= -1 (setq char-unicode (char-to-unicode char-after)))
-(setq failed-reason 'unencodable))
-(and (not ignore-invalid-sequencesp)
-;; The default case, with ignore-invalid-sequencesp
-;; not specified:
-;; If the character is in the Unicode range that
-;; corresponds to an invalid octet, we want to
-;; treat it as unencodable.
-(<= invalid-sequence-lower-unicode-bound
-char-unicode)
-(<= char-unicode
-			     invalid-sequence-upper-unicode-bound)
-(setq failed-reason 'invalid-sequence)))
-(or (null previous-failed-reason)
-(eq previous-failed-reason failed-reason)))
-(forward-char 1 buffer)
-(setq char-after (char-after (point buffer) buffer)
-failed t
-previous-failed-reason failed-reason))
-(if (= fail-range-start (point buffer))
-;; The character can actually be encoded by the coding
-;; system; check the characters past it.
-	    (forward-char 1 buffer)
-;; Can't be encoded; note this.
-(when errorp
-(error 'text-conversion-error
-(format "Cannot encode %s using coding system"
-(buffer-substring fail-range-start (point buffer)
-buffer))
-(coding-system-name coding-system)))
-(assert
-(not (null previous-failed-reason)) t
-"If we've got here, previous-failed-reason should be non-nil.")
-(put-range-table fail-range-start
-;; If char-after is non-nil, we're not at
-;; the end of the buffer.
-(setq fail-range-end (if char-after
-(point buffer)
-(point-max buffer)))
-previous-failed-reason ranges)
-(setq previous-failed-reason nil)
-(when highlightp
-(setq extent (make-extent fail-range-start fail-range-end buffer))
-(set-extent-priority extent (+ mouse-highlight-priority 2))
-(set-extent-face extent 'query-coding-warning-face)))
-(skip-chars-forward skip-chars-arg end buffer))
-(if failed
-(values nil ranges)
-(values t nil)))))
-(loop
-for coding-system in (coding-system-list)
-initially (unless (featurep 'mule) (return))
-do (when (eq 'unicode (coding-system-type coding-system))
-(coding-system-put coding-system 'query-coding-function
-			  #'unicode-query-coding-region)))
 (unless (featurep 'mule)
 ;; We do this in such a roundabout way--instead of having the above defun
 ;; and defvar calls inside a (when (featurep 'mule) ...) form--to have
 ;; make-docfile.c pick up symbol and function documentation correctly. An
 ;; alternative approach would be to fix make-docfile.c to be able to read

Mercurial > hg > xemacs-beta

comparison lisp/unicode.el @ 4690:257b468bf2ca