Mercurial > hg > xemacs-beta
comparison lisp/unicode.el @ 4690:257b468bf2ca
Move the #'query-coding-region implementation to C.
This is necessary because there is no reasonable way to access the
corresponding mswindows-multibyte functionality from Lisp, and we need such
functionality if we're going to have a reliable and portable
#'query-coding-region implementation. However, this change doesn't yet
provide #'query-coding-region for the mswindow-multibyte coding systems,
there should be no functional differences between an XEmacs with this change
and one without it.
src/ChangeLog addition:
2009-09-19 Aidan Kehoe <kehoea@parhasard.net>
Move the #'query-coding-region implementation to C.
This is necessary because there is no reasonable way to access the
corresponding mswindows-multibyte functionality from Lisp, and we
need such functionality if we're going to have a reliable and
portable #'query-coding-region implementation. However, this
change doesn't yet provide #'query-coding-region for the
mswindow-multibyte coding systems, there should be no functional
differences between an XEmacs with this change and one without it.
* mule-coding.c (struct fixed_width_coding_system):
Add a new coding system type, fixed_width, and implement it. It
uses the CCL infrastructure but has a much simpler creation API,
and its own query_method, formerly in lisp/mule/mule-coding.el.
* unicode.c:
Move the Unicode query method implementation here from
unicode.el.
* lisp.h: Declare Fmake_coding_system_internal, Fcopy_range_table
here.
* intl-win32.c (complex_vars_of_intl_win32):
Use Fmake_coding_system_internal, not Fmake_coding_system.
* general-slots.h: Add Qsucceeded, Qunencodable, Qinvalid_sequence
here.
* file-coding.h (enum coding_system_variant):
Add fixed_width_coding_system here.
(struct coding_system_methods):
Add query_method and query_lstream_method to the coding system
methods.
Provide flags for the query methods.
Declare the default query method; initialise it correctly in
INITIALIZE_CODING_SYSTEM_TYPE.
* file-coding.c (default_query_method):
New function, the default query method for coding systems that do
not set it. Moved from coding.el.
(make_coding_system_1):
Accept new elements in PROPS in #'make-coding-system; aliases, a
list of aliases; safe-chars and safe-charsets (these were
previously accepted but not saved); and category.
(Fmake_coding_system_internal):
New function, what used to be #'make-coding-system--on Mule
builds, we've now moved some of the functionality of this to
Lisp.
(Fcoding_system_canonical_name_p):
Move this earlier in the file, since it's now called from within
make_coding_system_1.
(Fquery_coding_region):
Move the implementation of this here, from coding.el.
(complex_vars_of_file_coding):
Call Fmake_coding_system_internal, not Fmake_coding_system;
specify safe-charsets properties when we're a mule build.
* extents.h (mouse_highlight_priority, Fset_extent_priority,
Fset_extent_face, Fmap_extents):
Make these available to other C files.
lisp/ChangeLog addition:
2009-09-19 Aidan Kehoe <kehoea@parhasard.net>
Move the #'query-coding-region implementation to C.
* coding.el:
Consolidate code that depends on the presence or absence of Mule
at the end of this file.
(default-query-coding-region, query-coding-region):
Move these functions to C.
(default-query-coding-region-safe-charset-skip-chars-map):
Remove this variable, the corresponding C variable is
Vdefault_query_coding_region_chartab_cache in file-coding.c.
(query-coding-string): Update docstring to reflect actual multiple
values, be more careful about not modifying a range table that
we're currently mapping over.
(encode-coding-char): Make the implementation of this simpler.
(featurep 'mule): Autoload #'make-coding-system from
mule/make-coding-system.el if we're a mule build; provide an
appropriate compiler macro.
Do various non-mule compatibility things if we're not a mule
build.
* update-elc.el (additional-dump-dependencies):
Add mule/make-coding-system as a dump time dependency if we're a
mule build.
* unicode.el (ccl-encode-to-ucs-2):
(decode-char):
(encode-char):
Move these earlier in the file, for the sake of some byte compile
warnings.
(unicode-query-coding-region):
Move this to unicode.c
* mule/make-coding-system.el:
New file, not dumped. Contains the functionality to rework the
arguments necessary for fixed-width coding systems, and contains
the implementation of #'make-coding-system, which now calls
#'make-coding-system-internal.
* mule/vietnamese.el (viscii):
* mule/latin.el (iso-8859-2):
(windows-1250):
(iso-8859-3):
(iso-8859-4):
(iso-8859-14):
(iso-8859-15):
(iso-8859-16):
(iso-8859-9):
(macintosh):
(windows-1252):
* mule/hebrew.el (iso-8859-8):
* mule/greek.el (iso-8859-7):
(windows-1253):
* mule/cyrillic.el (iso-8859-5):
(koi8-r):
(koi8-u):
(windows-1251):
(alternativnyj):
(koi8-ru):
(koi8-t):
(koi8-c):
(koi8-o):
* mule/arabic.el (iso-8859-6):
(windows-1256):
Move all these coding systems to being of type fixed-width, not of
type CCL. This allows the distinct query-coding-region for them to
be in C, something which will eventually allow us to implement
query-coding-region for the mswindows-multibyte coding systems.
* mule/general-late.el (posix-charset-to-coding-system-hash):
Document why we're pre-emptively persuading the byte compiler that
the ELC for this file needs to be written using escape-quoted.
Call #'set-unicode-query-skip-chars-args, now the Unicode
query-coding-region implementation is in C.
* mule/thai-xtis.el (tis-620):
Don't bother checking whether we're XEmacs or not here.
* mule/mule-coding.el:
Move the eight bit fixed-width functionality from this file to
make-coding-system.el.
tests/ChangeLog addition:
2009-09-19 Aidan Kehoe <kehoea@parhasard.net>
* automated/mule-tests.el:
Check a coding system's type, not an 8-bit-fixed property, for
whether that coding system should be treated as a fixed-width
coding system.
* automated/query-coding-tests.el:
Don't test the query coding functionality for mswindows-multibyte
coding systems, it's not yet implemented.
author | Aidan Kehoe <kehoea@parhasard.net> |
---|---|
date | Sat, 19 Sep 2009 22:53:13 +0100 |
parents | 75e7ab37b6c8 |
children | e29fcfd8df5f |
comparison
equal
deleted
inserted
replaced
4689:0636c6ccb430 | 4690:257b468bf2ca |
---|---|
162 composite ethiopic indian-1-column indian-2-column jit-ucs-charset-0 | 162 composite ethiopic indian-1-column indian-2-column jit-ucs-charset-0 |
163 katakana-jisx0201 lao thai-tis620 thai-xtis tibetan tibetan-1-column | 163 katakana-jisx0201 lao thai-tis620 thai-xtis tibetan tibetan-1-column |
164 latin-jisx0201 chinese-cns11643-3 chinese-cns11643-4 | 164 latin-jisx0201 chinese-cns11643-3 chinese-cns11643-4 |
165 chinese-cns11643-5 chinese-cns11643-6 chinese-cns11643-7))))) | 165 chinese-cns11643-5 chinese-cns11643-6 chinese-cns11643-7))))) |
166 | 166 |
167 (make-coding-system | |
168 'utf-16 'unicode | |
169 "UTF-16" | |
170 '(mnemonic "UTF-16" | |
171 documentation | |
172 "UTF-16 Unicode encoding -- the standard (almost-) fixed-width | |
173 two-byte encoding, with surrogates. It will be fixed-width if all | |
174 characters are in the BMP (Basic Multilingual Plane -- first 65536 | |
175 codepoints). Cannot represent characters with codepoints above | |
176 0x10FFFF (a little more than 1,000,000). Unicode and ISO guarantee | |
177 never to encode any characters outside this range -- all the rest are | |
178 for private, corporate or internal use." | |
179 unicode-type utf-16)) | |
180 | |
181 (define-coding-system-alias 'utf-16-be 'utf-16) | |
182 | |
183 (make-coding-system | |
184 'utf-16-bom 'unicode | |
185 "UTF-16 w/BOM" | |
186 '(mnemonic "UTF16-BOM" | |
187 documentation | |
188 "UTF-16 Unicode encoding with byte order mark (BOM) at the beginning. | |
189 The BOM is Unicode character U+FEFF -- i.e. the first two bytes are | |
190 0xFE and 0xFF, respectively, or reversed in a little-endian | |
191 representation. It has been sanctioned by the Unicode Consortium for | |
192 use at the beginning of a Unicode stream as a marker of the byte order | |
193 of the stream, and commonly appears in Unicode files under Microsoft | |
194 Windows, where it also functions as a magic cookie identifying a | |
195 Unicode file. The character is called \"ZERO WIDTH NO-BREAK SPACE\" | |
196 and is suitable as a byte-order marker because: | |
197 | |
198 -- it has no displayable representation | |
199 -- due to its semantics it never normally appears at the beginning | |
200 of a stream | |
201 -- its reverse U+FFFE is not a legal Unicode character | |
202 -- neither byte sequence is at all likely in any other standard | |
203 encoding, particularly at the beginning of a stream | |
204 | |
205 This coding system will insert a BOM at the beginning of a stream when | |
206 writing and strip it off when reading." | |
207 unicode-type utf-16 | |
208 need-bom t)) | |
209 | |
210 (make-coding-system | |
211 'utf-16-little-endian 'unicode | |
212 "UTF-16 Little Endian" | |
213 '(mnemonic "UTF16-LE" | |
214 documentation | |
215 "Little-endian version of UTF-16 Unicode encoding. | |
216 See `utf-16' coding system." | |
217 unicode-type utf-16 | |
218 little-endian t)) | |
219 | |
220 (define-coding-system-alias 'utf-16-le 'utf-16-little-endian) | |
221 | |
222 (make-coding-system | |
223 'utf-16-little-endian-bom 'unicode | |
224 "UTF-16 Little Endian w/BOM" | |
225 '(mnemonic "MSW-Unicode" | |
226 documentation | |
227 "Little-endian version of UTF-16 Unicode encoding, with byte order mark. | |
228 Standard encoding for representing Unicode under MS Windows. See | |
229 `utf-16-bom' coding system." | |
230 unicode-type utf-16 | |
231 little-endian t | |
232 need-bom t)) | |
233 | |
234 (make-coding-system | |
235 'ucs-4 'unicode | |
236 "UCS-4" | |
237 '(mnemonic "UCS4" | |
238 documentation | |
239 "UCS-4 Unicode encoding -- fully fixed-width four-byte encoding." | |
240 unicode-type ucs-4)) | |
241 | |
242 (make-coding-system | |
243 'ucs-4-little-endian 'unicode | |
244 "UCS-4 Little Endian" | |
245 '(mnemonic "UCS4-LE" | |
246 documentation | |
247 ;; #### I don't think this is permitted by ISO 10646, only Unicode. | |
248 ;; Call it UTF-32 instead? | |
249 "Little-endian version of UCS-4 Unicode encoding. See `ucs-4' coding system." | |
250 unicode-type ucs-4 | |
251 little-endian t)) | |
252 | |
253 (make-coding-system | |
254 'utf-32 'unicode | |
255 "UTF-32" | |
256 '(mnemonic "UTF32" | |
257 documentation | |
258 "UTF-32 Unicode encoding -- fixed-width four-byte encoding, | |
259 characters less than #x10FFFF are not supported. " | |
260 unicode-type utf-32)) | |
261 | |
262 (make-coding-system | |
263 'utf-32-little-endian 'unicode | |
264 "UTF-32 Little Endian" | |
265 '(mnemonic "UTF32-LE" | |
266 documentation | |
267 "Little-endian version of UTF-32 Unicode encoding. | |
268 | |
269 A fixed-width four-byte encoding, characters less than #x10FFFF are not | |
270 supported. " | |
271 unicode-type ucs-4 little-endian t)) | |
272 | |
273 (make-coding-system | |
274 'utf-8 'unicode | |
275 "UTF-8" | |
276 '(mnemonic "UTF8" | |
277 documentation " | |
278 UTF-8 Unicode encoding -- ASCII-compatible 8-bit variable-width encoding | |
279 sharing the following principles with the Mule-internal encoding: | |
280 | |
281 -- All ASCII characters (codepoints 0 through 127) are represented | |
282 by themselves (i.e. using one byte, with the same value as the | |
283 ASCII codepoint), and these bytes are disjoint from bytes | |
284 representing non-ASCII characters. | |
285 | |
286 This means that any 8-bit clean application can safely process | |
287 UTF-8-encoded text as it were ASCII, with no corruption (e.g. a | |
288 '/' byte is always a slash character, never the second byte of | |
289 some other character, as with Big5, so a pathname encoded in | |
290 UTF-8 can safely be split up into components and reassembled | |
291 again using standard ASCII processes). | |
292 | |
293 -- Leading bytes and non-leading bytes in the encoding of a | |
294 character are disjoint, so moving backwards is easy. | |
295 | |
296 -- Given only the leading byte, you know how many following bytes | |
297 are present. | |
298 " | |
299 unicode-type utf-8)) | |
300 | |
301 (make-coding-system | |
302 'utf-8-bom 'unicode | |
303 "UTF-8 w/BOM" | |
304 '(mnemonic "MSW-UTF8" | |
305 documentation | |
306 "UTF-8 Unicode encoding, with byte order mark. | |
307 Standard encoding for representing UTF-8 under MS Windows." | |
308 unicode-type utf-8 | |
309 little-endian t | |
310 need-bom t)) | |
311 | |
312 (defun decode-char (quote-ucs code &optional restriction) | |
313 "FSF compatibility--return Mule character with Unicode codepoint CODE. | |
314 The second argument must be 'ucs, the third argument is ignored. " | |
315 ;; We're prepared to accept invalid Unicode in unicode-to-char, but not in | |
316 ;; this function, which is the API that should actually be used, since | |
317 ;; it's available in GNU and in Mule-UCS. | |
318 (check-argument-range code #x0 #x10FFFF) | |
319 (assert (eq quote-ucs 'ucs) t | |
320 "Sorry, decode-char doesn't yet support anything but the UCS. ") | |
321 (unicode-to-char code)) | |
322 | |
323 (defun encode-char (char quote-ucs &optional restriction) | |
324 "FSF compatibility--return the Unicode code point of CHAR. | |
325 The second argument must be 'ucs, the third argument is ignored. " | |
326 (assert (eq quote-ucs 'ucs) t | |
327 "Sorry, encode-char doesn't yet support anything but the UCS. ") | |
328 (char-to-unicode char)) | |
329 | |
330 (defconst ccl-encode-to-ucs-2 | 167 (defconst ccl-encode-to-ucs-2 |
331 (eval-when-compile | 168 (eval-when-compile |
332 (let ((pre-existing | 169 (let ((pre-existing |
333 ;; This is the compiled CCL program from the assert | 170 ;; This is the compiled CCL program from the assert |
334 ;; below. Since this file is dumped and ccl.el isn't (and | 171 ;; below. Since this file is dumped and ccl.el isn't (and |
368 | 205 |
369 (when (featurep 'mule) | 206 (when (featurep 'mule) |
370 (put 'ccl-encode-to-ucs-2 'ccl-program-idx | 207 (put 'ccl-encode-to-ucs-2 'ccl-program-idx |
371 (declare-fboundp | 208 (declare-fboundp |
372 (register-ccl-program 'ccl-encode-to-ucs-2 ccl-encode-to-ucs-2)))) | 209 (register-ccl-program 'ccl-encode-to-ucs-2 ccl-encode-to-ucs-2)))) |
210 | |
211 (defun decode-char (quote-ucs code &optional restriction) | |
212 "FSF compatibility--return Mule character with Unicode codepoint CODE. | |
213 The second argument must be 'ucs, the third argument is ignored. " | |
214 ;; We're prepared to accept invalid Unicode in unicode-to-char, but not in | |
215 ;; this function, which is the API that should actually be used, since | |
216 ;; it's available in GNU and in Mule-UCS. | |
217 (check-argument-range code #x0 #x10FFFF) | |
218 (assert (eq quote-ucs 'ucs) t | |
219 "Sorry, decode-char doesn't yet support anything but the UCS. ") | |
220 (unicode-to-char code)) | |
221 | |
222 (defun encode-char (char quote-ucs &optional restriction) | |
223 "FSF compatibility--return the Unicode code point of CHAR. | |
224 The second argument must be 'ucs, the third argument is ignored. " | |
225 (assert (eq quote-ucs 'ucs) t | |
226 "Sorry, encode-char doesn't yet support anything but the UCS. ") | |
227 (char-to-unicode char)) | |
228 | |
229 (make-coding-system | |
230 'utf-16 'unicode | |
231 "UTF-16" | |
232 '(mnemonic "UTF-16" | |
233 documentation | |
234 "UTF-16 Unicode encoding -- the standard (almost-) fixed-width | |
235 two-byte encoding, with surrogates. It will be fixed-width if all | |
236 characters are in the BMP (Basic Multilingual Plane -- first 65536 | |
237 codepoints). Cannot represent characters with codepoints above | |
238 0x10FFFF (a little more than 1,000,000). Unicode and ISO guarantee | |
239 never to encode any characters outside this range -- all the rest are | |
240 for private, corporate or internal use." | |
241 unicode-type utf-16)) | |
242 | |
243 (define-coding-system-alias 'utf-16-be 'utf-16) | |
244 | |
245 (make-coding-system | |
246 'utf-16-bom 'unicode | |
247 "UTF-16 w/BOM" | |
248 '(mnemonic "UTF16-BOM" | |
249 documentation | |
250 "UTF-16 Unicode encoding with byte order mark (BOM) at the beginning. | |
251 The BOM is Unicode character U+FEFF -- i.e. the first two bytes are | |
252 0xFE and 0xFF, respectively, or reversed in a little-endian | |
253 representation. It has been sanctioned by the Unicode Consortium for | |
254 use at the beginning of a Unicode stream as a marker of the byte order | |
255 of the stream, and commonly appears in Unicode files under Microsoft | |
256 Windows, where it also functions as a magic cookie identifying a | |
257 Unicode file. The character is called \"ZERO WIDTH NO-BREAK SPACE\" | |
258 and is suitable as a byte-order marker because: | |
259 | |
260 -- it has no displayable representation | |
261 -- due to its semantics it never normally appears at the beginning | |
262 of a stream | |
263 -- its reverse U+FFFE is not a legal Unicode character | |
264 -- neither byte sequence is at all likely in any other standard | |
265 encoding, particularly at the beginning of a stream | |
266 | |
267 This coding system will insert a BOM at the beginning of a stream when | |
268 writing and strip it off when reading." | |
269 unicode-type utf-16 | |
270 need-bom t)) | |
271 | |
272 (make-coding-system | |
273 'utf-16-little-endian 'unicode | |
274 "UTF-16 Little Endian" | |
275 '(mnemonic "UTF16-LE" | |
276 documentation | |
277 "Little-endian version of UTF-16 Unicode encoding. | |
278 See `utf-16' coding system." | |
279 unicode-type utf-16 | |
280 little-endian t)) | |
281 | |
282 (define-coding-system-alias 'utf-16-le 'utf-16-little-endian) | |
283 | |
284 (make-coding-system | |
285 'utf-16-little-endian-bom 'unicode | |
286 "UTF-16 Little Endian w/BOM" | |
287 '(mnemonic "MSW-Unicode" | |
288 documentation | |
289 "Little-endian version of UTF-16 Unicode encoding, with byte order mark. | |
290 Standard encoding for representing Unicode under MS Windows. See | |
291 `utf-16-bom' coding system." | |
292 unicode-type utf-16 | |
293 little-endian t | |
294 need-bom t)) | |
295 | |
296 (make-coding-system | |
297 'ucs-4 'unicode | |
298 "UCS-4" | |
299 '(mnemonic "UCS4" | |
300 documentation | |
301 "UCS-4 Unicode encoding -- fully fixed-width four-byte encoding." | |
302 unicode-type ucs-4)) | |
303 | |
304 (make-coding-system | |
305 'ucs-4-little-endian 'unicode | |
306 "UCS-4 Little Endian" | |
307 '(mnemonic "UCS4-LE" | |
308 documentation | |
309 ;; #### I don't think this is permitted by ISO 10646, only Unicode. | |
310 ;; Call it UTF-32 instead? | |
311 "Little-endian version of UCS-4 Unicode encoding. See `ucs-4' coding system." | |
312 unicode-type ucs-4 | |
313 little-endian t)) | |
314 | |
315 (make-coding-system | |
316 'utf-32 'unicode | |
317 "UTF-32" | |
318 '(mnemonic "UTF32" | |
319 documentation | |
320 "UTF-32 Unicode encoding -- fixed-width four-byte encoding, | |
321 characters less than #x10FFFF are not supported. " | |
322 unicode-type utf-32)) | |
323 | |
324 (make-coding-system | |
325 'utf-32-little-endian 'unicode | |
326 "UTF-32 Little Endian" | |
327 '(mnemonic "UTF32-LE" | |
328 documentation | |
329 "Little-endian version of UTF-32 Unicode encoding. | |
330 | |
331 A fixed-width four-byte encoding, characters less than #x10FFFF are not | |
332 supported. " | |
333 unicode-type ucs-4 little-endian t)) | |
334 | |
335 (make-coding-system | |
336 'utf-8 'unicode | |
337 "UTF-8" | |
338 '(mnemonic "UTF8" | |
339 documentation " | |
340 UTF-8 Unicode encoding -- ASCII-compatible 8-bit variable-width encoding | |
341 sharing the following principles with the Mule-internal encoding: | |
342 | |
343 -- All ASCII characters (codepoints 0 through 127) are represented | |
344 by themselves (i.e. using one byte, with the same value as the | |
345 ASCII codepoint), and these bytes are disjoint from bytes | |
346 representing non-ASCII characters. | |
347 | |
348 This means that any 8-bit clean application can safely process | |
349 UTF-8-encoded text as it were ASCII, with no corruption (e.g. a | |
350 '/' byte is always a slash character, never the second byte of | |
351 some other character, as with Big5, so a pathname encoded in | |
352 UTF-8 can safely be split up into components and reassembled | |
353 again using standard ASCII processes). | |
354 | |
355 -- Leading bytes and non-leading bytes in the encoding of a | |
356 character are disjoint, so moving backwards is easy. | |
357 | |
358 -- Given only the leading byte, you know how many following bytes | |
359 are present. | |
360 " | |
361 unicode-type utf-8)) | |
362 | |
363 (make-coding-system | |
364 'utf-8-bom 'unicode | |
365 "UTF-8 w/BOM" | |
366 '(mnemonic "MSW-UTF8" | |
367 documentation | |
368 "UTF-8 Unicode encoding, with byte order mark. | |
369 Standard encoding for representing UTF-8 under MS Windows." | |
370 unicode-type utf-8 | |
371 little-endian t | |
372 need-bom t)) | |
373 | 373 |
374 ;; Now, create jit-ucs-charset-0 entries for those characters in Windows | 374 ;; Now, create jit-ucs-charset-0 entries for those characters in Windows |
375 ;; Glyph List 4 that would otherwise end up in East Asian character sets. | 375 ;; Glyph List 4 that would otherwise end up in East Asian character sets. |
376 ;; | 376 ;; |
377 ;; WGL4 is a character repertoire from Microsoft that gives a guideline | 377 ;; WGL4 is a character repertoire from Microsoft that gives a guideline |
611 begin end buffer)) | 611 begin end buffer)) |
612 | 612 |
613 ;; Sure would be nice to be able to use defface here. | 613 ;; Sure would be nice to be able to use defface here. |
614 (copy-face 'highlight 'unicode-invalid-sequence-warning-face) | 614 (copy-face 'highlight 'unicode-invalid-sequence-warning-face) |
615 | 615 |
616 (defvar unicode-query-coding-skip-chars-arg nil ;; Set in general-late.el | |
617 "Used by `unicode-query-coding-region' to skip chars with known mappings.") | |
618 | |
619 (defun unicode-query-coding-region (begin end coding-system | |
620 &optional buffer ignore-invalid-sequencesp | |
621 errorp highlightp) | |
622 "The `query-coding-region' implementation for Unicode coding systems. | |
623 | |
624 Supports IGNORE-INVALID-SEQUENCESP, that is, XEmacs characters that reflect | |
625 invalid octets on disk will be treated as encodable if this argument is | |
626 specified, and as not encodable if it is not specified." | |
627 | |
628 ;; Potential problem here; the octets that correspond to octets from #x00 | |
629 ;; to #x7f on disk will be treated by utf-8 and utf-7 as invalid | |
630 ;; sequences, and thus, in theory, encodable. | |
631 | |
632 (check-argument-type #'coding-system-p | |
633 (setq coding-system (find-coding-system coding-system))) | |
634 (check-argument-type #'integer-or-marker-p begin) | |
635 (check-argument-type #'integer-or-marker-p end) | |
636 (let* ((skip-chars-arg (concat unicode-query-coding-skip-chars-arg | |
637 (if ignore-invalid-sequencesp | |
638 unicode-invalid-sequence-regexp-range | |
639 ""))) | |
640 (ranges (make-range-table)) | |
641 (looking-at-arg (concat "[" skip-chars-arg "]")) | |
642 (case-fold-search nil) | |
643 (invalid-sequence-lower-unicode-bound | |
644 (char-to-unicode | |
645 (aref (decode-coding-string "\xd8\x00\x00\x00" | |
646 'utf-16-be) 3))) | |
647 (invalid-sequence-upper-unicode-bound | |
648 (char-to-unicode | |
649 (aref (decode-coding-string "\xd8\x00\x00\xFF" | |
650 'utf-16-be) 3))) | |
651 fail-range-start fail-range-end char-after failed | |
652 extent char-unicode failed-reason previous-failed-reason) | |
653 (save-excursion | |
654 (when highlightp | |
655 (query-coding-clear-highlights begin end buffer)) | |
656 (goto-char begin buffer) | |
657 (skip-chars-forward skip-chars-arg end buffer) | |
658 (while (< (point buffer) end) | |
659 (setq char-after (char-after (point buffer) buffer) | |
660 fail-range-start (point buffer)) | |
661 (while (and | |
662 (< (point buffer) end) | |
663 (not (looking-at looking-at-arg)) | |
664 (or (and | |
665 (= -1 (setq char-unicode (char-to-unicode char-after))) | |
666 (setq failed-reason 'unencodable)) | |
667 (and (not ignore-invalid-sequencesp) | |
668 ;; The default case, with ignore-invalid-sequencesp | |
669 ;; not specified: | |
670 ;; If the character is in the Unicode range that | |
671 ;; corresponds to an invalid octet, we want to | |
672 ;; treat it as unencodable. | |
673 (<= invalid-sequence-lower-unicode-bound | |
674 char-unicode) | |
675 (<= char-unicode | |
676 invalid-sequence-upper-unicode-bound) | |
677 (setq failed-reason 'invalid-sequence))) | |
678 (or (null previous-failed-reason) | |
679 (eq previous-failed-reason failed-reason))) | |
680 (forward-char 1 buffer) | |
681 (setq char-after (char-after (point buffer) buffer) | |
682 failed t | |
683 previous-failed-reason failed-reason)) | |
684 (if (= fail-range-start (point buffer)) | |
685 ;; The character can actually be encoded by the coding | |
686 ;; system; check the characters past it. | |
687 (forward-char 1 buffer) | |
688 ;; Can't be encoded; note this. | |
689 (when errorp | |
690 (error 'text-conversion-error | |
691 (format "Cannot encode %s using coding system" | |
692 (buffer-substring fail-range-start (point buffer) | |
693 buffer)) | |
694 (coding-system-name coding-system))) | |
695 (assert | |
696 (not (null previous-failed-reason)) t | |
697 "If we've got here, previous-failed-reason should be non-nil.") | |
698 (put-range-table fail-range-start | |
699 ;; If char-after is non-nil, we're not at | |
700 ;; the end of the buffer. | |
701 (setq fail-range-end (if char-after | |
702 (point buffer) | |
703 (point-max buffer))) | |
704 previous-failed-reason ranges) | |
705 (setq previous-failed-reason nil) | |
706 (when highlightp | |
707 (setq extent (make-extent fail-range-start fail-range-end buffer)) | |
708 (set-extent-priority extent (+ mouse-highlight-priority 2)) | |
709 (set-extent-face extent 'query-coding-warning-face))) | |
710 (skip-chars-forward skip-chars-arg end buffer)) | |
711 (if failed | |
712 (values nil ranges) | |
713 (values t nil))))) | |
714 | |
715 (loop | |
716 for coding-system in (coding-system-list) | |
717 initially (unless (featurep 'mule) (return)) | |
718 do (when (eq 'unicode (coding-system-type coding-system)) | |
719 (coding-system-put coding-system 'query-coding-function | |
720 #'unicode-query-coding-region))) | |
721 | |
722 (unless (featurep 'mule) | 616 (unless (featurep 'mule) |
723 ;; We do this in such a roundabout way--instead of having the above defun | 617 ;; We do this in such a roundabout way--instead of having the above defun |
724 ;; and defvar calls inside a (when (featurep 'mule) ...) form--to have | 618 ;; and defvar calls inside a (when (featurep 'mule) ...) form--to have |
725 ;; make-docfile.c pick up symbol and function documentation correctly. An | 619 ;; make-docfile.c pick up symbol and function documentation correctly. An |
726 ;; alternative approach would be to fix make-docfile.c to be able to read | 620 ;; alternative approach would be to fix make-docfile.c to be able to read |