Mercurial > hg > xemacs-beta
annotate src/unicode.c @ 5576:071b810ceb18
Declare labels as line where appropriate; use #'labels, not #'flet, tests.
lisp/ChangeLog addition:
2011-10-03 Aidan Kehoe <kehoea@parhasard.net>
* simple.el (handle-pre-motion-command-current-command-is-motion):
Implement #'keysyms-equal with #'labels + (declare (inline ...)),
instead of abusing macrolet to the same end.
* specifier.el (let-specifier):
* mule/mule-cmds.el (describe-language-environment):
* mule/mule-cmds.el (set-language-environment-coding-systems):
* mule/mule-x-init.el (x-use-halfwidth-roman-font):
* faces.el (Face-frob-property):
* keymap.el (key-sequence-list-description):
* lisp-mode.el (construct-lisp-mode-menu):
* loadhist.el (unload-feature):
* mouse.el (default-mouse-track-check-for-activation):
Declare various labels inline in dumped files when that reduces
the size of the dumped image. Declaring labels inline is normally
only worthwhile for inner loops and so on, but it's reasonable
exercise of the related code to have these changes in core.
tests/ChangeLog addition:
2011-10-03 Aidan Kehoe <kehoea@parhasard.net>
* automated/case-tests.el (uni-mappings):
* automated/database-tests.el (delete-database-files):
* automated/hash-table-tests.el (iterations):
* automated/lisp-tests.el (test1):
* automated/lisp-tests.el (a):
* automated/lisp-tests.el (cl-floor):
* automated/lisp-tests.el (foo):
* automated/lisp-tests.el (list-nreverse):
* automated/lisp-tests.el (needs-lexical-context):
* automated/mule-tests.el (featurep):
* automated/os-tests.el (original-string):
* automated/os-tests.el (with):
* automated/symbol-tests.el (check-weak-list-unique):
Replace #'flet with #'labels where appropriate in these tests,
following my own advice on style in the docstrings of those
functions.
| author | Aidan Kehoe <kehoea@parhasard.net> |
|---|---|
| date | Mon, 03 Oct 2011 20:16:14 +0100 |
| parents | 4dee0387b9de |
| children | 56144c8593a8 |
| rev | line source |
|---|---|
| 771 | 1 /* Code to handle Unicode conversion. |
|
4834
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
2 Copyright (C) 2000, 2001, 2002, 2003, 2004, 2005, 2010 Ben Wing. |
| 771 | 3 |
| 4 This file is part of XEmacs. | |
| 5 | |
|
5402
308d34e9f07d
Changed bulk of GPLv2 or later files identified by script
Mats Lidell <matsl@xemacs.org>
parents:
5157
diff
changeset
|
6 XEmacs is free software: you can redistribute it and/or modify it |
| 771 | 7 under the terms of the GNU General Public License as published by the |
|
5402
308d34e9f07d
Changed bulk of GPLv2 or later files identified by script
Mats Lidell <matsl@xemacs.org>
parents:
5157
diff
changeset
|
8 Free Software Foundation, either version 3 of the License, or (at your |
|
308d34e9f07d
Changed bulk of GPLv2 or later files identified by script
Mats Lidell <matsl@xemacs.org>
parents:
5157
diff
changeset
|
9 option) any later version. |
| 771 | 10 |
| 11 XEmacs is distributed in the hope that it will be useful, but WITHOUT | |
| 12 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or | |
| 13 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License | |
| 14 for more details. | |
| 15 | |
| 16 You should have received a copy of the GNU General Public License | |
|
5402
308d34e9f07d
Changed bulk of GPLv2 or later files identified by script
Mats Lidell <matsl@xemacs.org>
parents:
5157
diff
changeset
|
17 along with XEmacs. If not, see <http://www.gnu.org/licenses/>. */ |
| 771 | 18 |
| 19 /* Synched up with: FSF 20.3. Not in FSF. */ | |
| 20 | |
| 21 /* Authorship: | |
| 22 | |
| 23 Current primary author: Ben Wing <ben@xemacs.org> | |
| 24 | |
| 25 Written by Ben Wing <ben@xemacs.org>, June, 2001. | |
| 26 Separated out into this file, August, 2001. | |
| 27 Includes Unicode coding systems, some parts of which have been written | |
| 877 | 28 by someone else. #### Morioka and Hayashi, I think. |
| 771 | 29 |
| 30 As of September 2001, the detection code is here and abstraction of the | |
| 877 | 31 detection system is finished. The unicode detectors have been rewritten |
| 771 | 32 to include multiple levels of likelihood. |
| 33 */ | |
| 34 | |
| 35 #include <config.h> | |
| 36 #include "lisp.h" | |
| 37 | |
| 38 #include "charset.h" | |
| 39 #include "file-coding.h" | |
| 40 #include "opaque.h" | |
| 41 | |
|
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
42 #include "buffer.h" |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
43 #include "rangetab.h" |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
44 #include "extents.h" |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
45 |
| 771 | 46 #include "sysfile.h" |
| 47 | |
| 2367 | 48 /* For more info about how Unicode works under Windows, see intl-win32.c. */ |
| 49 | |
| 50 /* Info about Unicode translation tables [ben]: | |
| 51 | |
| 52 FORMAT: | |
| 53 ------- | |
| 54 | |
| 55 We currently use the following format for tables: | |
| 56 | |
| 57 If dimension == 1, to_unicode_table is a 96-element array of ints | |
| 58 (Unicode code points); else, it's a 96-element array of int * pointers, | |
| 59 each of which points to a 96-element array of ints. If no elements in a | |
| 60 row have been filled in, the pointer will point to a default empty | |
| 61 table; that way, memory usage is more reasonable but lookup still fast. | |
| 62 | |
| 63 -- If from_unicode_levels == 1, from_unicode_table is a 256-element | |
| 64 array of shorts (octet 1 in high byte, octet 2 in low byte; we don't | |
| 65 store Ichars directly to save space). | |
| 66 | |
| 67 -- If from_unicode_levels == 2, from_unicode_table is a 256-element | |
| 68 array of short * pointers, each of which points to a 256-element array | |
| 69 of shorts. | |
| 70 | |
| 71 -- If from_unicode_levels == 3, from_unicode_table is a 256-element | |
| 72 array of short ** pointers, each of which points to a 256-element array | |
| 73 of short * pointers, each of which points to a 256-element array of | |
| 74 shorts. | |
| 75 | |
| 76 -- If from_unicode_levels == 4, same thing but one level deeper. | |
| 77 | |
| 78 Just as for to_unicode_table, we use default tables to fill in all | |
| 79 entries with no values in them. | |
| 80 | |
| 81 #### An obvious space-saving optimization is to use variable-sized | |
| 82 tables, where each table instead of just being a 256-element array, is a | |
| 83 structure with a start value, an end value, and a variable number of | |
| 84 entries (END - START + 1). Only 8 bits are needed for END and START, | |
| 85 and could be stored at the end to avoid alignment problems. However, | |
| 86 before charging off and implementing this, we need to consider whether | |
| 87 it's worth it: | |
| 88 | |
| 89 (1) Most tables will be highly localized in which code points are | |
| 90 defined, heavily reducing the possible memory waste. Before doing any | |
| 91 rewriting, write some code to see how much memory is actually being | |
| 92 wasted (i.e. ratio of empty entries to total # of entries) and only | |
| 93 start rewriting if it's unacceptably high. You have to check over all | |
| 94 charsets. | |
| 95 | |
| 96 (2) Since entries are usually added one at a time, you have to be very | |
| 97 careful when creating the tables to avoid realloc()/free() thrashing in | |
| 98 the common case when you are in an area of high localization and are | |
| 99 going to end up using most entries in the table. You'd certainly want | |
| 100 to allow only certain sizes, not arbitrary ones (probably powers of 2, | |
| 101 where you want the entire block including the START/END values to fit | |
| 102 into a power of 2, minus any malloc overhead if there is any -- there's | |
| 103 none under gmalloc.c, and probably most system malloc() functions are | |
| 104 quite smart nowadays and also have no overhead). You could optimize | |
| 105 somewhat during the in-C initializations, because you can compute the | |
| 106 actual usage of various tables by scanning the entries you're going to | |
| 107 add in a separate pass before adding them. (You could actually do the | |
| 108 same thing when entries are added on the Lisp level by making the | |
| 109 assumption that all the entries will come in one after another before | |
| 110 any use is made of the data. So as they're coming in, you just store | |
| 111 them in a big long list, and the first time you need to retrieve an | |
| 112 entry, you compute the whole table at once.) You'd still have to deal | |
| 113 with the possibility of later entries coming in, though. | |
| 114 | |
| 115 (3) You do lose some speed using START/END values, since you need a | |
| 116 couple of comparisons at each level. This could easily make each single | |
| 117 lookup become 3-4 times slower. The Unicode book considers this a big | |
| 118 issue, and recommends against variable-sized tables for this reason; | |
| 119 however, they almost certainly have in mind applications that primarily | |
| 120 involve conversion of large amounts of data. Most Unicode strings that | |
| 121 are translated in XEmacs are fairly small. The only place where this | |
| 122 might matter is in loading large files -- e.g. a 3-megabyte | |
| 123 Unicode-encoded file. So think about this, and maybe do a trial | |
| 124 implementation where you don't worry too much about the intricacies of | |
| 125 (2) and just implement some basic "multiply by 1.5" trick or something | |
| 126 to do the resizing. There is a very good FAQ on Unicode called | |
| 127 something like the Linux-Unicode How-To (it should be part of the Linux | |
| 128 How-To's, I think), that lists the url of a guy with a whole bunch of | |
| 129 unicode files you can use to stress-test your implementations, and he's | |
| 130 highly likely to have a good multi-megabyte Unicode-encoded file (with | |
| 131 normal text in it -- if you created your own just by creating repeated | |
| 132 strings of letters and numbers, you probably wouldn't get accurate | |
| 133 results). | |
| 134 | |
| 135 INITIALIZATION: | |
| 136 --------------- | |
| 137 | |
| 138 There are advantages and disadvantages to loading the tables at | |
| 139 run-time. | |
| 140 | |
| 141 Advantages: | |
| 142 | |
| 143 They're big, and it's very fast to recreate them (a fraction of a second | |
| 144 on modern processors). | |
| 145 | |
| 146 Disadvantages: | |
| 147 | |
| 148 (1) User-defined charsets: It would be inconvenient to require all | |
| 149 dumped user-defined charsets to be reloaded at init time. | |
| 150 | |
| 151 NB With run-time loading, we load in init-mule-at-startup, in | |
| 152 mule-cmds.el. This is called from startup.el, which is quite late in | |
| 153 the initialization process -- but data-directory isn't set until then. | |
| 154 With dump-time loading, you still can't dump in a Japanese directory | |
| 155 (again, until we move to Unicode internally), but this is not such an | |
| 156 imposition. | |
| 157 | |
| 158 | |
| 159 */ | |
| 160 | |
| 771 | 161 /* #### WARNING! The current sledgehammer routines have a fundamental |
| 162 problem in that they can't handle two characters mapping to a | |
| 163 single Unicode codepoint or vice-versa in a single charset table. | |
| 164 It's not clear there is any way to handle this and still make the | |
| 877 | 165 sledgehammer routines useful. |
| 166 | |
| 167 Inquiring Minds Want To Know Dept: does the above WARNING mean that | |
| 168 _if_ it happens, then it will signal error, or then it will do | |
| 169 something evil and unpredictable? Signaling an error is OK: for | |
| 170 all national standards, the national to Unicode map is an inclusion | |
| 171 (1-to-1). Any character set that does not behave that way is | |
| 1318 | 172 broken according to the Unicode standard. |
| 173 | |
| 2500 | 174 Answer: You will get an ABORT(), since the purpose of the sledgehammer |
| 1318 | 175 routines is self-checking. The above problem with non-1-to-1 mapping |
| 176 occurs in the Big5 tables, as provided by the Unicode Consortium. */ | |
| 877 | 177 |
| 771 | 178 /* #define SLEDGEHAMMER_CHECK_UNICODE */ |
| 179 | |
| 180 /* When MULE is not defined, we may still need some Unicode support -- | |
| 181 in particular, some Windows API's always want Unicode, and the way | |
| 182 we've set up the Unicode encapsulation, we may as well go ahead and | |
| 183 always use the Unicode versions of split API's. (It would be | |
| 184 trickier to not use them, and pointless -- under NT, the ANSI API's | |
| 185 call the Unicode ones anyway, so in the case of structures, we'd be | |
| 186 converting from Unicode to ANSI structures, only to have the OS | |
| 187 convert them back.) */ | |
| 188 | |
| 189 Lisp_Object Qunicode; | |
| 4096 | 190 Lisp_Object Qutf_16, Qutf_8, Qucs_4, Qutf_7, Qutf_32; |
| 771 | 191 Lisp_Object Qneed_bom; |
| 192 | |
| 193 Lisp_Object Qutf_16_little_endian, Qutf_16_bom; | |
| 194 Lisp_Object Qutf_16_little_endian_bom; | |
| 195 | |
| 985 | 196 Lisp_Object Qutf_8_bom; |
| 197 | |
|
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
198 #ifdef MULE |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
199 /* These range tables are not directly accessible from Lisp: */ |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
200 static Lisp_Object Vunicode_invalid_and_query_skip_chars; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
201 static Lisp_Object Vutf_8_invalid_and_query_skip_chars; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
202 static Lisp_Object Vunicode_query_skip_chars; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
203 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
204 static Lisp_Object Vunicode_query_string, Vunicode_invalid_string, |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
205 Vutf_8_invalid_string; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
206 #endif /* MULE */ |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
207 |
| 3952 | 208 /* See the Unicode FAQ, http://www.unicode.org/faq/utf_bom.html#35 for this |
| 209 algorithm. | |
| 210 | |
| 211 (They also give another, really verbose one, as part of their explanation | |
| 212 of the various planes of the encoding, but we won't use that.) */ | |
| 213 | |
| 214 #define UTF_16_LEAD_OFFSET (0xD800 - (0x10000 >> 10)) | |
| 215 #define UTF_16_SURROGATE_OFFSET (0x10000 - (0xD800 << 10) - 0xDC00) | |
| 216 | |
| 217 #define utf_16_surrogates_to_code(lead, trail) \ | |
| 218 (((lead) << 10) + (trail) + UTF_16_SURROGATE_OFFSET) | |
| 219 | |
| 220 #define CODE_TO_UTF_16_SURROGATES(codepoint, lead, trail) do { \ | |
| 221 int __ctu16s_code = (codepoint); \ | |
| 222 lead = UTF_16_LEAD_OFFSET + (__ctu16s_code >> 10); \ | |
| 223 trail = 0xDC00 + (__ctu16s_code & 0x3FF); \ | |
| 224 } while (0) | |
| 225 | |
| 771 | 226 #ifdef MULE |
| 227 | |
| 3352 | 228 /* Using ints for to_unicode is OK (as long as they are >= 32 bits). |
| 229 In from_unicode, we're converting from Mule characters, which means | |
| 230 that the values being converted to are only 96x96, and we can save | |
| 231 space by using shorts (signedness doesn't matter). */ | |
| 771 | 232 static int *to_unicode_blank_1; |
| 233 static int **to_unicode_blank_2; | |
| 234 | |
| 235 static short *from_unicode_blank_1; | |
| 236 static short **from_unicode_blank_2; | |
| 237 static short ***from_unicode_blank_3; | |
| 238 static short ****from_unicode_blank_4; | |
| 239 | |
| 1204 | 240 static const struct memory_description to_unicode_level_0_desc_1[] = { |
| 771 | 241 { XD_END } |
| 242 }; | |
| 243 | |
| 1204 | 244 static const struct sized_memory_description to_unicode_level_0_desc = { |
| 245 sizeof (int), to_unicode_level_0_desc_1 | |
| 771 | 246 }; |
| 247 | |
| 1204 | 248 static const struct memory_description to_unicode_level_1_desc_1[] = { |
| 2551 | 249 { XD_BLOCK_PTR, 0, 96, { &to_unicode_level_0_desc } }, |
| 771 | 250 { XD_END } |
| 251 }; | |
| 252 | |
| 1204 | 253 static const struct sized_memory_description to_unicode_level_1_desc = { |
| 254 sizeof (void *), to_unicode_level_1_desc_1 | |
| 771 | 255 }; |
| 256 | |
| 1204 | 257 static const struct memory_description to_unicode_description_1[] = { |
| 2551 | 258 { XD_BLOCK_PTR, 1, 96, { &to_unicode_level_0_desc } }, |
| 259 { XD_BLOCK_PTR, 2, 96, { &to_unicode_level_1_desc } }, | |
| 771 | 260 { XD_END } |
| 261 }; | |
| 262 | |
| 263 /* Not static because each charset has a set of to and from tables and | |
| 264 needs to describe them to pdump. */ | |
| 1204 | 265 const struct sized_memory_description to_unicode_description = { |
| 266 sizeof (void *), to_unicode_description_1 | |
| 267 }; | |
| 268 | |
| 2367 | 269 /* Used only for to_unicode_blank_2 */ |
| 270 static const struct memory_description to_unicode_level_2_desc_1[] = { | |
| 2551 | 271 { XD_BLOCK_PTR, 0, 96, { &to_unicode_level_1_desc } }, |
| 2367 | 272 { XD_END } |
| 273 }; | |
| 274 | |
| 1204 | 275 static const struct memory_description from_unicode_level_0_desc_1[] = { |
| 771 | 276 { XD_END } |
| 277 }; | |
| 278 | |
| 1204 | 279 static const struct sized_memory_description from_unicode_level_0_desc = { |
| 280 sizeof (short), from_unicode_level_0_desc_1 | |
| 771 | 281 }; |
| 282 | |
| 1204 | 283 static const struct memory_description from_unicode_level_1_desc_1[] = { |
| 2551 | 284 { XD_BLOCK_PTR, 0, 256, { &from_unicode_level_0_desc } }, |
| 771 | 285 { XD_END } |
| 286 }; | |
| 287 | |
| 1204 | 288 static const struct sized_memory_description from_unicode_level_1_desc = { |
| 289 sizeof (void *), from_unicode_level_1_desc_1 | |
| 771 | 290 }; |
| 291 | |
| 1204 | 292 static const struct memory_description from_unicode_level_2_desc_1[] = { |
| 2551 | 293 { XD_BLOCK_PTR, 0, 256, { &from_unicode_level_1_desc } }, |
| 771 | 294 { XD_END } |
| 295 }; | |
| 296 | |
| 1204 | 297 static const struct sized_memory_description from_unicode_level_2_desc = { |
| 298 sizeof (void *), from_unicode_level_2_desc_1 | |
| 771 | 299 }; |
| 300 | |
| 1204 | 301 static const struct memory_description from_unicode_level_3_desc_1[] = { |
| 2551 | 302 { XD_BLOCK_PTR, 0, 256, { &from_unicode_level_2_desc } }, |
| 771 | 303 { XD_END } |
| 304 }; | |
| 305 | |
| 1204 | 306 static const struct sized_memory_description from_unicode_level_3_desc = { |
| 307 sizeof (void *), from_unicode_level_3_desc_1 | |
| 771 | 308 }; |
| 309 | |
| 1204 | 310 static const struct memory_description from_unicode_description_1[] = { |
| 2551 | 311 { XD_BLOCK_PTR, 1, 256, { &from_unicode_level_0_desc } }, |
| 312 { XD_BLOCK_PTR, 2, 256, { &from_unicode_level_1_desc } }, | |
| 313 { XD_BLOCK_PTR, 3, 256, { &from_unicode_level_2_desc } }, | |
| 314 { XD_BLOCK_PTR, 4, 256, { &from_unicode_level_3_desc } }, | |
| 771 | 315 { XD_END } |
| 316 }; | |
| 317 | |
| 318 /* Not static because each charset has a set of to and from tables and | |
| 319 needs to describe them to pdump. */ | |
| 1204 | 320 const struct sized_memory_description from_unicode_description = { |
| 321 sizeof (void *), from_unicode_description_1 | |
| 771 | 322 }; |
| 323 | |
| 2367 | 324 /* Used only for from_unicode_blank_4 */ |
| 325 static const struct memory_description from_unicode_level_4_desc_1[] = { | |
| 2551 | 326 { XD_BLOCK_PTR, 0, 256, { &from_unicode_level_3_desc } }, |
| 2367 | 327 { XD_END } |
| 328 }; | |
| 329 | |
| 771 | 330 static Lisp_Object_dynarr *unicode_precedence_dynarr; |
| 331 | |
| 1204 | 332 static const struct memory_description lod_description_1[] = { |
| 333 XD_DYNARR_DESC (Lisp_Object_dynarr, &lisp_object_description), | |
| 771 | 334 { XD_END } |
| 335 }; | |
| 336 | |
| 1204 | 337 static const struct sized_memory_description lisp_object_dynarr_description = { |
| 771 | 338 sizeof (Lisp_Object_dynarr), |
| 339 lod_description_1 | |
| 340 }; | |
| 341 | |
| 342 Lisp_Object Vlanguage_unicode_precedence_list; | |
| 343 Lisp_Object Vdefault_unicode_precedence_list; | |
| 344 | |
| 345 Lisp_Object Qignore_first_column; | |
| 346 | |
| 3439 | 347 Lisp_Object Vcurrent_jit_charset; |
| 348 Lisp_Object Qlast_allocated_character; | |
| 349 Lisp_Object Qccl_encode_to_ucs_2; | |
| 350 | |
| 4268 | 351 Lisp_Object Vnumber_of_jit_charsets; |
| 352 Lisp_Object Vlast_jit_charset_final; | |
| 353 Lisp_Object Vcharset_descr; | |
| 354 | |
| 355 | |
| 771 | 356 |
| 357 /************************************************************************/ | |
| 358 /* Unicode implementation */ | |
| 359 /************************************************************************/ | |
| 360 | |
| 361 #define BREAKUP_UNICODE_CODE(val, u1, u2, u3, u4, levels) \ | |
| 362 do { \ | |
| 363 int buc_val = (val); \ | |
| 364 \ | |
| 365 (u1) = buc_val >> 24; \ | |
| 366 (u2) = (buc_val >> 16) & 255; \ | |
| 367 (u3) = (buc_val >> 8) & 255; \ | |
| 368 (u4) = buc_val & 255; \ | |
| 369 (levels) = (buc_val <= 0xFF ? 1 : \ | |
| 370 buc_val <= 0xFFFF ? 2 : \ | |
| 371 buc_val <= 0xFFFFFF ? 3 : \ | |
| 372 4); \ | |
| 373 } while (0) | |
| 374 | |
| 375 static void | |
| 376 init_blank_unicode_tables (void) | |
| 377 { | |
| 378 int i; | |
| 379 | |
| 380 from_unicode_blank_1 = xnew_array (short, 256); | |
| 381 from_unicode_blank_2 = xnew_array (short *, 256); | |
| 382 from_unicode_blank_3 = xnew_array (short **, 256); | |
| 383 from_unicode_blank_4 = xnew_array (short ***, 256); | |
| 384 for (i = 0; i < 256; i++) | |
| 385 { | |
| 877 | 386 /* #### IMWTK: Why does using -1 here work? Simply because there are |
| 1318 | 387 no existing 96x96 charsets? |
| 388 | |
| 389 Answer: I don't understand the concern. -1 indicates there is no | |
| 390 entry for this particular codepoint, which is always the case for | |
| 391 blank tables. */ | |
| 771 | 392 from_unicode_blank_1[i] = (short) -1; |
| 393 from_unicode_blank_2[i] = from_unicode_blank_1; | |
| 394 from_unicode_blank_3[i] = from_unicode_blank_2; | |
| 395 from_unicode_blank_4[i] = from_unicode_blank_3; | |
| 396 } | |
| 397 | |
| 398 to_unicode_blank_1 = xnew_array (int, 96); | |
| 399 to_unicode_blank_2 = xnew_array (int *, 96); | |
| 400 for (i = 0; i < 96; i++) | |
| 401 { | |
| 877 | 402 /* Here -1 is guaranteed OK. */ |
| 771 | 403 to_unicode_blank_1[i] = -1; |
| 404 to_unicode_blank_2[i] = to_unicode_blank_1; | |
| 405 } | |
| 406 } | |
| 407 | |
| 408 static void * | |
| 409 create_new_from_unicode_table (int level) | |
| 410 { | |
| 411 switch (level) | |
| 412 { | |
| 413 /* WARNING: If you are thinking of compressing these, keep in | |
| 414 mind that sizeof (short) does not equal sizeof (short *). */ | |
| 415 case 1: | |
| 416 { | |
| 417 short *newtab = xnew_array (short, 256); | |
| 418 memcpy (newtab, from_unicode_blank_1, 256 * sizeof (short)); | |
| 419 return newtab; | |
| 420 } | |
| 421 case 2: | |
| 422 { | |
| 423 short **newtab = xnew_array (short *, 256); | |
| 424 memcpy (newtab, from_unicode_blank_2, 256 * sizeof (short *)); | |
| 425 return newtab; | |
| 426 } | |
| 427 case 3: | |
| 428 { | |
| 429 short ***newtab = xnew_array (short **, 256); | |
| 430 memcpy (newtab, from_unicode_blank_3, 256 * sizeof (short **)); | |
| 431 return newtab; | |
| 432 } | |
| 433 case 4: | |
| 434 { | |
| 435 short ****newtab = xnew_array (short ***, 256); | |
| 436 memcpy (newtab, from_unicode_blank_4, 256 * sizeof (short ***)); | |
| 437 return newtab; | |
| 438 } | |
| 439 default: | |
| 2500 | 440 ABORT (); |
| 771 | 441 return 0; |
| 442 } | |
| 443 } | |
| 444 | |
| 877 | 445 /* Allocate and blank the tables. |
| 1318 | 446 Loading them up is done by load-unicode-mapping-table. */ |
| 771 | 447 void |
| 448 init_charset_unicode_tables (Lisp_Object charset) | |
| 449 { | |
| 450 if (XCHARSET_DIMENSION (charset) == 1) | |
| 451 { | |
| 452 int *to_table = xnew_array (int, 96); | |
| 453 memcpy (to_table, to_unicode_blank_1, 96 * sizeof (int)); | |
| 454 XCHARSET_TO_UNICODE_TABLE (charset) = to_table; | |
| 455 } | |
| 456 else | |
| 457 { | |
| 458 int **to_table = xnew_array (int *, 96); | |
| 459 memcpy (to_table, to_unicode_blank_2, 96 * sizeof (int *)); | |
| 460 XCHARSET_TO_UNICODE_TABLE (charset) = to_table; | |
| 461 } | |
| 462 | |
| 463 { | |
| 2367 | 464 XCHARSET_FROM_UNICODE_TABLE (charset) = |
| 465 create_new_from_unicode_table (1); | |
| 771 | 466 XCHARSET_FROM_UNICODE_LEVELS (charset) = 1; |
| 467 } | |
| 468 } | |
| 469 | |
| 470 static void | |
| 471 free_from_unicode_table (void *table, int level) | |
| 472 { | |
| 473 int i; | |
| 474 | |
| 475 switch (level) | |
| 476 { | |
| 477 case 2: | |
| 478 { | |
| 479 short **tab = (short **) table; | |
| 480 for (i = 0; i < 256; i++) | |
| 481 { | |
| 482 if (tab[i] != from_unicode_blank_1) | |
| 483 free_from_unicode_table (tab[i], 1); | |
| 484 } | |
| 485 break; | |
| 486 } | |
| 487 case 3: | |
| 488 { | |
| 489 short ***tab = (short ***) table; | |
| 490 for (i = 0; i < 256; i++) | |
| 491 { | |
| 492 if (tab[i] != from_unicode_blank_2) | |
| 493 free_from_unicode_table (tab[i], 2); | |
| 494 } | |
| 495 break; | |
| 496 } | |
| 497 case 4: | |
| 498 { | |
| 499 short ****tab = (short ****) table; | |
| 500 for (i = 0; i < 256; i++) | |
| 501 { | |
| 502 if (tab[i] != from_unicode_blank_3) | |
| 503 free_from_unicode_table (tab[i], 3); | |
| 504 } | |
| 505 break; | |
| 506 } | |
| 507 } | |
| 508 | |
|
4976
16112448d484
Rename xfree(FOO, TYPE) -> xfree(FOO)
Ben Wing <ben@xemacs.org>
parents:
4953
diff
changeset
|
509 xfree (table); |
| 771 | 510 } |
| 511 | |
| 512 static void | |
| 513 free_to_unicode_table (void *table, int level) | |
| 514 { | |
| 515 if (level == 2) | |
| 516 { | |
| 517 int i; | |
| 518 int **tab = (int **) table; | |
| 519 | |
| 520 for (i = 0; i < 96; i++) | |
| 521 { | |
| 522 if (tab[i] != to_unicode_blank_1) | |
| 523 free_to_unicode_table (tab[i], 1); | |
| 524 } | |
| 525 } | |
| 526 | |
|
4976
16112448d484
Rename xfree(FOO, TYPE) -> xfree(FOO)
Ben Wing <ben@xemacs.org>
parents:
4953
diff
changeset
|
527 xfree (table); |
| 771 | 528 } |
| 529 | |
| 530 void | |
| 531 free_charset_unicode_tables (Lisp_Object charset) | |
| 532 { | |
| 533 free_to_unicode_table (XCHARSET_TO_UNICODE_TABLE (charset), | |
| 534 XCHARSET_DIMENSION (charset)); | |
| 535 free_from_unicode_table (XCHARSET_FROM_UNICODE_TABLE (charset), | |
| 536 XCHARSET_FROM_UNICODE_LEVELS (charset)); | |
| 537 } | |
| 538 | |
| 539 #ifdef MEMORY_USAGE_STATS | |
| 540 | |
| 541 static Bytecount | |
| 542 compute_from_unicode_table_size_1 (void *table, int level, | |
|
5157
1fae11d56ad2
redo memory-usage mechanism, add way of dynamically initializing Lisp objects
Ben Wing <ben@xemacs.org>
parents:
4976
diff
changeset
|
543 struct usage_stats *stats) |
| 771 | 544 { |
| 545 int i; | |
| 546 Bytecount size = 0; | |
| 547 | |
| 548 switch (level) | |
| 549 { | |
| 550 case 2: | |
| 551 { | |
| 552 short **tab = (short **) table; | |
| 553 for (i = 0; i < 256; i++) | |
| 554 { | |
| 555 if (tab[i] != from_unicode_blank_1) | |
| 556 size += compute_from_unicode_table_size_1 (tab[i], 1, stats); | |
| 557 } | |
| 558 break; | |
| 559 } | |
| 560 case 3: | |
| 561 { | |
| 562 short ***tab = (short ***) table; | |
| 563 for (i = 0; i < 256; i++) | |
| 564 { | |
| 565 if (tab[i] != from_unicode_blank_2) | |
| 566 size += compute_from_unicode_table_size_1 (tab[i], 2, stats); | |
| 567 } | |
| 568 break; | |
| 569 } | |
| 570 case 4: | |
| 571 { | |
| 572 short ****tab = (short ****) table; | |
| 573 for (i = 0; i < 256; i++) | |
| 574 { | |
| 575 if (tab[i] != from_unicode_blank_3) | |
| 576 size += compute_from_unicode_table_size_1 (tab[i], 3, stats); | |
| 577 } | |
| 578 break; | |
| 579 } | |
| 580 } | |
| 581 | |
| 3024 | 582 size += malloced_storage_size (table, |
| 771 | 583 256 * (level == 1 ? sizeof (short) : |
| 584 sizeof (void *)), | |
| 585 stats); | |
| 586 return size; | |
| 587 } | |
| 588 | |
| 589 static Bytecount | |
| 590 compute_to_unicode_table_size_1 (void *table, int level, | |
|
5157
1fae11d56ad2
redo memory-usage mechanism, add way of dynamically initializing Lisp objects
Ben Wing <ben@xemacs.org>
parents:
4976
diff
changeset
|
591 struct usage_stats *stats) |
| 771 | 592 { |
| 593 Bytecount size = 0; | |
| 594 | |
| 595 if (level == 2) | |
| 596 { | |
| 597 int i; | |
| 598 int **tab = (int **) table; | |
| 599 | |
| 600 for (i = 0; i < 96; i++) | |
| 601 { | |
| 602 if (tab[i] != to_unicode_blank_1) | |
| 603 size += compute_to_unicode_table_size_1 (tab[i], 1, stats); | |
| 604 } | |
| 605 } | |
| 606 | |
| 3024 | 607 size += malloced_storage_size (table, |
| 771 | 608 96 * (level == 1 ? sizeof (int) : |
| 609 sizeof (void *)), | |
| 610 stats); | |
| 611 return size; | |
| 612 } | |
| 613 | |
| 614 Bytecount | |
| 615 compute_from_unicode_table_size (Lisp_Object charset, | |
|
5157
1fae11d56ad2
redo memory-usage mechanism, add way of dynamically initializing Lisp objects
Ben Wing <ben@xemacs.org>
parents:
4976
diff
changeset
|
616 struct usage_stats *stats) |
| 771 | 617 { |
| 618 return (compute_from_unicode_table_size_1 | |
| 619 (XCHARSET_FROM_UNICODE_TABLE (charset), | |
| 620 XCHARSET_FROM_UNICODE_LEVELS (charset), | |
| 621 stats)); | |
| 622 } | |
| 623 | |
| 624 Bytecount | |
| 625 compute_to_unicode_table_size (Lisp_Object charset, | |
|
5157
1fae11d56ad2
redo memory-usage mechanism, add way of dynamically initializing Lisp objects
Ben Wing <ben@xemacs.org>
parents:
4976
diff
changeset
|
626 struct usage_stats *stats) |
| 771 | 627 { |
| 628 return (compute_to_unicode_table_size_1 | |
| 629 (XCHARSET_TO_UNICODE_TABLE (charset), | |
| 630 XCHARSET_DIMENSION (charset), | |
| 631 stats)); | |
| 632 } | |
| 633 | |
| 634 #endif | |
| 635 | |
| 636 #ifdef SLEDGEHAMMER_CHECK_UNICODE | |
| 637 | |
| 638 /* "Sledgehammer checks" are checks that verify the self-consistency | |
| 639 of an entire structure every time a change is about to be made or | |
| 640 has been made to the structure. Not fast but a pretty much | |
| 641 sure-fire way of flushing out any incorrectnesses in the algorithms | |
| 642 that create the structure. | |
| 643 | |
| 644 Checking only after a change has been made will speed things up by | |
| 645 a factor of 2, but it doesn't absolutely prove that the code just | |
| 646 checked caused the problem; perhaps it happened elsewhere, either | |
| 647 in some code you forgot to sledgehammer check or as a result of | |
| 648 data corruption. */ | |
| 649 | |
| 650 static void | |
| 651 assert_not_any_blank_table (void *tab) | |
| 652 { | |
| 653 assert (tab != from_unicode_blank_1); | |
| 654 assert (tab != from_unicode_blank_2); | |
| 655 assert (tab != from_unicode_blank_3); | |
| 656 assert (tab != from_unicode_blank_4); | |
| 657 assert (tab != to_unicode_blank_1); | |
| 658 assert (tab != to_unicode_blank_2); | |
| 659 assert (tab); | |
| 660 } | |
| 661 | |
| 662 static void | |
| 663 sledgehammer_check_from_table (Lisp_Object charset, void *table, int level, | |
| 664 int codetop) | |
| 665 { | |
| 666 int i; | |
| 667 | |
| 668 switch (level) | |
| 669 { | |
| 670 case 1: | |
| 671 { | |
| 672 short *tab = (short *) table; | |
| 673 for (i = 0; i < 256; i++) | |
| 674 { | |
| 675 if (tab[i] != -1) | |
| 676 { | |
| 677 Lisp_Object char_charset; | |
| 678 int c1, c2; | |
| 679 | |
| 867 | 680 assert (valid_ichar_p (tab[i])); |
| 681 BREAKUP_ICHAR (tab[i], char_charset, c1, c2); | |
| 771 | 682 assert (EQ (charset, char_charset)); |
| 683 if (XCHARSET_DIMENSION (charset) == 1) | |
| 684 { | |
| 685 int *to_table = | |
| 686 (int *) XCHARSET_TO_UNICODE_TABLE (charset); | |
| 687 assert_not_any_blank_table (to_table); | |
| 688 assert (to_table[c1 - 32] == (codetop << 8) + i); | |
| 689 } | |
| 690 else | |
| 691 { | |
| 692 int **to_table = | |
| 693 (int **) XCHARSET_TO_UNICODE_TABLE (charset); | |
| 694 assert_not_any_blank_table (to_table); | |
| 695 assert_not_any_blank_table (to_table[c1 - 32]); | |
| 696 assert (to_table[c1 - 32][c2 - 32] == (codetop << 8) + i); | |
| 697 } | |
| 698 } | |
| 699 } | |
| 700 break; | |
| 701 } | |
| 702 case 2: | |
| 703 { | |
| 704 short **tab = (short **) table; | |
| 705 for (i = 0; i < 256; i++) | |
| 706 { | |
| 707 if (tab[i] != from_unicode_blank_1) | |
| 708 sledgehammer_check_from_table (charset, tab[i], 1, | |
| 709 (codetop << 8) + i); | |
| 710 } | |
| 711 break; | |
| 712 } | |
| 713 case 3: | |
| 714 { | |
| 715 short ***tab = (short ***) table; | |
| 716 for (i = 0; i < 256; i++) | |
| 717 { | |
| 718 if (tab[i] != from_unicode_blank_2) | |
| 719 sledgehammer_check_from_table (charset, tab[i], 2, | |
| 720 (codetop << 8) + i); | |
| 721 } | |
| 722 break; | |
| 723 } | |
| 724 case 4: | |
| 725 { | |
| 726 short ****tab = (short ****) table; | |
| 727 for (i = 0; i < 256; i++) | |
| 728 { | |
| 729 if (tab[i] != from_unicode_blank_3) | |
| 730 sledgehammer_check_from_table (charset, tab[i], 3, | |
| 731 (codetop << 8) + i); | |
| 732 } | |
| 733 break; | |
| 734 } | |
| 735 default: | |
| 2500 | 736 ABORT (); |
| 771 | 737 } |
| 738 } | |
| 739 | |
| 740 static void | |
| 741 sledgehammer_check_to_table (Lisp_Object charset, void *table, int level, | |
| 742 int codetop) | |
| 743 { | |
| 744 int i; | |
| 745 | |
| 746 switch (level) | |
| 747 { | |
| 748 case 1: | |
| 749 { | |
| 750 int *tab = (int *) table; | |
| 751 | |
| 752 if (XCHARSET_CHARS (charset) == 94) | |
| 753 { | |
| 754 assert (tab[0] == -1); | |
| 755 assert (tab[95] == -1); | |
| 756 } | |
| 757 | |
| 758 for (i = 0; i < 96; i++) | |
| 759 { | |
| 760 if (tab[i] != -1) | |
| 761 { | |
| 762 int u4, u3, u2, u1, levels; | |
| 867 | 763 Ichar ch; |
| 764 Ichar this_ch; | |
| 771 | 765 short val; |
| 766 void *frtab = XCHARSET_FROM_UNICODE_TABLE (charset); | |
| 767 | |
| 768 if (XCHARSET_DIMENSION (charset) == 1) | |
| 867 | 769 this_ch = make_ichar (charset, i + 32, 0); |
| 771 | 770 else |
| 867 | 771 this_ch = make_ichar (charset, codetop + 32, i + 32); |
| 771 | 772 |
| 773 assert (tab[i] >= 0); | |
| 774 BREAKUP_UNICODE_CODE (tab[i], u4, u3, u2, u1, levels); | |
| 775 assert (levels <= XCHARSET_FROM_UNICODE_LEVELS (charset)); | |
| 776 | |
| 777 switch (XCHARSET_FROM_UNICODE_LEVELS (charset)) | |
| 778 { | |
| 779 case 1: val = ((short *) frtab)[u1]; break; | |
| 780 case 2: val = ((short **) frtab)[u2][u1]; break; | |
| 781 case 3: val = ((short ***) frtab)[u3][u2][u1]; break; | |
| 782 case 4: val = ((short ****) frtab)[u4][u3][u2][u1]; break; | |
| 2500 | 783 default: ABORT (); |
| 771 | 784 } |
| 785 | |
| 867 | 786 ch = make_ichar (charset, val >> 8, val & 0xFF); |
| 771 | 787 assert (ch == this_ch); |
| 788 | |
| 789 switch (XCHARSET_FROM_UNICODE_LEVELS (charset)) | |
| 790 { | |
| 791 case 4: | |
| 792 assert_not_any_blank_table (frtab); | |
| 793 frtab = ((short ****) frtab)[u4]; | |
| 794 /* fall through */ | |
| 795 case 3: | |
| 796 assert_not_any_blank_table (frtab); | |
| 797 frtab = ((short ***) frtab)[u3]; | |
| 798 /* fall through */ | |
| 799 case 2: | |
| 800 assert_not_any_blank_table (frtab); | |
| 801 frtab = ((short **) frtab)[u2]; | |
| 802 /* fall through */ | |
| 803 case 1: | |
| 804 assert_not_any_blank_table (frtab); | |
| 805 break; | |
| 2500 | 806 default: ABORT (); |
| 771 | 807 } |
| 808 } | |
| 809 } | |
| 810 break; | |
| 811 } | |
| 812 case 2: | |
| 813 { | |
| 814 int **tab = (int **) table; | |
| 815 | |
| 816 if (XCHARSET_CHARS (charset) == 94) | |
| 817 { | |
| 818 assert (tab[0] == to_unicode_blank_1); | |
| 819 assert (tab[95] == to_unicode_blank_1); | |
| 820 } | |
| 821 | |
| 822 for (i = 0; i < 96; i++) | |
| 823 { | |
| 824 if (tab[i] != to_unicode_blank_1) | |
| 825 sledgehammer_check_to_table (charset, tab[i], 1, i); | |
| 826 } | |
| 827 break; | |
| 828 } | |
| 829 default: | |
| 2500 | 830 ABORT (); |
| 771 | 831 } |
| 832 } | |
| 833 | |
| 834 static void | |
| 835 sledgehammer_check_unicode_tables (Lisp_Object charset) | |
| 836 { | |
| 837 /* verify that the blank tables have not been modified */ | |
| 838 int i; | |
| 839 int from_level = XCHARSET_FROM_UNICODE_LEVELS (charset); | |
| 840 int to_level = XCHARSET_FROM_UNICODE_LEVELS (charset); | |
| 841 | |
| 842 for (i = 0; i < 256; i++) | |
| 843 { | |
| 844 assert (from_unicode_blank_1[i] == (short) -1); | |
| 845 assert (from_unicode_blank_2[i] == from_unicode_blank_1); | |
| 846 assert (from_unicode_blank_3[i] == from_unicode_blank_2); | |
| 847 assert (from_unicode_blank_4[i] == from_unicode_blank_3); | |
| 848 } | |
| 849 | |
| 850 for (i = 0; i < 96; i++) | |
| 851 { | |
| 852 assert (to_unicode_blank_1[i] == -1); | |
| 853 assert (to_unicode_blank_2[i] == to_unicode_blank_1); | |
| 854 } | |
| 855 | |
| 856 assert (from_level >= 1 && from_level <= 4); | |
| 857 | |
| 858 sledgehammer_check_from_table (charset, | |
| 859 XCHARSET_FROM_UNICODE_TABLE (charset), | |
| 860 from_level, 0); | |
| 861 | |
| 862 sledgehammer_check_to_table (charset, | |
| 863 XCHARSET_TO_UNICODE_TABLE (charset), | |
| 864 XCHARSET_DIMENSION (charset), 0); | |
| 865 } | |
| 866 | |
| 867 #endif /* SLEDGEHAMMER_CHECK_UNICODE */ | |
| 868 | |
| 869 static void | |
| 867 | 870 set_unicode_conversion (Ichar chr, int code) |
| 771 | 871 { |
| 872 Lisp_Object charset; | |
| 873 int c1, c2; | |
| 874 | |
| 867 | 875 BREAKUP_ICHAR (chr, charset, c1, c2); |
| 771 | 876 |
| 877 | 877 /* I tried an assert on code > 255 || chr == code, but that fails because |
| 878 Mule gives many Latin characters separate code points for different | |
| 879 ISO 8859 coded character sets. Obvious in hindsight.... */ | |
| 880 assert (!EQ (charset, Vcharset_ascii) || chr == code); | |
| 881 assert (!EQ (charset, Vcharset_latin_iso8859_1) || chr == code); | |
| 882 assert (!EQ (charset, Vcharset_control_1) || chr == code); | |
| 883 | |
| 884 /* This assert is needed because it is simply unimplemented. */ | |
| 771 | 885 assert (!EQ (charset, Vcharset_composite)); |
| 886 | |
| 887 #ifdef SLEDGEHAMMER_CHECK_UNICODE | |
| 888 sledgehammer_check_unicode_tables (charset); | |
| 889 #endif | |
| 890 | |
| 2704 | 891 if (EQ(charset, Vcharset_ascii) || EQ(charset, Vcharset_control_1)) |
| 892 return; | |
| 893 | |
| 771 | 894 /* First, the char -> unicode translation */ |
| 895 | |
| 896 if (XCHARSET_DIMENSION (charset) == 1) | |
| 897 { | |
| 898 int *to_table = (int *) XCHARSET_TO_UNICODE_TABLE (charset); | |
| 899 to_table[c1 - 32] = code; | |
| 900 } | |
| 901 else | |
| 902 { | |
| 903 int **to_table_2 = (int **) XCHARSET_TO_UNICODE_TABLE (charset); | |
| 904 int *to_table_1; | |
| 905 | |
| 906 assert (XCHARSET_DIMENSION (charset) == 2); | |
| 907 to_table_1 = to_table_2[c1 - 32]; | |
| 908 if (to_table_1 == to_unicode_blank_1) | |
| 909 { | |
| 910 to_table_1 = xnew_array (int, 96); | |
| 911 memcpy (to_table_1, to_unicode_blank_1, 96 * sizeof (int)); | |
| 912 to_table_2[c1 - 32] = to_table_1; | |
| 913 } | |
| 914 to_table_1[c2 - 32] = code; | |
| 915 } | |
| 916 | |
| 917 /* Then, unicode -> char: much harder */ | |
| 918 | |
| 919 { | |
| 920 int charset_levels; | |
| 921 int u4, u3, u2, u1; | |
| 922 int code_levels; | |
| 923 BREAKUP_UNICODE_CODE (code, u4, u3, u2, u1, code_levels); | |
| 924 | |
| 925 charset_levels = XCHARSET_FROM_UNICODE_LEVELS (charset); | |
| 926 | |
| 927 /* Make sure the charset's tables have at least as many levels as | |
| 928 the code point has: Note that the charset is guaranteed to have | |
| 929 at least one level, because it was created that way */ | |
| 930 if (charset_levels < code_levels) | |
| 931 { | |
| 932 int i; | |
| 933 | |
| 934 assert (charset_levels > 0); | |
| 935 for (i = 2; i <= code_levels; i++) | |
| 936 { | |
| 937 if (charset_levels < i) | |
| 938 { | |
| 939 void *old_table = XCHARSET_FROM_UNICODE_TABLE (charset); | |
| 940 void *table = create_new_from_unicode_table (i); | |
| 941 XCHARSET_FROM_UNICODE_TABLE (charset) = table; | |
| 942 | |
| 943 switch (i) | |
| 944 { | |
| 945 case 2: | |
| 946 ((short **) table)[0] = (short *) old_table; | |
| 947 break; | |
| 948 case 3: | |
| 949 ((short ***) table)[0] = (short **) old_table; | |
| 950 break; | |
| 951 case 4: | |
| 952 ((short ****) table)[0] = (short ***) old_table; | |
| 953 break; | |
| 2500 | 954 default: ABORT (); |
| 771 | 955 } |
| 956 } | |
| 957 } | |
| 958 | |
| 959 charset_levels = code_levels; | |
| 960 XCHARSET_FROM_UNICODE_LEVELS (charset) = code_levels; | |
| 961 } | |
| 962 | |
| 963 /* Now, make sure there is a non-default table at each level */ | |
| 964 { | |
| 965 int i; | |
| 966 void *table = XCHARSET_FROM_UNICODE_TABLE (charset); | |
| 967 | |
| 968 for (i = charset_levels; i >= 2; i--) | |
| 969 { | |
| 970 switch (i) | |
| 971 { | |
| 972 case 4: | |
| 973 if (((short ****) table)[u4] == from_unicode_blank_3) | |
| 974 ((short ****) table)[u4] = | |
| 975 ((short ***) create_new_from_unicode_table (3)); | |
| 976 table = ((short ****) table)[u4]; | |
| 977 break; | |
| 978 case 3: | |
| 979 if (((short ***) table)[u3] == from_unicode_blank_2) | |
| 980 ((short ***) table)[u3] = | |
| 981 ((short **) create_new_from_unicode_table (2)); | |
| 982 table = ((short ***) table)[u3]; | |
| 983 break; | |
| 984 case 2: | |
| 985 if (((short **) table)[u2] == from_unicode_blank_1) | |
| 986 ((short **) table)[u2] = | |
| 987 ((short *) create_new_from_unicode_table (1)); | |
| 988 table = ((short **) table)[u2]; | |
| 989 break; | |
| 2500 | 990 default: ABORT (); |
| 771 | 991 } |
| 992 } | |
| 993 } | |
| 994 | |
| 995 /* Finally, set the character */ | |
| 996 | |
| 997 { | |
| 998 void *table = XCHARSET_FROM_UNICODE_TABLE (charset); | |
| 999 switch (charset_levels) | |
| 1000 { | |
| 1001 case 1: ((short *) table)[u1] = (c1 << 8) + c2; break; | |
| 1002 case 2: ((short **) table)[u2][u1] = (c1 << 8) + c2; break; | |
| 1003 case 3: ((short ***) table)[u3][u2][u1] = (c1 << 8) + c2; break; | |
| 1004 case 4: ((short ****) table)[u4][u3][u2][u1] = (c1 << 8) + c2; break; | |
| 2500 | 1005 default: ABORT (); |
| 771 | 1006 } |
| 1007 } | |
| 1008 } | |
| 1009 | |
| 1010 #ifdef SLEDGEHAMMER_CHECK_UNICODE | |
| 1011 sledgehammer_check_unicode_tables (charset); | |
| 1012 #endif | |
| 1013 } | |
| 1014 | |
| 788 | 1015 int |
| 867 | 1016 ichar_to_unicode (Ichar chr) |
| 771 | 1017 { |
| 1018 Lisp_Object charset; | |
| 1019 int c1, c2; | |
| 1020 | |
| 867 | 1021 type_checking_assert (valid_ichar_p (chr)); |
| 877 | 1022 /* This shortcut depends on the representation of an Ichar, see text.c. */ |
| 771 | 1023 if (chr < 256) |
| 1024 return (int) chr; | |
| 1025 | |
| 867 | 1026 BREAKUP_ICHAR (chr, charset, c1, c2); |
| 771 | 1027 if (EQ (charset, Vcharset_composite)) |
| 1028 return -1; /* #### don't know how to handle */ | |
| 1029 else if (XCHARSET_DIMENSION (charset) == 1) | |
| 1030 return ((int *) XCHARSET_TO_UNICODE_TABLE (charset))[c1 - 32]; | |
| 1031 else | |
| 1032 return ((int **) XCHARSET_TO_UNICODE_TABLE (charset))[c1 - 32][c2 - 32]; | |
| 1033 } | |
| 1034 | |
| 867 | 1035 static Ichar |
| 3439 | 1036 get_free_codepoint(Lisp_Object charset) |
| 1037 { | |
| 1038 Lisp_Object name = Fcharset_name(charset); | |
| 1039 Lisp_Object zeichen = Fget(name, Qlast_allocated_character, Qnil); | |
| 1040 Ichar res; | |
| 1041 | |
| 1042 /* Only allow this with the 96x96 character sets we are using for | |
| 1043 temporary Unicode support. */ | |
| 1044 assert(2 == XCHARSET_DIMENSION(charset) && 96 == XCHARSET_CHARS(charset)); | |
| 1045 | |
| 1046 if (!NILP(zeichen)) | |
| 1047 { | |
| 1048 int c1, c2; | |
| 1049 | |
| 1050 BREAKUP_ICHAR(XCHAR(zeichen), charset, c1, c2); | |
| 1051 | |
| 1052 if (127 == c1 && 127 == c2) | |
| 1053 { | |
| 1054 /* We've already used the hightest-numbered character in this | |
| 1055 set--tell our caller to create another. */ | |
| 1056 return -1; | |
| 1057 } | |
| 1058 | |
| 1059 if (127 == c2) | |
| 1060 { | |
| 1061 ++c1; | |
| 1062 c2 = 0x20; | |
| 1063 } | |
| 1064 else | |
| 1065 { | |
| 1066 ++c2; | |
| 1067 } | |
| 1068 | |
| 1069 res = make_ichar(charset, c1, c2); | |
| 1070 Fput(name, Qlast_allocated_character, make_char(res)); | |
| 1071 } | |
| 1072 else | |
| 1073 { | |
| 1074 res = make_ichar(charset, 32, 32); | |
| 1075 Fput(name, Qlast_allocated_character, make_char(res)); | |
| 1076 } | |
| 1077 return res; | |
| 1078 } | |
| 1079 | |
| 1080 /* The just-in-time creation of XEmacs characters that correspond to unknown | |
| 1081 Unicode code points happens when: | |
| 1082 | |
| 1083 1. The lookup would otherwise fail. | |
| 1084 | |
| 1085 2. The charsets array is the nil or the default. | |
| 1086 | |
| 1087 If there are no free code points in the just-in-time Unicode character | |
| 1088 set, and the charsets array is the default unicode precedence list, | |
| 1089 create a new just-in-time Unicode character set, add it at the end of the | |
| 1090 unicode precedence list, create the XEmacs character in that character | |
| 1091 set, and return it. */ | |
| 1092 | |
| 1093 static Ichar | |
| 877 | 1094 unicode_to_ichar (int code, Lisp_Object_dynarr *charsets) |
| 771 | 1095 { |
| 1096 int u1, u2, u3, u4; | |
| 1097 int code_levels; | |
| 1098 int i; | |
| 1099 int n = Dynarr_length (charsets); | |
| 1100 | |
| 1101 type_checking_assert (code >= 0); | |
| 877 | 1102 /* This shortcut depends on the representation of an Ichar, see text.c. |
| 1103 Note that it may _not_ be extended to U+00A0 to U+00FF (many ISO 8859 | |
| 893 | 1104 coded character sets have points that map into that region, so this |
| 1105 function is many-valued). */ | |
| 877 | 1106 if (code < 0xA0) |
| 867 | 1107 return (Ichar) code; |
| 771 | 1108 |
| 1109 BREAKUP_UNICODE_CODE (code, u4, u3, u2, u1, code_levels); | |
| 1110 | |
| 1111 for (i = 0; i < n; i++) | |
| 1112 { | |
| 1113 Lisp_Object charset = Dynarr_at (charsets, i); | |
| 1114 int charset_levels = XCHARSET_FROM_UNICODE_LEVELS (charset); | |
| 1115 if (charset_levels >= code_levels) | |
| 1116 { | |
| 1117 void *table = XCHARSET_FROM_UNICODE_TABLE (charset); | |
| 1118 short retval; | |
| 1119 | |
| 1120 switch (charset_levels) | |
| 1121 { | |
| 1122 case 1: retval = ((short *) table)[u1]; break; | |
| 1123 case 2: retval = ((short **) table)[u2][u1]; break; | |
| 1124 case 3: retval = ((short ***) table)[u3][u2][u1]; break; | |
| 1125 case 4: retval = ((short ****) table)[u4][u3][u2][u1]; break; | |
| 2500 | 1126 default: ABORT (); retval = 0; |
| 771 | 1127 } |
| 1128 | |
| 1129 if (retval != -1) | |
| 867 | 1130 return make_ichar (charset, retval >> 8, retval & 0xFF); |
| 771 | 1131 } |
| 1132 } | |
| 3439 | 1133 |
| 1134 /* Only do the magic just-in-time assignment if we're using the default | |
| 1135 list. */ | |
| 1136 if (unicode_precedence_dynarr == charsets) | |
| 1137 { | |
| 1138 if (NILP (Vcurrent_jit_charset) || | |
| 1139 (-1 == (i = get_free_codepoint(Vcurrent_jit_charset)))) | |
| 1140 { | |
| 3452 | 1141 Ibyte setname[32]; |
| 4268 | 1142 int number_of_jit_charsets = XINT (Vnumber_of_jit_charsets); |
| 1143 Ascbyte last_jit_charset_final = XCHAR (Vlast_jit_charset_final); | |
| 1144 | |
| 1145 /* This final byte shit is, umm, not that cool. */ | |
| 1146 assert (last_jit_charset_final >= 0x30); | |
| 3439 | 1147 |
| 3452 | 1148 /* Assertion added partly because our Win32 layer doesn't |
| 1149 support snprintf; with this, we're sure it won't overflow | |
| 1150 the buffer. */ | |
| 1151 assert(100 > number_of_jit_charsets); | |
| 1152 | |
| 4268 | 1153 qxesprintf(setname, "jit-ucs-charset-%d", number_of_jit_charsets); |
| 1154 | |
| 3439 | 1155 Vcurrent_jit_charset = Fmake_charset |
|
4953
304aebb79cd3
function renamings to track names of char typedefs
Ben Wing <ben@xemacs.org>
parents:
4952
diff
changeset
|
1156 (intern_istring (setname), Vcharset_descr, |
| 3439 | 1157 /* Set encode-as-utf-8 to t, to have this character set written |
| 1158 using UTF-8 escapes in escape-quoted and ctext. This | |
| 1159 sidesteps the fact that our internal character -> Unicode | |
| 1160 mapping is not stable from one invocation to the next. */ | |
| 1161 nconc2 (list2(Qencode_as_utf_8, Qt), | |
| 1162 nconc2 (list6(Qcolumns, make_int(1), Qchars, make_int(96), | |
| 1163 Qdimension, make_int(2)), | |
| 3659 | 1164 list6(Qregistries, Qunicode_registries, |
| 4268 | 1165 Qfinal, make_char(last_jit_charset_final), |
| 3439 | 1166 /* This CCL program is initialised in |
| 1167 unicode.el. */ | |
| 1168 Qccl_program, Qccl_encode_to_ucs_2)))); | |
| 4268 | 1169 |
| 1170 /* Record for the Unicode infrastructure that we've created | |
| 1171 this character set. */ | |
| 1172 Vnumber_of_jit_charsets = make_int (number_of_jit_charsets + 1); | |
| 1173 Vlast_jit_charset_final = make_char (last_jit_charset_final + 1); | |
| 3439 | 1174 |
| 1175 i = get_free_codepoint(Vcurrent_jit_charset); | |
| 1176 } | |
| 1177 | |
| 1178 if (-1 != i) | |
| 1179 { | |
| 1180 set_unicode_conversion((Ichar)i, code); | |
| 1181 /* No need to add the charset to the end of the list; it's done | |
| 1182 automatically. */ | |
| 1183 } | |
| 1184 } | |
| 1185 return (Ichar) i; | |
| 771 | 1186 } |
| 1187 | |
| 877 | 1188 /* Add charsets to precedence list. |
| 1189 LIST must be a list of charsets. Charsets which are in the list more | |
| 1190 than once are given the precedence implied by their earliest appearance. | |
| 1191 Later appearances are ignored. */ | |
| 771 | 1192 static void |
| 1193 add_charsets_to_precedence_list (Lisp_Object list, int *lbs, | |
| 1194 Lisp_Object_dynarr *dynarr) | |
| 1195 { | |
| 1196 { | |
| 1197 EXTERNAL_LIST_LOOP_2 (elt, list) | |
| 1198 { | |
| 1199 Lisp_Object charset = Fget_charset (elt); | |
| 778 | 1200 int lb = XCHARSET_LEADING_BYTE (charset); |
| 771 | 1201 if (lbs[lb - MIN_LEADING_BYTE] == 0) |
| 1202 { | |
| 877 | 1203 Dynarr_add (dynarr, charset); |
| 771 | 1204 lbs[lb - MIN_LEADING_BYTE] = 1; |
| 1205 } | |
| 1206 } | |
| 1207 } | |
| 1208 } | |
| 1209 | |
| 877 | 1210 /* Rebuild the charset precedence array. |
| 1211 The "charsets preferred for the current language" get highest precedence, | |
| 1212 followed by the "charsets preferred by default", ordered as in | |
| 1213 Vlanguage_unicode_precedence_list and Vdefault_unicode_precedence_list, | |
| 1214 respectively. All remaining charsets follow in an arbitrary order. */ | |
| 771 | 1215 void |
| 1216 recalculate_unicode_precedence (void) | |
| 1217 { | |
| 1218 int lbs[NUM_LEADING_BYTES]; | |
| 1219 int i; | |
| 1220 | |
| 1221 for (i = 0; i < NUM_LEADING_BYTES; i++) | |
| 1222 lbs[i] = 0; | |
| 1223 | |
| 1224 Dynarr_reset (unicode_precedence_dynarr); | |
| 1225 | |
| 1226 add_charsets_to_precedence_list (Vlanguage_unicode_precedence_list, | |
| 1227 lbs, unicode_precedence_dynarr); | |
| 1228 add_charsets_to_precedence_list (Vdefault_unicode_precedence_list, | |
| 1229 lbs, unicode_precedence_dynarr); | |
| 1230 | |
| 1231 for (i = 0; i < NUM_LEADING_BYTES; i++) | |
| 1232 { | |
| 1233 if (lbs[i] == 0) | |
| 1234 { | |
| 826 | 1235 Lisp_Object charset = charset_by_leading_byte (i + MIN_LEADING_BYTE); |
| 771 | 1236 if (!NILP (charset)) |
| 1237 Dynarr_add (unicode_precedence_dynarr, charset); | |
| 1238 } | |
| 1239 } | |
| 1240 } | |
| 1241 | |
| 877 | 1242 DEFUN ("unicode-precedence-list", |
| 1243 Funicode_precedence_list, | |
| 1244 0, 0, 0, /* | |
| 1245 Return the precedence order among charsets used for Unicode decoding. | |
| 1246 | |
| 1247 Value is a list of charsets, which are searched in order for a translation | |
| 1248 matching a given Unicode character. | |
| 1249 | |
| 1250 The highest precedence is given to the language-specific precedence list of | |
| 1251 charsets, defined by `set-language-unicode-precedence-list'. These are | |
| 1252 followed by charsets in the default precedence list, defined by | |
| 1253 `set-default-unicode-precedence-list'. Charsets occurring multiple times are | |
|
5384
3889ef128488
Fix misspelled words, and some grammar, across the entire source tree.
Jerry James <james@xemacs.org>
parents:
5345
diff
changeset
|
1254 given precedence according to their first occurrence in either list. These |
| 877 | 1255 are followed by the remaining charsets, in some arbitrary order. |
| 771 | 1256 |
| 1257 The language-specific precedence list is meant to be set as part of the | |
| 1258 language environment initialization; the default precedence list is meant | |
| 1259 to be set by the user. | |
| 1318 | 1260 |
| 1261 #### NOTE: This interface may be changed. | |
| 771 | 1262 */ |
| 877 | 1263 ()) |
| 1264 { | |
| 1265 int i; | |
| 1266 Lisp_Object list = Qnil; | |
| 1267 | |
| 1268 for (i = Dynarr_length (unicode_precedence_dynarr) - 1; i >= 0; i--) | |
| 1269 list = Fcons (Dynarr_at (unicode_precedence_dynarr, i), list); | |
| 1270 return list; | |
| 1271 } | |
| 1272 | |
| 1273 | |
| 1274 /* #### This interface is wrong. Cyrillic users and Chinese users are going | |
| 1275 to have varying opinions about whether ISO Cyrillic, KOI8-R, or Windows | |
| 1276 1251 should take precedence, and whether Big Five or CNS should take | |
| 1277 precedence, respectively. This means that users are sometimes going to | |
| 1278 want to set Vlanguage_unicode_precedence_list. | |
| 1279 Furthermore, this should be language-local (buffer-local would be a | |
| 1318 | 1280 reasonable approximation). |
| 1281 | |
| 1282 Answer: You are right, this needs rethinking. */ | |
| 877 | 1283 DEFUN ("set-language-unicode-precedence-list", |
| 1284 Fset_language_unicode_precedence_list, | |
| 1285 1, 1, 0, /* | |
| 1286 Set the language-specific precedence of charsets in Unicode decoding. | |
| 1287 LIST is a list of charsets. | |
| 1288 See `unicode-precedence-list' for more information. | |
| 1318 | 1289 |
| 1290 #### NOTE: This interface may be changed. | |
| 877 | 1291 */ |
| 771 | 1292 (list)) |
| 1293 { | |
| 1294 { | |
| 1295 EXTERNAL_LIST_LOOP_2 (elt, list) | |
| 1296 Fget_charset (elt); | |
| 1297 } | |
| 1298 | |
| 1299 Vlanguage_unicode_precedence_list = list; | |
| 1300 recalculate_unicode_precedence (); | |
| 1301 return Qnil; | |
| 1302 } | |
| 1303 | |
| 1304 DEFUN ("language-unicode-precedence-list", | |
| 1305 Flanguage_unicode_precedence_list, | |
| 1306 0, 0, 0, /* | |
| 1307 Return the language-specific precedence list used for Unicode decoding. | |
| 877 | 1308 See `unicode-precedence-list' for more information. |
| 1318 | 1309 |
| 1310 #### NOTE: This interface may be changed. | |
| 771 | 1311 */ |
| 1312 ()) | |
| 1313 { | |
| 1314 return Vlanguage_unicode_precedence_list; | |
| 1315 } | |
| 1316 | |
| 1317 DEFUN ("set-default-unicode-precedence-list", | |
| 1318 Fset_default_unicode_precedence_list, | |
| 1319 1, 1, 0, /* | |
| 1320 Set the default precedence list used for Unicode decoding. | |
| 877 | 1321 This is intended to be set by the user. See |
| 1322 `unicode-precedence-list' for more information. | |
| 1318 | 1323 |
| 1324 #### NOTE: This interface may be changed. | |
| 771 | 1325 */ |
| 1326 (list)) | |
| 1327 { | |
| 1328 { | |
| 1329 EXTERNAL_LIST_LOOP_2 (elt, list) | |
| 1330 Fget_charset (elt); | |
| 1331 } | |
| 1332 | |
| 1333 Vdefault_unicode_precedence_list = list; | |
| 1334 recalculate_unicode_precedence (); | |
| 1335 return Qnil; | |
| 1336 } | |
| 1337 | |
| 1338 DEFUN ("default-unicode-precedence-list", | |
| 1339 Fdefault_unicode_precedence_list, | |
| 1340 0, 0, 0, /* | |
| 1341 Return the default precedence list used for Unicode decoding. | |
| 877 | 1342 See `unicode-precedence-list' for more information. |
| 1318 | 1343 |
| 1344 #### NOTE: This interface may be changed. | |
| 771 | 1345 */ |
| 1346 ()) | |
| 1347 { | |
| 1348 return Vdefault_unicode_precedence_list; | |
| 1349 } | |
| 1350 | |
| 1351 DEFUN ("set-unicode-conversion", Fset_unicode_conversion, | |
| 1352 2, 2, 0, /* | |
| 1353 Add conversion information between Unicode codepoints and characters. | |
| 877 | 1354 Conversions for U+0000 to U+00FF are hardwired to ASCII, Control-1, and |
| 1355 Latin-1. Attempts to set these values will raise an error. | |
| 1356 | |
| 771 | 1357 CHARACTER is one of the following: |
| 1358 | |
| 1359 -- A character (in which case CODE must be a non-negative integer; values | |
| 1360 above 2^20 - 1 are allowed for the purpose of specifying private | |
| 877 | 1361 characters, but are illegal in standard Unicode---they will cause errors |
| 1362 when converted to utf-16) | |
| 771 | 1363 -- A vector of characters (in which case CODE must be a vector of integers |
| 1364 of the same length) | |
| 1365 */ | |
| 1366 (character, code)) | |
| 1367 { | |
| 1368 Lisp_Object charset; | |
| 877 | 1369 int ichar, unicode; |
| 771 | 1370 |
| 1371 CHECK_CHAR (character); | |
|
5307
c096d8051f89
Have NATNUMP give t for positive bignums; check limits appropriately.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5157
diff
changeset
|
1372 |
|
c096d8051f89
Have NATNUMP give t for positive bignums; check limits appropriately.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5157
diff
changeset
|
1373 check_integer_range (code, Qzero, make_integer (EMACS_INT_MAX)); |
| 771 | 1374 |
| 877 | 1375 unicode = XINT (code); |
| 1376 ichar = XCHAR (character); | |
| 1377 charset = ichar_charset (ichar); | |
| 1378 | |
| 1379 /* The translations of ASCII, Control-1, and Latin-1 code points are | |
| 1380 hard-coded in ichar_to_unicode and unicode_to_ichar. | |
| 1381 | |
| 1382 Checking unicode < 256 && ichar != unicode is wrong because Mule gives | |
| 1383 many Latin characters code points in a few different character sets. */ | |
| 1384 if ((EQ (charset, Vcharset_ascii) || | |
| 1385 EQ (charset, Vcharset_control_1) || | |
| 1386 EQ (charset, Vcharset_latin_iso8859_1)) | |
| 1387 && unicode != ichar) | |
| 893 | 1388 signal_error (Qinvalid_argument, "Can't change Unicode translation for ASCII, Control-1 or Latin-1 character", |
| 771 | 1389 character); |
| 1390 | |
| 877 | 1391 /* #### Composite characters are not properly implemented yet. */ |
| 1392 if (EQ (charset, Vcharset_composite)) | |
| 1393 signal_error (Qinvalid_argument, "Can't set Unicode translation for Composite char", | |
| 1394 character); | |
| 1395 | |
| 1396 set_unicode_conversion (ichar, unicode); | |
| 771 | 1397 return Qnil; |
| 1398 } | |
| 1399 | |
| 1400 #endif /* MULE */ | |
| 1401 | |
| 800 | 1402 DEFUN ("char-to-unicode", Fchar_to_unicode, 1, 1, 0, /* |
| 771 | 1403 Convert character to Unicode codepoint. |
| 3025 | 1404 When there is no international support (i.e. the `mule' feature is not |
| 877 | 1405 present), this function simply does `char-to-int'. |
| 771 | 1406 */ |
| 1407 (character)) | |
| 1408 { | |
| 1409 CHECK_CHAR (character); | |
| 1410 #ifdef MULE | |
| 867 | 1411 return make_int (ichar_to_unicode (XCHAR (character))); |
| 771 | 1412 #else |
| 1413 return Fchar_to_int (character); | |
| 1414 #endif /* MULE */ | |
| 1415 } | |
| 1416 | |
| 800 | 1417 DEFUN ("unicode-to-char", Funicode_to_char, 1, 2, 0, /* |
| 771 | 1418 Convert Unicode codepoint to character. |
| 1419 CODE should be a non-negative integer. | |
| 1420 If CHARSETS is given, it should be a list of charsets, and only those | |
| 1421 charsets will be consulted, in the given order, for a translation. | |
| 1422 Otherwise, the default ordering of all charsets will be given (see | |
| 1423 `set-unicode-charset-precedence'). | |
| 1424 | |
| 3025 | 1425 When there is no international support (i.e. the `mule' feature is not |
| 877 | 1426 present), this function simply does `int-to-char' and ignores the CHARSETS |
| 1427 argument. | |
| 2622 | 1428 |
| 3439 | 1429 If the CODE would not otherwise be converted to an XEmacs character, and the |
| 1430 list of character sets to be consulted is nil or the default, a new XEmacs | |
| 1431 character will be created for it in one of the `jit-ucs-charset' Mule | |
| 4268 | 1432 character sets, and that character will be returned. |
| 1433 | |
| 1434 This is limited to around 400,000 characters per XEmacs session, though, so | |
| 1435 while normal usage will not be problematic, things like: | |
| 1436 | |
| 1437 \(dotimes (i #x110000) (decode-char 'ucs i)) | |
| 1438 | |
| 1439 will eventually error. The long-term solution to this is Unicode as an | |
| 1440 internal encoding. | |
| 771 | 1441 */ |
| 2333 | 1442 (code, USED_IF_MULE (charsets))) |
| 771 | 1443 { |
| 1444 #ifdef MULE | |
| 1445 Lisp_Object_dynarr *dyn; | |
| 1446 int lbs[NUM_LEADING_BYTES]; | |
| 1447 int c; | |
| 1448 | |
|
5307
c096d8051f89
Have NATNUMP give t for positive bignums; check limits appropriately.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5157
diff
changeset
|
1449 check_integer_range (code, Qzero, make_integer (EMACS_INT_MAX)); |
| 771 | 1450 c = XINT (code); |
| 1451 { | |
| 1452 EXTERNAL_LIST_LOOP_2 (elt, charsets) | |
| 1453 Fget_charset (elt); | |
| 1454 } | |
| 1455 | |
| 1456 if (NILP (charsets)) | |
| 1457 { | |
| 877 | 1458 Ichar ret = unicode_to_ichar (c, unicode_precedence_dynarr); |
| 771 | 1459 if (ret == -1) |
| 1460 return Qnil; | |
| 1461 return make_char (ret); | |
| 1462 } | |
| 1463 | |
| 1464 dyn = Dynarr_new (Lisp_Object); | |
| 1465 memset (lbs, 0, NUM_LEADING_BYTES * sizeof (int)); | |
| 1466 add_charsets_to_precedence_list (charsets, lbs, dyn); | |
| 1467 { | |
| 877 | 1468 Ichar ret = unicode_to_ichar (c, dyn); |
| 771 | 1469 Dynarr_free (dyn); |
| 1470 if (ret == -1) | |
| 1471 return Qnil; | |
| 1472 return make_char (ret); | |
| 1473 } | |
| 1474 #else | |
|
5307
c096d8051f89
Have NATNUMP give t for positive bignums; check limits appropriately.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5157
diff
changeset
|
1475 check_integer_range (code, Qzero, make_integer (EMACS_INT_MAX)); |
| 771 | 1476 return Fint_to_char (code); |
| 1477 #endif /* MULE */ | |
| 1478 } | |
| 1479 | |
| 872 | 1480 #ifdef MULE |
| 1481 | |
| 771 | 1482 static Lisp_Object |
| 1483 cerrar_el_fulano (Lisp_Object fulano) | |
| 1484 { | |
| 1485 FILE *file = (FILE *) get_opaque_ptr (fulano); | |
| 1486 retry_fclose (file); | |
| 1487 return Qnil; | |
| 1488 } | |
| 1489 | |
| 1318 | 1490 DEFUN ("load-unicode-mapping-table", Fload_unicode_mapping_table, |
| 771 | 1491 2, 6, 0, /* |
| 877 | 1492 Load Unicode tables with the Unicode mapping data in FILENAME for CHARSET. |
| 771 | 1493 Data is text, in the form of one translation per line -- charset |
| 1494 codepoint followed by Unicode codepoint. Numbers are decimal or hex | |
| 1495 \(preceded by 0x). Comments are marked with a #. Charset codepoints | |
| 877 | 1496 for two-dimensional charsets have the first octet stored in the |
| 771 | 1497 high 8 bits of the hex number and the second in the low 8 bits. |
| 1498 | |
| 1499 If START and END are given, only charset codepoints within the given | |
| 877 | 1500 range will be processed. (START and END apply to the codepoints in the |
| 1501 file, before OFFSET is applied.) | |
| 771 | 1502 |
| 877 | 1503 If OFFSET is given, that value will be added to all charset codepoints |
| 1504 in the file to obtain the internal charset codepoint. \(We assume | |
| 1505 that octets in the table are in the range 33 to 126 or 32 to 127. If | |
| 1506 you have a table in ku-ten form, with octets in the range 1 to 94, you | |
| 1507 will have to use an offset of 5140, i.e. 0x2020.) | |
| 771 | 1508 |
| 1509 FLAGS, if specified, control further how the tables are interpreted | |
| 877 | 1510 and are used to special-case certain known format deviations in the |
| 1511 Unicode tables or in the charset: | |
| 771 | 1512 |
| 1513 `ignore-first-column' | |
| 877 | 1514 The JIS X 0208 tables have 3 columns of data instead of 2. The first |
| 1515 column contains the Shift-JIS codepoint, which we ignore. | |
| 771 | 1516 `big5' |
| 877 | 1517 The charset codepoints are Big Five codepoints; convert it to the |
| 1518 hacked-up Mule codepoint in `chinese-big5-1' or `chinese-big5-2'. | |
| 771 | 1519 */ |
| 1520 (filename, charset, start, end, offset, flags)) | |
| 1521 { | |
| 1522 int st = 0, en = INT_MAX, of = 0; | |
| 1523 FILE *file; | |
| 1524 struct gcpro gcpro1; | |
| 1525 char line[1025]; | |
| 1526 int fondo = specpdl_depth (); | |
| 1527 int ignore_first_column = 0; | |
| 1528 int big5 = 0; | |
| 1529 | |
| 1530 CHECK_STRING (filename); | |
| 1531 charset = Fget_charset (charset); | |
| 1532 if (!NILP (start)) | |
| 1533 { | |
| 1534 CHECK_INT (start); | |
| 1535 st = XINT (start); | |
| 1536 } | |
| 1537 if (!NILP (end)) | |
| 1538 { | |
| 1539 CHECK_INT (end); | |
| 1540 en = XINT (end); | |
| 1541 } | |
| 1542 if (!NILP (offset)) | |
| 1543 { | |
| 1544 CHECK_INT (offset); | |
| 1545 of = XINT (offset); | |
| 1546 } | |
| 1547 | |
| 1548 if (!LISTP (flags)) | |
| 1549 flags = list1 (flags); | |
| 1550 | |
| 1551 { | |
| 1552 EXTERNAL_LIST_LOOP_2 (elt, flags) | |
| 1553 { | |
| 1554 if (EQ (elt, Qignore_first_column)) | |
| 1555 ignore_first_column = 1; | |
| 1556 else if (EQ (elt, Qbig5)) | |
| 1557 big5 = 1; | |
| 1558 else | |
| 1559 invalid_constant | |
| 1318 | 1560 ("Unrecognized `load-unicode-mapping-table' flag", elt); |
| 771 | 1561 } |
| 1562 } | |
| 1563 | |
| 1564 GCPRO1 (filename); | |
| 1565 filename = Fexpand_file_name (filename, Qnil); | |
| 1566 file = qxe_fopen (XSTRING_DATA (filename), READ_TEXT); | |
| 1567 if (!file) | |
| 1568 report_file_error ("Cannot open", filename); | |
| 1569 record_unwind_protect (cerrar_el_fulano, make_opaque_ptr (file)); | |
| 1570 while (fgets (line, sizeof (line), file)) | |
| 1571 { | |
| 1572 char *p = line; | |
| 1573 int cp1, cp2, endcount; | |
| 1574 int cp1high, cp1low; | |
| 1575 int dummy; | |
| 1576 | |
| 1577 while (*p) /* erase all comments out of the line */ | |
| 1578 { | |
| 1579 if (*p == '#') | |
| 1580 *p = '\0'; | |
| 1581 else | |
| 1582 p++; | |
| 1583 } | |
| 1584 /* see if line is nothing but whitespace and skip if so */ | |
| 1585 p = line + strspn (line, " \t\n\r\f"); | |
| 1586 if (!*p) | |
| 1587 continue; | |
| 1588 /* NOTE: It appears that MS Windows and Newlib sscanf() have | |
| 1589 different interpretations for whitespace (== "skip all whitespace | |
| 1590 at processing point"): Newlib requires at least one corresponding | |
| 1591 whitespace character in the input, but MS allows none. The | |
| 1592 following would be easier to write if we could count on the MS | |
| 1593 interpretation. | |
| 1594 | |
| 1595 Also, the return value does NOT include %n storage. */ | |
| 1596 if ((!ignore_first_column ? | |
| 1597 sscanf (p, "%i %i%n", &cp1, &cp2, &endcount) < 2 : | |
| 1598 sscanf (p, "%i %i %i%n", &dummy, &cp1, &cp2, &endcount) < 3) | |
| 2367 | 1599 /* #### Temporary code! Cygwin newlib fucked up scanf() handling |
| 1600 of numbers beginning 0x0... starting in 04/2004, in an attempt | |
| 1601 to fix another bug. A partial fix for this was put in in | |
| 1602 06/2004, but as of 10/2004 the value of ENDCOUNT returned in | |
| 1603 such case is still wrong. If this gets fixed soon, remove | |
| 1604 this code. --ben */ | |
| 1605 #ifndef CYGWIN_SCANF_BUG | |
| 1606 || *(p + endcount + strspn (p + endcount, " \t\n\r\f")) | |
| 1607 #endif | |
| 1608 ) | |
| 771 | 1609 { |
| 793 | 1610 warn_when_safe (Qunicode, Qwarning, |
| 771 | 1611 "Unrecognized line in translation file %s:\n%s", |
| 1612 XSTRING_DATA (filename), line); | |
| 1613 continue; | |
| 1614 } | |
| 1615 if (cp1 >= st && cp1 <= en) | |
| 1616 { | |
| 1617 cp1 += of; | |
| 1618 if (cp1 < 0 || cp1 >= 65536) | |
| 1619 { | |
| 1620 out_of_range: | |
| 793 | 1621 warn_when_safe (Qunicode, Qwarning, |
| 1622 "Out of range first codepoint 0x%x in " | |
| 1623 "translation file %s:\n%s", | |
| 771 | 1624 cp1, XSTRING_DATA (filename), line); |
| 1625 continue; | |
| 1626 } | |
| 1627 | |
| 1628 cp1high = cp1 >> 8; | |
| 1629 cp1low = cp1 & 255; | |
| 1630 | |
| 1631 if (big5) | |
| 1632 { | |
| 867 | 1633 Ichar ch = decode_big5_char (cp1high, cp1low); |
| 771 | 1634 if (ch == -1) |
| 793 | 1635 |
| 1636 warn_when_safe (Qunicode, Qwarning, | |
| 1637 "Out of range Big5 codepoint 0x%x in " | |
| 1638 "translation file %s:\n%s", | |
| 771 | 1639 cp1, XSTRING_DATA (filename), line); |
| 1640 else | |
| 1641 set_unicode_conversion (ch, cp2); | |
| 1642 } | |
| 1643 else | |
| 1644 { | |
| 1645 int l1, h1, l2, h2; | |
| 867 | 1646 Ichar emch; |
| 771 | 1647 |
| 1648 switch (XCHARSET_TYPE (charset)) | |
| 1649 { | |
| 1650 case CHARSET_TYPE_94: l1 = 33; h1 = 126; l2 = 0; h2 = 0; break; | |
| 1651 case CHARSET_TYPE_96: l1 = 32; h1 = 127; l2 = 0; h2 = 0; break; | |
| 1652 case CHARSET_TYPE_94X94: l1 = 33; h1 = 126; l2 = 33; h2 = 126; | |
| 1653 break; | |
| 1654 case CHARSET_TYPE_96X96: l1 = 32; h1 = 127; l2 = 32; h2 = 127; | |
| 1655 break; | |
| 2500 | 1656 default: ABORT (); l1 = 0; h1 = 0; l2 = 0; h2 = 0; |
| 771 | 1657 } |
| 1658 | |
| 1659 if (cp1high < l2 || cp1high > h2 || cp1low < l1 || cp1low > h1) | |
| 1660 goto out_of_range; | |
| 1661 | |
| 867 | 1662 emch = (cp1high == 0 ? make_ichar (charset, cp1low, 0) : |
| 1663 make_ichar (charset, cp1high, cp1low)); | |
| 771 | 1664 set_unicode_conversion (emch, cp2); |
| 1665 } | |
| 1666 } | |
| 1667 } | |
| 1668 | |
| 1669 if (ferror (file)) | |
| 1670 report_file_error ("IO error when reading", filename); | |
| 1671 | |
| 1672 unbind_to (fondo); /* close file */ | |
| 1673 UNGCPRO; | |
| 1674 return Qnil; | |
| 1675 } | |
| 1676 | |
| 1677 #endif /* MULE */ | |
| 1678 | |
| 1679 | |
| 1680 /************************************************************************/ | |
| 1681 /* Unicode coding system */ | |
| 1682 /************************************************************************/ | |
| 1683 | |
| 1684 struct unicode_coding_system | |
| 1685 { | |
| 1686 enum unicode_type type; | |
| 1887 | 1687 unsigned int little_endian :1; |
| 1688 unsigned int need_bom :1; | |
| 771 | 1689 }; |
| 1690 | |
| 1691 #define CODING_SYSTEM_UNICODE_TYPE(codesys) \ | |
| 1692 (CODING_SYSTEM_TYPE_DATA (codesys, unicode)->type) | |
| 1693 #define XCODING_SYSTEM_UNICODE_TYPE(codesys) \ | |
| 1694 CODING_SYSTEM_UNICODE_TYPE (XCODING_SYSTEM (codesys)) | |
| 1695 #define CODING_SYSTEM_UNICODE_LITTLE_ENDIAN(codesys) \ | |
| 1696 (CODING_SYSTEM_TYPE_DATA (codesys, unicode)->little_endian) | |
| 1697 #define XCODING_SYSTEM_UNICODE_LITTLE_ENDIAN(codesys) \ | |
| 1698 CODING_SYSTEM_UNICODE_LITTLE_ENDIAN (XCODING_SYSTEM (codesys)) | |
| 1699 #define CODING_SYSTEM_UNICODE_NEED_BOM(codesys) \ | |
| 1700 (CODING_SYSTEM_TYPE_DATA (codesys, unicode)->need_bom) | |
| 1701 #define XCODING_SYSTEM_UNICODE_NEED_BOM(codesys) \ | |
| 1702 CODING_SYSTEM_UNICODE_NEED_BOM (XCODING_SYSTEM (codesys)) | |
| 1703 | |
| 1704 struct unicode_coding_stream | |
| 1705 { | |
| 1706 /* decode */ | |
| 1707 unsigned char counter; | |
| 4096 | 1708 unsigned char indicated_length; |
| 771 | 1709 int seen_char; |
| 1710 /* encode */ | |
| 1711 Lisp_Object current_charset; | |
| 1712 int current_char_boundary; | |
| 1713 int wrote_bom; | |
| 1714 }; | |
| 1715 | |
| 1204 | 1716 static const struct memory_description unicode_coding_system_description[] = { |
| 771 | 1717 { XD_END } |
| 1718 }; | |
| 1719 | |
| 1204 | 1720 DEFINE_CODING_SYSTEM_TYPE_WITH_DATA (unicode); |
| 1721 | |
| 771 | 1722 static void |
| 1723 decode_unicode_char (int ch, unsigned_char_dynarr *dst, | |
| 1887 | 1724 struct unicode_coding_stream *data, |
| 1725 unsigned int ignore_bom) | |
| 771 | 1726 { |
| 1727 if (ch == 0xFEFF && !data->seen_char && ignore_bom) | |
| 1728 ; | |
| 1729 else | |
| 1730 { | |
| 1731 #ifdef MULE | |
| 877 | 1732 Ichar chr = unicode_to_ichar (ch, unicode_precedence_dynarr); |
| 771 | 1733 |
| 1734 if (chr != -1) | |
| 1735 { | |
| 867 | 1736 Ibyte work[MAX_ICHAR_LEN]; |
| 771 | 1737 int len; |
| 1738 | |
| 867 | 1739 len = set_itext_ichar (work, chr); |
| 771 | 1740 Dynarr_add_many (dst, work, len); |
| 1741 } | |
| 1742 else | |
| 1743 { | |
| 1744 Dynarr_add (dst, LEADING_BYTE_JAPANESE_JISX0208); | |
| 1745 Dynarr_add (dst, 34 + 128); | |
| 1746 Dynarr_add (dst, 46 + 128); | |
| 1747 } | |
| 1748 #else | |
| 867 | 1749 Dynarr_add (dst, (Ibyte) ch); |
| 771 | 1750 #endif /* MULE */ |
| 1751 } | |
| 1752 | |
| 1753 data->seen_char = 1; | |
| 1754 } | |
| 1755 | |
| 4096 | 1756 #define DECODE_ERROR_OCTET(octet, dst, data, ignore_bom) \ |
| 1757 decode_unicode_char ((octet) + UNICODE_ERROR_OCTET_RANGE_START, \ | |
| 1758 dst, data, ignore_bom) | |
| 1759 | |
| 1760 static inline void | |
| 1761 indicate_invalid_utf_8 (unsigned char indicated_length, | |
| 1762 unsigned char counter, | |
| 1763 int ch, unsigned_char_dynarr *dst, | |
| 1764 struct unicode_coding_stream *data, | |
| 1765 unsigned int ignore_bom) | |
| 1766 { | |
| 1767 Binbyte stored = indicated_length - counter; | |
| 1768 Binbyte mask = "\x00\x00\xC0\xE0\xF0\xF8\xFC"[indicated_length]; | |
| 1769 | |
| 1770 while (stored > 0) | |
| 1771 { | |
| 1772 DECODE_ERROR_OCTET (((ch >> (6 * (stored - 1))) & 0x3f) | mask, | |
| 1773 dst, data, ignore_bom); | |
| 1774 mask = 0x80, stored--; | |
| 1775 } | |
| 1776 } | |
| 1777 | |
| 771 | 1778 static void |
| 1779 encode_unicode_char_1 (int code, unsigned_char_dynarr *dst, | |
| 4096 | 1780 enum unicode_type type, unsigned int little_endian, |
| 1781 int write_error_characters_as_such) | |
| 771 | 1782 { |
| 1783 switch (type) | |
| 1784 { | |
| 1785 case UNICODE_UTF_16: | |
| 1786 if (little_endian) | |
| 1787 { | |
| 3952 | 1788 if (code < 0x10000) { |
| 1789 Dynarr_add (dst, (unsigned char) (code & 255)); | |
| 1790 Dynarr_add (dst, (unsigned char) ((code >> 8) & 255)); | |
| 4096 | 1791 } else if (write_error_characters_as_such && |
| 1792 code >= UNICODE_ERROR_OCTET_RANGE_START && | |
| 1793 code < (UNICODE_ERROR_OCTET_RANGE_START + 0x100)) | |
| 1794 { | |
| 1795 Dynarr_add (dst, (unsigned char) ((code & 0xFF))); | |
| 1796 } | |
| 1797 else if (code < 0x110000) | |
| 1798 { | |
| 1799 /* Little endian; least significant byte first. */ | |
| 1800 int first, second; | |
| 1801 | |
| 1802 CODE_TO_UTF_16_SURROGATES(code, first, second); | |
| 1803 | |
| 1804 Dynarr_add (dst, (unsigned char) (first & 255)); | |
| 1805 Dynarr_add (dst, (unsigned char) ((first >> 8) & 255)); | |
| 1806 | |
| 1807 Dynarr_add (dst, (unsigned char) (second & 255)); | |
| 1808 Dynarr_add (dst, (unsigned char) ((second >> 8) & 255)); | |
| 1809 } | |
| 1810 else | |
| 1811 { | |
| 1812 /* Not valid Unicode. Pass U+FFFD, least significant byte | |
| 1813 first. */ | |
| 1814 Dynarr_add (dst, (unsigned char) 0xFD); | |
| 1815 Dynarr_add (dst, (unsigned char) 0xFF); | |
| 1816 } | |
| 771 | 1817 } |
| 1818 else | |
| 1819 { | |
| 3952 | 1820 if (code < 0x10000) { |
| 1821 Dynarr_add (dst, (unsigned char) ((code >> 8) & 255)); | |
| 1822 Dynarr_add (dst, (unsigned char) (code & 255)); | |
| 4096 | 1823 } else if (write_error_characters_as_such && |
| 1824 code >= UNICODE_ERROR_OCTET_RANGE_START && | |
| 1825 code < (UNICODE_ERROR_OCTET_RANGE_START + 0x100)) | |
| 1826 { | |
| 1827 Dynarr_add (dst, (unsigned char) ((code & 0xFF))); | |
| 1828 } | |
| 1829 else if (code < 0x110000) | |
| 1830 { | |
| 1831 /* Big endian; most significant byte first. */ | |
| 1832 int first, second; | |
| 1833 | |
| 1834 CODE_TO_UTF_16_SURROGATES(code, first, second); | |
| 1835 | |
| 1836 Dynarr_add (dst, (unsigned char) ((first >> 8) & 255)); | |
| 1837 Dynarr_add (dst, (unsigned char) (first & 255)); | |
| 1838 | |
| 1839 Dynarr_add (dst, (unsigned char) ((second >> 8) & 255)); | |
| 1840 Dynarr_add (dst, (unsigned char) (second & 255)); | |
| 1841 } | |
| 1842 else | |
| 1843 { | |
| 1844 /* Not valid Unicode. Pass U+FFFD, most significant byte | |
| 1845 first. */ | |
| 1846 Dynarr_add (dst, (unsigned char) 0xFF); | |
| 1847 Dynarr_add (dst, (unsigned char) 0xFD); | |
| 1848 } | |
| 771 | 1849 } |
| 1850 break; | |
| 1851 | |
| 1852 case UNICODE_UCS_4: | |
| 4096 | 1853 case UNICODE_UTF_32: |
| 771 | 1854 if (little_endian) |
| 1855 { | |
| 4096 | 1856 if (write_error_characters_as_such && |
| 1857 code >= UNICODE_ERROR_OCTET_RANGE_START && | |
| 1858 code < (UNICODE_ERROR_OCTET_RANGE_START + 0x100)) | |
| 1859 { | |
| 1860 Dynarr_add (dst, (unsigned char) ((code & 0xFF))); | |
| 1861 } | |
| 1862 else | |
| 1863 { | |
| 1864 /* We generate and accept incorrect sequences here, which is | |
| 1865 okay, in the interest of preservation of the user's | |
| 1866 data. */ | |
| 1867 Dynarr_add (dst, (unsigned char) (code & 255)); | |
| 1868 Dynarr_add (dst, (unsigned char) ((code >> 8) & 255)); | |
| 1869 Dynarr_add (dst, (unsigned char) ((code >> 16) & 255)); | |
| 1870 Dynarr_add (dst, (unsigned char) (code >> 24)); | |
| 1871 } | |
| 771 | 1872 } |
| 1873 else | |
| 1874 { | |
| 4096 | 1875 if (write_error_characters_as_such && |
| 1876 code >= UNICODE_ERROR_OCTET_RANGE_START && | |
| 1877 code < (UNICODE_ERROR_OCTET_RANGE_START + 0x100)) | |
| 1878 { | |
| 1879 Dynarr_add (dst, (unsigned char) ((code & 0xFF))); | |
| 1880 } | |
| 1881 else | |
| 1882 { | |
| 1883 /* We generate and accept incorrect sequences here, which is okay, | |
| 1884 in the interest of preservation of the user's data. */ | |
| 1885 Dynarr_add (dst, (unsigned char) (code >> 24)); | |
| 1886 Dynarr_add (dst, (unsigned char) ((code >> 16) & 255)); | |
| 1887 Dynarr_add (dst, (unsigned char) ((code >> 8) & 255)); | |
| 1888 Dynarr_add (dst, (unsigned char) (code & 255)); | |
| 1889 } | |
| 771 | 1890 } |
| 1891 break; | |
| 1892 | |
| 1893 case UNICODE_UTF_8: | |
| 1894 if (code <= 0x7f) | |
| 1895 { | |
| 1896 Dynarr_add (dst, (unsigned char) code); | |
| 1897 } | |
| 1898 else if (code <= 0x7ff) | |
| 1899 { | |
| 1900 Dynarr_add (dst, (unsigned char) ((code >> 6) | 0xc0)); | |
| 1901 Dynarr_add (dst, (unsigned char) ((code & 0x3f) | 0x80)); | |
| 1902 } | |
| 1903 else if (code <= 0xffff) | |
| 1904 { | |
| 1905 Dynarr_add (dst, (unsigned char) ((code >> 12) | 0xe0)); | |
| 1906 Dynarr_add (dst, (unsigned char) (((code >> 6) & 0x3f) | 0x80)); | |
| 1907 Dynarr_add (dst, (unsigned char) ((code & 0x3f) | 0x80)); | |
| 1908 } | |
| 1909 else if (code <= 0x1fffff) | |
| 1910 { | |
| 1911 Dynarr_add (dst, (unsigned char) ((code >> 18) | 0xf0)); | |
| 1912 Dynarr_add (dst, (unsigned char) (((code >> 12) & 0x3f) | 0x80)); | |
| 1913 Dynarr_add (dst, (unsigned char) (((code >> 6) & 0x3f) | 0x80)); | |
| 1914 Dynarr_add (dst, (unsigned char) ((code & 0x3f) | 0x80)); | |
| 1915 } | |
| 1916 else if (code <= 0x3ffffff) | |
| 1917 { | |
| 4096 | 1918 |
| 1919 #if !(UNICODE_ERROR_OCTET_RANGE_START > 0x1fffff \ | |
| 1920 && UNICODE_ERROR_OCTET_RANGE_START < 0x3ffffff) | |
| 1921 #error "This code needs to be rewritten. " | |
| 1922 #endif | |
| 1923 if (write_error_characters_as_such && | |
| 1924 code >= UNICODE_ERROR_OCTET_RANGE_START && | |
| 1925 code < (UNICODE_ERROR_OCTET_RANGE_START + 0x100)) | |
| 1926 { | |
| 1927 Dynarr_add (dst, (unsigned char) ((code & 0xFF))); | |
| 1928 } | |
| 1929 else | |
| 1930 { | |
| 1931 Dynarr_add (dst, (unsigned char) ((code >> 24) | 0xf8)); | |
| 1932 Dynarr_add (dst, (unsigned char) (((code >> 18) & 0x3f) | 0x80)); | |
| 1933 Dynarr_add (dst, (unsigned char) (((code >> 12) & 0x3f) | 0x80)); | |
| 1934 Dynarr_add (dst, (unsigned char) (((code >> 6) & 0x3f) | 0x80)); | |
| 1935 Dynarr_add (dst, (unsigned char) ((code & 0x3f) | 0x80)); | |
| 1936 } | |
| 771 | 1937 } |
| 1938 else | |
| 1939 { | |
| 1940 Dynarr_add (dst, (unsigned char) ((code >> 30) | 0xfc)); | |
| 1941 Dynarr_add (dst, (unsigned char) (((code >> 24) & 0x3f) | 0x80)); | |
| 1942 Dynarr_add (dst, (unsigned char) (((code >> 18) & 0x3f) | 0x80)); | |
| 1943 Dynarr_add (dst, (unsigned char) (((code >> 12) & 0x3f) | 0x80)); | |
| 1944 Dynarr_add (dst, (unsigned char) (((code >> 6) & 0x3f) | 0x80)); | |
| 1945 Dynarr_add (dst, (unsigned char) ((code & 0x3f) | 0x80)); | |
| 1946 } | |
| 1947 break; | |
| 1948 | |
| 2500 | 1949 case UNICODE_UTF_7: ABORT (); |
| 771 | 1950 |
| 2500 | 1951 default: ABORT (); |
| 771 | 1952 } |
| 1953 } | |
| 1954 | |
| 3439 | 1955 /* Also used in mule-coding.c for UTF-8 handling in ISO 2022-oriented |
| 1956 encodings. */ | |
| 1957 void | |
| 2333 | 1958 encode_unicode_char (Lisp_Object USED_IF_MULE (charset), int h, |
| 1959 int USED_IF_MULE (l), unsigned_char_dynarr *dst, | |
| 4096 | 1960 enum unicode_type type, unsigned int little_endian, |
| 1961 int write_error_characters_as_such) | |
| 771 | 1962 { |
| 1963 #ifdef MULE | |
| 867 | 1964 int code = ichar_to_unicode (make_ichar (charset, h & 127, l & 127)); |
| 771 | 1965 |
| 1966 if (code == -1) | |
| 1967 { | |
| 1968 if (type != UNICODE_UTF_16 && | |
| 1969 XCHARSET_DIMENSION (charset) == 2 && | |
| 1970 XCHARSET_CHARS (charset) == 94) | |
| 1971 { | |
| 1972 unsigned char final = XCHARSET_FINAL (charset); | |
| 1973 | |
| 1974 if (('@' <= final) && (final < 0x7f)) | |
| 1975 code = (0xe00000 + (final - '@') * 94 * 94 | |
| 1976 + ((h & 127) - 33) * 94 + (l & 127) - 33); | |
| 1977 else | |
| 1978 code = '?'; | |
| 1979 } | |
| 1980 else | |
| 1981 code = '?'; | |
| 1982 } | |
| 1983 #else | |
| 1984 int code = h; | |
| 1985 #endif /* MULE */ | |
| 1986 | |
| 4096 | 1987 encode_unicode_char_1 (code, dst, type, little_endian, |
| 1988 write_error_characters_as_such); | |
| 771 | 1989 } |
| 1990 | |
| 1991 static Bytecount | |
| 1992 unicode_convert (struct coding_stream *str, const UExtbyte *src, | |
| 1993 unsigned_char_dynarr *dst, Bytecount n) | |
| 1994 { | |
| 1995 unsigned int ch = str->ch; | |
| 1996 struct unicode_coding_stream *data = CODING_STREAM_TYPE_DATA (str, unicode); | |
| 1997 enum unicode_type type = | |
| 1998 XCODING_SYSTEM_UNICODE_TYPE (str->codesys); | |
| 1887 | 1999 unsigned int little_endian = |
| 2000 XCODING_SYSTEM_UNICODE_LITTLE_ENDIAN (str->codesys); | |
| 2001 unsigned int ignore_bom = XCODING_SYSTEM_UNICODE_NEED_BOM (str->codesys); | |
| 771 | 2002 Bytecount orign = n; |
| 2003 | |
| 2004 if (str->direction == CODING_DECODE) | |
| 2005 { | |
| 2006 unsigned char counter = data->counter; | |
| 4096 | 2007 unsigned char indicated_length |
| 2008 = data->indicated_length; | |
| 771 | 2009 |
| 2010 while (n--) | |
| 2011 { | |
| 2012 UExtbyte c = *src++; | |
| 2013 | |
| 2014 switch (type) | |
| 2015 { | |
| 2016 case UNICODE_UTF_8: | |
| 4096 | 2017 if (0 == counter) |
| 2018 { | |
| 2019 if (0 == (c & 0x80)) | |
| 2020 { | |
| 2021 /* ASCII. */ | |
| 2022 decode_unicode_char (c, dst, data, ignore_bom); | |
| 2023 } | |
| 2024 else if (0 == (c & 0x40)) | |
| 2025 { | |
| 2026 /* Highest bit set, second highest not--there's | |
| 2027 something wrong. */ | |
| 2028 DECODE_ERROR_OCTET (c, dst, data, ignore_bom); | |
| 2029 } | |
| 2030 else if (0 == (c & 0x20)) | |
| 2031 { | |
| 2032 ch = c & 0x1f; | |
| 2033 counter = 1; | |
| 2034 indicated_length = 2; | |
| 2035 } | |
| 2036 else if (0 == (c & 0x10)) | |
| 2037 { | |
| 2038 ch = c & 0x0f; | |
| 2039 counter = 2; | |
| 2040 indicated_length = 3; | |
| 2041 } | |
| 2042 else if (0 == (c & 0x08)) | |
| 2043 { | |
| 2044 ch = c & 0x0f; | |
| 2045 counter = 3; | |
| 2046 indicated_length = 4; | |
| 2047 } | |
| 2048 else | |
| 2049 { | |
| 2050 /* We don't supports lengths longer than 4 in | |
| 2051 external-format data. */ | |
| 2052 DECODE_ERROR_OCTET (c, dst, data, ignore_bom); | |
| 2053 | |
| 2054 } | |
| 2055 } | |
| 2056 else | |
| 2057 { | |
| 2058 /* counter != 0 */ | |
| 2059 if ((0 == (c & 0x80)) || (0 != (c & 0x40))) | |
| 2060 { | |
| 2061 indicate_invalid_utf_8(indicated_length, | |
| 2062 counter, | |
| 2063 ch, dst, data, ignore_bom); | |
| 2064 if (c & 0x80) | |
| 2065 { | |
| 2066 DECODE_ERROR_OCTET (c, dst, data, ignore_bom); | |
| 2067 } | |
| 2068 else | |
| 2069 { | |
| 2070 /* The character just read is ASCII. Treat it as | |
| 2071 such. */ | |
| 2072 decode_unicode_char (c, dst, data, ignore_bom); | |
| 2073 } | |
| 2074 ch = 0; | |
| 2075 counter = 0; | |
| 2076 } | |
| 2077 else | |
| 2078 { | |
| 2079 ch = (ch << 6) | (c & 0x3f); | |
| 2080 counter--; | |
| 2081 /* Just processed the final byte. Emit the character. */ | |
| 2082 if (!counter) | |
| 2083 { | |
| 2084 /* Don't accept over-long sequences, surrogates, | |
| 2085 or codes above #x10FFFF. */ | |
| 2086 if ((ch < 0x80) || | |
| 2087 ((ch < 0x800) && indicated_length > 2) || | |
| 2088 ((ch < 0x10000) && indicated_length > 3) || | |
| 2089 valid_utf_16_surrogate(ch) || (ch > 0x110000)) | |
| 2090 { | |
| 2091 indicate_invalid_utf_8(indicated_length, | |
| 2092 counter, | |
| 2093 ch, dst, data, | |
| 2094 ignore_bom); | |
| 2095 } | |
| 2096 else | |
| 2097 { | |
| 2098 decode_unicode_char (ch, dst, data, ignore_bom); | |
| 2099 } | |
| 2100 ch = 0; | |
| 2101 } | |
| 2102 } | |
| 771 | 2103 } |
| 2104 break; | |
| 2105 | |
| 2106 case UNICODE_UTF_16: | |
| 3952 | 2107 |
| 771 | 2108 if (little_endian) |
| 2109 ch = (c << counter) | ch; | |
| 2110 else | |
| 2111 ch = (ch << 8) | c; | |
| 4096 | 2112 |
| 771 | 2113 counter += 8; |
| 3952 | 2114 |
| 4096 | 2115 if (16 == counter) |
| 2116 { | |
| 771 | 2117 int tempch = ch; |
| 4096 | 2118 |
| 2119 if (valid_utf_16_first_surrogate(ch)) | |
| 2120 { | |
| 2121 break; | |
| 2122 } | |
| 771 | 2123 ch = 0; |
| 2124 counter = 0; | |
| 2125 decode_unicode_char (tempch, dst, data, ignore_bom); | |
| 2126 } | |
| 4096 | 2127 else if (32 == counter) |
| 3952 | 2128 { |
| 2129 int tempch; | |
| 4096 | 2130 |
|
4583
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2131 if (little_endian) |
| 4096 | 2132 { |
|
4583
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2133 if (!valid_utf_16_last_surrogate(ch >> 16)) |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2134 { |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2135 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2136 ignore_bom); |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2137 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2138 ignore_bom); |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2139 DECODE_ERROR_OCTET ((ch >> 16) & 0xFF, dst, data, |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2140 ignore_bom); |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2141 DECODE_ERROR_OCTET ((ch >> 24) & 0xFF, dst, data, |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2142 ignore_bom); |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2143 } |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2144 else |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2145 { |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2146 tempch = utf_16_surrogates_to_code((ch & 0xffff), |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2147 (ch >> 16)); |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2148 decode_unicode_char(tempch, dst, data, ignore_bom); |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2149 } |
| 4096 | 2150 } |
|
4583
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2151 else |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2152 { |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2153 if (!valid_utf_16_last_surrogate(ch & 0xFFFF)) |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2154 { |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2155 DECODE_ERROR_OCTET ((ch >> 24) & 0xFF, dst, data, |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2156 ignore_bom); |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2157 DECODE_ERROR_OCTET ((ch >> 16) & 0xFF, dst, data, |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2158 ignore_bom); |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2159 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2160 ignore_bom); |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2161 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2162 ignore_bom); |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2163 } |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2164 else |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2165 { |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2166 tempch = utf_16_surrogates_to_code((ch >> 16), |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2167 (ch & 0xffff)); |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2168 decode_unicode_char(tempch, dst, data, ignore_bom); |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2169 } |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2170 } |
|
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2171 |
| 3952 | 2172 ch = 0; |
| 2173 counter = 0; | |
| 4096 | 2174 } |
| 2175 else | |
| 2176 assert(8 == counter || 24 == counter); | |
| 771 | 2177 break; |
| 2178 | |
| 2179 case UNICODE_UCS_4: | |
| 4096 | 2180 case UNICODE_UTF_32: |
| 771 | 2181 if (little_endian) |
| 2182 ch = (c << counter) | ch; | |
| 2183 else | |
| 2184 ch = (ch << 8) | c; | |
| 2185 counter += 8; | |
| 2186 if (counter == 32) | |
| 2187 { | |
| 4096 | 2188 if (ch > 0x10ffff) |
| 2189 { | |
| 2190 /* ch is not a legal Unicode character. We're fine | |
| 2191 with that in UCS-4, though not in UTF-32. */ | |
| 2192 if (UNICODE_UCS_4 == type && ch < 0x80000000) | |
| 2193 { | |
| 2194 decode_unicode_char (ch, dst, data, ignore_bom); | |
| 2195 } | |
| 2196 else if (little_endian) | |
| 2197 { | |
| 2198 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, | |
| 2199 ignore_bom); | |
| 2200 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, | |
| 2201 ignore_bom); | |
| 2202 DECODE_ERROR_OCTET ((ch >> 16) & 0xFF, dst, data, | |
| 2203 ignore_bom); | |
| 2204 DECODE_ERROR_OCTET ((ch >> 24) & 0xFF, dst, data, | |
| 2205 ignore_bom); | |
| 2206 } | |
| 2207 else | |
| 2208 { | |
| 2209 DECODE_ERROR_OCTET ((ch >> 24) & 0xFF, dst, data, | |
| 2210 ignore_bom); | |
| 2211 DECODE_ERROR_OCTET ((ch >> 16) & 0xFF, dst, data, | |
| 2212 ignore_bom); | |
| 2213 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, | |
| 2214 ignore_bom); | |
| 2215 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, | |
| 2216 ignore_bom); | |
| 2217 } | |
| 2218 } | |
| 2219 else | |
| 2220 { | |
| 2221 decode_unicode_char (ch, dst, data, ignore_bom); | |
| 2222 } | |
| 771 | 2223 ch = 0; |
| 2224 counter = 0; | |
| 2225 } | |
| 2226 break; | |
| 2227 | |
| 2228 case UNICODE_UTF_7: | |
| 2500 | 2229 ABORT (); |
| 771 | 2230 break; |
| 2231 | |
| 2500 | 2232 default: ABORT (); |
| 771 | 2233 } |
| 2234 | |
| 2235 } | |
| 4096 | 2236 |
|
4688
7e54adf407a1
Fix a bug with Unicode error sequences and very short input strings.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4583
diff
changeset
|
2237 if (str->eof && counter) |
| 4096 | 2238 { |
| 2239 switch (type) | |
| 2240 { | |
| 2241 case UNICODE_UTF_8: | |
| 2242 indicate_invalid_utf_8(indicated_length, | |
| 2243 counter, ch, dst, data, | |
| 2244 ignore_bom); | |
| 2245 break; | |
| 2246 | |
| 2247 case UNICODE_UTF_16: | |
| 2248 case UNICODE_UCS_4: | |
| 2249 case UNICODE_UTF_32: | |
| 2250 if (8 == counter) | |
| 2251 { | |
| 2252 DECODE_ERROR_OCTET (ch, dst, data, ignore_bom); | |
| 2253 } | |
| 2254 else if (16 == counter) | |
| 2255 { | |
| 2256 if (little_endian) | |
| 2257 { | |
| 2258 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, ignore_bom); | |
| 2259 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, | |
| 2260 ignore_bom); | |
| 2261 } | |
| 2262 else | |
| 2263 { | |
| 2264 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, | |
| 2265 ignore_bom); | |
| 2266 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, ignore_bom); | |
| 2267 } | |
| 2268 } | |
| 2269 else if (24 == counter) | |
| 2270 { | |
| 2271 if (little_endian) | |
| 2272 { | |
| 2273 DECODE_ERROR_OCTET ((ch >> 16) & 0xFF, dst, data, | |
| 2274 ignore_bom); | |
| 2275 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, ignore_bom); | |
| 2276 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, | |
| 2277 ignore_bom); | |
| 2278 } | |
| 2279 else | |
| 2280 { | |
| 2281 DECODE_ERROR_OCTET ((ch >> 16) & 0xFF, dst, data, | |
| 2282 ignore_bom); | |
| 2283 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, | |
| 2284 ignore_bom); | |
| 2285 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, | |
| 2286 ignore_bom); | |
| 2287 } | |
| 2288 } | |
| 2289 else assert(0); | |
| 2290 break; | |
| 2291 } | |
| 2292 ch = 0; | |
|
4688
7e54adf407a1
Fix a bug with Unicode error sequences and very short input strings.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4583
diff
changeset
|
2293 counter = 0; |
| 4096 | 2294 } |
| 771 | 2295 |
| 2296 data->counter = counter; | |
| 4096 | 2297 data->indicated_length = indicated_length; |
| 771 | 2298 } |
| 2299 else | |
| 2300 { | |
| 2301 unsigned char char_boundary = data->current_char_boundary; | |
| 2302 Lisp_Object charset = data->current_charset; | |
| 2303 | |
| 2304 #ifdef ENABLE_COMPOSITE_CHARS | |
| 2305 /* flags for handling composite chars. We do a little switcheroo | |
| 2306 on the source while we're outputting the composite char. */ | |
| 2307 Bytecount saved_n = 0; | |
| 867 | 2308 const Ibyte *saved_src = NULL; |
| 771 | 2309 int in_composite = 0; |
| 2310 | |
| 2311 back_to_square_n: | |
| 2312 #endif /* ENABLE_COMPOSITE_CHARS */ | |
| 2313 | |
| 2314 if (XCODING_SYSTEM_UNICODE_NEED_BOM (str->codesys) && !data->wrote_bom) | |
| 2315 { | |
| 4096 | 2316 encode_unicode_char_1 (0xFEFF, dst, type, little_endian, 1); |
| 771 | 2317 data->wrote_bom = 1; |
| 2318 } | |
| 2319 | |
| 2320 while (n--) | |
| 2321 { | |
| 867 | 2322 Ibyte c = *src++; |
| 771 | 2323 |
| 2324 #ifdef MULE | |
| 826 | 2325 if (byte_ascii_p (c)) |
| 771 | 2326 #endif /* MULE */ |
| 2327 { /* Processing ASCII character */ | |
| 2328 ch = 0; | |
| 2329 encode_unicode_char (Vcharset_ascii, c, 0, dst, type, | |
| 4096 | 2330 little_endian, 1); |
| 771 | 2331 |
| 2332 char_boundary = 1; | |
| 2333 } | |
| 2334 #ifdef MULE | |
| 867 | 2335 else if (ibyte_leading_byte_p (c) || ibyte_leading_byte_p (ch)) |
| 771 | 2336 { /* Processing Leading Byte */ |
| 2337 ch = 0; | |
| 826 | 2338 charset = charset_by_leading_byte (c); |
| 2339 if (leading_byte_prefix_p(c)) | |
| 771 | 2340 ch = c; |
| 2341 char_boundary = 0; | |
| 2342 } | |
| 2343 else | |
| 2344 { /* Processing Non-ASCII character */ | |
| 2345 char_boundary = 1; | |
| 2346 if (EQ (charset, Vcharset_control_1)) | |
| 2704 | 2347 /* See: |
| 2348 | |
| 2349 (Info-goto-node "(internals)Internal String Encoding") | |
| 2350 | |
| 2351 for the rationale behind subtracting #xa0 from the | |
| 2352 character's code. */ | |
| 2353 encode_unicode_char (Vcharset_control_1, c - 0xa0, 0, dst, | |
| 4096 | 2354 type, little_endian, 1); |
| 771 | 2355 else |
| 2356 { | |
| 2357 switch (XCHARSET_REP_BYTES (charset)) | |
| 2358 { | |
| 2359 case 2: | |
| 2360 encode_unicode_char (charset, c, 0, dst, type, | |
| 4096 | 2361 little_endian, 1); |
| 771 | 2362 break; |
| 2363 case 3: | |
| 2364 if (XCHARSET_PRIVATE_P (charset)) | |
| 2365 { | |
| 2366 encode_unicode_char (charset, c, 0, dst, type, | |
| 4096 | 2367 little_endian, 1); |
| 771 | 2368 ch = 0; |
| 2369 } | |
| 2370 else if (ch) | |
| 2371 { | |
| 2372 #ifdef ENABLE_COMPOSITE_CHARS | |
| 2373 if (EQ (charset, Vcharset_composite)) | |
| 2374 { | |
| 2375 if (in_composite) | |
| 2376 { | |
| 2377 /* #### Bother! We don't know how to | |
| 2378 handle this yet. */ | |
| 2379 encode_unicode_char (Vcharset_ascii, '~', 0, | |
| 2380 dst, type, | |
| 4096 | 2381 little_endian, 1); |
| 771 | 2382 } |
| 2383 else | |
| 2384 { | |
| 867 | 2385 Ichar emch = make_ichar (Vcharset_composite, |
| 771 | 2386 ch & 0x7F, |
| 2387 c & 0x7F); | |
| 2388 Lisp_Object lstr = | |
| 2389 composite_char_string (emch); | |
| 2390 saved_n = n; | |
| 2391 saved_src = src; | |
| 2392 in_composite = 1; | |
| 2393 src = XSTRING_DATA (lstr); | |
| 2394 n = XSTRING_LENGTH (lstr); | |
| 2395 } | |
| 2396 } | |
| 2397 else | |
| 2398 #endif /* ENABLE_COMPOSITE_CHARS */ | |
| 2399 encode_unicode_char (charset, ch, c, dst, type, | |
| 4096 | 2400 little_endian, 1); |
| 771 | 2401 ch = 0; |
| 2402 } | |
| 2403 else | |
| 2404 { | |
| 2405 ch = c; | |
| 2406 char_boundary = 0; | |
| 2407 } | |
| 2408 break; | |
| 2409 case 4: | |
| 2410 if (ch) | |
| 2411 { | |
| 2412 encode_unicode_char (charset, ch, c, dst, type, | |
| 4096 | 2413 little_endian, 1); |
| 771 | 2414 ch = 0; |
| 2415 } | |
| 2416 else | |
| 2417 { | |
| 2418 ch = c; | |
| 2419 char_boundary = 0; | |
| 2420 } | |
| 2421 break; | |
| 2422 default: | |
| 2500 | 2423 ABORT (); |
| 771 | 2424 } |
| 2425 } | |
| 2426 } | |
| 2427 #endif /* MULE */ | |
| 2428 } | |
| 2429 | |
| 2430 #ifdef ENABLE_COMPOSITE_CHARS | |
| 2431 if (in_composite) | |
| 2432 { | |
| 2433 n = saved_n; | |
| 2434 src = saved_src; | |
| 2435 in_composite = 0; | |
| 2436 goto back_to_square_n; /* Wheeeeeeeee ..... */ | |
| 2437 } | |
| 2438 #endif /* ENABLE_COMPOSITE_CHARS */ | |
| 2439 | |
| 2440 data->current_char_boundary = char_boundary; | |
| 2441 data->current_charset = charset; | |
| 2442 | |
| 2443 /* La palabra se hizo carne! */ | |
| 2444 /* A palavra fez-se carne! */ | |
| 2445 /* Whatever. */ | |
| 2446 } | |
| 2447 | |
| 2448 str->ch = ch; | |
| 2449 return orign; | |
| 2450 } | |
| 2451 | |
| 2452 /* DEFINE_DETECTOR (utf_7); */ | |
| 2453 DEFINE_DETECTOR (utf_8); | |
| 2454 DEFINE_DETECTOR_CATEGORY (utf_8, utf_8); | |
| 985 | 2455 DEFINE_DETECTOR_CATEGORY (utf_8, utf_8_bom); |
| 771 | 2456 DEFINE_DETECTOR (ucs_4); |
| 2457 DEFINE_DETECTOR_CATEGORY (ucs_4, ucs_4); | |
| 2458 DEFINE_DETECTOR (utf_16); | |
| 2459 DEFINE_DETECTOR_CATEGORY (utf_16, utf_16); | |
| 2460 DEFINE_DETECTOR_CATEGORY (utf_16, utf_16_little_endian); | |
| 2461 DEFINE_DETECTOR_CATEGORY (utf_16, utf_16_bom); | |
| 2462 DEFINE_DETECTOR_CATEGORY (utf_16, utf_16_little_endian_bom); | |
| 2463 | |
| 2464 struct ucs_4_detector | |
| 2465 { | |
| 2466 int in_ucs_4_byte; | |
| 2467 }; | |
| 2468 | |
| 2469 static void | |
| 2470 ucs_4_detect (struct detection_state *st, const UExtbyte *src, | |
| 2471 Bytecount n) | |
| 2472 { | |
| 2473 struct ucs_4_detector *data = DETECTION_STATE_DATA (st, ucs_4); | |
| 2474 | |
| 2475 while (n--) | |
| 2476 { | |
| 2477 UExtbyte c = *src++; | |
| 2478 switch (data->in_ucs_4_byte) | |
| 2479 { | |
| 2480 case 0: | |
| 2481 if (c >= 128) | |
| 2482 { | |
| 2483 DET_RESULT (st, ucs_4) = DET_NEARLY_IMPOSSIBLE; | |
| 2484 return; | |
| 2485 } | |
| 2486 else | |
| 2487 data->in_ucs_4_byte++; | |
| 2488 break; | |
| 2489 case 3: | |
| 2490 data->in_ucs_4_byte = 0; | |
| 2491 break; | |
| 2492 default: | |
| 2493 data->in_ucs_4_byte++; | |
| 2494 } | |
| 2495 } | |
| 2496 | |
| 2497 /* !!#### write this for real */ | |
| 2498 DET_RESULT (st, ucs_4) = DET_AS_LIKELY_AS_UNLIKELY; | |
| 2499 } | |
| 2500 | |
| 2501 struct utf_16_detector | |
| 2502 { | |
| 2503 unsigned int seen_ffff:1; | |
| 2504 unsigned int seen_forward_bom:1; | |
| 2505 unsigned int seen_rev_bom:1; | |
| 2506 int byteno; | |
| 2507 int prev_char; | |
| 2508 int text, rev_text; | |
| 1267 | 2509 int sep, rev_sep; |
| 2510 int num_ascii; | |
| 771 | 2511 }; |
| 2512 | |
| 2513 static void | |
| 2514 utf_16_detect (struct detection_state *st, const UExtbyte *src, | |
| 2515 Bytecount n) | |
| 2516 { | |
| 2517 struct utf_16_detector *data = DETECTION_STATE_DATA (st, utf_16); | |
| 2518 | |
| 2519 while (n--) | |
| 2520 { | |
| 2521 UExtbyte c = *src++; | |
| 2522 int prevc = data->prev_char; | |
| 2523 if (data->byteno == 1 && c == 0xFF && prevc == 0xFE) | |
| 2524 data->seen_forward_bom = 1; | |
| 2525 else if (data->byteno == 1 && c == 0xFE && prevc == 0xFF) | |
| 2526 data->seen_rev_bom = 1; | |
| 2527 | |
| 2528 if (data->byteno & 1) | |
| 2529 { | |
| 2530 if (c == 0xFF && prevc == 0xFF) | |
| 2531 data->seen_ffff = 1; | |
| 2532 if (prevc == 0 | |
| 2533 && (c == '\r' || c == '\n' | |
| 2534 || (c >= 0x20 && c <= 0x7E))) | |
| 2535 data->text++; | |
| 2536 if (c == 0 | |
| 2537 && (prevc == '\r' || prevc == '\n' | |
| 2538 || (prevc >= 0x20 && prevc <= 0x7E))) | |
| 2539 data->rev_text++; | |
| 1267 | 2540 /* #### 0x2028 is LINE SEPARATOR and 0x2029 is PARAGRAPH SEPARATOR. |
| 2541 I used to count these in text and rev_text but that is very bad, | |
| 2542 as 0x2028 is also space + left-paren in ASCII, which is extremely | |
| 2543 common. So, what do we do with these? */ | |
| 771 | 2544 if (prevc == 0x20 && (c == 0x28 || c == 0x29)) |
| 1267 | 2545 data->sep++; |
| 771 | 2546 if (c == 0x20 && (prevc == 0x28 || prevc == 0x29)) |
| 1267 | 2547 data->rev_sep++; |
| 771 | 2548 } |
| 2549 | |
| 1267 | 2550 if ((c >= ' ' && c <= '~') || c == '\n' || c == '\r' || c == '\t' || |
| 2551 c == '\f' || c == '\v') | |
| 2552 data->num_ascii++; | |
| 771 | 2553 data->byteno++; |
| 2554 data->prev_char = c; | |
| 2555 } | |
| 2556 | |
| 2557 { | |
| 2558 int variance_indicates_big_endian = | |
| 2559 (data->text >= 10 | |
| 2560 && (data->rev_text == 0 | |
| 2561 || data->text / data->rev_text >= 10)); | |
| 2562 int variance_indicates_little_endian = | |
| 2563 (data->rev_text >= 10 | |
| 2564 && (data->text == 0 | |
| 2565 || data->rev_text / data->text >= 10)); | |
| 2566 | |
| 2567 if (data->seen_ffff) | |
| 2568 SET_DET_RESULTS (st, utf_16, DET_NEARLY_IMPOSSIBLE); | |
| 2569 else if (data->seen_forward_bom) | |
| 2570 { | |
| 2571 SET_DET_RESULTS (st, utf_16, DET_NEARLY_IMPOSSIBLE); | |
| 2572 if (variance_indicates_big_endian) | |
| 2573 DET_RESULT (st, utf_16_bom) = DET_NEAR_CERTAINTY; | |
| 2574 else if (variance_indicates_little_endian) | |
| 2575 DET_RESULT (st, utf_16_bom) = DET_SOMEWHAT_LIKELY; | |
| 2576 else | |
| 2577 DET_RESULT (st, utf_16_bom) = DET_QUITE_PROBABLE; | |
| 2578 } | |
| 2579 else if (data->seen_forward_bom) | |
| 2580 { | |
| 2581 SET_DET_RESULTS (st, utf_16, DET_NEARLY_IMPOSSIBLE); | |
| 2582 if (variance_indicates_big_endian) | |
| 2583 DET_RESULT (st, utf_16_bom) = DET_NEAR_CERTAINTY; | |
| 2584 else if (variance_indicates_little_endian) | |
| 2585 /* #### may need to rethink */ | |
| 2586 DET_RESULT (st, utf_16_bom) = DET_SOMEWHAT_LIKELY; | |
| 2587 else | |
| 2588 /* #### may need to rethink */ | |
| 2589 DET_RESULT (st, utf_16_bom) = DET_QUITE_PROBABLE; | |
| 2590 } | |
| 2591 else if (data->seen_rev_bom) | |
| 2592 { | |
| 2593 SET_DET_RESULTS (st, utf_16, DET_NEARLY_IMPOSSIBLE); | |
| 2594 if (variance_indicates_little_endian) | |
| 2595 DET_RESULT (st, utf_16_little_endian_bom) = DET_NEAR_CERTAINTY; | |
| 2596 else if (variance_indicates_big_endian) | |
| 2597 /* #### may need to rethink */ | |
| 2598 DET_RESULT (st, utf_16_little_endian_bom) = DET_SOMEWHAT_LIKELY; | |
| 2599 else | |
| 2600 /* #### may need to rethink */ | |
| 2601 DET_RESULT (st, utf_16_little_endian_bom) = DET_QUITE_PROBABLE; | |
| 2602 } | |
| 2603 else if (variance_indicates_big_endian) | |
| 2604 { | |
| 2605 SET_DET_RESULTS (st, utf_16, DET_NEARLY_IMPOSSIBLE); | |
| 2606 DET_RESULT (st, utf_16) = DET_SOMEWHAT_LIKELY; | |
| 2607 DET_RESULT (st, utf_16_little_endian) = DET_SOMEWHAT_UNLIKELY; | |
| 2608 } | |
| 2609 else if (variance_indicates_little_endian) | |
| 2610 { | |
| 2611 SET_DET_RESULTS (st, utf_16, DET_NEARLY_IMPOSSIBLE); | |
| 2612 DET_RESULT (st, utf_16) = DET_SOMEWHAT_UNLIKELY; | |
| 2613 DET_RESULT (st, utf_16_little_endian) = DET_SOMEWHAT_LIKELY; | |
| 2614 } | |
| 2615 else | |
| 1267 | 2616 { |
| 2617 /* #### FUCKME! There should really be an ASCII detector. This | |
| 2618 would rule out the need to have this built-in here as | |
| 2619 well. --ben */ | |
| 1292 | 2620 int pct_ascii = data->byteno ? (100 * data->num_ascii) / data->byteno |
| 2621 : 100; | |
| 1267 | 2622 |
| 2623 if (pct_ascii > 90) | |
| 2624 SET_DET_RESULTS (st, utf_16, DET_QUITE_IMPROBABLE); | |
| 2625 else if (pct_ascii > 75) | |
| 2626 SET_DET_RESULTS (st, utf_16, DET_SOMEWHAT_UNLIKELY); | |
| 2627 else | |
| 2628 SET_DET_RESULTS (st, utf_16, DET_AS_LIKELY_AS_UNLIKELY); | |
| 2629 } | |
| 771 | 2630 } |
| 2631 } | |
| 2632 | |
| 2633 struct utf_8_detector | |
| 2634 { | |
| 985 | 2635 int byteno; |
| 2636 int first_byte; | |
| 2637 int second_byte; | |
| 1267 | 2638 int prev_byte; |
| 771 | 2639 int in_utf_8_byte; |
| 1267 | 2640 int recent_utf_8_sequence; |
| 2641 int seen_bogus_utf8; | |
| 2642 int seen_really_bogus_utf8; | |
| 2643 int seen_2byte_sequence; | |
| 2644 int seen_longer_sequence; | |
| 2645 int seen_iso2022_esc; | |
| 2646 int seen_iso_shift; | |
| 1887 | 2647 unsigned int seen_utf_bom:1; |
| 771 | 2648 }; |
| 2649 | |
| 2650 static void | |
| 2651 utf_8_detect (struct detection_state *st, const UExtbyte *src, | |
| 2652 Bytecount n) | |
| 2653 { | |
| 2654 struct utf_8_detector *data = DETECTION_STATE_DATA (st, utf_8); | |
| 2655 | |
| 2656 while (n--) | |
| 2657 { | |
| 2658 UExtbyte c = *src++; | |
| 985 | 2659 switch (data->byteno) |
| 2660 { | |
| 2661 case 0: | |
| 2662 data->first_byte = c; | |
| 2663 break; | |
| 2664 case 1: | |
| 2665 data->second_byte = c; | |
| 2666 break; | |
| 2667 case 2: | |
| 2668 if (data->first_byte == 0xef && | |
| 2669 data->second_byte == 0xbb && | |
| 2670 c == 0xbf) | |
| 1267 | 2671 data->seen_utf_bom = 1; |
| 985 | 2672 break; |
| 2673 } | |
| 2674 | |
| 771 | 2675 switch (data->in_utf_8_byte) |
| 2676 { | |
| 2677 case 0: | |
| 1267 | 2678 if (data->prev_byte == ISO_CODE_ESC && c >= 0x28 && c <= 0x2F) |
| 2679 data->seen_iso2022_esc++; | |
| 2680 else if (c == ISO_CODE_SI || c == ISO_CODE_SO) | |
| 2681 data->seen_iso_shift++; | |
| 771 | 2682 else if (c >= 0xfc) |
| 2683 data->in_utf_8_byte = 5; | |
| 2684 else if (c >= 0xf8) | |
| 2685 data->in_utf_8_byte = 4; | |
| 2686 else if (c >= 0xf0) | |
| 2687 data->in_utf_8_byte = 3; | |
| 2688 else if (c >= 0xe0) | |
| 2689 data->in_utf_8_byte = 2; | |
| 2690 else if (c >= 0xc0) | |
| 2691 data->in_utf_8_byte = 1; | |
| 2692 else if (c >= 0x80) | |
| 1267 | 2693 data->seen_bogus_utf8++; |
| 2694 if (data->in_utf_8_byte > 0) | |
| 2695 data->recent_utf_8_sequence = data->in_utf_8_byte; | |
| 771 | 2696 break; |
| 2697 default: | |
| 2698 if ((c & 0xc0) != 0x80) | |
| 1267 | 2699 data->seen_really_bogus_utf8++; |
| 2700 else | |
| 771 | 2701 { |
| 1267 | 2702 data->in_utf_8_byte--; |
| 2703 if (data->in_utf_8_byte == 0) | |
| 2704 { | |
| 2705 if (data->recent_utf_8_sequence == 1) | |
| 2706 data->seen_2byte_sequence++; | |
| 2707 else | |
| 2708 { | |
| 2709 assert (data->recent_utf_8_sequence >= 2); | |
| 2710 data->seen_longer_sequence++; | |
| 2711 } | |
| 2712 } | |
| 771 | 2713 } |
| 2714 } | |
| 985 | 2715 |
| 2716 data->byteno++; | |
| 1267 | 2717 data->prev_byte = c; |
| 771 | 2718 } |
| 1267 | 2719 |
| 2720 /* either BOM or no BOM, but not both */ | |
| 2721 SET_DET_RESULTS (st, utf_8, DET_NEARLY_IMPOSSIBLE); | |
| 2722 | |
| 2723 | |
| 2724 if (data->seen_utf_bom) | |
| 2725 DET_RESULT (st, utf_8_bom) = DET_NEAR_CERTAINTY; | |
| 2726 else | |
| 2727 { | |
| 2728 if (data->seen_really_bogus_utf8 || | |
| 2729 data->seen_bogus_utf8 >= 2) | |
| 2730 ; /* bogus */ | |
| 2731 else if (data->seen_bogus_utf8) | |
| 2732 DET_RESULT (st, utf_8) = DET_SOMEWHAT_UNLIKELY; | |
| 2733 else if ((data->seen_longer_sequence >= 5 || | |
| 2734 data->seen_2byte_sequence >= 10) && | |
| 2735 (!(data->seen_iso2022_esc + data->seen_iso_shift) || | |
| 2736 (data->seen_longer_sequence * 2 + data->seen_2byte_sequence) / | |
| 2737 (data->seen_iso2022_esc + data->seen_iso_shift) >= 10)) | |
| 2738 /* heuristics, heuristics, we love heuristics */ | |
| 2739 DET_RESULT (st, utf_8) = DET_QUITE_PROBABLE; | |
| 2740 else if (data->seen_iso2022_esc || | |
| 2741 data->seen_iso_shift >= 3) | |
| 2742 DET_RESULT (st, utf_8) = DET_SOMEWHAT_UNLIKELY; | |
| 2743 else if (data->seen_longer_sequence || | |
| 2744 data->seen_2byte_sequence) | |
| 2745 DET_RESULT (st, utf_8) = DET_SOMEWHAT_LIKELY; | |
| 2746 else if (data->seen_iso_shift) | |
| 2747 DET_RESULT (st, utf_8) = DET_SOMEWHAT_UNLIKELY; | |
| 2748 else | |
| 2749 DET_RESULT (st, utf_8) = DET_AS_LIKELY_AS_UNLIKELY; | |
| 2750 } | |
| 771 | 2751 } |
| 2752 | |
| 2753 static void | |
| 2754 unicode_init_coding_stream (struct coding_stream *str) | |
| 2755 { | |
| 2756 struct unicode_coding_stream *data = | |
| 2757 CODING_STREAM_TYPE_DATA (str, unicode); | |
| 2758 xzero (*data); | |
| 2759 data->current_charset = Qnil; | |
| 2760 } | |
| 2761 | |
| 2762 static void | |
| 2763 unicode_rewind_coding_stream (struct coding_stream *str) | |
| 2764 { | |
| 2765 unicode_init_coding_stream (str); | |
| 2766 } | |
| 2767 | |
| 2768 static int | |
| 2769 unicode_putprop (Lisp_Object codesys, Lisp_Object key, Lisp_Object value) | |
| 2770 { | |
| 3767 | 2771 if (EQ (key, Qunicode_type)) |
| 771 | 2772 { |
| 2773 enum unicode_type type; | |
| 2774 | |
| 2775 if (EQ (value, Qutf_8)) | |
| 2776 type = UNICODE_UTF_8; | |
| 2777 else if (EQ (value, Qutf_16)) | |
| 2778 type = UNICODE_UTF_16; | |
| 2779 else if (EQ (value, Qutf_7)) | |
| 2780 type = UNICODE_UTF_7; | |
| 2781 else if (EQ (value, Qucs_4)) | |
| 2782 type = UNICODE_UCS_4; | |
| 4096 | 2783 else if (EQ (value, Qutf_32)) |
| 2784 type = UNICODE_UTF_32; | |
| 771 | 2785 else |
| 2786 invalid_constant ("Invalid Unicode type", key); | |
| 2787 | |
| 2788 XCODING_SYSTEM_UNICODE_TYPE (codesys) = type; | |
| 2789 } | |
| 2790 else if (EQ (key, Qlittle_endian)) | |
| 2791 XCODING_SYSTEM_UNICODE_LITTLE_ENDIAN (codesys) = !NILP (value); | |
| 2792 else if (EQ (key, Qneed_bom)) | |
| 2793 XCODING_SYSTEM_UNICODE_NEED_BOM (codesys) = !NILP (value); | |
| 2794 else | |
| 2795 return 0; | |
| 2796 return 1; | |
| 2797 } | |
| 2798 | |
| 2799 static Lisp_Object | |
| 2800 unicode_getprop (Lisp_Object coding_system, Lisp_Object prop) | |
| 2801 { | |
| 3767 | 2802 if (EQ (prop, Qunicode_type)) |
| 771 | 2803 { |
| 2804 switch (XCODING_SYSTEM_UNICODE_TYPE (coding_system)) | |
| 2805 { | |
| 2806 case UNICODE_UTF_16: return Qutf_16; | |
| 2807 case UNICODE_UTF_8: return Qutf_8; | |
| 2808 case UNICODE_UTF_7: return Qutf_7; | |
| 2809 case UNICODE_UCS_4: return Qucs_4; | |
| 4096 | 2810 case UNICODE_UTF_32: return Qutf_32; |
| 2500 | 2811 default: ABORT (); |
| 771 | 2812 } |
| 2813 } | |
| 2814 else if (EQ (prop, Qlittle_endian)) | |
| 2815 return XCODING_SYSTEM_UNICODE_LITTLE_ENDIAN (coding_system) ? Qt : Qnil; | |
| 2816 else if (EQ (prop, Qneed_bom)) | |
| 2817 return XCODING_SYSTEM_UNICODE_NEED_BOM (coding_system) ? Qt : Qnil; | |
| 2818 return Qunbound; | |
| 2819 } | |
| 2820 | |
| 2821 static void | |
| 2286 | 2822 unicode_print (Lisp_Object cs, Lisp_Object printcharfun, |
| 2823 int UNUSED (escapeflag)) | |
| 771 | 2824 { |
| 3767 | 2825 write_fmt_string_lisp (printcharfun, "(%s", 1, |
| 2826 unicode_getprop (cs, Qunicode_type)); | |
| 771 | 2827 if (XCODING_SYSTEM_UNICODE_LITTLE_ENDIAN (cs)) |
|
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
2828 write_ascstring (printcharfun, ", little-endian"); |
| 771 | 2829 if (XCODING_SYSTEM_UNICODE_NEED_BOM (cs)) |
|
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
2830 write_ascstring (printcharfun, ", need-bom"); |
|
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
2831 write_ascstring (printcharfun, ")"); |
| 771 | 2832 } |
| 2833 | |
|
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2834 #ifdef MULE |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2835 DEFUN ("set-unicode-query-skip-chars-args", Fset_unicode_query_skip_chars_args, |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2836 3, 3, 0, /* |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2837 Specify strings as matching characters known to Unicode coding systems. |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2838 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2839 QUERY-STRING is a string matching characters that can unequivocally be |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2840 encoded by the Unicode coding systems. |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2841 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2842 INVALID-STRING is a string to match XEmacs characters that represent known |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2843 octets on disk, but that are invalid sequences according to Unicode. |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2844 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2845 UTF-8-INVALID-STRING is a more restrictive string to match XEmacs characters |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2846 that are invalid UTF-8 octets. |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2847 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2848 All three strings are in the format accepted by `skip-chars-forward'. |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2849 */ |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2850 (query_string, invalid_string, utf_8_invalid_string)) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2851 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2852 CHECK_STRING (query_string); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2853 CHECK_STRING (invalid_string); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2854 CHECK_STRING (utf_8_invalid_string); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2855 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2856 Vunicode_query_string = query_string; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2857 Vunicode_invalid_string = invalid_string; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2858 Vutf_8_invalid_string = utf_8_invalid_string; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2859 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2860 return Qnil; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2861 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2862 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2863 static void |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2864 add_lisp_string_to_skip_chars_range (Lisp_Object string, Lisp_Object rtab, |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2865 Lisp_Object value) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2866 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2867 Ibyte *p, *pend; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2868 Ichar c; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2869 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2870 p = XSTRING_DATA (string); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2871 pend = p + XSTRING_LENGTH (string); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2872 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2873 while (p != pend) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2874 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2875 c = itext_ichar (p); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2876 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2877 INC_IBYTEPTR (p); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2878 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2879 if (c == '\\') |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2880 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2881 if (p == pend) break; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2882 c = itext_ichar (p); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2883 INC_IBYTEPTR (p); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2884 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2885 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2886 if (p != pend && *p == '-') |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2887 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2888 Ichar cend; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2889 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2890 /* Skip over the dash. */ |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2891 p++; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2892 if (p == pend) break; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2893 cend = itext_ichar (p); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2894 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2895 Fput_range_table (make_int (c), make_int (cend), value, |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2896 rtab); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2897 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2898 INC_IBYTEPTR (p); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2899 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2900 else |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2901 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2902 Fput_range_table (make_int (c), make_int (c), value, rtab); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2903 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2904 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2905 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2906 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2907 /* This function wouldn't be necessary if initialised range tables were |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2908 dumped properly; see |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2909 http://mid.gmane.org/18179.49815.622843.336527@parhasard.net . */ |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2910 static void |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2911 initialize_unicode_query_range_tables_from_strings (void) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2912 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2913 CHECK_STRING (Vunicode_query_string); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2914 CHECK_STRING (Vunicode_invalid_string); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2915 CHECK_STRING (Vutf_8_invalid_string); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2916 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2917 Vunicode_query_skip_chars = Fmake_range_table (Qstart_closed_end_closed); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2918 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2919 add_lisp_string_to_skip_chars_range (Vunicode_query_string, |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2920 Vunicode_query_skip_chars, |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2921 Qsucceeded); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2922 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2923 Vunicode_invalid_and_query_skip_chars |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2924 = Fcopy_range_table (Vunicode_query_skip_chars); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2925 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2926 add_lisp_string_to_skip_chars_range (Vunicode_invalid_string, |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2927 Vunicode_invalid_and_query_skip_chars, |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2928 Qinvalid_sequence); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2929 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2930 Vutf_8_invalid_and_query_skip_chars |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2931 = Fcopy_range_table (Vunicode_query_skip_chars); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2932 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2933 add_lisp_string_to_skip_chars_range (Vutf_8_invalid_string, |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2934 Vutf_8_invalid_and_query_skip_chars, |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2935 Qinvalid_sequence); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2936 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2937 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2938 static Lisp_Object |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2939 unicode_query (Lisp_Object codesys, struct buffer *buf, Charbpos end, |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2940 int flags) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2941 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2942 Charbpos pos = BUF_PT (buf), fail_range_start, fail_range_end; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2943 Charbpos pos_byte = BYTE_BUF_PT (buf); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2944 Lisp_Object skip_chars_range_table, result = Qnil; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2945 enum query_coding_failure_reasons failed_reason, |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2946 previous_failed_reason = query_coding_succeeded; |
|
4824
c12b646d84ee
changes to get things to compile under latest cygwin
Ben Wing <ben@xemacs.org>
parents:
4770
diff
changeset
|
2947 int checked_unicode, |
|
c12b646d84ee
changes to get things to compile under latest cygwin
Ben Wing <ben@xemacs.org>
parents:
4770
diff
changeset
|
2948 invalid_lower_limit = UNICODE_ERROR_OCTET_RANGE_START, |
|
c12b646d84ee
changes to get things to compile under latest cygwin
Ben Wing <ben@xemacs.org>
parents:
4770
diff
changeset
|
2949 invalid_upper_limit = -1, |
|
c12b646d84ee
changes to get things to compile under latest cygwin
Ben Wing <ben@xemacs.org>
parents:
4770
diff
changeset
|
2950 unicode_type = XCODING_SYSTEM_UNICODE_TYPE (codesys); |
|
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2951 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2952 if (flags & QUERY_METHOD_HIGHLIGHT && |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2953 /* If we're being called really early, live without highlights getting |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2954 cleared properly: */ |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2955 !(UNBOUNDP (XSYMBOL (Qquery_coding_clear_highlights)->function))) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2956 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2957 /* It's okay to call Lisp here, the only non-stack object we may have |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2958 allocated up to this point is skip_chars_range_table, and that's |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2959 reachable from its entry in Vfixed_width_query_ranges_cache. */ |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2960 call3 (Qquery_coding_clear_highlights, make_int (pos), make_int (end), |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2961 wrap_buffer (buf)); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2962 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2963 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2964 if (NILP (Vunicode_query_skip_chars)) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2965 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2966 initialize_unicode_query_range_tables_from_strings(); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2967 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2968 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2969 if (flags & QUERY_METHOD_IGNORE_INVALID_SEQUENCES) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2970 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2971 switch (unicode_type) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2972 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2973 case UNICODE_UTF_8: |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2974 skip_chars_range_table = Vutf_8_invalid_and_query_skip_chars; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2975 break; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2976 case UNICODE_UTF_7: |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2977 /* #### See above. */ |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2978 return Qunbound; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2979 break; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2980 default: |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2981 skip_chars_range_table = Vunicode_invalid_and_query_skip_chars; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2982 break; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2983 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2984 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2985 else |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2986 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2987 switch (unicode_type) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2988 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2989 case UNICODE_UTF_8: |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2990 invalid_lower_limit = UNICODE_ERROR_OCTET_RANGE_START + 0x80; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2991 invalid_upper_limit = UNICODE_ERROR_OCTET_RANGE_START + 0xFF; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2992 break; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2993 case UNICODE_UTF_7: |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2994 /* #### Work out what to do here in reality, read the spec and decide |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2995 which octets are invalid. */ |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2996 return Qunbound; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2997 break; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2998 default: |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2999 invalid_lower_limit = UNICODE_ERROR_OCTET_RANGE_START; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3000 invalid_upper_limit = UNICODE_ERROR_OCTET_RANGE_START + 0xFF; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3001 break; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3002 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3003 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3004 skip_chars_range_table = Vunicode_query_skip_chars; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3005 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3006 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3007 while (pos < end) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3008 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3009 Ichar ch = BYTE_BUF_FETCH_CHAR (buf, pos_byte); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3010 if ((ch < 0x100 ? 1 : |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3011 (!EQ (Qnil, Fget_range_table (make_int (ch), skip_chars_range_table, |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3012 Qnil))))) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3013 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3014 pos++; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3015 INC_BYTEBPOS (buf, pos_byte); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3016 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3017 else |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3018 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3019 fail_range_start = pos; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3020 while ((pos < end) && |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3021 ((checked_unicode = ichar_to_unicode (ch), |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3022 -1 == checked_unicode |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3023 && (failed_reason = query_coding_unencodable)) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3024 || (!(flags & QUERY_METHOD_IGNORE_INVALID_SEQUENCES) && |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3025 (invalid_lower_limit <= checked_unicode) && |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3026 (checked_unicode <= invalid_upper_limit) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3027 && (failed_reason = query_coding_invalid_sequence))) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3028 && (previous_failed_reason == query_coding_succeeded |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3029 || previous_failed_reason == failed_reason)) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3030 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3031 pos++; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3032 INC_BYTEBPOS (buf, pos_byte); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3033 ch = BYTE_BUF_FETCH_CHAR (buf, pos_byte); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3034 previous_failed_reason = failed_reason; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3035 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3036 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3037 if (fail_range_start == pos) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3038 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3039 /* The character can actually be encoded; move on. */ |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3040 pos++; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3041 INC_BYTEBPOS (buf, pos_byte); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3042 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3043 else |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3044 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3045 assert (previous_failed_reason == query_coding_invalid_sequence |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3046 || previous_failed_reason == query_coding_unencodable); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3047 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3048 if (flags & QUERY_METHOD_ERRORP) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3049 { |
|
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3050 signal_error_2 |
|
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3051 (Qtext_conversion_error, |
|
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3052 "Cannot encode using coding system", |
|
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3053 make_string_from_buffer (buf, fail_range_start, |
|
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3054 pos - fail_range_start), |
|
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3055 XCODING_SYSTEM_NAME (codesys)); |
|
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3056 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3057 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3058 if (NILP (result)) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3059 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3060 result = Fmake_range_table (Qstart_closed_end_open); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3061 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3062 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3063 fail_range_end = pos; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3064 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3065 Fput_range_table (make_int (fail_range_start), |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3066 make_int (fail_range_end), |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3067 (previous_failed_reason |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3068 == query_coding_unencodable ? |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3069 Qunencodable : Qinvalid_sequence), |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3070 result); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3071 previous_failed_reason = query_coding_succeeded; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3072 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3073 if (flags & QUERY_METHOD_HIGHLIGHT) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3074 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3075 Lisp_Object extent |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3076 = Fmake_extent (make_int (fail_range_start), |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3077 make_int (fail_range_end), |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3078 wrap_buffer (buf)); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3079 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3080 Fset_extent_priority |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3081 (extent, make_int (2 + mouse_highlight_priority)); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3082 Fset_extent_face (extent, Qquery_coding_warning_face); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3083 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3084 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3085 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3086 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3087 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3088 return result; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3089 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3090 #else /* !MULE */ |
|
4770
b9aaf2a18957
Add missing return value type to unicode_query.
Stephen J. Turnbull <stephen@xemacs.org>
parents:
4690
diff
changeset
|
3091 static Lisp_Object |
|
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3092 unicode_query (Lisp_Object UNUSED (codesys), |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3093 struct buffer * UNUSED (buf), |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3094 Charbpos UNUSED (end), int UNUSED (flags)) |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3095 { |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3096 return Qnil; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3097 } |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3098 #endif |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3099 |
| 771 | 3100 int |
| 2286 | 3101 dfc_coding_system_is_unicode ( |
| 3102 #ifdef WIN32_ANY | |
| 3103 Lisp_Object codesys | |
| 3104 #else | |
| 3105 Lisp_Object UNUSED (codesys) | |
| 3106 #endif | |
| 3107 ) | |
| 771 | 3108 { |
| 1315 | 3109 #ifdef WIN32_ANY |
| 771 | 3110 codesys = Fget_coding_system (codesys); |
| 3111 return (EQ (XCODING_SYSTEM_TYPE (codesys), Qunicode) && | |
| 3112 XCODING_SYSTEM_UNICODE_TYPE (codesys) == UNICODE_UTF_16 && | |
| 3113 XCODING_SYSTEM_UNICODE_LITTLE_ENDIAN (codesys)); | |
| 3114 | |
| 3115 #else | |
| 3116 return 0; | |
| 3117 #endif | |
| 3118 } | |
| 3119 | |
| 3120 | |
| 3121 /************************************************************************/ | |
| 3122 /* Initialization */ | |
| 3123 /************************************************************************/ | |
| 3124 | |
| 3125 void | |
| 3126 syms_of_unicode (void) | |
| 3127 { | |
| 3128 #ifdef MULE | |
| 877 | 3129 DEFSUBR (Funicode_precedence_list); |
| 771 | 3130 DEFSUBR (Fset_language_unicode_precedence_list); |
| 3131 DEFSUBR (Flanguage_unicode_precedence_list); | |
| 3132 DEFSUBR (Fset_default_unicode_precedence_list); | |
| 3133 DEFSUBR (Fdefault_unicode_precedence_list); | |
| 3134 DEFSUBR (Fset_unicode_conversion); | |
| 3135 | |
| 1318 | 3136 DEFSUBR (Fload_unicode_mapping_table); |
| 771 | 3137 |
|
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3138 DEFSUBR (Fset_unicode_query_skip_chars_args); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3139 |
| 3439 | 3140 DEFSYMBOL (Qccl_encode_to_ucs_2); |
| 3141 DEFSYMBOL (Qlast_allocated_character); | |
| 771 | 3142 DEFSYMBOL (Qignore_first_column); |
| 3659 | 3143 |
| 3144 DEFSYMBOL (Qunicode_registries); | |
| 771 | 3145 #endif /* MULE */ |
| 3146 | |
| 800 | 3147 DEFSUBR (Fchar_to_unicode); |
| 3148 DEFSUBR (Funicode_to_char); | |
| 771 | 3149 |
| 3150 DEFSYMBOL (Qunicode); | |
| 3151 DEFSYMBOL (Qucs_4); | |
| 3152 DEFSYMBOL (Qutf_16); | |
| 4096 | 3153 DEFSYMBOL (Qutf_32); |
| 771 | 3154 DEFSYMBOL (Qutf_8); |
| 3155 DEFSYMBOL (Qutf_7); | |
| 3156 | |
| 3157 DEFSYMBOL (Qneed_bom); | |
| 3158 | |
| 3159 DEFSYMBOL (Qutf_16); | |
| 3160 DEFSYMBOL (Qutf_16_little_endian); | |
| 3161 DEFSYMBOL (Qutf_16_bom); | |
| 3162 DEFSYMBOL (Qutf_16_little_endian_bom); | |
| 985 | 3163 |
| 3164 DEFSYMBOL (Qutf_8); | |
| 3165 DEFSYMBOL (Qutf_8_bom); | |
| 771 | 3166 } |
| 3167 | |
| 3168 void | |
| 3169 coding_system_type_create_unicode (void) | |
| 3170 { | |
| 3171 INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (unicode, "unicode-coding-system-p"); | |
| 3172 CODING_SYSTEM_HAS_METHOD (unicode, print); | |
| 3173 CODING_SYSTEM_HAS_METHOD (unicode, convert); | |
|
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3174 CODING_SYSTEM_HAS_METHOD (unicode, query); |
| 771 | 3175 CODING_SYSTEM_HAS_METHOD (unicode, init_coding_stream); |
| 3176 CODING_SYSTEM_HAS_METHOD (unicode, rewind_coding_stream); | |
| 3177 CODING_SYSTEM_HAS_METHOD (unicode, putprop); | |
| 3178 CODING_SYSTEM_HAS_METHOD (unicode, getprop); | |
| 3179 | |
| 3180 INITIALIZE_DETECTOR (utf_8); | |
| 3181 DETECTOR_HAS_METHOD (utf_8, detect); | |
| 3182 INITIALIZE_DETECTOR_CATEGORY (utf_8, utf_8); | |
| 985 | 3183 INITIALIZE_DETECTOR_CATEGORY (utf_8, utf_8_bom); |
| 771 | 3184 |
| 3185 INITIALIZE_DETECTOR (ucs_4); | |
| 3186 DETECTOR_HAS_METHOD (ucs_4, detect); | |
| 3187 INITIALIZE_DETECTOR_CATEGORY (ucs_4, ucs_4); | |
| 3188 | |
| 3189 INITIALIZE_DETECTOR (utf_16); | |
| 3190 DETECTOR_HAS_METHOD (utf_16, detect); | |
| 3191 INITIALIZE_DETECTOR_CATEGORY (utf_16, utf_16); | |
| 3192 INITIALIZE_DETECTOR_CATEGORY (utf_16, utf_16_little_endian); | |
| 3193 INITIALIZE_DETECTOR_CATEGORY (utf_16, utf_16_bom); | |
| 3194 INITIALIZE_DETECTOR_CATEGORY (utf_16, utf_16_little_endian_bom); | |
| 3195 } | |
| 3196 | |
| 3197 void | |
| 3198 reinit_coding_system_type_create_unicode (void) | |
| 3199 { | |
| 3200 REINITIALIZE_CODING_SYSTEM_TYPE (unicode); | |
| 3201 } | |
| 3202 | |
| 3203 void | |
| 3204 vars_of_unicode (void) | |
| 3205 { | |
| 3206 Fprovide (intern ("unicode")); | |
| 3207 | |
| 3208 #ifdef MULE | |
| 4270 | 3209 staticpro (&Vnumber_of_jit_charsets); |
| 3210 Vnumber_of_jit_charsets = make_int (0); | |
| 3211 staticpro (&Vlast_jit_charset_final); | |
| 3212 Vlast_jit_charset_final = make_char (0x30); | |
| 3213 staticpro (&Vcharset_descr); | |
| 3214 Vcharset_descr | |
|
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3215 = build_defer_string ("Mule charset for otherwise unknown Unicode code points."); |
| 4270 | 3216 |
| 771 | 3217 staticpro (&Vlanguage_unicode_precedence_list); |
| 3218 Vlanguage_unicode_precedence_list = Qnil; | |
| 3219 | |
| 3220 staticpro (&Vdefault_unicode_precedence_list); | |
| 3221 Vdefault_unicode_precedence_list = Qnil; | |
| 3222 | |
| 3223 unicode_precedence_dynarr = Dynarr_new (Lisp_Object); | |
| 2367 | 3224 dump_add_root_block_ptr (&unicode_precedence_dynarr, |
| 771 | 3225 &lisp_object_dynarr_description); |
| 2367 | 3226 |
| 3659 | 3227 |
| 3228 | |
| 2367 | 3229 init_blank_unicode_tables (); |
| 3230 | |
| 3439 | 3231 staticpro (&Vcurrent_jit_charset); |
| 3232 Vcurrent_jit_charset = Qnil; | |
| 3233 | |
| 2367 | 3234 /* Note that the "block" we are describing is a single pointer, and hence |
| 3235 we could potentially use dump_add_root_block_ptr(). However, given | |
| 3236 the way the descriptions are written, we couldn't use them, and would | |
| 3237 have to write new descriptions for each of the pointers below, since | |
| 3238 we would have to make use of a description with an XD_BLOCK_ARRAY | |
| 3239 in it. */ | |
| 3240 | |
| 3241 dump_add_root_block (&to_unicode_blank_1, sizeof (void *), | |
| 3242 to_unicode_level_1_desc_1); | |
| 3243 dump_add_root_block (&to_unicode_blank_2, sizeof (void *), | |
| 3244 to_unicode_level_2_desc_1); | |
| 3245 | |
| 3246 dump_add_root_block (&from_unicode_blank_1, sizeof (void *), | |
| 3247 from_unicode_level_1_desc_1); | |
| 3248 dump_add_root_block (&from_unicode_blank_2, sizeof (void *), | |
| 3249 from_unicode_level_2_desc_1); | |
| 3250 dump_add_root_block (&from_unicode_blank_3, sizeof (void *), | |
| 3251 from_unicode_level_3_desc_1); | |
| 3252 dump_add_root_block (&from_unicode_blank_4, sizeof (void *), | |
| 3253 from_unicode_level_4_desc_1); | |
| 3659 | 3254 |
| 3255 DEFVAR_LISP ("unicode-registries", &Qunicode_registries /* | |
| 3256 Vector describing the X11 registries searched when using fallback fonts. | |
| 3257 | |
| 3258 "Fallback fonts" here includes by default those fonts used by redisplay when | |
| 3259 displaying charsets for which the `encode-as-utf-8' property is true, and | |
| 3260 those used when no font matching the charset's registries property has been | |
| 3261 found (that is, they're probably Mule-specific charsets like Ethiopic or | |
| 3262 IPA.) | |
| 3263 */ ); | |
|
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3264 Qunicode_registries = vector1(build_ascstring("iso10646-1")); |
|
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3265 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3266 /* Initialised in lisp/mule/general-late.el, by a call to |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3267 #'set-unicode-query-skip-chars-args. Or at least they would be, but we |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3268 can't do this at dump time right now, initialised range tables aren't |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3269 dumped properly. */ |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3270 staticpro (&Vunicode_invalid_and_query_skip_chars); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3271 Vunicode_invalid_and_query_skip_chars = Qnil; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3272 staticpro (&Vutf_8_invalid_and_query_skip_chars); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3273 Vutf_8_invalid_and_query_skip_chars = Qnil; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3274 staticpro (&Vunicode_query_skip_chars); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3275 Vunicode_query_skip_chars = Qnil; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3276 |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3277 /* If we could dump the range table above these wouldn't be necessary: */ |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3278 staticpro (&Vunicode_query_string); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3279 Vunicode_query_string = Qnil; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3280 staticpro (&Vunicode_invalid_string); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3281 Vunicode_invalid_string = Qnil; |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3282 staticpro (&Vutf_8_invalid_string); |
|
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3283 Vutf_8_invalid_string = Qnil; |
| 771 | 3284 #endif /* MULE */ |
| 3285 } | |
|
4834
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3286 |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3287 void |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3288 complex_vars_of_unicode (void) |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3289 { |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3290 /* We used to define this in unicode.el. But we need it early for |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3291 Cygwin 1.7 -- used in LOCAL_FILE_FORMAT_TO_TSTR() et al. */ |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3292 Fmake_coding_system_internal |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3293 (Qutf_8, Qunicode, |
|
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3294 build_defer_string ("UTF-8"), |
|
5345
db326b8fe982
Use Ben's recently-introduced listu (), where appropriate.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5307
diff
changeset
|
3295 listu (Qdocumentation, |
|
db326b8fe982
Use Ben's recently-introduced listu (), where appropriate.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5307
diff
changeset
|
3296 build_defer_string ( |
|
4834
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3297 "UTF-8 Unicode encoding -- ASCII-compatible 8-bit variable-width encoding\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3298 "sharing the following principles with the Mule-internal encoding:\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3299 "\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3300 " -- All ASCII characters (codepoints 0 through 127) are represented\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3301 " by themselves (i.e. using one byte, with the same value as the\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3302 " ASCII codepoint), and these bytes are disjoint from bytes\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3303 " representing non-ASCII characters.\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3304 "\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3305 " This means that any 8-bit clean application can safely process\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3306 " UTF-8-encoded text as it were ASCII, with no corruption (e.g. a\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3307 " '/' byte is always a slash character, never the second byte of\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3308 " some other character, as with Big5, so a pathname encoded in\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3309 " UTF-8 can safely be split up into components and reassembled\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3310 " again using standard ASCII processes).\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3311 "\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3312 " -- Leading bytes and non-leading bytes in the encoding of a\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3313 " character are disjoint, so moving backwards is easy.\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3314 "\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3315 " -- Given only the leading byte, you know how many following bytes\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3316 " are present.\n" |
|
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3317 ), |
|
5345
db326b8fe982
Use Ben's recently-introduced listu (), where appropriate.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5307
diff
changeset
|
3318 Qmnemonic, build_ascstring ("UTF8"), |
|
db326b8fe982
Use Ben's recently-introduced listu (), where appropriate.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5307
diff
changeset
|
3319 Qunicode_type, Qutf_8, |
|
db326b8fe982
Use Ben's recently-introduced listu (), where appropriate.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5307
diff
changeset
|
3320 Qunbound)); |
|
4834
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3321 } |
