Mercurial > hg > xemacs-beta
annotate src/text.c @ 5868:da732079c58d
Correct some code with badly-placed parentheses, thank you Mats Lidell.
lisp/ChangeLog addition:
2015-03-16 Aidan Kehoe <kehoea@parhasard.net>
* tty-init.el (make-frame-after-init-entry-point):
Some parentheses were placed badly here with the last change,
thank you Mats for pointing it out; in passing, change to a
version of the code that doesn't create a string for garbage, not
that it matters.
| author | Aidan Kehoe <kehoea@parhasard.net> |
|---|---|
| date | Mon, 16 Mar 2015 00:40:31 +0000 |
| parents | 15041705c196 |
| children |
| rev | line source |
|---|---|
| 2367 | 1 /* Text manipulation primitives for XEmacs. |
| 771 | 2 Copyright (C) 1995 Sun Microsystems, Inc. |
| 2367 | 3 Copyright (C) 1995, 1996, 2000, 2001, 2002, 2003, 2004 Ben Wing. |
| 771 | 4 Copyright (C) 1999 Martin Buchholz. |
| 5 | |
| 6 This file is part of XEmacs. | |
| 7 | |
|
5402
308d34e9f07d
Changed bulk of GPLv2 or later files identified by script
Mats Lidell <matsl@xemacs.org>
parents:
5191
diff
changeset
|
8 XEmacs is free software: you can redistribute it and/or modify it |
| 771 | 9 under the terms of the GNU General Public License as published by the |
|
5402
308d34e9f07d
Changed bulk of GPLv2 or later files identified by script
Mats Lidell <matsl@xemacs.org>
parents:
5191
diff
changeset
|
10 Free Software Foundation, either version 3 of the License, or (at your |
|
308d34e9f07d
Changed bulk of GPLv2 or later files identified by script
Mats Lidell <matsl@xemacs.org>
parents:
5191
diff
changeset
|
11 option) any later version. |
| 771 | 12 |
| 13 XEmacs is distributed in the hope that it will be useful, but WITHOUT | |
| 14 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or | |
| 15 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License | |
| 16 for more details. | |
| 17 | |
| 18 You should have received a copy of the GNU General Public License | |
|
5402
308d34e9f07d
Changed bulk of GPLv2 or later files identified by script
Mats Lidell <matsl@xemacs.org>
parents:
5191
diff
changeset
|
19 along with XEmacs. If not, see <http://www.gnu.org/licenses/>. */ |
| 771 | 20 |
| 21 /* Synched up with: Not in FSF. */ | |
| 22 | |
| 23 /* Authorship: | |
| 24 */ | |
| 25 | |
| 26 #include <config.h> | |
| 27 #include "lisp.h" | |
| 28 | |
| 29 #include "buffer.h" | |
| 30 #include "charset.h" | |
| 31 #include "file-coding.h" | |
| 32 #include "lstream.h" | |
| 1292 | 33 #include "profile.h" |
| 771 | 34 |
| 35 | |
| 36 /************************************************************************/ | |
| 37 /* long comments */ | |
| 38 /************************************************************************/ | |
| 39 | |
| 2367 | 40 /* NB: Everything below was written by Ben Wing except as otherwise noted. */ |
| 41 | |
| 42 /************************************************************************/ | |
| 43 /* */ | |
| 44 /* */ | |
| 45 /* Part A: More carefully-written documentation */ | |
| 46 /* */ | |
| 47 /* */ | |
| 48 /************************************************************************/ | |
| 49 | |
| 50 /* Authorship: Ben Wing | |
| 51 | |
| 771 | 52 |
| 826 | 53 ========================================================================== |
| 2367 | 54 7. Handling non-default formats |
| 826 | 55 ========================================================================== |
| 771 | 56 |
| 2367 | 57 We support, at least to some extent, formats other than the default |
| 58 variable-width format, for speed; all of these alternative formats are | |
| 59 fixed-width. Currently we only handle these non-default formats in | |
| 60 buffers, because access to their text is strictly controlled and thus | |
| 61 the details of the format mostly compartmentalized. The only really | |
| 62 tricky part is the search code -- the regex, Boyer-Moore, and | |
| 63 simple-search algorithms in search.c and regex.c. All other code that | |
| 64 knows directly about the buffer representation is the basic code to | |
| 65 modify or retrieve the buffer text. | |
| 66 | |
| 67 Supporting fixed-width formats in Lisp strings is harder, but possible | |
| 68 -- FSF currently does this, for example. In this case, however, | |
| 69 probably only 8-bit-fixed is reasonable for Lisp strings -- getting | |
| 70 non-ASCII-compatible fixed-width formats to work is much, much harder | |
| 71 because a lot of code assumes that strings are ASCII-compatible | |
| 72 (i.e. ASCII + other characters represented exclusively using high-bit | |
| 73 bytes) and a lot of code mixes Lisp strings and non-Lisp strings freely. | |
| 74 | |
| 75 The different possible fixed-width formats are 8-bit fixed, 16-bit | |
| 76 fixed, and 32-bit fixed. The latter can represent all possible | |
| 77 characters, but at a substantial memory penalty. The other two can | |
| 78 represent only a subset of the possible characters. How these subsets | |
| 79 are defined can be simple or very tricky. | |
| 80 | |
| 81 Currently we support only the default format and the 8-bit fixed format, | |
| 82 and in the latter, we only allow these to be the first 256 characters in | |
| 83 an Ichar (ASCII and Latin 1). | |
| 84 | |
| 85 One reasonable approach for 8-bit fixed is to allow the upper half to | |
| 86 represent any 1-byte charset, which is specified on a per-buffer basis. | |
| 87 This should work fairly well in practice since most documents are in | |
| 88 only one foreign language (possibly with some English mixed in). I | |
| 89 think FSF does something like this; or at least, they have something | |
| 90 called nonascii-translation-table and use it when converting from | |
| 91 8-bit-fixed text ("unibyte text") to default text ("multibyte text"). | |
| 92 With 16-bit fixed, you could do something like assign chunks of the 64K | |
| 93 worth of characters to charsets as they're encountered in documents. | |
| 94 This should work well with most Asian documents. | |
| 95 | |
| 96 If/when we switch to using Unicode internally, we might have formats more | |
| 97 like this: | |
| 98 | |
| 99 -- UTF-8 or some extension as the default format. Perl uses an | |
| 100 extension that handles 64-bit chars and requires as much as 13 bytes per | |
| 101 char, vs. the standard of 31-bit chars and 6 bytes max. UTF-8 has the | |
| 102 same basic properties as our own variable-width format (see text.c, | |
| 103 Internal String Encoding) and so most code would not need to be changed. | |
| 104 | |
| 105 -- UTF-16 as a "pseudo-fixed" format (i.e. 16-bit fixed plus surrogates | |
| 106 for representing characters not in the BMP, aka >= 65536). The vast | |
| 107 majority of documents will have no surrogates in them so byte/char | |
| 108 conversion will be very fast. | |
| 109 | |
| 110 -- an 8-bit fixed format, like currently. | |
| 111 | |
| 112 -- possibly, UCS-4 as a 32-bit fixed format. | |
| 113 | |
| 114 The fixed-width formats essentially treat the buffer as an array of | |
| 115 8-bit, 16-bit or 32-bit integers. This means that how they are stored | |
| 116 in memory (in particular, big-endian or little-endian) depends on the | |
| 117 native format of the machine's processor. It also means we have to | |
| 118 worry a bit about alignment (basically, we just need to keep the gap an | |
| 119 integral size of the character size, and get things aligned properly | |
| 120 when converting the buffer between formats). | |
| 826 | 121 |
| 122 ========================================================================== | |
| 2367 | 123 8. Using UTF-16 as the default text format |
| 826 | 124 ========================================================================== |
| 125 | |
| 2367 | 126 NOTE: The Eistring API is (or should be) Mule-correct even without |
| 127 an ASCII-compatible internal representation. | |
| 128 | |
| 129 #### Currently, the assumption that text units are one byte in size is | |
| 130 embedded throughout XEmacs, and `Ibyte *' is used where `Itext *' should | |
| 131 be. The way to fix this is to (among other things) | |
| 132 | |
| 133 (a) review all places referencing `Ibyte' and `Ibyte *', change them to | |
| 134 use Itext, and fix up the code. | |
| 135 (b) change XSTRING_DATA to be of type Itext * | |
| 136 (c) review all uses of XSTRING_DATA | |
| 137 (d) eliminate XSTRING_LENGTH, splitting it into XSTRING_BYTE_LENGTH and | |
| 138 XSTRING_TEXT_LENGTH and reviewing all places referencing this | |
| 139 (e) make similar changes to other API's that refer to the "length" of | |
| 140 something, such as qxestrlen() and eilen() | |
| 141 (f) review all use of `CIbyte *'. Currently this is usually a way of | |
| 142 passing literal ASCII text strings in places that want internal text. | |
| 143 Either create separate _ascii() and _itext() versions of the | |
| 144 functions taking CIbyte *, or make use of something like the | |
| 145 WEXTTEXT() macro, which will generate wide strings as appropriate. | |
| 146 (g) review all uses of Bytecount and see which ones should be Textcount. | |
| 147 (h) put in error-checking code that will be tripped as often as possible | |
| 148 when doing anything with internal text, and check to see that ASCII | |
| 149 text has not mistakenly filtered in. This should be fairly easy as | |
| 150 ASCII text will generally be entirely spaces and letters whereas every | |
| 151 second byte of Unicode text will generally be a null byte. Either we | |
| 152 abort if the second bytes are entirely letters and numbers, or, | |
| 153 perhaps better, do the equivalent of a non-MULE build, where we should | |
| 154 be dealing entirely with 8-bit characters, and assert that the high | |
| 155 bytes of each pair are null. | |
| 156 (i) review places where xmalloc() is called. If we convert each use of | |
| 157 xmalloc() to instead be xnew_array() or some other typed routine, | |
| 158 then we will find every place that allocates space for Itext and | |
| 159 assumes it is based on one-byte units. | |
| 160 (j) encourage the use of ITEXT_ZTERM_SIZE instead of '+ 1' whenever we | |
| 161 are adding space for a zero-terminator, to emphasize what we are | |
| 162 doing and make sure the calculations are correct. Similarly for | |
| 163 EXTTEXT_ZTERM_SIZE. | |
| 164 (k) Note that the qxestr*() functions, among other things, will need to | |
| 165 be rewritten. | |
| 166 | |
| 167 Note that this is a lot of work, and is not high on the list of priorities | |
| 168 currently. | |
| 826 | 169 |
| 170 ========================================================================== | |
| 2367 | 171 9. Miscellaneous |
| 826 | 172 ========================================================================== |
| 173 | |
| 174 A. Unicode Support | |
| 771 | 175 |
| 1292 | 176 Unicode support is very desirable. Currrently we know how to handle |
| 177 externally-encoded Unicode data in various encodings -- UTF-16, UTF-8, | |
| 178 etc. However, we really need to represent Unicode characters internally | |
| 179 as-is, rather than converting to some language-specific character set. | |
| 180 For efficiency, we should represent Unicode characters using 3 bytes | |
| 181 rather than 4. This means we need to find leading bytes for Unicode. | |
| 182 Given that there are 65,536 characters in Unicode and we can attach | |
| 183 96x96 = 9,216 characters per leading byte, we need eight leading bytes | |
| 184 for Unicode. We currently have four free (0x9A - 0x9D), and with a | |
| 185 little bit of rearranging we can get five: ASCII doesn't really need to | |
| 186 take up a leading byte. (We could just as well use 0x7F, with a little | |
| 187 change to the functions that assume that 0x80 is the lowest leading | |
| 188 byte.) This means we still need to dump three leading bytes and move | |
| 189 them into private space. The CNS charsets are good candidates since | |
| 190 they are rarely used, and JAPANESE_JISX0208_1978 is becoming less and | |
| 191 less used and could also be dumped. | |
| 826 | 192 |
| 193 B. Composite Characters | |
| 194 | |
| 195 Composite characters are characters constructed by overstriking two | |
| 771 | 196 or more regular characters. |
| 197 | |
| 198 1) The old Mule implementation involves storing composite characters | |
| 199 in a buffer as a tag followed by all of the actual characters | |
| 200 used to make up the composite character. I think this is a bad | |
| 201 idea; it greatly complicates code that wants to handle strings | |
| 202 one character at a time because it has to deal with the possibility | |
| 203 of great big ungainly characters. It's much more reasonable to | |
| 204 simply store an index into a table of composite characters. | |
| 205 | |
| 206 2) The current implementation only allows for 16,384 separate | |
| 207 composite characters over the lifetime of the XEmacs process. | |
| 208 This could become a potential problem if the user | |
| 209 edited lots of different files that use composite characters. | |
| 210 Due to FSF bogosity, increasing the number of allowable | |
| 211 composite characters under Mule would decrease the number | |
| 212 of possible faces that can exist. Mule already has shrunk | |
| 213 this to 2048, and further shrinkage would become uncomfortable. | |
| 214 No such problems exist in XEmacs. | |
| 215 | |
| 3498 | 216 Composite characters could be represented as 0x8D C1 C2 C3, where each |
| 217 C[1-3] is in the range 0xA0 - 0xFF. This allows for slightly under | |
| 218 2^20 (one million) composite characters over the XEmacs process | |
| 219 lifetime. Or you could use 0x8D C1 C2 C3 C4, allowing for about 85 | |
| 220 million (slightly over 2^26) composite characters. | |
| 826 | 221 |
| 2367 | 222 ========================================================================== |
| 223 10. Internal API's | |
| 224 ========================================================================== | |
| 225 | |
| 226 All of these are documented in more detail in text.h. | |
| 227 | |
| 228 @enumerate | |
| 229 @item | |
| 230 Basic internal-format API's | |
| 231 | |
| 232 These are simple functions and macros to convert between text | |
| 233 representation and characters, move forward and back in text, etc. | |
| 234 | |
| 235 @item | |
| 236 The DFC API | |
| 237 | |
| 238 This is for conversion between internal and external text. Note that | |
| 239 there is also the "new DFC" API, which *returns* a pointer to the | |
| 240 converted text (in alloca space), rather than storing it into a | |
| 241 variable. | |
| 242 | |
| 243 @item | |
| 244 The Eistring API | |
| 245 | |
| 4073 | 246 \(This API is currently under-used) When doing simple things with |
| 2367 | 247 internal text, the basic internal-format API's are enough. But to do |
| 248 things like delete or replace a substring, concatenate various strings, | |
| 249 etc. is difficult to do cleanly because of the allocation issues. | |
| 250 The Eistring API is designed to deal with this, and provides a clean | |
| 251 way of modifying and building up internal text. (Note that the former | |
| 252 lack of this API has meant that some code uses Lisp strings to do | |
| 253 similar manipulations, resulting in excess garbage and increased | |
| 254 garbage collection.) | |
| 255 | |
| 256 NOTE: The Eistring API is (or should be) Mule-correct even without | |
| 257 an ASCII-compatible internal representation. | |
| 258 @end enumerate | |
| 259 | |
| 260 ========================================================================== | |
| 261 11. Other Sources of Documentation | |
| 262 ========================================================================== | |
| 263 | |
| 264 man/lispref/mule.texi | |
| 265 @enumerate | |
| 266 @item | |
| 267 another intro to characters, encodings, etc; #### Merge with the | |
| 268 above info | |
| 269 @item | |
| 270 documentation of ISO-2022 | |
| 271 @item | |
| 272 The charset and coding-system Lisp API's | |
| 273 @item | |
| 274 The CCL conversion language for writing encoding conversions | |
| 275 @item | |
| 276 The Latin-Unity package for unifying Latin charsets | |
| 277 @end enumerate | |
| 278 | |
| 279 man/internals/internals.texi (the Internals manual) | |
| 280 @enumerate | |
| 281 @item | |
| 282 "Coding for Mule" -- how to write Mule-aware code | |
| 283 @item | |
| 284 "Modules for Internationalization" | |
| 285 @item | |
| 286 "The Text in a Buffer" -- more about the different ways of | |
| 287 viewing buffer positions; #### Merge with the above info | |
| 288 @item | |
| 289 "MULE Character Sets and Encodings" -- yet another intro | |
| 290 to characters, encodings, etc; #### Merge with the | |
| 291 above info; also some documentation of Japanese EUC and JIS7, | |
| 292 and CCL internals | |
| 293 @end enumerate | |
| 294 | |
| 295 text.h -- info about specific XEmacs-C API's for handling internal and | |
| 296 external text | |
| 297 | |
| 298 intl-win32.c -- Windows-specific I18N information | |
| 299 | |
| 300 lisp.h -- some info appears alongside the definitions of the basic | |
| 301 character-related types | |
| 302 | |
| 303 unicode.c -- documentation about Unicode translation tables | |
| 826 | 304 */ |
| 771 | 305 |
| 2367 | 306 |
| 307 /************************************************************************/ | |
| 308 /* */ | |
| 309 /* */ | |
| 310 /* Part B: Random proposals for work to be done */ | |
| 311 /* */ | |
| 312 /* */ | |
| 313 /************************************************************************/ | |
| 314 | |
| 315 | |
| 316 /* | |
| 317 | |
| 318 | |
| 319 ========================================================================== | |
| 320 - Mule design issues (ben) | |
| 321 ========================================================================== | |
| 322 | |
| 323 circa 1999 | |
| 324 | |
| 325 Here is a more detailed list of Mule-related projects that we will be | |
| 326 working on. They are more or less ordered according to how we will | |
| 327 proceed, but it's not exact. In particular, there will probably be | |
| 328 time overlap among adjacent projects. | |
| 329 | |
| 330 @enumerate | |
| 331 @item | |
| 332 Modify the internal/external conversion macros to allow for | |
| 333 MS Windows support. | |
| 334 | |
| 335 @item | |
| 336 Modify the buffer macros to allow for more than one internal | |
| 337 representation, e.g. fixed width and variable width. | |
| 338 | |
| 339 @item | |
| 340 Review the existing Mule code, especially the lisp code, for code | |
| 341 quality issues and improve the cleanliness of it. Also work on | |
| 342 creating a specification for the Mule API. | |
| 343 | |
| 344 @item | |
| 345 Write some more automated mule tests. | |
| 346 | |
| 347 @item | |
| 348 Integrate Tomohiko's UTF-2000 code, fixing it up so that nothing is | |
| 349 broken when the UTF-2000 configure option is not enabled. | |
| 350 | |
| 351 @item | |
| 352 Fix up the MS Windows code to be Mule-correct, so that you can | |
| 353 compile with Mule support under MS windows and have a working | |
| 354 XEmacs, at least just with Latin-1. | |
| 355 | |
| 356 @item | |
| 357 Implement a scheme to guarantee no corruption of files, even with | |
| 358 an incorrect coding system - in particular, guarantee no corruption | |
| 359 of binary files. | |
| 360 | |
| 361 @item | |
| 362 Make the text property support in XEmacs robust with respect to | |
| 363 string and text operations, so that the `no corruption' support in | |
| 364 the previous entry works properly, even if a lot of cutting and | |
| 365 pasting is done. | |
| 366 | |
| 367 @item | |
| 368 Improve the handling of auto-detection so that, when there is any | |
| 369 possibility at all of mistake, the user is informed of the detected | |
| 370 encoding and given the choice of choosing other possibilities. | |
| 371 | |
| 372 @item | |
| 373 Improve the support for different language environments in XEmacs, | |
| 374 for example, the priority of coding systems used in auto-detection | |
| 375 should properly reflect the language environment. This probably | |
| 376 necessitates rethinking the current `coding system priority' | |
| 377 scheme. | |
| 378 | |
| 379 @item | |
| 380 Do quality work to improve the existing UTF-2000 implementation. | |
| 381 | |
| 382 @item | |
| 383 Implement preliminary support for 8-bit fixed width | |
| 384 representation. First, we will only implement 7-bit support, and | |
| 385 will fall back to variable width as soon as any non-ASCII | |
| 386 character is encountered. Then we will improve the support to | |
| 387 handle an arbitrary character set in the upper half of the 8-bit space. | |
| 388 | |
| 389 @item | |
| 390 Investigate any remaining hurdles to making --with-mule be the | |
| 391 default configure option. | |
| 392 @end enumerate | |
| 393 | |
| 394 ========================================================================== | |
| 395 - Mule design issues (stephen) | |
| 396 ========================================================================== | |
| 397 | |
| 398 What I see as Mule priorities (in rough benefit order, I am not taking | |
| 399 account of difficulty, nor the fact that some - eg 8 & 10 - will | |
| 400 probably come as packages): | |
| 401 | |
| 402 @enumerate | |
| 403 @item | |
| 404 Fix the autodetect problem (by making the coding priority list | |
| 405 user-configurable, as short as he likes, even null, with "binary" | |
| 406 as the default). | |
| 407 @item | |
| 408 Document the language environments and other Mule "APIs" as | |
| 409 implemented (since there is no real design spec). Check to see | |
| 410 how and where they are broken. | |
| 411 @item | |
| 412 Make the Mule menu useful to non-ISO-2022-literate folks. | |
| 413 @item | |
| 414 Redo the lstreams stuff to make it easy and robust to "pipeline", | |
| 415 eg, libz | gnupg | jis2mule. | |
| 416 @item | |
| 417 Make Custom Mule-aware. (This probably depends on a sensible | |
| 418 fonts model.) | |
| 419 @item | |
| 420 Implement the "literal byte stream" memory feature. | |
| 421 @item | |
| 422 Study the FSF implementation of Mule for background for 7 & 8. | |
| 423 @item | |
| 424 Identify desirable Mule features (eg, i18n-ized messages as above, | |
| 425 collating tables by language environment, etc). (New features | |
| 426 might have priority as high as 9.) | |
| 427 @item | |
| 428 Specify Mule UIs, APIs, etc, and design and (re)implement them. | |
| 429 @item | |
| 430 Implement the 8-bit-wide buffer optimization. | |
| 431 @item | |
| 432 Move the internal encoding to UTF-32 (subject to Olivier's caveats | |
| 433 regarding compose characters), with the variable-width char | |
| 434 buffers using UTF-8. | |
| 435 @item | |
| 436 Implement the 16- and 32-bit-wide buffer optimizations. | |
| 437 @end enumerate | |
| 438 | |
| 439 ========================================================================== | |
| 440 - Mule design issues "short term" (ben) | |
| 441 ========================================================================== | |
| 442 | |
| 443 @enumerate | |
| 444 @item | |
| 445 Finish changes in fixup/directory, get in CVS. | |
| 446 | |
| 447 (Test with and without "quick-build", to see if really faster) | |
| 448 (need autoconf) | |
| 449 | |
| 450 @item | |
| 451 Finish up Windows/Mule changes. Outline of this elsewhere; Do | |
| 452 *minimal* effort. | |
| 453 | |
| 454 @item | |
| 455 Continue work on Windows stability, e.g. go through existing notes | |
| 456 on Windows Mule-ization + extract all info. | |
| 457 | |
| 458 @item | |
| 459 Get Unicode translation tables integrated. | |
| 460 | |
| 461 Finish UCS2/UTF16 coding system. | |
| 462 | |
| 463 @item | |
| 464 Make sure coding system priority list is language-environment specific. | |
| 465 | |
| 466 @item | |
| 467 Consider moving language selection Menu up to be parallel with Mule menu. | |
| 468 | |
| 469 @item | |
| 470 Check to make sure we grok the default locale at startup under | |
| 471 Windows and understand the Windows locales. Finish implementation | |
| 472 of mswindows-multibyte and make sure it groks all the locales. | |
| 473 | |
| 474 @item | |
| 475 Do the above as best as we can without using Unicode tables. | |
| 476 | |
| 477 @item | |
| 478 Start tagging all text with a language text property, | |
| 479 indicating the current language environment when the text was input. | |
| 480 | |
| 481 @item | |
| 482 Make sure we correctly accept input of non-ASCII chars | |
| 483 (probably already do!) | |
| 484 | |
| 485 @item | |
| 486 Implement active language/keyboard switching under Windows. | |
| 487 | |
| 488 @item | |
| 489 Look into implementing support for "MS IME" protocol (Microsoft | |
| 490 fancy built-in Asian input methods). | |
| 491 | |
| 492 @item | |
| 493 Redo implementation of mswindows-multibyte and internal display to | |
| 494 entirely use translation to/from Unicode for increased accuracy. | |
| 495 | |
| 496 @item | |
| 497 Implement buf<->char improvements from FSF. Also implement | |
| 498 my string byte<->char optimization structure. | |
| 499 | |
| 500 @item | |
| 501 Integrate all Mule DOCS from 20.6 or 21.0. Try to add sections | |
| 502 for what we've added. | |
| 503 | |
| 504 @item | |
| 505 Implement 8-bit fixed width optimizations. Then work on 16-bit. | |
| 506 @end enumerate | |
| 507 | |
| 508 ========================================================================== | |
| 509 - Mule design issues (more) (ben) | |
| 510 ========================================================================== | |
| 511 | |
| 512 Get minimal Mule for Windows working using Ikeyama's patches. At | |
| 513 first, rely on his conversion of internal -> external | |
| 514 locale-specific but very soon (as soon as we get translation | |
| 515 tables) can switch to using Unicode versions of display funs, which | |
| 516 will allow many more charsets to be handled and in a more | |
| 517 consistent fashion. | |
| 518 | |
| 519 i.e. to convert an internal string to an external format, at first | |
| 520 we use our own knowledge of the Microsoft locale file formats but | |
| 521 an alternative is to convert to Unicode and use Microsoft's | |
| 522 convert-Unicode-to-locale encoding functions. This gains us a | |
| 523 great deal of generality, since in practice all charset caching | |
| 524 points can be wrapped into Unicode caching points. | |
| 525 | |
| 526 This requires adding UCS2 support, which I'm doing. This support | |
| 527 would let us convert internal -> Unicode, which is exactly what we | |
| 528 want. | |
| 529 | |
| 530 At first, though, I would do the UCS2 support, but leave the | |
| 531 existing way of doing things in redisplay. Meanwhile, I'd go | |
| 532 through and fix up the places in the code that assume we are | |
| 533 dealing with unibytes. | |
| 534 | |
| 535 After this, the font problems will be fixed , we should have a | |
| 536 pretty well working XEmacs + MULE under Windows. The only real | |
| 537 other work is the clipboard code, which should be straightforward. | |
| 538 | |
| 539 ========================================================================== | |
| 540 - Mule design discussion | |
| 541 ========================================================================== | |
| 542 | |
| 543 -------------------------------------------------------------------------- | |
| 544 | |
| 545 Ben | |
| 546 | |
| 547 April 11, 2000 | |
| 548 | |
| 549 Well yes, this was the whole point of my "no lossage" proposal of being | |
| 550 able to undo any coding-system transformation on a buffer. The idea was | |
|
5384
3889ef128488
Fix misspelled words, and some grammar, across the entire source tree.
Jerry James <james@xemacs.org>
parents:
5191
diff
changeset
|
551 to figure out which transformations were definitely reversible, and for |
| 2367 | 552 all the others, cache the original text in a text property. This way, you |
| 553 could probably still do a fairly good job at constructing a good reversal | |
| 554 even after you've gone into the text and added, deleted, and rearranged | |
| 555 some things. | |
| 556 | |
| 557 But you could implement it much more simply and usefully by just | |
| 558 determining, for any text being decoded into mule-internal, can we go back | |
| 559 and read the source again? If not, remember the entire file (GNUS | |
| 560 message, etc) in text properties. Then, implement the UI interface (like | |
| 561 Netscape's) on top of that. This way, you have something that at least | |
| 562 works, but it might be inefficient. All we would need to do is work on | |
| 563 making the | |
| 564 underlying implementation more efficient. | |
| 565 | |
| 566 Are you interested in doing this? It would be a huge win for users. | |
| 567 Hrvoje Niksic wrote: | |
| 568 | |
| 569 > Ben Wing <ben@666.com> writes: | |
| 570 > | |
| 571 > > let me know exactly what "rethink" functionality you want and i'll | |
| 572 > > come up with an interface. perhaps you just want something like | |
| 573 > > netscape's encoding menu, where if you switch encodings, it reloads | |
| 574 > > and reencodes? | |
| 575 > | |
| 576 > It might be a bit more complex than that. In many cases, it's hard or | |
| 577 > impossible to meaningfully "reload" -- for instance, this | |
| 578 > functionality should be available while editing a Gnus message, as | |
| 579 > well as while visiting a file. | |
| 580 > | |
| 581 > For the special case of Latin-N <-> Latin-M conversion, things could | |
| 582 > be done easily -- to convert from N to M, you only need to convert | |
| 583 > internal representation back to N, and then convert it forth to M. | |
| 584 | |
| 585 -------------------------------------------------------------------------- | |
| 586 April 11, 2000 | |
| 587 | |
| 588 Well yes, this was the whole point of my "no lossage" proposal of being | |
| 589 able to undo any coding-system transformation on a buffer. The idea was | |
|
5384
3889ef128488
Fix misspelled words, and some grammar, across the entire source tree.
Jerry James <james@xemacs.org>
parents:
5191
diff
changeset
|
590 to figure out which transformations were definitely reversible, and for |
| 2367 | 591 all the others, cache the original text in a text property. This way, you |
| 592 could probably still do a fairly good job at constructing a good reversal | |
| 593 even after you've gone into the text and added, deleted, and rearranged | |
| 594 some things. | |
| 595 | |
| 596 But you could implement it much more simply and usefully by just | |
| 597 determining, for any text being decoded into mule-internal, can we go back | |
| 598 and read the source again? If not, remember the entire file (GNUS | |
| 599 message, etc) in text properties. Then, implement the UI interface (like | |
| 600 Netscape's) on top of that. This way, you have something that at least | |
| 601 works, but it might be inefficient. All we would need to do is work on | |
| 602 making the | |
| 603 underlying implementation more efficient. | |
| 604 | |
| 605 Are you interested in doing this? It would be a huge win for users. | |
| 606 Hrvoje Niksic wrote: | |
| 607 | |
| 608 > Ben Wing <ben@666.com> writes: | |
| 609 > | |
| 610 > > let me know exactly what "rethink" functionality you want and i'll | |
| 611 > > come up with an interface. perhaps you just want something like | |
| 612 > > netscape's encoding menu, where if you switch encodings, it reloads | |
| 613 > > and reencodes? | |
| 614 > | |
| 615 > It might be a bit more complex than that. In many cases, it's hard or | |
| 616 > impossible to meaningfully "reload" -- for instance, this | |
| 617 > functionality should be available while editing a Gnus message, as | |
| 618 > well as while visiting a file. | |
| 619 > | |
| 620 > For the special case of Latin-N <-> Latin-M conversion, things could | |
| 621 > be done easily -- to convert from N to M, you only need to convert | |
| 622 > internal representation back to N, and then convert it forth to M. | |
| 623 | |
| 624 | |
| 625 ------------------------------------------------------------------------ | |
| 626 | |
| 627 ========================================================================== | |
| 628 - Redoing translation macros [old] | |
| 629 ========================================================================== | |
| 630 | |
| 631 Currently the translation macros (the macros with names such as | |
| 632 GET_C_STRING_CTEXT_DATA_ALLOCA) have names that are difficult to parse | |
| 633 or remember, and are not all that general. In the process of | |
| 634 reviewing the Windows code so that it could be muleized, I discovered | |
| 635 that these macros need to be extended in various ways to allow for | |
| 636 the Windows code to be easily muleized. | |
| 637 | |
| 638 Since the macros needed to be changed anyways, I figured it would be a | |
| 639 good time to redo them properly. I propose new macros which have | |
| 640 names like this: | |
| 641 | |
| 642 @itemize @bullet | |
| 643 @item | |
| 644 <A>_TO_EXTERNAL_FORMAT_<B> | |
| 645 @item | |
| 646 <A>_TO_EXTERNAL_FORMAT_<B>_1 | |
| 647 @item | |
| 648 <C>_TO_INTERNAL_FORMAT_<D> | |
| 649 @item | |
| 650 <C>_TO_INTERNAL_FORMAT_<D>_1 | |
| 651 @end itemize | |
| 652 | |
| 653 A and C represent the source of the data, and B and D represent the | |
| 654 sink of the data. | |
| 655 | |
| 656 All of these macros call either the functions | |
| 657 convert_to_external_format or convert_to_internal_format internally, | |
| 658 with some massaging of the arguments. | |
| 659 | |
| 660 All of these macros take the following arguments: | |
| 661 | |
| 662 @itemize @bullet | |
| 663 @item | |
| 664 First, one or two arguments indicating the source of the data. | |
| 665 @item | |
| 666 Second, an argument indicating the coding system. (In order to avoid | |
| 667 an excessive number of macros, we no longer provide separate macros | |
| 668 for specific coding systems.) | |
| 669 @item | |
| 670 Third, one or two arguments indicating the sink of the data. | |
| 671 @item | |
| 672 Fourth, optionally, arguments indicating the error behavior and the | |
| 673 warning class (these arguments are only present in the _1 versions | |
| 674 of the macros). The other, shorter named macros are trivial | |
| 675 interfaces onto these macros with the error behavior being | |
| 676 ERROR_ME_WARN, with the warning class being Vstandard_warning_class. | |
| 677 @end itemize | |
| 678 | |
| 679 <A> can be one of the following: | |
| 680 @itemize @bullet | |
| 681 @item | |
| 682 LISP (which means a Lisp string) Takes one argument, a Lisp Object. | |
| 683 @item | |
| 684 LSTREAM (which indicates an lstream) Takes one argument, an | |
| 685 lstream. The data is read from the lstream until EOF is reached. | |
| 686 @item | |
| 687 DATA (which indicates a raw memory area) Takes two arguments, a | |
| 688 pointer and a length in bytes. | |
| 689 (You must never use this if the source of the data is a Lisp string, | |
| 690 because of the possibility of relocation during garbage collection.) | |
| 691 @end itemize | |
| 692 | |
| 693 <B> can be one of the following: | |
| 694 @itemize @bullet | |
| 695 @item | |
| 696 ALLOCA (which means that the resulting data is stored in alloca()ed | |
| 697 memory. Two arguments should be specified, a pointer and a length, | |
| 698 which should be lvalues.) | |
| 699 @item | |
| 700 MALLOC (which means that the resulting data is stored in malloc()ed | |
| 701 memory. Two arguments should be specified, a pointer and a | |
| 702 length. The memory must be free()d by the caller. | |
| 703 @item | |
| 704 OPAQUE (which means the resulting data is stored in an opaque Lisp | |
| 705 Object. This takes one argument, a lvalue Lisp Object. | |
| 706 @item | |
| 707 LSTREAM. The data is written to an lstream. | |
| 708 @end itemize | |
| 709 | |
| 710 <C> can be one of the : | |
| 711 @itemize @bullet | |
| 712 @item | |
| 713 DATA | |
| 714 @item | |
| 715 LSTREAM | |
| 716 @end itemize | |
| 717 (just like <A> above) | |
| 718 | |
| 719 <D> can be one of | |
| 720 @itemize @bullet | |
| 721 @item | |
| 722 ALLOCA | |
| 723 @item | |
| 724 MALLOC | |
| 725 @item | |
| 726 LISP This means a Lisp String. | |
| 727 @item | |
| 728 BUFFER The resulting data is inserted into a buffer at the buffer's | |
| 729 value of point. | |
| 730 @item | |
| 731 LSTREAM The data is written to the lstream. | |
| 732 @end itemize | |
| 733 | |
| 734 | |
| 735 Note that I have eliminated the FORMAT argument of previous macros, | |
| 736 and replaced it with a coding system. This was made possible by | |
| 737 coding system aliases. In place of old `format's, we use a `virtual | |
| 738 coding system', which is aliased to the actual coding system. | |
| 739 | |
| 740 The value of the coding system argument can be anything that is legal | |
| 741 input to get_coding_system, i.e. a symbol or a coding system object. | |
| 742 | |
| 743 ========================================================================== | |
| 744 - creation of generic macros for accessing internally formatted data [old] | |
| 745 ========================================================================== | |
| 746 | |
| 747 I have a design; it's all written down (I did it in Tsukuba), and I just have | |
| 748 to have it transcribed. It's higher level than the macros, though; it's Lisp | |
| 749 primitives that I'm designing. | |
| 750 | |
| 751 As for the design of the macros, don't worry so much about all files having to | |
| 752 get included (which is inevitable with macros), but about how the files are | |
| 753 separated. Your design might go like this: | |
| 754 | |
| 755 @enumerate | |
| 756 @item | |
| 757 you have generic macro interfaces, which specify a particular | |
| 758 behavior but not an implementation. these generic macros have | |
| 759 complementary versions for buffers and for strings (and the buffer | |
| 760 or string is an argument to all of the macros), and do such things | |
| 761 as convert between byte and char indices, retrieve the character at | |
| 762 a particular byte or char index, increment or decrement a byte | |
| 763 index to the beginning of the next or previous character, indicate | |
| 764 the number of bytes occupied by the character at a particular byte | |
| 765 or character index, etc. These are similar to what's already out | |
| 766 there except that they confound buffers and strings and that they | |
| 767 can also work with actual char *'s, which I think is a really bad | |
| 768 idea because it encourages code to "assume" that the representation | |
| 769 is ASCII compatible, which is might not be (e.g. 16-bit fixed | |
| 770 width). In fact, one thing I'm planning on doing is redefining | |
| 771 Bufbyte as a struct, for debugging purposes, to catch all places | |
| 772 that cavalierly compare them with ASCII char's. Note also that I | |
| 773 really want to rename Bufpos and Bytind, which are confusing and | |
| 774 wrong in that they also apply to strings. They should be Bytepos | |
| 775 and Charpos, or something like that, to go along with Bytecount and | |
| 776 Charcount. Similarly, Bufbyte is similarly a misnomer and should be | |
| 777 Intbyte -- a byte in the internal string representation (any of the | |
| 778 internal representations) of a string or buffer. Corresponding to | |
| 779 this is Extbyte (which we already have), a byte in any external | |
| 780 string representation. We also have Extcount, which makes sense, | |
| 781 and we might possibly want Extcharcount, the number of characters | |
| 782 in an external string representation; but that gets sticky in modal | |
| 783 encodings, and it's not clear how useful it would be. | |
| 784 | |
| 785 @item | |
| 786 for all generic macro interfaces, there are specific versions of | |
| 787 each of them for each possible representation (pure ASCII in the | |
| 788 non-Mule world, Mule standard, UTF-8, 8-bit fixed, 16-bit fixed, | |
| 789 32-bit fixed, etc.; there may well be more than one possible 16-bit | |
| 790 fixed version, as well). Each representation has a corresponding | |
| 791 prefix, e.g. MULE_ or FIXED16_ or whatever, which is prefixed onto | |
| 792 the generic macro names. The resulting macros perform the | |
| 793 operation defined for the macro, but assume, and only work | |
| 794 correctly with, text in the corresponding representation. | |
| 795 | |
| 796 @item | |
| 797 The definition of the generic versions merely conditionalizes on | |
| 798 the appropriate things (i.e. bit flags in the buffer or string | |
| 799 object) and calls the appropriate representation-specific version. | |
| 800 There may be more than one definition (protected by ifdefs, of | |
| 801 course), or one definition that amalgamated out of many ifdef'ed | |
| 802 sections. | |
| 803 | |
| 804 @item | |
| 805 You should probably put each different representation in its own | |
| 806 header file, e.g. charset-mule.h or charset-fixed16.h or | |
| 807 charset-ascii.h or whatever. Then put the main macros into | |
| 808 charset.h, and conditionalize in this file appropriately to include | |
| 809 the other ones. That way, code that actually needs to play around | |
| 810 with internal-format text at this level can include "charset.h" | |
| 811 (certainly a much better place than buffer.h), and everyone else | |
| 812 uses higher-level routines. The representation-specific macros | |
| 813 should not normally be used *directly* at all; they are invoked | |
| 814 automatically from the generic macros. However, code that needs to | |
| 815 be highly, highly optimized might choose to take a loop and write | |
| 816 two versions of it, one for each representation, to avoid the | |
| 817 per-loop-iteration cost of a comparison. Until the macro interface | |
| 818 is rock stable and solid, we should strongly discourage such | |
| 819 nanosecond optimizations. | |
| 820 @end enumerate | |
| 821 | |
| 822 ========================================================================== | |
| 823 - UTF-16 compatible representation | |
| 824 ========================================================================== | |
| 825 | |
| 826 NOTE: One possible default internal representation that was compatible | |
| 827 with UTF16 but allowed all possible chars in UCS4 would be to take a | |
| 828 more-or-less unused range of 2048 chars (not from the private area | |
| 829 because Microsoft actually uses up most or all of it with EUDC chars). | |
| 830 Let's say we picked A400 - ABFF. Then, we'd have: | |
| 831 | |
| 832 0000 - FFFF Simple chars | |
| 833 | |
| 834 D[8-B]xx D[C-F]xx Surrogate char, represents 1M chars | |
| 835 | |
| 836 A[4-B]xx D[C-F]xx D[C-F]xx Surrogate char, represents 2G chars | |
| 837 | |
| 838 This is exactly the same number of chars as UCS-4 handles, and it follows the | |
| 839 same property as UTF8 and Mule-internal: | |
| 840 | |
| 841 @enumerate | |
| 842 @item | |
| 843 There are two disjoint groupings of units, one representing leading units | |
| 844 and one representing non-leading units. | |
| 845 @item | |
| 846 Given a leading unit, you immediately know how many units follow to make | |
| 847 up a valid char, irrespective of any other context. | |
| 848 @end enumerate | |
| 849 | |
| 850 Note that A4xx is actually currently assigned to Yi. Since this is an | |
| 851 internal representation, we could just move these elsewhere. | |
| 852 | |
| 853 An alternative is to pick two disjoint ranges, e.g. 2D00 - 2DFF and | |
| 854 A500 - ABFF. | |
| 855 | |
| 856 ========================================================================== | |
| 857 New API for char->font mapping | |
| 858 ========================================================================== | |
| 859 - ; supersedes charset-registry and CCL; | |
| 860 supports all windows systems; powerful enough for Unicode; etc. | |
| 861 | |
| 862 (charset-font-mapping charset) | |
| 863 | |
| 864 font-mapping-specifier string | |
| 865 | |
| 866 char-font-mapping-table | |
| 867 | |
| 868 char-table, specifier; elements of char table are either strings (which | |
| 869 specify a registry or comparable font property, or vectors of a string | |
| 870 (same) followed by keyword-value pairs (optional). The only allowable | |
| 871 keyword currently is :ccl-program, which specifies a CCL program to map | |
| 872 the characters into font indices. Other keywords may be added | |
| 873 e.g. allowing Elisp fragments instead of CCL programs, also allowed is | |
| 874 [inherit], which inherits from the next less-specific char-table in the | |
| 875 specifier. | |
| 876 | |
| 877 The preferred interface onto this mapping (which should be portable | |
| 878 across Emacsen) is | |
| 879 | |
| 880 (set-char-font-mapping key value &optional locale tag-set how-to-add) | |
| 881 | |
| 882 where key is a char, range or charset (as for put-char-table), value is | |
| 883 as above, and the other arguments are standard for specifiers. This | |
| 884 automatically creates a char table in the locale, as necessary (all | |
| 885 elements default to [inherit]). On GNU Emacs, some specifiers arguments | |
| 886 may be unimplemented. | |
| 887 | |
| 888 (char-font-mapping key value &optional locale) | |
| 889 works vaguely like get-specifier? But does inheritance processing. | |
| 890 locale should clearly default here to current-buffer | |
| 891 | |
| 892 #### should get-specifier as well? Would make it work most like | |
| 893 #### buffer-local variables. | |
| 894 | |
| 895 NB. set-charset-registry and set-charset-ccl-program are obsoleted. | |
| 896 | |
| 897 ========================================================================== | |
| 898 Implementing fixed-width 8,16,32 bit buffer optimizations | |
| 899 ========================================================================== | |
| 900 | |
| 901 Add set-buffer-optimization (buffer &rest keywords) for | |
| 902 controlling these things. | |
| 903 | |
| 904 Also, put in hack so that correct arglist can be retrieved by | |
| 905 Lisp code. | |
| 906 | |
| 907 Look at the way keyword primitives are currently handled; make | |
| 908 sure it works and is documented, etc. | |
| 909 | |
| 910 Implement 8-bit fixed width optimization. Take the things that | |
| 911 know about the actual implementation and put them in a single | |
| 912 file, in essence creating an abstraction layer to allow | |
| 913 pluggable internal representations. Implement a fairly general | |
| 914 scheme for mapping between character codes in the 8 bits or 16 | |
| 915 bits representation and on actual charset characters. As part of | |
| 916 set-buffer-optimization, you can specify a list of character sets | |
| 917 to be used in the 8 bit to 16 bit, etc. world. You can also | |
| 918 request that the buffer be in 8, 16, etc. if possible. | |
| 919 | |
| 920 -> set defaults wrt this. | |
| 921 -> perhaps this should be just buffer properties. | |
| 922 -> this brings up the idea of default properties on an object. | |
| 923 -> Implement default-put, default-get, etc. | |
| 924 | |
| 925 What happens when a character not assigned in the range gets | |
| 926 added? Then, must convert to variable width of some sort. | |
| 927 | |
| 928 Note: at first, possibly we just convert whole hog to get things | |
| 929 right. Then we'd have to poy alternative to characters that got | |
| 930 added + deleted that were unassigned in the fixed width. When | |
| 931 this goes to zero and there's been enough time (heuristics), we | |
| 932 go back to fixed. | |
| 933 | |
| 934 Side note: We could dynamically build up the set of assigned | |
| 935 chars as they go. Conceivably this could even go down to the | |
| 936 single char level: Just keep a big array of mapping from 16 bit | |
| 937 values to chars, and add empty time, a char has been encountered | |
| 938 that wasn't there before. Problem need inverse mapping. | |
| 939 | |
| 940 -> Possibility; chars are actual objects, not just numbers. | |
| 941 Then you could keep track of such info in the chars itself. | |
| 942 *Think about this.* | |
| 943 | |
| 944 Eventually, we might consider allowing mixed fixed-width, | |
| 945 variable-width buffer encodings. Then, we use range tables to | |
| 946 indicate which sections are fixed and which variable and INC_CHAR does | |
| 947 something like this: binary search to find the current range, which | |
| 948 indicates whether it's fixed or variable, and tells us what the | |
| 949 increment is. We can cache this info and use it next time to speed | |
| 950 up. | |
| 951 | |
| 952 -> We will then have two partially shared range tables - one for | |
| 953 overall fixed width vs. variable width, and possibly one containing | |
| 954 this same info, but partitioning the variable width in one. Maybe | |
| 955 need fancier nested range table model. | |
| 956 | |
| 957 ========================================================================== | |
| 958 Expansion of display table and case mapping table support for all | |
| 959 chars, not just ASCII/Latin1. | |
| 960 ========================================================================== | |
| 961 | |
| 962 ========================================================================== | |
| 963 Improved flexibility for display tables, and evaluation of its | |
| 964 features to make sure it meshes with and complements the char<->font | |
| 965 mapping API mentioned earlier | |
| 966 ========================================================================== | |
| 967 | |
| 968 ========================================================================== | |
| 969 String access speedup: | |
| 970 ========================================================================== | |
| 971 | |
| 972 For strings larger than some size in bytes (10?), keep extra fields of | |
| 973 info: length in chars, and a (char, byte) pair in the middle to speed | |
| 974 up sequential access. | |
| 975 | |
| 976 (Better idea: do this for any size string, but only if it contains | |
| 977 non-ASCII chars. Then if info is missing, we know string is | |
| 978 ASCII-only.) | |
| 979 | |
| 980 Use a string-extra-info object, replacing string property slot and | |
| 981 containing fields for string mod tick, string extents, string props, | |
| 982 and string char length, and cached (char,byte) pair. | |
| 983 string-extra-info (or string-auxiliary?) objects could be in frob | |
| 984 blocks, esp. if creating frob blocks is easy + worth it. | |
| 985 | |
| 986 - Caching of char<->byte conversions in strings - should make nearly | |
| 987 all operations on strings O(N) | |
| 988 | |
| 989 ========================================================================== | |
| 990 Improvements in buffer char<->byte mapping | |
| 991 ========================================================================== | |
| 992 | |
| 993 - Range table implementation - especially when there are few runs of | |
| 994 different widths, e.g. recently converted from fixed-width | |
| 995 optimization to variable width | |
| 996 | |
| 997 Range Tables to speed up Bufpos <-> Bytind caching | |
| 998 ================================================== | |
| 999 | |
| 1000 This describes an alternative implementation using ranges. We | |
| 1001 maintain a range table of all spans of characters of a fixed width. | |
| 1002 Updating this table could take time if there are a large number of | |
| 1003 spans; but constant factors of operations should be quick. This method really wins | |
| 1004 when you have 8-bit buffers just converted to variable width, where | |
| 1005 there will be few spans. More specifically, lookup in this range | |
| 1006 table is O(log N) and can be done with simple binary search, which is | |
| 1007 very fast. If we maintain the ranges using a gap array, updating this | |
| 1008 table will be fast for local operations, which is most of the time. | |
| 1009 | |
| 1010 We will also provide (at first, at least) a Lisp function to set the | |
| 1011 caching mechanism explicitly - either range tables or the existing | |
| 1012 implementation. Eventually, we want to improve things, to the point | |
| 1013 where we automatically pick the right caching for the situation and | |
| 1014 have more caching schemes implemented. | |
| 1015 | |
| 1016 ========================================================================== | |
| 1017 - Robustify Text Properties | |
| 1018 ========================================================================== | |
| 1019 | |
| 1020 ========================================================================== | |
| 1021 Support for unified internal representation, e.g. Unicode | |
| 1022 ========================================================================== | |
| 1023 | |
| 1024 Start tagging all text with a language text property, | |
| 1025 indicating the current language environment when the text was input. | |
| 1026 (needs "Robustify Text Properties") | |
| 1027 | |
| 1028 ========================================================================== | |
| 1029 - Generalized Coding Systems | |
| 1030 ========================================================================== | |
| 1031 | |
| 1032 - Lisp API for Defining Coding Systems | |
| 1033 | |
| 1034 User-defined coding systems. | |
| 1035 | |
| 1036 (define-coding-system-type 'type | |
| 1037 :encode-function fun | |
| 1038 :decode-function fun | |
| 1039 :detect-function fun | |
| 1040 :buffering (number = at least this many chars | |
| 1041 line = buffer up to end of line | |
| 1042 regexp = buffer until this regexp is found in match | |
| 1043 source data. match data will be appropriate when fun is | |
| 1044 called | |
| 1045 | |
| 1046 encode fun is called as | |
| 1047 | |
| 1048 (encode instream outstream) | |
| 1049 | |
| 1050 should read data from instream and write converted result onto | |
| 1051 outstream. Can leave some data stuff in stream, it will reappear | |
| 1052 next time. Generally, there is a finite amount of data in instream | |
| 1053 and further attempts to read lead to would-block errors or retvals. | |
| 1054 Can use instream properties to record state. May use read-stream | |
| 1055 functionality to read everything into a vector or string. | |
| 1056 | |
| 1057 ->Need vectors + string exposed to resizing of Lisp implementation | |
| 1058 where necessary. | |
| 1059 | |
| 1060 ========================================================================== | |
| 1061 Support Windows Active Kbd Switching, Far East IME API (done already?) | |
| 1062 ========================================================================== | |
| 1063 | |
| 1064 ========================================================================== | |
| 1065 - UI/design changes for Coding System Pipelining | |
| 1066 ========================================================================== | |
| 1067 | |
| 1068 ------------------------------------------------------------------ | |
| 1069 CODING-SYSTEM CHAINS | |
| 1070 ------------------------------------------------------------------ | |
| 1071 | |
| 1072 sjt sez: | |
| 1073 | |
| 1074 There should be no elementary coding systems in the Lisp API, only | |
| 1075 chains. Chains should be declared, not computed, as a sequence of coding | |
| 1076 formats. (Probably the internal representation can be a vector for | |
| 1077 efficiency but programmers would probably rather work with lists.) A | |
| 1078 stream has a token type. Most streams are octet streams. Text is a | |
| 1079 stream of characters (in _internal_ format; a file on disk is not text!) | |
| 1080 An octet-stream has no implicit semantics, so its format must always be | |
| 1081 specified. The only type currently having semantics is characters. This | |
| 1082 means that the chain [euc-jp -> internal -> shift_jis) may be specified | |
| 1083 (euc-jp, shift_jis), and if no euc-jp -> shift_jis converter is | |
| 1084 available, then the chain is automatically constructed. (N.B. I f we | |
| 1085 have fixed width buffers in the future, then we could have ASCII -> 8-bit | |
| 1086 char -> 16-bit char -> ISO-2022-JP (with escape sequences). | |
| 1087 | |
| 1088 EOL handling is a char <-> char coding. It should not be part of another | |
| 1089 coding system except as a convenience for users. For text coding, | |
| 1090 automatically insert EOL handlers between char <-> octet boundaries. | |
| 1091 | |
| 1092 ------------------------------------------------------------------ | |
| 1093 ABOUT DETECTION | |
| 1094 ------------------------------------------------------------------ | |
| 1095 | |
| 1096 | |
| 1097 ------------------------------------------------------------------ | |
| 1098 EFFICIENCY OF CODING CONVERSION WITH MULTIPLE COPIES/CHAINS | |
| 1099 ------------------------------------------------------------------ | |
| 1100 | |
| 1101 A comment in encode_decode_coding_region(): | |
| 1102 | |
| 1103 The chain of streams looks like this: | |
| 1104 | |
| 1105 [BUFFER] <----- (( read from/send to loop )) | |
| 1106 ------> [CHAR->BYTE i.e. ENCODE AS BINARY if source is | |
| 1107 in bytes] | |
| 1108 ------> [ENCODE/DECODE AS SPECIFIED] | |
| 1109 ------> [BYTE->CHAR i.e. DECODE AS BINARY | |
| 1110 if sink is in bytes] | |
| 1111 ------> [AUTODETECT EOL if | |
| 1112 we're decoding and | |
| 1113 coding system calls | |
| 1114 for this] | |
| 1115 ------> [BUFFER] | |
| 1116 | |
| 1117 sjt (?) responds: | |
| 1118 | |
| 1119 Of course, this is just horrible. BYTE<->CHAR should only be available | |
| 1120 to I/O routines. It should not be visible to Mule proper. | |
| 1121 | |
| 1122 A comment on the implementation. Hrvoje and Kyle worry about the | |
| 1123 inefficiency of repeated copying among buffers that chained coding | |
| 1124 systems entail. But this may not be as time inefficient as it appears | |
| 1125 in the Mule ("house rules") context. The issue is how do you do chain | |
| 1126 coding systems without copying? In theory you could have | |
| 1127 | |
| 1128 IChar external_to_raw (ExtChar *cp, State *s); | |
| 1129 IChar decode_utf16 (IChar c, State *s); | |
| 1130 IChar decode_crlf (ExtChar *cp, State *s); | |
| 1131 | |
| 1132 typedef Ichar (*Converter[]) (Ichar, State*); | |
| 1133 | |
| 1134 Converter utf16[2] = { &decode_utf16, &decode_crlf }; | |
| 1135 | |
| 1136 void convert (ExtChar *inbuf, IChar *outbuf, Converter cvtr) | |
| 1137 { | |
| 1138 int i; | |
| 1139 ExtChar c; | |
| 1140 State s; | |
| 1141 | |
| 1142 while (c = external_to_raw (*inbuf++, &s)) | |
| 1143 { | |
| 1144 for (i = 0; i < sizeof(cvtr)/sizeof(Converter); ++i) | |
| 1145 if (s.ready) | |
| 1146 c = (*cvtr[i]) (c, &s); | |
| 1147 } | |
| 1148 if (s.ready) | |
| 1149 *outbuf++ = c; | |
| 1150 } | |
| 1151 | |
| 1152 But this is a lot of function calls; what Ben is doing is basically | |
| 1153 reducing this to one call per buffer-full. The only way to avoid this | |
| 1154 is to hardcode all the "interesting" coding systems, maybe using | |
| 1155 inline or macros to give structure. But this is still a huge amount | |
| 1156 of work, and code. | |
| 1157 | |
| 1158 One advantage to the call-per-char approach is that we might be able | |
| 1159 to do something about the marker/extent destruction that coding | |
| 1160 normally entails. | |
| 1161 | |
| 1162 ben sez: | |
| 1163 | |
| 1164 it should be possible to preserve the markers/extents without | |
| 1165 switching completely to one-call-per-char -- we could at least do one | |
| 1166 call per "run", where a run is more or less the maximal stretch of | |
| 1167 text not overlapping any markers or extent boundaries. (It's a bit | |
| 1168 more complicated if we want to properly support the different extent | |
| 1169 begins/ends; in some cases we might have to pump a single character | |
| 1170 adjacent to where two extents meet.) The "stateless" way that I wrote | |
| 1171 all of the conversion routines may be a real hassle but it allows | |
| 1172 something like this to work without too much problem -- pump in one | |
| 1173 run at a time into one end of the chain, do a flush after each | |
| 1174 iteration, and stick what comes out the other end in its place. | |
| 1175 | |
| 1176 ------------------------------------------------------------------ | |
| 1177 ABOUT FORMATS | |
| 1178 ------------------------------------------------------------------ | |
| 1179 | |
| 1180 when calling make-coding-system, the name can be a cons of (format1 . | |
| 1181 format2), specifying that it decodes format1->format2 and encodes the other | |
| 1182 way. if only one name is given, that is assumed to be format1, and the | |
| 1183 other is either `external' or `internal' depending on the end type. | |
| 1184 normally the user when decoding gives the decoding order in formats, but | |
| 1185 can leave off the last one, `internal', which is assumed. a multichain | |
| 1186 might look like gzip|multibyte|unicode, using the coding systems named | |
| 1187 `gzip', `(unicode . multibyte)' and `unicode'. the way this actually works | |
| 1188 is by searching for gzip->multibyte; if not found, look for gzip->external | |
| 1189 or gzip->internal. (In general we automatically do conversion between | |
| 1190 internal and external as necessary: thus gzip|crlf does the expected, and | |
| 1191 maps to gzip->external, external->internal, crlf->internal, which when | |
| 1192 fully specified would be gzip|external:external|internal:crlf|internal -- | |
| 1193 see below.) To forcibly fit together two converters that have explicitly | |
| 1194 specified and incompatible names (say you have unicode->multibyte and | |
| 1195 iso8859-1->ebcdic and you know that the multibyte and iso8859-1 in this | |
| 1196 case are compatible), you can force-cast using :, like this: | |
| 1197 ebcdic|iso8859-1:multibyte|unicode. (again, if you force-cast between | |
| 1198 internal and external formats, the conversion happens automatically.) | |
| 1199 | |
| 1200 -------------------------------------------------------------------------- | |
| 1201 ABOUT PDUMP, UNICODE, AND RUNNING XEMACS FROM A DIRECTORY WITH WEIRD CHARS | |
| 1202 -------------------------------------------------------------------------- | |
| 1203 | |
| 1204 -- there's the problem that XEmacs can't be run in a directory with | |
| 1205 non-ASCII/Latin-1 chars in it, since it will be doing Unicode | |
| 1206 processing before we've had a chance to load the tables. In fact, | |
| 1207 even finding the tables in such a situation is problematic using | |
| 1208 the normal commands. my idea is to eventually load the stuff | |
| 1209 extremely extremely early, at the same time as the pdump data gets | |
| 1210 loaded. in fact, the unicode table data (stored in an efficient | |
| 1211 binary format) can even be stuck into the pdump file (which would | |
| 1212 mean as a resource to the executable, for windows). we'd need to | |
| 1213 extend pdump a bit: to allow for attaching extra data to the pdump | |
| 1214 file. (something like pdump_attach_extra_data (addr, length) | |
| 1215 returns a number of some sort, an index into the file, which you | |
| 1216 can then retrieve with pdump_load_extra_data(), which returns an | |
| 1217 addr (mmap()ed or loaded), and later you pdump_unload_extra_data() | |
| 1218 when finished. we'd probably also need | |
| 1219 pdump_attach_extra_data_append(), which appends data to the data | |
| 1220 just written out with pdump_attach_extra_data(). this way, | |
| 1221 multiple tables in memory can be written out into one contiguous | |
| 1222 table. (we'd use the tar-like trick of allowing new blocks to be | |
| 1223 written without going back to change the old blocks -- we just rely | |
| 1224 on the end of file/end of memory.) this same mechanism could be | |
| 1225 extracted out of pdump and used to handle the non-pdump situation | |
| 1226 (or alternatively, we could just dump either the memory image of | |
| 1227 the tables themselves or the compressed binary version). in the | |
| 1228 case of extra unicode tables not known about at compile time that | |
| 1229 get loaded before dumping, we either just dump them into the image | |
| 1230 (pdump and all) or extract them into the compressed binary format, | |
| 1231 free the original tables, and treat them like all other tables. | |
| 1232 | |
| 1233 | |
| 1234 ========================================================================== | |
| 1235 - Generalized language appropriate word wrapping (requires | |
| 1236 layout-exposing API defined in BIDI section) | |
| 1237 ========================================================================== | |
| 1238 | |
| 1239 ========================================================================== | |
| 1240 - Make Custom Mule-aware | |
| 1241 ========================================================================== | |
| 1242 | |
| 1243 ========================================================================== | |
| 1244 - Composite character support | |
| 1245 ========================================================================== | |
| 1246 | |
| 1247 ========================================================================== | |
| 1248 - Language appropriate sorting and searching | |
| 1249 ========================================================================== | |
| 1250 | |
| 1251 ========================================================================== | |
| 1252 - Glyph shaping for Arabic and Devanagari | |
| 1253 ========================================================================== | |
| 1254 | |
| 1255 - (needs to be handled mostly | |
| 1256 at C level, as part of layout; luckily it's entirely local in its | |
| 1257 changes, as this is not hard) | |
| 1258 | |
| 1259 | |
| 1260 ========================================================================== | |
| 1261 Consider moving language selection Menu up to be parallel with Mule menu | |
| 1262 ========================================================================== | |
| 1263 | |
| 1264 */ | |
| 1265 | |
| 1266 | |
| 771 | 1267 |
| 1268 /************************************************************************/ | |
| 1269 /* declarations */ | |
| 1270 /************************************************************************/ | |
| 1271 | |
| 1272 Eistring the_eistring_zero_init, the_eistring_malloc_zero_init; | |
| 1273 | |
| 1274 #define MAX_CHARBPOS_GAP_SIZE_3 (65535/3) | |
| 1275 #define MAX_BYTEBPOS_GAP_SIZE_3 (3 * MAX_CHARBPOS_GAP_SIZE_3) | |
| 1276 | |
| 1277 short three_to_one_table[1 + MAX_BYTEBPOS_GAP_SIZE_3]; | |
| 1278 | |
| 1279 #ifdef MULE | |
| 1280 | |
| 1281 /* Table of number of bytes in the string representation of a character | |
| 1282 indexed by the first byte of that representation. | |
| 1283 | |
| 1284 rep_bytes_by_first_byte(c) is more efficient than the equivalent | |
| 1285 canonical computation: | |
| 1286 | |
| 826 | 1287 XCHARSET_REP_BYTES (charset_by_leading_byte (c)) */ |
| 771 | 1288 |
| 1289 const Bytecount rep_bytes_by_first_byte[0xA0] = | |
| 1290 { /* 0x00 - 0x7f are for straight ASCII */ | |
| 1291 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | |
| 1292 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | |
| 1293 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | |
| 1294 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | |
| 1295 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | |
| 1296 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | |
| 1297 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | |
| 1298 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | |
| 1299 /* 0x80 - 0x8f are for Dimension-1 official charsets */ | |
| 1300 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, | |
| 1301 /* 0x90 - 0x9d are for Dimension-2 official charsets */ | |
| 1302 /* 0x9e is for Dimension-1 private charsets */ | |
| 1303 /* 0x9f is for Dimension-2 private charsets */ | |
| 1304 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4 | |
| 1305 }; | |
| 1306 | |
| 1307 #ifdef ENABLE_COMPOSITE_CHARS | |
| 1308 | |
| 1309 /* Hash tables for composite chars. One maps string representing | |
| 1310 composed chars to their equivalent chars; one goes the | |
| 1311 other way. */ | |
| 1312 Lisp_Object Vcomposite_char_char2string_hash_table; | |
| 1313 Lisp_Object Vcomposite_char_string2char_hash_table; | |
| 1314 | |
| 1315 static int composite_char_row_next; | |
| 1316 static int composite_char_col_next; | |
| 1317 | |
| 1318 #endif /* ENABLE_COMPOSITE_CHARS */ | |
| 1319 | |
| 1320 #endif /* MULE */ | |
| 1321 | |
| 1292 | 1322 Lisp_Object QSin_char_byte_conversion; |
| 1323 Lisp_Object QSin_internal_external_conversion; | |
| 1324 | |
|
5863
15041705c196
Provide `char-code-limit', implement the GNU equivalent in terms of it.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5785
diff
changeset
|
1325 Fixnum Vchar_code_limit; |
| 771 | 1326 |
| 1327 /************************************************************************/ | |
| 1328 /* qxestr***() functions */ | |
| 1329 /************************************************************************/ | |
| 1330 | |
| 1331 /* Most are inline functions in lisp.h */ | |
| 1332 | |
| 1333 int | |
| 867 | 1334 qxesprintf (Ibyte *buffer, const CIbyte *format, ...) |
| 771 | 1335 { |
| 1336 va_list args; | |
| 1337 int retval; | |
| 1338 | |
| 1339 va_start (args, format); | |
| 2367 | 1340 retval = vsprintf ((Chbyte *) buffer, format, args); |
| 771 | 1341 va_end (args); |
| 1342 | |
| 1343 return retval; | |
| 1344 } | |
| 1345 | |
| 1346 /* strcasecmp() implementation from BSD */ | |
| 867 | 1347 static Ibyte strcasecmp_charmap[] = { |
| 1429 | 1348 0000, 0001, 0002, 0003, 0004, 0005, 0006, 0007, |
| 1349 0010, 0011, 0012, 0013, 0014, 0015, 0016, 0017, | |
| 1350 0020, 0021, 0022, 0023, 0024, 0025, 0026, 0027, | |
| 1351 0030, 0031, 0032, 0033, 0034, 0035, 0036, 0037, | |
| 1352 0040, 0041, 0042, 0043, 0044, 0045, 0046, 0047, | |
| 1353 0050, 0051, 0052, 0053, 0054, 0055, 0056, 0057, | |
| 1354 0060, 0061, 0062, 0063, 0064, 0065, 0066, 0067, | |
| 1355 0070, 0071, 0072, 0073, 0074, 0075, 0076, 0077, | |
| 1356 0100, 0141, 0142, 0143, 0144, 0145, 0146, 0147, | |
| 1357 0150, 0151, 0152, 0153, 0154, 0155, 0156, 0157, | |
| 1358 0160, 0161, 0162, 0163, 0164, 0165, 0166, 0167, | |
| 1359 0170, 0171, 0172, 0133, 0134, 0135, 0136, 0137, | |
| 1360 0140, 0141, 0142, 0143, 0144, 0145, 0146, 0147, | |
| 1361 0150, 0151, 0152, 0153, 0154, 0155, 0156, 0157, | |
| 1362 0160, 0161, 0162, 0163, 0164, 0165, 0166, 0167, | |
| 1363 0170, 0171, 0172, 0173, 0174, 0175, 0176, 0177, | |
| 1364 0200, 0201, 0202, 0203, 0204, 0205, 0206, 0207, | |
| 1365 0210, 0211, 0212, 0213, 0214, 0215, 0216, 0217, | |
| 1366 0220, 0221, 0222, 0223, 0224, 0225, 0226, 0227, | |
| 1367 0230, 0231, 0232, 0233, 0234, 0235, 0236, 0237, | |
| 1368 0240, 0241, 0242, 0243, 0244, 0245, 0246, 0247, | |
| 1369 0250, 0251, 0252, 0253, 0254, 0255, 0256, 0257, | |
| 1370 0260, 0261, 0262, 0263, 0264, 0265, 0266, 0267, | |
| 1371 0270, 0271, 0272, 0273, 0274, 0275, 0276, 0277, | |
| 1372 0300, 0301, 0302, 0303, 0304, 0305, 0306, 0307, | |
| 1373 0310, 0311, 0312, 0313, 0314, 0315, 0316, 0317, | |
| 1374 0320, 0321, 0322, 0323, 0324, 0325, 0326, 0327, | |
| 1375 0330, 0331, 0332, 0333, 0334, 0335, 0336, 0337, | |
| 1376 0340, 0341, 0342, 0343, 0344, 0345, 0346, 0347, | |
| 1377 0350, 0351, 0352, 0353, 0354, 0355, 0356, 0357, | |
| 1378 0360, 0361, 0362, 0363, 0364, 0365, 0366, 0367, | |
| 1379 0370, 0371, 0372, 0373, 0374, 0375, 0376, 0377 | |
| 771 | 1380 }; |
| 1381 | |
| 1382 /* A version that works like generic strcasecmp() -- only collapsing | |
| 1383 case in ASCII A-Z/a-z. This is safe on Mule strings due to the | |
| 1384 current representation. | |
| 1385 | |
| 1386 This version was written by some Berkeley coder, favoring | |
| 1387 nanosecond improvements over clarity. In all other versions below, | |
| 1388 we use symmetrical algorithms that may sacrifice a few machine | |
| 1389 cycles but are MUCH MUCH clearer, which counts a lot more. | |
| 1390 */ | |
| 1391 | |
| 1392 int | |
| 867 | 1393 qxestrcasecmp (const Ibyte *s1, const Ibyte *s2) |
| 771 | 1394 { |
| 867 | 1395 Ibyte *cm = strcasecmp_charmap; |
| 771 | 1396 |
| 1397 while (cm[*s1] == cm[*s2++]) | |
| 1398 if (*s1++ == '\0') | |
| 1399 return (0); | |
| 1400 | |
| 1401 return (cm[*s1] - cm[*--s2]); | |
| 1402 } | |
| 1403 | |
| 1404 int | |
| 2367 | 1405 ascii_strcasecmp (const Ascbyte *s1, const Ascbyte *s2) |
| 771 | 1406 { |
| 867 | 1407 return qxestrcasecmp ((const Ibyte *) s1, (const Ibyte *) s2); |
| 771 | 1408 } |
| 1409 | |
| 1410 int | |
| 2367 | 1411 qxestrcasecmp_ascii (const Ibyte *s1, const Ascbyte *s2) |
| 771 | 1412 { |
| 867 | 1413 return qxestrcasecmp (s1, (const Ibyte *) s2); |
| 771 | 1414 } |
| 1415 | |
| 1416 /* An internationalized version that collapses case in a general fashion. | |
| 1417 */ | |
| 1418 | |
| 1419 int | |
| 867 | 1420 qxestrcasecmp_i18n (const Ibyte *s1, const Ibyte *s2) |
| 771 | 1421 { |
| 1422 while (*s1 && *s2) | |
| 1423 { | |
|
4906
6ef8256a020a
implement equalp in C, fix case-folding, add equal() method for keymaps
Ben Wing <ben@xemacs.org>
parents:
4526
diff
changeset
|
1424 if (CANONCASE (0, itext_ichar (s1)) != |
|
6ef8256a020a
implement equalp in C, fix case-folding, add equal() method for keymaps
Ben Wing <ben@xemacs.org>
parents:
4526
diff
changeset
|
1425 CANONCASE (0, itext_ichar (s2))) |
| 771 | 1426 break; |
| 867 | 1427 INC_IBYTEPTR (s1); |
| 1428 INC_IBYTEPTR (s2); | |
| 771 | 1429 } |
| 1430 | |
|
4906
6ef8256a020a
implement equalp in C, fix case-folding, add equal() method for keymaps
Ben Wing <ben@xemacs.org>
parents:
4526
diff
changeset
|
1431 return (CANONCASE (0, itext_ichar (s1)) - |
|
6ef8256a020a
implement equalp in C, fix case-folding, add equal() method for keymaps
Ben Wing <ben@xemacs.org>
parents:
4526
diff
changeset
|
1432 CANONCASE (0, itext_ichar (s2))); |
| 771 | 1433 } |
| 1434 | |
| 1435 /* The only difference between these next two and | |
| 1436 qxememcasecmp()/qxememcasecmp_i18n() is that these two will stop if | |
| 1437 both strings are equal and less than LEN in length, while | |
| 1438 the mem...() versions would would run off the end. */ | |
| 1439 | |
| 1440 int | |
| 867 | 1441 qxestrncasecmp (const Ibyte *s1, const Ibyte *s2, Bytecount len) |
| 771 | 1442 { |
| 867 | 1443 Ibyte *cm = strcasecmp_charmap; |
| 771 | 1444 |
| 1445 while (len--) | |
| 1446 { | |
| 1447 int diff = cm[*s1] - cm[*s2]; | |
| 1448 if (diff != 0) | |
| 1449 return diff; | |
| 1450 if (!*s1) | |
| 1451 return 0; | |
| 1452 s1++, s2++; | |
| 1453 } | |
| 1454 | |
| 1455 return 0; | |
| 1456 } | |
| 1457 | |
| 1458 int | |
| 2367 | 1459 ascii_strncasecmp (const Ascbyte *s1, const Ascbyte *s2, Bytecount len) |
| 771 | 1460 { |
| 867 | 1461 return qxestrncasecmp ((const Ibyte *) s1, (const Ibyte *) s2, len); |
| 771 | 1462 } |
| 1463 | |
| 1464 int | |
| 2367 | 1465 qxestrncasecmp_ascii (const Ibyte *s1, const Ascbyte *s2, Bytecount len) |
| 771 | 1466 { |
| 867 | 1467 return qxestrncasecmp (s1, (const Ibyte *) s2, len); |
| 771 | 1468 } |
| 1469 | |
| 801 | 1470 /* Compare LEN_FROM_S1 worth of characters from S1 with the same number of |
| 1471 characters from S2, case insensitive. NOTE: Downcasing can convert | |
| 1472 characters from one length in bytes to another, so reversing S1 and S2 | |
| 1473 is *NOT* a symmetric operations! You must choose a length that agrees | |
| 1474 with S1. */ | |
| 1475 | |
| 771 | 1476 int |
| 867 | 1477 qxestrncasecmp_i18n (const Ibyte *s1, const Ibyte *s2, |
| 801 | 1478 Bytecount len_from_s1) |
| 771 | 1479 { |
| 801 | 1480 while (len_from_s1 > 0) |
| 771 | 1481 { |
| 867 | 1482 const Ibyte *old_s1 = s1; |
|
4906
6ef8256a020a
implement equalp in C, fix case-folding, add equal() method for keymaps
Ben Wing <ben@xemacs.org>
parents:
4526
diff
changeset
|
1483 int diff = (CANONCASE (0, itext_ichar (s1)) - |
|
6ef8256a020a
implement equalp in C, fix case-folding, add equal() method for keymaps
Ben Wing <ben@xemacs.org>
parents:
4526
diff
changeset
|
1484 CANONCASE (0, itext_ichar (s2))); |
| 771 | 1485 if (diff != 0) |
| 1486 return diff; | |
| 1487 if (!*s1) | |
| 1488 return 0; | |
| 867 | 1489 INC_IBYTEPTR (s1); |
| 1490 INC_IBYTEPTR (s2); | |
| 801 | 1491 len_from_s1 -= s1 - old_s1; |
| 771 | 1492 } |
| 1493 | |
| 1494 return 0; | |
| 1495 } | |
| 1496 | |
| 1497 int | |
| 867 | 1498 qxememcmp (const Ibyte *s1, const Ibyte *s2, Bytecount len) |
| 771 | 1499 { |
| 1500 return memcmp (s1, s2, len); | |
| 1501 } | |
| 1502 | |
| 1503 int | |
| 867 | 1504 qxememcmp4 (const Ibyte *s1, Bytecount len1, |
| 1505 const Ibyte *s2, Bytecount len2) | |
| 801 | 1506 { |
| 1507 int retval = qxememcmp (s1, s2, min (len1, len2)); | |
| 1508 if (retval) | |
| 1509 return retval; | |
| 1510 return len1 - len2; | |
| 1511 } | |
| 1512 | |
| 1513 int | |
| 867 | 1514 qxememcasecmp (const Ibyte *s1, const Ibyte *s2, Bytecount len) |
| 771 | 1515 { |
| 867 | 1516 Ibyte *cm = strcasecmp_charmap; |
| 771 | 1517 |
| 1518 while (len--) | |
| 1519 { | |
| 1520 int diff = cm[*s1] - cm[*s2]; | |
| 1521 if (diff != 0) | |
| 1522 return diff; | |
| 1523 s1++, s2++; | |
| 1524 } | |
| 1525 | |
| 1526 return 0; | |
| 1527 } | |
| 1528 | |
| 1529 int | |
| 867 | 1530 qxememcasecmp4 (const Ibyte *s1, Bytecount len1, |
| 1531 const Ibyte *s2, Bytecount len2) | |
| 771 | 1532 { |
| 801 | 1533 int retval = qxememcasecmp (s1, s2, min (len1, len2)); |
| 1534 if (retval) | |
| 1535 return retval; | |
| 1536 return len1 - len2; | |
| 1537 } | |
| 1538 | |
| 1539 /* Do a character-by-character comparison, returning "which is greater" by | |
| 867 | 1540 comparing the Ichar values. (#### Should have option to compare Unicode |
| 801 | 1541 points) */ |
| 1542 | |
| 1543 int | |
| 867 | 1544 qxetextcmp (const Ibyte *s1, Bytecount len1, |
| 1545 const Ibyte *s2, Bytecount len2) | |
| 801 | 1546 { |
| 1547 while (len1 > 0 && len2 > 0) | |
| 771 | 1548 { |
| 867 | 1549 const Ibyte *old_s1 = s1; |
| 1550 const Ibyte *old_s2 = s2; | |
| 1551 int diff = itext_ichar (s1) - itext_ichar (s2); | |
| 801 | 1552 if (diff != 0) |
| 1553 return diff; | |
| 867 | 1554 INC_IBYTEPTR (s1); |
| 1555 INC_IBYTEPTR (s2); | |
| 801 | 1556 len1 -= s1 - old_s1; |
| 1557 len2 -= s2 - old_s2; | |
| 1558 } | |
| 1559 | |
| 1560 assert (len1 >= 0 && len2 >= 0); | |
| 1561 return len1 - len2; | |
| 1562 } | |
| 1563 | |
| 1564 int | |
| 867 | 1565 qxetextcmp_matching (const Ibyte *s1, Bytecount len1, |
| 1566 const Ibyte *s2, Bytecount len2, | |
| 801 | 1567 Charcount *matching) |
| 1568 { | |
| 1569 *matching = 0; | |
| 1570 while (len1 > 0 && len2 > 0) | |
| 1571 { | |
| 867 | 1572 const Ibyte *old_s1 = s1; |
| 1573 const Ibyte *old_s2 = s2; | |
| 1574 int diff = itext_ichar (s1) - itext_ichar (s2); | |
| 801 | 1575 if (diff != 0) |
| 1576 return diff; | |
| 867 | 1577 INC_IBYTEPTR (s1); |
| 1578 INC_IBYTEPTR (s2); | |
| 801 | 1579 len1 -= s1 - old_s1; |
| 1580 len2 -= s2 - old_s2; | |
| 1581 (*matching)++; | |
| 1582 } | |
| 1583 | |
| 1584 assert (len1 >= 0 && len2 >= 0); | |
| 1585 return len1 - len2; | |
| 1586 } | |
| 1587 | |
| 1588 /* Do a character-by-character comparison, returning "which is greater" by | |
| 867 | 1589 comparing the Ichar values, case insensitively (by downcasing both |
| 801 | 1590 first). (#### Should have option to compare Unicode points) |
| 1591 | |
| 1592 In this case, both lengths must be specified becaused downcasing can | |
| 1593 convert characters from one length in bytes to another; therefore, two | |
| 1594 blocks of text of different length might be equal. If both compare | |
| 1595 equal up to the limit in length of one but not the other, the longer one | |
| 1596 is "greater". */ | |
| 1597 | |
| 1598 int | |
| 867 | 1599 qxetextcasecmp (const Ibyte *s1, Bytecount len1, |
| 1600 const Ibyte *s2, Bytecount len2) | |
| 801 | 1601 { |
| 1602 while (len1 > 0 && len2 > 0) | |
| 1603 { | |
| 867 | 1604 const Ibyte *old_s1 = s1; |
| 1605 const Ibyte *old_s2 = s2; | |
|
4906
6ef8256a020a
implement equalp in C, fix case-folding, add equal() method for keymaps
Ben Wing <ben@xemacs.org>
parents:
4526
diff
changeset
|
1606 int diff = (CANONCASE (0, itext_ichar (s1)) - |
|
6ef8256a020a
implement equalp in C, fix case-folding, add equal() method for keymaps
Ben Wing <ben@xemacs.org>
parents:
4526
diff
changeset
|
1607 CANONCASE (0, itext_ichar (s2))); |
| 771 | 1608 if (diff != 0) |
| 1609 return diff; | |
| 867 | 1610 INC_IBYTEPTR (s1); |
| 1611 INC_IBYTEPTR (s2); | |
| 801 | 1612 len1 -= s1 - old_s1; |
| 1613 len2 -= s2 - old_s2; | |
| 771 | 1614 } |
| 1615 | |
| 801 | 1616 assert (len1 >= 0 && len2 >= 0); |
| 1617 return len1 - len2; | |
| 1618 } | |
| 1619 | |
| 1620 /* Like qxetextcasecmp() but also return number of characters at | |
| 1621 beginning that match. */ | |
| 1622 | |
| 1623 int | |
| 867 | 1624 qxetextcasecmp_matching (const Ibyte *s1, Bytecount len1, |
| 1625 const Ibyte *s2, Bytecount len2, | |
| 801 | 1626 Charcount *matching) |
| 1627 { | |
| 1628 *matching = 0; | |
| 1629 while (len1 > 0 && len2 > 0) | |
| 1630 { | |
| 867 | 1631 const Ibyte *old_s1 = s1; |
| 1632 const Ibyte *old_s2 = s2; | |
|
4906
6ef8256a020a
implement equalp in C, fix case-folding, add equal() method for keymaps
Ben Wing <ben@xemacs.org>
parents:
4526
diff
changeset
|
1633 int diff = (CANONCASE (0, itext_ichar (s1)) - |
|
6ef8256a020a
implement equalp in C, fix case-folding, add equal() method for keymaps
Ben Wing <ben@xemacs.org>
parents:
4526
diff
changeset
|
1634 CANONCASE (0, itext_ichar (s2))); |
| 801 | 1635 if (diff != 0) |
| 1636 return diff; | |
| 867 | 1637 INC_IBYTEPTR (s1); |
| 1638 INC_IBYTEPTR (s2); | |
| 801 | 1639 len1 -= s1 - old_s1; |
| 1640 len2 -= s2 - old_s2; | |
| 1641 (*matching)++; | |
| 1642 } | |
| 1643 | |
| 1644 assert (len1 >= 0 && len2 >= 0); | |
| 1645 return len1 - len2; | |
| 771 | 1646 } |
| 1647 | |
| 1648 int | |
|
4906
6ef8256a020a
implement equalp in C, fix case-folding, add equal() method for keymaps
Ben Wing <ben@xemacs.org>
parents:
4526
diff
changeset
|
1649 lisp_strcasecmp_ascii (Lisp_Object s1, Lisp_Object s2) |
| 771 | 1650 { |
| 867 | 1651 Ibyte *cm = strcasecmp_charmap; |
| 1652 Ibyte *p1 = XSTRING_DATA (s1); | |
| 1653 Ibyte *p2 = XSTRING_DATA (s2); | |
| 1654 Ibyte *e1 = p1 + XSTRING_LENGTH (s1); | |
| 1655 Ibyte *e2 = p2 + XSTRING_LENGTH (s2); | |
| 771 | 1656 |
| 1657 /* again, we use a symmetric algorithm and favor clarity over | |
| 1658 nanosecond improvements. */ | |
| 1659 while (1) | |
| 1660 { | |
| 1661 /* if we reached the end of either string, compare lengths. | |
| 1662 do NOT compare the final null byte against anything, in case | |
| 1663 the other string also has a null byte at that position. */ | |
| 1664 if (p1 == e1 || p2 == e2) | |
| 1665 return e1 - e2; | |
| 1666 if (cm[*p1] != cm[*p2]) | |
| 1667 return cm[*p1] - cm[*p2]; | |
| 1668 p1++, p2++; | |
| 1669 } | |
| 1670 } | |
| 1671 | |
| 1672 int | |
| 1673 lisp_strcasecmp_i18n (Lisp_Object s1, Lisp_Object s2) | |
| 1674 { | |
| 801 | 1675 return qxetextcasecmp (XSTRING_DATA (s1), XSTRING_LENGTH (s1), |
| 1676 XSTRING_DATA (s2), XSTRING_LENGTH (s2)); | |
| 771 | 1677 } |
| 1678 | |
| 2367 | 1679 /* Compare a wide string with an ASCII string */ |
| 1680 | |
| 1681 int | |
| 1682 wcscmp_ascii (const wchar_t *s1, const Ascbyte *s2) | |
| 1683 { | |
| 1684 while (*s1 && *s2) | |
| 1685 { | |
| 2956 | 1686 if (*s1 != (wchar_t) *s2) |
| 2367 | 1687 break; |
| 1688 s1++, s2++; | |
| 1689 } | |
| 1690 | |
| 1691 return *s1 - *s2; | |
| 1692 } | |
| 1693 | |
| 1694 int | |
| 1695 wcsncmp_ascii (const wchar_t *s1, const Ascbyte *s2, Charcount len) | |
| 1696 { | |
| 1697 while (len--) | |
| 1698 { | |
| 1699 int diff = *s1 - *s2; | |
| 1700 if (diff != 0) | |
| 1701 return diff; | |
| 1702 if (!*s1) | |
| 1703 return 0; | |
| 1704 s1++, s2++; | |
| 1705 } | |
| 1706 | |
| 1707 return 0; | |
| 1708 } | |
| 1709 | |
| 771 | 1710 |
| 1711 /************************************************************************/ | |
| 1712 /* conversion between textual representations */ | |
| 1713 /************************************************************************/ | |
| 1714 | |
| 1715 /* NOTE: Does not reset the Dynarr. */ | |
| 1716 | |
| 1717 void | |
| 867 | 1718 convert_ibyte_string_into_ichar_dynarr (const Ibyte *str, Bytecount len, |
| 2367 | 1719 Ichar_dynarr *dyn) |
| 771 | 1720 { |
| 867 | 1721 const Ibyte *strend = str + len; |
| 771 | 1722 |
| 1723 while (str < strend) | |
| 1724 { | |
| 867 | 1725 Ichar ch = itext_ichar (str); |
| 771 | 1726 Dynarr_add (dyn, ch); |
| 867 | 1727 INC_IBYTEPTR (str); |
| 771 | 1728 } |
| 1729 } | |
| 1730 | |
| 1731 Charcount | |
| 867 | 1732 convert_ibyte_string_into_ichar_string (const Ibyte *str, Bytecount len, |
| 2367 | 1733 Ichar *arr) |
| 771 | 1734 { |
| 867 | 1735 const Ibyte *strend = str + len; |
| 771 | 1736 Charcount newlen = 0; |
| 1737 while (str < strend) | |
| 1738 { | |
| 867 | 1739 Ichar ch = itext_ichar (str); |
| 771 | 1740 arr[newlen++] = ch; |
| 867 | 1741 INC_IBYTEPTR (str); |
| 771 | 1742 } |
| 1743 return newlen; | |
| 1744 } | |
| 1745 | |
| 867 | 1746 /* Convert an array of Ichars into the equivalent string representation. |
| 1747 Store into the given Ibyte dynarr. Does not reset the dynarr. | |
| 771 | 1748 Does not add a terminating zero. */ |
| 1749 | |
| 1750 void | |
| 867 | 1751 convert_ichar_string_into_ibyte_dynarr (Ichar *arr, int nels, |
| 1752 Ibyte_dynarr *dyn) | |
| 771 | 1753 { |
| 867 | 1754 Ibyte str[MAX_ICHAR_LEN]; |
| 771 | 1755 int i; |
| 1756 | |
| 1757 for (i = 0; i < nels; i++) | |
| 1758 { | |
| 867 | 1759 Bytecount len = set_itext_ichar (str, arr[i]); |
| 771 | 1760 Dynarr_add_many (dyn, str, len); |
| 1761 } | |
| 1762 } | |
| 1763 | |
| 867 | 1764 /* Convert an array of Ichars into the equivalent string representation. |
| 771 | 1765 Malloc the space needed for this and return it. If LEN_OUT is not a |
| 867 | 1766 NULL pointer, store into LEN_OUT the number of Ibytes in the |
| 1767 malloc()ed string. Note that the actual number of Ibytes allocated | |
| 771 | 1768 is one more than this: the returned string is zero-terminated. */ |
| 1769 | |
| 867 | 1770 Ibyte * |
| 1771 convert_ichar_string_into_malloced_string (Ichar *arr, int nels, | |
| 826 | 1772 Bytecount *len_out) |
| 771 | 1773 { |
| 1774 /* Damn zero-termination. */ | |
| 2367 | 1775 Ibyte *str = alloca_ibytes (nels * MAX_ICHAR_LEN + 1); |
| 867 | 1776 Ibyte *strorig = str; |
| 771 | 1777 Bytecount len; |
| 1778 | |
| 1779 int i; | |
| 1780 | |
| 1781 for (i = 0; i < nels; i++) | |
| 867 | 1782 str += set_itext_ichar (str, arr[i]); |
| 771 | 1783 *str = '\0'; |
| 1784 len = str - strorig; | |
| 2367 | 1785 str = xnew_ibytes (1 + len); |
| 771 | 1786 memcpy (str, strorig, 1 + len); |
| 1787 if (len_out) | |
| 1788 *len_out = len; | |
| 1789 return str; | |
| 1790 } | |
| 1791 | |
| 826 | 1792 #define COPY_TEXT_BETWEEN_FORMATS(srcfmt, dstfmt) \ |
| 1793 do \ | |
| 1794 { \ | |
| 1795 if (dst) \ | |
| 1796 { \ | |
| 867 | 1797 Ibyte *dstend = dst + dstlen; \ |
| 1798 Ibyte *dstp = dst; \ | |
| 1799 const Ibyte *srcend = src + srclen; \ | |
| 1800 const Ibyte *srcp = src; \ | |
| 826 | 1801 \ |
| 1802 while (srcp < srcend) \ | |
| 1803 { \ | |
| 867 | 1804 Ichar ch = itext_ichar_fmt (srcp, srcfmt, srcobj); \ |
| 1805 Bytecount len = ichar_len_fmt (ch, dstfmt); \ | |
| 826 | 1806 \ |
| 1807 if (dstp + len <= dstend) \ | |
| 1808 { \ | |
| 2956 | 1809 (void) set_itext_ichar_fmt (dstp, ch, dstfmt, dstobj); \ |
| 826 | 1810 dstp += len; \ |
| 1811 } \ | |
| 1812 else \ | |
| 1813 break; \ | |
| 867 | 1814 INC_IBYTEPTR_FMT (srcp, srcfmt); \ |
| 826 | 1815 } \ |
| 1816 text_checking_assert (srcp <= srcend); \ | |
| 1817 if (src_used) \ | |
| 1818 *src_used = srcp - src; \ | |
| 1819 return dstp - dst; \ | |
| 1820 } \ | |
| 1821 else \ | |
| 1822 { \ | |
| 867 | 1823 const Ibyte *srcend = src + srclen; \ |
| 1824 const Ibyte *srcp = src; \ | |
| 826 | 1825 Bytecount total = 0; \ |
| 1826 \ | |
| 1827 while (srcp < srcend) \ | |
| 1828 { \ | |
| 867 | 1829 total += ichar_len_fmt (itext_ichar_fmt (srcp, srcfmt, \ |
| 826 | 1830 srcobj), dstfmt); \ |
| 867 | 1831 INC_IBYTEPTR_FMT (srcp, srcfmt); \ |
| 826 | 1832 } \ |
| 1833 text_checking_assert (srcp == srcend); \ | |
| 1834 if (src_used) \ | |
| 1835 *src_used = srcp - src; \ | |
| 1836 return total; \ | |
| 1837 } \ | |
| 1838 } \ | |
| 1839 while (0) | |
| 1840 | |
| 1841 /* Copy as much text from SRC/SRCLEN to DST/DSTLEN as will fit, converting | |
| 1842 from SRCFMT/SRCOBJ to DSTFMT/DSTOBJ. Return number of bytes stored into | |
| 1843 DST as return value, and number of bytes copied from SRC through | |
| 1844 SRC_USED (if not NULL). If DST is NULL, don't actually store anything | |
| 1845 and just return the size needed to store all the text. Will not copy | |
| 1846 partial characters into DST. */ | |
| 1847 | |
| 1848 Bytecount | |
| 867 | 1849 copy_text_between_formats (const Ibyte *src, Bytecount srclen, |
| 826 | 1850 Internal_Format srcfmt, |
| 2333 | 1851 Lisp_Object USED_IF_MULE (srcobj), |
| 867 | 1852 Ibyte *dst, Bytecount dstlen, |
| 826 | 1853 Internal_Format dstfmt, |
| 2333 | 1854 Lisp_Object USED_IF_MULE (dstobj), |
| 826 | 1855 Bytecount *src_used) |
| 1856 { | |
| 1857 if (srcfmt == dstfmt && | |
| 1858 objects_have_same_internal_representation (srcobj, dstobj)) | |
| 1859 { | |
| 1860 if (dst) | |
| 1861 { | |
| 1862 srclen = min (srclen, dstlen); | |
| 867 | 1863 srclen = validate_ibyte_string_backward (src, srclen); |
| 826 | 1864 memcpy (dst, src, srclen); |
| 1865 if (src_used) | |
| 1866 *src_used = srclen; | |
| 1867 return srclen; | |
| 1868 } | |
| 1869 else | |
| 1870 return srclen; | |
| 1871 } | |
| 1872 /* Everything before the final else statement is an optimization. | |
| 1873 The inner loops inside COPY_TEXT_BETWEEN_FORMATS() have a number | |
| 1874 of calls to *_fmt(), each of which has a switch statement in it. | |
| 1875 By using constants as the FMT argument, these switch statements | |
| 1876 will be optimized out of existence. */ | |
| 1877 #define ELSE_FORMATS(fmt1, fmt2) \ | |
| 1878 else if (srcfmt == fmt1 && dstfmt == fmt2) \ | |
| 1879 COPY_TEXT_BETWEEN_FORMATS (fmt1, fmt2) | |
| 1880 ELSE_FORMATS (FORMAT_DEFAULT, FORMAT_8_BIT_FIXED); | |
| 1881 ELSE_FORMATS (FORMAT_8_BIT_FIXED, FORMAT_DEFAULT); | |
| 1882 ELSE_FORMATS (FORMAT_DEFAULT, FORMAT_32_BIT_FIXED); | |
| 1883 ELSE_FORMATS (FORMAT_32_BIT_FIXED, FORMAT_DEFAULT); | |
| 1884 else | |
| 1885 COPY_TEXT_BETWEEN_FORMATS (srcfmt, dstfmt); | |
| 1886 #undef ELSE_FORMATS | |
| 1887 } | |
| 1888 | |
| 1889 /* Copy as much buffer text in BUF, starting at POS, of length LEN, as will | |
| 1890 fit into DST/DSTLEN, converting to DSTFMT. Return number of bytes | |
| 1891 stored into DST as return value, and number of bytes copied from BUF | |
| 1892 through SRC_USED (if not NULL). If DST is NULL, don't actually store | |
| 1893 anything and just return the size needed to store all the text. */ | |
| 1894 | |
| 1895 Bytecount | |
| 1896 copy_buffer_text_out (struct buffer *buf, Bytebpos pos, | |
| 867 | 1897 Bytecount len, Ibyte *dst, Bytecount dstlen, |
| 826 | 1898 Internal_Format dstfmt, Lisp_Object dstobj, |
| 1899 Bytecount *src_used) | |
| 1900 { | |
| 1901 Bytecount dst_used = 0; | |
| 1902 if (src_used) | |
| 1903 *src_used = 0; | |
| 1904 | |
| 1905 { | |
| 1906 BUFFER_TEXT_LOOP (buf, pos, len, runptr, runlen) | |
| 1907 { | |
| 1908 Bytecount the_src_used, the_dst_used; | |
| 1909 | |
| 1910 the_dst_used = copy_text_between_formats (runptr, runlen, | |
| 1911 BUF_FORMAT (buf), | |
| 1912 wrap_buffer (buf), | |
| 1913 dst, dstlen, dstfmt, | |
| 1914 dstobj, &the_src_used); | |
| 1915 dst_used += the_dst_used; | |
| 1916 if (src_used) | |
| 1917 *src_used += the_src_used; | |
| 1918 if (dst) | |
| 1919 { | |
| 1920 dst += the_dst_used; | |
| 1921 dstlen -= the_dst_used; | |
| 841 | 1922 /* Stop if we didn't use all of the source text. Also stop |
| 1923 if the destination is full. We need the first test because | |
| 1924 there might be a couple bytes left in the destination, but | |
| 1925 not enough to fit a full character. The first test will in | |
| 1926 fact catch the vast majority of cases where the destination | |
| 1927 is empty, too -- but in case the destination holds *exactly* | |
| 1928 the run length, we put in the second check. (It shouldn't | |
| 1929 really matter though -- next time through we'll just get a | |
| 1930 0.) */ | |
| 1931 if (the_src_used < runlen || !dstlen) | |
| 826 | 1932 break; |
| 1933 } | |
| 1934 } | |
| 1935 } | |
| 1936 | |
| 1937 return dst_used; | |
| 1938 } | |
| 1939 | |
| 771 | 1940 |
| 1941 /************************************************************************/ | |
| 1942 /* charset properties of strings */ | |
| 1943 /************************************************************************/ | |
| 1944 | |
| 1945 void | |
| 2333 | 1946 find_charsets_in_ibyte_string (unsigned char *charsets, |
| 1947 const Ibyte *USED_IF_MULE (str), | |
| 1948 Bytecount USED_IF_MULE (len)) | |
| 771 | 1949 { |
| 1950 #ifndef MULE | |
| 1951 /* Telescope this. */ | |
| 1952 charsets[0] = 1; | |
| 1953 #else | |
| 867 | 1954 const Ibyte *strend = str + len; |
| 771 | 1955 memset (charsets, 0, NUM_LEADING_BYTES); |
| 1956 | |
| 1957 /* #### SJT doesn't like this. */ | |
| 1958 if (len == 0) | |
| 1959 { | |
| 1960 charsets[XCHARSET_LEADING_BYTE (Vcharset_ascii) - MIN_LEADING_BYTE] = 1; | |
| 1961 return; | |
| 1962 } | |
| 1963 | |
| 1964 while (str < strend) | |
| 1965 { | |
| 867 | 1966 charsets[ichar_leading_byte (itext_ichar (str)) - MIN_LEADING_BYTE] = |
| 771 | 1967 1; |
| 867 | 1968 INC_IBYTEPTR (str); |
| 771 | 1969 } |
| 1970 #endif | |
| 1971 } | |
| 1972 | |
| 1973 void | |
| 2333 | 1974 find_charsets_in_ichar_string (unsigned char *charsets, |
| 1975 const Ichar *USED_IF_MULE (str), | |
| 1976 Charcount USED_IF_MULE (len)) | |
| 771 | 1977 { |
| 1978 #ifndef MULE | |
| 1979 /* Telescope this. */ | |
| 1980 charsets[0] = 1; | |
| 1981 #else | |
| 1982 int i; | |
| 1983 | |
| 1984 memset (charsets, 0, NUM_LEADING_BYTES); | |
| 1985 | |
| 1986 /* #### SJT doesn't like this. */ | |
| 1987 if (len == 0) | |
| 1988 { | |
| 1989 charsets[XCHARSET_LEADING_BYTE (Vcharset_ascii) - MIN_LEADING_BYTE] = 1; | |
| 1990 return; | |
| 1991 } | |
| 1992 | |
| 1993 for (i = 0; i < len; i++) | |
| 1994 { | |
| 867 | 1995 charsets[ichar_leading_byte (str[i]) - MIN_LEADING_BYTE] = 1; |
| 771 | 1996 } |
| 1997 #endif | |
| 1998 } | |
| 1999 | |
| 3571 | 2000 /* A couple of these functions should only be called on a non-Mule build. */ |
| 2001 #ifdef MULE | |
| 2002 #define ASSERT_BUILT_WITH_MULE() assert(1) | |
| 2003 #else /* MULE */ | |
| 2004 #define ASSERT_BUILT_WITH_MULE() assert(0) | |
| 2005 #endif /* MULE */ | |
| 2006 | |
| 771 | 2007 int |
| 867 | 2008 ibyte_string_displayed_columns (const Ibyte *str, Bytecount len) |
| 771 | 2009 { |
| 2010 int cols = 0; | |
| 867 | 2011 const Ibyte *end = str + len; |
| 3571 | 2012 Ichar ch; |
| 2013 | |
| 2014 ASSERT_BUILT_WITH_MULE(); | |
| 771 | 2015 |
| 2016 while (str < end) | |
| 2017 { | |
| 3571 | 2018 ch = itext_ichar (str); |
| 867 | 2019 cols += XCHARSET_COLUMNS (ichar_charset (ch)); |
| 2020 INC_IBYTEPTR (str); | |
| 771 | 2021 } |
| 2022 | |
| 2023 return cols; | |
| 2024 } | |
| 2025 | |
| 2026 int | |
| 3571 | 2027 ichar_string_displayed_columns (const Ichar * USED_IF_MULE(str), Charcount len) |
| 771 | 2028 { |
| 2029 int cols = 0; | |
| 2030 int i; | |
| 2031 | |
| 3571 | 2032 ASSERT_BUILT_WITH_MULE(); |
| 2033 | |
| 771 | 2034 for (i = 0; i < len; i++) |
| 867 | 2035 cols += XCHARSET_COLUMNS (ichar_charset (str[i])); |
| 771 | 2036 |
| 2037 return cols; | |
| 2038 } | |
| 2039 | |
| 2040 Charcount | |
| 2333 | 2041 ibyte_string_nonascii_chars (const Ibyte *USED_IF_MULE (str), |
| 2042 Bytecount USED_IF_MULE (len)) | |
| 771 | 2043 { |
| 2044 #ifdef MULE | |
| 867 | 2045 const Ibyte *end = str + len; |
| 771 | 2046 Charcount retval = 0; |
| 2047 | |
| 2048 while (str < end) | |
| 2049 { | |
| 826 | 2050 if (!byte_ascii_p (*str)) |
| 771 | 2051 retval++; |
| 867 | 2052 INC_IBYTEPTR (str); |
| 771 | 2053 } |
| 2054 | |
| 2055 return retval; | |
| 2056 #else | |
| 2057 return 0; | |
| 2058 #endif | |
| 2059 } | |
| 2060 | |
| 2061 | |
| 2062 /***************************************************************************/ | |
| 2063 /* Eistring helper functions */ | |
| 2064 /***************************************************************************/ | |
| 2065 | |
| 2066 int | |
| 867 | 2067 eistr_casefiddle_1 (Ibyte *olddata, Bytecount len, Ibyte *newdata, |
| 771 | 2068 int downp) |
| 2069 { | |
| 867 | 2070 Ibyte *endp = olddata + len; |
| 2071 Ibyte *newp = newdata; | |
| 771 | 2072 int changedp = 0; |
| 2073 | |
| 2074 while (olddata < endp) | |
| 2075 { | |
| 867 | 2076 Ichar c = itext_ichar (olddata); |
| 2077 Ichar newc; | |
| 771 | 2078 |
| 2079 if (downp) | |
| 2080 newc = DOWNCASE (0, c); | |
| 2081 else | |
| 2082 newc = UPCASE (0, c); | |
| 2083 | |
| 2084 if (c != newc) | |
| 2085 changedp = 1; | |
| 2086 | |
| 867 | 2087 newp += set_itext_ichar (newp, newc); |
| 2088 INC_IBYTEPTR (olddata); | |
| 771 | 2089 } |
| 2090 | |
| 2091 *newp = '\0'; | |
| 2092 | |
| 2093 return changedp ? newp - newdata : 0; | |
| 2094 } | |
| 2095 | |
| 2096 int | |
| 2097 eifind_large_enough_buffer (int oldbufsize, int needed_size) | |
| 2098 { | |
| 2099 while (oldbufsize < needed_size) | |
| 2100 { | |
| 2101 oldbufsize = oldbufsize * 3 / 2; | |
| 2102 oldbufsize = max (oldbufsize, 32); | |
| 2103 } | |
| 2104 | |
| 2105 return oldbufsize; | |
| 2106 } | |
| 2107 | |
| 2108 void | |
| 2109 eito_malloc_1 (Eistring *ei) | |
| 2110 { | |
| 2111 if (ei->mallocp_) | |
| 2112 return; | |
| 2113 ei->mallocp_ = 1; | |
| 2114 if (ei->data_) | |
| 2115 { | |
| 867 | 2116 Ibyte *newdata; |
| 771 | 2117 |
| 2118 ei->max_size_allocated_ = | |
| 2119 eifind_large_enough_buffer (0, ei->bytelen_ + 1); | |
| 2367 | 2120 newdata = xnew_ibytes (ei->max_size_allocated_); |
| 771 | 2121 memcpy (newdata, ei->data_, ei->bytelen_ + 1); |
| 2122 ei->data_ = newdata; | |
| 2123 } | |
| 2124 | |
| 2125 if (ei->extdata_) | |
| 2126 { | |
| 2367 | 2127 Extbyte *newdata = xnew_extbytes (ei->extlen_ + 2); |
| 771 | 2128 |
| 2129 memcpy (newdata, ei->extdata_, ei->extlen_); | |
| 2130 /* Double null-terminate in case of Unicode data */ | |
| 2131 newdata[ei->extlen_] = '\0'; | |
| 2132 newdata[ei->extlen_ + 1] = '\0'; | |
| 2133 ei->extdata_ = newdata; | |
| 2134 } | |
| 2135 } | |
| 2136 | |
| 2137 int | |
| 2138 eicmp_1 (Eistring *ei, Bytecount off, Charcount charoff, | |
| 867 | 2139 Bytecount len, Charcount charlen, const Ibyte *data, |
| 2421 | 2140 const Eistring *ei2, int is_ascii, int fold_case) |
| 771 | 2141 { |
| 3462 | 2142 assert ((data == 0) != (ei == 0)); |
| 2143 assert ((is_ascii != 0) == (data != 0)); | |
| 2144 assert (fold_case >= 0 && fold_case <= 2); | |
| 771 | 2145 assert ((off < 0) != (charoff < 0)); |
| 3462 | 2146 |
| 771 | 2147 if (off < 0) |
| 2148 { | |
| 2149 off = charcount_to_bytecount (ei->data_, charoff); | |
| 2150 if (charlen < 0) | |
| 2151 len = -1; | |
| 2152 else | |
| 2153 len = charcount_to_bytecount (ei->data_ + off, charlen); | |
| 2154 } | |
| 2155 if (len < 0) | |
| 2156 len = ei->bytelen_ - off; | |
| 2157 | |
| 2158 assert (off >= 0 && off <= ei->bytelen_); | |
| 2159 assert (len >= 0 && off + len <= ei->bytelen_); | |
| 2160 | |
| 2161 { | |
| 2162 Bytecount dstlen; | |
| 867 | 2163 const Ibyte *src = ei->data_, *dst; |
| 771 | 2164 |
| 2165 if (data) | |
| 2166 { | |
| 2167 dst = data; | |
| 2168 dstlen = qxestrlen (data); | |
| 2169 } | |
| 2170 else | |
| 2171 { | |
| 2172 dst = ei2->data_; | |
| 2173 dstlen = ei2->bytelen_; | |
| 2174 } | |
| 2175 | |
| 2421 | 2176 if (is_ascii) |
| 2367 | 2177 ASSERT_ASCTEXT_ASCII_LEN ((Ascbyte *) dst, dstlen); |
| 771 | 2178 |
| 801 | 2179 return (fold_case == 0 ? qxememcmp4 (src, len, dst, dstlen) : |
| 2180 fold_case == 1 ? qxememcasecmp4 (src, len, dst, dstlen) : | |
| 2181 qxetextcasecmp (src, len, dst, dstlen)); | |
| 771 | 2182 } |
| 2183 } | |
| 2184 | |
| 867 | 2185 Ibyte * |
| 826 | 2186 eicpyout_malloc_fmt (Eistring *eistr, Bytecount *len_out, Internal_Format fmt, |
| 2286 | 2187 Lisp_Object UNUSED (object)) |
| 771 | 2188 { |
| 867 | 2189 Ibyte *ptr; |
| 771 | 2190 |
| 2191 assert (fmt == FORMAT_DEFAULT); | |
| 867 | 2192 ptr = xnew_array (Ibyte, eistr->bytelen_ + 1); |
| 771 | 2193 if (len_out) |
| 2194 *len_out = eistr->bytelen_; | |
| 2195 memcpy (ptr, eistr->data_, eistr->bytelen_ + 1); | |
| 2196 return ptr; | |
| 2197 } | |
| 2198 | |
| 2199 | |
| 2200 /************************************************************************/ | |
| 2201 /* Charcount/Bytecount conversion */ | |
| 2202 /************************************************************************/ | |
| 2203 | |
| 2204 /* Optimization. Do it. Live it. Love it. */ | |
| 2205 | |
| 2206 #ifdef MULE | |
| 2207 | |
| 826 | 2208 /* Function equivalents of bytecount_to_charcount/charcount_to_bytecount. |
| 2209 These work on strings of all sizes but are more efficient than a simple | |
| 2210 loop on large strings and probably less efficient on sufficiently small | |
| 2211 strings. */ | |
| 2212 | |
| 2213 Charcount | |
| 867 | 2214 bytecount_to_charcount_fun (const Ibyte *ptr, Bytecount len) |
| 826 | 2215 { |
| 2216 Charcount count = 0; | |
| 867 | 2217 const Ibyte *end = ptr + len; |
| 826 | 2218 while (1) |
| 2219 { | |
| 867 | 2220 const Ibyte *newptr = skip_ascii (ptr, end); |
| 826 | 2221 count += newptr - ptr; |
| 2222 ptr = newptr; | |
| 2223 if (ptr == end) | |
| 2224 break; | |
| 2225 { | |
| 2226 /* Optimize for successive characters from the same charset */ | |
| 867 | 2227 Ibyte leading_byte = *ptr; |
| 826 | 2228 int bytes = rep_bytes_by_first_byte (leading_byte); |
| 2229 while (ptr < end && *ptr == leading_byte) | |
| 2230 ptr += bytes, count++; | |
| 2231 } | |
| 771 | 2232 } |
| 2233 | |
| 2234 /* Bomb out if the specified substring ends in the middle | |
| 2235 of a character. Note that we might have already gotten | |
| 2236 a core dump above from an invalid reference, but at least | |
| 2237 we will get no farther than here. | |
| 2238 | |
| 2239 This also catches len < 0. */ | |
| 800 | 2240 text_checking_assert (ptr == end); |
| 771 | 2241 |
| 2242 return count; | |
| 2243 } | |
| 2244 | |
|
5784
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2245 /* Return the character count of an lstream or coding buffer of |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2246 internal-format text, counting partial characters at the beginning of the |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2247 buffer as whole characters, and *not* counting partial characters at the |
|
5785
7343a186a475
Correct some partial character accounting, buffered_bytecount_to_charcount().
Aidan Kehoe <kehoea@parhasard.net>
parents:
5784
diff
changeset
|
2248 end of the buffer. The result of this function is subtracted from the |
|
7343a186a475
Correct some partial character accounting, buffered_bytecount_to_charcount().
Aidan Kehoe <kehoea@parhasard.net>
parents:
5784
diff
changeset
|
2249 character count given by the coding system character tell methods, and we |
|
7343a186a475
Correct some partial character accounting, buffered_bytecount_to_charcount().
Aidan Kehoe <kehoea@parhasard.net>
parents:
5784
diff
changeset
|
2250 need to treat each buffer in the same way to avoid double-counting. */ |
|
5784
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2251 |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2252 Charcount |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2253 buffered_bytecount_to_charcount (const Ibyte *bufptr, Bytecount len) |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2254 { |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2255 Boolint partial_first = 0; |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2256 Bytecount impartial; |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2257 |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2258 if (valid_ibyteptr_p (bufptr)) |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2259 { |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2260 if (rep_bytes_by_first_byte (*bufptr) > len) |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2261 { |
|
5785
7343a186a475
Correct some partial character accounting, buffered_bytecount_to_charcount().
Aidan Kehoe <kehoea@parhasard.net>
parents:
5784
diff
changeset
|
2262 /* This is a partial last character. Return 0, avoid treating it |
|
7343a186a475
Correct some partial character accounting, buffered_bytecount_to_charcount().
Aidan Kehoe <kehoea@parhasard.net>
parents:
5784
diff
changeset
|
2263 as a partial first character, since that would lead to it being |
|
7343a186a475
Correct some partial character accounting, buffered_bytecount_to_charcount().
Aidan Kehoe <kehoea@parhasard.net>
parents:
5784
diff
changeset
|
2264 counted twice. */ |
|
7343a186a475
Correct some partial character accounting, buffered_bytecount_to_charcount().
Aidan Kehoe <kehoea@parhasard.net>
parents:
5784
diff
changeset
|
2265 return (Charcount) 0; |
|
5784
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2266 } |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2267 } |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2268 else |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2269 { |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2270 const Ibyte *newstart = bufptr, *limit = newstart + len; |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2271 |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2272 /* Our consumer has the start of a partial character, we have the |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2273 rest. */ |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2274 while (newstart < limit && !valid_ibyteptr_p (newstart)) |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2275 { |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2276 newstart++; |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2277 } |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2278 |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2279 partial_first = 1; |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2280 bufptr = newstart; |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2281 len = limit - newstart; |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2282 } |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2283 |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2284 if (len && valid_ibyteptr_p (bufptr)) |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2285 { |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2286 /* There's at least one valid starting char in the string, |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2287 validate_ibyte_string_backward won't run off the begining. */ |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2288 impartial = validate_ibyte_string_backward (bufptr, len); |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2289 } |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2290 else |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2291 { |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2292 impartial = 0; |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2293 } |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2294 |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2295 return (Charcount) partial_first + bytecount_to_charcount (bufptr, |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2296 impartial); |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2297 } |
|
0cb4f494a548
Have the result of coding_character_tell() reflect str->convert_to, too.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5774
diff
changeset
|
2298 |
| 771 | 2299 Bytecount |
| 867 | 2300 charcount_to_bytecount_fun (const Ibyte *ptr, Charcount len) |
| 771 | 2301 { |
| 867 | 2302 const Ibyte *newptr = ptr; |
| 826 | 2303 while (1) |
| 771 | 2304 { |
| 867 | 2305 const Ibyte *newnewptr = skip_ascii (newptr, newptr + len); |
| 826 | 2306 len -= newnewptr - newptr; |
| 2307 newptr = newnewptr; | |
| 2308 if (!len) | |
| 2309 break; | |
| 2310 { | |
| 2311 /* Optimize for successive characters from the same charset */ | |
| 867 | 2312 Ibyte leading_byte = *newptr; |
| 826 | 2313 int bytes = rep_bytes_by_first_byte (leading_byte); |
| 2314 while (len > 0 && *newptr == leading_byte) | |
| 2315 newptr += bytes, len--; | |
| 2316 } | |
| 771 | 2317 } |
| 2318 return newptr - ptr; | |
| 2319 } | |
| 2320 | |
| 2367 | 2321 /* Function equivalent of charcount_to_bytecount_down. This works on strings |
| 2322 of all sizes but is more efficient than a simple loop on large strings | |
| 2323 and probably less efficient on sufficiently small strings. */ | |
| 2324 | |
| 2325 Bytecount | |
| 2326 charcount_to_bytecount_down_fun (const Ibyte *ptr, Charcount len) | |
| 2327 { | |
| 2328 const Ibyte *newptr = ptr; | |
| 2329 while (1) | |
| 2330 { | |
| 2331 const Ibyte *newnewptr = skip_ascii_down (newptr, newptr - len); | |
| 2332 len -= newptr - newnewptr; | |
| 2333 newptr = newnewptr; | |
| 2334 /* Skip over all non-ASCII chars, counting the length and | |
| 2335 stopping if it's zero */ | |
| 2336 while (len && !byte_ascii_p (*(newptr - 1))) | |
| 2337 if (ibyte_first_byte_p (*--newptr)) | |
| 2338 len--; | |
| 2339 if (!len) | |
| 2340 break; | |
| 2341 } | |
| 2342 text_checking_assert (ptr - newptr >= 0); | |
| 2343 return ptr - newptr; | |
| 2344 } | |
| 2345 | |
| 771 | 2346 /* The next two functions are the actual meat behind the |
| 2347 charbpos-to-bytebpos and bytebpos-to-charbpos conversions. Currently | |
| 2348 the method they use is fairly unsophisticated; see buffer.h. | |
| 2349 | |
| 2350 Note that charbpos_to_bytebpos_func() is probably the most-called | |
| 2351 function in all of XEmacs. Therefore, it must be FAST FAST FAST. | |
| 2352 This is the reason why so much of the code is duplicated. | |
| 2353 | |
| 2354 Similar considerations apply to bytebpos_to_charbpos_func(), although | |
| 2355 less so because the function is not called so often. | |
| 2367 | 2356 */ |
| 2357 | |
| 2358 /* | |
| 2359 | |
| 2360 Info on Byte-Char conversion: | |
| 2361 | |
| 2362 (Info-goto-node "(internals)Byte-Char Position Conversion") | |
| 2363 */ | |
| 2364 | |
| 2365 #ifdef OLD_BYTE_CHAR | |
| 771 | 2366 static int not_very_random_number; |
| 2367 | 2367 #endif /* OLD_BYTE_CHAR */ |
| 2368 | |
| 2369 #define OLD_LOOP | |
| 2370 | |
| 2371 /* If we are this many characters away from any known position, cache the | |
| 2372 new position in the buffer's char-byte cache. */ | |
| 2373 #define FAR_AWAY_DISTANCE 5000 | |
| 2374 | |
| 2375 /* Converting between character positions and byte positions. */ | |
| 2376 | |
| 2377 /* There are several places in the buffer where we know | |
| 2378 the correspondence: BEG, BEGV, PT, GPT, ZV and Z, | |
| 2379 and everywhere there is a marker. So we find the one of these places | |
| 2380 that is closest to the specified position, and scan from there. */ | |
| 2381 | |
| 2382 /* This macro is a subroutine of charbpos_to_bytebpos_func. | |
| 2383 Note that it is desirable that BYTEPOS is not evaluated | |
| 2384 except when we really want its value. */ | |
| 2385 | |
| 2386 #define CONSIDER(CHARPOS, BYTEPOS) \ | |
| 2387 do \ | |
| 2388 { \ | |
| 2389 Charbpos this_charpos = (CHARPOS); \ | |
| 2390 int changed = 0; \ | |
| 2391 \ | |
| 2392 if (this_charpos == x) \ | |
| 2393 { \ | |
| 2394 retval = (BYTEPOS); \ | |
| 2395 goto done; \ | |
| 2396 } \ | |
| 2397 else if (this_charpos > x) \ | |
| 2398 { \ | |
| 2399 if (this_charpos < best_above) \ | |
| 2400 { \ | |
| 2401 best_above = this_charpos; \ | |
| 2402 best_above_byte = (BYTEPOS); \ | |
| 2403 changed = 1; \ | |
| 2404 } \ | |
| 2405 } \ | |
| 2406 else if (this_charpos > best_below) \ | |
| 2407 { \ | |
| 2408 best_below = this_charpos; \ | |
| 2409 best_below_byte = (BYTEPOS); \ | |
| 2410 changed = 1; \ | |
| 2411 } \ | |
| 2412 \ | |
| 2413 if (changed) \ | |
| 2414 { \ | |
| 2415 if (best_above - best_below == best_above_byte - best_below_byte) \ | |
| 2416 { \ | |
| 2417 retval = best_below_byte + (x - best_below); \ | |
| 2418 goto done; \ | |
| 2419 } \ | |
| 2420 } \ | |
| 2421 } \ | |
| 2422 while (0) | |
| 2423 | |
| 771 | 2424 |
| 2425 Bytebpos | |
| 2426 charbpos_to_bytebpos_func (struct buffer *buf, Charbpos x) | |
| 2427 { | |
| 2367 | 2428 #ifdef OLD_BYTE_CHAR |
| 771 | 2429 Charbpos bufmin; |
| 2430 Charbpos bufmax; | |
| 2431 Bytebpos bytmin; | |
| 2432 Bytebpos bytmax; | |
| 2433 int size; | |
| 2434 int forward_p; | |
| 2435 int diff_so_far; | |
| 2436 int add_to_cache = 0; | |
| 2367 | 2437 #endif /* OLD_BYTE_CHAR */ |
| 2438 | |
| 2439 Charbpos best_above, best_below; | |
| 2440 Bytebpos best_above_byte, best_below_byte; | |
| 2441 int i; | |
| 2442 struct buffer_text *t; | |
| 2443 Bytebpos retval; | |
| 2444 | |
| 1292 | 2445 PROFILE_DECLARE (); |
| 771 | 2446 |
| 1292 | 2447 PROFILE_RECORD_ENTERING_SECTION (QSin_char_byte_conversion); |
| 2448 | |
| 2367 | 2449 best_above = BUF_Z (buf); |
| 2450 best_above_byte = BYTE_BUF_Z (buf); | |
| 2451 | |
| 2452 /* In this case, we simply have all one-byte characters. But this should | |
| 2453 have been intercepted before, in charbpos_to_bytebpos(). */ | |
| 2454 text_checking_assert (best_above != best_above_byte); | |
| 2455 | |
| 2456 best_below = BUF_BEG (buf); | |
| 2457 best_below_byte = BYTE_BUF_BEG (buf); | |
| 2458 | |
| 2459 /* We find in best_above and best_above_byte | |
| 2460 the closest known point above CHARPOS, | |
| 2461 and in best_below and best_below_byte | |
| 2462 the closest known point below CHARPOS, | |
| 2463 | |
| 2464 If at any point we can tell that the space between those | |
| 2465 two best approximations is all single-byte, | |
| 2466 we interpolate the result immediately. */ | |
| 2467 | |
| 2468 CONSIDER (BUF_PT (buf), BYTE_BUF_PT (buf)); | |
| 2469 CONSIDER (BUF_GPT (buf), BYTE_BUF_GPT (buf)); | |
| 2470 CONSIDER (BUF_BEGV (buf), BYTE_BUF_BEGV (buf)); | |
| 2471 CONSIDER (BUF_ZV (buf), BYTE_BUF_ZV (buf)); | |
| 2472 | |
| 2473 t = buf->text; | |
| 2474 CONSIDER (t->cached_charpos, t->cached_bytepos); | |
| 2475 | |
| 2476 /* Check the most recently entered positions first */ | |
| 2477 | |
| 2478 for (i = t->next_cache_pos - 1; i >= 0; i--) | |
| 2479 { | |
| 2480 CONSIDER (t->mule_charbpos_cache[i], t->mule_bytebpos_cache[i]); | |
| 2481 | |
| 2482 /* If we are down to a range of 50 chars, | |
| 2483 don't bother checking any other markers; | |
| 2484 scan the intervening chars directly now. */ | |
| 2485 if (best_above - best_below < 50) | |
| 2486 break; | |
| 2487 } | |
| 2488 | |
| 2489 /* We get here if we did not exactly hit one of the known places. | |
| 2490 We have one known above and one known below. | |
| 2491 Scan, counting characters, from whichever one is closer. */ | |
| 2492 | |
| 2493 if (x - best_below < best_above - x) | |
| 2494 { | |
| 2495 int record = x - best_below > FAR_AWAY_DISTANCE; | |
| 2496 | |
| 2497 #ifdef OLD_LOOP /* old code */ | |
| 2498 while (best_below != x) | |
| 2499 { | |
| 2500 best_below++; | |
| 2501 INC_BYTEBPOS (buf, best_below_byte); | |
| 2502 } | |
| 2503 #else | |
| 2504 text_checking_assert (BUF_FORMAT (buf) == FORMAT_DEFAULT); | |
| 2505 /* The gap should not occur between best_below and x, or we will be | |
| 2506 screwed in using charcount_to_bytecount(). It should not be exactly | |
| 2507 at x either, because we already should have caught that. */ | |
| 2508 text_checking_assert | |
| 2509 (BUF_CEILING_OF_IGNORE_ACCESSIBLE (buf, best_below) > x); | |
| 2510 | |
| 2511 /* Using charcount_to_bytecount() is potentially a lot faster than a | |
| 2512 simple loop using INC_BYTEBPOS() because (a) the checks for gap | |
| 2513 and buffer format are factored out instead of getting checked | |
| 2514 every time; (b) the checking goes 4 or 8 bytes at a time in ASCII | |
| 2515 text. | |
| 2516 */ | |
| 2517 best_below_byte += | |
| 2518 charcount_to_bytecount | |
| 2519 (BYTE_BUF_BYTE_ADDRESS (buf, best_below_byte), x - best_below); | |
| 2520 best_below = x; | |
| 2521 #endif /* 0 */ | |
| 2522 | |
| 2523 /* If this position is quite far from the nearest known position, | |
| 2524 cache the correspondence. | |
| 2525 | |
| 2526 NB FSF does this: "... by creating a marker here. | |
| 2527 It will last until the next GC." | |
| 2528 */ | |
| 2529 | |
| 2530 if (record) | |
| 2531 { | |
| 2532 /* If we have run out of positions to record, discard some of the | |
| 2533 old ones. I used to use a circular buffer, which avoids the | |
| 2534 need to block-move any memory. But it makes it more difficult | |
| 2535 to keep track of which positions haven't been used -- commonly | |
| 2536 we haven't yet filled out anywhere near the whole set of | |
| 2537 positions and don't want to check them all. We should not be | |
| 2538 recording that often, and block-moving is extremely fast in | |
| 2539 any case. --ben */ | |
| 2540 if (t->next_cache_pos == NUM_CACHED_POSITIONS) | |
| 2541 { | |
| 2542 memmove (t->mule_charbpos_cache, | |
| 2543 t->mule_charbpos_cache + NUM_MOVED_POSITIONS, | |
| 2544 sizeof (Charbpos) * | |
| 2545 (NUM_CACHED_POSITIONS - NUM_MOVED_POSITIONS)); | |
| 2546 memmove (t->mule_bytebpos_cache, | |
| 2547 t->mule_bytebpos_cache + NUM_MOVED_POSITIONS, | |
| 2548 sizeof (Bytebpos) * | |
| 2549 (NUM_CACHED_POSITIONS - NUM_MOVED_POSITIONS)); | |
| 2550 t->next_cache_pos -= NUM_MOVED_POSITIONS; | |
| 2551 } | |
| 2552 t->mule_charbpos_cache[t->next_cache_pos] = best_below; | |
| 2553 t->mule_bytebpos_cache[t->next_cache_pos] = best_below_byte; | |
| 2554 t->next_cache_pos++; | |
| 2555 } | |
| 2556 | |
| 2557 t->cached_charpos = best_below; | |
| 2558 t->cached_bytepos = best_below_byte; | |
| 2559 | |
| 2560 retval = best_below_byte; | |
| 2561 text_checking_assert (best_below_byte >= best_below); | |
| 2562 goto done; | |
| 2563 } | |
| 2564 else | |
| 2565 { | |
| 2566 int record = best_above - x > FAR_AWAY_DISTANCE; | |
| 2567 | |
| 2568 #ifdef OLD_LOOP | |
| 2569 while (best_above != x) | |
| 2570 { | |
| 2571 best_above--; | |
| 2572 DEC_BYTEBPOS (buf, best_above_byte); | |
| 2573 } | |
| 2574 #else | |
| 2575 text_checking_assert (BUF_FORMAT (buf) == FORMAT_DEFAULT); | |
| 2576 /* The gap should not occur between best_above and x, or we will be | |
| 2577 screwed in using charcount_to_bytecount_down(). It should not be | |
| 2578 exactly at x either, because we already should have caught | |
| 2579 that. */ | |
| 2580 text_checking_assert | |
| 2581 (BUF_FLOOR_OF_IGNORE_ACCESSIBLE (buf, best_above) < x); | |
| 2582 | |
| 2583 /* Using charcount_to_bytecount_down() is potentially a lot faster | |
| 2584 than a simple loop using DEC_BYTEBPOS(); see above. */ | |
| 2585 best_above_byte -= | |
| 2586 charcount_to_bytecount_down | |
| 2587 /* BYTE_BUF_BYTE_ADDRESS will return a value on the high side of the | |
| 2588 gap if we are at the gap, which is the wrong side. So do the | |
| 2589 following trick instead. */ | |
| 2590 (BYTE_BUF_BYTE_ADDRESS_BEFORE (buf, best_above_byte) + 1, | |
| 2591 best_above - x); | |
| 2592 best_above = x; | |
| 2593 #endif /* SLEDGEHAMMER_CHECK_TEXT */ | |
| 2594 | |
| 2595 | |
| 2596 /* If this position is quite far from the nearest known position, | |
| 2597 cache the correspondence. | |
| 2598 | |
| 2599 NB FSF does this: "... by creating a marker here. | |
| 2600 It will last until the next GC." | |
| 2601 */ | |
| 2602 if (record) | |
| 2603 { | |
| 2604 if (t->next_cache_pos == NUM_CACHED_POSITIONS) | |
| 2605 { | |
| 2606 memmove (t->mule_charbpos_cache, | |
| 2607 t->mule_charbpos_cache + NUM_MOVED_POSITIONS, | |
| 2608 sizeof (Charbpos) * | |
| 2609 (NUM_CACHED_POSITIONS - NUM_MOVED_POSITIONS)); | |
| 2610 memmove (t->mule_bytebpos_cache, | |
| 2611 t->mule_bytebpos_cache + NUM_MOVED_POSITIONS, | |
| 2612 sizeof (Bytebpos) * | |
| 2613 (NUM_CACHED_POSITIONS - NUM_MOVED_POSITIONS)); | |
| 2614 t->next_cache_pos -= NUM_MOVED_POSITIONS; | |
| 2615 } | |
| 2616 t->mule_charbpos_cache[t->next_cache_pos] = best_above; | |
| 2617 t->mule_bytebpos_cache[t->next_cache_pos] = best_above_byte; | |
| 2618 t->next_cache_pos++; | |
| 2619 } | |
| 2620 | |
| 2621 t->cached_charpos = best_above; | |
| 2622 t->cached_bytepos = best_above_byte; | |
| 2623 | |
| 2624 retval = best_above_byte; | |
| 2625 text_checking_assert (best_above_byte >= best_above); | |
| 2626 goto done; | |
| 2627 } | |
| 2628 | |
| 2629 #ifdef OLD_BYTE_CHAR | |
| 2630 | |
| 771 | 2631 bufmin = buf->text->mule_bufmin; |
| 2632 bufmax = buf->text->mule_bufmax; | |
| 2633 bytmin = buf->text->mule_bytmin; | |
| 2634 bytmax = buf->text->mule_bytmax; | |
| 2635 size = (1 << buf->text->mule_shifter) + !!buf->text->mule_three_p; | |
| 2636 | |
| 2637 /* The basic idea here is that we shift the "known region" up or down | |
| 2638 until it overlaps the specified position. We do this by moving | |
| 2639 the upper bound of the known region up one character at a time, | |
| 2640 and moving the lower bound of the known region up as necessary | |
| 2641 when the size of the character just seen changes. | |
| 2642 | |
| 2643 We optimize this, however, by first shifting the known region to | |
| 2644 one of the cached points if it's close by. (We don't check BEG or | |
| 2645 Z, even though they're cached; most of the time these will be the | |
| 2646 same as BEGV and ZV, and when they're not, they're not likely | |
| 2647 to be used.) */ | |
| 2648 | |
| 2649 if (x > bufmax) | |
| 2650 { | |
| 2651 Charbpos diffmax = x - bufmax; | |
| 2652 Charbpos diffpt = x - BUF_PT (buf); | |
| 2653 Charbpos diffzv = BUF_ZV (buf) - x; | |
| 2654 /* #### This value could stand some more exploration. */ | |
| 2655 Charcount heuristic_hack = (bufmax - bufmin) >> 2; | |
| 2656 | |
| 2657 /* Check if the position is closer to PT or ZV than to the | |
| 2658 end of the known region. */ | |
| 2659 | |
| 2660 if (diffpt < 0) | |
| 2661 diffpt = -diffpt; | |
| 2662 if (diffzv < 0) | |
| 2663 diffzv = -diffzv; | |
| 2664 | |
| 2665 /* But also implement a heuristic that favors the known region | |
| 2666 over PT or ZV. The reason for this is that switching to | |
| 2667 PT or ZV will wipe out the knowledge in the known region, | |
| 2668 which might be annoying if the known region is large and | |
| 2669 PT or ZV is not that much closer than the end of the known | |
| 2670 region. */ | |
| 2671 | |
| 2672 diffzv += heuristic_hack; | |
| 2673 diffpt += heuristic_hack; | |
| 2674 if (diffpt < diffmax && diffpt <= diffzv) | |
| 2675 { | |
| 2676 bufmax = bufmin = BUF_PT (buf); | |
| 826 | 2677 bytmax = bytmin = BYTE_BUF_PT (buf); |
| 771 | 2678 /* We set the size to 1 even though it doesn't really |
| 2679 matter because the new known region contains no | |
| 2680 characters. We do this because this is the most | |
| 2681 likely size of the characters around the new known | |
| 2682 region, and we avoid potential yuckiness that is | |
| 2683 done when size == 3. */ | |
| 2684 size = 1; | |
| 2685 } | |
| 2686 if (diffzv < diffmax) | |
| 2687 { | |
| 2688 bufmax = bufmin = BUF_ZV (buf); | |
| 826 | 2689 bytmax = bytmin = BYTE_BUF_ZV (buf); |
| 771 | 2690 size = 1; |
| 2691 } | |
| 2692 } | |
| 800 | 2693 #ifdef ERROR_CHECK_TEXT |
| 771 | 2694 else if (x >= bufmin) |
| 2500 | 2695 ABORT (); |
| 771 | 2696 #endif |
| 2697 else | |
| 2698 { | |
| 2699 Charbpos diffmin = bufmin - x; | |
| 2700 Charbpos diffpt = BUF_PT (buf) - x; | |
| 2701 Charbpos diffbegv = x - BUF_BEGV (buf); | |
| 2702 /* #### This value could stand some more exploration. */ | |
| 2703 Charcount heuristic_hack = (bufmax - bufmin) >> 2; | |
| 2704 | |
| 2705 if (diffpt < 0) | |
| 2706 diffpt = -diffpt; | |
| 2707 if (diffbegv < 0) | |
| 2708 diffbegv = -diffbegv; | |
| 2709 | |
| 2710 /* But also implement a heuristic that favors the known region -- | |
| 2711 see above. */ | |
| 2712 | |
| 2713 diffbegv += heuristic_hack; | |
| 2714 diffpt += heuristic_hack; | |
| 2715 | |
| 2716 if (diffpt < diffmin && diffpt <= diffbegv) | |
| 2717 { | |
| 2718 bufmax = bufmin = BUF_PT (buf); | |
| 826 | 2719 bytmax = bytmin = BYTE_BUF_PT (buf); |
| 771 | 2720 /* We set the size to 1 even though it doesn't really |
| 2721 matter because the new known region contains no | |
| 2722 characters. We do this because this is the most | |
| 2723 likely size of the characters around the new known | |
| 2724 region, and we avoid potential yuckiness that is | |
| 2725 done when size == 3. */ | |
| 2726 size = 1; | |
| 2727 } | |
| 2728 if (diffbegv < diffmin) | |
| 2729 { | |
| 2730 bufmax = bufmin = BUF_BEGV (buf); | |
| 826 | 2731 bytmax = bytmin = BYTE_BUF_BEGV (buf); |
| 771 | 2732 size = 1; |
| 2733 } | |
| 2734 } | |
| 2735 | |
| 2736 diff_so_far = x > bufmax ? x - bufmax : bufmin - x; | |
| 2737 if (diff_so_far > 50) | |
| 2738 { | |
| 2739 /* If we have to move more than a certain amount, then look | |
| 2740 into our cache. */ | |
| 2741 int minval = INT_MAX; | |
| 2742 int found = 0; | |
| 2743 int i; | |
| 2744 | |
| 2745 add_to_cache = 1; | |
| 2746 /* I considered keeping the positions ordered. This would speed | |
| 2747 up this loop, but updating the cache would take longer, so | |
| 2748 it doesn't seem like it would really matter. */ | |
| 2367 | 2749 for (i = 0; i < NUM_CACHED_POSITIONS; i++) |
| 771 | 2750 { |
| 2751 int diff = buf->text->mule_charbpos_cache[i] - x; | |
| 2752 | |
| 2753 if (diff < 0) | |
| 2754 diff = -diff; | |
| 2755 if (diff < minval) | |
| 2756 { | |
| 2757 minval = diff; | |
| 2758 found = i; | |
| 2759 } | |
| 2760 } | |
| 2761 | |
| 2762 if (minval < diff_so_far) | |
| 2763 { | |
| 2764 bufmax = bufmin = buf->text->mule_charbpos_cache[found]; | |
| 2765 bytmax = bytmin = buf->text->mule_bytebpos_cache[found]; | |
| 2766 size = 1; | |
| 2767 } | |
| 2768 } | |
| 2769 | |
| 2770 /* It's conceivable that the caching above could lead to X being | |
| 2771 the same as one of the range edges. */ | |
| 2772 if (x >= bufmax) | |
| 2773 { | |
| 2774 Bytebpos newmax; | |
| 2775 Bytecount newsize; | |
| 2776 | |
| 2777 forward_p = 1; | |
| 2778 while (x > bufmax) | |
| 2779 { | |
| 2780 newmax = bytmax; | |
| 2781 | |
| 2782 INC_BYTEBPOS (buf, newmax); | |
| 2783 newsize = newmax - bytmax; | |
| 2784 if (newsize != size) | |
| 2785 { | |
| 2786 bufmin = bufmax; | |
| 2787 bytmin = bytmax; | |
| 2788 size = newsize; | |
| 2789 } | |
| 2790 bytmax = newmax; | |
| 2791 bufmax++; | |
| 2792 } | |
| 2793 retval = bytmax; | |
| 2794 | |
| 2795 /* #### Should go past the found location to reduce the number | |
| 2796 of times that this function is called */ | |
| 2797 } | |
| 2798 else /* x < bufmin */ | |
| 2799 { | |
| 2800 Bytebpos newmin; | |
| 2801 Bytecount newsize; | |
| 2802 | |
| 2803 forward_p = 0; | |
| 2804 while (x < bufmin) | |
| 2805 { | |
| 2806 newmin = bytmin; | |
| 2807 | |
| 2808 DEC_BYTEBPOS (buf, newmin); | |
| 2809 newsize = bytmin - newmin; | |
| 2810 if (newsize != size) | |
| 2811 { | |
| 2812 bufmax = bufmin; | |
| 2813 bytmax = bytmin; | |
| 2814 size = newsize; | |
| 2815 } | |
| 2816 bytmin = newmin; | |
| 2817 bufmin--; | |
| 2818 } | |
| 2819 retval = bytmin; | |
| 2820 | |
| 2821 /* #### Should go past the found location to reduce the number | |
| 2822 of times that this function is called | |
| 2823 */ | |
| 2824 } | |
| 2825 | |
| 2826 /* If size is three, than we have to max sure that the range we | |
| 2827 discovered isn't too large, because we use a fixed-length | |
| 2828 table to divide by 3. */ | |
| 2829 | |
| 2830 if (size == 3) | |
| 2831 { | |
| 2832 int gap = bytmax - bytmin; | |
| 2833 buf->text->mule_three_p = 1; | |
| 2834 buf->text->mule_shifter = 1; | |
| 2835 | |
| 2836 if (gap > MAX_BYTEBPOS_GAP_SIZE_3) | |
| 2837 { | |
| 2838 if (forward_p) | |
| 2839 { | |
| 2840 bytmin = bytmax - MAX_BYTEBPOS_GAP_SIZE_3; | |
| 2841 bufmin = bufmax - MAX_CHARBPOS_GAP_SIZE_3; | |
| 2842 } | |
| 2843 else | |
| 2844 { | |
| 2845 bytmax = bytmin + MAX_BYTEBPOS_GAP_SIZE_3; | |
| 2846 bufmax = bufmin + MAX_CHARBPOS_GAP_SIZE_3; | |
| 2847 } | |
| 2848 } | |
| 2849 } | |
| 2850 else | |
| 2851 { | |
| 2852 buf->text->mule_three_p = 0; | |
| 2853 if (size == 4) | |
| 2854 buf->text->mule_shifter = 2; | |
| 2855 else | |
| 2856 buf->text->mule_shifter = size - 1; | |
| 2857 } | |
| 2858 | |
| 2859 buf->text->mule_bufmin = bufmin; | |
| 2860 buf->text->mule_bufmax = bufmax; | |
| 2861 buf->text->mule_bytmin = bytmin; | |
| 2862 buf->text->mule_bytmax = bytmax; | |
| 2863 | |
| 2864 if (add_to_cache) | |
| 2865 { | |
| 2866 int replace_loc; | |
| 2867 | |
| 2868 /* We throw away a "random" cached value and replace it with | |
| 2869 the new value. It doesn't actually have to be very random | |
| 2870 at all, just evenly distributed. | |
| 2871 | |
| 2872 #### It would be better to use a least-recently-used algorithm | |
| 2873 or something that tries to space things out, but I'm not sure | |
| 2874 it's worth it to go to the trouble of maintaining that. */ | |
| 2875 not_very_random_number += 621; | |
| 2876 replace_loc = not_very_random_number & 15; | |
| 2877 buf->text->mule_charbpos_cache[replace_loc] = x; | |
| 2878 buf->text->mule_bytebpos_cache[replace_loc] = retval; | |
| 2879 } | |
| 2880 | |
| 2367 | 2881 #endif /* OLD_BYTE_CHAR */ |
| 2882 | |
| 2883 done: | |
| 1292 | 2884 PROFILE_RECORD_EXITING_SECTION (QSin_char_byte_conversion); |
| 2885 | |
| 771 | 2886 return retval; |
| 2887 } | |
| 2888 | |
| 2367 | 2889 #undef CONSIDER |
| 2890 | |
| 2891 /* bytepos_to_charpos returns the char position corresponding to BYTEPOS. */ | |
| 2892 | |
| 2893 /* This macro is a subroutine of bytebpos_to_charbpos_func. | |
| 2894 It is used when BYTEPOS is actually the byte position. */ | |
| 2895 | |
| 2896 #define CONSIDER(BYTEPOS, CHARPOS) \ | |
| 2897 do \ | |
| 2898 { \ | |
| 2899 Bytebpos this_bytepos = (BYTEPOS); \ | |
| 2900 int changed = 0; \ | |
| 2901 \ | |
| 2902 if (this_bytepos == x) \ | |
| 2903 { \ | |
| 2904 retval = (CHARPOS); \ | |
| 2905 goto done; \ | |
| 2906 } \ | |
| 2907 else if (this_bytepos > x) \ | |
| 2908 { \ | |
| 2909 if (this_bytepos < best_above_byte) \ | |
| 2910 { \ | |
| 2911 best_above = (CHARPOS); \ | |
| 2912 best_above_byte = this_bytepos; \ | |
| 2913 changed = 1; \ | |
| 2914 } \ | |
| 2915 } \ | |
| 2916 else if (this_bytepos > best_below_byte) \ | |
| 2917 { \ | |
| 2918 best_below = (CHARPOS); \ | |
| 2919 best_below_byte = this_bytepos; \ | |
| 2920 changed = 1; \ | |
| 2921 } \ | |
| 2922 \ | |
| 2923 if (changed) \ | |
| 2924 { \ | |
| 2925 if (best_above - best_below == best_above_byte - best_below_byte) \ | |
| 2926 { \ | |
| 2927 retval = best_below + (x - best_below_byte); \ | |
| 2928 goto done; \ | |
| 2929 } \ | |
| 2930 } \ | |
| 2931 } \ | |
| 2932 while (0) | |
| 2933 | |
| 771 | 2934 /* The logic in this function is almost identical to the logic in |
| 2935 the previous function. */ | |
| 2936 | |
| 2937 Charbpos | |
| 2938 bytebpos_to_charbpos_func (struct buffer *buf, Bytebpos x) | |
| 2939 { | |
| 2367 | 2940 #ifdef OLD_BYTE_CHAR |
| 771 | 2941 Charbpos bufmin; |
| 2942 Charbpos bufmax; | |
| 2943 Bytebpos bytmin; | |
| 2944 Bytebpos bytmax; | |
| 2945 int size; | |
| 2946 int forward_p; | |
| 2947 int diff_so_far; | |
| 2948 int add_to_cache = 0; | |
| 2367 | 2949 #endif /* OLD_BYTE_CHAR */ |
| 2950 | |
| 2951 Charbpos best_above, best_above_byte; | |
| 2952 Bytebpos best_below, best_below_byte; | |
| 2953 int i; | |
| 2954 struct buffer_text *t; | |
| 2955 Charbpos retval; | |
| 2956 | |
| 1292 | 2957 PROFILE_DECLARE (); |
| 771 | 2958 |
| 1292 | 2959 PROFILE_RECORD_ENTERING_SECTION (QSin_char_byte_conversion); |
| 2960 | |
| 2367 | 2961 best_above = BUF_Z (buf); |
| 2962 best_above_byte = BYTE_BUF_Z (buf); | |
| 2963 | |
| 2964 /* In this case, we simply have all one-byte characters. But this should | |
| 2965 have been intercepted before, in bytebpos_to_charbpos(). */ | |
| 2966 text_checking_assert (best_above != best_above_byte); | |
| 2967 | |
| 2968 best_below = BUF_BEG (buf); | |
| 2969 best_below_byte = BYTE_BUF_BEG (buf); | |
| 2970 | |
| 2971 CONSIDER (BYTE_BUF_PT (buf), BUF_PT (buf)); | |
| 2972 CONSIDER (BYTE_BUF_GPT (buf), BUF_GPT (buf)); | |
| 2973 CONSIDER (BYTE_BUF_BEGV (buf), BUF_BEGV (buf)); | |
| 2974 CONSIDER (BYTE_BUF_ZV (buf), BUF_ZV (buf)); | |
| 2975 | |
| 2976 t = buf->text; | |
| 2977 CONSIDER (t->cached_bytepos, t->cached_charpos); | |
| 2978 | |
| 2979 /* Check the most recently entered positions first */ | |
| 2980 | |
| 2981 for (i = t->next_cache_pos - 1; i >= 0; i--) | |
| 2982 { | |
| 2983 CONSIDER (t->mule_bytebpos_cache[i], t->mule_charbpos_cache[i]); | |
| 2984 | |
| 2985 /* If we are down to a range of 50 chars, | |
| 2986 don't bother checking any other markers; | |
| 2987 scan the intervening chars directly now. */ | |
| 2988 if (best_above - best_below < 50) | |
| 2989 break; | |
| 2990 } | |
| 2991 | |
| 2992 /* We get here if we did not exactly hit one of the known places. | |
| 2993 We have one known above and one known below. | |
| 2994 Scan, counting characters, from whichever one is closer. */ | |
| 2995 | |
| 2996 if (x - best_below_byte < best_above_byte - x) | |
| 2997 { | |
| 2998 int record = x - best_below_byte > 5000; | |
| 2999 | |
| 3000 #ifdef OLD_LOOP /* old code */ | |
|
4526
38493c0fb952
Fix accidental deletion in src/text.c.
Stephen J. Turnbull <stephen@xemacs.org>
parents:
4525
diff
changeset
|
3001 while (best_below_byte < x) |
| 2367 | 3002 { |
| 3003 best_below++; | |
| 3004 INC_BYTEBPOS (buf, best_below_byte); | |
| 3005 } | |
| 3006 #else | |
| 3007 text_checking_assert (BUF_FORMAT (buf) == FORMAT_DEFAULT); | |
| 3008 /* The gap should not occur between best_below and x, or we will be | |
| 3009 screwed in using charcount_to_bytecount(). It should not be exactly | |
| 3010 at x either, because we already should have caught that. */ | |
| 3011 text_checking_assert | |
| 3012 (BYTE_BUF_CEILING_OF_IGNORE_ACCESSIBLE (buf, best_below_byte) > x); | |
| 3013 | |
| 3014 /* Using bytecount_to_charcount() is potentially a lot faster than | |
| 3015 a simple loop above using INC_BYTEBPOS(); see above. | |
| 3016 */ | |
| 3017 best_below += | |
| 3018 bytecount_to_charcount | |
| 3019 (BYTE_BUF_BYTE_ADDRESS (buf, best_below_byte), x - best_below_byte); | |
| 3020 best_below_byte = x; | |
| 3021 #endif | |
| 3022 | |
| 3023 /* If this position is quite far from the nearest known position, | |
| 3024 cache the correspondence. | |
| 3025 | |
| 3026 NB FSF does this: "... by creating a marker here. | |
| 3027 It will last until the next GC." | |
| 3028 */ | |
| 3029 | |
| 3030 if (record) | |
| 3031 { | |
| 3032 if (t->next_cache_pos == NUM_CACHED_POSITIONS) | |
| 3033 { | |
| 3034 memmove (t->mule_charbpos_cache, | |
| 3035 t->mule_charbpos_cache + NUM_MOVED_POSITIONS, | |
| 3036 sizeof (Charbpos) * | |
| 3037 (NUM_CACHED_POSITIONS - NUM_MOVED_POSITIONS)); | |
| 3038 memmove (t->mule_bytebpos_cache, | |
| 3039 t->mule_bytebpos_cache + NUM_MOVED_POSITIONS, | |
| 3040 sizeof (Bytebpos) * | |
| 3041 (NUM_CACHED_POSITIONS - NUM_MOVED_POSITIONS)); | |
| 3042 t->next_cache_pos -= NUM_MOVED_POSITIONS; | |
| 3043 } | |
| 3044 t->mule_charbpos_cache[t->next_cache_pos] = best_below; | |
| 3045 t->mule_bytebpos_cache[t->next_cache_pos] = best_below_byte; | |
| 3046 t->next_cache_pos++; | |
| 3047 } | |
| 3048 | |
| 3049 | |
| 3050 t->cached_charpos = best_below; | |
| 3051 t->cached_bytepos = best_below_byte; | |
| 3052 | |
| 3053 retval = best_below; | |
| 3054 text_checking_assert (best_below_byte >= best_below); | |
| 3055 goto done; | |
| 3056 } | |
| 3057 else | |
| 3058 { | |
| 3059 int record = best_above_byte - x > 5000; | |
| 3060 | |
| 3061 #ifdef OLD_LOOP /* old code */ | |
| 3062 while (best_above_byte > x) | |
| 3063 { | |
| 3064 best_above--; | |
| 3065 DEC_BYTEBPOS (buf, best_above_byte); | |
| 3066 } | |
| 3067 #else | |
| 3068 text_checking_assert (BUF_FORMAT (buf) == FORMAT_DEFAULT); | |
| 3069 /* The gap should not occur between best_above and x, or we will be | |
| 3070 screwed in using bytecount_to_charcount_down(). It should not be | |
| 3071 exactly at x either, because we already should have caught | |
| 3072 that. */ | |
| 3073 text_checking_assert | |
| 3074 (BYTE_BUF_FLOOR_OF_IGNORE_ACCESSIBLE (buf, best_above_byte) < x); | |
| 3075 | |
| 3076 /* Using bytecount_to_charcount_down() is potentially a lot faster | |
| 3077 than a simple loop using INC_BYTEBPOS(); see above. */ | |
| 3078 best_above -= | |
| 3079 bytecount_to_charcount_down | |
| 3080 /* BYTE_BUF_BYTE_ADDRESS will return a value on the high side of the | |
| 3081 gap if we are at the gap, which is the wrong side. So do the | |
| 3082 following trick instead. */ | |
| 3083 (BYTE_BUF_BYTE_ADDRESS_BEFORE (buf, best_above_byte) + 1, | |
| 3084 best_above_byte - x); | |
| 3085 best_above_byte = x; | |
| 3086 #endif | |
| 3087 | |
| 3088 | |
| 3089 /* If this position is quite far from the nearest known position, | |
| 3090 cache the correspondence. | |
| 3091 | |
| 3092 NB FSF does this: "... by creating a marker here. | |
| 3093 It will last until the next GC." | |
| 3094 */ | |
| 3095 if (record) | |
| 3096 { | |
| 3097 if (t->next_cache_pos == NUM_CACHED_POSITIONS) | |
| 3098 { | |
| 3099 memmove (t->mule_charbpos_cache, | |
| 3100 t->mule_charbpos_cache + NUM_MOVED_POSITIONS, | |
| 3101 sizeof (Charbpos) * | |
| 3102 (NUM_CACHED_POSITIONS - NUM_MOVED_POSITIONS)); | |
| 3103 memmove (t->mule_bytebpos_cache, | |
| 3104 t->mule_bytebpos_cache + NUM_MOVED_POSITIONS, | |
| 3105 sizeof (Bytebpos) * | |
| 3106 (NUM_CACHED_POSITIONS - NUM_MOVED_POSITIONS)); | |
| 3107 t->next_cache_pos -= NUM_MOVED_POSITIONS; | |
| 3108 } | |
| 3109 t->mule_charbpos_cache[t->next_cache_pos] = best_above; | |
| 3110 t->mule_bytebpos_cache[t->next_cache_pos] = best_above_byte; | |
| 3111 t->next_cache_pos++; | |
| 3112 } | |
| 3113 | |
| 3114 t->cached_charpos = best_above; | |
| 3115 t->cached_bytepos = best_above_byte; | |
| 3116 | |
| 3117 retval = best_above; | |
| 3118 text_checking_assert (best_above_byte >= best_above); | |
| 3119 goto done; | |
| 3120 } | |
| 3121 | |
| 3122 #ifdef OLD_BYTE_CHAR | |
| 3123 | |
| 771 | 3124 bufmin = buf->text->mule_bufmin; |
| 3125 bufmax = buf->text->mule_bufmax; | |
| 3126 bytmin = buf->text->mule_bytmin; | |
| 3127 bytmax = buf->text->mule_bytmax; | |
| 3128 size = (1 << buf->text->mule_shifter) + !!buf->text->mule_three_p; | |
| 3129 | |
| 3130 /* The basic idea here is that we shift the "known region" up or down | |
| 3131 until it overlaps the specified position. We do this by moving | |
| 3132 the upper bound of the known region up one character at a time, | |
| 3133 and moving the lower bound of the known region up as necessary | |
| 3134 when the size of the character just seen changes. | |
| 3135 | |
| 3136 We optimize this, however, by first shifting the known region to | |
| 826 | 3137 one of the cached points if it's close by. (We don't check BYTE_BEG or |
| 3138 BYTE_Z, even though they're cached; most of the time these will be the | |
| 3139 same as BYTE_BEGV and BYTE_ZV, and when they're not, they're not likely | |
| 771 | 3140 to be used.) */ |
| 3141 | |
| 3142 if (x > bytmax) | |
| 3143 { | |
| 3144 Bytebpos diffmax = x - bytmax; | |
| 826 | 3145 Bytebpos diffpt = x - BYTE_BUF_PT (buf); |
| 3146 Bytebpos diffzv = BYTE_BUF_ZV (buf) - x; | |
| 771 | 3147 /* #### This value could stand some more exploration. */ |
| 3148 Bytecount heuristic_hack = (bytmax - bytmin) >> 2; | |
| 3149 | |
| 3150 /* Check if the position is closer to PT or ZV than to the | |
| 3151 end of the known region. */ | |
| 3152 | |
| 3153 if (diffpt < 0) | |
| 3154 diffpt = -diffpt; | |
| 3155 if (diffzv < 0) | |
| 3156 diffzv = -diffzv; | |
| 3157 | |
| 3158 /* But also implement a heuristic that favors the known region | |
| 826 | 3159 over BYTE_PT or BYTE_ZV. The reason for this is that switching to |
| 3160 BYTE_PT or BYTE_ZV will wipe out the knowledge in the known region, | |
| 771 | 3161 which might be annoying if the known region is large and |
| 826 | 3162 BYTE_PT or BYTE_ZV is not that much closer than the end of the known |
| 771 | 3163 region. */ |
| 3164 | |
| 3165 diffzv += heuristic_hack; | |
| 3166 diffpt += heuristic_hack; | |
| 3167 if (diffpt < diffmax && diffpt <= diffzv) | |
| 3168 { | |
| 3169 bufmax = bufmin = BUF_PT (buf); | |
| 826 | 3170 bytmax = bytmin = BYTE_BUF_PT (buf); |
| 771 | 3171 /* We set the size to 1 even though it doesn't really |
| 3172 matter because the new known region contains no | |
| 3173 characters. We do this because this is the most | |
| 3174 likely size of the characters around the new known | |
| 3175 region, and we avoid potential yuckiness that is | |
| 3176 done when size == 3. */ | |
| 3177 size = 1; | |
| 3178 } | |
| 3179 if (diffzv < diffmax) | |
| 3180 { | |
| 3181 bufmax = bufmin = BUF_ZV (buf); | |
| 826 | 3182 bytmax = bytmin = BYTE_BUF_ZV (buf); |
| 771 | 3183 size = 1; |
| 3184 } | |
| 3185 } | |
| 800 | 3186 #ifdef ERROR_CHECK_TEXT |
| 771 | 3187 else if (x >= bytmin) |
| 2500 | 3188 ABORT (); |
| 771 | 3189 #endif |
| 3190 else | |
| 3191 { | |
| 3192 Bytebpos diffmin = bytmin - x; | |
| 826 | 3193 Bytebpos diffpt = BYTE_BUF_PT (buf) - x; |
| 3194 Bytebpos diffbegv = x - BYTE_BUF_BEGV (buf); | |
| 771 | 3195 /* #### This value could stand some more exploration. */ |
| 3196 Bytecount heuristic_hack = (bytmax - bytmin) >> 2; | |
| 3197 | |
| 3198 if (diffpt < 0) | |
| 3199 diffpt = -diffpt; | |
| 3200 if (diffbegv < 0) | |
| 3201 diffbegv = -diffbegv; | |
| 3202 | |
| 3203 /* But also implement a heuristic that favors the known region -- | |
| 3204 see above. */ | |
| 3205 | |
| 3206 diffbegv += heuristic_hack; | |
| 3207 diffpt += heuristic_hack; | |
| 3208 | |
| 3209 if (diffpt < diffmin && diffpt <= diffbegv) | |
| 3210 { | |
| 3211 bufmax = bufmin = BUF_PT (buf); | |
| 826 | 3212 bytmax = bytmin = BYTE_BUF_PT (buf); |
| 771 | 3213 /* We set the size to 1 even though it doesn't really |
| 3214 matter because the new known region contains no | |
| 3215 characters. We do this because this is the most | |
| 3216 likely size of the characters around the new known | |
| 3217 region, and we avoid potential yuckiness that is | |
| 3218 done when size == 3. */ | |
| 3219 size = 1; | |
| 3220 } | |
| 3221 if (diffbegv < diffmin) | |
| 3222 { | |
| 3223 bufmax = bufmin = BUF_BEGV (buf); | |
| 826 | 3224 bytmax = bytmin = BYTE_BUF_BEGV (buf); |
| 771 | 3225 size = 1; |
| 3226 } | |
| 3227 } | |
| 3228 | |
| 3229 diff_so_far = x > bytmax ? x - bytmax : bytmin - x; | |
| 3230 if (diff_so_far > 50) | |
| 3231 { | |
| 3232 /* If we have to move more than a certain amount, then look | |
| 3233 into our cache. */ | |
| 3234 int minval = INT_MAX; | |
| 3235 int found = 0; | |
| 3236 int i; | |
| 3237 | |
| 3238 add_to_cache = 1; | |
| 3239 /* I considered keeping the positions ordered. This would speed | |
| 3240 up this loop, but updating the cache would take longer, so | |
| 3241 it doesn't seem like it would really matter. */ | |
| 2367 | 3242 for (i = 0; i < NUM_CACHED_POSITIONS; i++) |
| 771 | 3243 { |
| 3244 int diff = buf->text->mule_bytebpos_cache[i] - x; | |
| 3245 | |
| 3246 if (diff < 0) | |
| 3247 diff = -diff; | |
| 3248 if (diff < minval) | |
| 3249 { | |
| 3250 minval = diff; | |
| 3251 found = i; | |
| 3252 } | |
| 3253 } | |
| 3254 | |
| 3255 if (minval < diff_so_far) | |
| 3256 { | |
| 3257 bufmax = bufmin = buf->text->mule_charbpos_cache[found]; | |
| 3258 bytmax = bytmin = buf->text->mule_bytebpos_cache[found]; | |
| 3259 size = 1; | |
| 3260 } | |
| 3261 } | |
| 3262 | |
| 3263 /* It's conceivable that the caching above could lead to X being | |
| 3264 the same as one of the range edges. */ | |
| 3265 if (x >= bytmax) | |
| 3266 { | |
| 3267 Bytebpos newmax; | |
| 3268 Bytecount newsize; | |
| 3269 | |
| 3270 forward_p = 1; | |
| 3271 while (x > bytmax) | |
| 3272 { | |
| 3273 newmax = bytmax; | |
| 3274 | |
| 3275 INC_BYTEBPOS (buf, newmax); | |
| 3276 newsize = newmax - bytmax; | |
| 3277 if (newsize != size) | |
| 3278 { | |
| 3279 bufmin = bufmax; | |
| 3280 bytmin = bytmax; | |
| 3281 size = newsize; | |
| 3282 } | |
| 3283 bytmax = newmax; | |
| 3284 bufmax++; | |
| 3285 } | |
| 3286 retval = bufmax; | |
| 3287 | |
| 3288 /* #### Should go past the found location to reduce the number | |
| 3289 of times that this function is called */ | |
| 3290 } | |
| 3291 else /* x <= bytmin */ | |
| 3292 { | |
| 3293 Bytebpos newmin; | |
| 3294 Bytecount newsize; | |
| 3295 | |
| 3296 forward_p = 0; | |
| 3297 while (x < bytmin) | |
| 3298 { | |
| 3299 newmin = bytmin; | |
| 3300 | |
| 3301 DEC_BYTEBPOS (buf, newmin); | |
| 3302 newsize = bytmin - newmin; | |
| 3303 if (newsize != size) | |
| 3304 { | |
| 3305 bufmax = bufmin; | |
| 3306 bytmax = bytmin; | |
| 3307 size = newsize; | |
| 3308 } | |
| 3309 bytmin = newmin; | |
| 3310 bufmin--; | |
| 3311 } | |
| 3312 retval = bufmin; | |
| 3313 | |
| 3314 /* #### Should go past the found location to reduce the number | |
| 3315 of times that this function is called | |
| 3316 */ | |
| 3317 } | |
| 3318 | |
| 3319 /* If size is three, than we have to max sure that the range we | |
| 3320 discovered isn't too large, because we use a fixed-length | |
| 3321 table to divide by 3. */ | |
| 3322 | |
| 3323 if (size == 3) | |
| 3324 { | |
| 3325 int gap = bytmax - bytmin; | |
| 3326 buf->text->mule_three_p = 1; | |
| 3327 buf->text->mule_shifter = 1; | |
| 3328 | |
| 3329 if (gap > MAX_BYTEBPOS_GAP_SIZE_3) | |
| 3330 { | |
| 3331 if (forward_p) | |
| 3332 { | |
| 3333 bytmin = bytmax - MAX_BYTEBPOS_GAP_SIZE_3; | |
| 3334 bufmin = bufmax - MAX_CHARBPOS_GAP_SIZE_3; | |
| 3335 } | |
| 3336 else | |
| 3337 { | |
| 3338 bytmax = bytmin + MAX_BYTEBPOS_GAP_SIZE_3; | |
| 3339 bufmax = bufmin + MAX_CHARBPOS_GAP_SIZE_3; | |
| 3340 } | |
| 3341 } | |
| 3342 } | |
| 3343 else | |
| 3344 { | |
| 3345 buf->text->mule_three_p = 0; | |
| 3346 if (size == 4) | |
| 3347 buf->text->mule_shifter = 2; | |
| 3348 else | |
| 3349 buf->text->mule_shifter = size - 1; | |
| 3350 } | |
| 3351 | |
| 3352 buf->text->mule_bufmin = bufmin; | |
| 3353 buf->text->mule_bufmax = bufmax; | |
| 3354 buf->text->mule_bytmin = bytmin; | |
| 3355 buf->text->mule_bytmax = bytmax; | |
| 3356 | |
| 3357 if (add_to_cache) | |
| 3358 { | |
| 3359 int replace_loc; | |
| 3360 | |
| 3361 /* We throw away a "random" cached value and replace it with | |
| 3362 the new value. It doesn't actually have to be very random | |
| 3363 at all, just evenly distributed. | |
| 3364 | |
| 3365 #### It would be better to use a least-recently-used algorithm | |
| 3366 or something that tries to space things out, but I'm not sure | |
| 3367 it's worth it to go to the trouble of maintaining that. */ | |
| 3368 not_very_random_number += 621; | |
| 3369 replace_loc = not_very_random_number & 15; | |
| 3370 buf->text->mule_charbpos_cache[replace_loc] = retval; | |
| 3371 buf->text->mule_bytebpos_cache[replace_loc] = x; | |
| 3372 } | |
| 2367 | 3373 #endif /* OLD_BYTE_CHAR */ |
| 3374 | |
| 3375 done: | |
| 1292 | 3376 PROFILE_RECORD_EXITING_SECTION (QSin_char_byte_conversion); |
| 3377 | |
| 771 | 3378 return retval; |
| 3379 } | |
| 3380 | |
| 3381 /* Text of length BYTELENGTH and CHARLENGTH (in different units) | |
| 3382 was inserted at charbpos START. */ | |
| 3383 | |
| 3384 void | |
| 3385 buffer_mule_signal_inserted_region (struct buffer *buf, Charbpos start, | |
| 3386 Bytecount bytelength, | |
| 3387 Charcount charlength) | |
| 3388 { | |
| 2367 | 3389 #ifdef OLD_BYTE_CHAR |
| 771 | 3390 int size = (1 << buf->text->mule_shifter) + !!buf->text->mule_three_p; |
| 2367 | 3391 #endif /* OLD_BYTE_CHAR */ |
| 771 | 3392 int i; |
| 3393 | |
| 3394 /* Adjust the cache of known positions. */ | |
| 2367 | 3395 for (i = 0; i < buf->text->next_cache_pos; i++) |
| 771 | 3396 { |
| 3397 | |
| 3398 if (buf->text->mule_charbpos_cache[i] > start) | |
| 3399 { | |
| 3400 buf->text->mule_charbpos_cache[i] += charlength; | |
| 3401 buf->text->mule_bytebpos_cache[i] += bytelength; | |
| 3402 } | |
| 3403 } | |
| 3404 | |
| 2367 | 3405 /* Adjust the special cached position. */ |
| 3406 | |
| 3407 if (buf->text->cached_charpos > start) | |
| 3408 { | |
| 3409 buf->text->cached_charpos += charlength; | |
| 3410 buf->text->cached_bytepos += bytelength; | |
| 3411 } | |
| 3412 | |
| 3413 #ifdef OLD_BYTE_CHAR | |
| 771 | 3414 if (start >= buf->text->mule_bufmax) |
| 826 | 3415 return; |
| 771 | 3416 |
| 3417 /* The insertion is either before the known region, in which case | |
| 3418 it shoves it forward; or within the known region, in which case | |
| 3419 it shoves the end forward. (But it may make the known region | |
| 3420 inconsistent, so we may have to shorten it.) */ | |
| 3421 | |
| 3422 if (start <= buf->text->mule_bufmin) | |
| 3423 { | |
| 3424 buf->text->mule_bufmin += charlength; | |
| 3425 buf->text->mule_bufmax += charlength; | |
| 3426 buf->text->mule_bytmin += bytelength; | |
| 3427 buf->text->mule_bytmax += bytelength; | |
| 3428 } | |
| 3429 else | |
| 3430 { | |
| 3431 Charbpos end = start + charlength; | |
| 3432 /* the insertion point divides the known region in two. | |
| 3433 Keep the longer half, at least, and expand into the | |
| 3434 inserted chunk as much as possible. */ | |
| 3435 | |
| 3436 if (start - buf->text->mule_bufmin > buf->text->mule_bufmax - start) | |
| 3437 { | |
| 3438 Bytebpos bytestart = (buf->text->mule_bytmin | |
| 3439 + size * (start - buf->text->mule_bufmin)); | |
| 3440 Bytebpos bytenew; | |
| 3441 | |
| 3442 while (start < end) | |
| 3443 { | |
| 3444 bytenew = bytestart; | |
| 3445 INC_BYTEBPOS (buf, bytenew); | |
| 3446 if (bytenew - bytestart != size) | |
| 3447 break; | |
| 3448 start++; | |
| 3449 bytestart = bytenew; | |
| 3450 } | |
| 3451 if (start != end) | |
| 3452 { | |
| 3453 buf->text->mule_bufmax = start; | |
| 3454 buf->text->mule_bytmax = bytestart; | |
| 3455 } | |
| 3456 else | |
| 3457 { | |
| 3458 buf->text->mule_bufmax += charlength; | |
| 3459 buf->text->mule_bytmax += bytelength; | |
| 3460 } | |
| 3461 } | |
| 3462 else | |
| 3463 { | |
| 3464 Bytebpos byteend = (buf->text->mule_bytmin | |
| 3465 + size * (start - buf->text->mule_bufmin) | |
| 3466 + bytelength); | |
| 3467 Bytebpos bytenew; | |
| 3468 | |
| 3469 buf->text->mule_bufmax += charlength; | |
| 3470 buf->text->mule_bytmax += bytelength; | |
| 3471 | |
| 3472 while (end > start) | |
| 3473 { | |
| 3474 bytenew = byteend; | |
| 3475 DEC_BYTEBPOS (buf, bytenew); | |
| 3476 if (byteend - bytenew != size) | |
| 3477 break; | |
| 3478 end--; | |
| 3479 byteend = bytenew; | |
| 3480 } | |
| 3481 if (start != end) | |
| 3482 { | |
| 3483 buf->text->mule_bufmin = end; | |
| 3484 buf->text->mule_bytmin = byteend; | |
| 3485 } | |
| 3486 } | |
| 3487 } | |
| 2367 | 3488 #endif /* OLD_BYTE_CHAR */ |
| 771 | 3489 } |
| 3490 | |
| 826 | 3491 /* Text from START to END (equivalent in Bytebpos's: from BYTE_START to |
| 3492 BYTE_END) was deleted. */ | |
| 771 | 3493 |
| 3494 void | |
| 3495 buffer_mule_signal_deleted_region (struct buffer *buf, Charbpos start, | |
| 826 | 3496 Charbpos end, Bytebpos byte_start, |
| 3497 Bytebpos byte_end) | |
| 771 | 3498 { |
| 3499 int i; | |
| 3500 | |
| 3501 /* Adjust the cache of known positions. */ | |
| 2367 | 3502 for (i = 0; i < buf->text->next_cache_pos; i++) |
| 771 | 3503 { |
| 3504 /* After the end; gets shoved backward */ | |
| 3505 if (buf->text->mule_charbpos_cache[i] > end) | |
| 3506 { | |
| 3507 buf->text->mule_charbpos_cache[i] -= end - start; | |
| 826 | 3508 buf->text->mule_bytebpos_cache[i] -= byte_end - byte_start; |
| 771 | 3509 } |
| 3510 /* In the range; moves to start of range */ | |
| 3511 else if (buf->text->mule_charbpos_cache[i] > start) | |
| 3512 { | |
| 3513 buf->text->mule_charbpos_cache[i] = start; | |
| 826 | 3514 buf->text->mule_bytebpos_cache[i] = byte_start; |
| 771 | 3515 } |
| 3516 } | |
| 3517 | |
| 2367 | 3518 /* Adjust the special cached position. */ |
| 3519 | |
| 3520 /* After the end; gets shoved backward */ | |
| 3521 if (buf->text->cached_charpos > end) | |
| 3522 { | |
| 3523 buf->text->cached_charpos -= end - start; | |
| 3524 buf->text->cached_bytepos -= byte_end - byte_start; | |
| 3525 } | |
| 3526 /* In the range; moves to start of range */ | |
| 3527 else if (buf->text->cached_charpos > start) | |
| 3528 { | |
| 3529 buf->text->cached_charpos = start; | |
| 3530 buf->text->cached_bytepos = byte_start; | |
| 3531 } | |
| 3532 | |
| 3533 #ifdef OLD_BYTE_CHAR | |
| 771 | 3534 /* We don't care about any text after the end of the known region. */ |
| 3535 | |
| 3536 end = min (end, buf->text->mule_bufmax); | |
| 826 | 3537 byte_end = min (byte_end, buf->text->mule_bytmax); |
| 771 | 3538 if (start >= end) |
| 826 | 3539 return; |
| 771 | 3540 |
| 3541 /* The end of the known region offsets by the total amount of deletion, | |
| 3542 since it's all before it. */ | |
| 3543 | |
| 3544 buf->text->mule_bufmax -= end - start; | |
| 826 | 3545 buf->text->mule_bytmax -= byte_end - byte_start; |
| 771 | 3546 |
| 3547 /* Now we don't care about any text after the start of the known region. */ | |
| 3548 | |
| 3549 end = min (end, buf->text->mule_bufmin); | |
| 826 | 3550 byte_end = min (byte_end, buf->text->mule_bytmin); |
| 771 | 3551 if (start < end) |
| 3552 { | |
| 3553 buf->text->mule_bufmin -= end - start; | |
| 826 | 3554 buf->text->mule_bytmin -= byte_end - byte_start; |
| 771 | 3555 } |
| 2367 | 3556 #endif /* OLD_BYTE_CHAR */ |
| 771 | 3557 } |
| 3558 | |
| 3559 #endif /* MULE */ | |
| 3560 | |
| 3561 | |
| 3562 /************************************************************************/ | |
| 3563 /* verifying buffer and string positions */ | |
| 3564 /************************************************************************/ | |
| 3565 | |
| 3566 /* Functions below are tagged with either _byte or _char indicating | |
| 3567 whether they return byte or character positions. For a buffer, | |
| 3568 a character position is a "Charbpos" and a byte position is a "Bytebpos". | |
| 3569 For strings, these are sometimes typed using "Charcount" and | |
| 3570 "Bytecount". */ | |
| 3571 | |
| 3572 /* Flags for the functions below are: | |
| 3573 | |
| 3574 GB_ALLOW_PAST_ACCESSIBLE | |
| 3575 | |
| 3576 Allow positions to range over the entire buffer (BUF_BEG to BUF_Z), | |
| 3577 rather than just the accessible portion (BUF_BEGV to BUF_ZV). | |
| 3578 For strings, this flag has no effect. | |
| 3579 | |
| 3580 GB_COERCE_RANGE | |
| 3581 | |
| 3582 If the position is outside the allowable range, return the lower | |
| 3583 or upper bound of the range, whichever is closer to the specified | |
| 3584 position. | |
| 3585 | |
| 3586 GB_NO_ERROR_IF_BAD | |
| 3587 | |
| 3588 If the position is outside the allowable range, return -1. | |
| 3589 | |
| 3590 GB_NEGATIVE_FROM_END | |
| 3591 | |
| 3592 If a value is negative, treat it as an offset from the end. | |
| 3593 Only applies to strings. | |
| 3594 | |
| 3595 The following additional flags apply only to the functions | |
| 3596 that return ranges: | |
| 3597 | |
| 3598 GB_ALLOW_NIL | |
| 3599 | |
| 3600 Either or both positions can be nil. If FROM is nil, | |
| 3601 FROM_OUT will contain the lower bound of the allowed range. | |
| 3602 If TO is nil, TO_OUT will contain the upper bound of the | |
| 3603 allowed range. | |
| 3604 | |
| 3605 GB_CHECK_ORDER | |
| 3606 | |
| 3607 FROM must contain the lower bound and TO the upper bound | |
| 3608 of the range. If the positions are reversed, an error is | |
| 3609 signalled. | |
| 3610 | |
| 3611 The following is a combination flag: | |
| 3612 | |
| 3613 GB_HISTORICAL_STRING_BEHAVIOR | |
| 3614 | |
| 3615 Equivalent to (GB_NEGATIVE_FROM_END | GB_ALLOW_NIL). | |
| 3616 */ | |
| 3617 | |
| 3618 /* Return a buffer position stored in a Lisp_Object. Full | |
| 3619 error-checking is done on the position. Flags can be specified to | |
| 3620 control the behavior of out-of-range values. The default behavior | |
| 3621 is to require that the position is within the accessible part of | |
| 3622 the buffer (BEGV and ZV), and to signal an error if the position is | |
| 3623 out of range. | |
| 3624 | |
| 3625 */ | |
| 3626 | |
| 3627 Charbpos | |
| 3628 get_buffer_pos_char (struct buffer *b, Lisp_Object pos, unsigned int flags) | |
| 3629 { | |
| 3630 /* Does not GC */ | |
| 3631 Charbpos ind; | |
| 3632 Charbpos min_allowed, max_allowed; | |
| 3633 | |
|
5581
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
3634 CHECK_FIXNUM_COERCE_MARKER (pos); |
|
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
3635 ind = XFIXNUM (pos); |
| 771 | 3636 min_allowed = flags & GB_ALLOW_PAST_ACCESSIBLE ? BUF_BEG (b) : BUF_BEGV (b); |
| 3637 max_allowed = flags & GB_ALLOW_PAST_ACCESSIBLE ? BUF_Z (b) : BUF_ZV (b); | |
| 3638 | |
| 3639 if (ind < min_allowed || ind > max_allowed) | |
| 3640 { | |
| 3641 if (flags & GB_COERCE_RANGE) | |
| 3642 ind = ind < min_allowed ? min_allowed : max_allowed; | |
| 3643 else if (flags & GB_NO_ERROR_IF_BAD) | |
| 3644 ind = -1; | |
| 3645 else | |
| 3646 { | |
| 793 | 3647 Lisp_Object buffer = wrap_buffer (b); |
| 3648 | |
| 771 | 3649 args_out_of_range (buffer, pos); |
| 3650 } | |
| 3651 } | |
| 3652 | |
| 3653 return ind; | |
| 3654 } | |
| 3655 | |
| 3656 Bytebpos | |
| 3657 get_buffer_pos_byte (struct buffer *b, Lisp_Object pos, unsigned int flags) | |
| 3658 { | |
| 3659 Charbpos bpos = get_buffer_pos_char (b, pos, flags); | |
| 3660 if (bpos < 0) /* could happen with GB_NO_ERROR_IF_BAD */ | |
| 3661 return -1; | |
| 3662 return charbpos_to_bytebpos (b, bpos); | |
| 3663 } | |
| 3664 | |
| 3665 /* Return a pair of buffer positions representing a range of text, | |
| 3666 taken from a pair of Lisp_Objects. Full error-checking is | |
| 3667 done on the positions. Flags can be specified to control the | |
| 3668 behavior of out-of-range values. The default behavior is to | |
| 3669 allow the range bounds to be specified in either order | |
| 3670 (however, FROM_OUT will always be the lower bound of the range | |
| 3671 and TO_OUT the upper bound),to require that the positions | |
| 3672 are within the accessible part of the buffer (BEGV and ZV), | |
| 3673 and to signal an error if the positions are out of range. | |
| 3674 */ | |
| 3675 | |
| 3676 void | |
| 3677 get_buffer_range_char (struct buffer *b, Lisp_Object from, Lisp_Object to, | |
| 826 | 3678 Charbpos *from_out, Charbpos *to_out, |
| 3679 unsigned int flags) | |
| 771 | 3680 { |
| 3681 /* Does not GC */ | |
| 3682 Charbpos min_allowed, max_allowed; | |
| 3683 | |
| 3684 min_allowed = (flags & GB_ALLOW_PAST_ACCESSIBLE) ? | |
| 3685 BUF_BEG (b) : BUF_BEGV (b); | |
| 3686 max_allowed = (flags & GB_ALLOW_PAST_ACCESSIBLE) ? | |
| 3687 BUF_Z (b) : BUF_ZV (b); | |
| 3688 | |
| 3689 if (NILP (from) && (flags & GB_ALLOW_NIL)) | |
| 3690 *from_out = min_allowed; | |
| 3691 else | |
| 3692 *from_out = get_buffer_pos_char (b, from, flags | GB_NO_ERROR_IF_BAD); | |
| 3693 | |
| 3694 if (NILP (to) && (flags & GB_ALLOW_NIL)) | |
| 3695 *to_out = max_allowed; | |
| 3696 else | |
| 3697 *to_out = get_buffer_pos_char (b, to, flags | GB_NO_ERROR_IF_BAD); | |
| 3698 | |
| 3699 if ((*from_out < 0 || *to_out < 0) && !(flags & GB_NO_ERROR_IF_BAD)) | |
| 3700 { | |
| 793 | 3701 Lisp_Object buffer = wrap_buffer (b); |
| 3702 | |
| 771 | 3703 args_out_of_range_3 (buffer, from, to); |
| 3704 } | |
| 3705 | |
| 3706 if (*from_out >= 0 && *to_out >= 0 && *from_out > *to_out) | |
| 3707 { | |
| 3708 if (flags & GB_CHECK_ORDER) | |
| 3709 invalid_argument_2 ("start greater than end", from, to); | |
| 3710 else | |
| 3711 { | |
| 3712 Charbpos temp = *from_out; | |
| 3713 *from_out = *to_out; | |
| 3714 *to_out = temp; | |
| 3715 } | |
| 3716 } | |
| 3717 } | |
| 3718 | |
| 3719 void | |
| 3720 get_buffer_range_byte (struct buffer *b, Lisp_Object from, Lisp_Object to, | |
| 826 | 3721 Bytebpos *from_out, Bytebpos *to_out, |
| 3722 unsigned int flags) | |
| 771 | 3723 { |
| 3724 Charbpos s, e; | |
| 3725 | |
| 3726 get_buffer_range_char (b, from, to, &s, &e, flags); | |
| 3727 if (s >= 0) | |
| 3728 *from_out = charbpos_to_bytebpos (b, s); | |
| 3729 else /* could happen with GB_NO_ERROR_IF_BAD */ | |
| 3730 *from_out = -1; | |
| 3731 if (e >= 0) | |
| 3732 *to_out = charbpos_to_bytebpos (b, e); | |
| 3733 else | |
| 3734 *to_out = -1; | |
| 3735 } | |
| 3736 | |
| 3737 static Charcount | |
| 3738 get_string_pos_char_1 (Lisp_Object string, Lisp_Object pos, unsigned int flags, | |
| 3739 Charcount known_length) | |
| 3740 { | |
| 3741 Charcount ccpos; | |
| 3742 Charcount min_allowed = 0; | |
| 3743 Charcount max_allowed = known_length; | |
| 3744 | |
| 3745 /* Computation of KNOWN_LENGTH is potentially expensive so we pass | |
| 3746 it in. */ | |
|
5581
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
3747 CHECK_FIXNUM (pos); |
|
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
3748 ccpos = XFIXNUM (pos); |
| 771 | 3749 if (ccpos < 0 && flags & GB_NEGATIVE_FROM_END) |
| 3750 ccpos += max_allowed; | |
| 3751 | |
| 3752 if (ccpos < min_allowed || ccpos > max_allowed) | |
| 3753 { | |
| 3754 if (flags & GB_COERCE_RANGE) | |
| 3755 ccpos = ccpos < min_allowed ? min_allowed : max_allowed; | |
| 3756 else if (flags & GB_NO_ERROR_IF_BAD) | |
| 3757 ccpos = -1; | |
| 3758 else | |
| 3759 args_out_of_range (string, pos); | |
| 3760 } | |
| 3761 | |
| 3762 return ccpos; | |
| 3763 } | |
| 3764 | |
| 3765 Charcount | |
| 3766 get_string_pos_char (Lisp_Object string, Lisp_Object pos, unsigned int flags) | |
| 3767 { | |
| 3768 return get_string_pos_char_1 (string, pos, flags, | |
| 826 | 3769 string_char_length (string)); |
| 771 | 3770 } |
| 3771 | |
| 3772 Bytecount | |
| 3773 get_string_pos_byte (Lisp_Object string, Lisp_Object pos, unsigned int flags) | |
| 3774 { | |
| 3775 Charcount ccpos = get_string_pos_char (string, pos, flags); | |
| 3776 if (ccpos < 0) /* could happen with GB_NO_ERROR_IF_BAD */ | |
| 3777 return -1; | |
| 793 | 3778 return string_index_char_to_byte (string, ccpos); |
| 771 | 3779 } |
| 3780 | |
| 3781 void | |
| 3782 get_string_range_char (Lisp_Object string, Lisp_Object from, Lisp_Object to, | |
| 3783 Charcount *from_out, Charcount *to_out, | |
| 3784 unsigned int flags) | |
| 3785 { | |
| 3786 Charcount min_allowed = 0; | |
| 826 | 3787 Charcount max_allowed = string_char_length (string); |
| 771 | 3788 |
| 3789 if (NILP (from) && (flags & GB_ALLOW_NIL)) | |
| 3790 *from_out = min_allowed; | |
| 3791 else | |
| 3792 *from_out = get_string_pos_char_1 (string, from, | |
| 3793 flags | GB_NO_ERROR_IF_BAD, | |
| 3794 max_allowed); | |
| 3795 | |
| 3796 if (NILP (to) && (flags & GB_ALLOW_NIL)) | |
| 3797 *to_out = max_allowed; | |
| 3798 else | |
| 3799 *to_out = get_string_pos_char_1 (string, to, | |
| 3800 flags | GB_NO_ERROR_IF_BAD, | |
| 3801 max_allowed); | |
| 3802 | |
| 3803 if ((*from_out < 0 || *to_out < 0) && !(flags & GB_NO_ERROR_IF_BAD)) | |
| 3804 args_out_of_range_3 (string, from, to); | |
| 3805 | |
| 3806 if (*from_out >= 0 && *to_out >= 0 && *from_out > *to_out) | |
| 3807 { | |
| 3808 if (flags & GB_CHECK_ORDER) | |
| 3809 invalid_argument_2 ("start greater than end", from, to); | |
| 3810 else | |
| 3811 { | |
| 3812 Charbpos temp = *from_out; | |
| 3813 *from_out = *to_out; | |
| 3814 *to_out = temp; | |
| 3815 } | |
| 3816 } | |
| 3817 } | |
| 3818 | |
| 3819 void | |
| 3820 get_string_range_byte (Lisp_Object string, Lisp_Object from, Lisp_Object to, | |
| 3821 Bytecount *from_out, Bytecount *to_out, | |
| 3822 unsigned int flags) | |
| 3823 { | |
| 3824 Charcount s, e; | |
| 3825 | |
| 3826 get_string_range_char (string, from, to, &s, &e, flags); | |
| 3827 if (s >= 0) | |
| 793 | 3828 *from_out = string_index_char_to_byte (string, s); |
| 771 | 3829 else /* could happen with GB_NO_ERROR_IF_BAD */ |
| 3830 *from_out = -1; | |
| 3831 if (e >= 0) | |
| 793 | 3832 *to_out = string_index_char_to_byte (string, e); |
| 771 | 3833 else |
| 3834 *to_out = -1; | |
| 3835 | |
| 3836 } | |
| 3837 | |
| 826 | 3838 Charxpos |
| 771 | 3839 get_buffer_or_string_pos_char (Lisp_Object object, Lisp_Object pos, |
| 3840 unsigned int flags) | |
| 3841 { | |
| 3842 return STRINGP (object) ? | |
| 3843 get_string_pos_char (object, pos, flags) : | |
| 3844 get_buffer_pos_char (XBUFFER (object), pos, flags); | |
| 3845 } | |
| 3846 | |
| 826 | 3847 Bytexpos |
| 771 | 3848 get_buffer_or_string_pos_byte (Lisp_Object object, Lisp_Object pos, |
| 3849 unsigned int flags) | |
| 3850 { | |
| 3851 return STRINGP (object) ? | |
| 3852 get_string_pos_byte (object, pos, flags) : | |
| 3853 get_buffer_pos_byte (XBUFFER (object), pos, flags); | |
| 3854 } | |
| 3855 | |
| 3856 void | |
| 3857 get_buffer_or_string_range_char (Lisp_Object object, Lisp_Object from, | |
| 826 | 3858 Lisp_Object to, Charxpos *from_out, |
| 3859 Charxpos *to_out, unsigned int flags) | |
| 771 | 3860 { |
| 3861 if (STRINGP (object)) | |
| 3862 get_string_range_char (object, from, to, from_out, to_out, flags); | |
| 3863 else | |
| 826 | 3864 get_buffer_range_char (XBUFFER (object), from, to, from_out, to_out, |
| 3865 flags); | |
| 771 | 3866 } |
| 3867 | |
| 3868 void | |
| 3869 get_buffer_or_string_range_byte (Lisp_Object object, Lisp_Object from, | |
| 826 | 3870 Lisp_Object to, Bytexpos *from_out, |
| 3871 Bytexpos *to_out, unsigned int flags) | |
| 771 | 3872 { |
| 3873 if (STRINGP (object)) | |
| 3874 get_string_range_byte (object, from, to, from_out, to_out, flags); | |
| 3875 else | |
| 826 | 3876 get_buffer_range_byte (XBUFFER (object), from, to, from_out, to_out, |
| 3877 flags); | |
| 771 | 3878 } |
| 3879 | |
| 826 | 3880 Charxpos |
| 771 | 3881 buffer_or_string_accessible_begin_char (Lisp_Object object) |
| 3882 { | |
| 3883 return STRINGP (object) ? 0 : BUF_BEGV (XBUFFER (object)); | |
| 3884 } | |
| 3885 | |
| 826 | 3886 Charxpos |
| 771 | 3887 buffer_or_string_accessible_end_char (Lisp_Object object) |
| 3888 { | |
| 3889 return STRINGP (object) ? | |
| 826 | 3890 string_char_length (object) : BUF_ZV (XBUFFER (object)); |
| 771 | 3891 } |
| 3892 | |
| 826 | 3893 Bytexpos |
| 771 | 3894 buffer_or_string_accessible_begin_byte (Lisp_Object object) |
| 3895 { | |
| 826 | 3896 return STRINGP (object) ? 0 : BYTE_BUF_BEGV (XBUFFER (object)); |
| 771 | 3897 } |
| 3898 | |
| 826 | 3899 Bytexpos |
| 771 | 3900 buffer_or_string_accessible_end_byte (Lisp_Object object) |
| 3901 { | |
| 3902 return STRINGP (object) ? | |
| 826 | 3903 XSTRING_LENGTH (object) : BYTE_BUF_ZV (XBUFFER (object)); |
| 771 | 3904 } |
| 3905 | |
| 826 | 3906 Charxpos |
| 771 | 3907 buffer_or_string_absolute_begin_char (Lisp_Object object) |
| 3908 { | |
| 3909 return STRINGP (object) ? 0 : BUF_BEG (XBUFFER (object)); | |
| 3910 } | |
| 3911 | |
| 826 | 3912 Charxpos |
| 771 | 3913 buffer_or_string_absolute_end_char (Lisp_Object object) |
| 3914 { | |
| 3915 return STRINGP (object) ? | |
| 826 | 3916 string_char_length (object) : BUF_Z (XBUFFER (object)); |
| 3917 } | |
| 3918 | |
| 3919 Bytexpos | |
| 3920 buffer_or_string_absolute_begin_byte (Lisp_Object object) | |
| 3921 { | |
| 3922 return STRINGP (object) ? 0 : BYTE_BUF_BEG (XBUFFER (object)); | |
| 3923 } | |
| 3924 | |
| 3925 Bytexpos | |
| 3926 buffer_or_string_absolute_end_byte (Lisp_Object object) | |
| 3927 { | |
| 3928 return STRINGP (object) ? | |
| 3929 XSTRING_LENGTH (object) : BYTE_BUF_Z (XBUFFER (object)); | |
| 3930 } | |
| 3931 | |
| 3932 Charbpos | |
| 3933 charbpos_clip_to_bounds (Charbpos lower, Charbpos num, Charbpos upper) | |
| 3934 { | |
| 3935 return (num < lower ? lower : | |
| 3936 num > upper ? upper : | |
| 3937 num); | |
| 771 | 3938 } |
| 3939 | |
| 3940 Bytebpos | |
| 826 | 3941 bytebpos_clip_to_bounds (Bytebpos lower, Bytebpos num, Bytebpos upper) |
| 3942 { | |
| 3943 return (num < lower ? lower : | |
| 3944 num > upper ? upper : | |
| 3945 num); | |
| 3946 } | |
| 3947 | |
| 3948 Charxpos | |
| 3949 charxpos_clip_to_bounds (Charxpos lower, Charxpos num, Charxpos upper) | |
| 771 | 3950 { |
| 826 | 3951 return (num < lower ? lower : |
| 3952 num > upper ? upper : | |
| 3953 num); | |
| 3954 } | |
| 3955 | |
| 3956 Bytexpos | |
| 3957 bytexpos_clip_to_bounds (Bytexpos lower, Bytexpos num, Bytexpos upper) | |
| 3958 { | |
| 3959 return (num < lower ? lower : | |
| 3960 num > upper ? upper : | |
| 3961 num); | |
| 771 | 3962 } |
| 3963 | |
| 826 | 3964 /* These could be implemented in terms of the get_buffer_or_string() |
| 3965 functions above, but those are complicated and handle lots of weird | |
| 3966 cases stemming from uncertain external input. */ | |
| 3967 | |
| 3968 Charxpos | |
| 3969 buffer_or_string_clip_to_accessible_char (Lisp_Object object, Charxpos pos) | |
| 3970 { | |
| 3971 return (charxpos_clip_to_bounds | |
| 3972 (pos, buffer_or_string_accessible_begin_char (object), | |
| 3973 buffer_or_string_accessible_end_char (object))); | |
| 3974 } | |
| 3975 | |
| 3976 Bytexpos | |
| 3977 buffer_or_string_clip_to_accessible_byte (Lisp_Object object, Bytexpos pos) | |
| 771 | 3978 { |
| 826 | 3979 return (bytexpos_clip_to_bounds |
| 3980 (pos, buffer_or_string_accessible_begin_byte (object), | |
| 3981 buffer_or_string_accessible_end_byte (object))); | |
| 3982 } | |
| 3983 | |
| 3984 Charxpos | |
| 3985 buffer_or_string_clip_to_absolute_char (Lisp_Object object, Charxpos pos) | |
| 3986 { | |
| 3987 return (charxpos_clip_to_bounds | |
| 3988 (pos, buffer_or_string_absolute_begin_char (object), | |
| 3989 buffer_or_string_absolute_end_char (object))); | |
| 3990 } | |
| 3991 | |
| 3992 Bytexpos | |
| 3993 buffer_or_string_clip_to_absolute_byte (Lisp_Object object, Bytexpos pos) | |
| 3994 { | |
| 3995 return (bytexpos_clip_to_bounds | |
| 3996 (pos, buffer_or_string_absolute_begin_byte (object), | |
| 3997 buffer_or_string_absolute_end_byte (object))); | |
| 771 | 3998 } |
| 3999 | |
| 4000 | |
| 4001 /************************************************************************/ | |
| 4002 /* Implement TO_EXTERNAL_FORMAT, TO_INTERNAL_FORMAT */ | |
| 4003 /************************************************************************/ | |
| 4004 | |
| 4005 typedef struct | |
| 4006 { | |
| 867 | 4007 Dynarr_declare (Ibyte_dynarr *); |
| 4008 } Ibyte_dynarr_dynarr; | |
| 771 | 4009 |
| 4010 typedef struct | |
| 4011 { | |
| 4012 Dynarr_declare (Extbyte_dynarr *); | |
| 4013 } Extbyte_dynarr_dynarr; | |
| 4014 | |
| 4015 static Extbyte_dynarr_dynarr *conversion_out_dynarr_list; | |
| 867 | 4016 static Ibyte_dynarr_dynarr *conversion_in_dynarr_list; |
| 771 | 4017 |
| 4018 static int dfc_convert_to_external_format_in_use; | |
| 4019 static int dfc_convert_to_internal_format_in_use; | |
| 4020 | |
| 4021 void | |
| 4022 dfc_convert_to_external_format (dfc_conversion_type source_type, | |
| 4023 dfc_conversion_data *source, | |
| 4024 Lisp_Object coding_system, | |
| 4025 dfc_conversion_type sink_type, | |
| 4026 dfc_conversion_data *sink) | |
| 4027 { | |
| 4028 /* It's guaranteed that many callers are not prepared for GC here, | |
| 4029 esp. given that this code conversion occurs in many very hidden | |
| 4030 places. */ | |
| 1292 | 4031 int count; |
| 771 | 4032 Extbyte_dynarr *conversion_out_dynarr; |
| 1292 | 4033 PROFILE_DECLARE (); |
| 4034 | |
| 2367 | 4035 assert (!inhibit_non_essential_conversion_operations); |
| 1292 | 4036 PROFILE_RECORD_ENTERING_SECTION (QSin_internal_external_conversion); |
| 4037 | |
| 4038 count = begin_gc_forbidden (); | |
| 771 | 4039 |
| 4040 type_checking_assert | |
| 4041 (((source_type == DFC_TYPE_DATA) || | |
| 4042 (source_type == DFC_TYPE_LISP_LSTREAM && LSTREAMP (source->lisp_object)) || | |
| 4043 (source_type == DFC_TYPE_LISP_STRING && STRINGP (source->lisp_object))) | |
| 4044 && | |
| 4045 ((sink_type == DFC_TYPE_DATA) || | |
| 4046 (sink_type == DFC_TYPE_LISP_LSTREAM && LSTREAMP (source->lisp_object)))); | |
| 4047 | |
| 4048 if (Dynarr_length (conversion_out_dynarr_list) <= | |
| 4049 dfc_convert_to_external_format_in_use) | |
| 4050 Dynarr_add (conversion_out_dynarr_list, Dynarr_new (Extbyte)); | |
| 4051 conversion_out_dynarr = Dynarr_at (conversion_out_dynarr_list, | |
| 4052 dfc_convert_to_external_format_in_use); | |
| 4053 Dynarr_reset (conversion_out_dynarr); | |
| 4054 | |
| 853 | 4055 internal_bind_int (&dfc_convert_to_external_format_in_use, |
| 4056 dfc_convert_to_external_format_in_use + 1); | |
| 4057 | |
| 771 | 4058 coding_system = get_coding_system_for_text_file (coding_system, 0); |
| 4059 | |
| 4060 /* Here we optimize in the case where the coding system does no | |
| 4061 conversion. However, we don't want to optimize in case the source | |
| 4062 or sink is an lstream, since writing to an lstream can cause a | |
| 4063 garbage collection, and this could be problematic if the source | |
| 4064 is a lisp string. */ | |
| 4065 if (source_type != DFC_TYPE_LISP_LSTREAM && | |
| 4066 sink_type != DFC_TYPE_LISP_LSTREAM && | |
| 4067 coding_system_is_binary (coding_system)) | |
| 4068 { | |
| 867 | 4069 const Ibyte *ptr; |
| 771 | 4070 Bytecount len; |
| 4071 | |
| 4072 if (source_type == DFC_TYPE_LISP_STRING) | |
| 4073 { | |
| 4074 ptr = XSTRING_DATA (source->lisp_object); | |
| 4075 len = XSTRING_LENGTH (source->lisp_object); | |
| 4076 } | |
| 4077 else | |
| 4078 { | |
| 867 | 4079 ptr = (Ibyte *) source->data.ptr; |
| 771 | 4080 len = source->data.len; |
| 4081 } | |
| 4082 | |
| 4083 #ifdef MULE | |
| 4084 { | |
| 867 | 4085 const Ibyte *end; |
| 771 | 4086 for (end = ptr + len; ptr < end;) |
| 4087 { | |
| 867 | 4088 Ibyte c = |
| 826 | 4089 (byte_ascii_p (*ptr)) ? *ptr : |
| 771 | 4090 (*ptr == LEADING_BYTE_CONTROL_1) ? (*(ptr+1) - 0x20) : |
| 4091 (*ptr == LEADING_BYTE_LATIN_ISO8859_1) ? (*(ptr+1)) : | |
| 4092 '~'; | |
| 4093 | |
| 4094 Dynarr_add (conversion_out_dynarr, (Extbyte) c); | |
| 867 | 4095 INC_IBYTEPTR (ptr); |
| 771 | 4096 } |
| 800 | 4097 text_checking_assert (ptr == end); |
| 771 | 4098 } |
| 4099 #else | |
| 4100 Dynarr_add_many (conversion_out_dynarr, ptr, len); | |
| 4101 #endif | |
| 4102 | |
| 4103 } | |
| 1315 | 4104 #ifdef WIN32_ANY |
| 771 | 4105 /* Optimize the common case involving Unicode where only ASCII is involved */ |
| 4106 else if (source_type != DFC_TYPE_LISP_LSTREAM && | |
| 4107 sink_type != DFC_TYPE_LISP_LSTREAM && | |
| 4108 dfc_coding_system_is_unicode (coding_system)) | |
| 4109 { | |
| 867 | 4110 const Ibyte *ptr, *p; |
| 771 | 4111 Bytecount len; |
| 867 | 4112 const Ibyte *end; |
| 771 | 4113 |
| 4114 if (source_type == DFC_TYPE_LISP_STRING) | |
| 4115 { | |
| 4116 ptr = XSTRING_DATA (source->lisp_object); | |
| 4117 len = XSTRING_LENGTH (source->lisp_object); | |
| 4118 } | |
| 4119 else | |
| 4120 { | |
| 867 | 4121 ptr = (Ibyte *) source->data.ptr; |
| 771 | 4122 len = source->data.len; |
| 4123 } | |
| 4124 end = ptr + len; | |
| 4125 | |
| 4126 for (p = ptr; p < end; p++) | |
| 4127 { | |
| 826 | 4128 if (!byte_ascii_p (*p)) |
| 771 | 4129 goto the_hard_way; |
| 4130 } | |
| 4131 | |
| 4132 for (p = ptr; p < end; p++) | |
| 4133 { | |
| 4134 Dynarr_add (conversion_out_dynarr, (Extbyte) (*p)); | |
| 4135 Dynarr_add (conversion_out_dynarr, (Extbyte) '\0'); | |
| 4136 } | |
| 4137 } | |
| 1315 | 4138 #endif /* WIN32_ANY */ |
| 771 | 4139 else |
| 4140 { | |
| 4141 Lisp_Object streams_to_delete[3]; | |
| 4142 int delete_count; | |
| 4143 Lisp_Object instream, outstream; | |
| 4144 Lstream *reader, *writer; | |
| 4145 | |
| 1315 | 4146 #ifdef WIN32_ANY |
| 771 | 4147 the_hard_way: |
| 1315 | 4148 #endif /* WIN32_ANY */ |
| 771 | 4149 delete_count = 0; |
| 4150 if (source_type == DFC_TYPE_LISP_LSTREAM) | |
| 4151 instream = source->lisp_object; | |
| 4152 else if (source_type == DFC_TYPE_DATA) | |
| 4153 streams_to_delete[delete_count++] = instream = | |
| 4154 make_fixed_buffer_input_stream (source->data.ptr, source->data.len); | |
| 4155 else | |
| 4156 { | |
| 4157 type_checking_assert (source_type == DFC_TYPE_LISP_STRING); | |
| 4158 streams_to_delete[delete_count++] = instream = | |
| 4159 /* This will GCPRO the Lisp string */ | |
| 4160 make_lisp_string_input_stream (source->lisp_object, 0, -1); | |
| 4161 } | |
| 4162 | |
| 4163 if (sink_type == DFC_TYPE_LISP_LSTREAM) | |
| 4164 outstream = sink->lisp_object; | |
| 4165 else | |
| 4166 { | |
| 4167 type_checking_assert (sink_type == DFC_TYPE_DATA); | |
| 4168 streams_to_delete[delete_count++] = outstream = | |
| 4169 make_dynarr_output_stream | |
| 4170 ((unsigned_char_dynarr *) conversion_out_dynarr); | |
| 4171 } | |
| 4172 | |
| 4173 streams_to_delete[delete_count++] = outstream = | |
| 800 | 4174 make_coding_output_stream (XLSTREAM (outstream), coding_system, |
| 4175 CODING_ENCODE, 0); | |
| 771 | 4176 |
| 4177 reader = XLSTREAM (instream); | |
| 4178 writer = XLSTREAM (outstream); | |
| 4179 /* decoding_stream will gc-protect outstream */ | |
| 1204 | 4180 { |
| 4181 struct gcpro gcpro1, gcpro2; | |
| 4182 GCPRO2 (instream, outstream); | |
| 4183 | |
| 4184 while (1) | |
| 4185 { | |
| 4186 Bytecount size_in_bytes; | |
| 4187 char tempbuf[1024]; /* some random amount */ | |
| 4188 | |
| 4189 size_in_bytes = Lstream_read (reader, tempbuf, sizeof (tempbuf)); | |
| 4190 | |
| 4191 if (size_in_bytes == 0) | |
| 4192 break; | |
| 4193 else if (size_in_bytes < 0) | |
| 4194 signal_error (Qtext_conversion_error, | |
| 4195 "Error converting to external format", Qunbound); | |
| 4196 | |
| 4197 if (Lstream_write (writer, tempbuf, size_in_bytes) < 0) | |
| 4198 signal_error (Qtext_conversion_error, | |
| 4199 "Error converting to external format", Qunbound); | |
| 4200 } | |
| 4201 | |
| 4202 /* Closing writer will close any stream at the other end of writer. */ | |
| 4203 Lstream_close (writer); | |
| 4204 Lstream_close (reader); | |
| 4205 UNGCPRO; | |
| 4206 } | |
| 771 | 4207 |
| 4208 /* The idea is that this function will create no garbage. */ | |
| 4209 while (delete_count) | |
| 4210 Lstream_delete (XLSTREAM (streams_to_delete [--delete_count])); | |
| 4211 } | |
| 4212 | |
| 4213 unbind_to (count); | |
| 4214 | |
| 4215 if (sink_type != DFC_TYPE_LISP_LSTREAM) | |
| 4216 { | |
| 4217 sink->data.len = Dynarr_length (conversion_out_dynarr); | |
| 4218 /* double zero-extend because we may be dealing with Unicode data */ | |
| 4219 Dynarr_add (conversion_out_dynarr, '\0'); | |
| 4220 Dynarr_add (conversion_out_dynarr, '\0'); | |
| 4967 | 4221 sink->data.ptr = Dynarr_begin (conversion_out_dynarr); |
| 771 | 4222 } |
| 1292 | 4223 |
| 4224 PROFILE_RECORD_EXITING_SECTION (QSin_internal_external_conversion); | |
| 771 | 4225 } |
| 4226 | |
| 4227 void | |
| 4228 dfc_convert_to_internal_format (dfc_conversion_type source_type, | |
| 4229 dfc_conversion_data *source, | |
| 4230 Lisp_Object coding_system, | |
| 4231 dfc_conversion_type sink_type, | |
| 4232 dfc_conversion_data *sink) | |
| 4233 { | |
| 4234 /* It's guaranteed that many callers are not prepared for GC here, | |
| 4235 esp. given that this code conversion occurs in many very hidden | |
| 4236 places. */ | |
| 1292 | 4237 int count; |
| 867 | 4238 Ibyte_dynarr *conversion_in_dynarr; |
| 2421 | 4239 Lisp_Object underlying_cs; |
| 1292 | 4240 PROFILE_DECLARE (); |
| 4241 | |
| 2367 | 4242 assert (!inhibit_non_essential_conversion_operations); |
| 1292 | 4243 PROFILE_RECORD_ENTERING_SECTION (QSin_internal_external_conversion); |
| 4244 | |
| 4245 count = begin_gc_forbidden (); | |
| 771 | 4246 |
| 4247 type_checking_assert | |
| 4248 ((source_type == DFC_TYPE_DATA || | |
| 4249 source_type == DFC_TYPE_LISP_LSTREAM) | |
| 4250 && | |
| 4251 (sink_type == DFC_TYPE_DATA || | |
| 4252 sink_type == DFC_TYPE_LISP_LSTREAM)); | |
| 4253 | |
| 4254 if (Dynarr_length (conversion_in_dynarr_list) <= | |
| 4255 dfc_convert_to_internal_format_in_use) | |
| 867 | 4256 Dynarr_add (conversion_in_dynarr_list, Dynarr_new (Ibyte)); |
| 771 | 4257 conversion_in_dynarr = Dynarr_at (conversion_in_dynarr_list, |
| 4258 dfc_convert_to_internal_format_in_use); | |
| 4259 Dynarr_reset (conversion_in_dynarr); | |
| 4260 | |
| 853 | 4261 internal_bind_int (&dfc_convert_to_internal_format_in_use, |
| 4262 dfc_convert_to_internal_format_in_use + 1); | |
| 4263 | |
| 2421 | 4264 /* The second call does the equivalent of both calls, but we need |
| 4265 the result after the first call (which wraps just a to-text | |
| 4266 converter) as well as the result after the second call (which | |
| 4267 also wraps an EOL-detection converter). */ | |
| 4268 underlying_cs = get_coding_system_for_text_file (coding_system, 0); | |
| 4269 coding_system = get_coding_system_for_text_file (underlying_cs, 1); | |
| 771 | 4270 |
| 4271 if (source_type != DFC_TYPE_LISP_LSTREAM && | |
| 4272 sink_type != DFC_TYPE_LISP_LSTREAM && | |
| 2421 | 4273 coding_system_is_binary (underlying_cs)) |
| 771 | 4274 { |
| 4275 #ifdef MULE | |
| 2421 | 4276 const Ibyte *ptr; |
| 771 | 4277 Bytecount len = source->data.len; |
| 2421 | 4278 const Ibyte *end; |
| 4279 | |
| 4280 /* Make sure no EOL conversion is needed. With a little work we | |
| 4281 could handle EOL conversion as well but it may not be needed as an | |
| 4282 optimization. */ | |
| 4283 if (!EQ (coding_system, underlying_cs)) | |
| 4284 { | |
| 4285 for (ptr = (const Ibyte *) source->data.ptr, end = ptr + len; | |
| 4286 ptr < end; ptr++) | |
| 4287 { | |
| 4288 if (*ptr == '\r' || *ptr == '\n') | |
| 4289 goto the_hard_way; | |
| 4290 } | |
| 4291 } | |
| 4292 | |
| 4293 for (ptr = (const Ibyte *) source->data.ptr, end = ptr + len; | |
| 4294 ptr < end; ptr++) | |
| 771 | 4295 { |
| 867 | 4296 Ibyte c = *ptr; |
| 771 | 4297 |
| 826 | 4298 if (byte_ascii_p (c)) |
| 771 | 4299 Dynarr_add (conversion_in_dynarr, c); |
| 826 | 4300 else if (byte_c1_p (c)) |
| 771 | 4301 { |
| 4302 Dynarr_add (conversion_in_dynarr, LEADING_BYTE_CONTROL_1); | |
| 4303 Dynarr_add (conversion_in_dynarr, c + 0x20); | |
| 4304 } | |
| 4305 else | |
| 4306 { | |
| 4307 Dynarr_add (conversion_in_dynarr, LEADING_BYTE_LATIN_ISO8859_1); | |
| 4308 Dynarr_add (conversion_in_dynarr, c); | |
| 4309 } | |
| 4310 } | |
| 4311 #else | |
| 4312 Dynarr_add_many (conversion_in_dynarr, source->data.ptr, source->data.len); | |
| 4313 #endif | |
| 4314 } | |
| 1315 | 4315 #ifdef WIN32_ANY |
| 1292 | 4316 /* Optimize the common case involving Unicode where only ASCII/Latin-1 is |
| 4317 involved */ | |
| 771 | 4318 else if (source_type != DFC_TYPE_LISP_LSTREAM && |
| 4319 sink_type != DFC_TYPE_LISP_LSTREAM && | |
| 2421 | 4320 dfc_coding_system_is_unicode (underlying_cs)) |
| 771 | 4321 { |
| 2421 | 4322 const Ibyte *ptr; |
| 771 | 4323 Bytecount len = source->data.len; |
| 2421 | 4324 const Ibyte *end; |
| 771 | 4325 |
| 4326 if (len & 1) | |
| 4327 goto the_hard_way; | |
| 4328 | |
| 2421 | 4329 /* Make sure only ASCII/Latin-1 is involved */ |
| 4330 for (ptr = (const Ibyte *) source->data.ptr + 1, end = ptr + len; | |
| 4331 ptr < end; ptr += 2) | |
| 771 | 4332 { |
| 4333 if (*ptr) | |
| 4334 goto the_hard_way; | |
| 4335 } | |
| 4336 | |
| 2421 | 4337 /* Make sure no EOL conversion is needed. With a little work we |
| 4338 could handle EOL conversion as well but it may not be needed as an | |
| 4339 optimization. */ | |
| 4340 if (!EQ (coding_system, underlying_cs)) | |
| 4341 { | |
| 4342 for (ptr = (const Ibyte *) source->data.ptr, end = ptr + len; | |
| 4343 ptr < end; ptr += 2) | |
| 4344 { | |
| 4345 if (*ptr == '\r' || *ptr == '\n') | |
| 4346 goto the_hard_way; | |
| 4347 } | |
| 4348 } | |
| 4349 | |
| 4350 for (ptr = (const Ibyte *) source->data.ptr, end = ptr + len; | |
| 4351 ptr < end; ptr += 2) | |
| 771 | 4352 { |
| 867 | 4353 Ibyte c = *ptr; |
| 771 | 4354 |
| 826 | 4355 if (byte_ascii_p (c)) |
| 771 | 4356 Dynarr_add (conversion_in_dynarr, c); |
| 4357 #ifdef MULE | |
| 826 | 4358 else if (byte_c1_p (c)) |
| 771 | 4359 { |
| 4360 Dynarr_add (conversion_in_dynarr, LEADING_BYTE_CONTROL_1); | |
| 4361 Dynarr_add (conversion_in_dynarr, c + 0x20); | |
| 4362 } | |
| 4363 else | |
| 4364 { | |
| 4365 Dynarr_add (conversion_in_dynarr, LEADING_BYTE_LATIN_ISO8859_1); | |
| 4366 Dynarr_add (conversion_in_dynarr, c); | |
| 4367 } | |
| 4368 #endif /* MULE */ | |
| 4369 } | |
| 4370 } | |
| 1315 | 4371 #endif /* WIN32_ANY */ |
| 771 | 4372 else |
| 4373 { | |
| 4374 Lisp_Object streams_to_delete[3]; | |
| 4375 int delete_count; | |
| 4376 Lisp_Object instream, outstream; | |
| 4377 Lstream *reader, *writer; | |
| 4378 | |
| 2421 | 4379 #if defined (WIN32_ANY) || defined (MULE) |
| 771 | 4380 the_hard_way: |
| 2421 | 4381 #endif |
| 771 | 4382 delete_count = 0; |
| 4383 if (source_type == DFC_TYPE_LISP_LSTREAM) | |
| 4384 instream = source->lisp_object; | |
| 4385 else | |
| 4386 { | |
| 4387 type_checking_assert (source_type == DFC_TYPE_DATA); | |
| 4388 streams_to_delete[delete_count++] = instream = | |
| 4389 make_fixed_buffer_input_stream (source->data.ptr, source->data.len); | |
| 4390 } | |
| 4391 | |
| 4392 if (sink_type == DFC_TYPE_LISP_LSTREAM) | |
| 4393 outstream = sink->lisp_object; | |
| 4394 else | |
| 4395 { | |
| 4396 type_checking_assert (sink_type == DFC_TYPE_DATA); | |
| 4397 streams_to_delete[delete_count++] = outstream = | |
| 4398 make_dynarr_output_stream | |
| 4399 ((unsigned_char_dynarr *) conversion_in_dynarr); | |
| 4400 } | |
| 4401 | |
| 4402 streams_to_delete[delete_count++] = outstream = | |
| 800 | 4403 make_coding_output_stream (XLSTREAM (outstream), coding_system, |
| 4404 CODING_DECODE, 0); | |
| 771 | 4405 |
| 4406 reader = XLSTREAM (instream); | |
| 4407 writer = XLSTREAM (outstream); | |
| 1204 | 4408 { |
| 4409 struct gcpro gcpro1, gcpro2; | |
| 4410 /* outstream will gc-protect its sink stream, if necessary */ | |
| 4411 GCPRO2 (instream, outstream); | |
| 4412 | |
| 4413 while (1) | |
| 4414 { | |
| 4415 Bytecount size_in_bytes; | |
| 4416 char tempbuf[1024]; /* some random amount */ | |
| 4417 | |
| 4418 size_in_bytes = Lstream_read (reader, tempbuf, sizeof (tempbuf)); | |
| 4419 | |
| 4420 if (size_in_bytes == 0) | |
| 4421 break; | |
| 4422 else if (size_in_bytes < 0) | |
| 4423 signal_error (Qtext_conversion_error, | |
| 4424 "Error converting to internal format", Qunbound); | |
| 4425 | |
| 4426 if (Lstream_write (writer, tempbuf, size_in_bytes) < 0) | |
| 4427 signal_error (Qtext_conversion_error, | |
| 4428 "Error converting to internal format", Qunbound); | |
| 4429 } | |
| 4430 | |
| 4431 /* Closing writer will close any stream at the other end of writer. */ | |
| 4432 Lstream_close (writer); | |
| 4433 Lstream_close (reader); | |
| 4434 UNGCPRO; | |
| 4435 } | |
| 771 | 4436 |
| 4437 /* The idea is that this function will create no garbage. */ | |
| 4438 while (delete_count) | |
| 4439 Lstream_delete (XLSTREAM (streams_to_delete [--delete_count])); | |
| 4440 } | |
| 4441 | |
| 4442 unbind_to (count); | |
| 4443 | |
| 4444 if (sink_type != DFC_TYPE_LISP_LSTREAM) | |
| 4445 { | |
| 4446 sink->data.len = Dynarr_length (conversion_in_dynarr); | |
| 4447 Dynarr_add (conversion_in_dynarr, '\0'); /* remember to NUL-terminate! */ | |
| 4448 /* The macros don't currently distinguish between internal and | |
| 4449 external sinks, and allocate and copy two extra bytes in both | |
| 4450 cases. So we add a second zero, just like for external data | |
| 4451 (in that case, because we may be converting to Unicode). */ | |
| 4452 Dynarr_add (conversion_in_dynarr, '\0'); | |
| 4967 | 4453 sink->data.ptr = Dynarr_begin (conversion_in_dynarr); |
| 771 | 4454 } |
| 1292 | 4455 |
| 4456 PROFILE_RECORD_EXITING_SECTION (QSin_internal_external_conversion); | |
| 771 | 4457 } |
| 4458 | |
| 1318 | 4459 /* ----------------------------------------------------------------------- */ |
| 2367 | 4460 /* Alloca-conversion helpers */ |
| 4461 /* ----------------------------------------------------------------------- */ | |
| 4462 | |
| 4463 /* For alloca(), things are trickier because the calling function needs to | |
| 4464 allocate. This means that the caller needs to do the following: | |
| 4465 | |
| 4466 (a) invoke us to do the conversion, remember the data and return the size. | |
| 4467 (b) alloca() the proper size. | |
| 4468 (c) invoke us again to copy the data. | |
| 4469 | |
| 4470 We need to handle the possibility of two or more invocations of the | |
| 4471 converter in the same expression. In such cases it's conceivable that | |
| 4472 the evaluation of the sub-expressions will be overlapping (e.g. one size | |
| 4473 function called, then the other one called, then the copy functions | |
| 4474 called). To handle this, we keep a list of active data, indexed by the | |
| 4475 src expression. (We use the stringize operator to avoid evaluating the | |
| 4476 expression multiple times.) If the caller uses the exact same src | |
| 4477 expression twice in two converter calls in the same subexpression, we | |
| 2500 | 4478 will lose, but at least we can check for this and ABORT(). We could |
| 2367 | 4479 conceivably try to index on other parameters as well, but there is not |
| 4480 really any point. */ | |
| 4481 | |
| 4482 alloca_convert_vals_dynarr *active_alloca_convert; | |
| 4483 | |
| 4484 int | |
| 4485 find_pos_of_existing_active_alloca_convert (const char *srctext) | |
| 4486 { | |
| 4487 alloca_convert_vals *vals = NULL; | |
| 4488 int i; | |
| 4489 | |
| 4490 if (!active_alloca_convert) | |
| 4491 active_alloca_convert = Dynarr_new (alloca_convert_vals); | |
| 4492 | |
| 4493 for (i = 0; i < Dynarr_length (active_alloca_convert); i++) | |
| 4494 { | |
| 4495 vals = Dynarr_atp (active_alloca_convert, i); | |
| 2385 | 4496 /* On my system, two different occurrences of the same stringized |
| 4497 argument always point to the same string. However, on someone | |
| 4498 else's system, that wasn't the case. We check for equality | |
| 4499 first, since it seems systems work my way more than the other | |
| 4500 way. */ | |
| 4501 if (vals->srctext == srctext || !strcmp (vals->srctext, srctext)) | |
| 2367 | 4502 return i; |
| 4503 } | |
| 4504 | |
| 4505 return -1; | |
| 4506 } | |
| 4507 | |
| 4508 /* ----------------------------------------------------------------------- */ | |
| 1318 | 4509 /* New-style DFC converters (data is returned rather than stored into var) */ |
| 4510 /* ----------------------------------------------------------------------- */ | |
| 4511 | |
| 4512 /* We handle here the cases where SRC is a Lisp_Object, internal data | |
| 4513 (sized or unsized), or external data (sized or unsized), and return type | |
| 4514 is unsized alloca() or malloc() data. If the return type is a | |
|
4953
304aebb79cd3
function renamings to track names of char typedefs
Ben Wing <ben@xemacs.org>
parents:
4952
diff
changeset
|
4515 Lisp_Object, use build_extstring() for unsized external data, |
|
304aebb79cd3
function renamings to track names of char typedefs
Ben Wing <ben@xemacs.org>
parents:
4952
diff
changeset
|
4516 make_extstring() for sized external data. If the return type needs to |
| 1318 | 4517 be sized data, use the *_TO_SIZED_*() macros, and for other more |
| 4518 complicated cases, use the original TO_*_FORMAT() macros. */ | |
| 4519 | |
| 4520 static void | |
| 4521 new_dfc_convert_now_damn_it (const void *src, Bytecount src_size, | |
| 4522 enum new_dfc_src_type type, | |
| 4523 void **dst, Bytecount *dst_size, | |
| 4524 Lisp_Object codesys) | |
| 4525 { | |
| 4526 /* #### In the case of alloca(), it would be a bit more efficient, for | |
| 4527 small strings, to use static Dynarr's like are used internally in | |
| 4528 TO_*_FORMAT(), or some other way of avoiding malloc() followed by | |
| 4529 free(). I doubt it really matters, though. */ | |
| 4530 | |
| 4531 switch (type) | |
| 4532 { | |
| 4533 case DFC_EXTERNAL: | |
| 4534 TO_INTERNAL_FORMAT (C_STRING, src, | |
| 4535 MALLOC, (*dst, *dst_size), codesys); | |
| 4536 break; | |
| 4537 | |
| 4538 case DFC_SIZED_EXTERNAL: | |
| 4539 TO_INTERNAL_FORMAT (DATA, (src, src_size), | |
| 4540 MALLOC, (*dst, *dst_size), codesys); | |
| 4541 break; | |
| 4542 | |
| 4543 case DFC_INTERNAL: | |
| 4544 TO_EXTERNAL_FORMAT (C_STRING, src, | |
| 4545 MALLOC, (*dst, *dst_size), codesys); | |
| 4546 break; | |
| 4547 | |
| 4548 case DFC_SIZED_INTERNAL: | |
| 4549 TO_EXTERNAL_FORMAT (DATA, (src, src_size), | |
| 4550 MALLOC, (*dst, *dst_size), codesys); | |
| 4551 break; | |
| 4552 | |
| 4553 case DFC_LISP_STRING: | |
| 5013 | 4554 TO_EXTERNAL_FORMAT (LISP_STRING, GET_LISP_FROM_VOID (src), |
| 1318 | 4555 MALLOC, (*dst, *dst_size), codesys); |
| 4556 break; | |
| 4557 | |
| 4558 default: | |
| 2500 | 4559 ABORT (); |
| 1318 | 4560 } |
| 2367 | 4561 |
| 4562 /* The size is always + 2 because we have double zero-termination at the | |
| 4563 end of all data (for Unicode-correctness). */ | |
| 4564 *dst_size += 2; | |
| 4565 } | |
| 4566 | |
| 4567 Bytecount | |
| 4568 new_dfc_convert_size (const char *srctext, const void *src, | |
| 4569 Bytecount src_size, enum new_dfc_src_type type, | |
| 4570 Lisp_Object codesys) | |
| 4571 { | |
| 4572 alloca_convert_vals vals; | |
| 4573 | |
| 2721 | 4574 int i = find_pos_of_existing_active_alloca_convert (srctext); |
| 4575 assert (i < 0); | |
| 2367 | 4576 |
| 4577 vals.srctext = srctext; | |
| 4578 | |
| 4579 new_dfc_convert_now_damn_it (src, src_size, type, &vals.dst, &vals.dst_size, | |
| 4580 codesys); | |
| 4581 | |
| 4582 Dynarr_add (active_alloca_convert, vals); | |
| 4583 return vals.dst_size; | |
| 4584 } | |
| 4585 | |
| 4586 void * | |
| 4587 new_dfc_convert_copy_data (const char *srctext, void *alloca_data) | |
| 4588 { | |
| 4589 alloca_convert_vals *vals; | |
| 4590 int i = find_pos_of_existing_active_alloca_convert (srctext); | |
| 4591 | |
| 4592 assert (i >= 0); | |
| 4593 vals = Dynarr_atp (active_alloca_convert, i); | |
| 4594 assert (alloca_data); | |
| 4595 memcpy (alloca_data, vals->dst, vals->dst_size); | |
|
4976
16112448d484
Rename xfree(FOO, TYPE) -> xfree(FOO)
Ben Wing <ben@xemacs.org>
parents:
4967
diff
changeset
|
4596 xfree (vals->dst); |
| 2367 | 4597 Dynarr_delete (active_alloca_convert, i); |
| 4598 return alloca_data; | |
| 1318 | 4599 } |
| 4600 | |
| 4601 void * | |
| 4602 new_dfc_convert_malloc (const void *src, Bytecount src_size, | |
| 4603 enum new_dfc_src_type type, Lisp_Object codesys) | |
| 4604 { | |
| 4605 void *dst; | |
| 4606 Bytecount dst_size; | |
| 4607 | |
| 4608 new_dfc_convert_now_damn_it (src, src_size, type, &dst, &dst_size, codesys); | |
| 4609 return dst; | |
| 4610 } | |
| 4611 | |
| 771 | 4612 |
| 4613 /************************************************************************/ | |
| 867 | 4614 /* Basic Ichar functions */ |
| 771 | 4615 /************************************************************************/ |
| 4616 | |
| 4617 #ifdef MULE | |
| 4618 | |
| 4619 /* Convert a non-ASCII Mule character C into a one-character Mule-encoded | |
| 4620 string in STR. Returns the number of bytes stored. | |
| 867 | 4621 Do not call this directly. Use the macro set_itext_ichar() instead. |
| 771 | 4622 */ |
| 4623 | |
| 4624 Bytecount | |
| 867 | 4625 non_ascii_set_itext_ichar (Ibyte *str, Ichar c) |
| 771 | 4626 { |
| 867 | 4627 Ibyte *p; |
| 4628 Ibyte lb; | |
| 771 | 4629 int c1, c2; |
| 4630 Lisp_Object charset; | |
| 4631 | |
| 4632 p = str; | |
| 867 | 4633 BREAKUP_ICHAR (c, charset, c1, c2); |
| 4634 lb = ichar_leading_byte (c); | |
| 826 | 4635 if (leading_byte_private_p (lb)) |
| 4636 *p++ = private_leading_byte_prefix (lb); | |
| 771 | 4637 *p++ = lb; |
| 4638 if (EQ (charset, Vcharset_control_1)) | |
| 4639 c1 += 0x20; | |
| 4640 *p++ = c1 | 0x80; | |
| 4641 if (c2) | |
| 4642 *p++ = c2 | 0x80; | |
| 4643 | |
| 4644 return (p - str); | |
| 4645 } | |
| 4646 | |
| 4647 /* Return the first character from a Mule-encoded string in STR, | |
| 4648 assuming it's non-ASCII. Do not call this directly. | |
| 867 | 4649 Use the macro itext_ichar() instead. */ |
| 4650 | |
| 4651 Ichar | |
| 4652 non_ascii_itext_ichar (const Ibyte *str) | |
| 771 | 4653 { |
| 867 | 4654 Ibyte i0 = *str, i1, i2 = 0; |
| 771 | 4655 Lisp_Object charset; |
| 4656 | |
| 4657 if (i0 == LEADING_BYTE_CONTROL_1) | |
| 867 | 4658 return (Ichar) (*++str - 0x20); |
| 771 | 4659 |
| 826 | 4660 if (leading_byte_prefix_p (i0)) |
| 771 | 4661 i0 = *++str; |
| 4662 | |
| 4663 i1 = *++str & 0x7F; | |
| 4664 | |
| 826 | 4665 charset = charset_by_leading_byte (i0); |
| 771 | 4666 if (XCHARSET_DIMENSION (charset) == 2) |
| 4667 i2 = *++str & 0x7F; | |
| 4668 | |
| 867 | 4669 return make_ichar (charset, i1, i2); |
| 771 | 4670 } |
| 4671 | |
| 867 | 4672 /* Return whether CH is a valid Ichar, assuming it's non-ASCII. |
| 4673 Do not call this directly. Use the macro valid_ichar_p() instead. */ | |
| 771 | 4674 |
| 4675 int | |
| 867 | 4676 non_ascii_valid_ichar_p (Ichar ch) |
| 771 | 4677 { |
| 4678 int f1, f2, f3; | |
| 4679 | |
| 3498 | 4680 /* Must have only lowest 21 bits set */ |
|
5863
15041705c196
Provide `char-code-limit', implement the GNU equivalent in terms of it.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5785
diff
changeset
|
4681 if (ch & ~(CHAR_CODE_LIMIT - 1)) |
| 771 | 4682 return 0; |
| 4683 | |
| 867 | 4684 f1 = ichar_field1 (ch); |
| 4685 f2 = ichar_field2 (ch); | |
| 4686 f3 = ichar_field3 (ch); | |
| 771 | 4687 |
| 4688 if (f1 == 0) | |
| 4689 { | |
| 4690 /* dimension-1 char */ | |
| 4691 Lisp_Object charset; | |
| 4692 | |
| 4693 /* leading byte must be correct */ | |
| 867 | 4694 if (f2 < MIN_ICHAR_FIELD2_OFFICIAL || |
| 4695 (f2 > MAX_ICHAR_FIELD2_OFFICIAL && f2 < MIN_ICHAR_FIELD2_PRIVATE) || | |
| 4696 f2 > MAX_ICHAR_FIELD2_PRIVATE) | |
| 771 | 4697 return 0; |
| 4698 /* octet not out of range */ | |
| 4699 if (f3 < 0x20) | |
| 4700 return 0; | |
| 4701 /* charset exists */ | |
| 4702 /* | |
| 4703 NOTE: This takes advantage of the fact that | |
| 4704 FIELD2_TO_OFFICIAL_LEADING_BYTE and | |
| 4705 FIELD2_TO_PRIVATE_LEADING_BYTE are the same. | |
| 4706 */ | |
| 826 | 4707 charset = charset_by_leading_byte (f2 + FIELD2_TO_OFFICIAL_LEADING_BYTE); |
| 771 | 4708 if (EQ (charset, Qnil)) |
| 4709 return 0; | |
| 4710 /* check range as per size (94 or 96) of charset */ | |
| 4711 return ((f3 > 0x20 && f3 < 0x7f) || XCHARSET_CHARS (charset) == 96); | |
| 4712 } | |
| 4713 else | |
| 4714 { | |
| 4715 /* dimension-2 char */ | |
| 4716 Lisp_Object charset; | |
| 4717 | |
| 4718 /* leading byte must be correct */ | |
| 867 | 4719 if (f1 < MIN_ICHAR_FIELD1_OFFICIAL || |
| 4720 (f1 > MAX_ICHAR_FIELD1_OFFICIAL && f1 < MIN_ICHAR_FIELD1_PRIVATE) || | |
| 4721 f1 > MAX_ICHAR_FIELD1_PRIVATE) | |
| 771 | 4722 return 0; |
| 4723 /* octets not out of range */ | |
| 4724 if (f2 < 0x20 || f3 < 0x20) | |
| 4725 return 0; | |
| 4726 | |
| 4727 #ifdef ENABLE_COMPOSITE_CHARS | |
| 4728 if (f1 + FIELD1_TO_OFFICIAL_LEADING_BYTE == LEADING_BYTE_COMPOSITE) | |
| 4729 { | |
|
5581
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
4730 if (UNBOUNDP (Fgethash (make_fixnum (ch), |
| 771 | 4731 Vcomposite_char_char2string_hash_table, |
| 4732 Qunbound))) | |
| 4733 return 0; | |
| 4734 return 1; | |
| 4735 } | |
| 4736 #endif /* ENABLE_COMPOSITE_CHARS */ | |
| 4737 | |
| 4738 /* charset exists */ | |
| 867 | 4739 if (f1 <= MAX_ICHAR_FIELD1_OFFICIAL) |
| 771 | 4740 charset = |
| 826 | 4741 charset_by_leading_byte (f1 + FIELD1_TO_OFFICIAL_LEADING_BYTE); |
| 771 | 4742 else |
| 4743 charset = | |
| 826 | 4744 charset_by_leading_byte (f1 + FIELD1_TO_PRIVATE_LEADING_BYTE); |
| 771 | 4745 |
| 4746 if (EQ (charset, Qnil)) | |
| 4747 return 0; | |
| 4748 /* check range as per size (94x94 or 96x96) of charset */ | |
| 4749 return ((f2 != 0x20 && f2 != 0x7F && f3 != 0x20 && f3 != 0x7F) || | |
| 4750 XCHARSET_CHARS (charset) == 96); | |
| 4751 } | |
| 4752 } | |
| 4753 | |
| 4754 /* Copy the character pointed to by SRC into DST. Do not call this | |
| 867 | 4755 directly. Use the macro itext_copy_ichar() instead. |
| 771 | 4756 Return the number of bytes copied. */ |
| 4757 | |
| 4758 Bytecount | |
| 867 | 4759 non_ascii_itext_copy_ichar (const Ibyte *src, Ibyte *dst) |
| 771 | 4760 { |
| 826 | 4761 Bytecount bytes = rep_bytes_by_first_byte (*src); |
| 771 | 4762 Bytecount i; |
| 4763 for (i = bytes; i; i--, dst++, src++) | |
| 4764 *dst = *src; | |
| 4765 return bytes; | |
| 4766 } | |
| 4767 | |
| 4768 #endif /* MULE */ | |
| 4769 | |
| 4770 | |
| 4771 /************************************************************************/ | |
| 867 | 4772 /* streams of Ichars */ |
| 771 | 4773 /************************************************************************/ |
| 4774 | |
| 4775 #ifdef MULE | |
| 4776 | |
| 867 | 4777 /* Treat a stream as a stream of Ichar's rather than a stream of bytes. |
| 771 | 4778 The functions below are not meant to be called directly; use |
| 4779 the macros in insdel.h. */ | |
| 4780 | |
| 867 | 4781 Ichar |
| 4782 Lstream_get_ichar_1 (Lstream *stream, int ch) | |
| 771 | 4783 { |
| 867 | 4784 Ibyte str[MAX_ICHAR_LEN]; |
| 4785 Ibyte *strptr = str; | |
| 771 | 4786 Bytecount bytes; |
| 4787 | |
| 867 | 4788 str[0] = (Ibyte) ch; |
| 771 | 4789 |
| 826 | 4790 for (bytes = rep_bytes_by_first_byte (ch) - 1; bytes; bytes--) |
| 771 | 4791 { |
| 4792 int c = Lstream_getc (stream); | |
| 800 | 4793 text_checking_assert (c >= 0); |
| 867 | 4794 *++strptr = (Ibyte) c; |
| 771 | 4795 } |
| 867 | 4796 return itext_ichar (str); |
| 771 | 4797 } |
| 4798 | |
| 4799 int | |
| 867 | 4800 Lstream_fput_ichar (Lstream *stream, Ichar ch) |
| 771 | 4801 { |
| 867 | 4802 Ibyte str[MAX_ICHAR_LEN]; |
| 4803 Bytecount len = set_itext_ichar (str, ch); | |
| 771 | 4804 return Lstream_write (stream, str, len); |
| 4805 } | |
| 4806 | |
| 4807 void | |
| 867 | 4808 Lstream_funget_ichar (Lstream *stream, Ichar ch) |
| 771 | 4809 { |
| 867 | 4810 Ibyte str[MAX_ICHAR_LEN]; |
| 4811 Bytecount len = set_itext_ichar (str, ch); | |
| 771 | 4812 Lstream_unread (stream, str, len); |
| 4813 } | |
| 4814 | |
| 4815 #endif /* MULE */ | |
| 4816 | |
| 4817 | |
| 4818 /************************************************************************/ | |
| 4819 /* Lisp primitives for working with characters */ | |
| 4820 /************************************************************************/ | |
| 4821 | |
| 4822 DEFUN ("make-char", Fmake_char, 2, 3, 0, /* | |
| 4823 Make a character from CHARSET and octets ARG1 and ARG2. | |
| 4824 ARG2 is required only for characters from two-dimensional charsets. | |
| 4825 | |
| 4826 Each octet should be in the range 32 through 127 for a 96 or 96x96 | |
| 4827 charset and 33 through 126 for a 94 or 94x94 charset. (Most charsets | |
| 4828 are either 96 or 94x94.) Note that this is 32 more than the values | |
| 4829 typically given for 94x94 charsets. When two octets are required, the | |
| 4830 order is "standard" -- the same as appears in ISO-2022 encodings, | |
| 4831 reference tables, etc. | |
| 4832 | |
| 4833 \(Note the following non-obvious result: Computerized translation | |
| 4834 tables often encode the two octets as the high and low bytes, | |
| 4835 respectively, of a hex short, while when there's only one octet, it | |
| 4836 goes in the low byte. When decoding such a value, you need to treat | |
| 4837 the two cases differently when calling make-char: One is (make-char | |
| 4838 CHARSET HIGH LOW), the other is (make-char CHARSET LOW).) | |
| 4839 | |
| 4840 For example, (make-char 'latin-iso8859-2 185) or (make-char | |
| 4841 'latin-iso8859-2 57) will return the Latin 2 character s with caron. | |
| 4842 | |
| 4843 As another example, the Japanese character for "kawa" (stream), which | |
| 4844 looks something like this: | |
| 4845 | |
| 4846 | | | |
| 4847 | | | | |
| 4848 | | | | |
| 4849 | | | | |
| 4850 / | | |
| 4851 | |
| 4852 appears in the Unicode Standard (version 2.0) on page 7-287 with the | |
| 4853 following values (see also page 7-4): | |
| 4854 | |
| 4855 U 5DDD (Unicode) | |
| 4856 G 0-2008 (GB 2312-80) | |
| 4857 J 0-3278 (JIS X 0208-1990) | |
| 4858 K 0-8425 (KS C 5601-1987) | |
| 4859 B A474 (Big Five) | |
| 4860 C 1-4455 (CNS 11643-1986 (1st plane)) | |
| 4861 A 213C34 (ANSI Z39.64-1989) | |
| 4862 | |
| 4863 These are equivalent to: | |
| 4864 | |
| 4865 \(make-char 'chinese-gb2312 52 40) | |
| 4866 \(make-char 'japanese-jisx0208 64 110) | |
| 4867 \(make-char 'korean-ksc5601 116 57) | |
| 4868 \(make-char 'chinese-cns11643-1 76 87) | |
| 4869 \(decode-big5-char '(164 . 116)) | |
| 4870 | |
| 4871 \(All codes above are two decimal numbers except for Big Five and ANSI | |
| 4872 Z39.64, which we don't support. We add 32 to each of the decimal | |
| 4873 numbers. Big Five is split in a rather hackish fashion into two | |
| 4874 charsets, `big5-1' and `big5-2', due to its excessive size -- 94x157, | |
| 4875 with the first codepoint in the range 0xA1 to 0xFE and the second in | |
| 4876 the range 0x40 to 0x7E or 0xA1 to 0xFE. `decode-big5-char' is used to | |
| 4877 generate the char from its codes, and `encode-big5-char' extracts the | |
| 4878 codes.) | |
| 4879 | |
| 4880 When compiled without MULE, this function does not do much, but it's | |
| 4881 provided for compatibility. In this case, the following CHARSET symbols | |
| 4882 are allowed: | |
| 4883 | |
| 4884 `ascii' -- ARG1 should be in the range 0 through 127. | |
| 4885 `control-1' -- ARG1 should be in the range 128 through 159. | |
| 4886 else -- ARG1 is coerced to be between 0 and 255, and then the high | |
| 4887 bit is set. | |
| 4888 | |
| 4889 `int-to-char of the resulting ARG1' is returned, and ARG2 is always ignored. | |
| 4890 */ | |
| 2333 | 4891 (charset, arg1, USED_IF_MULE (arg2))) |
| 771 | 4892 { |
| 4893 #ifdef MULE | |
| 4894 Lisp_Charset *cs; | |
| 4895 int a1, a2; | |
| 4896 int lowlim, highlim; | |
| 4897 | |
| 4898 charset = Fget_charset (charset); | |
| 4899 cs = XCHARSET (charset); | |
| 4900 | |
| 788 | 4901 get_charset_limits (charset, &lowlim, &highlim); |
| 771 | 4902 |
|
5581
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
4903 CHECK_FIXNUM (arg1); |
| 771 | 4904 /* It is useful (and safe, according to Olivier Galibert) to strip |
| 4905 the 8th bit off ARG1 and ARG2 because it allows programmers to | |
| 4906 write (make-char 'latin-iso8859-2 CODE) where code is the actual | |
| 4907 Latin 2 code of the character. */ | |
|
5581
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
4908 a1 = XFIXNUM (arg1) & 0x7f; |
| 771 | 4909 if (a1 < lowlim || a1 > highlim) |
|
5581
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
4910 args_out_of_range_3 (arg1, make_fixnum (lowlim), make_fixnum (highlim)); |
| 771 | 4911 |
| 4912 if (CHARSET_DIMENSION (cs) == 1) | |
| 4913 { | |
| 4914 if (!NILP (arg2)) | |
| 4915 invalid_argument | |
| 4916 ("Charset is of dimension one; second octet must be nil", arg2); | |
| 867 | 4917 return make_char (make_ichar (charset, a1, 0)); |
| 771 | 4918 } |
| 4919 | |
|
5581
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
4920 CHECK_FIXNUM (arg2); |
|
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
4921 a2 = XFIXNUM (arg2) & 0x7f; |
| 771 | 4922 if (a2 < lowlim || a2 > highlim) |
|
5581
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
4923 args_out_of_range_3 (arg2, make_fixnum (lowlim), make_fixnum (highlim)); |
| 771 | 4924 |
| 867 | 4925 return make_char (make_ichar (charset, a1, a2)); |
| 771 | 4926 #else |
| 4927 int a1; | |
| 4928 int lowlim, highlim; | |
| 4929 | |
| 4930 if (EQ (charset, Qascii)) lowlim = 0, highlim = 127; | |
| 4931 else if (EQ (charset, Qcontrol_1)) lowlim = 0, highlim = 31; | |
| 4932 else lowlim = 0, highlim = 127; | |
| 4933 | |
|
5581
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
4934 CHECK_FIXNUM (arg1); |
| 771 | 4935 /* It is useful (and safe, according to Olivier Galibert) to strip |
| 4936 the 8th bit off ARG1 and ARG2 because it allows programmers to | |
| 4937 write (make-char 'latin-iso8859-2 CODE) where code is the actual | |
| 4938 Latin 2 code of the character. */ | |
|
5581
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
4939 a1 = XFIXNUM (arg1) & 0x7f; |
| 771 | 4940 if (a1 < lowlim || a1 > highlim) |
|
5581
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
4941 args_out_of_range_3 (arg1, make_fixnum (lowlim), make_fixnum (highlim)); |
| 771 | 4942 |
| 4943 if (EQ (charset, Qascii)) | |
| 4944 return make_char (a1); | |
| 4945 return make_char (a1 + 128); | |
| 4946 #endif /* MULE */ | |
| 4947 } | |
| 4948 | |
| 4949 #ifdef MULE | |
| 4950 | |
| 4951 DEFUN ("char-charset", Fchar_charset, 1, 1, 0, /* | |
| 4952 Return the character set of char CH. | |
| 4953 */ | |
| 4954 (ch)) | |
| 4955 { | |
| 4956 CHECK_CHAR_COERCE_INT (ch); | |
| 4957 | |
| 826 | 4958 return XCHARSET_NAME (charset_by_leading_byte |
| 867 | 4959 (ichar_leading_byte (XCHAR (ch)))); |
| 771 | 4960 } |
| 4961 | |
| 4962 DEFUN ("char-octet", Fchar_octet, 1, 2, 0, /* | |
| 4963 Return the octet numbered N (should be 0 or 1) of char CH. | |
| 4964 N defaults to 0 if omitted. | |
| 4965 */ | |
| 4966 (ch, n)) | |
| 4967 { | |
| 4968 Lisp_Object charset; | |
| 4969 int octet0, octet1; | |
| 4970 | |
| 4971 CHECK_CHAR_COERCE_INT (ch); | |
| 4972 | |
| 867 | 4973 BREAKUP_ICHAR (XCHAR (ch), charset, octet0, octet1); |
| 771 | 4974 |
| 4975 if (NILP (n) || EQ (n, Qzero)) | |
|
5581
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
4976 return make_fixnum (octet0); |
|
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
4977 else if (EQ (n, make_fixnum (1))) |
|
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
4978 return make_fixnum (octet1); |
| 771 | 4979 else |
| 4980 invalid_constant ("Octet number must be 0 or 1", n); | |
| 4981 } | |
| 4982 | |
| 3724 | 4983 #endif /* MULE */ |
| 4984 | |
| 771 | 4985 DEFUN ("split-char", Fsplit_char, 1, 1, 0, /* |
| 4986 Return list of charset and one or two position-codes of CHAR. | |
| 4987 */ | |
| 4988 (character)) | |
| 4989 { | |
| 4990 /* This function can GC */ | |
| 4991 struct gcpro gcpro1, gcpro2; | |
| 4992 Lisp_Object charset = Qnil; | |
| 4993 Lisp_Object rc = Qnil; | |
| 4994 int c1, c2; | |
| 4995 | |
| 4996 GCPRO2 (charset, rc); | |
| 4997 CHECK_CHAR_COERCE_INT (character); | |
| 4998 | |
| 867 | 4999 BREAKUP_ICHAR (XCHAR (character), charset, c1, c2); |
| 771 | 5000 |
| 3724 | 5001 if (XCHARSET_DIMENSION (charset) == 2) |
| 771 | 5002 { |
|
5581
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
5003 rc = list3 (XCHARSET_NAME (charset), make_fixnum (c1), make_fixnum (c2)); |
| 771 | 5004 } |
| 5005 else | |
| 5006 { | |
|
5581
56144c8593a8
Mechanically change INT to FIXNUM in our sources.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5474
diff
changeset
|
5007 rc = list2 (XCHARSET_NAME (charset), make_fixnum (c1)); |
| 771 | 5008 } |
| 5009 UNGCPRO; | |
| 5010 | |
| 5011 return rc; | |
| 5012 } | |
| 5013 | |
| 5014 | |
| 5015 /************************************************************************/ | |
| 5016 /* composite character functions */ | |
| 5017 /************************************************************************/ | |
| 5018 | |
| 5019 #ifdef ENABLE_COMPOSITE_CHARS | |
| 5020 | |
| 867 | 5021 Ichar |
| 5022 lookup_composite_char (Ibyte *str, int len) | |
| 771 | 5023 { |
| 5024 Lisp_Object lispstr = make_string (str, len); | |
| 5025 Lisp_Object ch = Fgethash (lispstr, | |
| 5026 Vcomposite_char_string2char_hash_table, | |
| 5027 Qunbound); | |
| 867 | 5028 Ichar emch; |
| 771 | 5029 |
| 5030 if (UNBOUNDP (ch)) | |
| 5031 { | |
| 5032 if (composite_char_row_next >= 128) | |
| 5033 invalid_operation ("No more composite chars available", lispstr); | |
| 867 | 5034 emch = make_ichar (Vcharset_composite, composite_char_row_next, |
| 771 | 5035 composite_char_col_next); |
| 5036 Fputhash (make_char (emch), lispstr, | |
| 5037 Vcomposite_char_char2string_hash_table); | |
| 5038 Fputhash (lispstr, make_char (emch), | |
| 5039 Vcomposite_char_string2char_hash_table); | |
| 5040 composite_char_col_next++; | |
| 5041 if (composite_char_col_next >= 128) | |
| 5042 { | |
| 5043 composite_char_col_next = 32; | |
| 5044 composite_char_row_next++; | |
| 5045 } | |
| 5046 } | |
| 5047 else | |
| 5048 emch = XCHAR (ch); | |
| 5049 return emch; | |
| 5050 } | |
| 5051 | |
| 5052 Lisp_Object | |
| 867 | 5053 composite_char_string (Ichar ch) |
| 771 | 5054 { |
| 5055 Lisp_Object str = Fgethash (make_char (ch), | |
| 5056 Vcomposite_char_char2string_hash_table, | |
| 5057 Qunbound); | |
| 5058 assert (!UNBOUNDP (str)); | |
| 5059 return str; | |
| 5060 } | |
| 5061 | |
| 826 | 5062 DEFUN ("make-composite-char", Fmake_composite_char, 1, 1, 0, /* |
| 771 | 5063 Convert a string into a single composite character. |
| 5064 The character is the result of overstriking all the characters in | |
| 5065 the string. | |
| 5066 */ | |
| 5067 (string)) | |
| 5068 { | |
| 5069 CHECK_STRING (string); | |
| 5070 return make_char (lookup_composite_char (XSTRING_DATA (string), | |
| 5071 XSTRING_LENGTH (string))); | |
| 5072 } | |
| 5073 | |
| 826 | 5074 DEFUN ("composite-char-string", Fcomposite_char_string, 1, 1, 0, /* |
| 771 | 5075 Return a string of the characters comprising a composite character. |
| 5076 */ | |
| 5077 (ch)) | |
| 5078 { | |
| 867 | 5079 Ichar emch; |
| 771 | 5080 |
| 5081 CHECK_CHAR (ch); | |
| 5082 emch = XCHAR (ch); | |
| 867 | 5083 if (ichar_leading_byte (emch) != LEADING_BYTE_COMPOSITE) |
| 771 | 5084 invalid_argument ("Must be composite char", ch); |
| 5085 return composite_char_string (emch); | |
| 5086 } | |
| 5087 #endif /* ENABLE_COMPOSITE_CHARS */ | |
| 5088 | |
| 5089 | |
| 5090 /************************************************************************/ | |
| 5091 /* initialization */ | |
| 5092 /************************************************************************/ | |
| 5093 | |
| 5094 void | |
| 1204 | 5095 reinit_eistring_early (void) |
| 771 | 5096 { |
| 5097 the_eistring_malloc_zero_init = the_eistring_zero_init; | |
| 5098 the_eistring_malloc_zero_init.mallocp_ = 1; | |
| 5099 } | |
| 5100 | |
| 5101 void | |
| 814 | 5102 init_eistring_once_early (void) |
| 5103 { | |
| 1204 | 5104 reinit_eistring_early (); |
| 814 | 5105 } |
| 5106 | |
| 5107 void | |
| 771 | 5108 syms_of_text (void) |
| 5109 { | |
| 5110 DEFSUBR (Fmake_char); | |
| 3724 | 5111 DEFSUBR (Fsplit_char); |
| 771 | 5112 |
| 5113 #ifdef MULE | |
| 5114 DEFSUBR (Fchar_charset); | |
| 5115 DEFSUBR (Fchar_octet); | |
| 5116 | |
| 5117 #ifdef ENABLE_COMPOSITE_CHARS | |
| 5118 DEFSUBR (Fmake_composite_char); | |
| 5119 DEFSUBR (Fcomposite_char_string); | |
| 5120 #endif | |
| 5121 #endif /* MULE */ | |
| 5122 } | |
| 5123 | |
| 5124 void | |
| 5125 reinit_vars_of_text (void) | |
| 5126 { | |
| 5127 int i; | |
| 5128 | |
| 867 | 5129 conversion_in_dynarr_list = Dynarr_new2 (Ibyte_dynarr_dynarr, |
| 5130 Ibyte_dynarr *); | |
| 771 | 5131 conversion_out_dynarr_list = Dynarr_new2 (Extbyte_dynarr_dynarr, |
| 5132 Extbyte_dynarr *); | |
| 5133 | |
| 5134 for (i = 0; i <= MAX_BYTEBPOS_GAP_SIZE_3; i++) | |
| 5135 three_to_one_table[i] = i / 3; | |
| 5136 } | |
| 5137 | |
| 5138 void | |
| 5139 vars_of_text (void) | |
| 5140 { | |
|
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4526
diff
changeset
|
5141 QSin_char_byte_conversion = build_defer_string ("(in char-byte conversion)"); |
| 1292 | 5142 staticpro (&QSin_char_byte_conversion); |
| 5143 QSin_internal_external_conversion = | |
|
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4526
diff
changeset
|
5144 build_defer_string ("(in internal-external conversion)"); |
| 1292 | 5145 staticpro (&QSin_internal_external_conversion); |
| 5146 | |
|
5863
15041705c196
Provide `char-code-limit', implement the GNU equivalent in terms of it.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5785
diff
changeset
|
5147 DEFVAR_CONST_INT ("char-code-limit", &Vchar_code_limit /* |
|
15041705c196
Provide `char-code-limit', implement the GNU equivalent in terms of it.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5785
diff
changeset
|
5148 Exclusive upper bound on the values return by `char-int'. |
|
15041705c196
Provide `char-code-limit', implement the GNU equivalent in terms of it.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5785
diff
changeset
|
5149 |
|
15041705c196
Provide `char-code-limit', implement the GNU equivalent in terms of it.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5785
diff
changeset
|
5150 Note that not every fixnum with a value below `char-code-limit' has an |
|
15041705c196
Provide `char-code-limit', implement the GNU equivalent in terms of it.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5785
diff
changeset
|
5151 associated character; check with `char-int-p' if necessary. |
|
15041705c196
Provide `char-code-limit', implement the GNU equivalent in terms of it.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5785
diff
changeset
|
5152 */); |
|
15041705c196
Provide `char-code-limit', implement the GNU equivalent in terms of it.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5785
diff
changeset
|
5153 Vchar_code_limit = CHAR_CODE_LIMIT; |
|
15041705c196
Provide `char-code-limit', implement the GNU equivalent in terms of it.
Aidan Kehoe <kehoea@parhasard.net>
parents:
5785
diff
changeset
|
5154 |
| 771 | 5155 #ifdef ENABLE_COMPOSITE_CHARS |
| 5156 /* #### not dumped properly */ | |
| 5157 composite_char_row_next = 32; | |
| 5158 composite_char_col_next = 32; | |
| 5159 | |
| 5160 Vcomposite_char_string2char_hash_table = | |
|
5191
71ee43b8a74d
Add #'equalp as a hash test by default; add #'define-hash-table-test, GNU API
Aidan Kehoe <kehoea@parhasard.net>
parents:
5013
diff
changeset
|
5161 make_lisp_hash_table (500, HASH_TABLE_NON_WEAK, Qequal); |
| 771 | 5162 Vcomposite_char_char2string_hash_table = |
|
5191
71ee43b8a74d
Add #'equalp as a hash test by default; add #'define-hash-table-test, GNU API
Aidan Kehoe <kehoea@parhasard.net>
parents:
5013
diff
changeset
|
5163 make_lisp_hash_table (500, HASH_TABLE_NON_WEAK, Qeq); |
| 771 | 5164 staticpro (&Vcomposite_char_string2char_hash_table); |
| 5165 staticpro (&Vcomposite_char_char2string_hash_table); | |
| 5166 #endif /* ENABLE_COMPOSITE_CHARS */ | |
| 5167 } |
