Mercurial > hg > xemacs-beta
comparison src/file-coding.c @ 2297:13a418960a88
[xemacs-hg @ 2004-09-22 02:05:42 by stephent]
various doc patches <87isa7awrh.fsf@tleepslib.sk.tsukuba.ac.jp>
author | stephent |
---|---|
date | Wed, 22 Sep 2004 02:06:52 +0000 |
parents | 04bc9d2f42c7 |
children | ecf1ebac70d8 |
comparison
equal
deleted
inserted
replaced
2296:a58ea4d0d0cd | 2297:13a418960a88 |
---|---|
66 levels of likelihood to increase the reliability of the algorithm. | 66 levels of likelihood to increase the reliability of the algorithm. |
67 | 67 |
68 October 2001, Ben Wing: HAVE_CODING_SYSTEMS is always now defined. | 68 October 2001, Ben Wing: HAVE_CODING_SYSTEMS is always now defined. |
69 Removed the conditionals. | 69 Removed the conditionals. |
70 */ | 70 */ |
71 | |
72 /* sjt sez: | |
73 | |
74 There should be no elementary coding systems in the Lisp API, only chains. | |
75 Chains should be declared, not computed, as a sequence of coding formats. | |
76 (Probably the internal representation can be a vector for efficiency but | |
77 programmers would probably rather work with lists.) A stream has a token | |
78 type. Most streams are octet streams. Text is a stream of characters (in | |
79 _internal_ format; a file on disk is not text!) An octet-stream has no | |
80 implicit semantics, so its format must always be specified. The only type | |
81 currently having semantics is characters. This means that the chain [euc-jp | |
82 -> internal -> shift_jis) may be specified (euc-jp, shift_jis), and if no | |
83 euc-jp -> shift_jis converter is available, then the chain is automatically | |
84 constructed. (N.B. I f we have fixed width buffers in the future, then we | |
85 could have ASCII -> 8-bit char -> 16-bit char -> ISO-2022-JP (with escape | |
86 sequences). | |
87 | |
88 EOL handling is a char <-> char coding. It should not be part of another | |
89 coding system except as a convenience for users. For text coding, | |
90 automatically insert EOL handlers between char <-> octet boundaries. | |
91 */ | |
71 | 92 |
72 /* Comments about future work | 93 /* Comments about future work |
73 | 94 |
74 ------------------------------------------------------------------ | 95 ------------------------------------------------------------------ |
75 ABOUT DETECTION | 96 ABOUT DETECTION |
155 in CRLF format. The EOL detector (which really detects *plain text* | 176 in CRLF format. The EOL detector (which really detects *plain text* |
156 with a particular EOL type) would return at most level 0 for all | 177 with a particular EOL type) would return at most level 0 for all |
157 results until the text file is reached, whereas the base64, gzip or | 178 results until the text file is reached, whereas the base64, gzip or |
158 euc-jp decoders will return higher. Once the text file is reached, | 179 euc-jp decoders will return higher. Once the text file is reached, |
159 the EOL detector will return 0 or higher for the CRLF encoding, and | 180 the EOL detector will return 0 or higher for the CRLF encoding, and |
160 all other decoders will return 0 or lower; thus, we will successfully | 181 all other detectors will return 0 or lower; thus, we will successfully |
161 proceed through CRLF decoding, or at worst prompt the user. (The only | 182 proceed through CRLF decoding, or at worst prompt the user. (The only |
162 external-vs-internal distinction that might make sense here is to | 183 external-vs-internal distinction that might make sense here is to |
163 favor coding systems of the correct source type over those that | 184 favor coding systems of the correct source type over those that |
164 require conversion between external and internal; if done right, this | 185 require conversion between external and internal; if done right, this |
165 could allow the CRLF detector to return level 1 for all CRLF-encoded | 186 could allow the CRLF detector to return level 1 for all CRLF-encoded |
168 interfere with other decoders. On the other hand, this | 189 interfere with other decoders. On the other hand, this |
169 external-vs-internal distinction may not matter at all -- with | 190 external-vs-internal distinction may not matter at all -- with |
170 automatic internal-external conversion, CRLF decoding can occur | 191 automatic internal-external conversion, CRLF decoding can occur |
171 before or after decoding of euc-jp, base64, iso2022, or similar, | 192 before or after decoding of euc-jp, base64, iso2022, or similar, |
172 without any difference in the final results.) | 193 without any difference in the final results.) |
194 | |
195 #### What are we trying to say? In base64, the CRLF decoding before | |
196 base64 decoding is irrelevant, they will be thrown out as whitespace | |
197 is not significant in base64. | |
198 | |
199 [sjt considers all of this to be rather bogus. Ideas like "greater | |
200 certainty" and "distinctive" can and should be quantified. The issue | |
201 of proper table organization should be a question of optimization.] | |
202 | |
203 [sjt wonders if it might not be a good idea to use Unicode's newline | |
204 character as the internal representation so that (for non-Unicode | |
205 coding systems) we can catch EOL bugs on Unix too.] | |
173 | 206 |
174 -- There need to be two priority lists and two | 207 -- There need to be two priority lists and two |
175 category->coding-system lists. Once is general, the other | 208 category->coding-system lists. Once is general, the other |
176 category->langenv-specific. The user sets the former, the langenv | 209 category->langenv-specific. The user sets the former, the langenv |
177 category->the latter. The langenv-specific entries take precedence | 210 category->the latter. The langenv-specific entries take precedence |
219 then re-decode it, perhaps multiple times, as we get better | 252 then re-decode it, perhaps multiple times, as we get better |
220 detection results. | 253 detection results. |
221 | 254 |
222 -- Clearly some of these are more important than others. at the | 255 -- Clearly some of these are more important than others. at the |
223 very least, the "better means of presentation" should be | 256 very least, the "better means of presentation" should be |
224 implementation as soon as possibl, along with a very simple means | 257 implemented as soon as possible, along with a very simple means |
225 of fail-safe whenever the data is readibly available, e.g. it's | 258 of fail-safe whenever the data is readibly available, e.g. it's |
226 coming from a file, which is the most common scenario. | 259 coming from a file, which is the most common scenario. |
260 | |
261 --ben [at least that's what sjt thinks] | |
262 | |
263 ***** | |
264 | |
265 While this is clearly something of an improvement over earlier designs, | |
266 it doesn't deal with the most important issue: to do better than categories | |
267 (which in the medium term is mostly going to mean "which flavor of Unicode | |
268 is this?"), we need to look at statistical behavior rather than ruling out | |
269 categories via presence of specific sequences. This means the stream | |
270 processor should | |
271 | |
272 (1) keep octet distributions (octet, 2-, 3-, 4- octet sequences) | |
273 (2) in some kind of compressed form | |
274 (3) look for "skip features" (eg, characteristic behavior of leading | |
275 bytes for UTF-7, UTF-8, UTF-16, Mule code) | |
276 (4) pick up certain "simple" regexps | |
277 (5) provide "triggers" to determine when statistical detectors should be | |
278 invoked, such as octet count | |
279 (6) and "magic" like Unicode signatures or file(1) magic. | |
280 | |
281 --sjt | |
227 | 282 |
228 | 283 |
229 ------------------------------------------------------------------ | 284 ------------------------------------------------------------------ |
230 ABOUT FORMATS | 285 ABOUT FORMATS |
231 ------------------------------------------------------------------ | 286 ------------------------------------------------------------------ |
307 there's also a global one, and presumably all coding systems | 362 there's also a global one, and presumably all coding systems |
308 not on other list get appended to the end (and perhaps not | 363 not on other list get appended to the end (and perhaps not |
309 checked at all when doing safe-checking?). safe-checking | 364 checked at all when doing safe-checking?). safe-checking |
310 should work something like this: compile a list of all | 365 should work something like this: compile a list of all |
311 charsets used in the buffer, along with a count of chars | 366 charsets used in the buffer, along with a count of chars |
312 used. that way, "slightly unsafe" charsets can perhaps be | 367 used. that way, "slightly unsafe" coding systems can perhaps |
313 presented at the end, which will lose only a few characters | 368 be presented at the end, which will lose only a few characters |
314 and are perhaps what the users were looking for. | 369 and are perhaps what the users were looking for. |
370 | |
371 [sjt sez this whole step is a crock. If a universal coding system | |
372 is unacceptable, the user had better know what he/she is doing, | |
373 and explicitly specify a lossy encoding. | |
374 In principle, we can simply check for characters being writable as | |
375 we go along. Eg, via an "unrepresentable character handler." We | |
376 still have the buffer contents. If we can't successfully save, | |
377 then ask the user what to do. (Do we ever simply destroy previous | |
378 file version before completing a write?)] | |
315 | 379 |
316 2. when actually writing out, we need error checking in case an | 380 2. when actually writing out, we need error checking in case an |
317 individual char in a charset can't be written even though the | 381 individual char in a charset can't be written even though the |
318 charsets are safe. again, the user gets the choice of other | 382 charsets are safe. again, the user gets the choice of other |
319 reasonable coding systems. | 383 reasonable coding systems. |
320 | 384 |
385 [sjt -- something is very confused, here; safe charsets should be | |
386 defined as those charsets all of whose characters can be encoded.] | |
387 | |
321 3. same thing (error checking, list of alternatives, etc.) needs | 388 3. same thing (error checking, list of alternatives, etc.) needs |
322 to happen when reading! all of this will be a lot of work! | 389 to happen when reading! all of this will be a lot of work! |
323 | 390 |
324 | 391 |
325 --ben | 392 --ben |
393 | |
394 I don't much like Ben's scheme. First, this isn't an issue of I/O, | |
395 it's a coding issue. It can happen in many places, not just on stream | |
396 I/O. Error checking should take place on all translations. Second, | |
397 the two-pass algorithm should be avoided if possible. In some cases | |
398 (eg, output to a tty) we won't be able to go back and change the | |
399 previously output data. Third, the whole idea of having a buffer full | |
400 of arbitrary characters which we're going to somehow shoehorn into a | |
401 file based on some twit user's less than informed idea of a coding system | |
402 is kind of laughable from the start. If we're going to say that a buffer | |
403 has a coding system, shouldn't we enforce restrictions on what you can | |
404 put into it? Fourth, what's the point of having safe charsets if some | |
405 of the characters in them are unsafe? Fifth, what makes you think we're | |
406 going to have a list of charsets? It seems to me that there might be | |
407 reasons to have user-defined charsets (eg, "German" vs "French" subsets | |
408 of ISO 8859/15). Sixth, the idea of having language environment determine | |
409 precedence doesn't seem very useful to me. Users who are working with a | |
410 language that corresponds to the language environment are not going to | |
411 run into safe charsets problems. It's users who are outside of their | |
412 usual language environment who run into trouble. Also, the reason for | |
413 specifying anything other than a universal coding system is normally | |
414 restrictions imposed by other users or applications. Seventh, the | |
415 statistical feedback isn't terribly useful. Users rarely "want" a | |
416 coding system, they want their file saved in a useful way. We could | |
417 add a FORCE argument to conversions for those who really want a specific | |
418 coding system. But mostly, a user might want to edit out a few unsafe | |
419 characters. So (up to some maximum) we should keep a list of unsafe | |
420 text positions, and provide a convenient function for traversing them. | |
421 | |
422 --sjt | |
326 */ | 423 */ |
327 | 424 |
328 #include <config.h> | 425 #include <config.h> |
329 #include "lisp.h" | 426 #include "lisp.h" |
330 | 427 |
495 | 592 |
496 #ifdef HAVE_ZLIB | 593 #ifdef HAVE_ZLIB |
497 Lisp_Object Qgzip; | 594 Lisp_Object Qgzip; |
498 #endif | 595 #endif |
499 | 596 |
500 /* Maps coding system names to either coding system objects or (for | 597 /* Maps symbols (coding system names) to either coding system objects or |
501 aliases) other names. */ | 598 (for aliases) other names. */ |
502 static Lisp_Object Vcoding_system_hash_table; | 599 static Lisp_Object Vcoding_system_hash_table; |
503 | 600 |
504 int enable_multibyte_characters; | 601 int enable_multibyte_characters; |
505 | 602 |
506 EXFUN (Fcopy_coding_system, 2); | 603 EXFUN (Fcopy_coding_system, 2); |
908 cscl->internal : cscl->normal) | 1005 cscl->internal : cscl->normal) |
909 *coding_system_list = Fcons (key, *coding_system_list); | 1006 *coding_system_list = Fcons (key, *coding_system_list); |
910 return 0; | 1007 return 0; |
911 } | 1008 } |
912 | 1009 |
1010 /* #### should we specify a conventional for "all coding systems"? */ | |
913 DEFUN ("coding-system-list", Fcoding_system_list, 0, 1, 0, /* | 1011 DEFUN ("coding-system-list", Fcoding_system_list, 0, 1, 0, /* |
914 Return a list of the names of all defined coding systems. | 1012 Return a list of the names of all defined coding systems. |
915 If INTERNAL is nil, only the normal (non-internal) coding systems are | 1013 If INTERNAL is nil, only the normal (non-internal) coding systems are |
916 included. (Internal coding systems are created for various internal | 1014 included. (Internal coding systems are created for various internal |
917 purposes, such as implementing EOL types of CRLF and CR; generally, you do | 1015 purposes, such as implementing EOL types of CRLF and CR; generally, you do |
1556 One of `utf-16', `utf-8', `ucs-4', or `utf-7' (the latter is not | 1654 One of `utf-16', `utf-8', `ucs-4', or `utf-7' (the latter is not |
1557 yet implemented). `utf-16' is the basic two-byte encoding; | 1655 yet implemented). `utf-16' is the basic two-byte encoding; |
1558 `ucs-4' is the four-byte encoding; `utf-8' is an ASCII-compatible | 1656 `ucs-4' is the four-byte encoding; `utf-8' is an ASCII-compatible |
1559 variable-width 8-bit encoding; `utf-7' is a 7-bit encoding using | 1657 variable-width 8-bit encoding; `utf-7' is a 7-bit encoding using |
1560 only characters that will safely pass through all mail gateways. | 1658 only characters that will safely pass through all mail gateways. |
1659 [[ This should be \"transformation format\". There should also be | |
1660 `ucs-2' (or `bmp' -- no surrogates) and `utf-32' (range checked). ]] | |
1561 | 1661 |
1562 'little-endian | 1662 'little-endian |
1563 If non-nil, `utf-16' and `ucs-4' will write out the groups of two | 1663 If non-nil, `utf-16' and `ucs-4' will write out the groups of two |
1564 or four bytes little-endian instead of big-endian. This is required, | 1664 or four bytes little-endian instead of big-endian. This is required, |
1565 for example, under Windows. | 1665 for example, under Windows. |
1567 'need-bom | 1667 'need-bom |
1568 If non-nil, a byte order mark (BOM, or Unicode FFFE) should be | 1668 If non-nil, a byte order mark (BOM, or Unicode FFFE) should be |
1569 written out at the beginning of the data. This serves both to | 1669 written out at the beginning of the data. This serves both to |
1570 identify the endianness of the following data and to mark the | 1670 identify the endianness of the following data and to mark the |
1571 data as Unicode (at least, this is how Windows uses it). | 1671 data as Unicode (at least, this is how Windows uses it). |
1572 | 1672 [[ The correct term is \"signature\", since this technique may also |
1673 be used with UTF-8. That is the term used in the standard. ]] | |
1573 | 1674 |
1574 | 1675 |
1575 The following additional properties are recognized if TYPE is | 1676 The following additional properties are recognized if TYPE is |
1576 'mswindows-multibyte: | 1677 'mswindows-multibyte: |
1577 | 1678 |
1594 `mswindows-system-default-locale', respectively. | 1695 `mswindows-system-default-locale', respectively. |
1595 | 1696 |
1596 | 1697 |
1597 | 1698 |
1598 The following additional properties are recognized if TYPE is 'undecided: | 1699 The following additional properties are recognized if TYPE is 'undecided: |
1700 [[ Doesn't GNU use \"detect-*\" for the following two? ]] | |
1599 | 1701 |
1600 'do-eol | 1702 'do-eol |
1601 Do EOL detection. | 1703 Do EOL detection. |
1602 | 1704 |
1603 'do-coding | 1705 'do-coding |
1667 (object)) | 1769 (object)) |
1668 { | 1770 { |
1669 return CODING_SYSTEMP (Fgethash (object, Vcoding_system_hash_table, Qnil)) | 1771 return CODING_SYSTEMP (Fgethash (object, Vcoding_system_hash_table, Qnil)) |
1670 ? Qt : Qnil; | 1772 ? Qt : Qnil; |
1671 } | 1773 } |
1774 | |
1775 /* #### Shouldn't this really be a find/get pair? */ | |
1672 | 1776 |
1673 DEFUN ("coding-system-alias-p", Fcoding_system_alias_p, 1, 1, 0, /* | 1777 DEFUN ("coding-system-alias-p", Fcoding_system_alias_p, 1, 1, 0, /* |
1674 Return t if OBJECT is a coding system alias. | 1778 Return t if OBJECT is a coding system alias. |
1675 All coding system aliases are created by `define-coding-system-alias'. | 1779 All coding system aliases are created by `define-coding-system-alias'. |
1676 */ | 1780 */ |
1792 } | 1896 } |
1793 | 1897 |
1794 Fputhash (alias, aliasee, Vcoding_system_hash_table); | 1898 Fputhash (alias, aliasee, Vcoding_system_hash_table); |
1795 | 1899 |
1796 /* Set up aliases for subsidiaries. | 1900 /* Set up aliases for subsidiaries. |
1797 #### There must be a better way to handle subsidiary coding systems. */ | 1901 #### There must be a better way to handle subsidiary coding systems. |
1902 Inquiring Minds Want To Know: shouldn't they always be chains? */ | |
1798 { | 1903 { |
1799 static const char *suffixes[] = { "-unix", "-dos", "-mac" }; | 1904 static const char *suffixes[] = { "-unix", "-dos", "-mac" }; |
1800 int i; | 1905 int i; |
1801 for (i = 0; i < countof (suffixes); i++) | 1906 for (i = 0; i < countof (suffixes); i++) |
1802 { | 1907 { |
1867 | 1972 |
1868 DEFUN ("coding-system-used-for-io", Fcoding_system_used_for_io, | 1973 DEFUN ("coding-system-used-for-io", Fcoding_system_used_for_io, |
1869 1, 1, 0, /* | 1974 1, 1, 0, /* |
1870 Return the coding system actually used for I/O. | 1975 Return the coding system actually used for I/O. |
1871 In some cases (e.g. when a particular EOL type is specified) this won't be | 1976 In some cases (e.g. when a particular EOL type is specified) this won't be |
1872 the coding system itself. This can be useful when trying to track down | 1977 the coding system itself. This can be useful when trying to determine |
1873 more closely how exactly data is decoded. | 1978 precisely how data was decoded. |
1874 */ | 1979 */ |
1875 (coding_system)) | 1980 (coding_system)) |
1876 { | 1981 { |
1877 Lisp_Object canon; | 1982 Lisp_Object canon; |
1878 | 1983 |
2000 stream for both. "Decoding" may involve the extra step of autodetection | 2105 stream for both. "Decoding" may involve the extra step of autodetection |
2001 of the data format, but that's only because of the conventional | 2106 of the data format, but that's only because of the conventional |
2002 definition of decoding as converting from external- to | 2107 definition of decoding as converting from external- to |
2003 internal-formatted data. | 2108 internal-formatted data. |
2004 | 2109 |
2110 [[ REWRITE ME! ]] | |
2111 | |
2005 #### We really need to abstract out the concept of "data formats" and | 2112 #### We really need to abstract out the concept of "data formats" and |
2006 define "converters" that convert from and to specified formats, | 2113 define "converters" that convert from and to specified formats, |
2007 eliminating the idea of decoding and encoding. When specifying a | 2114 eliminating the idea of decoding and encoding. When specifying a |
2008 conversion process, we need to give the data formats themselves, not the | 2115 conversion process, we need to give the data formats themselves, not the |
2009 conversion processes -- e.g. a coding system called "Unicode->multibyte" | 2116 conversion processes -- e.g. a coding system called "Unicode->multibyte" |
2050 after existing "rejected" data from the last conversion. */ | 2157 after existing "rejected" data from the last conversion. */ |
2051 Bytecount rejected = Dynarr_length (str->convert_from); | 2158 Bytecount rejected = Dynarr_length (str->convert_from); |
2052 /* #### 1024 is arbitrary; we really need to separate 0 from EOF, | 2159 /* #### 1024 is arbitrary; we really need to separate 0 from EOF, |
2053 and when we get 0, keep taking more data until we don't get 0 -- | 2160 and when we get 0, keep taking more data until we don't get 0 -- |
2054 we don't know how much data the conversion routine might need | 2161 we don't know how much data the conversion routine might need |
2055 before it can generate any data of its own */ | 2162 before it can generate any data of its own (eg, bzip2). */ |
2056 Bytecount readmore = | 2163 Bytecount readmore = |
2057 str->one_byte_at_a_time ? (Bytecount) 1 : | 2164 str->one_byte_at_a_time ? (Bytecount) 1 : |
2058 max (size, (Bytecount) 1024); | 2165 max (size, (Bytecount) 1024); |
2059 | 2166 |
2060 Dynarr_add_many (str->convert_from, 0, readmore); | 2167 Dynarr_add_many (str->convert_from, 0, readmore); |
2460 ostr = XLSTREAM (outstream); | 2567 ostr = XLSTREAM (outstream); |
2461 istr = XLSTREAM (instream); | 2568 istr = XLSTREAM (instream); |
2462 | 2569 |
2463 /* The chain of streams looks like this: | 2570 /* The chain of streams looks like this: |
2464 | 2571 |
2465 [BUFFER] <----- send through | 2572 [BUFFER] <----- (( read from/send to loop )) |
2466 ------> [CHAR->BYTE i.e. ENCODE AS BINARY if source is | 2573 ------> [CHAR->BYTE i.e. ENCODE AS BINARY if source is |
2467 in bytes] | 2574 in bytes] |
2468 ------> [ENCODE/DECODE AS SPECIFIED] | 2575 ------> [ENCODE/DECODE AS SPECIFIED] |
2469 ------> [BYTE->CHAR i.e. DECODE AS BINARY | 2576 ------> [BYTE->CHAR i.e. DECODE AS BINARY |
2470 if sink is in bytes] | 2577 if sink is in bytes] |
2471 ------> [AUTODETECT EOL if | 2578 ------> [AUTODETECT EOL if |
2472 we're decoding and | 2579 we're decoding and |
2473 coding system calls | 2580 coding system calls |
2474 for this] | 2581 for this] |
2475 ------> [BUFFER] | 2582 ------> [BUFFER] |
2583 */ | |
2584 /* Of course, this is just horrible. BYTE<->CHAR should only be available | |
2585 to I/O routines. It should not be visible to Mule proper. | |
2586 | |
2587 A comment on the implementation. Hrvoje and Kyle worry about the | |
2588 inefficiency of repeated copying among buffers that chained coding | |
2589 systems entail. But this may not be as time inefficient as it appears | |
2590 in the Mule ("house rules") context. The issue is how do you do chain | |
2591 coding systems without copying? In theory you could have | |
2592 | |
2593 IChar external_to_raw (ExtChar *cp, State *s); | |
2594 IChar decode_utf16 (IChar c, State *s); | |
2595 IChar decode_crlf (ExtChar *cp, State *s); | |
2596 | |
2597 typedef Ichar (*Converter[]) (Ichar, State*); | |
2598 | |
2599 Converter utf16[2] = { &decode_utf16, &decode_crlf }; | |
2600 | |
2601 void convert (ExtChar *inbuf, IChar *outbuf, Converter cvtr) | |
2602 { | |
2603 int i; | |
2604 ExtChar c; | |
2605 State s; | |
2606 | |
2607 while (c = external_to_raw (*inbuf++, &s)) | |
2608 { | |
2609 for (i = 0; i < sizeof(cvtr)/sizeof(Converter); ++i) | |
2610 if (s.ready) | |
2611 c = (*cvtr[i]) (c, &s); | |
2612 } | |
2613 if (s.ready) | |
2614 *outbuf++ = c; | |
2615 } | |
2616 | |
2617 But this is a lot of function calls; what Ben is doing is basically | |
2618 reducing this to one call per buffer-full. The only way to avoid this | |
2619 is to hardcode all the "interesting" coding systems, maybe using | |
2620 inline or macros to give structure. But this is still a huge amount | |
2621 of work, and code. | |
2622 | |
2623 One advantage to the call-per-char approach is that we might be able | |
2624 to do something about the marker/extent destruction that coding | |
2625 normally entails. | |
2476 */ | 2626 */ |
2477 while (1) | 2627 while (1) |
2478 { | 2628 { |
2479 char tempbuf[1024]; /* some random amount */ | 2629 char tempbuf[1024]; /* some random amount */ |
2480 Charbpos newpos, even_newer_pos; | 2630 Charbpos newpos, even_newer_pos; |
2795 static void | 2945 static void |
2796 chain_finalize_coding_stream_1 (struct chain_coding_stream *data) | 2946 chain_finalize_coding_stream_1 (struct chain_coding_stream *data) |
2797 { | 2947 { |
2798 if (data->lstreams) | 2948 if (data->lstreams) |
2799 { | 2949 { |
2800 /* Order of deletion is important here! Delete from the head of the | 2950 /* During GC, these objects are unmarked, and are about to be freed. |
2801 chain and work your way towards the tail. In general, when you | 2951 We do NOT want them on the free list, and that will cause lots of |
2802 delete an object, there should be *NO* pointers to it anywhere. | 2952 nastiness including crashes. Just let them be freed normally. */ |
2803 Deleting back-to-front would be a problem because there are | |
2804 pointers going forward. If there were pointers in both | |
2805 directions, you'd have to disconnect the pointers to a particular | |
2806 object before deleting it. */ | |
2807 if (!gc_in_progress) | 2953 if (!gc_in_progress) |
2808 { | 2954 { |
2809 int i; | 2955 int i; |
2810 /* During GC, these objects are unmarked, and are about to be | 2956 /* Order of deletion is important here! Delete from the head of |
2811 freed. We do NOT want them on the free list, and that will | 2957 the chain and work your way towards the tail. In general, |
2812 cause lots of nastiness including crashes. Just let them be | 2958 when you delete an object, there should be *NO* pointers to it |
2813 freed normally. */ | 2959 anywhere. Deleting back-to-front would be a problem because |
2960 there are pointers going forward. If there were pointers in | |
2961 both directions, you'd have to disconnect the pointers to a | |
2962 particular object before deleting it. */ | |
2814 for (i = 0; i < data->lstream_count; i++) | 2963 for (i = 0; i < data->lstream_count; i++) |
2815 Lstream_delete (XLSTREAM ((data->lstreams)[i])); | 2964 Lstream_delete (XLSTREAM ((data->lstreams)[i])); |
2816 } | 2965 } |
2817 xfree (data->lstreams, Lisp_Object *); | 2966 xfree (data->lstreams, Lisp_Object *); |
2818 } | 2967 } |
2925 /************************************************************************/ | 3074 /************************************************************************/ |
2926 | 3075 |
2927 /* "No conversion"; used for binary files. We use quotes because there | 3076 /* "No conversion"; used for binary files. We use quotes because there |
2928 really is some conversion being applied (it does byte<->char | 3077 really is some conversion being applied (it does byte<->char |
2929 conversion), but it appears to the user as if the text is read in | 3078 conversion), but it appears to the user as if the text is read in |
2930 without conversion. */ | 3079 without conversion. |
3080 | |
3081 #### Shouldn't we _call_ it that, then? And while we're at it, | |
3082 separate it into "to_internal" and "to_external"? */ | |
2931 DEFINE_CODING_SYSTEM_TYPE (no_conversion); | 3083 DEFINE_CODING_SYSTEM_TYPE (no_conversion); |
2932 | 3084 |
2933 /* This is used when reading in "binary" files -- i.e. files that may | 3085 /* This is used when reading in "binary" files -- i.e. files that may |
2934 contain all 256 possible byte values and that are not to be | 3086 contain all 256 possible byte values and that are not to be |
2935 interpreted as being in any particular encoding. */ | 3087 interpreted as being in any particular encoding. */ |
2971 assert (ch == 0); | 3123 assert (ch == 0); |
2972 if (c == LEADING_BYTE_LATIN_ISO8859_1 || | 3124 if (c == LEADING_BYTE_LATIN_ISO8859_1 || |
2973 c == LEADING_BYTE_CONTROL_1) | 3125 c == LEADING_BYTE_CONTROL_1) |
2974 ch = c; | 3126 ch = c; |
2975 else | 3127 else |
3128 /* #### This is just plain unacceptable. */ | |
2976 Dynarr_add (dst, '~'); /* untranslatable character */ | 3129 Dynarr_add (dst, '~'); /* untranslatable character */ |
2977 } | 3130 } |
2978 else | 3131 else |
2979 { | 3132 { |
2980 if (ch == LEADING_BYTE_LATIN_ISO8859_1) | 3133 if (ch == LEADING_BYTE_LATIN_ISO8859_1) |
3022 character-to-character, and works (when encoding) *BEFORE* sending | 3175 character-to-character, and works (when encoding) *BEFORE* sending |
3023 data to the main encoding routine -- thus, that routine must handle | 3176 data to the main encoding routine -- thus, that routine must handle |
3024 different EOL types itself if it does line-oriented type processing. | 3177 different EOL types itself if it does line-oriented type processing. |
3025 This is unavoidable because we don't know whether the output of the | 3178 This is unavoidable because we don't know whether the output of the |
3026 main encoding routine is ASCII compatible (Unicode is definitely not, | 3179 main encoding routine is ASCII compatible (Unicode is definitely not, |
3027 for example). | 3180 for example). [[ sjt sez this is bogus. There should be _no_ EOL |
3181 processing (or processing of any kind) after conversion to external. ]] | |
3028 | 3182 |
3029 There is one parameter: `subtype', either `cr', `lf', `crlf', or nil. | 3183 There is one parameter: `subtype', either `cr', `lf', `crlf', or nil. |
3030 */ | 3184 */ |
3031 | 3185 |
3032 struct convert_eol_coding_system | 3186 struct convert_eol_coding_system |
4808 | 4962 |
4809 QScoding_system_cookie = build_string (";;;###coding system: "); | 4963 QScoding_system_cookie = build_string (";;;###coding system: "); |
4810 staticpro (&QScoding_system_cookie); | 4964 staticpro (&QScoding_system_cookie); |
4811 | 4965 |
4812 #ifdef HAVE_DEFAULT_EOL_DETECTION | 4966 #ifdef HAVE_DEFAULT_EOL_DETECTION |
4813 /* WARNING: The existing categories are intimately tied to the function | 4967 /* #### Find a more appropriate place for this comment. |
4968 WARNING: The existing categories are intimately tied to the function | |
4814 `coding-system-category' in coding.el. If you change a category, or | 4969 `coding-system-category' in coding.el. If you change a category, or |
4815 change the layout of any coding system associated with a category, you | 4970 change the layout of any coding system associated with a category, you |
4816 need to check that function and make sure it's written properly. */ | 4971 need to check that function and make sure it's written properly. */ |
4817 | 4972 |
4818 Fprovide (intern ("unix-default-eol-detection")); | 4973 Fprovide (intern ("unix-default-eol-detection")); |
4871 Information is displayed on stderr. | 5026 Information is displayed on stderr. |
4872 */ ); | 5027 */ ); |
4873 Vdebug_coding_detection = Qnil; | 5028 Vdebug_coding_detection = Qnil; |
4874 #endif | 5029 #endif |
4875 } | 5030 } |
5031 | |
5032 /* #### reformat this for consistent appearance? */ | |
4876 | 5033 |
4877 void | 5034 void |
4878 complex_vars_of_file_coding (void) | 5035 complex_vars_of_file_coding (void) |
4879 { | 5036 { |
4880 Fmake_coding_system | 5037 Fmake_coding_system |