comparison src/file-coding.c @ 2297:13a418960a88

[xemacs-hg @ 2004-09-22 02:05:42 by stephent] various doc patches <87isa7awrh.fsf@tleepslib.sk.tsukuba.ac.jp>
author stephent
date Wed, 22 Sep 2004 02:06:52 +0000
parents 04bc9d2f42c7
children ecf1ebac70d8
comparison
equal deleted inserted replaced
2296:a58ea4d0d0cd 2297:13a418960a88
66 levels of likelihood to increase the reliability of the algorithm. 66 levels of likelihood to increase the reliability of the algorithm.
67 67
68 October 2001, Ben Wing: HAVE_CODING_SYSTEMS is always now defined. 68 October 2001, Ben Wing: HAVE_CODING_SYSTEMS is always now defined.
69 Removed the conditionals. 69 Removed the conditionals.
70 */ 70 */
71
72 /* sjt sez:
73
74 There should be no elementary coding systems in the Lisp API, only chains.
75 Chains should be declared, not computed, as a sequence of coding formats.
76 (Probably the internal representation can be a vector for efficiency but
77 programmers would probably rather work with lists.) A stream has a token
78 type. Most streams are octet streams. Text is a stream of characters (in
79 _internal_ format; a file on disk is not text!) An octet-stream has no
80 implicit semantics, so its format must always be specified. The only type
81 currently having semantics is characters. This means that the chain [euc-jp
82 -> internal -> shift_jis) may be specified (euc-jp, shift_jis), and if no
83 euc-jp -> shift_jis converter is available, then the chain is automatically
84 constructed. (N.B. I f we have fixed width buffers in the future, then we
85 could have ASCII -> 8-bit char -> 16-bit char -> ISO-2022-JP (with escape
86 sequences).
87
88 EOL handling is a char <-> char coding. It should not be part of another
89 coding system except as a convenience for users. For text coding,
90 automatically insert EOL handlers between char <-> octet boundaries.
91 */
71 92
72 /* Comments about future work 93 /* Comments about future work
73 94
74 ------------------------------------------------------------------ 95 ------------------------------------------------------------------
75 ABOUT DETECTION 96 ABOUT DETECTION
155 in CRLF format. The EOL detector (which really detects *plain text* 176 in CRLF format. The EOL detector (which really detects *plain text*
156 with a particular EOL type) would return at most level 0 for all 177 with a particular EOL type) would return at most level 0 for all
157 results until the text file is reached, whereas the base64, gzip or 178 results until the text file is reached, whereas the base64, gzip or
158 euc-jp decoders will return higher. Once the text file is reached, 179 euc-jp decoders will return higher. Once the text file is reached,
159 the EOL detector will return 0 or higher for the CRLF encoding, and 180 the EOL detector will return 0 or higher for the CRLF encoding, and
160 all other decoders will return 0 or lower; thus, we will successfully 181 all other detectors will return 0 or lower; thus, we will successfully
161 proceed through CRLF decoding, or at worst prompt the user. (The only 182 proceed through CRLF decoding, or at worst prompt the user. (The only
162 external-vs-internal distinction that might make sense here is to 183 external-vs-internal distinction that might make sense here is to
163 favor coding systems of the correct source type over those that 184 favor coding systems of the correct source type over those that
164 require conversion between external and internal; if done right, this 185 require conversion between external and internal; if done right, this
165 could allow the CRLF detector to return level 1 for all CRLF-encoded 186 could allow the CRLF detector to return level 1 for all CRLF-encoded
168 interfere with other decoders. On the other hand, this 189 interfere with other decoders. On the other hand, this
169 external-vs-internal distinction may not matter at all -- with 190 external-vs-internal distinction may not matter at all -- with
170 automatic internal-external conversion, CRLF decoding can occur 191 automatic internal-external conversion, CRLF decoding can occur
171 before or after decoding of euc-jp, base64, iso2022, or similar, 192 before or after decoding of euc-jp, base64, iso2022, or similar,
172 without any difference in the final results.) 193 without any difference in the final results.)
194
195 #### What are we trying to say? In base64, the CRLF decoding before
196 base64 decoding is irrelevant, they will be thrown out as whitespace
197 is not significant in base64.
198
199 [sjt considers all of this to be rather bogus. Ideas like "greater
200 certainty" and "distinctive" can and should be quantified. The issue
201 of proper table organization should be a question of optimization.]
202
203 [sjt wonders if it might not be a good idea to use Unicode's newline
204 character as the internal representation so that (for non-Unicode
205 coding systems) we can catch EOL bugs on Unix too.]
173 206
174 -- There need to be two priority lists and two 207 -- There need to be two priority lists and two
175 category->coding-system lists. Once is general, the other 208 category->coding-system lists. Once is general, the other
176 category->langenv-specific. The user sets the former, the langenv 209 category->langenv-specific. The user sets the former, the langenv
177 category->the latter. The langenv-specific entries take precedence 210 category->the latter. The langenv-specific entries take precedence
219 then re-decode it, perhaps multiple times, as we get better 252 then re-decode it, perhaps multiple times, as we get better
220 detection results. 253 detection results.
221 254
222 -- Clearly some of these are more important than others. at the 255 -- Clearly some of these are more important than others. at the
223 very least, the "better means of presentation" should be 256 very least, the "better means of presentation" should be
224 implementation as soon as possibl, along with a very simple means 257 implemented as soon as possible, along with a very simple means
225 of fail-safe whenever the data is readibly available, e.g. it's 258 of fail-safe whenever the data is readibly available, e.g. it's
226 coming from a file, which is the most common scenario. 259 coming from a file, which is the most common scenario.
260
261 --ben [at least that's what sjt thinks]
262
263 *****
264
265 While this is clearly something of an improvement over earlier designs,
266 it doesn't deal with the most important issue: to do better than categories
267 (which in the medium term is mostly going to mean "which flavor of Unicode
268 is this?"), we need to look at statistical behavior rather than ruling out
269 categories via presence of specific sequences. This means the stream
270 processor should
271
272 (1) keep octet distributions (octet, 2-, 3-, 4- octet sequences)
273 (2) in some kind of compressed form
274 (3) look for "skip features" (eg, characteristic behavior of leading
275 bytes for UTF-7, UTF-8, UTF-16, Mule code)
276 (4) pick up certain "simple" regexps
277 (5) provide "triggers" to determine when statistical detectors should be
278 invoked, such as octet count
279 (6) and "magic" like Unicode signatures or file(1) magic.
280
281 --sjt
227 282
228 283
229 ------------------------------------------------------------------ 284 ------------------------------------------------------------------
230 ABOUT FORMATS 285 ABOUT FORMATS
231 ------------------------------------------------------------------ 286 ------------------------------------------------------------------
307 there's also a global one, and presumably all coding systems 362 there's also a global one, and presumably all coding systems
308 not on other list get appended to the end (and perhaps not 363 not on other list get appended to the end (and perhaps not
309 checked at all when doing safe-checking?). safe-checking 364 checked at all when doing safe-checking?). safe-checking
310 should work something like this: compile a list of all 365 should work something like this: compile a list of all
311 charsets used in the buffer, along with a count of chars 366 charsets used in the buffer, along with a count of chars
312 used. that way, "slightly unsafe" charsets can perhaps be 367 used. that way, "slightly unsafe" coding systems can perhaps
313 presented at the end, which will lose only a few characters 368 be presented at the end, which will lose only a few characters
314 and are perhaps what the users were looking for. 369 and are perhaps what the users were looking for.
370
371 [sjt sez this whole step is a crock. If a universal coding system
372 is unacceptable, the user had better know what he/she is doing,
373 and explicitly specify a lossy encoding.
374 In principle, we can simply check for characters being writable as
375 we go along. Eg, via an "unrepresentable character handler." We
376 still have the buffer contents. If we can't successfully save,
377 then ask the user what to do. (Do we ever simply destroy previous
378 file version before completing a write?)]
315 379
316 2. when actually writing out, we need error checking in case an 380 2. when actually writing out, we need error checking in case an
317 individual char in a charset can't be written even though the 381 individual char in a charset can't be written even though the
318 charsets are safe. again, the user gets the choice of other 382 charsets are safe. again, the user gets the choice of other
319 reasonable coding systems. 383 reasonable coding systems.
320 384
385 [sjt -- something is very confused, here; safe charsets should be
386 defined as those charsets all of whose characters can be encoded.]
387
321 3. same thing (error checking, list of alternatives, etc.) needs 388 3. same thing (error checking, list of alternatives, etc.) needs
322 to happen when reading! all of this will be a lot of work! 389 to happen when reading! all of this will be a lot of work!
323 390
324 391
325 --ben 392 --ben
393
394 I don't much like Ben's scheme. First, this isn't an issue of I/O,
395 it's a coding issue. It can happen in many places, not just on stream
396 I/O. Error checking should take place on all translations. Second,
397 the two-pass algorithm should be avoided if possible. In some cases
398 (eg, output to a tty) we won't be able to go back and change the
399 previously output data. Third, the whole idea of having a buffer full
400 of arbitrary characters which we're going to somehow shoehorn into a
401 file based on some twit user's less than informed idea of a coding system
402 is kind of laughable from the start. If we're going to say that a buffer
403 has a coding system, shouldn't we enforce restrictions on what you can
404 put into it? Fourth, what's the point of having safe charsets if some
405 of the characters in them are unsafe? Fifth, what makes you think we're
406 going to have a list of charsets? It seems to me that there might be
407 reasons to have user-defined charsets (eg, "German" vs "French" subsets
408 of ISO 8859/15). Sixth, the idea of having language environment determine
409 precedence doesn't seem very useful to me. Users who are working with a
410 language that corresponds to the language environment are not going to
411 run into safe charsets problems. It's users who are outside of their
412 usual language environment who run into trouble. Also, the reason for
413 specifying anything other than a universal coding system is normally
414 restrictions imposed by other users or applications. Seventh, the
415 statistical feedback isn't terribly useful. Users rarely "want" a
416 coding system, they want their file saved in a useful way. We could
417 add a FORCE argument to conversions for those who really want a specific
418 coding system. But mostly, a user might want to edit out a few unsafe
419 characters. So (up to some maximum) we should keep a list of unsafe
420 text positions, and provide a convenient function for traversing them.
421
422 --sjt
326 */ 423 */
327 424
328 #include <config.h> 425 #include <config.h>
329 #include "lisp.h" 426 #include "lisp.h"
330 427
495 592
496 #ifdef HAVE_ZLIB 593 #ifdef HAVE_ZLIB
497 Lisp_Object Qgzip; 594 Lisp_Object Qgzip;
498 #endif 595 #endif
499 596
500 /* Maps coding system names to either coding system objects or (for 597 /* Maps symbols (coding system names) to either coding system objects or
501 aliases) other names. */ 598 (for aliases) other names. */
502 static Lisp_Object Vcoding_system_hash_table; 599 static Lisp_Object Vcoding_system_hash_table;
503 600
504 int enable_multibyte_characters; 601 int enable_multibyte_characters;
505 602
506 EXFUN (Fcopy_coding_system, 2); 603 EXFUN (Fcopy_coding_system, 2);
908 cscl->internal : cscl->normal) 1005 cscl->internal : cscl->normal)
909 *coding_system_list = Fcons (key, *coding_system_list); 1006 *coding_system_list = Fcons (key, *coding_system_list);
910 return 0; 1007 return 0;
911 } 1008 }
912 1009
1010 /* #### should we specify a conventional for "all coding systems"? */
913 DEFUN ("coding-system-list", Fcoding_system_list, 0, 1, 0, /* 1011 DEFUN ("coding-system-list", Fcoding_system_list, 0, 1, 0, /*
914 Return a list of the names of all defined coding systems. 1012 Return a list of the names of all defined coding systems.
915 If INTERNAL is nil, only the normal (non-internal) coding systems are 1013 If INTERNAL is nil, only the normal (non-internal) coding systems are
916 included. (Internal coding systems are created for various internal 1014 included. (Internal coding systems are created for various internal
917 purposes, such as implementing EOL types of CRLF and CR; generally, you do 1015 purposes, such as implementing EOL types of CRLF and CR; generally, you do
1556 One of `utf-16', `utf-8', `ucs-4', or `utf-7' (the latter is not 1654 One of `utf-16', `utf-8', `ucs-4', or `utf-7' (the latter is not
1557 yet implemented). `utf-16' is the basic two-byte encoding; 1655 yet implemented). `utf-16' is the basic two-byte encoding;
1558 `ucs-4' is the four-byte encoding; `utf-8' is an ASCII-compatible 1656 `ucs-4' is the four-byte encoding; `utf-8' is an ASCII-compatible
1559 variable-width 8-bit encoding; `utf-7' is a 7-bit encoding using 1657 variable-width 8-bit encoding; `utf-7' is a 7-bit encoding using
1560 only characters that will safely pass through all mail gateways. 1658 only characters that will safely pass through all mail gateways.
1659 [[ This should be \"transformation format\". There should also be
1660 `ucs-2' (or `bmp' -- no surrogates) and `utf-32' (range checked). ]]
1561 1661
1562 'little-endian 1662 'little-endian
1563 If non-nil, `utf-16' and `ucs-4' will write out the groups of two 1663 If non-nil, `utf-16' and `ucs-4' will write out the groups of two
1564 or four bytes little-endian instead of big-endian. This is required, 1664 or four bytes little-endian instead of big-endian. This is required,
1565 for example, under Windows. 1665 for example, under Windows.
1567 'need-bom 1667 'need-bom
1568 If non-nil, a byte order mark (BOM, or Unicode FFFE) should be 1668 If non-nil, a byte order mark (BOM, or Unicode FFFE) should be
1569 written out at the beginning of the data. This serves both to 1669 written out at the beginning of the data. This serves both to
1570 identify the endianness of the following data and to mark the 1670 identify the endianness of the following data and to mark the
1571 data as Unicode (at least, this is how Windows uses it). 1671 data as Unicode (at least, this is how Windows uses it).
1572 1672 [[ The correct term is \"signature\", since this technique may also
1673 be used with UTF-8. That is the term used in the standard. ]]
1573 1674
1574 1675
1575 The following additional properties are recognized if TYPE is 1676 The following additional properties are recognized if TYPE is
1576 'mswindows-multibyte: 1677 'mswindows-multibyte:
1577 1678
1594 `mswindows-system-default-locale', respectively. 1695 `mswindows-system-default-locale', respectively.
1595 1696
1596 1697
1597 1698
1598 The following additional properties are recognized if TYPE is 'undecided: 1699 The following additional properties are recognized if TYPE is 'undecided:
1700 [[ Doesn't GNU use \"detect-*\" for the following two? ]]
1599 1701
1600 'do-eol 1702 'do-eol
1601 Do EOL detection. 1703 Do EOL detection.
1602 1704
1603 'do-coding 1705 'do-coding
1667 (object)) 1769 (object))
1668 { 1770 {
1669 return CODING_SYSTEMP (Fgethash (object, Vcoding_system_hash_table, Qnil)) 1771 return CODING_SYSTEMP (Fgethash (object, Vcoding_system_hash_table, Qnil))
1670 ? Qt : Qnil; 1772 ? Qt : Qnil;
1671 } 1773 }
1774
1775 /* #### Shouldn't this really be a find/get pair? */
1672 1776
1673 DEFUN ("coding-system-alias-p", Fcoding_system_alias_p, 1, 1, 0, /* 1777 DEFUN ("coding-system-alias-p", Fcoding_system_alias_p, 1, 1, 0, /*
1674 Return t if OBJECT is a coding system alias. 1778 Return t if OBJECT is a coding system alias.
1675 All coding system aliases are created by `define-coding-system-alias'. 1779 All coding system aliases are created by `define-coding-system-alias'.
1676 */ 1780 */
1792 } 1896 }
1793 1897
1794 Fputhash (alias, aliasee, Vcoding_system_hash_table); 1898 Fputhash (alias, aliasee, Vcoding_system_hash_table);
1795 1899
1796 /* Set up aliases for subsidiaries. 1900 /* Set up aliases for subsidiaries.
1797 #### There must be a better way to handle subsidiary coding systems. */ 1901 #### There must be a better way to handle subsidiary coding systems.
1902 Inquiring Minds Want To Know: shouldn't they always be chains? */
1798 { 1903 {
1799 static const char *suffixes[] = { "-unix", "-dos", "-mac" }; 1904 static const char *suffixes[] = { "-unix", "-dos", "-mac" };
1800 int i; 1905 int i;
1801 for (i = 0; i < countof (suffixes); i++) 1906 for (i = 0; i < countof (suffixes); i++)
1802 { 1907 {
1867 1972
1868 DEFUN ("coding-system-used-for-io", Fcoding_system_used_for_io, 1973 DEFUN ("coding-system-used-for-io", Fcoding_system_used_for_io,
1869 1, 1, 0, /* 1974 1, 1, 0, /*
1870 Return the coding system actually used for I/O. 1975 Return the coding system actually used for I/O.
1871 In some cases (e.g. when a particular EOL type is specified) this won't be 1976 In some cases (e.g. when a particular EOL type is specified) this won't be
1872 the coding system itself. This can be useful when trying to track down 1977 the coding system itself. This can be useful when trying to determine
1873 more closely how exactly data is decoded. 1978 precisely how data was decoded.
1874 */ 1979 */
1875 (coding_system)) 1980 (coding_system))
1876 { 1981 {
1877 Lisp_Object canon; 1982 Lisp_Object canon;
1878 1983
2000 stream for both. "Decoding" may involve the extra step of autodetection 2105 stream for both. "Decoding" may involve the extra step of autodetection
2001 of the data format, but that's only because of the conventional 2106 of the data format, but that's only because of the conventional
2002 definition of decoding as converting from external- to 2107 definition of decoding as converting from external- to
2003 internal-formatted data. 2108 internal-formatted data.
2004 2109
2110 [[ REWRITE ME! ]]
2111
2005 #### We really need to abstract out the concept of "data formats" and 2112 #### We really need to abstract out the concept of "data formats" and
2006 define "converters" that convert from and to specified formats, 2113 define "converters" that convert from and to specified formats,
2007 eliminating the idea of decoding and encoding. When specifying a 2114 eliminating the idea of decoding and encoding. When specifying a
2008 conversion process, we need to give the data formats themselves, not the 2115 conversion process, we need to give the data formats themselves, not the
2009 conversion processes -- e.g. a coding system called "Unicode->multibyte" 2116 conversion processes -- e.g. a coding system called "Unicode->multibyte"
2050 after existing "rejected" data from the last conversion. */ 2157 after existing "rejected" data from the last conversion. */
2051 Bytecount rejected = Dynarr_length (str->convert_from); 2158 Bytecount rejected = Dynarr_length (str->convert_from);
2052 /* #### 1024 is arbitrary; we really need to separate 0 from EOF, 2159 /* #### 1024 is arbitrary; we really need to separate 0 from EOF,
2053 and when we get 0, keep taking more data until we don't get 0 -- 2160 and when we get 0, keep taking more data until we don't get 0 --
2054 we don't know how much data the conversion routine might need 2161 we don't know how much data the conversion routine might need
2055 before it can generate any data of its own */ 2162 before it can generate any data of its own (eg, bzip2). */
2056 Bytecount readmore = 2163 Bytecount readmore =
2057 str->one_byte_at_a_time ? (Bytecount) 1 : 2164 str->one_byte_at_a_time ? (Bytecount) 1 :
2058 max (size, (Bytecount) 1024); 2165 max (size, (Bytecount) 1024);
2059 2166
2060 Dynarr_add_many (str->convert_from, 0, readmore); 2167 Dynarr_add_many (str->convert_from, 0, readmore);
2460 ostr = XLSTREAM (outstream); 2567 ostr = XLSTREAM (outstream);
2461 istr = XLSTREAM (instream); 2568 istr = XLSTREAM (instream);
2462 2569
2463 /* The chain of streams looks like this: 2570 /* The chain of streams looks like this:
2464 2571
2465 [BUFFER] <----- send through 2572 [BUFFER] <----- (( read from/send to loop ))
2466 ------> [CHAR->BYTE i.e. ENCODE AS BINARY if source is 2573 ------> [CHAR->BYTE i.e. ENCODE AS BINARY if source is
2467 in bytes] 2574 in bytes]
2468 ------> [ENCODE/DECODE AS SPECIFIED] 2575 ------> [ENCODE/DECODE AS SPECIFIED]
2469 ------> [BYTE->CHAR i.e. DECODE AS BINARY 2576 ------> [BYTE->CHAR i.e. DECODE AS BINARY
2470 if sink is in bytes] 2577 if sink is in bytes]
2471 ------> [AUTODETECT EOL if 2578 ------> [AUTODETECT EOL if
2472 we're decoding and 2579 we're decoding and
2473 coding system calls 2580 coding system calls
2474 for this] 2581 for this]
2475 ------> [BUFFER] 2582 ------> [BUFFER]
2583 */
2584 /* Of course, this is just horrible. BYTE<->CHAR should only be available
2585 to I/O routines. It should not be visible to Mule proper.
2586
2587 A comment on the implementation. Hrvoje and Kyle worry about the
2588 inefficiency of repeated copying among buffers that chained coding
2589 systems entail. But this may not be as time inefficient as it appears
2590 in the Mule ("house rules") context. The issue is how do you do chain
2591 coding systems without copying? In theory you could have
2592
2593 IChar external_to_raw (ExtChar *cp, State *s);
2594 IChar decode_utf16 (IChar c, State *s);
2595 IChar decode_crlf (ExtChar *cp, State *s);
2596
2597 typedef Ichar (*Converter[]) (Ichar, State*);
2598
2599 Converter utf16[2] = { &decode_utf16, &decode_crlf };
2600
2601 void convert (ExtChar *inbuf, IChar *outbuf, Converter cvtr)
2602 {
2603 int i;
2604 ExtChar c;
2605 State s;
2606
2607 while (c = external_to_raw (*inbuf++, &s))
2608 {
2609 for (i = 0; i < sizeof(cvtr)/sizeof(Converter); ++i)
2610 if (s.ready)
2611 c = (*cvtr[i]) (c, &s);
2612 }
2613 if (s.ready)
2614 *outbuf++ = c;
2615 }
2616
2617 But this is a lot of function calls; what Ben is doing is basically
2618 reducing this to one call per buffer-full. The only way to avoid this
2619 is to hardcode all the "interesting" coding systems, maybe using
2620 inline or macros to give structure. But this is still a huge amount
2621 of work, and code.
2622
2623 One advantage to the call-per-char approach is that we might be able
2624 to do something about the marker/extent destruction that coding
2625 normally entails.
2476 */ 2626 */
2477 while (1) 2627 while (1)
2478 { 2628 {
2479 char tempbuf[1024]; /* some random amount */ 2629 char tempbuf[1024]; /* some random amount */
2480 Charbpos newpos, even_newer_pos; 2630 Charbpos newpos, even_newer_pos;
2795 static void 2945 static void
2796 chain_finalize_coding_stream_1 (struct chain_coding_stream *data) 2946 chain_finalize_coding_stream_1 (struct chain_coding_stream *data)
2797 { 2947 {
2798 if (data->lstreams) 2948 if (data->lstreams)
2799 { 2949 {
2800 /* Order of deletion is important here! Delete from the head of the 2950 /* During GC, these objects are unmarked, and are about to be freed.
2801 chain and work your way towards the tail. In general, when you 2951 We do NOT want them on the free list, and that will cause lots of
2802 delete an object, there should be *NO* pointers to it anywhere. 2952 nastiness including crashes. Just let them be freed normally. */
2803 Deleting back-to-front would be a problem because there are
2804 pointers going forward. If there were pointers in both
2805 directions, you'd have to disconnect the pointers to a particular
2806 object before deleting it. */
2807 if (!gc_in_progress) 2953 if (!gc_in_progress)
2808 { 2954 {
2809 int i; 2955 int i;
2810 /* During GC, these objects are unmarked, and are about to be 2956 /* Order of deletion is important here! Delete from the head of
2811 freed. We do NOT want them on the free list, and that will 2957 the chain and work your way towards the tail. In general,
2812 cause lots of nastiness including crashes. Just let them be 2958 when you delete an object, there should be *NO* pointers to it
2813 freed normally. */ 2959 anywhere. Deleting back-to-front would be a problem because
2960 there are pointers going forward. If there were pointers in
2961 both directions, you'd have to disconnect the pointers to a
2962 particular object before deleting it. */
2814 for (i = 0; i < data->lstream_count; i++) 2963 for (i = 0; i < data->lstream_count; i++)
2815 Lstream_delete (XLSTREAM ((data->lstreams)[i])); 2964 Lstream_delete (XLSTREAM ((data->lstreams)[i]));
2816 } 2965 }
2817 xfree (data->lstreams, Lisp_Object *); 2966 xfree (data->lstreams, Lisp_Object *);
2818 } 2967 }
2925 /************************************************************************/ 3074 /************************************************************************/
2926 3075
2927 /* "No conversion"; used for binary files. We use quotes because there 3076 /* "No conversion"; used for binary files. We use quotes because there
2928 really is some conversion being applied (it does byte<->char 3077 really is some conversion being applied (it does byte<->char
2929 conversion), but it appears to the user as if the text is read in 3078 conversion), but it appears to the user as if the text is read in
2930 without conversion. */ 3079 without conversion.
3080
3081 #### Shouldn't we _call_ it that, then? And while we're at it,
3082 separate it into "to_internal" and "to_external"? */
2931 DEFINE_CODING_SYSTEM_TYPE (no_conversion); 3083 DEFINE_CODING_SYSTEM_TYPE (no_conversion);
2932 3084
2933 /* This is used when reading in "binary" files -- i.e. files that may 3085 /* This is used when reading in "binary" files -- i.e. files that may
2934 contain all 256 possible byte values and that are not to be 3086 contain all 256 possible byte values and that are not to be
2935 interpreted as being in any particular encoding. */ 3087 interpreted as being in any particular encoding. */
2971 assert (ch == 0); 3123 assert (ch == 0);
2972 if (c == LEADING_BYTE_LATIN_ISO8859_1 || 3124 if (c == LEADING_BYTE_LATIN_ISO8859_1 ||
2973 c == LEADING_BYTE_CONTROL_1) 3125 c == LEADING_BYTE_CONTROL_1)
2974 ch = c; 3126 ch = c;
2975 else 3127 else
3128 /* #### This is just plain unacceptable. */
2976 Dynarr_add (dst, '~'); /* untranslatable character */ 3129 Dynarr_add (dst, '~'); /* untranslatable character */
2977 } 3130 }
2978 else 3131 else
2979 { 3132 {
2980 if (ch == LEADING_BYTE_LATIN_ISO8859_1) 3133 if (ch == LEADING_BYTE_LATIN_ISO8859_1)
3022 character-to-character, and works (when encoding) *BEFORE* sending 3175 character-to-character, and works (when encoding) *BEFORE* sending
3023 data to the main encoding routine -- thus, that routine must handle 3176 data to the main encoding routine -- thus, that routine must handle
3024 different EOL types itself if it does line-oriented type processing. 3177 different EOL types itself if it does line-oriented type processing.
3025 This is unavoidable because we don't know whether the output of the 3178 This is unavoidable because we don't know whether the output of the
3026 main encoding routine is ASCII compatible (Unicode is definitely not, 3179 main encoding routine is ASCII compatible (Unicode is definitely not,
3027 for example). 3180 for example). [[ sjt sez this is bogus. There should be _no_ EOL
3181 processing (or processing of any kind) after conversion to external. ]]
3028 3182
3029 There is one parameter: `subtype', either `cr', `lf', `crlf', or nil. 3183 There is one parameter: `subtype', either `cr', `lf', `crlf', or nil.
3030 */ 3184 */
3031 3185
3032 struct convert_eol_coding_system 3186 struct convert_eol_coding_system
4808 4962
4809 QScoding_system_cookie = build_string (";;;###coding system: "); 4963 QScoding_system_cookie = build_string (";;;###coding system: ");
4810 staticpro (&QScoding_system_cookie); 4964 staticpro (&QScoding_system_cookie);
4811 4965
4812 #ifdef HAVE_DEFAULT_EOL_DETECTION 4966 #ifdef HAVE_DEFAULT_EOL_DETECTION
4813 /* WARNING: The existing categories are intimately tied to the function 4967 /* #### Find a more appropriate place for this comment.
4968 WARNING: The existing categories are intimately tied to the function
4814 `coding-system-category' in coding.el. If you change a category, or 4969 `coding-system-category' in coding.el. If you change a category, or
4815 change the layout of any coding system associated with a category, you 4970 change the layout of any coding system associated with a category, you
4816 need to check that function and make sure it's written properly. */ 4971 need to check that function and make sure it's written properly. */
4817 4972
4818 Fprovide (intern ("unix-default-eol-detection")); 4973 Fprovide (intern ("unix-default-eol-detection"));
4871 Information is displayed on stderr. 5026 Information is displayed on stderr.
4872 */ ); 5027 */ );
4873 Vdebug_coding_detection = Qnil; 5028 Vdebug_coding_detection = Qnil;
4874 #endif 5029 #endif
4875 } 5030 }
5031
5032 /* #### reformat this for consistent appearance? */
4876 5033
4877 void 5034 void
4878 complex_vars_of_file_coding (void) 5035 complex_vars_of_file_coding (void)
4879 { 5036 {
4880 Fmake_coding_system 5037 Fmake_coding_system