comparison src/file-coding.c @ 771:943eaba38521

[xemacs-hg @ 2002-03-13 08:51:24 by ben] The big ben-mule-21-5 check-in! Various files were added and deleted. See CHANGES-ben-mule. There are still some test suite failures. No crashes, though. Many of the failures have to do with problems in the test suite itself rather than in the actual code. I'll be addressing these in the next day or so -- none of the test suite failures are at all critical. Meanwhile I'll be trying to address the biggest issues -- i.e. build or run failures, which will almost certainly happen on various platforms. All comments should be sent to ben@xemacs.org -- use a Cc: if necessary when sending to mailing lists. There will be pre- and post- tags, something like pre-ben-mule-21-5-merge-in, and post-ben-mule-21-5-merge-in.
author ben
date Wed, 13 Mar 2002 08:54:06 +0000
parents fdefd0186b75
children 2923009caf47
comparison
equal deleted inserted replaced
770:336a418893b5 771:943eaba38521
1 /* Code conversion functions. 1 /* Text encoding conversion functions; coding-system object.
2 #### rename me to coding-system.c or coding.c
2 Copyright (C) 1991, 1995 Free Software Foundation, Inc. 3 Copyright (C) 1991, 1995 Free Software Foundation, Inc.
3 Copyright (C) 1995 Sun Microsystems, Inc. 4 Copyright (C) 1995 Sun Microsystems, Inc.
5 Copyright (C) 2000, 2001, 2002 Ben Wing.
4 6
5 This file is part of XEmacs. 7 This file is part of XEmacs.
6 8
7 XEmacs is free software; you can redistribute it and/or modify it 9 XEmacs is free software; you can redistribute it and/or modify it
8 under the terms of the GNU General Public License as published by the 10 under the terms of the GNU General Public License as published by the
17 You should have received a copy of the GNU General Public License 19 You should have received a copy of the GNU General Public License
18 along with XEmacs; see the file COPYING. If not, write to 20 along with XEmacs; see the file COPYING. If not, write to
19 the Free Software Foundation, Inc., 59 Temple Place - Suite 330, 21 the Free Software Foundation, Inc., 59 Temple Place - Suite 330,
20 Boston, MA 02111-1307, USA. */ 22 Boston, MA 02111-1307, USA. */
21 23
22 /* Synched up with: Mule 2.3. Not in FSF. */ 24 /* Synched up with: Not in FSF. */
23 25
24 /* Rewritten by Ben Wing <ben@xemacs.org>. */ 26 /* Authorship:
27
28 Current primary author: Ben Wing <ben@xemacs.org>
29
30 Rewritten by Ben Wing <ben@xemacs.org>, based originally on coding.c
31 from Mule 2.? but probably does not share one line of code with that
32 original source. Rewriting work started around Dec. 1994. or Jan. 1995.
33 Proceeded in earnest till Nov. 1995.
34
35 Around Feb. 17, 1998, Andy Piper renamed what was then mule-coding.c to
36 file-coding.c, with the intention of using it to do end-of-line conversion
37 on non-MULE machines (specifically, on Windows machines). He separated
38 out the MULE stuff from non-MULE using ifdef's, and searched throughout
39 the rest of the source tree looking for coding-system-related code that
40 was ifdef MULE but should be ifdef HAVE_CODING_SYSTEMS.
41
42 Sept. 4 - 8, 1998, Tomohiko Morioka added the UCS_4 and UTF_8 coding system
43 types, providing a primitive means of decoding and encoding externally-
44 formatted Unicode/UCS_4 and Unicode/UTF_8 data.
45
46 January 25, 2000, Martin Buchholz redid and fleshed out the coding
47 system alias handling that was first added in prototype form by
48 Hrjove Niksic, April 15, 1999.
49
50 April to May 2000, Ben Wing: More major reorganization. Adding features
51 needed for MS Windows (multibyte, unicode, unicode-to-multibyte), the
52 "chain" coding system for chaining two together, and doing a lot of
53 reorganization in preparation for properly abstracting out the different
54 coding system types.
55
56 June 2001, Ben Wing: Added Unicode support. Eliminated previous
57 junky Unicode translation support.
58
59 August 2001, Ben Wing: Moved Unicode support to unicode.c. Finished
60 abstracting everything except detection, which is hard to abstract (see
61 just below).
62
63 September 2001, Ben Wing: Moved Mule code to mule-coding.c, Windows code
64 to intl-win32.c. Lots more rewriting; very little code is untouched
65 from before April 2000. Abstracted the detection code, added multiple
66 levels of likelihood to increase the reliability of the algorithm.
67
68 October 2001, Ben Wing: HAVE_CODING_SYSTEMS is always now defined.
69 Removed the conditionals.
70 */
71
72 /* Comments about future work
73
74 ------------------------------------------------------------------
75 ABOUT DETECTION
76 ------------------------------------------------------------------
77
78 however, in general the detection code has major problems and needs lots
79 of work:
80
81 -- instead of merely "yes" or "no" for particular categories, we need a
82 more flexible system, with various levels of likelihood. Currently
83 I've created a system with six levels, as follows:
84
85 [see file-coding.h]
86
87 Let's consider what this might mean for an ASCII text detector. (In
88 order to have accurate detection, especially given the iteration I
89 proposed below, we need active detectors for *all* types of data we
90 might reasonably encounter, such as ASCII text files, binary files,
91 and possibly other sorts of ASCII files, and not assume that simply
92 "falling back to no detection" will work at all well.)
93
94 An ASCII text detector DOES NOT report ASCII text as level 0, since
95 that's what the detector is looking for. Such a detector ideally
96 wants all bytes in the range 0x20 - 0x7E (no high bytes!), except for
97 whitespace control chars and perhaps a few others; LF, CR, or CRLF
98 sequences at regular intervals (where "regular" might mean an average
99 < 100 chars and 99% < 300 for code and other stuff of the "text file
100 w/line breaks" variety, but for the "text file w/o line breaks"
101 variety, excluding blank lines, averages could easily be 600 or more
102 with 2000-3000 char "lines" not so uncommon); similar statistical
103 variance between odds and evens (not Unicode); frequent occurrences of
104 the space character; letters more common than non-letters; etc. Also
105 checking for too little variability between frequencies of characters
106 and for exclusion of particular characters based on character ranges
107 can catch ASCII encodings like base-64, UUEncode, UTF-7, etc.
108 Granted, this doesn't even apply to everything called "ASCII", and we
109 could potentially distinguish off ASCII for code, ASCII for text,
110 etc. as separate categories. However, it does give us a lot to work
111 off of, in deciding what likelihood to choose -- and it shows there's
112 in fact a lot of detectable patterns to look for even in something
113 seemingly so generic as ASCII. The detector would report most text
114 files in level 1 or level 2. EUC encodings, Shift-JIS, etc. probably
115 go to level -1 because they also pass the EOL test and all other tests
116 for the ASCII part of the text, but have lots of high bytes, which in
117 essence turn them into binary. Aberrant text files like something in
118 BASE64 encoding might get placed in level 0, because they pass most
119 tests but fail dramatically the frequency test; but they should not be
120 reported as any lower, because that would cause explicit prompting,
121 and the user should be able any valid text file without prompting.
122 The escape sequences and the base-64-type checks might send 7-bit
123 iso2022 to 0, but probably not -1, for similar reasons.
124
125 -- The assumed algorithm for the above detection levels is to in essence
126 sort categories first by detection level and then by priority.
127 Perhaps, however, we would want smarter algorithms, or at least
128 something user-controllable -- in particular, when (other than no
129 category at level 0 or greater) do we prompt the user to pick a
130 category?
131
132 -- Improvements in how the detection algorithm works: we want to handle
133 lots of different ways something could be encoded, including multiple
134 stacked encodings. trying to specify a series of detection levels
135 (check for base64 first, then check for gzip, then check for an i18n
136 decoding, then for crlf) won't generally work. for example, what
137 about the same encoding appearing more than once? for example, take
138 euc-jp, base64'd, then gzip'd, then base64'd again: this could well
139 happen, and you could specify the encodings specifically as
140 base64|gzip|base64|euc-jp, but we'd like to autodetect it without
141 worrying about exactly what order these things appear in. we should
142 allow for iterating over detection/decoding cycles until we reach
143 some maximum (we got stuck in a loop, due to incorrect category
144 tables or detection algorithms), have no reported detection levels
145 over -1, or we end up with no change after a decoding pass (i.e. the
146 coding system associated with a chosen category was `no-conversion'
147 or something equivalent). it might make sense to divide things into
148 two phases (internal and external), where the internal phase has a
149 separate category list and would probably mostly end up handling EOL
150 detection; but the i think about it, the more i disagree. with
151 properly written detectors, and properly organized tables (in
152 general, those decodings that are more "distinctive" and thus
153 detectable with greater certainty go lower on the list), we shouldn't
154 need two phases. for example, let's say the example above was also
155 in CRLF format. The EOL detector (which really detects *plain text*
156 with a particular EOL type) would return at most level 0 for all
157 results until the text file is reached, whereas the base64, gzip or
158 euc-jp decoders will return higher. Once the text file is reached,
159 the EOL detector will return 0 or higher for the CRLF encoding, and
160 all other decoders will return 0 or lower; thus, we will successfully
161 proceed through CRLF decoding, or at worst prompt the user. (The only
162 external-vs-internal distinction that might make sense here is to
163 favor coding systems of the correct source type over those that
164 require conversion between external and internal; if done right, this
165 could allow the CRLF detector to return level 1 for all CRLF-encoded
166 text files, even those that look like Base-64 or similar encoding, so
167 that CRLF encoding will always get decoded without prompting, but not
168 interfere with other decoders. On the other hand, this
169 external-vs-internal distinction may not matter at all -- with
170 automatic internal-external conversion, CRLF decoding can occur
171 before or after decoding of euc-jp, base64, iso2022, or similar,
172 without any difference in the final results.)
173
174 -- There need to be two priority lists and two
175 category->coding-system lists. Once is general, the other
176 category->langenv-specific. The user sets the former, the langenv
177 category->the latter. The langenv-specific entries take precedence
178 category->over the others. This works similarly to the
179 category->category->Unicode charset priority list.
180
181 -- The simple list of coding categories per detectors is not enough.
182 Instead of coding categories, we need parameters. For example,
183 Unicode might have separate detectors for UTF-8, UTF-7, UTF-16,
184 and perhaps UCS-4; or UTF-16/UCS-4 would be one detection type.
185 UTF-16 would have parameters such as "little-endian" and "needs BOM",
186 and possibly another one like "collapse/expand/leave alone composite
187 sequences" once we add this support. Usually these parameters
188 correspond directly to a coding system parameter. Different
189 likelihood values can be specified for each parameter as well as for
190 the detection type as a whole. The user can specify particular
191 coding systems for a particular combination of detection type and
192 parameters, or can give "default parameters" associated with a
193 detection type. In the latter case, we create a new coding system as
194 necessary that corresponds to the detected type and parameters.
195
196 -- a better means of presentation. rather than just coming up
197 with the new file decoded according to the detected coding
198 system, allow the user to browse through the file and
199 conveniently reject it if it looks wrong; then detection
200 starts again, but with that possibility removed. in cases where
201 certainty is low and thus more than one possibility is presented,
202 the user can browse each one and select one or reject them all.
203
204 -- fail-safe: even after the user has made a choice, if they
205 later on realize they have the wrong coding system, they can
206 go back, and we've squirreled away the original data so they
207 can start the process over. this may be tricky.
208
209 -- using a larger buffer for detection. we use just a small
210 piece, which can give quite random results. we may need to
211 buffer up all the data we look through because we can't
212 necessarily rewind. the idea is we proceed until we get a
213 result that's at least at a certain level of certainty
214 (e.g. "probable") or we reached a maximum limit of how much
215 we want to buffer.
216
217 -- dealing with interactive systems. we might need to go ahead
218 and present the data before we've finished detection, and
219 then re-decode it, perhaps multiple times, as we get better
220 detection results.
221
222 -- Clearly some of these are more important than others. at the
223 very least, the "better means of presentation" should be
224 implementation as soon as possibl, along with a very simple means
225 of fail-safe whenever the data is readibly available, e.g. it's
226 coming from a file, which is the most common scenario.
227
228
229 ------------------------------------------------------------------
230 ABOUT FORMATS
231 ------------------------------------------------------------------
232
233 when calling make-coding-system, the name can be a cons of (format1 .
234 format2), specifying that it decodes format1->format2 and encodes the other
235 way. if only one name is given, that is assumed to be format1, and the
236 other is either `external' or `internal' depending on the end type.
237 normally the user when decoding gives the decoding order in formats, but
238 can leave off the last one, `internal', which is assumed. a multichain
239 might look like gzip|multibyte|unicode, using the coding systems named
240 `gzip', `(unicode . multibyte)' and `unicode'. the way this actually works
241 is by searching for gzip->multibyte; if not found, look for gzip->external
242 or gzip->internal. (In general we automatically do conversion between
243 internal and external as necessary: thus gzip|crlf does the expected, and
244 maps to gzip->external, external->internal, crlf->internal, which when
245 fully specified would be gzip|external:external|internal:crlf|internal --
246 see below.) To forcibly fit together two converters that have explicitly
247 specified and incompatible names (say you have unicode->multibyte and
248 iso8859-1->ebcdic and you know that the multibyte and iso8859-1 in this
249 case are compatible), you can force-cast using :, like this:
250 ebcdic|iso8859-1:multibyte|unicode. (again, if you force-cast between
251 internal and external formats, the conversion happens automatically.)
252
253 --------------------------------------------------------------------------
254 ABOUT PDUMP, UNICODE, AND RUNNING XEMACS FROM A DIRECTORY WITH WEIRD CHARS
255 --------------------------------------------------------------------------
256
257 -- there's the problem that XEmacs can't be run in a directory with
258 non-ASCII/Latin-1 chars in it, since it will be doing Unicode
259 processing before we've had a chance to load the tables. In fact,
260 even finding the tables in such a situation is problematic using
261 the normal commands. my idea is to eventually load the stuff
262 extremely extremely early, at the same time as the pdump data gets
263 loaded. in fact, the unicode table data (stored in an efficient
264 binary format) can even be stuck into the pdump file (which would
265 mean as a resource to the executable, for windows). we'd need to
266 extend pdump a bit: to allow for attaching extra data to the pdump
267 file. (something like pdump_attach_extra_data (addr, length)
268 returns a number of some sort, an index into the file, which you
269 can then retrieve with pdump_load_extra_data(), which returns an
270 addr (mmap()ed or loaded), and later you pdump_unload_extra_data()
271 when finished. we'd probably also need
272 pdump_attach_extra_data_append(), which appends data to the data
273 just written out with pdump_attach_extra_data(). this way,
274 multiple tables in memory can be written out into one contiguous
275 table. (we'd use the tar-like trick of allowing new blocks to be
276 written without going back to change the old blocks -- we just rely
277 on the end of file/end of memory.) this same mechanism could be
278 extracted out of pdump and used to handle the non-pdump situation
279 (or alternatively, we could just dump either the memory image of
280 the tables themselves or the compressed binary version). in the
281 case of extra unicode tables not known about at compile time that
282 get loaded before dumping, we either just dump them into the image
283 (pdump and all) or extract them into the compressed binary format,
284 free the original tables, and treat them like all other tables.
285
286 --------------------------------------------------------------------------
287 HANDLING WRITING A FILE SAFELY, WITHOUT DATA LOSS
288 --------------------------------------------------------------------------
289
290 -- When writing a file, we need error detection; otherwise somebody
291 will create a Unicode file without realizing the coding system
292 of the buffer is Raw, and then lose all the non-ASCII/Latin-1
293 text when it's written out. We need two levels
294
295 1. first, a "safe-charset" level that checks before any actual
296 encoding to see if all characters in the document can safely
297 be represented using the given coding system. FSF has a
298 "safe-charset" property of coding systems, but it's stupid
299 because this information can be automatically derived from
300 the coding system, at least the vast majority of the time.
301 What we need is some sort of
302 alternative-coding-system-precedence-list, langenv-specific,
303 where everything on it can be checked for safe charsets and
304 then the user given a list of possibilities. When the user
305 does "save with specified encoding", they should see the same
306 precedence list. Again like with other precedence lists,
307 there's also a global one, and presumably all coding systems
308 not on other list get appended to the end (and perhaps not
309 checked at all when doing safe-checking?). safe-checking
310 should work something like this: compile a list of all
311 charsets used in the buffer, along with a count of chars
312 used. that way, "slightly unsafe" charsets can perhaps be
313 presented at the end, which will lose only a few characters
314 and are perhaps what the users were looking for.
315
316 2. when actually writing out, we need error checking in case an
317 individual char in a charset can't be written even though the
318 charsets are safe. again, the user gets the choice of other
319 reasonable coding systems.
320
321 3. same thing (error checking, list of alternatives, etc.) needs
322 to happen when reading! all of this will be a lot of work!
323
324
325 --ben
326 */
25 327
26 #include <config.h> 328 #include <config.h>
27 #include "lisp.h" 329 #include "lisp.h"
28 330
29 #include "buffer.h" 331 #include "buffer.h"
30 #include "elhash.h" 332 #include "elhash.h"
31 #include "insdel.h" 333 #include "insdel.h"
32 #include "lstream.h" 334 #include "lstream.h"
33 #include "opaque.h" 335 #include "opaque.h"
34 #ifdef MULE 336 #include "file-coding.h"
35 #include "mule-ccl.h" 337
36 #include "chartab.h" 338 #ifdef HAVE_ZLIB
339 #include "zlib.h"
37 #endif 340 #endif
38 #include "file-coding.h"
39 341
40 Lisp_Object Vkeyboard_coding_system; 342 Lisp_Object Vkeyboard_coding_system;
41 Lisp_Object Vterminal_coding_system; 343 Lisp_Object Vterminal_coding_system;
42 Lisp_Object Vcoding_system_for_read; 344 Lisp_Object Vcoding_system_for_read;
43 Lisp_Object Vcoding_system_for_write; 345 Lisp_Object Vcoding_system_for_write;
44 Lisp_Object Vfile_name_coding_system; 346 Lisp_Object Vfile_name_coding_system;
45 347
46 /* Table of symbols identifying each coding category. */ 348 #ifdef DEBUG_XEMACS
47 Lisp_Object coding_category_symbol[CODING_CATEGORY_LAST]; 349 Lisp_Object Vdebug_coding_detection;
48
49
50
51 struct file_coding_dump {
52 /* Coding system currently associated with each coding category. */
53 Lisp_Object coding_category_system[CODING_CATEGORY_LAST];
54
55 /* Table of all coding categories in decreasing order of priority.
56 This describes a permutation of the possible coding categories. */
57 int coding_category_by_priority[CODING_CATEGORY_LAST];
58
59 #ifdef MULE
60 Lisp_Object ucs_to_mule_table[65536];
61 #endif 350 #endif
62 } *fcd; 351
63 352 typedef struct coding_system_type_entry
64 static const struct lrecord_description fcd_description_1[] = { 353 {
65 { XD_LISP_OBJECT_ARRAY, offsetof (struct file_coding_dump, coding_category_system), CODING_CATEGORY_LAST }, 354 struct coding_system_methods *meths;
66 #ifdef MULE 355 } coding_system_type_entry;
67 { XD_LISP_OBJECT_ARRAY, offsetof (struct file_coding_dump, ucs_to_mule_table), countof (fcd->ucs_to_mule_table) }, 356
68 #endif 357 typedef struct
358 {
359 Dynarr_declare (coding_system_type_entry);
360 } coding_system_type_entry_dynarr;
361
362 static coding_system_type_entry_dynarr *the_coding_system_type_entry_dynarr;
363
364 static const struct lrecord_description cste_description_1[] = {
365 { XD_STRUCT_PTR, offsetof (coding_system_type_entry, meths), 1, &coding_system_methods_description },
69 { XD_END } 366 { XD_END }
70 }; 367 };
71 368
72 static const struct struct_description fcd_description = { 369 static const struct struct_description cste_description = {
73 sizeof (struct file_coding_dump), 370 sizeof (coding_system_type_entry),
74 fcd_description_1 371 cste_description_1
75 }; 372 };
76 373
77 Lisp_Object mule_to_ucs_table; 374 static const struct lrecord_description csted_description_1[] = {
375 XD_DYNARR_DESC (coding_system_type_entry_dynarr, &cste_description),
376 { XD_END }
377 };
378
379 static const struct struct_description csted_description = {
380 sizeof (coding_system_type_entry_dynarr),
381 csted_description_1
382 };
383
384 static Lisp_Object Vcoding_system_type_list;
385
386 /* Coding system currently associated with each coding category. */
387 Lisp_Object coding_category_system[MAX_DETECTOR_CATEGORIES];
388
389 /* Table of all coding categories in decreasing order of priority.
390 This describes a permutation of the possible coding categories. */
391 int coding_category_by_priority[MAX_DETECTOR_CATEGORIES];
392
393 /* Value used with to give a unique name to nameless coding systems */
394 int coding_system_tick;
395
396 int coding_detector_count;
397 int coding_detector_category_count;
398
399 detector_dynarr *all_coding_detectors;
400
401 static const struct lrecord_description struct_detector_category_description_1[]
402 =
403 {
404 { XD_LISP_OBJECT, offsetof (struct detector_category, sym) },
405 { XD_END }
406 };
407
408 static const struct struct_description struct_detector_category_description =
409 {
410 sizeof (struct detector_category),
411 struct_detector_category_description_1
412 };
413
414 static const struct lrecord_description detector_category_dynarr_description_1[] =
415 {
416 XD_DYNARR_DESC (detector_category_dynarr,
417 &struct_detector_category_description),
418 { XD_END }
419 };
420
421 static const struct struct_description detector_category_dynarr_description = {
422 sizeof (detector_category_dynarr),
423 detector_category_dynarr_description_1
424 };
425
426 static const struct lrecord_description struct_detector_description_1[]
427 =
428 {
429 { XD_STRUCT_PTR, offsetof (struct detector, cats), 1,
430 &detector_category_dynarr_description },
431 { XD_END }
432 };
433
434 static const struct struct_description struct_detector_description =
435 {
436 sizeof (struct detector),
437 struct_detector_description_1
438 };
439
440 static const struct lrecord_description detector_dynarr_description_1[] =
441 {
442 XD_DYNARR_DESC (detector_dynarr, &struct_detector_description),
443 { XD_END }
444 };
445
446 static const struct struct_description detector_dynarr_description = {
447 sizeof (detector_dynarr),
448 detector_dynarr_description_1
449 };
78 450
79 Lisp_Object Qcoding_systemp; 451 Lisp_Object Qcoding_systemp;
80 452
81 Lisp_Object Qraw_text, Qno_conversion, Qccl, Qiso2022; 453 Lisp_Object Qraw_text;
82 /* Qinternal in general.c */
83 454
84 Lisp_Object Qmnemonic, Qeol_type; 455 Lisp_Object Qmnemonic, Qeol_type;
85 Lisp_Object Qcr, Qcrlf, Qlf; 456 Lisp_Object Qcr, Qcrlf, Qlf;
86 Lisp_Object Qeol_cr, Qeol_crlf, Qeol_lf; 457 Lisp_Object Qeol_cr, Qeol_crlf, Qeol_lf;
87 Lisp_Object Qpost_read_conversion; 458 Lisp_Object Qpost_read_conversion;
88 Lisp_Object Qpre_write_conversion; 459 Lisp_Object Qpre_write_conversion;
89 460
90 #ifdef MULE 461 Lisp_Object Qtranslation_table_for_decode;
91 Lisp_Object Qucs4, Qutf8; 462 Lisp_Object Qtranslation_table_for_encode;
92 Lisp_Object Qbig5, Qshift_jis; 463 Lisp_Object Qsafe_chars;
93 Lisp_Object Qcharset_g0, Qcharset_g1, Qcharset_g2, Qcharset_g3; 464 Lisp_Object Qsafe_charsets;
94 Lisp_Object Qforce_g0_on_output, Qforce_g1_on_output; 465 Lisp_Object Qmime_charset;
95 Lisp_Object Qforce_g2_on_output, Qforce_g3_on_output; 466 Lisp_Object Qvalid_codes;
96 Lisp_Object Qno_iso6429; 467
97 Lisp_Object Qinput_charset_conversion, Qoutput_charset_conversion; 468 Lisp_Object Qno_conversion;
469 Lisp_Object Qconvert_eol;
98 Lisp_Object Qescape_quoted; 470 Lisp_Object Qescape_quoted;
99 Lisp_Object Qshort, Qno_ascii_eol, Qno_ascii_cntl, Qseven, Qlock_shift; 471 Lisp_Object Qencode, Qdecode;
472
473 Lisp_Object Qconvert_eol_lf, Qconvert_eol_cr, Qconvert_eol_crlf;
474 Lisp_Object Qconvert_eol_autodetect;
475
476 Lisp_Object Qnear_certainty, Qquite_probable, Qsomewhat_likely;
477 Lisp_Object Qas_likely_as_unlikely, Qsomewhat_unlikely, Qquite_improbable;
478 Lisp_Object Qnearly_impossible;
479
480 Lisp_Object Qdo_eol, Qdo_coding;
481
482 Lisp_Object Qcanonicalize_after_coding;
483
484 /* This is used to convert autodetected coding systems into existing
485 systems. For example, the chain undecided->convert-eol-autodetect may
486 have its separate parts detected as mswindows-multibyte and
487 convert-eol-crlf, and the result needs to be mapped to
488 mswindows-multibyte-dos. */
489 /* #### It's not clear we need this whole chain-canonicalize mechanism
490 any more. */
491 static Lisp_Object Vchain_canonicalize_hash_table;
492
493 #ifdef HAVE_ZLIB
494 Lisp_Object Qgzip;
100 #endif 495 #endif
101 Lisp_Object Qencode, Qdecode; 496
102 497 /* Maps coding system names to either coding system objects or (for
103 Lisp_Object Vcoding_system_hash_table; 498 aliases) other names. */
499 static Lisp_Object Vcoding_system_hash_table;
104 500
105 int enable_multibyte_characters; 501 int enable_multibyte_characters;
106 502
107 #ifdef MULE
108 /* Additional information used by the ISO2022 decoder and detector. */
109 struct iso2022_decoder
110 {
111 /* CHARSET holds the character sets currently assigned to the G0
112 through G3 variables. It is initialized from the array
113 INITIAL_CHARSET in CODESYS. */
114 Lisp_Object charset[4];
115
116 /* Which registers are currently invoked into the left (GL) and
117 right (GR) halves of the 8-bit encoding space? */
118 int register_left, register_right;
119
120 /* ISO_ESC holds a value indicating part of an escape sequence
121 that has already been seen. */
122 enum iso_esc_flag esc;
123
124 /* This records the bytes we've seen so far in an escape sequence,
125 in case the sequence is invalid (we spit out the bytes unchanged). */
126 unsigned char esc_bytes[8];
127
128 /* Index for next byte to store in ISO escape sequence. */
129 int esc_bytes_index;
130
131 #ifdef ENABLE_COMPOSITE_CHARS
132 /* Stuff seen so far when composing a string. */
133 unsigned_char_dynarr *composite_chars;
134 #endif
135
136 /* If we saw an invalid designation sequence for a particular
137 register, we flag it here and switch to ASCII. The next time we
138 see a valid designation for this register, we turn off the flag
139 and do the designation normally, but pretend the sequence was
140 invalid. The effect of all this is that (most of the time) the
141 escape sequences for both the switch to the unknown charset, and
142 the switch back to the known charset, get inserted literally into
143 the buffer and saved out as such. The hope is that we can
144 preserve the escape sequences so that the resulting written out
145 file makes sense. If we don't do any of this, the designation
146 to the invalid charset will be preserved but that switch back
147 to the known charset will probably get eaten because it was
148 the same charset that was already present in the register. */
149 unsigned char invalid_designated[4];
150
151 /* We try to do similar things as above for direction-switching
152 sequences. If we encountered a direction switch while an
153 invalid designation was present, or an invalid designation
154 just after a direction switch (i.e. no valid designation
155 encountered yet), we insert the direction-switch escape
156 sequence literally into the output stream, and later on
157 insert the corresponding direction-restoring escape sequence
158 literally also. */
159 unsigned int switched_dir_and_no_valid_charset_yet :1;
160 unsigned int invalid_switch_dir :1;
161
162 /* Tells the decoder to output the escape sequence literally
163 even though it was valid. Used in the games we play to
164 avoid lossage when we encounter invalid designations. */
165 unsigned int output_literally :1;
166 /* We encountered a direction switch followed by an invalid
167 designation. We didn't output the direction switch
168 literally because we didn't know about the invalid designation;
169 but we have to do so now. */
170 unsigned int output_direction_sequence :1;
171 };
172 #endif /* MULE */
173 EXFUN (Fcopy_coding_system, 2); 503 EXFUN (Fcopy_coding_system, 2);
174 #ifdef MULE
175 struct detection_state;
176 static int detect_coding_sjis (struct detection_state *st,
177 const Extbyte *src, Bytecount n);
178 static void decode_coding_sjis (Lstream *decoding, const Extbyte *src,
179 unsigned_char_dynarr *dst, Bytecount n);
180 static void encode_coding_sjis (Lstream *encoding, const Intbyte *src,
181 unsigned_char_dynarr *dst, Bytecount n);
182 static int detect_coding_big5 (struct detection_state *st,
183 const Extbyte *src, Bytecount n);
184 static void decode_coding_big5 (Lstream *decoding, const Extbyte *src,
185 unsigned_char_dynarr *dst, Bytecount n);
186 static void encode_coding_big5 (Lstream *encoding, const Intbyte *src,
187 unsigned_char_dynarr *dst, Bytecount n);
188 static int detect_coding_ucs4 (struct detection_state *st,
189 const Extbyte *src, Bytecount n);
190 static void decode_coding_ucs4 (Lstream *decoding, const Extbyte *src,
191 unsigned_char_dynarr *dst, Bytecount n);
192 static void encode_coding_ucs4 (Lstream *encoding, const Intbyte *src,
193 unsigned_char_dynarr *dst, Bytecount n);
194 static int detect_coding_utf8 (struct detection_state *st,
195 const Extbyte *src, Bytecount n);
196 static void decode_coding_utf8 (Lstream *decoding, const Extbyte *src,
197 unsigned_char_dynarr *dst, Bytecount n);
198 static void encode_coding_utf8 (Lstream *encoding, const Intbyte *src,
199 unsigned_char_dynarr *dst, Bytecount n);
200 static int postprocess_iso2022_mask (int mask);
201 static void reset_iso2022 (Lisp_Object coding_system,
202 struct iso2022_decoder *iso);
203 static int detect_coding_iso2022 (struct detection_state *st,
204 const Extbyte *src, Bytecount n);
205 static void decode_coding_iso2022 (Lstream *decoding, const Extbyte *src,
206 unsigned_char_dynarr *dst, Bytecount n);
207 static void encode_coding_iso2022 (Lstream *encoding, const Intbyte *src,
208 unsigned_char_dynarr *dst, Bytecount n);
209 #endif /* MULE */
210 static void decode_coding_no_conversion (Lstream *decoding, const Extbyte *src,
211 unsigned_char_dynarr *dst, Bytecount n);
212 static void encode_coding_no_conversion (Lstream *encoding, const Intbyte *src,
213 unsigned_char_dynarr *dst, Bytecount n);
214 static void mule_decode (Lstream *decoding, const Extbyte *src,
215 unsigned_char_dynarr *dst, Bytecount n);
216 static void mule_encode (Lstream *encoding, const Intbyte *src,
217 unsigned_char_dynarr *dst, Bytecount n);
218
219 typedef struct codesys_prop codesys_prop;
220 struct codesys_prop
221 {
222 Lisp_Object sym;
223 int prop_type;
224 };
225
226 typedef struct
227 {
228 Dynarr_declare (codesys_prop);
229 } codesys_prop_dynarr;
230
231 static const struct lrecord_description codesys_prop_description_1[] = {
232 { XD_LISP_OBJECT, offsetof (codesys_prop, sym) },
233 { XD_END }
234 };
235
236 static const struct struct_description codesys_prop_description = {
237 sizeof (codesys_prop),
238 codesys_prop_description_1
239 };
240
241 static const struct lrecord_description codesys_prop_dynarr_description_1[] = {
242 XD_DYNARR_DESC (codesys_prop_dynarr, &codesys_prop_description),
243 { XD_END }
244 };
245
246 static const struct struct_description codesys_prop_dynarr_description = {
247 sizeof (codesys_prop_dynarr),
248 codesys_prop_dynarr_description_1
249 };
250
251 codesys_prop_dynarr *the_codesys_prop_dynarr;
252
253 enum codesys_prop_enum
254 {
255 CODESYS_PROP_ALL_OK,
256 CODESYS_PROP_ISO2022,
257 CODESYS_PROP_CCL
258 };
259 504
260 505
261 /************************************************************************/ 506 /************************************************************************/
262 /* Coding system functions */ 507 /* Coding system object methods */
263 /************************************************************************/ 508 /************************************************************************/
264
265 static Lisp_Object mark_coding_system (Lisp_Object);
266 static void print_coding_system (Lisp_Object, Lisp_Object, int);
267 static void finalize_coding_system (void *header, int for_disksave);
268
269 #ifdef MULE
270 static const struct lrecord_description ccs_description_1[] = {
271 { XD_LISP_OBJECT, offsetof (charset_conversion_spec, from_charset) },
272 { XD_LISP_OBJECT, offsetof (charset_conversion_spec, to_charset) },
273 { XD_END }
274 };
275
276 static const struct struct_description ccs_description = {
277 sizeof (charset_conversion_spec),
278 ccs_description_1
279 };
280
281 static const struct lrecord_description ccsd_description_1[] = {
282 XD_DYNARR_DESC (charset_conversion_spec_dynarr, &ccs_description),
283 { XD_END }
284 };
285
286 static const struct struct_description ccsd_description = {
287 sizeof (charset_conversion_spec_dynarr),
288 ccsd_description_1
289 };
290 #endif
291
292 static const struct lrecord_description coding_system_description[] = {
293 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, name) },
294 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, doc_string) },
295 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, mnemonic) },
296 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, post_read_conversion) },
297 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, pre_write_conversion) },
298 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol_lf) },
299 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol_crlf) },
300 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol_cr) },
301 #ifdef MULE
302 { XD_LISP_OBJECT_ARRAY, offsetof (Lisp_Coding_System, iso2022.initial_charset), 4 },
303 { XD_STRUCT_PTR, offsetof (Lisp_Coding_System, iso2022.input_conv), 1, &ccsd_description },
304 { XD_STRUCT_PTR, offsetof (Lisp_Coding_System, iso2022.output_conv), 1, &ccsd_description },
305 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, ccl.decode) },
306 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, ccl.encode) },
307 #endif
308 { XD_END }
309 };
310
311 DEFINE_LRECORD_IMPLEMENTATION ("coding-system", coding_system,
312 mark_coding_system, print_coding_system,
313 finalize_coding_system,
314 0, 0, coding_system_description,
315 Lisp_Coding_System);
316 509
317 static Lisp_Object 510 static Lisp_Object
318 mark_coding_system (Lisp_Object obj) 511 mark_coding_system (Lisp_Object obj)
319 { 512 {
320 Lisp_Coding_System *codesys = XCODING_SYSTEM (obj); 513 Lisp_Coding_System *codesys = XCODING_SYSTEM (obj);
321 514
322 mark_object (CODING_SYSTEM_NAME (codesys)); 515 mark_object (CODING_SYSTEM_NAME (codesys));
323 mark_object (CODING_SYSTEM_DOC_STRING (codesys)); 516 mark_object (CODING_SYSTEM_DESCRIPTION (codesys));
324 mark_object (CODING_SYSTEM_MNEMONIC (codesys)); 517 mark_object (CODING_SYSTEM_MNEMONIC (codesys));
518 mark_object (CODING_SYSTEM_DOCUMENTATION (codesys));
325 mark_object (CODING_SYSTEM_EOL_LF (codesys)); 519 mark_object (CODING_SYSTEM_EOL_LF (codesys));
326 mark_object (CODING_SYSTEM_EOL_CRLF (codesys)); 520 mark_object (CODING_SYSTEM_EOL_CRLF (codesys));
327 mark_object (CODING_SYSTEM_EOL_CR (codesys)); 521 mark_object (CODING_SYSTEM_EOL_CR (codesys));
328 522 mark_object (CODING_SYSTEM_SUBSIDIARY_PARENT (codesys));
329 switch (CODING_SYSTEM_TYPE (codesys)) 523 mark_object (CODING_SYSTEM_CANONICAL (codesys));
330 { 524
331 #ifdef MULE 525 MAYBE_CODESYSMETH (codesys, mark, (obj));
332 int i;
333 case CODESYS_ISO2022:
334 for (i = 0; i < 4; i++)
335 mark_object (CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, i));
336 if (codesys->iso2022.input_conv)
337 {
338 for (i = 0; i < Dynarr_length (codesys->iso2022.input_conv); i++)
339 {
340 struct charset_conversion_spec *ccs =
341 Dynarr_atp (codesys->iso2022.input_conv, i);
342 mark_object (ccs->from_charset);
343 mark_object (ccs->to_charset);
344 }
345 }
346 if (codesys->iso2022.output_conv)
347 {
348 for (i = 0; i < Dynarr_length (codesys->iso2022.output_conv); i++)
349 {
350 struct charset_conversion_spec *ccs =
351 Dynarr_atp (codesys->iso2022.output_conv, i);
352 mark_object (ccs->from_charset);
353 mark_object (ccs->to_charset);
354 }
355 }
356 break;
357
358 case CODESYS_CCL:
359 mark_object (CODING_SYSTEM_CCL_DECODE (codesys));
360 mark_object (CODING_SYSTEM_CCL_ENCODE (codesys));
361 break;
362 #endif /* MULE */
363 default:
364 break;
365 }
366 526
367 mark_object (CODING_SYSTEM_PRE_WRITE_CONVERSION (codesys)); 527 mark_object (CODING_SYSTEM_PRE_WRITE_CONVERSION (codesys));
368 return CODING_SYSTEM_POST_READ_CONVERSION (codesys); 528 return CODING_SYSTEM_POST_READ_CONVERSION (codesys);
529 }
530
531 static void
532 print_coding_system_properties (Lisp_Object obj, Lisp_Object printcharfun)
533 {
534 Lisp_Coding_System *c = XCODING_SYSTEM (obj);
535 print_internal (c->methods->type, printcharfun, 1);
536 MAYBE_CODESYSMETH (c, print, (obj, printcharfun, 1));
537 if (CODING_SYSTEM_EOL_TYPE (c) != EOL_AUTODETECT)
538 write_fmt_string_lisp (printcharfun, " eol-type=%s",
539 1, Fcoding_system_property (obj, Qeol_type));
369 } 540 }
370 541
371 static void 542 static void
372 print_coding_system (Lisp_Object obj, Lisp_Object printcharfun, 543 print_coding_system (Lisp_Object obj, Lisp_Object printcharfun,
373 int escapeflag) 544 int escapeflag)
374 { 545 {
375 Lisp_Coding_System *c = XCODING_SYSTEM (obj); 546 Lisp_Coding_System *c = XCODING_SYSTEM (obj);
376 if (print_readably) 547 if (print_readably)
377 printing_unreadable_object ("#<coding_system 0x%x>", c->header.uid); 548 printing_unreadable_object
378 549 ("printing unreadable object #<coding-system 0x%x>", c->header.uid);
379 write_c_string ("#<coding_system ", printcharfun); 550
380 print_internal (c->name, printcharfun, 1); 551 write_fmt_string_lisp (printcharfun, "#<coding-system %s ", 1, c->name);
552 print_coding_system_properties (obj, printcharfun);
381 write_c_string (">", printcharfun); 553 write_c_string (">", printcharfun);
554 }
555
556 /* Print an abbreviated version of a coding system (but still containing
557 all the information), for use within a coding system print method. */
558
559 static void
560 print_coding_system_in_print_method (Lisp_Object cs, Lisp_Object printcharfun,
561 int escapeflag)
562 {
563 print_internal (XCODING_SYSTEM_NAME (cs), printcharfun, 0);
564 write_c_string ("[", printcharfun);
565 print_coding_system_properties (cs, printcharfun);
566 write_c_string ("]", printcharfun);
382 } 567 }
383 568
384 static void 569 static void
385 finalize_coding_system (void *header, int for_disksave) 570 finalize_coding_system (void *header, int for_disksave)
386 { 571 {
387 Lisp_Coding_System *c = (Lisp_Coding_System *) header; 572 Lisp_Object cs = wrap_coding_system ((Lisp_Coding_System *) header);
388 /* Since coding systems never go away, this function is not 573 /* Since coding systems never go away, this function is not
389 necessary. But it would be necessary if we changed things 574 necessary. But it would be necessary if we changed things
390 so that coding systems could go away. */ 575 so that coding systems could go away. */
391 if (!for_disksave) /* see comment in lstream.c */ 576 if (!for_disksave) /* see comment in lstream.c */
392 { 577 MAYBE_XCODESYSMETH (cs, finalize, (cs));
393 switch (CODING_SYSTEM_TYPE (c)) 578 }
394 { 579
395 #ifdef MULE 580 static Bytecount
396 case CODESYS_ISO2022: 581 sizeof_coding_system (const void *header)
397 if (c->iso2022.input_conv) 582 {
398 { 583 const Lisp_Coding_System *p = (const Lisp_Coding_System *) header;
399 Dynarr_free (c->iso2022.input_conv); 584 return offsetof (Lisp_Coding_System, data) + p->methods->extra_data_size;
400 c->iso2022.input_conv = 0; 585 }
401 } 586
402 if (c->iso2022.output_conv) 587 static const struct lrecord_description coding_system_methods_description_1[]
403 { 588 = {
404 Dynarr_free (c->iso2022.output_conv); 589 { XD_LISP_OBJECT,
405 c->iso2022.output_conv = 0; 590 offsetof (struct coding_system_methods, type) },
406 } 591 { XD_LISP_OBJECT,
407 break; 592 offsetof (struct coding_system_methods, predicate_symbol) },
408 #endif /* MULE */ 593 { XD_END }
409 default: 594 };
410 break; 595
411 } 596 const struct struct_description coding_system_methods_description = {
412 } 597 sizeof (struct coding_system_methods),
413 } 598 coding_system_methods_description_1
414 599 };
415 static eol_type_t 600
416 symbol_to_eol_type (Lisp_Object symbol) 601 const struct lrecord_description coding_system_empty_extra_description[] = {
417 { 602 { XD_END }
418 CHECK_SYMBOL (symbol); 603 };
419 if (NILP (symbol)) return EOL_AUTODETECT; 604
420 if (EQ (symbol, Qlf)) return EOL_LF; 605 static const struct lrecord_description coding_system_description[] =
421 if (EQ (symbol, Qcrlf)) return EOL_CRLF; 606 {
422 if (EQ (symbol, Qcr)) return EOL_CR; 607 { XD_STRUCT_PTR, offsetof (Lisp_Coding_System, methods), 1,
423 608 &coding_system_methods_description },
424 invalid_constant ("Unrecognized eol type", symbol); 609 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, name) },
425 return EOL_AUTODETECT; /* not reached */ 610 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, description) },
426 } 611 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, mnemonic) },
427 612 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, documentation) },
428 static Lisp_Object 613 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, post_read_conversion) },
429 eol_type_to_symbol (eol_type_t type) 614 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, pre_write_conversion) },
430 { 615 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, text_file_wrapper) },
431 switch (type) 616 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, auto_eol_wrapper) },
432 { 617 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol[0]) },
433 default: abort (); 618 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol[1]) },
434 case EOL_LF: return Qlf; 619 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol[2]) },
435 case EOL_CRLF: return Qcrlf; 620 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, subsidiary_parent) },
436 case EOL_CR: return Qcr; 621 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, canonical) },
437 case EOL_AUTODETECT: return Qnil; 622 { XD_CODING_SYSTEM_END }
438 } 623 };
439 } 624
440 625 DEFINE_LRECORD_SEQUENCE_IMPLEMENTATION ("coding-system", coding_system,
441 static void 626 mark_coding_system,
442 setup_eol_coding_systems (Lisp_Coding_System *codesys) 627 print_coding_system,
443 { 628 finalize_coding_system,
444 Lisp_Object codesys_obj; 629 0, 0, coding_system_description,
445 int len = string_length (XSYMBOL (CODING_SYSTEM_NAME (codesys))->name); 630 sizeof_coding_system,
446 char *codesys_name = (char *) alloca (len + 7); 631 Lisp_Coding_System);
447 int mlen = -1; 632
448 char *codesys_mnemonic=0; 633
449 634 /************************************************************************/
450 Lisp_Object codesys_name_sym, sub_codesys_obj; 635 /* Creating coding systems */
451 636 /************************************************************************/
452 /* kludge */ 637
453 638 static struct coding_system_methods *
454 XSETCODING_SYSTEM (codesys_obj, codesys); 639 decode_coding_system_type (Lisp_Object type, Error_Behavior errb)
455 640 {
456 memcpy (codesys_name, 641 int i;
457 string_data (XSYMBOL (CODING_SYSTEM_NAME (codesys))->name), len); 642
458 643 for (i = 0; i < Dynarr_length (the_coding_system_type_entry_dynarr); i++)
459 if (STRINGP (CODING_SYSTEM_MNEMONIC (codesys))) 644 {
460 { 645 if (EQ (type,
461 mlen = XSTRING_LENGTH (CODING_SYSTEM_MNEMONIC (codesys)); 646 Dynarr_at (the_coding_system_type_entry_dynarr, i).meths->type))
462 codesys_mnemonic = (char *) alloca (mlen + 7); 647 return Dynarr_at (the_coding_system_type_entry_dynarr, i).meths;
463 memcpy (codesys_mnemonic, 648 }
464 XSTRING_DATA (CODING_SYSTEM_MNEMONIC (codesys)), mlen); 649
465 } 650 maybe_invalid_constant ("Invalid coding system type", type,
466 651 Qcoding_system, errb);
467 #define DEFINE_SUB_CODESYS(op_sys, op_sys_abbr, Type) do { \ 652
468 strcpy (codesys_name + len, "-" op_sys); \ 653 return 0;
469 if (mlen != -1) \ 654 }
470 strcpy (codesys_mnemonic + mlen, op_sys_abbr); \ 655
471 codesys_name_sym = intern (codesys_name); \ 656 static int
472 sub_codesys_obj = Fcopy_coding_system (codesys_obj, codesys_name_sym); \ 657 valid_coding_system_type_p (Lisp_Object type)
473 XCODING_SYSTEM_EOL_TYPE (sub_codesys_obj) = Type; \ 658 {
474 if (mlen != -1) \ 659 return decode_coding_system_type (type, ERROR_ME_NOT) != 0;
475 XCODING_SYSTEM_MNEMONIC(sub_codesys_obj) = \ 660 }
476 build_string (codesys_mnemonic); \ 661
477 CODING_SYSTEM_##Type (codesys) = sub_codesys_obj; \ 662 DEFUN ("valid-coding-system-type-p", Fvalid_coding_system_type_p, 1, 1, 0, /*
478 } while (0) 663 Given a CODING-SYSTEM-TYPE, return non-nil if it is valid.
479 664 Valid types depend on how XEmacs was compiled but may include
480 DEFINE_SUB_CODESYS("unix", "", EOL_LF); 665 'undecided, 'chain, 'integer, 'ccl, 'iso2022, 'big5, 'shift-jis,
481 DEFINE_SUB_CODESYS("dos", ":T", EOL_CRLF); 666 'utf-16, 'ucs-4, 'utf-8, etc.
482 DEFINE_SUB_CODESYS("mac", ":t", EOL_CR); 667 */
668 (coding_system_type))
669 {
670 return valid_coding_system_type_p (coding_system_type) ? Qt : Qnil;
671 }
672
673 DEFUN ("coding-system-type-list", Fcoding_system_type_list, 0, 0, 0, /*
674 Return a list of valid coding system types.
675 */
676 ())
677 {
678 return Fcopy_sequence (Vcoding_system_type_list);
679 }
680
681 void
682 add_entry_to_coding_system_type_list (struct coding_system_methods *meths)
683 {
684 struct coding_system_type_entry entry;
685
686 entry.meths = meths;
687 Dynarr_add (the_coding_system_type_entry_dynarr, entry);
688 Vcoding_system_type_list = Fcons (meths->type, Vcoding_system_type_list);
483 } 689 }
484 690
485 DEFUN ("coding-system-p", Fcoding_system_p, 1, 1, 0, /* 691 DEFUN ("coding-system-p", Fcoding_system_p, 1, 1, 0, /*
486 Return t if OBJECT is a coding system. 692 Return t if OBJECT is a coding system.
487 A coding system is an object that defines how text containing multiple 693 A coding system is an object that defines how text containing multiple
527 while (1) 733 while (1)
528 { 734 {
529 coding_system_or_name = 735 coding_system_or_name =
530 Fgethash (coding_system_or_name, Vcoding_system_hash_table, Qnil); 736 Fgethash (coding_system_or_name, Vcoding_system_hash_table, Qnil);
531 737
532 if (CODING_SYSTEMP (coding_system_or_name) || NILP (coding_system_or_name)) 738 if (CODING_SYSTEMP (coding_system_or_name)
739 || NILP (coding_system_or_name))
533 return coding_system_or_name; 740 return coding_system_or_name;
534 } 741 }
535 } 742 }
536 743
537 DEFUN ("get-coding-system", Fget_coding_system, 1, 1, 0, /* 744 DEFUN ("get-coding-system", Fget_coding_system, 1, 1, 0, /*
546 if (NILP (coding_system)) 753 if (NILP (coding_system))
547 invalid_argument ("No such coding system", name); 754 invalid_argument ("No such coding system", name);
548 return coding_system; 755 return coding_system;
549 } 756 }
550 757
551 /* We store the coding systems in hash tables with the names as the key and the 758 int
552 actual coding system object as the value. Occasionally we need to use them 759 coding_system_is_binary (Lisp_Object coding_system)
553 in a list format. These routines provide us with that. */ 760 {
761 Lisp_Coding_System *cs = XCODING_SYSTEM (coding_system);
762 return
763 (EQ (CODING_SYSTEM_TYPE (cs), Qno_conversion) &&
764 CODING_SYSTEM_EOL_TYPE (cs) == EOL_LF &&
765 EQ (CODING_SYSTEM_POST_READ_CONVERSION (cs), Qnil) &&
766 EQ (CODING_SYSTEM_PRE_WRITE_CONVERSION (cs), Qnil));
767 }
768
769 static Lisp_Object
770 coding_system_real_canonical (Lisp_Object cs)
771 {
772 if (!NILP (XCODING_SYSTEM_CANONICAL (cs)))
773 return XCODING_SYSTEM_CANONICAL (cs);
774 return cs;
775 }
776
777 /* Return true if coding system is of the "standard" type that decodes
778 bytes into characters (suitable for decoding a text file). */
779 int
780 coding_system_is_for_text_file (Lisp_Object coding_system)
781 {
782 return (XCODESYSMETH_OR_GIVEN
783 (coding_system, conversion_end_type,
784 (coding_system_real_canonical (coding_system)),
785 DECODES_BYTE_TO_CHARACTER) ==
786 DECODES_BYTE_TO_CHARACTER);
787 }
788
789 static int
790 decoding_source_sink_type_is_char (Lisp_Object cs, enum source_or_sink sex)
791 {
792 enum source_sink_type type =
793 XCODESYSMETH_OR_GIVEN (cs, conversion_end_type,
794 (coding_system_real_canonical (cs)),
795 DECODES_BYTE_TO_CHARACTER);
796 if (sex == CODING_SOURCE)
797 return (type == DECODES_CHARACTER_TO_CHARACTER ||
798 type == DECODES_CHARACTER_TO_BYTE);
799 else
800 return (type == DECODES_CHARACTER_TO_CHARACTER ||
801 type == DECODES_BYTE_TO_CHARACTER);
802 }
803
804 static int
805 encoding_source_sink_type_is_char (Lisp_Object cs, enum source_or_sink sex)
806 {
807 return decoding_source_sink_type_is_char (cs,
808 /* Sex change */
809 sex == CODING_SOURCE ?
810 CODING_SINK : CODING_SOURCE);
811 }
812
813 /* Like Ffind_coding_system() but check that the coding system is of the
814 "standard" type that decodes bytes into characters (suitable for
815 decoding a text file), and if not, returns an appropriate wrapper that
816 does. Also, if EOL_WRAP is non-zero, check whether this coding system
817 wants EOL auto-detection, and if so, wrap with a convert-eol coding
818 system to do this. */
819
820 Lisp_Object
821 find_coding_system_for_text_file (Lisp_Object name, int eol_wrap)
822 {
823 Lisp_Object coding_system = Ffind_coding_system (name);
824 Lisp_Object wrapper = coding_system;
825
826 if (NILP (coding_system))
827 return Qnil;
828 if (!coding_system_is_for_text_file (coding_system))
829 {
830 wrapper = XCODING_SYSTEM_TEXT_FILE_WRAPPER (coding_system);
831 if (NILP (wrapper))
832 {
833 Lisp_Object chain;
834 if (!decoding_source_sink_type_is_char (coding_system, CODING_SINK))
835 chain = list2 (coding_system, Qbinary);
836 else
837 chain = list1 (coding_system);
838 if (decoding_source_sink_type_is_char (coding_system, CODING_SOURCE))
839 chain = Fcons (Qbinary, chain);
840 wrapper =
841 make_internal_coding_system
842 (coding_system,
843 "internal-text-file-wrapper",
844 Qchain,
845 Qunbound, list4 (Qchain, chain,
846 Qcanonicalize_after_coding, coding_system));
847 XCODING_SYSTEM_TEXT_FILE_WRAPPER (coding_system) = wrapper;
848 }
849 }
850
851 if (!eol_wrap || XCODING_SYSTEM_EOL_TYPE (coding_system) != EOL_AUTODETECT)
852 return wrapper;
853
854 coding_system = wrapper;
855 wrapper = XCODING_SYSTEM_AUTO_EOL_WRAPPER (coding_system);
856 if (!NILP (wrapper))
857 return wrapper;
858 wrapper =
859 make_internal_coding_system
860 (coding_system,
861 "internal-auto-eol-wrapper",
862 Qundecided, Qunbound,
863 list4 (Qcoding_system, coding_system,
864 Qdo_eol, Qt));
865 XCODING_SYSTEM_AUTO_EOL_WRAPPER (coding_system) = wrapper;
866 return wrapper;
867 }
868
869 /* Like Fget_coding_system() but verify that the coding system is of the
870 "standard" type that decodes bytes into characters (suitable for
871 decoding a text file), and if not, returns an appropriate wrapper that
872 does. Also, if EOL_WRAP is non-zero, check whether this coding system
873 wants EOL auto-detection, and if so, wrap with a convert-eol coding
874 system to do this. */
875
876 Lisp_Object
877 get_coding_system_for_text_file (Lisp_Object name, int eol_wrap)
878 {
879 Lisp_Object coding_system = find_coding_system_for_text_file (name,
880 eol_wrap);
881 if (NILP (coding_system))
882 invalid_argument ("No such coding system", name);
883 return coding_system;
884 }
885
886 /* We store the coding systems in hash tables with the names as the
887 key and the actual coding system object as the value. Occasionally
888 we need to use them in a list format. These routines provide us
889 with that. */
554 struct coding_system_list_closure 890 struct coding_system_list_closure
555 { 891 {
556 Lisp_Object *coding_system_list; 892 Lisp_Object *coding_system_list;
893 int normal;
894 int internal;
557 }; 895 };
558 896
559 static int 897 static int
560 add_coding_system_to_list_mapper (Lisp_Object key, Lisp_Object value, 898 add_coding_system_to_list_mapper (Lisp_Object key, Lisp_Object value,
561 void *coding_system_list_closure) 899 void *coding_system_list_closure)
563 /* This function can GC */ 901 /* This function can GC */
564 struct coding_system_list_closure *cscl = 902 struct coding_system_list_closure *cscl =
565 (struct coding_system_list_closure *) coding_system_list_closure; 903 (struct coding_system_list_closure *) coding_system_list_closure;
566 Lisp_Object *coding_system_list = cscl->coding_system_list; 904 Lisp_Object *coding_system_list = cscl->coding_system_list;
567 905
568 *coding_system_list = Fcons (key, *coding_system_list); 906 /* We can't just use VALUE because KEY might be an alias, and we need
907 the real coding system object. */
908 if (XCODING_SYSTEM (Ffind_coding_system (key))->internal_p ?
909 cscl->internal : cscl->normal)
910 *coding_system_list = Fcons (key, *coding_system_list);
569 return 0; 911 return 0;
570 } 912 }
571 913
572 DEFUN ("coding-system-list", Fcoding_system_list, 0, 0, 0, /* 914 DEFUN ("coding-system-list", Fcoding_system_list, 0, 1, 0, /*
573 Return a list of the names of all defined coding systems. 915 Return a list of the names of all defined coding systems.
916 If INTERNAL is nil, only the normal (non-internal) coding systems are
917 included. (Internal coding systems are created for various internal
918 purposes, such as implementing EOL types of CRLF and CR; generally, you do
919 not want to see these.) If it is t, only the internal coding systems are
920 included. If it is any other non-nil value both normal and internal are
921 included.
574 */ 922 */
575 ()) 923 (internal))
576 { 924 {
577 Lisp_Object coding_system_list = Qnil; 925 Lisp_Object coding_system_list = Qnil;
578 struct gcpro gcpro1; 926 struct gcpro gcpro1;
579 struct coding_system_list_closure coding_system_list_closure; 927 struct coding_system_list_closure coding_system_list_closure;
580 928
581 GCPRO1 (coding_system_list); 929 GCPRO1 (coding_system_list);
582 coding_system_list_closure.coding_system_list = &coding_system_list; 930 coding_system_list_closure.coding_system_list = &coding_system_list;
931 coding_system_list_closure.normal = !EQ (internal, Qt);
932 coding_system_list_closure.internal = !NILP (internal);
583 elisp_maphash (add_coding_system_to_list_mapper, Vcoding_system_hash_table, 933 elisp_maphash (add_coding_system_to_list_mapper, Vcoding_system_hash_table,
584 &coding_system_list_closure); 934 &coding_system_list_closure);
585 UNGCPRO; 935 UNGCPRO;
586 936
587 return coding_system_list; 937 return coding_system_list;
595 coding_system = Fget_coding_system (coding_system); 945 coding_system = Fget_coding_system (coding_system);
596 return XCODING_SYSTEM_NAME (coding_system); 946 return XCODING_SYSTEM_NAME (coding_system);
597 } 947 }
598 948
599 static Lisp_Coding_System * 949 static Lisp_Coding_System *
600 allocate_coding_system (enum coding_system_type type, Lisp_Object name) 950 allocate_coding_system (struct coding_system_methods *codesys_meths,
601 { 951 Bytecount data_size,
952 Lisp_Object name)
953 {
954 Bytecount total_size = offsetof (Lisp_Coding_System, data) + data_size;
602 Lisp_Coding_System *codesys = 955 Lisp_Coding_System *codesys =
603 alloc_lcrecord_type (Lisp_Coding_System, &lrecord_coding_system); 956 (Lisp_Coding_System *) alloc_lcrecord (total_size, &lrecord_coding_system);
604 957
605 zero_lcrecord (codesys); 958 zero_sized_lcrecord (codesys, total_size);
959 codesys->methods = codesys_meths;
606 CODING_SYSTEM_PRE_WRITE_CONVERSION (codesys) = Qnil; 960 CODING_SYSTEM_PRE_WRITE_CONVERSION (codesys) = Qnil;
607 CODING_SYSTEM_POST_READ_CONVERSION (codesys) = Qnil; 961 CODING_SYSTEM_POST_READ_CONVERSION (codesys) = Qnil;
608 CODING_SYSTEM_EOL_TYPE (codesys) = EOL_AUTODETECT; 962 CODING_SYSTEM_EOL_TYPE (codesys) = EOL_LF;
609 CODING_SYSTEM_EOL_CRLF (codesys) = Qnil; 963 CODING_SYSTEM_EOL_CRLF (codesys) = Qnil;
610 CODING_SYSTEM_EOL_CR (codesys) = Qnil; 964 CODING_SYSTEM_EOL_CR (codesys) = Qnil;
611 CODING_SYSTEM_EOL_LF (codesys) = Qnil; 965 CODING_SYSTEM_EOL_LF (codesys) = Qnil;
612 CODING_SYSTEM_TYPE (codesys) = type; 966 CODING_SYSTEM_SUBSIDIARY_PARENT (codesys) = Qnil;
967 CODING_SYSTEM_CANONICAL (codesys) = Qnil;
613 CODING_SYSTEM_MNEMONIC (codesys) = Qnil; 968 CODING_SYSTEM_MNEMONIC (codesys) = Qnil;
614 #ifdef MULE 969 CODING_SYSTEM_DOCUMENTATION (codesys) = Qnil;
615 if (type == CODESYS_ISO2022) 970 CODING_SYSTEM_TEXT_FILE_WRAPPER (codesys) = Qnil;
616 { 971 CODING_SYSTEM_AUTO_EOL_WRAPPER (codesys) = Qnil;
617 int i; 972 CODING_SYSTEM_NAME (codesys) = name;
618 for (i = 0; i < 4; i++) 973
619 CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, i) = Qnil; 974 MAYBE_CODESYSMETH (codesys, init, (wrap_coding_system (codesys)));
620 }
621 else if (type == CODESYS_CCL)
622 {
623 CODING_SYSTEM_CCL_DECODE (codesys) = Qnil;
624 CODING_SYSTEM_CCL_ENCODE (codesys) = Qnil;
625 }
626 #endif /* MULE */
627 CODING_SYSTEM_NAME (codesys) = name;
628 975
629 return codesys; 976 return codesys;
630 } 977 }
631 978
632 #ifdef MULE 979 static enum eol_type
633 /* Given a list of charset conversion specs as specified in a Lisp 980 symbol_to_eol_type (Lisp_Object symbol)
634 program, parse it into STORE_HERE. */ 981 {
635 982 CHECK_SYMBOL (symbol);
983 if (NILP (symbol)) return EOL_AUTODETECT;
984 if (EQ (symbol, Qlf)) return EOL_LF;
985 if (EQ (symbol, Qcrlf)) return EOL_CRLF;
986 if (EQ (symbol, Qcr)) return EOL_CR;
987
988 invalid_constant ("Unrecognized eol type", symbol);
989 return EOL_AUTODETECT; /* not reached */
990 }
991
992 static Lisp_Object
993 eol_type_to_symbol (enum eol_type type)
994 {
995 switch (type)
996 {
997 default: abort ();
998 case EOL_LF: return Qlf;
999 case EOL_CRLF: return Qcrlf;
1000 case EOL_CR: return Qcr;
1001 case EOL_AUTODETECT: return Qnil;
1002 }
1003 }
1004
1005 struct subsidiary_type
1006 {
1007 Char_ASCII *extension;
1008 Char_ASCII *mnemonic_ext;
1009 enum eol_type eol;
1010 };
1011
1012 static struct subsidiary_type coding_subsidiary_list[] =
1013 { { "-unix", "", EOL_LF },
1014 { "-dos", ":T", EOL_CRLF },
1015 { "-mac", ":t", EOL_CR } };
1016
1017 /* kludge */
636 static void 1018 static void
637 parse_charset_conversion_specs (charset_conversion_spec_dynarr *store_here, 1019 setup_eol_coding_systems (Lisp_Object codesys)
638 Lisp_Object spec_list) 1020 {
639 { 1021 int len = string_length (XSYMBOL (XCODING_SYSTEM_NAME (codesys))->name);
640 Lisp_Object rest; 1022 Intbyte *codesys_name = (Intbyte *) alloca (len + 7);
641 1023 int mlen = -1;
642 EXTERNAL_LIST_LOOP (rest, spec_list) 1024 Intbyte *codesys_mnemonic = 0;
643 { 1025 Lisp_Object codesys_name_sym, sub_codesys;
644 Lisp_Object car = XCAR (rest); 1026 int i;
645 Lisp_Object from, to; 1027
646 struct charset_conversion_spec spec; 1028 memcpy (codesys_name,
647 1029 string_data (XSYMBOL (XCODING_SYSTEM_NAME (codesys))->name), len);
648 if (!CONSP (car) || !CONSP (XCDR (car)) || !NILP (XCDR (XCDR (car)))) 1030
649 invalid_argument ("Invalid charset conversion spec", car); 1031 if (STRINGP (XCODING_SYSTEM_MNEMONIC (codesys)))
650 from = Fget_charset (XCAR (car)); 1032 {
651 to = Fget_charset (XCAR (XCDR (car))); 1033 mlen = XSTRING_LENGTH (XCODING_SYSTEM_MNEMONIC (codesys));
652 if (XCHARSET_TYPE (from) != XCHARSET_TYPE (to)) 1034 codesys_mnemonic = (Intbyte *) alloca (mlen + 7);
653 invalid_operation_2 1035 memcpy (codesys_mnemonic,
654 ("Attempted conversion between different charset types", 1036 XSTRING_DATA (XCODING_SYSTEM_MNEMONIC (codesys)), mlen);
655 from, to); 1037 }
656 spec.from_charset = from; 1038
657 spec.to_charset = to; 1039 /* Create three "subsidiary" coding systems, decoding data encoded using
658 1040 each of the three EOL types. We do this for each subsidiary by
659 Dynarr_add (store_here, spec); 1041 copying the original coding system, setting the EOL type
660 } 1042 appropriately, and setting the CANONICAL member of the new coding
661 } 1043 system to be a chain consisting of the original coding system followed
662 1044 by a convert-eol coding system to do the EOL decoding. For EOL type
663 /* Given a dynarr LOAD_HERE of internally-stored charset conversion 1045 LF, however, we don't need any decoding, so we skip creating a
664 specs, return the equivalent as the Lisp programmer would see it. 1046 CANONICAL.
665 1047
666 If LOAD_HERE is 0, return Qnil. */ 1048 If the original coding system is not a text-type coding system
1049 (decodes byte->char), we need to coerce it to one by the appropriate
1050 wrapping in CANONICAL. */
1051
1052 for (i = 0; i < countof (coding_subsidiary_list); i++)
1053 {
1054 Char_ASCII *extension = coding_subsidiary_list[i].extension;
1055 Char_ASCII *mnemonic_ext = coding_subsidiary_list[i].mnemonic_ext;
1056 enum eol_type eol = coding_subsidiary_list[i].eol;
1057
1058 qxestrcpy_c (codesys_name + len, extension);
1059 codesys_name_sym = intern_int (codesys_name);
1060 if (mlen != -1)
1061 qxestrcpy_c (codesys_mnemonic + mlen, mnemonic_ext);
1062
1063 sub_codesys = Fcopy_coding_system (codesys, codesys_name_sym);
1064 if (mlen != -1)
1065 XCODING_SYSTEM_MNEMONIC (sub_codesys) =
1066 build_intstring (codesys_mnemonic);
1067
1068 if (eol != EOL_LF)
1069 {
1070 Lisp_Object chain = list2 (get_coding_system_for_text_file
1071 (codesys, 0),
1072 eol == EOL_CR ? Qconvert_eol_cr :
1073 Qconvert_eol_crlf);
1074 Lisp_Object canon =
1075 make_internal_coding_system
1076 (sub_codesys, "internal-subsidiary-eol-wrapper",
1077 Qchain, Qunbound,
1078 mlen != -1 ?
1079 list6 (Qmnemonic, build_intstring (codesys_mnemonic),
1080 Qchain, chain,
1081 Qcanonicalize_after_coding, sub_codesys) :
1082 list4 (Qchain, chain,
1083 Qcanonicalize_after_coding, sub_codesys));
1084 XCODING_SYSTEM_CANONICAL (sub_codesys) = canon;
1085 }
1086 XCODING_SYSTEM_EOL_TYPE (sub_codesys) = eol;
1087 XCODING_SYSTEM_SUBSIDIARY_PARENT (sub_codesys) = codesys;
1088 XCODING_SYSTEM (codesys)->eol[eol] = sub_codesys;
1089 }
1090 }
1091
1092 /* Basic function to create new coding systems. For `make-coding-system',
1093 NAME-OR-EXISTING is the NAME argument, PREFIX is null, and TYPE,
1094 DESCRIPTION, and PROPS are the same. All created coding systems are put
1095 in a hash table indexed by NAME.
1096
1097 If PREFIX is a string, NAME-OR-EXISTING should specify an existing
1098 coding system (or nil), and an internal coding system will be created.
1099 The name of the coding system will be constructed by combining PREFIX
1100 with the name of the existing coding system (if given), and a number
1101 will be appended to insure uniqueness. In such a case, if Qunbound is
1102 given for DESCRIPTION, the description gets created based on the
1103 generated name. Also, if no mnemonic is given in the properties list, a
1104 mnemonic is created based on the generated name.
1105
1106 For internal coding systems, the coding system is marked as internal
1107 (see `coding-system-list'), and no subsidiaries will be created or
1108 eol-wrapping will happen. Otherwise:
1109
1110 -- if the eol-type property is `lf' or t, the coding system is merely
1111 created and returned. (For t, the coding system will be wrapped with
1112 an EOL autodetector when it's used to read a file.)
1113
1114 -- if eol-type is `crlf' or `cr', after the coding system object is
1115 created, it will be wrapped in a chain with the appropriate
1116 convert-eol coding system (either `convert-eol-crlf' or
1117 `convert-eol-cr'), so that CRLF->LF or CR->LF conversion is done at
1118 decoding time, and the opposite at encoding time. The resulting
1119 chain becomes the CANONICAL field of the coding system object.
1120
1121 -- if eol-type is nil or omitted, "subsidiaries" are generated: Three
1122 coding systems where the original coding system (before wrapping with
1123 convert-eol-autodetect) is either unwrapped or wrapped with
1124 convert-eol-crlf or convert-eol-cr, respectively, so that coding systems
1125 to handle LF, CRLF, and CR end-of-line indicators are created. (This
1126 crazy crap is based on existing behavior in other Mule versions,
1127 including FSF Emacs.)
1128 */
667 1129
668 static Lisp_Object 1130 static Lisp_Object
669 unparse_charset_conversion_specs (charset_conversion_spec_dynarr *load_here) 1131 make_coding_system_1 (Lisp_Object name_or_existing, Char_ASCII *prefix,
670 { 1132 Lisp_Object type, Lisp_Object description,
671 int i; 1133 Lisp_Object props)
672 Lisp_Object result; 1134 {
673 1135 Lisp_Coding_System *cs;
674 if (!load_here) 1136 int need_to_setup_eol_systems = 1;
675 return Qnil; 1137 enum eol_type eol_wrapper = EOL_AUTODETECT;
676 for (i = 0, result = Qnil; i < Dynarr_length (load_here); i++) 1138 struct coding_system_methods *meths;
677 { 1139 Lisp_Object csobj;
678 struct charset_conversion_spec *ccs = Dynarr_atp (load_here, i); 1140 Lisp_Object defmnem = Qnil;
679 result = Fcons (list2 (ccs->from_charset, ccs->to_charset), result); 1141
680 } 1142 if (NILP (type))
681 1143 type = Qundecided;
682 return Fnreverse (result); 1144 meths = decode_coding_system_type (type, ERROR_ME);
683 } 1145
684 1146 if (prefix)
685 #endif /* MULE */ 1147 {
1148 Intbyte *newname =
1149 emacs_sprintf_malloc (NULL, "%s-%s-%d",
1150 prefix,
1151 NILP (name_or_existing) ? (Intbyte *) "nil" :
1152 XSTRING_DATA (Fsymbol_name (XCODING_SYSTEM_NAME
1153 (name_or_existing))),
1154 ++coding_system_tick);
1155 name_or_existing = intern_int (newname);
1156 xfree (newname);
1157
1158 if (UNBOUNDP (description))
1159 {
1160 newname =
1161 emacs_sprintf_malloc
1162 (NULL, "For Internal Use (%s)",
1163 XSTRING_DATA (Fsymbol_name (name_or_existing)));
1164 description = build_intstring (newname);
1165 xfree (newname);
1166 }
1167
1168 newname = emacs_sprintf_malloc (NULL, "Int%d", coding_system_tick);
1169 defmnem = build_intstring (newname);
1170 }
1171 else
1172 CHECK_SYMBOL (name_or_existing);
1173
1174 if (!NILP (Ffind_coding_system (name_or_existing)))
1175 invalid_operation ("Cannot redefine existing coding system",
1176 name_or_existing);
1177
1178 cs = allocate_coding_system (meths, meths->extra_data_size,
1179 name_or_existing);
1180 XSETCODING_SYSTEM (csobj, cs);
1181
1182 cs->internal_p = !!prefix;
1183
1184 if (NILP (description))
1185 description = build_string ("");
1186 else
1187 CHECK_STRING (description);
1188 CODING_SYSTEM_DESCRIPTION (cs) = description;
1189
1190 if (!NILP (defmnem))
1191 CODING_SYSTEM_MNEMONIC (cs) = defmnem;
1192
1193 {
1194 EXTERNAL_PROPERTY_LIST_LOOP_3 (key, value, props)
1195 {
1196 int recognized = 1;
1197
1198 if (EQ (key, Qmnemonic))
1199 {
1200 if (!NILP (value))
1201 CHECK_STRING (value);
1202 CODING_SYSTEM_MNEMONIC (cs) = value;
1203 }
1204
1205 else if (EQ (key, Qdocumentation))
1206 {
1207 if (!NILP (value))
1208 CHECK_STRING (value);
1209 CODING_SYSTEM_DOCUMENTATION (cs) = value;
1210 }
1211
1212 else if (EQ (key, Qeol_type))
1213 {
1214 need_to_setup_eol_systems = NILP (value);
1215 if (EQ (value, Qt))
1216 value = Qnil;
1217 eol_wrapper = symbol_to_eol_type (value);
1218 }
1219
1220 else if (EQ (key, Qpost_read_conversion))
1221 CODING_SYSTEM_POST_READ_CONVERSION (cs) = value;
1222 else if (EQ (key, Qpre_write_conversion))
1223 CODING_SYSTEM_PRE_WRITE_CONVERSION (cs) = value;
1224 /* FSF compatibility */
1225 else if (EQ (key, Qtranslation_table_for_decode))
1226 ;
1227 else if (EQ (key, Qtranslation_table_for_encode))
1228 ;
1229 else if (EQ (key, Qsafe_chars))
1230 ;
1231 else if (EQ (key, Qsafe_charsets))
1232 ;
1233 else if (EQ (key, Qmime_charset))
1234 ;
1235 else if (EQ (key, Qvalid_codes))
1236 ;
1237 else
1238 recognized = CODESYSMETH_OR_GIVEN (cs, putprop,
1239 (csobj, key, value), 0);
1240
1241 if (!recognized)
1242 invalid_constant ("Unrecognized property", key);
1243 }
1244 }
1245
1246 {
1247 XCODING_SYSTEM_CANONICAL (csobj) =
1248 CODESYSMETH_OR_GIVEN (cs, canonicalize, (csobj), Qnil);
1249 XCODING_SYSTEM_EOL_TYPE (csobj) = EOL_AUTODETECT; /* for copy-coding-system
1250 below */
1251
1252 if (need_to_setup_eol_systems && !cs->internal_p)
1253 setup_eol_coding_systems (csobj);
1254 else if (eol_wrapper == EOL_CR || eol_wrapper == EOL_CRLF)
1255 {
1256 /* If a specific eol-type (other than LF) was specified, we handle
1257 this by converting the coding system into a chain that wraps the
1258 coding system along with a convert-eol system after it, in
1259 exactly that same switcheroo fashion that the normal
1260 canonicalize method works -- BUT we will run into a problem if
1261 we do it the obvious way, because when `chain' creates its
1262 substreams, the substream containing the coding system we're
1263 creating will have canonicalization expansion done on it,
1264 leading to infinite recursion. So we have to generate a new,
1265 internal coding system with the previous value of CANONICAL. */
1266 Intbyte *newname =
1267 emacs_sprintf_malloc
1268 (NULL, "internal-eol-copy-%s-%d",
1269 XSTRING_DATA (Fsymbol_name (name_or_existing)),
1270 ++coding_system_tick);
1271 Lisp_Object newnamesym = intern_int (newname);
1272 Lisp_Object copied = Fcopy_coding_system (csobj, newnamesym);
1273 xfree (newname);
1274
1275 XCODING_SYSTEM_CANONICAL (csobj) =
1276 make_internal_coding_system
1277 (csobj,
1278 "internal-eol-wrapper",
1279 Qchain, Qunbound,
1280 list4 (Qchain,
1281 list2 (copied,
1282 eol_wrapper == EOL_CR ?
1283 Qconvert_eol_cr :
1284 Qconvert_eol_crlf),
1285 Qcanonicalize_after_coding,
1286 csobj));
1287 }
1288 XCODING_SYSTEM_EOL_TYPE (csobj) = eol_wrapper;
1289 }
1290
1291 Fputhash (name_or_existing, csobj, Vcoding_system_hash_table);
1292
1293 return csobj;
1294 }
1295
1296 Lisp_Object
1297 make_internal_coding_system (Lisp_Object existing, Char_ASCII *prefix,
1298 Lisp_Object type, Lisp_Object description,
1299 Lisp_Object props)
1300 {
1301 return make_coding_system_1 (existing, prefix, type, description, props);
1302 }
686 1303
687 DEFUN ("make-coding-system", Fmake_coding_system, 2, 4, 0, /* 1304 DEFUN ("make-coding-system", Fmake_coding_system, 2, 4, 0, /*
688 Register symbol NAME as a coding system. 1305 Register symbol NAME as a coding system.
689 1306
690 TYPE describes the conversion method used and should be one of 1307 TYPE describes the conversion method used and should be one of
691 1308
692 nil or 'undecided 1309 nil or 'undecided
693 Automatic conversion. XEmacs attempts to detect the coding system 1310 Automatic conversion. XEmacs attempts to detect the coding system
694 used in the file. 1311 used in the file.
1312 'chain
1313 Chain two or more coding systems together to make a combination coding
1314 system.
695 'no-conversion 1315 'no-conversion
696 No conversion. Use this for binary files and such. On output, 1316 No conversion. Use this for binary files and such. On output,
697 graphic characters that are not in ASCII or Latin-1 will be 1317 graphic characters that are not in ASCII or Latin-1 will be
698 replaced by a ?. (For a no-conversion-encoded buffer, these 1318 replaced by a ?. (For a no-conversion-encoded buffer, these
699 characters will only be present if you explicitly insert them.) 1319 characters will only be present if you explicitly insert them.)
1320 'convert-eol
1321 Convert CRLF sequences or CR to LF.
700 'shift-jis 1322 'shift-jis
701 Shift-JIS (a Japanese encoding commonly used in PC operating systems). 1323 Shift-JIS (a Japanese encoding commonly used in PC operating systems).
702 'ucs-4 1324 'unicode
703 ISO 10646 UCS-4 encoding. 1325 Any Unicode encoding (UCS-4, UTF-8, UTF-16, etc.).
704 'utf-8 1326 'mswindows-unicode-to-multibyte
705 ISO 10646 UTF-8 encoding. 1327 (MS Windows only) Converts from Windows Unicode to Windows Multibyte
1328 (any code page encoding) upon encoding, and the other way upon decoding.
1329 'mswindows-multibyte
1330 Converts to or from Windows Multibyte (any code page encoding).
1331 This is resolved into a chain of `mswindows-unicode' and
1332 `mswindows-unicode-to-multibyte'.
706 'iso2022 1333 'iso2022
707 Any ISO2022-compliant encoding. Among other things, this includes 1334 Any ISO2022-compliant encoding. Among other things, this includes
708 JIS (the Japanese encoding commonly used for e-mail), EUC (the 1335 JIS (the Japanese encoding commonly used for e-mail), EUC (the
709 standard Unix encoding for Japanese and other languages), and 1336 standard Unix encoding for Japanese and other languages), and
710 Compound Text (the encoding used in X11). You can specify more 1337 Compound Text (the encoding used in X11). You can specify more
713 Big5 (the encoding commonly used for Taiwanese). 1340 Big5 (the encoding commonly used for Taiwanese).
714 'ccl 1341 'ccl
715 The conversion is performed using a user-written pseudo-code 1342 The conversion is performed using a user-written pseudo-code
716 program. CCL (Code Conversion Language) is the name of this 1343 program. CCL (Code Conversion Language) is the name of this
717 pseudo-code. 1344 pseudo-code.
1345 'gzip
1346 GZIP compression format.
718 'internal 1347 'internal
719 Write out or read in the raw contents of the memory representing 1348 Write out or read in the raw contents of the memory representing
720 the buffer's text. This is primarily useful for debugging 1349 the buffer's text. This is primarily useful for debugging
721 purposes, and is only enabled when XEmacs has been compiled with 1350 purposes, and is only enabled when XEmacs has been compiled with
722 DEBUG_XEMACS defined (via the --debug configure option). 1351 DEBUG_XEMACS defined (via the --debug configure option).
724 in an internal inconsistency in the memory representing a 1353 in an internal inconsistency in the memory representing a
725 buffer's text, which will produce unpredictable results and may 1354 buffer's text, which will produce unpredictable results and may
726 cause XEmacs to crash. Under normal circumstances you should 1355 cause XEmacs to crash. Under normal circumstances you should
727 never use 'internal conversion. 1356 never use 'internal conversion.
728 1357
729 DOC-STRING is a string describing the coding system. 1358 DESCRIPTION is a short English phrase describing the coding system,
1359 suitable for use as a menu item. (See also the `documentation' property
1360 below.)
730 1361
731 PROPS is a property list, describing the specific nature of the 1362 PROPS is a property list, describing the specific nature of the
732 character set. Recognized properties are: 1363 character set. Recognized properties are:
733 1364
734 'mnemonic 1365 'mnemonic
735 String to be displayed in the modeline when this coding system is 1366 String to be displayed in the modeline when this coding system is
736 active. 1367 active.
1368
1369 'documentation
1370 Detailed documentation on the coding system.
737 1371
738 'eol-type 1372 'eol-type
739 End-of-line conversion to be used. It should be one of 1373 End-of-line conversion to be used. It should be one of
740 1374
741 nil 1375 nil
762 generate subsidiary coding systems. (This value is 1396 generate subsidiary coding systems. (This value is
763 converted to nil when stored internally, and 1397 converted to nil when stored internally, and
764 `coding-system-property' will return nil.) 1398 `coding-system-property' will return nil.)
765 1399
766 'post-read-conversion 1400 'post-read-conversion
767 Function called after a file has been read in, to perform the 1401 The value is a function to call after some text is inserted and
768 decoding. Called with two arguments, START and END, denoting 1402 decoded by the coding system itself and before any functions in
769 a region of the current buffer to be decoded. 1403 `after-change-functions' are called. (#### Not actually true in
1404 XEmacs. `after-change-functions' will be called twice if
1405 `post-read-conversion' changes something.) The argument of this
1406 function is the same as for a function in
1407 `after-insert-file-functions', i.e. LENGTH of the text inserted,
1408 with point at the head of the text to be decoded.
770 1409
771 'pre-write-conversion 1410 'pre-write-conversion
772 Function called before a file is written out, to perform the 1411 The value is a function to call after all functions in
773 encoding. Called with two arguments, START and END, denoting 1412 `write-region-annotate-functions' and `buffer-file-format' are
774 a region of the current buffer to be encoded. 1413 called, and before the text is encoded by the coding system itself.
1414 The arguments to this function are the same as those of a function
1415 in `write-region-annotate-functions', i.e. FROM and TO, specifying
1416 a region of text.
1417
1418
1419
1420 The following properties are allowed for FSF compatibility but currently
1421 ignored:
1422
1423 'translation-table-for-decode
1424 The value is a translation table to be applied on decoding. See
1425 the function `make-translation-table' for the format of translation
1426 table. This is not applicable to CCL-based coding systems.
1427
1428 'translation-table-for-encode
1429 The value is a translation table to be applied on encoding. This is
1430 not applicable to CCL-based coding systems.
1431
1432 'safe-chars
1433 The value is a char table. If a character has non-nil value in it,
1434 the character is safely supported by the coding system. This
1435 overrides the specification of safe-charsets.
1436
1437 'safe-charsets
1438 The value is a list of charsets safely supported by the coding
1439 system. The value t means that all charsets Emacs handles are
1440 supported. Even if some charset is not in this list, it doesn't
1441 mean that the charset can't be encoded in the coding system;
1442 it just means that some other receiver of text encoded
1443 in the coding system won't be able to handle that charset.
1444
1445 'mime-charset
1446 The value is a symbol of which name is `MIME-charset' parameter of
1447 the coding system.
1448
1449 'valid-codes (meaningful only for a coding system based on CCL)
1450 The value is a list to indicate valid byte ranges of the encoded
1451 file. Each element of the list is an integer or a cons of integer.
1452 In the former case, the integer value is a valid byte code. In the
1453 latter case, the integers specifies the range of valid byte codes.
1454
1455
1456
1457 The following additional property is recognized if TYPE is 'convert-eol:
1458
1459 'subtype
1460 One of `lf', `crlf', `cr' or `autodetect'. When decoding, the
1461 corresponding sequence will be converted to LF. When encoding, the
1462 opposite happens. This coding system converts characters to
1463 characters.
1464
775 1465
776 1466
777 The following additional properties are recognized if TYPE is 'iso2022: 1467 The following additional properties are recognized if TYPE is 'iso2022:
778 1468
779 'charset-g0 1469 'charset-g0
844 characters in one charset to another when encoding is performed. 1534 characters in one charset to another when encoding is performed.
845 The form of each specification is the same as for 1535 The form of each specification is the same as for
846 'input-charset-conversion. 1536 'input-charset-conversion.
847 1537
848 1538
1539
849 The following additional properties are recognized (and required) 1540 The following additional properties are recognized (and required)
850 if TYPE is 'ccl: 1541 if TYPE is 'ccl:
851 1542
852 'decode 1543 'decode
853 CCL program used for decoding (converting to internal format). 1544 CCL program used for decoding (converting to internal format).
854 1545
855 'encode 1546 'encode
856 CCL program used for encoding (converting to external format). 1547 CCL program used for encoding (converting to external format).
1548
1549
1550 The following additional properties are recognized if TYPE is 'chain:
1551
1552 'chain
1553 List of coding systems to be chained together, in decoding order.
1554
1555 'canonicalize-after-coding
1556 Coding system to be returned by the detector routines in place of
1557 this coding system.
1558
1559
1560
1561 The following additional properties are recognized if TYPE is 'unicode:
1562
1563 'type
1564 One of `utf-16', `utf-8', `ucs-4', or `utf-7' (the latter is not
1565 yet implemented). `utf-16' is the basic two-byte encoding;
1566 `ucs-4' is the four-byte encoding; `utf-8' is an ASCII-compatible
1567 variable-width 8-bit encoding; `utf-7' is a 7-bit encoding using
1568 only characters that will safely pass through all mail gateways.
1569
1570 'little-endian
1571 If non-nil, `utf-16' and `ucs-4' will write out the groups of two
1572 or four bytes little-endian instead of big-endian. This is required,
1573 for example, under Windows.
1574
1575 'need-bom
1576 If non-nil, a byte order mark (BOM, or Unicode FFFE) should be
1577 written out at the beginning of the data. This serves both to
1578 identify the endianness of the following data and to mark the
1579 data as Unicode (at least, this is how Windows uses it).
1580
1581
1582
1583 The following additional properties are recognized if TYPE is
1584 'mswindows-multibyte:
1585
1586 'code-page
1587 Either a number (specifying a particular code page) or one of the
1588 symbols `ansi', `oem', `mac', or `ebcdic', specifying the ANSI,
1589 OEM, Macintosh, or EBCDIC code page associated with a particular
1590 locale (given by the `locale' property). NOTE: EBCDIC code pages
1591 only exist in Windows 2000 and later.
1592
1593 'locale
1594 If `code-page' is a symbol, this specifies the locale whose code
1595 page of the corresponding type should be used. This should be
1596 one of the following: A cons of two strings, (LANGUAGE
1597 . SUBLANGUAGE) (see `mswindows-set-current-locale'); a string (a
1598 language; SUBLANG_DEFAULT, i.e. the default sublanguage, is
1599 used); or one of the symbols `current', `user-default', or
1600 `system-default', corresponding to the values of
1601 `mswindows-current-locale', `mswindows-user-default-locale', or
1602 `mswindows-system-default-locale', respectively.
1603
1604
1605
1606 The following additional properties are recognized if TYPE is 'undecided:
1607
1608 'do-eol
1609 Do EOL detection.
1610
1611 'do-coding
1612 Do encoding detection.
1613
1614 'coding-system
1615 If encoding detection is not done, use the specified coding system
1616 to do decoding. This is used internally when implementing coding
1617 systems with an EOL type that specifies autodetection (the default),
1618 so that the detector routines return the proper subsidiary.
1619
1620
1621
1622 The following additional property is recognized if TYPE is 'gzip:
1623
1624 'level
1625 Compression level: 0 through 9, or `default' (currently 6).
1626
857 */ 1627 */
858 (name, type, doc_string, props)) 1628 (name, type, description, props))
859 { 1629 {
860 Lisp_Coding_System *codesys; 1630 return make_coding_system_1 (name, 0, type, description, props);
861 enum coding_system_type ty;
862 int need_to_setup_eol_systems = 1;
863
864 /* Convert type to constant */
865 if (NILP (type) || EQ (type, Qundecided))
866 { ty = CODESYS_AUTODETECT; }
867 #ifdef MULE
868 else if (EQ (type, Qshift_jis)) { ty = CODESYS_SHIFT_JIS; }
869 else if (EQ (type, Qiso2022)) { ty = CODESYS_ISO2022; }
870 else if (EQ (type, Qbig5)) { ty = CODESYS_BIG5; }
871 else if (EQ (type, Qucs4)) { ty = CODESYS_UCS4; }
872 else if (EQ (type, Qutf8)) { ty = CODESYS_UTF8; }
873 else if (EQ (type, Qccl)) { ty = CODESYS_CCL; }
874 #endif
875 else if (EQ (type, Qno_conversion)) { ty = CODESYS_NO_CONVERSION; }
876 #ifdef DEBUG_XEMACS
877 else if (EQ (type, Qinternal)) { ty = CODESYS_INTERNAL; }
878 #endif
879 else
880 invalid_constant ("Invalid coding system type", type);
881
882 CHECK_SYMBOL (name);
883
884 codesys = allocate_coding_system (ty, name);
885
886 if (NILP (doc_string))
887 doc_string = build_string ("");
888 else
889 CHECK_STRING (doc_string);
890 CODING_SYSTEM_DOC_STRING (codesys) = doc_string;
891
892 {
893 EXTERNAL_PROPERTY_LIST_LOOP_3 (key, value, props)
894 {
895 if (EQ (key, Qmnemonic))
896 {
897 if (!NILP (value))
898 CHECK_STRING (value);
899 CODING_SYSTEM_MNEMONIC (codesys) = value;
900 }
901
902 else if (EQ (key, Qeol_type))
903 {
904 need_to_setup_eol_systems = NILP (value);
905 if (EQ (value, Qt))
906 value = Qnil;
907 CODING_SYSTEM_EOL_TYPE (codesys) = symbol_to_eol_type (value);
908 }
909
910 else if (EQ (key, Qpost_read_conversion)) CODING_SYSTEM_POST_READ_CONVERSION (codesys) = value;
911 else if (EQ (key, Qpre_write_conversion)) CODING_SYSTEM_PRE_WRITE_CONVERSION (codesys) = value;
912 #ifdef MULE
913 else if (ty == CODESYS_ISO2022)
914 {
915 #define FROB_INITIAL_CHARSET(charset_num) \
916 CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, charset_num) = \
917 ((EQ (value, Qt) || EQ (value, Qnil)) ? value : Fget_charset (value))
918
919 if (EQ (key, Qcharset_g0)) FROB_INITIAL_CHARSET (0);
920 else if (EQ (key, Qcharset_g1)) FROB_INITIAL_CHARSET (1);
921 else if (EQ (key, Qcharset_g2)) FROB_INITIAL_CHARSET (2);
922 else if (EQ (key, Qcharset_g3)) FROB_INITIAL_CHARSET (3);
923
924 #define FROB_FORCE_CHARSET(charset_num) \
925 CODING_SYSTEM_ISO2022_FORCE_CHARSET_ON_OUTPUT (codesys, charset_num) = !NILP (value)
926
927 else if (EQ (key, Qforce_g0_on_output)) FROB_FORCE_CHARSET (0);
928 else if (EQ (key, Qforce_g1_on_output)) FROB_FORCE_CHARSET (1);
929 else if (EQ (key, Qforce_g2_on_output)) FROB_FORCE_CHARSET (2);
930 else if (EQ (key, Qforce_g3_on_output)) FROB_FORCE_CHARSET (3);
931
932 #define FROB_BOOLEAN_PROPERTY(prop) \
933 CODING_SYSTEM_ISO2022_##prop (codesys) = !NILP (value)
934
935 else if (EQ (key, Qshort)) FROB_BOOLEAN_PROPERTY (SHORT);
936 else if (EQ (key, Qno_ascii_eol)) FROB_BOOLEAN_PROPERTY (NO_ASCII_EOL);
937 else if (EQ (key, Qno_ascii_cntl)) FROB_BOOLEAN_PROPERTY (NO_ASCII_CNTL);
938 else if (EQ (key, Qseven)) FROB_BOOLEAN_PROPERTY (SEVEN);
939 else if (EQ (key, Qlock_shift)) FROB_BOOLEAN_PROPERTY (LOCK_SHIFT);
940 else if (EQ (key, Qno_iso6429)) FROB_BOOLEAN_PROPERTY (NO_ISO6429);
941 else if (EQ (key, Qescape_quoted)) FROB_BOOLEAN_PROPERTY (ESCAPE_QUOTED);
942
943 else if (EQ (key, Qinput_charset_conversion))
944 {
945 codesys->iso2022.input_conv =
946 Dynarr_new (charset_conversion_spec);
947 parse_charset_conversion_specs (codesys->iso2022.input_conv,
948 value);
949 }
950 else if (EQ (key, Qoutput_charset_conversion))
951 {
952 codesys->iso2022.output_conv =
953 Dynarr_new (charset_conversion_spec);
954 parse_charset_conversion_specs (codesys->iso2022.output_conv,
955 value);
956 }
957 else
958 invalid_constant ("Unrecognized property", key);
959 }
960 else if (EQ (type, Qccl))
961 {
962 Lisp_Object sym;
963 struct ccl_program test_ccl;
964 Extbyte *suffix;
965
966 /* Check key first. */
967 if (EQ (key, Qdecode))
968 suffix = "-ccl-decode";
969 else if (EQ (key, Qencode))
970 suffix = "-ccl-encode";
971 else
972 invalid_constant ("Unrecognized property", key);
973
974 /* If value is vector, register it as a ccl program
975 associated with an newly created symbol for
976 backward compatibility. */
977 if (VECTORP (value))
978 {
979 sym = Fintern (concat2 (Fsymbol_name (name),
980 build_string (suffix)),
981 Qnil);
982 Fregister_ccl_program (sym, value);
983 }
984 else
985 {
986 CHECK_SYMBOL (value);
987 sym = value;
988 }
989 /* check if the given ccl programs are valid. */
990 if (setup_ccl_program (&test_ccl, sym) < 0)
991 invalid_argument ("Invalid CCL program", value);
992
993 if (EQ (key, Qdecode))
994 CODING_SYSTEM_CCL_DECODE (codesys) = sym;
995 else if (EQ (key, Qencode))
996 CODING_SYSTEM_CCL_ENCODE (codesys) = sym;
997
998 }
999 #endif /* MULE */
1000 else
1001 invalid_constant ("Unrecognized property", key);
1002 }
1003 }
1004
1005 if (need_to_setup_eol_systems)
1006 setup_eol_coding_systems (codesys);
1007
1008 {
1009 Lisp_Object codesys_obj;
1010 XSETCODING_SYSTEM (codesys_obj, codesys);
1011 Fputhash (name, codesys_obj, Vcoding_system_hash_table);
1012 return codesys_obj;
1013 }
1014 } 1631 }
1015 1632
1016 DEFUN ("copy-coding-system", Fcopy_coding_system, 2, 2, 0, /* 1633 DEFUN ("copy-coding-system", Fcopy_coding_system, 2, 2, 0, /*
1017 Copy OLD-CODING-SYSTEM to NEW-NAME. 1634 Copy OLD-CODING-SYSTEM to NEW-NAME.
1018 If NEW-NAME does not name an existing coding system, a new one will 1635 If NEW-NAME does not name an existing coding system, a new one will
1019 be created. 1636 be created.
1637 If you are using this function to create an alias, think again:
1638 Use `define-coding-system-alias' instead.
1020 */ 1639 */
1021 (old_coding_system, new_name)) 1640 (old_coding_system, new_name))
1022 { 1641 {
1023 Lisp_Object new_coding_system; 1642 Lisp_Object new_coding_system;
1024 old_coding_system = Fget_coding_system (old_coding_system); 1643 old_coding_system = Fget_coding_system (old_coding_system);
1025 new_coding_system = Ffind_coding_system (new_name); 1644 new_coding_system =
1645 UNBOUNDP (new_name) ? Qnil : Ffind_coding_system (new_name);
1026 if (NILP (new_coding_system)) 1646 if (NILP (new_coding_system))
1027 { 1647 {
1028 XSETCODING_SYSTEM (new_coding_system, 1648 XSETCODING_SYSTEM
1029 allocate_coding_system 1649 (new_coding_system,
1030 (XCODING_SYSTEM_TYPE (old_coding_system), 1650 allocate_coding_system
1031 new_name)); 1651 (XCODING_SYSTEM (old_coding_system)->methods,
1032 Fputhash (new_name, new_coding_system, Vcoding_system_hash_table); 1652 XCODING_SYSTEM (old_coding_system)->methods->extra_data_size,
1033 } 1653 new_name));
1654 if (!UNBOUNDP (new_name))
1655 Fputhash (new_name, new_coding_system, Vcoding_system_hash_table);
1656 }
1657 else if (XCODING_SYSTEM (old_coding_system)->methods !=
1658 XCODING_SYSTEM (new_coding_system)->methods)
1659 invalid_operation_2 ("Coding systems not same type",
1660 old_coding_system, new_coding_system);
1034 1661
1035 { 1662 {
1036 Lisp_Coding_System *to = XCODING_SYSTEM (new_coding_system); 1663 Lisp_Coding_System *to = XCODING_SYSTEM (new_coding_system);
1037 Lisp_Coding_System *from = XCODING_SYSTEM (old_coding_system); 1664 Lisp_Coding_System *from = XCODING_SYSTEM (old_coding_system);
1038 memcpy (((char *) to ) + sizeof (to->header), 1665 copy_sized_lcrecord (to, from, sizeof_coding_system (from));
1039 ((char *) from) + sizeof (from->header),
1040 sizeof (*from) - sizeof (from->header));
1041 to->name = new_name; 1666 to->name = new_name;
1042 } 1667 }
1043 return new_coding_system; 1668 return new_coding_system;
1044 } 1669 }
1045 1670
1046 DEFUN ("coding-system-canonical-name-p", Fcoding_system_canonical_name_p, 1, 1, 0, /* 1671 DEFUN ("coding-system-canonical-name-p", Fcoding_system_canonical_name_p,
1672 1, 1, 0, /*
1047 Return t if OBJECT names a coding system, and is not a coding system alias. 1673 Return t if OBJECT names a coding system, and is not a coding system alias.
1048 */ 1674 */
1049 (object)) 1675 (object))
1050 { 1676 {
1051 return CODING_SYSTEMP (Fgethash (object, Vcoding_system_hash_table, Qnil)) 1677 return CODING_SYSTEMP (Fgethash (object, Vcoding_system_hash_table, Qnil))
1073 else 1699 else
1074 invalid_argument ("Symbol is not a coding system alias", alias); 1700 invalid_argument ("Symbol is not a coding system alias", alias);
1075 return Qnil; /* To keep the compiler happy */ 1701 return Qnil; /* To keep the compiler happy */
1076 } 1702 }
1077 1703
1078 static Lisp_Object
1079 append_suffix_to_symbol (Lisp_Object symbol, const char *ascii_string)
1080 {
1081 return Fintern (concat2 (Fsymbol_name (symbol), build_string (ascii_string)),
1082 Qnil);
1083 }
1084
1085 /* A maphash function, for removing dangling coding system aliases. */ 1704 /* A maphash function, for removing dangling coding system aliases. */
1086 static int 1705 static int
1087 dangling_coding_system_alias_p (Lisp_Object alias, 1706 dangling_coding_system_alias_p (Lisp_Object alias,
1088 Lisp_Object aliasee, 1707 Lisp_Object aliasee,
1089 void *dangling_aliases) 1708 void *dangling_aliases)
1127 ("Symbol is the canonical name of a coding system and cannot be redefined", 1746 ("Symbol is the canonical name of a coding system and cannot be redefined",
1128 alias); 1747 alias);
1129 1748
1130 if (NILP (aliasee)) 1749 if (NILP (aliasee))
1131 { 1750 {
1132 Lisp_Object subsidiary_unix = append_suffix_to_symbol (alias, "-unix"); 1751 Lisp_Object subsidiary_unix = add_suffix_to_symbol (alias, "-unix");
1133 Lisp_Object subsidiary_dos = append_suffix_to_symbol (alias, "-dos"); 1752 Lisp_Object subsidiary_dos = add_suffix_to_symbol (alias, "-dos");
1134 Lisp_Object subsidiary_mac = append_suffix_to_symbol (alias, "-mac"); 1753 Lisp_Object subsidiary_mac = add_suffix_to_symbol (alias, "-mac");
1135 1754
1136 Fremhash (alias, Vcoding_system_hash_table); 1755 Fremhash (alias, Vcoding_system_hash_table);
1137 1756
1138 /* Undefine subsidiary aliases, 1757 /* Undefine subsidiary aliases,
1139 presumably created by a previous call to this function */ 1758 presumably created by a previous call to this function */
1188 static const char *suffixes[] = { "-unix", "-dos", "-mac" }; 1807 static const char *suffixes[] = { "-unix", "-dos", "-mac" };
1189 int i; 1808 int i;
1190 for (i = 0; i < countof (suffixes); i++) 1809 for (i = 0; i < countof (suffixes); i++)
1191 { 1810 {
1192 Lisp_Object alias_subsidiary = 1811 Lisp_Object alias_subsidiary =
1193 append_suffix_to_symbol (alias, suffixes[i]); 1812 add_suffix_to_symbol (alias, suffixes[i]);
1194 Lisp_Object aliasee_subsidiary = 1813 Lisp_Object aliasee_subsidiary =
1195 append_suffix_to_symbol (aliasee, suffixes[i]); 1814 add_suffix_to_symbol (aliasee, suffixes[i]);
1196 1815
1197 if (! NILP (Ffind_coding_system (aliasee_subsidiary))) 1816 if (! NILP (Ffind_coding_system (aliasee_subsidiary)))
1198 Fdefine_coding_system_alias (alias_subsidiary, aliasee_subsidiary); 1817 Fdefine_coding_system_alias (alias_subsidiary, aliasee_subsidiary);
1199 } 1818 }
1200 } 1819 }
1203 meaningful or nothing at all. */ 1822 meaningful or nothing at all. */
1204 return Qnil; 1823 return Qnil;
1205 } 1824 }
1206 1825
1207 static Lisp_Object 1826 static Lisp_Object
1208 subsidiary_coding_system (Lisp_Object coding_system, eol_type_t type) 1827 subsidiary_coding_system (Lisp_Object coding_system, enum eol_type type)
1209 { 1828 {
1210 Lisp_Coding_System *cs = XCODING_SYSTEM (coding_system); 1829 Lisp_Coding_System *cs = XCODING_SYSTEM (coding_system);
1211 Lisp_Object new_coding_system; 1830 Lisp_Object new_coding_system;
1212
1213 if (CODING_SYSTEM_EOL_TYPE (cs) != EOL_AUTODETECT)
1214 return coding_system;
1215 1831
1216 switch (type) 1832 switch (type)
1217 { 1833 {
1218 case EOL_AUTODETECT: return coding_system; 1834 case EOL_AUTODETECT: return coding_system;
1219 case EOL_LF: new_coding_system = CODING_SYSTEM_EOL_LF (cs); break; 1835 case EOL_LF: new_coding_system = CODING_SYSTEM_EOL_LF (cs); break;
1225 return NILP (new_coding_system) ? coding_system : new_coding_system; 1841 return NILP (new_coding_system) ? coding_system : new_coding_system;
1226 } 1842 }
1227 1843
1228 DEFUN ("subsidiary-coding-system", Fsubsidiary_coding_system, 2, 2, 0, /* 1844 DEFUN ("subsidiary-coding-system", Fsubsidiary_coding_system, 2, 2, 0, /*
1229 Return the subsidiary coding system of CODING-SYSTEM with eol type EOL-TYPE. 1845 Return the subsidiary coding system of CODING-SYSTEM with eol type EOL-TYPE.
1846 The logically opposite operation is `coding-system-base'.
1230 */ 1847 */
1231 (coding_system, eol_type)) 1848 (coding_system, eol_type))
1232 { 1849 {
1233 coding_system = Fget_coding_system (coding_system); 1850 coding_system = get_coding_system_for_text_file (coding_system, 0);
1234 1851
1235 return subsidiary_coding_system (coding_system, 1852 return subsidiary_coding_system (coding_system,
1236 symbol_to_eol_type (eol_type)); 1853 symbol_to_eol_type (eol_type));
1854 }
1855
1856 DEFUN ("coding-system-base", Fcoding_system_base,
1857 1, 1, 0, /*
1858 Return the base coding system of CODING-SYSTEM.
1859 If CODING-SYSTEM is a subsidiary, this returns its parent; otherwise, it
1860 returns CODING-SYSTEM.
1861 The logically opposite operation is `subsidiary-coding-system'.
1862 */
1863 (coding_system))
1864 {
1865 Lisp_Object base;
1866
1867 coding_system = Fget_coding_system (coding_system);
1868 if (EQ (XCODING_SYSTEM_NAME (coding_system), Qbinary))
1869 return Fget_coding_system (Qraw_text); /* hack! */
1870 base = XCODING_SYSTEM_SUBSIDIARY_PARENT (coding_system);
1871 if (!NILP (base))
1872 return base;
1873 return coding_system;
1874 }
1875
1876 DEFUN ("coding-system-used-for-io", Fcoding_system_used_for_io,
1877 1, 1, 0, /*
1878 Return the coding system actually used for I/O.
1879 In some cases (e.g. when a particular EOL type is specified) this won't be
1880 the coding system itself. This can be useful when trying to track down
1881 more closely how exactly data is decoded.
1882 */
1883 (coding_system))
1884 {
1885 Lisp_Object canon;
1886
1887 coding_system = Fget_coding_system (coding_system);
1888 canon = XCODING_SYSTEM_CANONICAL (coding_system);
1889 if (!NILP (canon))
1890 return canon;
1891 return coding_system;
1237 } 1892 }
1238 1893
1239 1894
1240 /************************************************************************/ 1895 /************************************************************************/
1241 /* Coding system accessors */ 1896 /* Coding system accessors */
1242 /************************************************************************/ 1897 /************************************************************************/
1243 1898
1244 DEFUN ("coding-system-doc-string", Fcoding_system_doc_string, 1, 1, 0, /* 1899 DEFUN ("coding-system-description", Fcoding_system_description, 1, 1, 0, /*
1245 Return the doc string for CODING-SYSTEM. 1900 Return the description for CODING-SYSTEM.
1901 The `description' of a coding system is a short English phrase giving the
1902 name rendered according to English punctuation rules, plus possibly some
1903 explanatory text (typically in the form of a parenthetical phrase). The
1904 description is intended to be short enough that it can appear as a menu item,
1905 and clear enough to be recognizable even to someone who is assumed to have
1906 some basic familiarity with different encodings but may not know all the
1907 technical names; thus, for `cn-gb-2312' is described as "Chinese EUC" and
1908 `hz-gb-2312' is described as "Hz/ZW (Chinese)", where the actual name of
1909 the encoding is given, followed by a note that this is a Chinese encoding,
1910 because the great majority of people encountering this would have no idea
1911 what it is, and giving the language indicates whether the encoding should
1912 just be ignored or (conceivably) investigated more thoroughly.
1246 */ 1913 */
1247 (coding_system)) 1914 (coding_system))
1248 { 1915 {
1249 coding_system = Fget_coding_system (coding_system); 1916 coding_system = Fget_coding_system (coding_system);
1250 return XCODING_SYSTEM_DOC_STRING (coding_system); 1917 return XCODING_SYSTEM_DESCRIPTION (coding_system);
1251 } 1918 }
1252 1919
1253 DEFUN ("coding-system-type", Fcoding_system_type, 1, 1, 0, /* 1920 DEFUN ("coding-system-type", Fcoding_system_type, 1, 1, 0, /*
1254 Return the type of CODING-SYSTEM. 1921 Return the type of CODING-SYSTEM.
1255 */ 1922 */
1256 (coding_system)) 1923 (coding_system))
1257 { 1924 {
1258 switch (XCODING_SYSTEM_TYPE (Fget_coding_system (coding_system))) 1925 coding_system = Fget_coding_system (coding_system);
1259 { 1926 return XCODING_SYSTEM_TYPE (coding_system);
1260 default: abort (); 1927 }
1261 case CODESYS_AUTODETECT: return Qundecided;
1262 #ifdef MULE
1263 case CODESYS_SHIFT_JIS: return Qshift_jis;
1264 case CODESYS_ISO2022: return Qiso2022;
1265 case CODESYS_BIG5: return Qbig5;
1266 case CODESYS_UCS4: return Qucs4;
1267 case CODESYS_UTF8: return Qutf8;
1268 case CODESYS_CCL: return Qccl;
1269 #endif
1270 case CODESYS_NO_CONVERSION: return Qno_conversion;
1271 #ifdef DEBUG_XEMACS
1272 case CODESYS_INTERNAL: return Qinternal;
1273 #endif
1274 }
1275 }
1276
1277 #ifdef MULE
1278 static
1279 Lisp_Object coding_system_charset (Lisp_Object coding_system, int gnum)
1280 {
1281 Lisp_Object cs
1282 = XCODING_SYSTEM_ISO2022_INITIAL_CHARSET (coding_system, gnum);
1283
1284 return CHARSETP (cs) ? XCHARSET_NAME (cs) : Qnil;
1285 }
1286 #endif /* MULE */
1287 1928
1288 DEFUN ("coding-system-property", Fcoding_system_property, 2, 2, 0, /* 1929 DEFUN ("coding-system-property", Fcoding_system_property, 2, 2, 0, /*
1289 Return the PROP property of CODING-SYSTEM. 1930 Return the PROP property of CODING-SYSTEM.
1290 */ 1931 */
1291 (coding_system, prop)) 1932 (coding_system, prop))
1292 { 1933 {
1293 int i, ok = 0;
1294 enum coding_system_type type;
1295
1296 coding_system = Fget_coding_system (coding_system); 1934 coding_system = Fget_coding_system (coding_system);
1297 CHECK_SYMBOL (prop); 1935 CHECK_SYMBOL (prop);
1298 type = XCODING_SYSTEM_TYPE (coding_system);
1299
1300 for (i = 0; !ok && i < Dynarr_length (the_codesys_prop_dynarr); i++)
1301 if (EQ (Dynarr_at (the_codesys_prop_dynarr, i).sym, prop))
1302 {
1303 ok = 1;
1304 switch (Dynarr_at (the_codesys_prop_dynarr, i).prop_type)
1305 {
1306 case CODESYS_PROP_ALL_OK:
1307 break;
1308 #ifdef MULE
1309 case CODESYS_PROP_ISO2022:
1310 if (type != CODESYS_ISO2022)
1311 invalid_argument
1312 ("Property only valid in ISO2022 coding systems",
1313 prop);
1314 break;
1315
1316 case CODESYS_PROP_CCL:
1317 if (type != CODESYS_CCL)
1318 invalid_argument
1319 ("Property only valid in CCL coding systems",
1320 prop);
1321 break;
1322 #endif /* MULE */
1323 default:
1324 abort ();
1325 }
1326 }
1327
1328 if (!ok)
1329 invalid_constant ("Unrecognized property", prop);
1330 1936
1331 if (EQ (prop, Qname)) 1937 if (EQ (prop, Qname))
1332 return XCODING_SYSTEM_NAME (coding_system); 1938 return XCODING_SYSTEM_NAME (coding_system);
1333 else if (EQ (prop, Qtype)) 1939 else if (EQ (prop, Qtype))
1334 return Fcoding_system_type (coding_system); 1940 return Fcoding_system_type (coding_system);
1335 else if (EQ (prop, Qdoc_string)) 1941 else if (EQ (prop, Qdescription))
1336 return XCODING_SYSTEM_DOC_STRING (coding_system); 1942 return XCODING_SYSTEM_DESCRIPTION (coding_system);
1337 else if (EQ (prop, Qmnemonic)) 1943 else if (EQ (prop, Qmnemonic))
1338 return XCODING_SYSTEM_MNEMONIC (coding_system); 1944 return XCODING_SYSTEM_MNEMONIC (coding_system);
1945 else if (EQ (prop, Qdocumentation))
1946 return XCODING_SYSTEM_DOCUMENTATION (coding_system);
1339 else if (EQ (prop, Qeol_type)) 1947 else if (EQ (prop, Qeol_type))
1340 return eol_type_to_symbol (XCODING_SYSTEM_EOL_TYPE (coding_system)); 1948 return eol_type_to_symbol (XCODING_SYSTEM_EOL_TYPE
1949 (coding_system));
1341 else if (EQ (prop, Qeol_lf)) 1950 else if (EQ (prop, Qeol_lf))
1342 return XCODING_SYSTEM_EOL_LF (coding_system); 1951 return XCODING_SYSTEM_EOL_LF (coding_system);
1343 else if (EQ (prop, Qeol_crlf)) 1952 else if (EQ (prop, Qeol_crlf))
1344 return XCODING_SYSTEM_EOL_CRLF (coding_system); 1953 return XCODING_SYSTEM_EOL_CRLF (coding_system);
1345 else if (EQ (prop, Qeol_cr)) 1954 else if (EQ (prop, Qeol_cr))
1346 return XCODING_SYSTEM_EOL_CR (coding_system); 1955 return XCODING_SYSTEM_EOL_CR (coding_system);
1347 else if (EQ (prop, Qpost_read_conversion)) 1956 else if (EQ (prop, Qpost_read_conversion))
1348 return XCODING_SYSTEM_POST_READ_CONVERSION (coding_system); 1957 return XCODING_SYSTEM_POST_READ_CONVERSION (coding_system);
1349 else if (EQ (prop, Qpre_write_conversion)) 1958 else if (EQ (prop, Qpre_write_conversion))
1350 return XCODING_SYSTEM_PRE_WRITE_CONVERSION (coding_system); 1959 return XCODING_SYSTEM_PRE_WRITE_CONVERSION (coding_system);
1351 #ifdef MULE
1352 else if (type == CODESYS_ISO2022)
1353 {
1354 if (EQ (prop, Qcharset_g0))
1355 return coding_system_charset (coding_system, 0);
1356 else if (EQ (prop, Qcharset_g1))
1357 return coding_system_charset (coding_system, 1);
1358 else if (EQ (prop, Qcharset_g2))
1359 return coding_system_charset (coding_system, 2);
1360 else if (EQ (prop, Qcharset_g3))
1361 return coding_system_charset (coding_system, 3);
1362
1363 #define FORCE_CHARSET(charset_num) \
1364 (XCODING_SYSTEM_ISO2022_FORCE_CHARSET_ON_OUTPUT \
1365 (coding_system, charset_num) ? Qt : Qnil)
1366
1367 else if (EQ (prop, Qforce_g0_on_output)) return FORCE_CHARSET (0);
1368 else if (EQ (prop, Qforce_g1_on_output)) return FORCE_CHARSET (1);
1369 else if (EQ (prop, Qforce_g2_on_output)) return FORCE_CHARSET (2);
1370 else if (EQ (prop, Qforce_g3_on_output)) return FORCE_CHARSET (3);
1371
1372 #define LISP_BOOLEAN(prop) \
1373 (XCODING_SYSTEM_ISO2022_##prop (coding_system) ? Qt : Qnil)
1374
1375 else if (EQ (prop, Qshort)) return LISP_BOOLEAN (SHORT);
1376 else if (EQ (prop, Qno_ascii_eol)) return LISP_BOOLEAN (NO_ASCII_EOL);
1377 else if (EQ (prop, Qno_ascii_cntl)) return LISP_BOOLEAN (NO_ASCII_CNTL);
1378 else if (EQ (prop, Qseven)) return LISP_BOOLEAN (SEVEN);
1379 else if (EQ (prop, Qlock_shift)) return LISP_BOOLEAN (LOCK_SHIFT);
1380 else if (EQ (prop, Qno_iso6429)) return LISP_BOOLEAN (NO_ISO6429);
1381 else if (EQ (prop, Qescape_quoted)) return LISP_BOOLEAN (ESCAPE_QUOTED);
1382
1383 else if (EQ (prop, Qinput_charset_conversion))
1384 return
1385 unparse_charset_conversion_specs
1386 (XCODING_SYSTEM (coding_system)->iso2022.input_conv);
1387 else if (EQ (prop, Qoutput_charset_conversion))
1388 return
1389 unparse_charset_conversion_specs
1390 (XCODING_SYSTEM (coding_system)->iso2022.output_conv);
1391 else
1392 abort ();
1393 }
1394 else if (type == CODESYS_CCL)
1395 {
1396 if (EQ (prop, Qdecode))
1397 return XCODING_SYSTEM_CCL_DECODE (coding_system);
1398 else if (EQ (prop, Qencode))
1399 return XCODING_SYSTEM_CCL_ENCODE (coding_system);
1400 else
1401 abort ();
1402 }
1403 #endif /* MULE */
1404 else 1960 else
1405 abort (); 1961 {
1406 1962 Lisp_Object value = CODESYSMETH_OR_GIVEN (XCODING_SYSTEM (coding_system),
1407 return Qnil; /* not reached */ 1963 getprop,
1964 (coding_system, prop),
1965 Qunbound);
1966 if (UNBOUNDP (value))
1967 invalid_constant ("Unrecognized property", prop);
1968 return value;
1969 }
1408 } 1970 }
1409 1971
1410 1972
1411 /************************************************************************/ 1973 /************************************************************************/
1412 /* Coding category functions */ 1974 /* Coding stream functions */
1413 /************************************************************************/ 1975 /************************************************************************/
1414 1976
1977 /* A coding stream is a stream used for encoding or decoding text. The
1978 coding-stream object keeps track of the actual coding system, the stream
1979 that is at the other end, and data that needs to be persistent across
1980 the lifetime of the stream. */
1981
1982 DEFINE_LSTREAM_IMPLEMENTATION ("coding", coding);
1983
1984 /* Encoding and decoding are parallel operations, so we create just one
1985 stream for both. "Decoding" may involve the extra step of autodetection
1986 of the data format, but that's only because of the conventional
1987 definition of decoding as converting from external- to
1988 internal-formatted data.
1989
1990 #### We really need to abstract out the concept of "data formats" and
1991 define "converters" that convert from and to specified formats,
1992 eliminating the idea of decoding and encoding. When specifying a
1993 conversion process, we need to give the data formats themselves, not the
1994 conversion processes -- e.g. a coding system called "Unicode->multibyte"
1995 converts in both directions, and we could auto-detect the format of data
1996 at either end. */
1997
1998 static Bytecount
1999 coding_reader (Lstream *stream, unsigned char *data, Bytecount size)
2000 {
2001 unsigned char *orig_data = data;
2002 Bytecount read_size;
2003 int error_occurred = 0;
2004 struct coding_stream *str = CODING_STREAM_DATA (stream);
2005
2006 /* We need to interface to coding_{de,en}code_1(), which expects to take
2007 some amount of data and store the result into a Dynarr. We have
2008 coding_{de,en}code_1() store into c->runoff, and take data from there
2009 as necessary. */
2010
2011 /* We loop until we have enough data, reading chunks from the other
2012 end and converting it. */
2013 while (1)
2014 {
2015 /* Take data from convert_to if we can. Make sure to take at
2016 most SIZE bytes, and delete the data from convert_to. */
2017 if (Dynarr_length (str->convert_to) > 0)
2018 {
2019 Bytecount chunk =
2020 min (size, (Bytecount) Dynarr_length (str->convert_to));
2021 memcpy (data, Dynarr_atp (str->convert_to, 0), chunk);
2022 Dynarr_delete_many (str->convert_to, 0, chunk);
2023 data += chunk;
2024 size -= chunk;
2025 }
2026
2027 if (size == 0)
2028 break; /* No more room for data */
2029
2030 if (str->eof)
2031 break;
2032
2033 {
2034 /* Exhausted convert_to, so get some more. Read into convert_from,
2035 after existing "rejected" data from the last conversion. */
2036 Bytecount rejected = Dynarr_length (str->convert_from);
2037 /* #### 1024 is arbitrary; we really need to separate 0 from EOF,
2038 and when we get 0, keep taking more data until we don't get 0 --
2039 we don't know how much data the conversion routine might need
2040 before it can generate any data of its own */
2041 Bytecount readmore = max (size, 1024);
2042
2043 Dynarr_add_many (str->convert_from, 0, readmore);
2044 read_size = Lstream_read (str->other_end,
2045 Dynarr_atp (str->convert_from, rejected),
2046 readmore);
2047 /* Trim size down to how much we actually got */
2048 Dynarr_set_size (str->convert_from, rejected + max (0, read_size));
2049 }
2050
2051 if (read_size < 0) /* LSTREAM_ERROR */
2052 {
2053 error_occurred = 1;
2054 break;
2055 }
2056 if (read_size == 0) /* LSTREAM_EOF */
2057 /* There might be some more end data produced in the translation,
2058 so we set a flag and call the conversion method once more to
2059 output any final stuff it may be holding, any "go back to a sane
2060 state" escape sequences, etc. The conversion method is free to
2061 look at this flag, and we use it above to stop looping. */
2062 str->eof = 1;
2063 {
2064 Bytecount processed;
2065 Bytecount to_process = Dynarr_length (str->convert_from);
2066
2067 /* Convert the data, and save any rejected data in convert_from */
2068 processed =
2069 XCODESYSMETH (str->codesys, convert,
2070 (str, Dynarr_atp (str->convert_from, 0),
2071 str->convert_to, to_process));
2072 if (processed < 0)
2073 {
2074 error_occurred = 1;
2075 break;
2076 }
2077 assert (processed <= to_process);
2078 if (processed < to_process)
2079 memmove (Dynarr_atp (str->convert_from, 0),
2080 Dynarr_atp (str->convert_from, processed),
2081 to_process - processed);
2082 Dynarr_set_size (str->convert_from, to_process - processed);
2083 }
2084 }
2085
2086 if (data - orig_data == 0)
2087 return error_occurred ? -1 : 0;
2088 else
2089 return data - orig_data;
2090 }
2091
2092 static Bytecount
2093 coding_writer (Lstream *stream, const unsigned char *data, Bytecount size)
2094 {
2095 struct coding_stream *str = CODING_STREAM_DATA (stream);
2096
2097 /* Convert all our data into convert_to, and then attempt to write
2098 it all out to the other end. */
2099 Dynarr_reset (str->convert_to);
2100 size = XCODESYSMETH (str->codesys, convert,
2101 (str, data, str->convert_to, size));
2102 if (Lstream_write (str->other_end, Dynarr_atp (str->convert_to, 0),
2103 Dynarr_length (str->convert_to)) < 0)
2104 return -1;
2105 else
2106 /* The return value indicates how much of the incoming data was
2107 processed, not how many bytes were written. */
2108 return size;
2109 }
2110
1415 static int 2111 static int
1416 decode_coding_category (Lisp_Object symbol) 2112 encode_decode_source_sink_type_is_char (Lisp_Object cs,
1417 { 2113 enum source_or_sink sex,
2114 enum encode_decode direction)
2115 {
2116 return (direction == CODING_DECODE ?
2117 decoding_source_sink_type_is_char (cs, sex) :
2118 encoding_source_sink_type_is_char (cs, sex));
2119 }
2120
2121 /* Ensure that the convert methods only get full characters sent to them to
2122 convert if the source of that conversion is characters; and that no such
2123 full-character checking happens when the source is bytes. Keep in mind
2124 that (1) the conversion_end_type return values take the perspective of
2125 encoding; (2) the source for decoding is the same as the sink for
2126 encoding; (3) when writing, the data is given to us, and we set our own
2127 stream to be character mode or not; (4) when reading, the data comes
2128 from the other_end stream, and we set that one to be character mode or
2129 not. This is consistent with the comment above the prototype for
2130 Lstream_set_character_mode(), which lays out rules for who is allowed to
2131 modify the character type mode on a stream.
2132
2133 NOTE: We could potentially implement the full-character checking stuff
2134 ourselves, which might be a bit safer in case people mess up the
2135 character mode themselves. But people shouldn't be doing that -- don't
2136 hide bugs -- and there's no sense duplicating code. */
2137
2138 static void
2139 set_coding_character_mode (Lstream *stream)
2140 {
2141 struct coding_stream *str = CODING_STREAM_DATA (stream);
2142 Lstream *stream_to_set =
2143 stream->flags & LSTREAM_FL_WRITE ? stream : str->other_end;
2144 if (encode_decode_source_sink_type_is_char
2145 (str->codesys, CODING_SOURCE, str->direction))
2146 Lstream_set_character_mode (stream_to_set);
2147 else
2148 Lstream_unset_character_mode (stream_to_set);
2149 }
2150
2151 static Lisp_Object
2152 coding_marker (Lisp_Object stream)
2153 {
2154 struct coding_stream *str = CODING_STREAM_DATA (XLSTREAM (stream));
2155
2156 mark_object (str->orig_codesys);
2157 mark_object (str->codesys);
2158 MAYBE_XCODESYSMETH (str->codesys, mark_coding_stream, (str));
2159 return wrap_lstream (str->other_end);
2160 }
2161
2162 static int
2163 coding_rewinder (Lstream *stream)
2164 {
2165 struct coding_stream *str = CODING_STREAM_DATA (stream);
2166 MAYBE_XCODESYSMETH (str->codesys, rewind_coding_stream, (str));
2167
2168 str->ch = 0;
2169 Dynarr_reset (str->convert_to);
2170 Dynarr_reset (str->convert_from);
2171 return Lstream_rewind (str->other_end);
2172 }
2173
2174 static int
2175 coding_seekable_p (Lstream *stream)
2176 {
2177 struct coding_stream *str = CODING_STREAM_DATA (stream);
2178 return Lstream_seekable_p (str->other_end);
2179 }
2180
2181 static int
2182 coding_flusher (Lstream *stream)
2183 {
2184 struct coding_stream *str = CODING_STREAM_DATA (stream);
2185 return Lstream_flush (str->other_end);
2186 }
2187
2188 static int
2189 coding_closer (Lstream *stream)
2190 {
2191 struct coding_stream *str = CODING_STREAM_DATA (stream);
2192 if (stream->flags & LSTREAM_FL_WRITE)
2193 {
2194 str->eof = 1;
2195 coding_writer (stream, 0, 0);
2196 str->eof = 0;
2197 }
2198 /* It's safe to free the runoff dynarrs now because they are used only
2199 during conversion. We need to keep the type-specific data around,
2200 though, because of canonicalize_after_coding. */
2201 if (str->convert_to)
2202 {
2203 Dynarr_free (str->convert_to);
2204 str->convert_to = 0;
2205 }
2206 if (str->convert_from)
2207 {
2208 Dynarr_free (str->convert_from);
2209 str->convert_from = 0;
2210 }
2211
2212 return Lstream_close (str->other_end);
2213 }
2214
2215 static void
2216 coding_finalizer (Lstream *stream)
2217 {
2218 struct coding_stream *str = CODING_STREAM_DATA (stream);
2219
2220 assert (!str->finalized);
2221 MAYBE_XCODESYSMETH (str->codesys, finalize_coding_stream, (str));
2222 if (str->data)
2223 {
2224 xfree (str->data);
2225 str->data = 0;
2226 }
2227 str->finalized = 1;
2228 }
2229
2230 static Lisp_Object
2231 coding_stream_canonicalize_after_coding (Lstream *stream)
2232 {
2233 struct coding_stream *str = CODING_STREAM_DATA (stream);
2234
2235 return XCODESYSMETH_OR_GIVEN (str->codesys, canonicalize_after_coding,
2236 (str), str->codesys);
2237 }
2238
2239 Lisp_Object
2240 coding_stream_detected_coding_system (Lstream *stream)
2241 {
2242 Lisp_Object codesys =
2243 coding_stream_canonicalize_after_coding (stream);
2244 if (NILP (codesys))
2245 return Fget_coding_system (Qidentity);
2246 return codesys;
2247 }
2248
2249 Lisp_Object
2250 coding_stream_coding_system (Lstream *stream)
2251 {
2252 return CODING_STREAM_DATA (stream)->codesys;
2253 }
2254
2255 /* Change the coding system associated with a stream. */
2256
2257 void
2258 set_coding_stream_coding_system (Lstream *lstr, Lisp_Object codesys)
2259 {
2260 struct coding_stream *str = CODING_STREAM_DATA (lstr);
2261 if (EQ (str->orig_codesys, codesys))
2262 return;
2263 /* We do the equivalent of closing the stream, destroying it, and
2264 reinitializing it. This includes flushing out the data and signalling
2265 EOF, if we're a writing stream; we also replace the type-specific data
2266 with the data appropriate for the new coding system. */
2267 if (!NILP (str->codesys))
2268 {
2269 if (lstr->flags & LSTREAM_FL_WRITE)
2270 {
2271 Lstream_flush (lstr);
2272 str->eof = 1;
2273 coding_writer (lstr, 0, 0);
2274 str->eof = 0;
2275 }
2276 MAYBE_XCODESYSMETH (str->codesys, finalize_coding_stream, (str));
2277 }
2278 str->orig_codesys = codesys;
2279 str->codesys = coding_system_real_canonical (codesys);
2280
2281 if (str->data)
2282 {
2283 xfree (str->data);
2284 str->data = 0;
2285 }
2286 if (XCODING_SYSTEM_METHODS (str->codesys)->coding_data_size)
2287 str->data =
2288 xmalloc_and_zero (XCODING_SYSTEM_METHODS (str->codesys)->
2289 coding_data_size);
2290 MAYBE_XCODESYSMETH (str->codesys, init_coding_stream, (str));
2291 /* The new coding system may have different ideas regarding whether its
2292 ends are characters or bytes. */
2293 set_coding_character_mode (lstr);
2294 }
2295
2296 /* WARNING WARNING WARNING WARNING!!!!! If you open up a coding
2297 stream for writing, no automatic code detection will be performed.
2298 The reason for this is that automatic code detection requires a
2299 seekable input. Things will also fail if you open a coding
2300 stream for reading using a non-fully-specified coding system and
2301 a non-seekable input stream. */
2302
2303 static Lisp_Object
2304 make_coding_stream_1 (Lstream *stream, Lisp_Object codesys,
2305 const char *mode, enum encode_decode direction)
2306 {
2307 Lstream *lstr = Lstream_new (lstream_coding, mode);
2308 struct coding_stream *str = CODING_STREAM_DATA (lstr);
2309 Lisp_Object obj;
2310
2311 codesys = Fget_coding_system (codesys);
2312 xzero (*str);
2313 str->codesys = Qnil;
2314 str->orig_codesys = Qnil;
2315 str->us = lstr;
2316 str->other_end = stream;
2317 str->convert_to = Dynarr_new (unsigned_char);
2318 str->convert_from = Dynarr_new (unsigned_char);
2319 str->direction = direction;
2320 set_coding_stream_coding_system (lstr, codesys);
2321 XSETLSTREAM (obj, lstr);
2322 return obj;
2323 }
2324
2325 Lisp_Object
2326 make_coding_input_stream (Lstream *stream, Lisp_Object codesys,
2327 enum encode_decode direction)
2328 {
2329 return make_coding_stream_1 (stream, codesys, "r", direction);
2330 }
2331
2332 Lisp_Object
2333 make_coding_output_stream (Lstream *stream, Lisp_Object codesys,
2334 enum encode_decode direction)
2335 {
2336 return make_coding_stream_1 (stream, codesys, "w", direction);
2337 }
2338
2339 static Lisp_Object
2340 encode_decode_coding_region (Lisp_Object start, Lisp_Object end,
2341 Lisp_Object coding_system, Lisp_Object buffer,
2342 enum encode_decode direction)
2343 {
2344 Charbpos b, e;
2345 struct buffer *buf = decode_buffer (buffer, 0);
2346 Lisp_Object instream = Qnil, to_outstream = Qnil, outstream = Qnil;
2347 Lisp_Object from_outstream = Qnil, auto_outstream = Qnil;
2348 Lisp_Object lb_outstream = Qnil;
2349 Lisp_Object next;
2350 Lstream *istr, *ostr;
2351 struct gcpro gcpro1, gcpro2, gcpro3, gcpro4, gcpro5;
2352 struct gcpro ngcpro1;
2353 int source_char, sink_char;
2354
2355 get_buffer_range_char (buf, start, end, &b, &e, 0);
2356 barf_if_buffer_read_only (buf, b, e);
2357
2358 GCPRO5 (instream, to_outstream, outstream, from_outstream, lb_outstream);
2359 NGCPRO1 (auto_outstream);
2360
2361 coding_system = Fget_coding_system (coding_system);
2362 source_char = encode_decode_source_sink_type_is_char (coding_system,
2363 CODING_SOURCE,
2364 direction);
2365 sink_char = encode_decode_source_sink_type_is_char (coding_system,
2366 CODING_SINK,
2367 direction);
2368
2369 /* Order is IN <---> [TO] -> OUT -> [FROM] -> [AUTODETECT-EOL] -> LB */
2370 instream = make_lisp_buffer_input_stream (buf, b, e, 0);
2371 next = lb_outstream = make_lisp_buffer_output_stream (buf, b, 0);
2372
2373 if (direction == CODING_DECODE &&
2374 XCODING_SYSTEM_EOL_TYPE (coding_system) == EOL_AUTODETECT)
2375 next = auto_outstream =
2376 make_coding_output_stream
2377 (XLSTREAM (next), Fget_coding_system (Qconvert_eol_autodetect), CODING_DECODE);
2378
2379 if (!sink_char)
2380 next = from_outstream =
2381 make_coding_output_stream (XLSTREAM (next), Qbinary, CODING_DECODE);
2382 outstream = make_coding_output_stream (XLSTREAM (next), coding_system,
2383 direction);
2384 if (!source_char)
2385 {
2386 to_outstream =
2387 make_coding_output_stream (XLSTREAM (outstream),
2388 Qbinary, CODING_ENCODE);
2389 ostr = XLSTREAM (to_outstream);
2390 }
2391 else
2392 ostr = XLSTREAM (outstream);
2393 istr = XLSTREAM (instream);
2394
2395 /* The chain of streams looks like this:
2396
2397 [BUFFER] <----- send through
2398 ------> [CHAR->BYTE i.e. ENCODE AS BINARY if source is
2399 in bytes]
2400 ------> [ENCODE/DECODE AS SPECIFIED]
2401 ------> [BYTE->CHAR i.e. DECODE AS BINARY
2402 if sink is in bytes]
2403 ------> [AUTODETECT EOL if
2404 we're decoding and
2405 coding system calls
2406 for this]
2407 ------> [BUFFER]
2408 */
2409 while (1)
2410 {
2411 char tempbuf[1024]; /* some random amount */
2412 Charbpos newpos, even_newer_pos;
2413 Charbpos oldpos = lisp_buffer_stream_startpos (istr);
2414 Bytecount size_in_bytes =
2415 Lstream_read (istr, tempbuf, sizeof (tempbuf));
2416
2417 if (!size_in_bytes)
2418 break;
2419 newpos = lisp_buffer_stream_startpos (istr);
2420 Lstream_write (ostr, tempbuf, size_in_bytes);
2421 even_newer_pos = lisp_buffer_stream_startpos (istr);
2422 buffer_delete_range (buf, even_newer_pos - (newpos - oldpos),
2423 even_newer_pos, 0);
2424 }
2425
2426 {
2427 Charcount retlen =
2428 lisp_buffer_stream_startpos (XLSTREAM (instream)) - b;
2429 Lstream_close (istr);
2430 Lstream_close (ostr);
2431 NUNGCPRO;
2432 UNGCPRO;
2433 Lstream_delete (istr);
2434 if (!NILP (from_outstream))
2435 Lstream_delete (XLSTREAM (from_outstream));
2436 Lstream_delete (XLSTREAM (outstream));
2437 if (!NILP (to_outstream))
2438 Lstream_delete (XLSTREAM (to_outstream));
2439 if (!NILP (auto_outstream))
2440 Lstream_delete (XLSTREAM (auto_outstream));
2441 Lstream_delete (XLSTREAM (lb_outstream));
2442 return make_int (retlen);
2443 }
2444 }
2445
2446 DEFUN ("decode-coding-region", Fdecode_coding_region, 3, 4, 0, /*
2447 Decode the text between START and END which is encoded in CODING-SYSTEM.
2448 This is useful if you've read in encoded text from a file without decoding
2449 it (e.g. you read in a JIS-formatted file but used the `binary' or
2450 `no-conversion' coding system, so that it shows up as "^[$B!<!+^[(B").
2451 Return length of decoded text.
2452 BUFFER defaults to the current buffer if unspecified.
2453 */
2454 (start, end, coding_system, buffer))
2455 {
2456 return encode_decode_coding_region (start, end, coding_system, buffer,
2457 CODING_DECODE);
2458 }
2459
2460 DEFUN ("encode-coding-region", Fencode_coding_region, 3, 4, 0, /*
2461 Encode the text between START and END using CODING-SYSTEM.
2462 This will, for example, convert Japanese characters into stuff such as
2463 "^[$B!<!+^[(B" if you use the JIS encoding. Return length of encoded
2464 text. BUFFER defaults to the current buffer if unspecified.
2465 */
2466 (start, end, coding_system, buffer))
2467 {
2468 return encode_decode_coding_region (start, end, coding_system, buffer,
2469 CODING_ENCODE);
2470 }
2471
2472
2473 /************************************************************************/
2474 /* Chain methods */
2475 /************************************************************************/
2476
2477 /* #### Need a way to create "opposite-direction" coding systems. */
2478
2479 /* Chain two or more coding systems together to make a combination coding
2480 system. */
2481 DEFINE_CODING_SYSTEM_TYPE (chain);
2482
2483 struct chain_coding_system
2484 {
2485 /* List of coding systems, in decode order */
2486 Lisp_Object *chain;
2487 /* Number of coding systems in list */
2488 int count;
2489 /* Coding system to return as a result of canonicalize-after-coding */
2490 Lisp_Object canonicalize_after_coding;
2491 };
2492
2493 struct chain_coding_stream
2494 {
2495 int initted;
2496 /* Lstreams for chain coding system */
2497 Lisp_Object *lstreams;
2498 int lstream_count;
2499 };
2500
2501 static const struct lrecord_description lo_description_1[] = {
2502 { XD_LISP_OBJECT, 0 },
2503 { XD_END }
2504 };
2505
2506 static const struct struct_description lo_description = {
2507 sizeof (Lisp_Object),
2508 lo_description_1
2509 };
2510
2511 static const struct lrecord_description chain_coding_system_description[] = {
2512 { XD_INT,
2513 coding_system_data_offset + offsetof (struct chain_coding_system,
2514 count) },
2515 { XD_STRUCT_PTR,
2516 coding_system_data_offset + offsetof (struct chain_coding_system,
2517 chain),
2518 XD_INDIRECT (0, 0), &lo_description },
2519 { XD_LISP_OBJECT,
2520 coding_system_data_offset + offsetof (struct chain_coding_system,
2521 canonicalize_after_coding) },
2522 { XD_END }
2523 };
2524
2525 static Lisp_Object
2526 chain_canonicalize (Lisp_Object codesys)
2527 {
2528 /* We make use of the fact that this method is called at init time, after
2529 properties have been parsed. init_method is called too early. */
2530 /* #### It's not clear we need this whole chain-canonicalize mechanism
2531 any more. */
2532 Lisp_Object chain = Flist (XCODING_SYSTEM_CHAIN_COUNT (codesys),
2533 XCODING_SYSTEM_CHAIN_CHAIN (codesys));
2534 chain = Fcons (XCODING_SYSTEM_PRE_WRITE_CONVERSION (codesys),
2535 Fcons (XCODING_SYSTEM_POST_READ_CONVERSION (codesys),
2536 chain));
2537 Fputhash (chain, codesys, Vchain_canonicalize_hash_table);
2538 return codesys;
2539 }
2540
2541 static Lisp_Object
2542 chain_canonicalize_after_coding (struct coding_stream *str)
2543 {
2544 Lisp_Object cac =
2545 XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (str->codesys);
2546 if (!NILP (cac))
2547 return cac;
2548 return str->codesys;
2549 #if 0
2550 struct chain_coding_stream *data = CODING_STREAM_TYPE_DATA (str, chain);
2551 Lisp_Object us = str->codesys, codesys;
1418 int i; 2552 int i;
2553 Lisp_Object chain;
2554 Lisp_Object tail;
2555 int changed = 0;
2556
2557 /* #### It's not clear we need this whole chain-canonicalize mechanism
2558 any more. */
2559 if (str->direction == CODING_ENCODE || !data->initted)
2560 return us;
2561
2562 chain = Flist (XCODING_SYSTEM_CHAIN_COUNT (us),
2563 XCODING_SYSTEM_CHAIN_CHAIN (us));
2564
2565 tail = chain;
2566 for (i = 0; i < XCODING_SYSTEM_CHAIN_COUNT (us); i++)
2567 {
2568 codesys = (coding_stream_canonicalize_after_coding
2569 (XLSTREAM (data->lstreams[i])));
2570 if (!EQ (codesys, XCAR (tail)))
2571 changed = 1;
2572 XCAR (tail) = codesys;
2573 tail = XCDR (tail);
2574 }
2575
2576 if (!changed)
2577 return us;
2578
2579 chain = delq_no_quit (Qnil, chain);
2580
2581 if (NILP (XCODING_SYSTEM_PRE_WRITE_CONVERSION (us)) &&
2582 NILP (XCODING_SYSTEM_POST_READ_CONVERSION (us)))
2583 {
2584 if (NILP (chain))
2585 return Qnil;
2586 if (NILP (XCDR (chain)))
2587 return XCAR (chain);
2588 }
2589
2590 codesys = Fgethash (Fcons (XCODING_SYSTEM_PRE_WRITE_CONVERSION (us),
2591 Fcons (XCODING_SYSTEM_POST_READ_CONVERSION (us),
2592 chain)), Vchain_canonicalize_hash_table,
2593 Qnil);
2594 if (!NILP (codesys))
2595 return codesys;
2596 return make_internal_coding_system
2597 (us, "internal-chain-canonicalizer-wrapper",
2598 Qchain, Qunbound, list2 (Qchain, chain));
2599 #endif /* 0 */
2600 }
2601
2602 static void
2603 chain_init (Lisp_Object codesys)
2604 {
2605 XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (codesys) = Qnil;
2606 }
2607
2608 static void
2609 chain_mark (Lisp_Object codesys)
2610 {
2611 int i;
2612
2613 for (i = 0; i < XCODING_SYSTEM_CHAIN_COUNT (codesys); i++)
2614 mark_object (XCODING_SYSTEM_CHAIN_CHAIN (codesys)[i]);
2615 mark_object (XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (codesys));
2616 }
2617
2618 static void
2619 chain_mark_coding_stream_1 (struct chain_coding_stream *data)
2620 {
2621 int i;
2622
2623 for (i = 0; i < data->lstream_count; i++)
2624 mark_object (data->lstreams[i]);
2625 }
2626
2627 static void
2628 chain_mark_coding_stream (struct coding_stream *str)
2629 {
2630 chain_mark_coding_stream_1 (CODING_STREAM_TYPE_DATA (str, chain));
2631 }
2632
2633 static void
2634 chain_print (Lisp_Object cs, Lisp_Object printcharfun, int escapeflag)
2635 {
2636 int i;
2637
2638 write_c_string ("(", printcharfun);
2639 for (i = 0; i < XCODING_SYSTEM_CHAIN_COUNT (cs); i++)
2640 {
2641 write_c_string (i == 0 ? "" : "->", printcharfun);
2642 print_coding_system_in_print_method (XCODING_SYSTEM_CHAIN_CHAIN (cs)[i],
2643 printcharfun, escapeflag);
2644 }
2645 {
2646 Lisp_Object cac = XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (cs);
2647 if (!NILP (cac))
2648 {
2649 if (i > 0)
2650 write_c_string (" ", printcharfun);
2651 write_c_string ("canonicalize-after-coding=", printcharfun);
2652 print_coding_system_in_print_method (cac, printcharfun, escapeflag);
2653 }
2654 }
2655
2656 write_c_string (")", printcharfun);
2657 }
2658
2659 static void
2660 chain_rewind_coding_stream_1 (struct chain_coding_stream *data)
2661 {
2662 /* Each will rewind the next; there is always at least one stream (the
2663 dynarr stream at the end) if we're initted */
2664 if (data->initted)
2665 Lstream_rewind (XLSTREAM (data->lstreams[0]));
2666 }
2667
2668 static void
2669 chain_rewind_coding_stream (struct coding_stream *str)
2670 {
2671 chain_rewind_coding_stream_1 (CODING_STREAM_TYPE_DATA (str, chain));
2672 }
2673
2674 static void
2675 chain_init_coding_streams_1 (struct chain_coding_stream *data,
2676 unsigned_char_dynarr *dst,
2677 int ncodesys, Lisp_Object *codesys,
2678 enum encode_decode direction)
2679 {
2680 int i;
2681 Lisp_Object lstream_out;
2682
2683 data->lstream_count = ncodesys + 1;
2684 data->lstreams = xnew_array (Lisp_Object, data->lstream_count);
2685
2686 lstream_out = make_dynarr_output_stream (dst);
2687 Lstream_set_buffering (XLSTREAM (lstream_out), LSTREAM_UNBUFFERED, 0);
2688 data->lstreams[data->lstream_count - 1] = lstream_out;
2689
2690 for (i = ncodesys - 1; i >= 0; i--)
2691 {
2692 data->lstreams[i] =
2693 make_coding_output_stream
2694 (XLSTREAM (lstream_out),
2695 codesys[direction == CODING_ENCODE ? ncodesys - (i + 1) : i],
2696 direction);
2697 lstream_out = data->lstreams[i];
2698 Lstream_set_buffering (XLSTREAM (lstream_out), LSTREAM_UNBUFFERED,
2699 0);
2700 }
2701 data->initted = 1;
2702 }
2703
2704 static Bytecount
2705 chain_convert (struct coding_stream *str, const UExtbyte *src,
2706 unsigned_char_dynarr *dst, Bytecount n)
2707 {
2708 struct chain_coding_stream *data = CODING_STREAM_TYPE_DATA (str, chain);
2709
2710 if (str->eof)
2711 {
2712 /* Each will close the next; there is always at least one stream (the
2713 dynarr stream at the end) if we're initted. We need to close now
2714 because more data may be generated. */
2715 if (data->initted)
2716 Lstream_close (XLSTREAM (data->lstreams[0]));
2717 return n;
2718 }
2719
2720 if (!data->initted)
2721 chain_init_coding_streams_1
2722 (data, dst, XCODING_SYSTEM_CHAIN_COUNT (str->codesys),
2723 XCODING_SYSTEM_CHAIN_CHAIN (str->codesys), str->direction);
2724
2725 if (Lstream_write (XLSTREAM (data->lstreams[0]), src, n) < 0)
2726 return -1;
2727 return n;
2728 }
2729
2730 static void
2731 chain_finalize_coding_stream_1 (struct chain_coding_stream *data)
2732 {
2733 if (data->lstreams)
2734 {
2735 /* Order of deletion is important here! Delete from the head of the
2736 chain and work your way towards the tail. In general, when you
2737 delete an object, there should be *NO* pointers to it anywhere.
2738 Deleting back-to-front would be a problem because there are
2739 pointers going forward. If there were pointers in both
2740 directions, you'd have to disconnect the pointers to a particular
2741 object before deleting it. */
2742 if (!gc_in_progress)
2743 {
2744 int i;
2745 /* During GC, these objects are unmarked, and are about to be
2746 freed. We do NOT want them on the free list, and that will
2747 cause lots of nastiness including crashes. Just let them be
2748 freed normally. */
2749 for (i = 0; i < data->lstream_count; i++)
2750 Lstream_delete (XLSTREAM ((data->lstreams)[i]));
2751 }
2752 xfree (data->lstreams);
2753 }
2754 }
2755
2756 static void
2757 chain_finalize_coding_stream (struct coding_stream *str)
2758 {
2759 chain_finalize_coding_stream_1 (CODING_STREAM_TYPE_DATA (str, chain));
2760 }
2761
2762 static void
2763 chain_finalize (Lisp_Object c)
2764 {
2765 if (XCODING_SYSTEM_CHAIN_CHAIN (c))
2766 xfree (XCODING_SYSTEM_CHAIN_CHAIN (c));
2767 }
2768
2769 static int
2770 chain_putprop (Lisp_Object codesys, Lisp_Object key, Lisp_Object value)
2771 {
2772 if (EQ (key, Qchain))
2773 {
2774 Lisp_Object tail;
2775 Lisp_Object *cslist;
2776 int count = 0;
2777 int i;
2778
2779 EXTERNAL_LIST_LOOP (tail, value)
2780 {
2781 Fget_coding_system (XCAR (tail));
2782 count++;
2783 }
2784
2785 cslist = xnew_array (Lisp_Object, count);
2786 XCODING_SYSTEM_CHAIN_CHAIN (codesys) = cslist;
2787
2788 count = 0;
2789 EXTERNAL_LIST_LOOP (tail, value)
2790 {
2791 cslist[count] = Fget_coding_system (XCAR (tail));
2792 count++;
2793 }
2794
2795 XCODING_SYSTEM_CHAIN_COUNT (codesys) = count;
2796
2797 for (i = 0; i < count - 1; i++)
2798 {
2799 if (decoding_source_sink_type_is_char (cslist[i], CODING_SINK) !=
2800 decoding_source_sink_type_is_char (cslist[i + 1], CODING_SOURCE))
2801 invalid_argument_2 ("Sink of first must match source of second",
2802 cslist[i], cslist[i + 1]);
2803 }
2804 }
2805 else if (EQ (key, Qcanonicalize_after_coding))
2806 XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (codesys) =
2807 Fget_coding_system (value);
2808 else
2809 return 0;
2810 return 1;
2811 }
2812
2813 static Lisp_Object
2814 chain_getprop (Lisp_Object coding_system, Lisp_Object prop)
2815 {
2816 if (EQ (prop, Qchain))
2817 {
2818 Lisp_Object result = Qnil;
2819 int i;
2820
2821 for (i = 0; i < XCODING_SYSTEM_CHAIN_COUNT (coding_system); i++)
2822 result = Fcons (XCODING_SYSTEM_CHAIN_CHAIN (coding_system)[i],
2823 result);
2824
2825 return Fnreverse (result);
2826 }
2827 else if (EQ (prop, Qcanonicalize_after_coding))
2828 return XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (coding_system);
2829 else
2830 return Qunbound;
2831 }
2832
2833 static enum source_sink_type
2834 chain_conversion_end_type (Lisp_Object codesys)
2835 {
2836 Lisp_Object *cslist = XCODING_SYSTEM_CHAIN_CHAIN (codesys);
2837 int n = XCODING_SYSTEM_CHAIN_COUNT (codesys);
2838 int charp_source, charp_sink;
2839
2840 if (n == 0)
2841 return DECODES_BYTE_TO_BYTE; /* arbitrary */
2842 charp_source = decoding_source_sink_type_is_char (cslist[0], CODING_SOURCE);
2843 charp_sink = decoding_source_sink_type_is_char (cslist[n - 1], CODING_SINK);
2844
2845 switch (charp_source * 2 + charp_sink)
2846 {
2847 case 0: return DECODES_BYTE_TO_BYTE;
2848 case 1: return DECODES_BYTE_TO_CHARACTER;
2849 case 2: return DECODES_CHARACTER_TO_BYTE;
2850 case 3: return DECODES_CHARACTER_TO_CHARACTER;
2851 }
2852
2853 abort ();
2854 return DECODES_BYTE_TO_BYTE;
2855 }
2856
2857
2858 /************************************************************************/
2859 /* No-conversion methods */
2860 /************************************************************************/
2861
2862 /* "No conversion"; used for binary files. We use quotes because there
2863 really is some conversion being applied (it does byte<->char
2864 conversion), but it appears to the user as if the text is read in
2865 without conversion. */
2866 DEFINE_CODING_SYSTEM_TYPE (no_conversion);
2867
2868 /* This is used when reading in "binary" files -- i.e. files that may
2869 contain all 256 possible byte values and that are not to be
2870 interpreted as being in any particular encoding. */
2871 static Bytecount
2872 no_conversion_convert (struct coding_stream *str,
2873 const UExtbyte *src,
2874 unsigned_char_dynarr *dst, Bytecount n)
2875 {
2876 UExtbyte c;
2877 unsigned int ch = str->ch;
2878 Bytecount orign = n;
2879
2880 if (str->direction == CODING_DECODE)
2881 {
2882 while (n--)
2883 {
2884 c = *src++;
2885
2886 DECODE_ADD_BINARY_CHAR (c, dst);
2887 }
2888
2889 if (str->eof)
2890 DECODE_OUTPUT_PARTIAL_CHAR (ch, dst);
2891 }
2892 else
2893 {
2894
2895 while (n--)
2896 {
2897 c = *src++;
2898 if (BYTE_ASCII_P (c))
2899 {
2900 assert (ch == 0);
2901 Dynarr_add (dst, c);
2902 }
2903 #ifdef MULE
2904 else if (INTBYTE_LEADING_BYTE_P (c))
2905 {
2906 assert (ch == 0);
2907 if (c == LEADING_BYTE_LATIN_ISO8859_1 ||
2908 c == LEADING_BYTE_CONTROL_1)
2909 ch = c;
2910 else
2911 Dynarr_add (dst, '~'); /* untranslatable character */
2912 }
2913 else
2914 {
2915 if (ch == LEADING_BYTE_LATIN_ISO8859_1)
2916 Dynarr_add (dst, c);
2917 else if (ch == LEADING_BYTE_CONTROL_1)
2918 {
2919 assert (c < 0xC0);
2920 Dynarr_add (dst, c - 0x20);
2921 }
2922 /* else it should be the second or third byte of an
2923 untranslatable character, so ignore it */
2924 ch = 0;
2925 }
2926 #endif /* MULE */
2927
2928 }
2929 }
2930
2931 str->ch = ch;
2932 return orign;
2933 }
2934
2935 DEFINE_DETECTOR (no_conversion);
2936 DEFINE_DETECTOR_CATEGORY (no_conversion, no_conversion);
2937
2938 struct no_conversion_detector
2939 {
2940 int dummy;
2941 };
2942
2943 static void
2944 no_conversion_detect (struct detection_state *st, const UExtbyte *src,
2945 Bytecount n)
2946 {
2947 /* Hack until we get better handling of this stuff! */
2948 DET_RESULT (st, no_conversion) = DET_SLIGHTLY_LIKELY;
2949 }
2950
2951
2952 /************************************************************************/
2953 /* Convert-eol methods */
2954 /************************************************************************/
2955
2956 /* This is used to handle end-of-line (EOL) differences. It is
2957 character-to-character, and works (when encoding) *BEFORE* sending
2958 data to the main encoding routine -- thus, that routine must handle
2959 different EOL types itself if it does line-oriented type processing.
2960 This is unavoidable because we don't know whether the output of the
2961 main encoding routine is ASCII compatible (Unicode is definitely not,
2962 for example).
2963
2964 There is one parameter: `subtype', either `cr', `lf', `crlf', or `autodetect'.
2965 */
2966
2967 DEFINE_CODING_SYSTEM_TYPE (convert_eol);
2968
2969 struct convert_eol_coding_system
2970 {
2971 enum eol_type subtype;
2972 };
2973
2974 #define CODING_SYSTEM_CONVERT_EOL_SUBTYPE(codesys) \
2975 (CODING_SYSTEM_TYPE_DATA (codesys, convert_eol)->subtype)
2976 #define XCODING_SYSTEM_CONVERT_EOL_SUBTYPE(codesys) \
2977 (XCODING_SYSTEM_TYPE_DATA (codesys, convert_eol)->subtype)
2978
2979 struct convert_eol_coding_stream
2980 {
2981 enum eol_type actual;
2982 };
2983
2984 static const struct lrecord_description
2985 convert_eol_coding_system_description[] = {
2986 { XD_END }
2987 };
2988
2989 static void
2990 convert_eol_print (Lisp_Object cs, Lisp_Object printcharfun, int escapeflag)
2991 {
2992 struct convert_eol_coding_system *data =
2993 XCODING_SYSTEM_TYPE_DATA (cs, convert_eol);
2994
2995 write_fmt_string (printcharfun, "(%s)",
2996 data->subtype == EOL_LF ? "lf" :
2997 data->subtype == EOL_CRLF ? "crlf" :
2998 data->subtype == EOL_CR ? "cr" :
2999 data->subtype == EOL_AUTODETECT ? "autodetect" :
3000 (abort(), ""));
3001 }
3002
3003 static enum source_sink_type
3004 convert_eol_conversion_end_type (Lisp_Object codesys)
3005 {
3006 return DECODES_CHARACTER_TO_CHARACTER;
3007 }
3008
3009 static int
3010 convert_eol_putprop (Lisp_Object codesys,
3011 Lisp_Object key,
3012 Lisp_Object value)
3013 {
3014 struct convert_eol_coding_system *data =
3015 XCODING_SYSTEM_TYPE_DATA (codesys, convert_eol);
3016
3017 if (EQ (key, Qsubtype))
3018 {
3019 if (EQ (value, Qlf) /* || EQ (value, Qunix) */)
3020 data->subtype = EOL_LF;
3021 else if (EQ (value, Qcrlf) /* || EQ (value, Qdos) */)
3022 data->subtype = EOL_CRLF;
3023 else if (EQ (value, Qcr) /* || EQ (value, Qmac) */)
3024 data->subtype = EOL_CR;
3025 else if (EQ (value, Qautodetect) /* || EQ (value, Qmac) */)
3026 data->subtype = EOL_AUTODETECT;
3027 else invalid_constant ("Unrecognized eol type", value);
3028 }
3029 else
3030 return 0;
3031 return 1;
3032 }
3033
3034 static Lisp_Object
3035 convert_eol_getprop (Lisp_Object coding_system, Lisp_Object prop)
3036 {
3037 struct convert_eol_coding_system *data =
3038 XCODING_SYSTEM_TYPE_DATA (coding_system, convert_eol);
3039
3040 if (EQ (prop, Qsubtype))
3041 {
3042 switch (data->subtype)
3043 {
3044 case EOL_LF: return Qlf;
3045 case EOL_CRLF: return Qcrlf;
3046 case EOL_CR: return Qcr;
3047 case EOL_AUTODETECT: return Qautodetect;
3048 default: abort ();
3049 }
3050 }
3051
3052 return Qunbound;
3053 }
3054
3055 static void
3056 convert_eol_init_coding_stream (struct coding_stream *str)
3057 {
3058 struct convert_eol_coding_stream *data =
3059 CODING_STREAM_TYPE_DATA (str, convert_eol);
3060 data->actual = XCODING_SYSTEM_CONVERT_EOL_SUBTYPE (str->codesys);
3061 }
3062
3063 static Bytecount
3064 convert_eol_convert (struct coding_stream *str, const Intbyte *src,
3065 unsigned_char_dynarr *dst, Bytecount n)
3066 {
3067 if (str->direction == CODING_DECODE)
3068 {
3069 struct convert_eol_coding_stream *data =
3070 CODING_STREAM_TYPE_DATA (str, convert_eol);
3071
3072 if (data->actual == EOL_AUTODETECT)
3073 {
3074 Bytecount n2 = n;
3075 const Intbyte *src2 = src;
3076
3077 for (; n2; n2--)
3078 {
3079 Intbyte c = *src2++;
3080 if (c == '\n')
3081 {
3082 data->actual = EOL_LF;
3083 break;
3084 }
3085 else if (c == '\r')
3086 {
3087 if (n2 == 1)
3088 {
3089 /* If we're seeing a '\r' at the end of the data, then
3090 reject the '\r' right now so it doesn't become an
3091 issue in the code below -- unless we're at the end of
3092 the stream, in which case we can't do that (because
3093 then the '\r' will never get written out), and in any
3094 case we should be recognizing it at EOL_CR format. */
3095 if (str->eof)
3096 data->actual = EOL_CR;
3097 else
3098 n--;
3099 break;
3100 }
3101 else if (*src2 == '\n')
3102 data->actual = EOL_CRLF;
3103 else
3104 data->actual = EOL_CR;
3105 break;
3106 }
3107 }
3108 }
3109
3110 /* str->eof is set, the caller reached EOF on the other end and has
3111 no new data to give us. The only data we get is the data we
3112 rejected from last time. */
3113 if (data->actual == EOL_LF || data->actual == EOL_AUTODETECT ||
3114 (str->eof))
3115 Dynarr_add_many (dst, src, n);
3116 else
3117 {
3118 const Intbyte *end = src + n;
3119 while (1)
3120 {
3121 /* Find the next section with no \r and add it. */
3122 const Intbyte *runstart = src;
3123 src = (Intbyte *) memchr (src, '\r', end - src);
3124 if (!src)
3125 src = end;
3126 Dynarr_add_many (dst, runstart, src - runstart);
3127 /* Stop if at end ... */
3128 if (src == end)
3129 break;
3130 /* ... else, translate as necessary. */
3131 src++;
3132 if (data->actual == EOL_CR)
3133 Dynarr_add (dst, '\n');
3134 /* We need to be careful here with CRLF. If we see a CR at the
3135 end of the data, we don't know if it's part of a CRLF, so we
3136 reject it. Otherwise: If it's part of a CRLF, eat it and
3137 loop; the following LF gets added next time around. If it's
3138 not part of a CRLF, add the CR and loop. The following
3139 character will be processed in the next loop iteration. This
3140 correctly handles a sequence like CR+CR+LF. */
3141 else if (src == end)
3142 return n - 1; /* reject the CR at the end; we'll get it again
3143 next time the convert method is called */
3144 else if (*src != '\n')
3145 Dynarr_add (dst, '\r');
3146 }
3147 }
3148
3149 return n;
3150 }
3151 else
3152 {
3153 enum eol_type subtype =
3154 XCODING_SYSTEM_CONVERT_EOL_SUBTYPE (str->codesys);
3155 const Intbyte *end = src + n;
3156
3157 /* We try to be relatively efficient here. */
3158 if (subtype == EOL_LF)
3159 Dynarr_add_many (dst, src, n);
3160 else
3161 {
3162 while (1)
3163 {
3164 /* Find the next section with no \n and add it. */
3165 const Intbyte *runstart = src;
3166 src = (Intbyte *) memchr (src, '\n', end - src);
3167 if (!src)
3168 src = end;
3169 Dynarr_add_many (dst, runstart, src - runstart);
3170 /* Stop if at end ... */
3171 if (src == end)
3172 break;
3173 /* ... else, skip over \n and add its translation. */
3174 src++;
3175 Dynarr_add (dst, '\r');
3176 if (subtype == EOL_CRLF)
3177 Dynarr_add (dst, '\n');
3178 }
3179 }
3180
3181 return n;
3182 }
3183 }
3184
3185 static Lisp_Object
3186 convert_eol_canonicalize_after_coding (struct coding_stream *str)
3187 {
3188 struct convert_eol_coding_stream *data =
3189 CODING_STREAM_TYPE_DATA (str, convert_eol);
3190
3191 if (str->direction == CODING_ENCODE)
3192 return str->codesys;
3193
3194 switch (data->actual)
3195 {
3196 case EOL_LF: return Fget_coding_system (Qconvert_eol_lf);
3197 case EOL_CRLF: return Fget_coding_system (Qconvert_eol_crlf);
3198 case EOL_CR: return Fget_coding_system (Qconvert_eol_cr);
3199 case EOL_AUTODETECT: return str->codesys;
3200 default: abort (); return Qnil;
3201 }
3202 }
3203
3204
3205 /************************************************************************/
3206 /* Undecided methods */
3207 /************************************************************************/
3208
3209 /* Do autodetection. We can autodetect the EOL type only, the coding
3210 system only, or both. We only do autodetection when decoding; when
3211 encoding, we just pass the data through.
3212
3213 When doing just EOL detection, a coding system can be specified; if so,
3214 we will decode this data through the coding system before doing EOL
3215 detection. The reason for specifying this is so that
3216 canonicalize-after-coding works: We will canonicalize the specified
3217 coding system into the appropriate EOL type. When doing both coding and
3218 EOL detection, we do similar canonicalization, and also catch situations
3219 where the EOL type is overspecified, i.e. the detected coding system
3220 specifies an EOL type, and either switch to the equivalent
3221 non-EOL-processing coding system (if possible), or terminate EOL
3222 detection and use the specified EOL type. This prevents data from being
3223 EOL-processed twice.
3224 */
3225
3226 DEFINE_CODING_SYSTEM_TYPE (undecided);
3227
3228 struct undecided_coding_system
3229 {
3230 int do_eol, do_coding;
3231 Lisp_Object cs;
3232 };
3233
3234 struct undecided_coding_stream
3235 {
3236 Lisp_Object actual;
3237 /* Either 2 or 3 lstreams here; see undecided_convert */
3238 struct chain_coding_stream c;
3239
3240 struct detection_state *st;
3241 };
3242
3243 static const struct lrecord_description
3244 undecided_coding_system_description[] = {
3245 { XD_LISP_OBJECT,
3246 coding_system_data_offset + offsetof (struct undecided_coding_system,
3247 cs) },
3248 { XD_END }
3249 };
3250
3251 static void
3252 undecided_init (Lisp_Object codesys)
3253 {
3254 struct undecided_coding_system *data =
3255 XCODING_SYSTEM_TYPE_DATA (codesys, undecided);
3256
3257 data->cs = Qnil;
3258 }
3259
3260 static void
3261 undecided_mark (Lisp_Object codesys)
3262 {
3263 struct undecided_coding_system *data =
3264 XCODING_SYSTEM_TYPE_DATA (codesys, undecided);
3265
3266 mark_object (data->cs);
3267 }
3268
3269 static void
3270 undecided_print (Lisp_Object cs, Lisp_Object printcharfun, int escapeflag)
3271 {
3272 struct undecided_coding_system *data =
3273 XCODING_SYSTEM_TYPE_DATA (cs, undecided);
3274 int need_space = 0;
3275
3276 write_c_string ("(", printcharfun);
3277 if (data->do_eol)
3278 {
3279 write_c_string ("do-eol", printcharfun);
3280 need_space = 1;
3281 }
3282 if (data->do_coding)
3283 {
3284 if (need_space)
3285 write_c_string (" ", printcharfun);
3286 write_c_string ("do-coding", printcharfun);
3287 need_space = 1;
3288 }
3289 if (!NILP (data->cs))
3290 {
3291 if (need_space)
3292 write_c_string (" ", printcharfun);
3293 write_c_string ("coding-system=", printcharfun);
3294 print_coding_system_in_print_method (data->cs, printcharfun, escapeflag);
3295 }
3296 write_c_string (")", printcharfun);
3297 }
3298
3299 static void
3300 undecided_mark_coding_stream (struct coding_stream *str)
3301 {
3302 chain_mark_coding_stream_1 (&CODING_STREAM_TYPE_DATA (str, undecided)->c);
3303 }
3304
3305 static int
3306 undecided_putprop (Lisp_Object codesys, Lisp_Object key, Lisp_Object value)
3307 {
3308 struct undecided_coding_system *data =
3309 XCODING_SYSTEM_TYPE_DATA (codesys, undecided);
3310
3311 if (EQ (key, Qdo_eol))
3312 data->do_eol = 1;
3313 else if (EQ (key, Qdo_coding))
3314 data->do_coding = 1;
3315 else if (EQ (key, Qcoding_system))
3316 data->cs = get_coding_system_for_text_file (value, 0);
3317 else
3318 return 0;
3319 return 1;
3320 }
3321
3322 static Lisp_Object
3323 undecided_getprop (Lisp_Object codesys, Lisp_Object prop)
3324 {
3325 struct undecided_coding_system *data =
3326 XCODING_SYSTEM_TYPE_DATA (codesys, undecided);
3327
3328 if (EQ (prop, Qdo_eol))
3329 return data->do_eol ? Qt : Qnil;
3330 if (EQ (prop, Qdo_coding))
3331 return data->do_coding ? Qt : Qnil;
3332 if (EQ (prop, Qcoding_system))
3333 return data->cs;
3334 return Qunbound;
3335 }
3336
3337 static struct detection_state *
3338 allocate_detection_state (void)
3339 {
3340 int i;
3341 Bytecount size = MAX_ALIGN_SIZE (sizeof (struct detection_state));
3342 struct detection_state *block;
3343
3344 for (i = 0; i < coding_detector_count; i++)
3345 size += MAX_ALIGN_SIZE (Dynarr_at (all_coding_detectors, i).data_size);
3346
3347 block = (struct detection_state *) xmalloc_and_zero (size);
3348
3349 size = MAX_ALIGN_SIZE (sizeof (struct detection_state));
3350 for (i = 0; i < coding_detector_count; i++)
3351 {
3352 block->data_offset[i] = size;
3353 size += MAX_ALIGN_SIZE (Dynarr_at (all_coding_detectors, i).data_size);
3354 }
3355
3356 return block;
3357 }
3358
3359 static void
3360 free_detection_state (struct detection_state *st)
3361 {
3362 int i;
3363
3364 for (i = 0; i < coding_detector_count; i++)
3365 {
3366 if (Dynarr_at (all_coding_detectors, i).finalize_detection_state_method)
3367 Dynarr_at (all_coding_detectors, i).finalize_detection_state_method
3368 (st);
3369 }
3370
3371 xfree (st);
3372 }
3373
3374 static int
3375 coding_category_symbol_to_id (Lisp_Object symbol)
3376 {
3377 int i;
1419 3378
1420 CHECK_SYMBOL (symbol); 3379 CHECK_SYMBOL (symbol);
1421 for (i = 0; i < CODING_CATEGORY_LAST; i++) 3380 for (i = 0; i < coding_detector_count; i++)
1422 if (EQ (coding_category_symbol[i], symbol)) 3381 {
1423 return i; 3382 detector_category_dynarr *cats =
1424 3383 Dynarr_at (all_coding_detectors, i).cats;
3384 int j;
3385
3386 for (j = 0; j < Dynarr_length (cats); j++)
3387 if (EQ (Dynarr_at (cats, j).sym, symbol))
3388 return Dynarr_at (cats, j).id;
3389 }
3390
1425 invalid_constant ("Unrecognized coding category", symbol); 3391 invalid_constant ("Unrecognized coding category", symbol);
1426 return 0; /* not reached */ 3392 return 0; /* not reached */
1427 } 3393 }
1428 3394
1429 DEFUN ("coding-category-list", Fcoding_category_list, 0, 0, 0, /* 3395 static Lisp_Object
1430 Return a list of all recognized coding categories. 3396 coding_category_id_to_symbol (int id)
1431 */
1432 ())
1433 { 3397 {
1434 int i; 3398 int i;
1435 Lisp_Object list = Qnil; 3399
1436 3400 for (i = 0; i < coding_detector_count; i++)
1437 for (i = CODING_CATEGORY_LAST - 1; i >= 0; i--) 3401 {
1438 list = Fcons (coding_category_symbol[i], list); 3402 detector_category_dynarr *cats =
1439 return list; 3403 Dynarr_at (all_coding_detectors, i).cats;
1440 } 3404 int j;
1441 3405
1442 DEFUN ("set-coding-priority-list", Fset_coding_priority_list, 1, 1, 0, /* 3406 for (j = 0; j < Dynarr_length (cats); j++)
1443 Change the priority order of the coding categories. 3407 if (id == Dynarr_at (cats, j).id)
1444 LIST should be list of coding categories, in descending order of 3408 return Dynarr_at (cats, j).sym;
1445 priority. Unspecified coding categories will be lower in priority 3409 }
1446 than all specified ones, in the same relative order they were in 3410
1447 previously. 3411 abort ();
1448 */ 3412 return Qnil; /* (usually) not reached */
1449 (list)) 3413 }
1450 { 3414
1451 int category_to_priority[CODING_CATEGORY_LAST]; 3415 static Lisp_Object
1452 int i, j; 3416 detection_result_number_to_symbol (enum detection_result result)
1453 Lisp_Object rest; 3417 {
1454 3418 #define FROB(sym, num) if (result == num) return (sym)
1455 /* First generate a list that maps coding categories to priorities. */ 3419 FROB (Qnear_certainty, DET_NEAR_CERTAINTY);
1456 3420 FROB (Qquite_probable, DET_QUITE_PROBABLE);
1457 for (i = 0; i < CODING_CATEGORY_LAST; i++) 3421 FROB (Qsomewhat_likely, DET_SOMEWHAT_LIKELY);
1458 category_to_priority[i] = -1; 3422 FROB (Qas_likely_as_unlikely, DET_AS_LIKELY_AS_UNLIKELY);
1459 3423 FROB (Qsomewhat_unlikely, DET_SOMEWHAT_UNLIKELY);
1460 /* Highest priority comes from the specified list. */ 3424 FROB (Qquite_improbable, DET_QUITE_IMPROBABLE);
1461 i = 0; 3425 FROB (Qnearly_impossible, DET_NEARLY_IMPOSSIBLE);
1462 EXTERNAL_LIST_LOOP (rest, list) 3426 #undef FROB
1463 { 3427
1464 int cat = decode_coding_category (XCAR (rest)); 3428 abort ();
1465 3429 return Qnil; /* (usually) not reached */
1466 if (category_to_priority[cat] >= 0) 3430 }
1467 sferror ("Duplicate coding category in list", XCAR (rest)); 3431
1468 category_to_priority[cat] = i++; 3432 static enum detection_result
1469 } 3433 detection_result_symbol_to_number (Lisp_Object symbol)
1470 3434 {
1471 /* Now go through the existing categories by priority to retrieve 3435 #define FROB(sym, num) if (EQ (symbol, sym)) return (num)
1472 the categories not yet specified and preserve their priority 3436 FROB (Qnear_certainty, DET_NEAR_CERTAINTY);
1473 order. */ 3437 FROB (Qquite_probable, DET_QUITE_PROBABLE);
1474 for (j = 0; j < CODING_CATEGORY_LAST; j++) 3438 FROB (Qsomewhat_likely, DET_SOMEWHAT_LIKELY);
1475 { 3439 FROB (Qas_likely_as_unlikely, DET_AS_LIKELY_AS_UNLIKELY);
1476 int cat = fcd->coding_category_by_priority[j]; 3440 FROB (Qsomewhat_unlikely, DET_SOMEWHAT_UNLIKELY);
1477 if (category_to_priority[cat] < 0) 3441 FROB (Qquite_improbable, DET_QUITE_IMPROBABLE);
1478 category_to_priority[cat] = i++; 3442 FROB (Qnearly_impossible, DET_NEARLY_IMPOSSIBLE);
1479 } 3443 #undef FROB
1480 3444
1481 /* Now we need to construct the inverse of the mapping we just 3445 invalid_constant ("Unrecognized detection result", symbol);
1482 constructed. */ 3446 return ((enum detection_result) 0); /* not reached */
1483 3447 }
1484 for (i = 0; i < CODING_CATEGORY_LAST; i++) 3448
1485 fcd->coding_category_by_priority[category_to_priority[i]] = i; 3449 /* Set all detection results for a given detector to a specified value. */
1486 3450 void
1487 /* Phew! That was confusing. */ 3451 set_detection_results (struct detection_state *st, int detector, int given)
1488 return Qnil; 3452 {
1489 } 3453 detector_category_dynarr *cats =
1490 3454 Dynarr_at (all_coding_detectors, detector).cats;
1491 DEFUN ("coding-priority-list", Fcoding_priority_list, 0, 0, 0, /*
1492 Return a list of coding categories in descending order of priority.
1493 */
1494 ())
1495 {
1496 int i; 3455 int i;
1497 Lisp_Object list = Qnil; 3456
1498 3457 for (i = 0; i < Dynarr_length (cats); i++)
1499 for (i = CODING_CATEGORY_LAST - 1; i >= 0; i--) 3458 st->categories[Dynarr_at (cats, i).id] = given;
1500 list = Fcons (coding_category_symbol[fcd->coding_category_by_priority[i]], 3459 }
1501 list);
1502 return list;
1503 }
1504
1505 DEFUN ("set-coding-category-system", Fset_coding_category_system, 2, 2, 0, /*
1506 Change the coding system associated with a coding category.
1507 */
1508 (coding_category, coding_system))
1509 {
1510 int cat = decode_coding_category (coding_category);
1511
1512 coding_system = Fget_coding_system (coding_system);
1513 fcd->coding_category_system[cat] = coding_system;
1514 return Qnil;
1515 }
1516
1517 DEFUN ("coding-category-system", Fcoding_category_system, 1, 1, 0, /*
1518 Return the coding system associated with a coding category.
1519 */
1520 (coding_category))
1521 {
1522 int cat = decode_coding_category (coding_category);
1523 Lisp_Object sys = fcd->coding_category_system[cat];
1524
1525 if (!NILP (sys))
1526 return XCODING_SYSTEM_NAME (sys);
1527 return Qnil;
1528 }
1529
1530
1531 /************************************************************************/
1532 /* Detecting the encoding of data */
1533 /************************************************************************/
1534
1535 struct detection_state
1536 {
1537 eol_type_t eol_type;
1538 int seen_non_ascii;
1539 int mask;
1540 #ifdef MULE
1541 struct
1542 {
1543 int mask;
1544 int in_second_byte;
1545 }
1546 big5;
1547
1548 struct
1549 {
1550 int mask;
1551 int in_second_byte;
1552 }
1553 shift_jis;
1554
1555 struct
1556 {
1557 int mask;
1558 int in_byte;
1559 }
1560 ucs4;
1561
1562 struct
1563 {
1564 int mask;
1565 int in_byte;
1566 }
1567 utf8;
1568
1569 struct
1570 {
1571 int mask;
1572 int initted;
1573 struct iso2022_decoder iso;
1574 unsigned int flags;
1575 int high_byte_count;
1576 unsigned int saw_single_shift:1;
1577 }
1578 iso2022;
1579 #endif
1580 struct
1581 {
1582 int seen_anything;
1583 int just_saw_cr;
1584 }
1585 eol;
1586 };
1587 3460
1588 static int 3461 static int
1589 acceptable_control_char_p (int c) 3462 acceptable_control_char_p (int c)
1590 { 3463 {
1591 switch (c) 3464 switch (c)
1605 default: 3478 default:
1606 return 0; 3479 return 0;
1607 } 3480 }
1608 } 3481 }
1609 3482
3483 #ifdef DEBUG_XEMACS
3484
3485 static UExtbyte
3486 hex_digit_to_char (int digit)
3487 {
3488 if (digit < 10)
3489 return digit + '0';
3490 else
3491 return digit - 10 + 'A';
3492 }
3493
3494 static void
3495 output_bytes_in_ascii_and_hex (const UExtbyte *src, Bytecount n)
3496 {
3497 UExtbyte *ascii = alloca_array (UExtbyte, n + 1);
3498 UExtbyte *hex = alloca_array (UExtbyte, 3 * n + 1);
3499 int i;
3500
3501 for (i = 0; i < n; i++)
3502 {
3503 UExtbyte c = src[i];
3504 if (c < 0x20)
3505 ascii[i] = '.';
3506 else
3507 ascii[i] = c;
3508 hex[3 * i] = hex_digit_to_char (c >> 4);
3509 hex[3 * i + 1] = hex_digit_to_char (c & 0xF);
3510 hex[3 * i + 2] = ' ';
3511 }
3512 ascii[i] = '\0';
3513 hex[3 * i - 1] = '\0';
3514 stderr_out ("%s %s", ascii, hex);
3515 }
3516
3517 #endif /* DEBUG_XEMACS */
3518
3519 /* Attempt to determine the encoding of the given text. Before calling
3520 this function for the first time, you must zero out the detection state.
3521
3522 Returns:
3523
3524 0 == keep going
3525 1 == stop
3526 */
3527
1610 static int 3528 static int
1611 mask_has_at_most_one_bit_p (int mask) 3529 detect_coding_type (struct detection_state *st, const UExtbyte *src,
1612 { 3530 Bytecount n)
1613 /* Perhaps the only thing useful you learn from intensive Microsoft 3531 {
1614 technical interviews */ 3532 Bytecount n2 = n;
1615 return (mask & (mask - 1)) == 0; 3533 const UExtbyte *src2 = src;
1616 } 3534 int i;
1617 3535
1618 static eol_type_t 3536 #ifdef DEBUG_XEMACS
1619 detect_eol_type (struct detection_state *st, const Extbyte *src, 3537 if (!NILP (Vdebug_coding_detection))
1620 Bytecount n) 3538 {
1621 { 3539 int bytes = min (16, n);
1622 while (n--) 3540 stderr_out ("detect_coding_type: processing %ld bytes\n", n);
1623 { 3541 stderr_out ("First %d: ", bytes);
1624 unsigned char c = *(unsigned char *)src++; 3542 output_bytes_in_ascii_and_hex (src, bytes);
1625 if (c == '\n') 3543 stderr_out ("\nLast %d: ", bytes);
3544 output_bytes_in_ascii_and_hex (src + n - bytes, bytes);
3545 stderr_out ("\n");
3546 }
3547 #endif /* DEBUG_XEMACS */
3548 if (!st->seen_non_ascii)
3549 {
3550 for (; n2; n2--, src2++)
1626 { 3551 {
1627 if (st->eol.just_saw_cr) 3552 UExtbyte c = *src2;
1628 return EOL_CRLF;
1629 else if (st->eol.seen_anything)
1630 return EOL_LF;
1631 }
1632 else if (st->eol.just_saw_cr)
1633 return EOL_CR;
1634 else if (c == '\r')
1635 st->eol.just_saw_cr = 1;
1636 else
1637 st->eol.just_saw_cr = 0;
1638 st->eol.seen_anything = 1;
1639 }
1640
1641 return EOL_AUTODETECT;
1642 }
1643
1644 /* Attempt to determine the encoding and EOL type of the given text.
1645 Before calling this function for the first type, you must initialize
1646 st->eol_type as appropriate and initialize st->mask to ~0.
1647
1648 st->eol_type holds the determined EOL type, or EOL_AUTODETECT if
1649 not yet known.
1650
1651 st->mask holds the determined coding category mask, or ~0 if only
1652 ASCII has been seen so far.
1653
1654 Returns:
1655
1656 0 == st->eol_type is EOL_AUTODETECT and/or more than coding category
1657 is present in st->mask
1658 1 == definitive answers are here for both st->eol_type and st->mask
1659 */
1660
1661 static int
1662 detect_coding_type (struct detection_state *st, const Extbyte *src,
1663 Bytecount n, int just_do_eol)
1664 {
1665 if (st->eol_type == EOL_AUTODETECT)
1666 st->eol_type = detect_eol_type (st, src, n);
1667
1668 if (just_do_eol)
1669 return st->eol_type != EOL_AUTODETECT;
1670
1671 if (!st->seen_non_ascii)
1672 {
1673 for (; n; n--, src++)
1674 {
1675 unsigned char c = *(unsigned char *) src;
1676 if ((c < 0x20 && !acceptable_control_char_p (c)) || c >= 0x80) 3553 if ((c < 0x20 && !acceptable_control_char_p (c)) || c >= 0x80)
1677 { 3554 {
1678 st->seen_non_ascii = 1; 3555 st->seen_non_ascii = 1;
1679 #ifdef MULE
1680 st->shift_jis.mask = ~0;
1681 st->big5.mask = ~0;
1682 st->ucs4.mask = ~0;
1683 st->utf8.mask = ~0;
1684 st->iso2022.mask = ~0;
1685 #endif
1686 break; 3556 break;
1687 } 3557 }
1688 } 3558 }
1689 } 3559 }
1690 3560
1691 if (!n) 3561 for (i = 0; i < coding_detector_count; i++)
1692 return 0; 3562 Dynarr_at (all_coding_detectors, i).detect_method (st, src, n);
1693 #ifdef MULE 3563
1694 if (!mask_has_at_most_one_bit_p (st->iso2022.mask)) 3564 st->bytes_seen += n;
1695 st->iso2022.mask = detect_coding_iso2022 (st, src, n); 3565
1696 if (!mask_has_at_most_one_bit_p (st->shift_jis.mask)) 3566 #ifdef DEBUG_XEMACS
1697 st->shift_jis.mask = detect_coding_sjis (st, src, n); 3567 if (!NILP (Vdebug_coding_detection))
1698 if (!mask_has_at_most_one_bit_p (st->big5.mask)) 3568 {
1699 st->big5.mask = detect_coding_big5 (st, src, n); 3569 stderr_out ("seen_non_ascii: %d\n", st->seen_non_ascii);
1700 if (!mask_has_at_most_one_bit_p (st->utf8.mask)) 3570 for (i = 0; i < coding_detector_category_count; i++)
1701 st->utf8.mask = detect_coding_utf8 (st, src, n); 3571 stderr_out_lisp
1702 if (!mask_has_at_most_one_bit_p (st->ucs4.mask)) 3572 ("%s: %s\n",
1703 st->ucs4.mask = detect_coding_ucs4 (st, src, n); 3573 2,
1704 3574 coding_category_id_to_symbol (i),
1705 st->mask 3575 detection_result_number_to_symbol ((enum detection_result)
1706 = st->iso2022.mask | st->shift_jis.mask | st->big5.mask 3576 st->categories[i]));
1707 | st->utf8.mask | st->ucs4.mask; 3577 }
3578 #endif /* DEBUG_XEMACS */
3579
3580 {
3581 int not_unlikely = 0;
3582 int retval;
3583
3584 for (i = 0; i < coding_detector_category_count; i++)
3585 if (st->categories[i] >= 0)
3586 not_unlikely++;
3587
3588 retval = (not_unlikely <= 1
3589 #if 0 /* this is bogus */
3590 || st->bytes_seen >= MAX_BYTES_PROCESSED_FOR_DETECTION
1708 #endif 3591 #endif
1709 { 3592 );
1710 int retval = mask_has_at_most_one_bit_p (st->mask); 3593
1711 st->mask |= CODING_CATEGORY_NO_CONVERSION_MASK; 3594 #ifdef DEBUG_XEMACS
1712 return retval && st->eol_type != EOL_AUTODETECT; 3595 if (!NILP (Vdebug_coding_detection))
3596 stderr_out ("detect_coding_type: returning %d (%s)\n",
3597 retval, retval ? "stop" : "keep going");
3598 #endif /* DEBUG_XEMACS */
3599
3600 return retval;
1713 } 3601 }
1714 } 3602 }
1715 3603
1716 static Lisp_Object 3604 static Lisp_Object
1717 coding_system_from_mask (int mask) 3605 detected_coding_system (struct detection_state *st)
1718 { 3606 {
1719 if (mask == ~0) 3607 int i;
3608 int even = 1;
3609
3610 if (st->seen_non_ascii)
3611 {
3612 for (i = 0; i < coding_detector_category_count; i++)
3613 if (st->categories[i] != DET_AS_LIKELY_AS_UNLIKELY)
3614 {
3615 even = 0;
3616 break;
3617 }
3618 }
3619
3620 /* #### Here we are ignoring the results of detection when it's all
3621 ASCII. This is obviously a bad thing. But we need to fix up the
3622 existing detection methods somewhat before we can switch. */
3623 if (even)
1720 { 3624 {
1721 /* If the file was entirely or basically ASCII, use the 3625 /* If the file was entirely or basically ASCII, use the
1722 default value of `buffer-file-coding-system'. */ 3626 default value of `buffer-file-coding-system'. */
1723 Lisp_Object retval = 3627 Lisp_Object retval =
1724 XBUFFER (Vbuffer_defaults)->buffer_file_coding_system; 3628 XBUFFER (Vbuffer_defaults)->buffer_file_coding_system;
1725 if (!NILP (retval)) 3629 if (!NILP (retval))
1726 { 3630 {
1727 retval = Ffind_coding_system (retval); 3631 retval = find_coding_system_for_text_file (retval, 0);
1728 if (NILP (retval)) 3632 if (NILP (retval))
1729 { 3633 {
1730 warn_when_safe 3634 warn_when_safe
1731 (Qbad_variable, Qwarning, 3635 (Qbad_variable, Qwarning,
1732 "Invalid `default-buffer-file-coding-system', set to nil"); 3636 "Invalid `default-buffer-file-coding-system', set to nil");
1737 retval = Fget_coding_system (Qraw_text); 3641 retval = Fget_coding_system (Qraw_text);
1738 return retval; 3642 return retval;
1739 } 3643 }
1740 else 3644 else
1741 { 3645 {
1742 int i; 3646 int likelihood;
1743 int cat = -1; 3647 Lisp_Object retval = Qnil;
1744 #ifdef MULE 3648
1745 mask = postprocess_iso2022_mask (mask); 3649 /* Look through the coding categories first by likelihood and then by
1746 #endif 3650 priority and find the first one that is allowed. */
1747 /* Look through the coding categories by priority and find 3651
1748 the first one that is allowed. */ 3652 for (likelihood = DET_HIGHEST; likelihood >= DET_LOWEST; likelihood--)
1749 for (i = 0; i < CODING_CATEGORY_LAST; i++)
1750 { 3653 {
1751 cat = fcd->coding_category_by_priority[i]; 3654 for (i = 0; i < coding_detector_category_count; i++)
1752 if ((mask & (1 << cat)) && 3655 {
1753 !NILP (fcd->coding_category_system[cat])) 3656 int cat = coding_category_by_priority[i];
1754 break; 3657 if (st->categories[cat] == likelihood &&
3658 !NILP (coding_category_system[cat]))
3659 {
3660 retval = (get_coding_system_for_text_file
3661 (coding_category_system[cat], 0));
3662 if (likelihood < DET_AS_LIKELY_AS_UNLIKELY)
3663 warn_when_safe_lispobj
3664 (intern ("detection"),
3665 Qerror,
3666 emacs_sprintf_string_lisp
3667 (
3668 "Detected coding %s is unlikely to be correct (likelihood == `%s')",
3669 Qnil, 2, XCODING_SYSTEM_NAME (retval),
3670 detection_result_number_to_symbol
3671 ((enum detection_result) likelihood)));
3672 return retval;
3673 }
3674 }
1755 } 3675 }
1756 if (cat >= 0) 3676
1757 return fcd->coding_category_system[cat]; 3677 return Fget_coding_system (Qraw_text);
1758 else
1759 return Fget_coding_system (Qraw_text);
1760 } 3678 }
1761 } 3679 }
1762 3680
1763 /* Given a seekable read stream and potential coding system and EOL type 3681 /* Given a seekable read stream and potential coding system and EOL type
1764 as specified, do any autodetection that is called for. If the 3682 as specified, do any autodetection that is called for. If the
1769 This function does not automatically fetch subsidiary coding systems; 3687 This function does not automatically fetch subsidiary coding systems;
1770 that should be unnecessary with the explicit eol-type argument. */ 3688 that should be unnecessary with the explicit eol-type argument. */
1771 3689
1772 #define LENGTH(string_constant) (sizeof (string_constant) - 1) 3690 #define LENGTH(string_constant) (sizeof (string_constant) - 1)
1773 3691
1774 void 3692 static Lisp_Object
1775 determine_real_coding_system (Lstream *stream, Lisp_Object *codesys_in_out, 3693 unwind_free_detection_state (Lisp_Object opaque)
1776 eol_type_t *eol_type_in_out) 3694 {
1777 { 3695 struct detection_state *st =
1778 struct detection_state decst; 3696 (struct detection_state *) get_opaque_ptr (opaque);
1779 3697 free_detection_state (st);
1780 if (*eol_type_in_out == EOL_AUTODETECT) 3698 free_opaque_ptr (opaque);
1781 *eol_type_in_out = XCODING_SYSTEM_EOL_TYPE (*codesys_in_out); 3699 return Qnil;
1782 3700 }
1783 xzero (decst); 3701
1784 decst.eol_type = *eol_type_in_out; 3702 static Lisp_Object
1785 decst.mask = ~0; 3703 look_for_coding_system_magic_cookie (const UExtbyte *data, Bytecount len)
1786 3704 {
1787 /* If autodetection is called for, do it now. */ 3705 Lisp_Object coding_system = Qnil;
1788 if (XCODING_SYSTEM_TYPE (*codesys_in_out) == CODESYS_AUTODETECT 3706 const UExtbyte *p;
1789 || *eol_type_in_out == EOL_AUTODETECT) 3707 const UExtbyte *scan_end;
1790 { 3708
1791 Extbyte buf[4096]; 3709 /* Look for initial "-*-"; mode line prefix */
1792 Lisp_Object coding_system = Qnil; 3710 for (p = data,
1793 Extbyte *p; 3711 scan_end = data + len - LENGTH ("-*-coding:?-*-");
1794 Bytecount nread = Lstream_read (stream, buf, sizeof (buf)); 3712 p <= scan_end
1795 Extbyte *scan_end; 3713 && *p != '\n'
1796 3714 && *p != '\r';
1797 /* Look for initial "-*-"; mode line prefix */ 3715 p++)
1798 for (p = buf, 3716 if (*p == '-' && *(p+1) == '*' && *(p+2) == '-')
1799 scan_end = buf + nread - LENGTH ("-*-coding:?-*-"); 3717 {
1800 p <= scan_end 3718 const UExtbyte *local_vars_beg = p + 3;
3719 /* Look for final "-*-"; mode line suffix */
3720 for (p = local_vars_beg,
3721 scan_end = data + len - LENGTH ("-*-");
3722 p <= scan_end
1801 && *p != '\n' 3723 && *p != '\n'
1802 && *p != '\r'; 3724 && *p != '\r';
1803 p++) 3725 p++)
1804 if (*p == '-' && *(p+1) == '*' && *(p+2) == '-') 3726 if (*p == '-' && *(p+1) == '*' && *(p+2) == '-')
1805 { 3727 {
1806 Extbyte *local_vars_beg = p + 3; 3728 const UExtbyte *suffix = p;
1807 /* Look for final "-*-"; mode line suffix */ 3729 /* Look for "coding:" */
1808 for (p = local_vars_beg, 3730 for (p = local_vars_beg,
1809 scan_end = buf + nread - LENGTH ("-*-"); 3731 scan_end = suffix - LENGTH ("coding:?");
1810 p <= scan_end 3732 p <= scan_end;
1811 && *p != '\n' 3733 p++)
1812 && *p != '\r'; 3734 if (memcmp ("coding:", p, LENGTH ("coding:")) == 0
1813 p++) 3735 && (p == local_vars_beg
1814 if (*p == '-' && *(p+1) == '*' && *(p+2) == '-') 3736 || (*(p-1) == ' ' ||
3737 *(p-1) == '\t' ||
3738 *(p-1) == ';')))
3739 {
3740 Bytecount n;
3741 Intbyte *name;
3742
3743 p += LENGTH ("coding:");
3744 while (*p == ' ' || *p == '\t') p++;
3745 name = alloca_intbytes (suffix - p + 1);
3746 memcpy (name, p, suffix - p);
3747 name[suffix - p] = '\0';
3748
3749 /* Get coding system name */
3750 /* Characters valid in a MIME charset name (rfc 1521),
3751 and in a Lisp symbol name. */
3752 n = qxestrspn (name,
3753 "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
3754 "abcdefghijklmnopqrstuvwxyz"
3755 "0123456789"
3756 "!$%&*+-.^_{|}~");
3757 if (n > 0)
3758 {
3759 name[n] = '\0';
3760 coding_system =
3761 find_coding_system_for_text_file (intern_int (name),
3762 0);
3763 }
3764 break;
3765 }
3766 break;
3767 }
3768 break;
3769 }
3770
3771 return coding_system;
3772 }
3773
3774 static Lisp_Object
3775 determine_real_coding_system (Lstream *stream)
3776 {
3777 struct detection_state *st = allocate_detection_state ();
3778 int depth = record_unwind_protect (unwind_free_detection_state,
3779 make_opaque_ptr (st));
3780 UExtbyte buf[4096];
3781 Bytecount nread = Lstream_read (stream, buf, sizeof (buf));
3782 Lisp_Object coding_system = look_for_coding_system_magic_cookie (buf, nread);
3783
3784 if (NILP (coding_system))
3785 {
3786 while (1)
3787 {
3788 if (detect_coding_type (st, buf, nread))
3789 break;
3790 nread = Lstream_read (stream, buf, sizeof (buf));
3791 if (nread == 0)
3792 break;
3793 }
3794
3795 coding_system = detected_coding_system (st);
3796 }
3797
3798 Lstream_rewind (stream);
3799
3800 unbind_to (depth);
3801 return coding_system;
3802 }
3803
3804 static void
3805 undecided_init_coding_stream (struct coding_stream *str)
3806 {
3807 struct undecided_coding_stream *data =
3808 CODING_STREAM_TYPE_DATA (str, undecided);
3809 struct undecided_coding_system *csdata =
3810 XCODING_SYSTEM_TYPE_DATA (str->codesys, undecided);
3811
3812 data->actual = Qnil;
3813
3814 if (str->direction == CODING_DECODE)
3815 {
3816 Lstream *lst = str->other_end;
3817
3818 if ((lst->flags & LSTREAM_FL_READ) &&
3819 Lstream_seekable_p (lst) &&
3820 csdata->do_coding)
3821 /* We can determine the coding system now. */
3822 data->actual = determine_real_coding_system (lst);
3823 }
3824 }
3825
3826 static void
3827 undecided_rewind_coding_stream (struct coding_stream *str)
3828 {
3829 chain_rewind_coding_stream_1 (&CODING_STREAM_TYPE_DATA (str, undecided)->c);
3830 }
3831
3832 static void
3833 undecided_finalize_coding_stream (struct coding_stream *str)
3834 {
3835 struct undecided_coding_stream *data =
3836 CODING_STREAM_TYPE_DATA (str, undecided);
3837
3838 chain_finalize_coding_stream_1
3839 (&CODING_STREAM_TYPE_DATA (str, undecided)->c);
3840 if (data->st)
3841 free_detection_state (data->st);
3842 }
3843
3844 static Lisp_Object
3845 undecided_canonicalize (Lisp_Object codesys)
3846 {
3847 struct undecided_coding_system *csdata =
3848 XCODING_SYSTEM_TYPE_DATA (codesys, undecided);
3849 if (!csdata->do_eol && !csdata->do_coding)
3850 return NILP (csdata->cs) ? Fget_coding_system (Qbinary) : csdata->cs;
3851 if (csdata->do_eol && !csdata->do_coding && NILP (csdata->cs))
3852 return Fget_coding_system (Qconvert_eol_autodetect);
3853 return codesys;
3854 }
3855
3856 static Bytecount
3857 undecided_convert (struct coding_stream *str, const UExtbyte *src,
3858 unsigned_char_dynarr *dst, Bytecount n)
3859 {
3860 int first_time = 0;
3861
3862 if (str->direction == CODING_DECODE)
3863 {
3864 /* At this point, we have only the following possibilities:
3865
3866 do_eol && do_coding
3867 do_coding only
3868 do_eol only and a coding system was specified
3869
3870 Other possibilities are removed during undecided_canonicalize.
3871
3872 Therefore, our substreams are either
3873
3874 lstream_coding -> lstream_dynarr, or
3875 lstream_coding -> lstream_eol -> lstream_dynarr.
3876 */
3877 struct undecided_coding_system *csdata =
3878 XCODING_SYSTEM_TYPE_DATA (str->codesys, undecided);
3879 struct undecided_coding_stream *data =
3880 CODING_STREAM_TYPE_DATA (str, undecided);
3881
3882 if (str->eof)
3883 {
3884 /* Each will close the next. We need to close now because more
3885 data may be generated. */
3886 if (data->c.initted)
3887 Lstream_close (XLSTREAM (data->c.lstreams[0]));
3888 return n;
3889 }
3890
3891 if (!data->c.initted)
3892 {
3893 data->c.lstream_count = csdata->do_eol ? 3 : 2;
3894 data->c.lstreams = xnew_array (Lisp_Object, data->c.lstream_count);
3895
3896 data->c.lstreams[data->c.lstream_count - 1] =
3897 make_dynarr_output_stream (dst);
3898 Lstream_set_buffering
3899 (XLSTREAM (data->c.lstreams[data->c.lstream_count - 1]),
3900 LSTREAM_UNBUFFERED, 0);
3901 if (csdata->do_eol)
3902 {
3903 data->c.lstreams[1] =
3904 make_coding_output_stream
3905 (XLSTREAM (data->c.lstreams[data->c.lstream_count - 1]),
3906 Fget_coding_system (Qconvert_eol_autodetect),
3907 CODING_DECODE);
3908 Lstream_set_buffering
3909 (XLSTREAM (data->c.lstreams[1]),
3910 LSTREAM_UNBUFFERED, 0);
3911 }
3912
3913 data->c.lstreams[0] =
3914 make_coding_output_stream
3915 (XLSTREAM (data->c.lstreams[1]),
3916 /* Substitute binary if we need to detect the encoding */
3917 csdata->do_coding ? Qbinary : csdata->cs,
3918 CODING_DECODE);
3919 Lstream_set_buffering (XLSTREAM (data->c.lstreams[0]),
3920 LSTREAM_UNBUFFERED, 0);
3921
3922 first_time = 1;
3923 data->c.initted = 1;
3924 }
3925
3926 /* If necessary, do encoding-detection now. We do this when we're a
3927 writing stream or a non-seekable reading stream, meaning that we
3928 can't just process the whole input, rewind, and start over. */
3929
3930 if (csdata->do_coding)
3931 {
3932 int actual_was_nil = NILP (data->actual);
3933 if (NILP (data->actual))
3934 {
3935 if (!data->st)
3936 data->st = allocate_detection_state ();
3937 if (first_time)
3938 /* #### This is cheesy. What we really ought to do is buffer
3939 up a certain minimum amount of data to get a better result.
3940 */
3941 data->actual = look_for_coding_system_magic_cookie (src, n);
3942 if (NILP (data->actual))
1815 { 3943 {
1816 Extbyte *suffix = p; 3944 /* #### This is cheesy. What we really ought to do is buffer
1817 /* Look for "coding:" */ 3945 up a certain minimum amount of data so as to get a less
1818 for (p = local_vars_beg, 3946 random result when doing subprocess detection. */
1819 scan_end = suffix - LENGTH ("coding:?"); 3947 detect_coding_type (data->st, src, n);
1820 p <= scan_end; 3948 data->actual = detected_coding_system (data->st);
1821 p++)
1822 if (memcmp ("coding:", p, LENGTH ("coding:")) == 0
1823 && (p == local_vars_beg
1824 || (*(p-1) == ' ' ||
1825 *(p-1) == '\t' ||
1826 *(p-1) == ';')))
1827 {
1828 Extbyte save;
1829 int n;
1830 p += LENGTH ("coding:");
1831 while (*p == ' ' || *p == '\t') p++;
1832
1833 /* Get coding system name */
1834 save = *suffix; *suffix = '\0';
1835 /* Characters valid in a MIME charset name (rfc 1521),
1836 and in a Lisp symbol name. */
1837 n = strspn ( (char *) p,
1838 "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
1839 "abcdefghijklmnopqrstuvwxyz"
1840 "0123456789"
1841 "!$%&*+-.^_{|}~");
1842 *suffix = save;
1843 if (n > 0)
1844 {
1845 save = p[n]; p[n] = '\0';
1846 coding_system =
1847 Ffind_coding_system (intern ((char *) p));
1848 p[n] = save;
1849 }
1850 break;
1851 }
1852 break;
1853 } 3949 }
1854 break; 3950 }
1855 } 3951 /* We need to set the detected coding system if we actually have
1856 3952 such a coding system but didn't before. That is the case
1857 if (NILP (coding_system)) 3953 either when we just detected it in the previous code or when
1858 do 3954 it was detected during undecided_init_coding_stream(). We
1859 { 3955 can check for that using first_time. */
1860 if (detect_coding_type (&decst, buf, nread, 3956 if (!NILP (data->actual) && (actual_was_nil || first_time))
1861 XCODING_SYSTEM_TYPE (*codesys_in_out) 3957 {
1862 != CODESYS_AUTODETECT)) 3958 /* If the detected coding system doesn't allow for EOL
1863 break; 3959 autodetection, try to get the equivalent that does;
1864 nread = Lstream_read (stream, buf, sizeof (buf)); 3960 otherwise, disable EOL detection (overriding whatever
1865 if (nread == 0) 3961 may already have been detected). */
1866 break; 3962 if (XCODING_SYSTEM_EOL_TYPE (data->actual) != EOL_AUTODETECT)
1867 } 3963 {
1868 while (1); 3964 if (!NILP (XCODING_SYSTEM_SUBSIDIARY_PARENT (data->actual)))
1869 3965 data->actual =
1870 else if (XCODING_SYSTEM_TYPE (*codesys_in_out) == CODESYS_AUTODETECT 3966 XCODING_SYSTEM_SUBSIDIARY_PARENT (data->actual);
1871 && XCODING_SYSTEM_EOL_TYPE (coding_system) == EOL_AUTODETECT) 3967 else if (data->c.lstream_count == 3)
1872 do 3968 set_coding_stream_coding_system
1873 { 3969 (XLSTREAM (data->c.lstreams[1]),
1874 if (detect_coding_type (&decst, buf, nread, 1)) 3970 Fget_coding_system (Qidentity));
1875 break; 3971 }
1876 nread = Lstream_read (stream, buf, sizeof (buf)); 3972 set_coding_stream_coding_system
1877 if (!nread) 3973 (XLSTREAM (data->c.lstreams[0]), data->actual);
1878 break; 3974 }
1879 }
1880 while (1);
1881
1882 *eol_type_in_out = decst.eol_type;
1883 if (XCODING_SYSTEM_TYPE (*codesys_in_out) == CODESYS_AUTODETECT)
1884 {
1885 if (NILP (coding_system))
1886 *codesys_in_out = coding_system_from_mask (decst.mask);
1887 else
1888 *codesys_in_out = coding_system;
1889 } 3975 }
1890 } 3976
1891 3977 if (Lstream_write (XLSTREAM (data->c.lstreams[0]), src, n) < 0)
1892 /* If we absolutely can't determine the EOL type, just assume LF. */ 3978 return -1;
1893 if (*eol_type_in_out == EOL_AUTODETECT) 3979 return n;
1894 *eol_type_in_out = EOL_LF; 3980 }
1895 3981 else
1896 Lstream_rewind (stream); 3982 return no_conversion_convert (str, src, dst, n);
3983 }
3984
3985 static Lisp_Object
3986 undecided_canonicalize_after_coding (struct coding_stream *str)
3987 {
3988 struct undecided_coding_stream *data =
3989 CODING_STREAM_TYPE_DATA (str, undecided);
3990 Lisp_Object ret, eolret;
3991
3992 if (str->direction == CODING_ENCODE)
3993 return str->codesys;
3994
3995 if (!data->c.initted)
3996 return Fget_coding_system (Qundecided);
3997
3998 ret = coding_stream_canonicalize_after_coding
3999 (XLSTREAM (data->c.lstreams[0]));
4000 if (NILP (ret))
4001 ret = Fget_coding_system (Qundecided);
4002 if (XCODING_SYSTEM_EOL_TYPE (ret) != EOL_AUTODETECT)
4003 return ret;
4004 eolret = coding_stream_canonicalize_after_coding
4005 (XLSTREAM (data->c.lstreams[1]));
4006 if (!EQ (XCODING_SYSTEM_TYPE (eolret), Qconvert_eol))
4007 return ret;
4008 return
4009 Fsubsidiary_coding_system (ret, Fcoding_system_property (eolret,
4010 Qsubtype));
4011 }
4012
4013
4014 /************************************************************************/
4015 /* Lisp interface: Coding category functions and detection */
4016 /************************************************************************/
4017
4018 DEFUN ("coding-category-list", Fcoding_category_list, 0, 0, 0, /*
4019 Return a list of all recognized coding categories.
4020 */
4021 ())
4022 {
4023 int i;
4024 Lisp_Object list = Qnil;
4025
4026 for (i = 0; i < coding_detector_count; i++)
4027 {
4028 detector_category_dynarr *cats =
4029 Dynarr_at (all_coding_detectors, i).cats;
4030 int j;
4031
4032 for (j = 0; j < Dynarr_length (cats); j++)
4033 list = Fcons (Dynarr_at (cats, j).sym, list);
4034 }
4035
4036 return Fnreverse (list);
4037 }
4038
4039 DEFUN ("set-coding-priority-list", Fset_coding_priority_list, 1, 1, 0, /*
4040 Change the priority order of the coding categories.
4041 LIST should be list of coding categories, in descending order of
4042 priority. Unspecified coding categories will be lower in priority
4043 than all specified ones, in the same relative order they were in
4044 previously.
4045 */
4046 (list))
4047 {
4048 int *category_to_priority =
4049 alloca_array (int, coding_detector_category_count);
4050 int i, j;
4051 Lisp_Object rest;
4052
4053 /* First generate a list that maps coding categories to priorities. */
4054
4055 for (i = 0; i < coding_detector_category_count; i++)
4056 category_to_priority[i] = -1;
4057
4058 /* Highest priority comes from the specified list. */
4059 i = 0;
4060 EXTERNAL_LIST_LOOP (rest, list)
4061 {
4062 int cat = coding_category_symbol_to_id (XCAR (rest));
4063
4064 if (category_to_priority[cat] >= 0)
4065 sferror ("Duplicate coding category in list", XCAR (rest));
4066 category_to_priority[cat] = i++;
4067 }
4068
4069 /* Now go through the existing categories by priority to retrieve
4070 the categories not yet specified and preserve their priority
4071 order. */
4072 for (j = 0; j < coding_detector_category_count; j++)
4073 {
4074 int cat = coding_category_by_priority[j];
4075 if (category_to_priority[cat] < 0)
4076 category_to_priority[cat] = i++;
4077 }
4078
4079 /* Now we need to construct the inverse of the mapping we just
4080 constructed. */
4081
4082 for (i = 0; i < coding_detector_category_count; i++)
4083 coding_category_by_priority[category_to_priority[i]] = i;
4084
4085 /* Phew! That was confusing. */
4086 return Qnil;
4087 }
4088
4089 DEFUN ("coding-priority-list", Fcoding_priority_list, 0, 0, 0, /*
4090 Return a list of coding categories in descending order of priority.
4091 */
4092 ())
4093 {
4094 int i;
4095 Lisp_Object list = Qnil;
4096
4097 for (i = 0; i < coding_detector_category_count; i++)
4098 list =
4099 Fcons (coding_category_id_to_symbol (coding_category_by_priority[i]),
4100 list);
4101 return Fnreverse (list);
4102 }
4103
4104 DEFUN ("set-coding-category-system", Fset_coding_category_system, 2, 2, 0, /*
4105 Change the coding system associated with a coding category.
4106 */
4107 (coding_category, coding_system))
4108 {
4109 coding_category_system[coding_category_symbol_to_id (coding_category)] =
4110 Fget_coding_system (coding_system);
4111 return Qnil;
4112 }
4113
4114 DEFUN ("coding-category-system", Fcoding_category_system, 1, 1, 0, /*
4115 Return the coding system associated with a coding category.
4116 */
4117 (coding_category))
4118 {
4119 Lisp_Object sys =
4120 coding_category_system[coding_category_symbol_to_id (coding_category)];
4121
4122 if (!NILP (sys))
4123 return XCODING_SYSTEM_NAME (sys);
4124 return Qnil;
4125 }
4126
4127 Lisp_Object
4128 detect_coding_stream (Lisp_Object stream)
4129 {
4130 Lisp_Object val = Qnil;
4131 struct gcpro gcpro1, gcpro2, gcpro3;
4132 UExtbyte random_buffer[65536];
4133 Lisp_Object binary_instream =
4134 make_coding_input_stream
4135 (XLSTREAM (stream), Qbinary,
4136 CODING_ENCODE);
4137 Lisp_Object decstream =
4138 make_coding_input_stream
4139 (XLSTREAM (binary_instream),
4140 Qundecided, CODING_DECODE);
4141 Lstream *decstr = XLSTREAM (decstream);
4142
4143 GCPRO3 (decstream, stream, binary_instream);
4144 /* Read and discard all data; detection happens as a side effect of this,
4145 and we examine what was detected afterwards. */
4146 while (Lstream_read (decstr, random_buffer, sizeof (random_buffer)) > 0)
4147 ;
4148
4149 val = coding_stream_detected_coding_system (decstr);
4150 Lstream_close (decstr);
4151 Lstream_delete (decstr);
4152 Lstream_delete (XLSTREAM (binary_instream));
4153 UNGCPRO;
4154 return val;
1897 } 4155 }
1898 4156
1899 DEFUN ("detect-coding-region", Fdetect_coding_region, 2, 3, 0, /* 4157 DEFUN ("detect-coding-region", Fdetect_coding_region, 2, 3, 0, /*
1900 Detect coding system of the text in the region between START and END. 4158 Detect coding system of the text in the region between START and END.
1901 Return a list of possible coding systems ordered by priority. 4159 Return a list of possible coding systems ordered by priority.
1906 (start, end, buffer)) 4164 (start, end, buffer))
1907 { 4165 {
1908 Lisp_Object val = Qnil; 4166 Lisp_Object val = Qnil;
1909 struct buffer *buf = decode_buffer (buffer, 0); 4167 struct buffer *buf = decode_buffer (buffer, 0);
1910 Charbpos b, e; 4168 Charbpos b, e;
1911 Lisp_Object instream, lb_instream; 4169 Lisp_Object lb_instream;
1912 Lstream *istr, *lb_istr;
1913 struct detection_state decst;
1914 struct gcpro gcpro1, gcpro2;
1915 4170
1916 get_buffer_range_char (buf, start, end, &b, &e, 0); 4171 get_buffer_range_char (buf, start, end, &b, &e, 0);
1917 lb_instream = make_lisp_buffer_input_stream (buf, b, e, 0); 4172 lb_instream = make_lisp_buffer_input_stream (buf, b, e, 0);
1918 lb_istr = XLSTREAM (lb_instream); 4173
1919 instream = make_encoding_input_stream (lb_istr, Fget_coding_system (Qbinary)); 4174 val = detect_coding_stream (lb_instream);
1920 istr = XLSTREAM (instream); 4175 Lstream_delete (XLSTREAM (lb_instream));
1921 GCPRO2 (instream, lb_instream); 4176 return val;
1922 xzero (decst); 4177 }
1923 decst.eol_type = EOL_AUTODETECT; 4178
1924 decst.mask = ~0; 4179
1925 while (1) 4180
1926 { 4181 #ifdef DEBUG_XEMACS
1927 Extbyte random_buffer[4096]; 4182
1928 Bytecount nread = Lstream_read (istr, random_buffer, sizeof (random_buffer)); 4183 /************************************************************************/
1929 4184 /* Internal methods */
1930 if (!nread) 4185 /************************************************************************/
1931 break; 4186
1932 if (detect_coding_type (&decst, random_buffer, nread, 0)) 4187 /* Raw (internally-formatted) data. */
1933 break; 4188 DEFINE_CODING_SYSTEM_TYPE (internal);
1934 } 4189
1935 4190 static Bytecount
1936 if (decst.mask == ~0) 4191 internal_convert (struct coding_stream *str, const UExtbyte *src,
1937 val = subsidiary_coding_system (Fget_coding_system (Qundecided), 4192 unsigned_char_dynarr *dst, Bytecount n)
1938 decst.eol_type); 4193 {
4194 Bytecount orign = n;
4195 Dynarr_add_many (dst, src, n);
4196 return orign;
4197 }
4198
4199 #endif /* DEBUG_XEMACS */
4200
4201
4202
4203 #ifdef HAVE_ZLIB
4204
4205 /************************************************************************/
4206 /* Gzip methods */
4207 /************************************************************************/
4208
4209 DEFINE_CODING_SYSTEM_TYPE (gzip);
4210
4211 struct gzip_coding_system
4212 {
4213 int level; /* 0 through 9, or -1 for default */
4214 };
4215
4216 #define CODING_SYSTEM_GZIP_LEVEL(codesys) \
4217 (CODING_SYSTEM_TYPE_DATA (codesys, gzip)->level)
4218 #define XCODING_SYSTEM_GZIP_LEVEL(codesys) \
4219 (XCODING_SYSTEM_TYPE_DATA (codesys, gzip)->level)
4220
4221 struct gzip_coding_stream
4222 {
4223 z_stream stream;
4224 int stream_initted;
4225 int reached_eof; /* #### this should be handled by the caller, once we
4226 return LSTREAM_EOF */
4227 };
4228
4229 static const struct lrecord_description
4230 gzip_coding_system_description[] = {
4231 { XD_END }
4232 };
4233
4234 enum source_sink_type
4235 gzip_conversion_end_type (Lisp_Object codesys)
4236 {
4237 return DECODES_BYTE_TO_BYTE;
4238 }
4239
4240 static void
4241 gzip_init (Lisp_Object codesys)
4242 {
4243 struct gzip_coding_system *data = XCODING_SYSTEM_TYPE_DATA (codesys, gzip);
4244 data->level = -1;
4245 }
4246
4247 static void
4248 gzip_print (Lisp_Object cs, Lisp_Object printcharfun, int escapeflag)
4249 {
4250 struct gzip_coding_system *data = XCODING_SYSTEM_TYPE_DATA (cs, gzip);
4251
4252 write_c_string ("(", printcharfun);
4253 if (data->level == -1)
4254 write_c_string ("default", printcharfun);
1939 else 4255 else
1940 { 4256 print_internal (make_int (data->level), printcharfun, 0);
1941 int i; 4257 write_c_string (")", printcharfun);
1942
1943 val = Qnil;
1944 #ifdef MULE
1945 decst.mask = postprocess_iso2022_mask (decst.mask);
1946 #endif
1947 for (i = CODING_CATEGORY_LAST - 1; i >= 0; i--)
1948 {
1949 int sys = fcd->coding_category_by_priority[i];
1950 if (decst.mask & (1 << sys))
1951 {
1952 Lisp_Object codesys = fcd->coding_category_system[sys];
1953 if (!NILP (codesys))
1954 codesys = subsidiary_coding_system (codesys, decst.eol_type);
1955 val = Fcons (codesys, val);
1956 }
1957 }
1958 }
1959 Lstream_close (istr);
1960 UNGCPRO;
1961 Lstream_delete (istr);
1962 Lstream_delete (lb_istr);
1963 return val;
1964 }
1965
1966
1967 /************************************************************************/
1968 /* Converting to internal Mule format ("decoding") */
1969 /************************************************************************/
1970
1971 /* A decoding stream is a stream used for decoding text (i.e.
1972 converting from some external format to internal format).
1973 The decoding-stream object keeps track of the actual coding
1974 stream, the stream that is at the other end, and data that
1975 needs to be persistent across the lifetime of the stream. */
1976
1977 /* Handle the EOL stuff related to just-read-in character C.
1978 EOL_TYPE is the EOL type of the coding stream.
1979 FLAGS is the current value of FLAGS in the coding stream, and may
1980 be modified by this macro. (The macro only looks at the
1981 CODING_STATE_CR flag.) DST is the Dynarr to which the decoded
1982 bytes are to be written. You need to also define a local goto
1983 label "label_continue_loop" that is at the end of the main
1984 character-reading loop.
1985
1986 If C is a CR character, then this macro handles it entirely and
1987 jumps to label_continue_loop. Otherwise, this macro does not add
1988 anything to DST, and continues normally. You should continue
1989 processing C normally after this macro. */
1990
1991 #define DECODE_HANDLE_EOL_TYPE(eol_type, c, flags, dst) \
1992 do { \
1993 if (c == '\r') \
1994 { \
1995 if (eol_type == EOL_CR) \
1996 Dynarr_add (dst, '\n'); \
1997 else if (eol_type != EOL_CRLF || flags & CODING_STATE_CR) \
1998 Dynarr_add (dst, c); \
1999 else \
2000 flags |= CODING_STATE_CR; \
2001 goto label_continue_loop; \
2002 } \
2003 else if (flags & CODING_STATE_CR) \
2004 { /* eol_type == CODING_SYSTEM_EOL_CRLF */ \
2005 if (c != '\n') \
2006 Dynarr_add (dst, '\r'); \
2007 flags &= ~CODING_STATE_CR; \
2008 } \
2009 } while (0)
2010
2011 /* C should be a binary character in the range 0 - 255; convert
2012 to internal format and add to Dynarr DST. */
2013
2014 #define DECODE_ADD_BINARY_CHAR(c, dst) \
2015 do { \
2016 if (BYTE_ASCII_P (c)) \
2017 Dynarr_add (dst, c); \
2018 else if (BYTE_C1_P (c)) \
2019 { \
2020 Dynarr_add (dst, LEADING_BYTE_CONTROL_1); \
2021 Dynarr_add (dst, c + 0x20); \
2022 } \
2023 else \
2024 { \
2025 Dynarr_add (dst, LEADING_BYTE_LATIN_ISO8859_1); \
2026 Dynarr_add (dst, c); \
2027 } \
2028 } while (0)
2029
2030 #define DECODE_OUTPUT_PARTIAL_CHAR(ch) \
2031 do { \
2032 if (ch) \
2033 { \
2034 DECODE_ADD_BINARY_CHAR (ch, dst); \
2035 ch = 0; \
2036 } \
2037 } while (0)
2038
2039 #define DECODE_HANDLE_END_OF_CONVERSION(flags, ch, dst) \
2040 do { \
2041 if (flags & CODING_STATE_END) \
2042 { \
2043 DECODE_OUTPUT_PARTIAL_CHAR (ch); \
2044 if (flags & CODING_STATE_CR) \
2045 Dynarr_add (dst, '\r'); \
2046 } \
2047 } while (0)
2048
2049 #define DECODING_STREAM_DATA(stream) LSTREAM_TYPE_DATA (stream, decoding)
2050
2051 struct decoding_stream
2052 {
2053 /* Coding system that governs the conversion. */
2054 Lisp_Coding_System *codesys;
2055
2056 /* Stream that we read the encoded data from or
2057 write the decoded data to. */
2058 Lstream *other_end;
2059
2060 /* If we are reading, then we can return only a fixed amount of
2061 data, so if the conversion resulted in too much data, we store it
2062 here for retrieval the next time around. */
2063 unsigned_char_dynarr *runoff;
2064
2065 /* FLAGS holds flags indicating the current state of the decoding.
2066 Some of these flags are dependent on the coding system. */
2067 unsigned int flags;
2068
2069 /* CH holds a partially built-up character. Since we only deal
2070 with one- and two-byte characters at the moment, we only use
2071 this to store the first byte of a two-byte character. */
2072 unsigned int ch;
2073
2074 /* EOL_TYPE specifies the type of end-of-line conversion that
2075 currently applies. We need to keep this separate from the
2076 EOL type stored in CODESYS because the latter might indicate
2077 automatic EOL-type detection while the former will always
2078 indicate a particular EOL type. */
2079 eol_type_t eol_type;
2080 #ifdef MULE
2081 /* Additional ISO2022 information. We define the structure above
2082 because it's also needed by the detection routines. */
2083 struct iso2022_decoder iso2022;
2084
2085 /* Additional information (the state of the running CCL program)
2086 used by the CCL decoder. */
2087 struct ccl_program ccl;
2088
2089 /* counter for UTF-8 or UCS-4 */
2090 unsigned char counter;
2091 #endif
2092 struct detection_state decst;
2093 };
2094
2095 static Bytecount decoding_reader (Lstream *stream,
2096 unsigned char *data, Bytecount size);
2097 static Bytecount decoding_writer (Lstream *stream,
2098 const unsigned char *data, Bytecount size);
2099 static int decoding_rewinder (Lstream *stream);
2100 static int decoding_seekable_p (Lstream *stream);
2101 static int decoding_flusher (Lstream *stream);
2102 static int decoding_closer (Lstream *stream);
2103
2104 static Lisp_Object decoding_marker (Lisp_Object stream);
2105
2106 DEFINE_LSTREAM_IMPLEMENTATION ("decoding", lstream_decoding,
2107 sizeof (struct decoding_stream));
2108
2109 static Lisp_Object
2110 decoding_marker (Lisp_Object stream)
2111 {
2112 Lstream *str = DECODING_STREAM_DATA (XLSTREAM (stream))->other_end;
2113 Lisp_Object str_obj;
2114
2115 /* We do not need to mark the coding systems or charsets stored
2116 within the stream because they are stored in a global list
2117 and automatically marked. */
2118
2119 XSETLSTREAM (str_obj, str);
2120 mark_object (str_obj);
2121 if (str->imp->marker)
2122 return (str->imp->marker) (str_obj);
2123 else
2124 return Qnil;
2125 }
2126
2127 /* Read SIZE bytes of data and store it into DATA. We are a decoding stream
2128 so we read data from the other end, decode it, and store it into DATA. */
2129
2130 static Bytecount
2131 decoding_reader (Lstream *stream, unsigned char *data, Bytecount size)
2132 {
2133 struct decoding_stream *str = DECODING_STREAM_DATA (stream);
2134 unsigned char *orig_data = data;
2135 Bytecount read_size;
2136 int error_occurred = 0;
2137
2138 /* We need to interface to mule_decode(), which expects to take some
2139 amount of data and store the result into a Dynarr. We have
2140 mule_decode() store into str->runoff, and take data from there
2141 as necessary. */
2142
2143 /* We loop until we have enough data, reading chunks from the other
2144 end and decoding it. */
2145 while (1)
2146 {
2147 /* Take data from the runoff if we can. Make sure to take at
2148 most SIZE bytes, and delete the data from the runoff. */
2149 if (Dynarr_length (str->runoff) > 0)
2150 {
2151 Bytecount chunk = min (size, (Bytecount) Dynarr_length (str->runoff));
2152 memcpy (data, Dynarr_atp (str->runoff, 0), chunk);
2153 Dynarr_delete_many (str->runoff, 0, chunk);
2154 data += chunk;
2155 size -= chunk;
2156 }
2157
2158 if (size == 0)
2159 break; /* No more room for data */
2160
2161 if (str->flags & CODING_STATE_END)
2162 /* This means that on the previous iteration, we hit the EOF on
2163 the other end. We loop once more so that mule_decode() can
2164 output any final stuff it may be holding, or any "go back
2165 to a sane state" escape sequences. (This latter makes sense
2166 during encoding.) */
2167 break;
2168
2169 /* Exhausted the runoff, so get some more. DATA has at least
2170 SIZE bytes left of storage in it, so it's OK to read directly
2171 into it. (We'll be overwriting above, after we've decoded it
2172 into the runoff.) */
2173 read_size = Lstream_read (str->other_end, data, size);
2174 if (read_size < 0)
2175 {
2176 error_occurred = 1;
2177 break;
2178 }
2179 if (read_size == 0)
2180 /* There might be some more end data produced in the translation.
2181 See the comment above. */
2182 str->flags |= CODING_STATE_END;
2183 mule_decode (stream, (Extbyte *) data, str->runoff, read_size);
2184 }
2185
2186 if (data - orig_data == 0)
2187 return error_occurred ? -1 : 0;
2188 else
2189 return data - orig_data;
2190 }
2191
2192 static Bytecount
2193 decoding_writer (Lstream *stream, const unsigned char *data, Bytecount size)
2194 {
2195 struct decoding_stream *str = DECODING_STREAM_DATA (stream);
2196 Bytecount retval;
2197
2198 /* Decode all our data into the runoff, and then attempt to write
2199 it all out to the other end. Remove whatever chunk we succeeded
2200 in writing. */
2201 mule_decode (stream, (Extbyte *) data, str->runoff, size);
2202 retval = Lstream_write (str->other_end, Dynarr_atp (str->runoff, 0),
2203 Dynarr_length (str->runoff));
2204 if (retval > 0)
2205 Dynarr_delete_many (str->runoff, 0, retval);
2206 /* Do NOT return retval. The return value indicates how much
2207 of the incoming data was written, not how many bytes were
2208 written. */
2209 return size;
2210 }
2211
2212 static void
2213 reset_decoding_stream (struct decoding_stream *str)
2214 {
2215 #ifdef MULE
2216 if (CODING_SYSTEM_TYPE (str->codesys) == CODESYS_ISO2022)
2217 {
2218 Lisp_Object coding_system;
2219 XSETCODING_SYSTEM (coding_system, str->codesys);
2220 reset_iso2022 (coding_system, &str->iso2022);
2221 }
2222 else if (CODING_SYSTEM_TYPE (str->codesys) == CODESYS_CCL)
2223 {
2224 setup_ccl_program (&str->ccl, CODING_SYSTEM_CCL_DECODE (str->codesys));
2225 }
2226 str->counter = 0;
2227 #endif /* MULE */
2228 if (CODING_SYSTEM_TYPE (str->codesys) == CODESYS_AUTODETECT
2229 || CODING_SYSTEM_EOL_TYPE (str->codesys) == EOL_AUTODETECT)
2230 {
2231 xzero (str->decst);
2232 str->decst.eol_type = EOL_AUTODETECT;
2233 str->decst.mask = ~0;
2234 }
2235 str->flags = str->ch = 0;
2236 } 4258 }
2237 4259
2238 static int 4260 static int
2239 decoding_rewinder (Lstream *stream) 4261 gzip_putprop (Lisp_Object codesys, Lisp_Object key, Lisp_Object value)
2240 { 4262 {
2241 struct decoding_stream *str = DECODING_STREAM_DATA (stream); 4263 struct gzip_coding_system *data = XCODING_SYSTEM_TYPE_DATA (codesys, gzip);
2242 reset_decoding_stream (str); 4264
2243 Dynarr_reset (str->runoff); 4265 if (EQ (key, Qlevel))
2244 return Lstream_rewind (str->other_end); 4266 {
2245 } 4267 if (EQ (value, Qdefault))
2246 4268 data->level = -1;
2247 static int
2248 decoding_seekable_p (Lstream *stream)
2249 {
2250 struct decoding_stream *str = DECODING_STREAM_DATA (stream);
2251 return Lstream_seekable_p (str->other_end);
2252 }
2253
2254 static int
2255 decoding_flusher (Lstream *stream)
2256 {
2257 struct decoding_stream *str = DECODING_STREAM_DATA (stream);
2258 return Lstream_flush (str->other_end);
2259 }
2260
2261 static int
2262 decoding_closer (Lstream *stream)
2263 {
2264 struct decoding_stream *str = DECODING_STREAM_DATA (stream);
2265 if (stream->flags & LSTREAM_FL_WRITE)
2266 {
2267 str->flags |= CODING_STATE_END;
2268 decoding_writer (stream, 0, 0);
2269 }
2270 Dynarr_free (str->runoff);
2271 #ifdef MULE
2272 #ifdef ENABLE_COMPOSITE_CHARS
2273 if (str->iso2022.composite_chars)
2274 Dynarr_free (str->iso2022.composite_chars);
2275 #endif
2276 #endif
2277 return Lstream_close (str->other_end);
2278 }
2279
2280 Lisp_Object
2281 decoding_stream_coding_system (Lstream *stream)
2282 {
2283 Lisp_Object coding_system;
2284 struct decoding_stream *str = DECODING_STREAM_DATA (stream);
2285
2286 XSETCODING_SYSTEM (coding_system, str->codesys);
2287 return subsidiary_coding_system (coding_system, str->eol_type);
2288 }
2289
2290 void
2291 set_decoding_stream_coding_system (Lstream *lstr, Lisp_Object codesys)
2292 {
2293 Lisp_Coding_System *cs = XCODING_SYSTEM (codesys);
2294 struct decoding_stream *str = DECODING_STREAM_DATA (lstr);
2295 str->codesys = cs;
2296 if (CODING_SYSTEM_EOL_TYPE (cs) != EOL_AUTODETECT)
2297 str->eol_type = CODING_SYSTEM_EOL_TYPE (cs);
2298 reset_decoding_stream (str);
2299 }
2300
2301 /* WARNING WARNING WARNING WARNING!!!!! If you open up a decoding
2302 stream for writing, no automatic code detection will be performed.
2303 The reason for this is that automatic code detection requires a
2304 seekable input. Things will also fail if you open a decoding
2305 stream for reading using a non-fully-specified coding system and
2306 a non-seekable input stream. */
2307
2308 static Lisp_Object
2309 make_decoding_stream_1 (Lstream *stream, Lisp_Object codesys,
2310 const char *mode)
2311 {
2312 Lstream *lstr = Lstream_new (lstream_decoding, mode);
2313 struct decoding_stream *str = DECODING_STREAM_DATA (lstr);
2314 Lisp_Object obj;
2315
2316 xzero (*str);
2317 str->other_end = stream;
2318 str->runoff = (unsigned_char_dynarr *) Dynarr_new (unsigned_char);
2319 str->eol_type = EOL_AUTODETECT;
2320 if (!strcmp (mode, "r")
2321 && Lstream_seekable_p (stream))
2322 /* We can determine the coding system now. */
2323 determine_real_coding_system (stream, &codesys, &str->eol_type);
2324 set_decoding_stream_coding_system (lstr, codesys);
2325 str->decst.eol_type = str->eol_type;
2326 str->decst.mask = ~0;
2327 XSETLSTREAM (obj, lstr);
2328 return obj;
2329 }
2330
2331 Lisp_Object
2332 make_decoding_input_stream (Lstream *stream, Lisp_Object codesys)
2333 {
2334 return make_decoding_stream_1 (stream, codesys, "r");
2335 }
2336
2337 Lisp_Object
2338 make_decoding_output_stream (Lstream *stream, Lisp_Object codesys)
2339 {
2340 return make_decoding_stream_1 (stream, codesys, "w");
2341 }
2342
2343 /* Note: the decode_coding_* functions all take the same
2344 arguments as mule_decode(), which is to say some SRC data of
2345 size N, which is to be stored into dynamic array DST.
2346 DECODING is the stream within which the decoding is
2347 taking place, but no data is actually read from or
2348 written to that stream; that is handled in decoding_reader()
2349 or decoding_writer(). This allows the same functions to
2350 be used for both reading and writing. */
2351
2352 static void
2353 mule_decode (Lstream *decoding, const Extbyte *src,
2354 unsigned_char_dynarr *dst, Bytecount n)
2355 {
2356 struct decoding_stream *str = DECODING_STREAM_DATA (decoding);
2357
2358 /* If necessary, do encoding-detection now. We do this when
2359 we're a writing stream or a non-seekable reading stream,
2360 meaning that we can't just process the whole input,
2361 rewind, and start over. */
2362
2363 if (CODING_SYSTEM_TYPE (str->codesys) == CODESYS_AUTODETECT ||
2364 str->eol_type == EOL_AUTODETECT)
2365 {
2366 Lisp_Object codesys;
2367
2368 XSETCODING_SYSTEM (codesys, str->codesys);
2369 detect_coding_type (&str->decst, src, n,
2370 CODING_SYSTEM_TYPE (str->codesys) !=
2371 CODESYS_AUTODETECT);
2372 if (CODING_SYSTEM_TYPE (str->codesys) == CODESYS_AUTODETECT &&
2373 str->decst.mask != ~0)
2374 /* #### This is cheesy. What we really ought to do is
2375 buffer up a certain amount of data so as to get a
2376 less random result. */
2377 codesys = coding_system_from_mask (str->decst.mask);
2378 str->eol_type = str->decst.eol_type;
2379 if (XCODING_SYSTEM (codesys) != str->codesys)
2380 {
2381 /* Preserve the CODING_STATE_END flag in case it was set.
2382 If we erase it, bad things might happen. */
2383 int was_end = str->flags & CODING_STATE_END;
2384 set_decoding_stream_coding_system (decoding, codesys);
2385 if (was_end)
2386 str->flags |= CODING_STATE_END;
2387 }
2388 }
2389
2390 switch (CODING_SYSTEM_TYPE (str->codesys))
2391 {
2392 #ifdef DEBUG_XEMACS
2393 case CODESYS_INTERNAL:
2394 Dynarr_add_many (dst, src, n);
2395 break;
2396 #endif
2397 case CODESYS_AUTODETECT:
2398 /* If we got this far and still haven't decided on the coding
2399 system, then do no conversion. */
2400 case CODESYS_NO_CONVERSION:
2401 decode_coding_no_conversion (decoding, src, dst, n);
2402 break;
2403 #ifdef MULE
2404 case CODESYS_SHIFT_JIS:
2405 decode_coding_sjis (decoding, src, dst, n);
2406 break;
2407 case CODESYS_BIG5:
2408 decode_coding_big5 (decoding, src, dst, n);
2409 break;
2410 case CODESYS_UCS4:
2411 decode_coding_ucs4 (decoding, src, dst, n);
2412 break;
2413 case CODESYS_UTF8:
2414 decode_coding_utf8 (decoding, src, dst, n);
2415 break;
2416 case CODESYS_CCL:
2417 str->ccl.last_block = str->flags & CODING_STATE_END;
2418 /* When applying ccl program to stream, MUST NOT set NULL
2419 pointer to src. */
2420 ccl_driver (&str->ccl, (src ? (unsigned char *)src : (unsigned char*)""),
2421 dst, n, 0, CCL_MODE_DECODING);
2422 break;
2423 case CODESYS_ISO2022:
2424 decode_coding_iso2022 (decoding, src, dst, n);
2425 break;
2426 #endif /* MULE */
2427 default:
2428 abort ();
2429 }
2430 }
2431
2432 DEFUN ("decode-coding-region", Fdecode_coding_region, 3, 4, 0, /*
2433 Decode the text between START and END which is encoded in CODING-SYSTEM.
2434 This is useful if you've read in encoded text from a file without decoding
2435 it (e.g. you read in a JIS-formatted file but used the `binary' or
2436 `no-conversion' coding system, so that it shows up as "^[$B!<!+^[(B").
2437 Return length of decoded text.
2438 BUFFER defaults to the current buffer if unspecified.
2439 */
2440 (start, end, coding_system, buffer))
2441 {
2442 Charbpos b, e;
2443 struct buffer *buf = decode_buffer (buffer, 0);
2444 Lisp_Object instream, lb_outstream, de_outstream, outstream;
2445 Lstream *istr, *ostr;
2446 struct gcpro gcpro1, gcpro2, gcpro3, gcpro4;
2447
2448 get_buffer_range_char (buf, start, end, &b, &e, 0);
2449
2450 barf_if_buffer_read_only (buf, b, e);
2451
2452 coding_system = Fget_coding_system (coding_system);
2453 instream = make_lisp_buffer_input_stream (buf, b, e, 0);
2454 lb_outstream = make_lisp_buffer_output_stream (buf, b, 0);
2455 de_outstream = make_decoding_output_stream (XLSTREAM (lb_outstream),
2456 coding_system);
2457 outstream = make_encoding_output_stream (XLSTREAM (de_outstream),
2458 Fget_coding_system (Qbinary));
2459 istr = XLSTREAM (instream);
2460 ostr = XLSTREAM (outstream);
2461 GCPRO4 (instream, lb_outstream, de_outstream, outstream);
2462
2463 /* The chain of streams looks like this:
2464
2465 [BUFFER] <----- send through
2466 ------> [ENCODE AS BINARY]
2467 ------> [DECODE AS SPECIFIED]
2468 ------> [BUFFER]
2469 */
2470
2471 while (1)
2472 {
2473 char tempbuf[1024]; /* some random amount */
2474 Charbpos newpos, even_newer_pos;
2475 Charbpos oldpos = lisp_buffer_stream_startpos (istr);
2476 Bytecount size_in_bytes = Lstream_read (istr, tempbuf, sizeof (tempbuf));
2477
2478 if (!size_in_bytes)
2479 break;
2480 newpos = lisp_buffer_stream_startpos (istr);
2481 Lstream_write (ostr, tempbuf, size_in_bytes);
2482 even_newer_pos = lisp_buffer_stream_startpos (istr);
2483 buffer_delete_range (buf, even_newer_pos - (newpos - oldpos),
2484 even_newer_pos, 0);
2485 }
2486 Lstream_close (istr);
2487 Lstream_close (ostr);
2488 UNGCPRO;
2489 Lstream_delete (istr);
2490 Lstream_delete (ostr);
2491 Lstream_delete (XLSTREAM (de_outstream));
2492 Lstream_delete (XLSTREAM (lb_outstream));
2493 return Qnil;
2494 }
2495
2496
2497 /************************************************************************/
2498 /* Converting to an external encoding ("encoding") */
2499 /************************************************************************/
2500
2501 /* An encoding stream is an output stream. When you create the
2502 stream, you specify the coding system that governs the encoding
2503 and another stream that the resulting encoded data is to be
2504 sent to, and then start sending data to it. */
2505
2506 #define ENCODING_STREAM_DATA(stream) LSTREAM_TYPE_DATA (stream, encoding)
2507
2508 struct encoding_stream
2509 {
2510 /* Coding system that governs the conversion. */
2511 Lisp_Coding_System *codesys;
2512
2513 /* Stream that we read the encoded data from or
2514 write the decoded data to. */
2515 Lstream *other_end;
2516
2517 /* If we are reading, then we can return only a fixed amount of
2518 data, so if the conversion resulted in too much data, we store it
2519 here for retrieval the next time around. */
2520 unsigned_char_dynarr *runoff;
2521
2522 /* FLAGS holds flags indicating the current state of the encoding.
2523 Some of these flags are dependent on the coding system. */
2524 unsigned int flags;
2525
2526 /* CH holds a partially built-up character. Since we only deal
2527 with one- and two-byte characters at the moment, we only use
2528 this to store the first byte of a two-byte character. */
2529 unsigned int ch;
2530 #ifdef MULE
2531 /* Additional information used by the ISO2022 encoder. */
2532 struct
2533 {
2534 /* CHARSET holds the character sets currently assigned to the G0
2535 through G3 registers. It is initialized from the array
2536 INITIAL_CHARSET in CODESYS. */
2537 Lisp_Object charset[4];
2538
2539 /* Which registers are currently invoked into the left (GL) and
2540 right (GR) halves of the 8-bit encoding space? */
2541 int register_left, register_right;
2542
2543 /* Whether we need to explicitly designate the charset in the
2544 G? register before using it. It is initialized from the
2545 array FORCE_CHARSET_ON_OUTPUT in CODESYS. */
2546 unsigned char force_charset_on_output[4];
2547
2548 /* Other state variables that need to be preserved across
2549 invocations. */
2550 Lisp_Object current_charset;
2551 int current_half;
2552 int current_char_boundary;
2553 } iso2022;
2554
2555 /* Additional information (the state of the running CCL program)
2556 used by the CCL encoder. */
2557 struct ccl_program ccl;
2558 #endif /* MULE */
2559 };
2560
2561 static Bytecount encoding_reader (Lstream *stream, unsigned char *data, Bytecount size);
2562 static Bytecount encoding_writer (Lstream *stream, const unsigned char *data,
2563 Bytecount size);
2564 static int encoding_rewinder (Lstream *stream);
2565 static int encoding_seekable_p (Lstream *stream);
2566 static int encoding_flusher (Lstream *stream);
2567 static int encoding_closer (Lstream *stream);
2568
2569 static Lisp_Object encoding_marker (Lisp_Object stream);
2570
2571 DEFINE_LSTREAM_IMPLEMENTATION ("encoding", lstream_encoding,
2572 sizeof (struct encoding_stream));
2573
2574 static Lisp_Object
2575 encoding_marker (Lisp_Object stream)
2576 {
2577 Lstream *str = ENCODING_STREAM_DATA (XLSTREAM (stream))->other_end;
2578 Lisp_Object str_obj;
2579
2580 /* We do not need to mark the coding systems or charsets stored
2581 within the stream because they are stored in a global list
2582 and automatically marked. */
2583
2584 XSETLSTREAM (str_obj, str);
2585 mark_object (str_obj);
2586 if (str->imp->marker)
2587 return (str->imp->marker) (str_obj);
2588 else
2589 return Qnil;
2590 }
2591
2592 /* Read SIZE bytes of data and store it into DATA. We are a encoding stream
2593 so we read data from the other end, encode it, and store it into DATA. */
2594
2595 static Bytecount
2596 encoding_reader (Lstream *stream, unsigned char *data, Bytecount size)
2597 {
2598 struct encoding_stream *str = ENCODING_STREAM_DATA (stream);
2599 unsigned char *orig_data = data;
2600 Bytecount read_size;
2601 int error_occurred = 0;
2602
2603 /* We need to interface to mule_encode(), which expects to take some
2604 amount of data and store the result into a Dynarr. We have
2605 mule_encode() store into str->runoff, and take data from there
2606 as necessary. */
2607
2608 /* We loop until we have enough data, reading chunks from the other
2609 end and encoding it. */
2610 while (1)
2611 {
2612 /* Take data from the runoff if we can. Make sure to take at
2613 most SIZE bytes, and delete the data from the runoff. */
2614 if (Dynarr_length (str->runoff) > 0)
2615 {
2616 int chunk = min ((int) size, Dynarr_length (str->runoff));
2617 memcpy (data, Dynarr_atp (str->runoff, 0), chunk);
2618 Dynarr_delete_many (str->runoff, 0, chunk);
2619 data += chunk;
2620 size -= chunk;
2621 }
2622
2623 if (size == 0)
2624 break; /* No more room for data */
2625
2626 if (str->flags & CODING_STATE_END)
2627 /* This means that on the previous iteration, we hit the EOF on
2628 the other end. We loop once more so that mule_encode() can
2629 output any final stuff it may be holding, or any "go back
2630 to a sane state" escape sequences. (This latter makes sense
2631 during encoding.) */
2632 break;
2633
2634 /* Exhausted the runoff, so get some more. DATA at least SIZE bytes
2635 left of storage in it, so it's OK to read directly into it.
2636 (We'll be overwriting above, after we've encoded it into the
2637 runoff.) */
2638 read_size = Lstream_read (str->other_end, data, size);
2639 if (read_size < 0)
2640 {
2641 error_occurred = 1;
2642 break;
2643 }
2644 if (read_size == 0)
2645 /* There might be some more end data produced in the translation.
2646 See the comment above. */
2647 str->flags |= CODING_STATE_END;
2648 mule_encode (stream, data, str->runoff, read_size);
2649 }
2650
2651 if (data == orig_data)
2652 return error_occurred ? -1 : 0;
2653 else
2654 return data - orig_data;
2655 }
2656
2657 static Bytecount
2658 encoding_writer (Lstream *stream, const unsigned char *data, Bytecount size)
2659 {
2660 struct encoding_stream *str = ENCODING_STREAM_DATA (stream);
2661 Bytecount retval;
2662
2663 /* Encode all our data into the runoff, and then attempt to write
2664 it all out to the other end. Remove whatever chunk we succeeded
2665 in writing. */
2666 mule_encode (stream, data, str->runoff, size);
2667 retval = Lstream_write (str->other_end, Dynarr_atp (str->runoff, 0),
2668 Dynarr_length (str->runoff));
2669 if (retval > 0)
2670 Dynarr_delete_many (str->runoff, 0, retval);
2671 /* Do NOT return retval. The return value indicates how much
2672 of the incoming data was written, not how many bytes were
2673 written. */
2674 return size;
2675 }
2676
2677 static void
2678 reset_encoding_stream (struct encoding_stream *str)
2679 {
2680 #ifdef MULE
2681 switch (CODING_SYSTEM_TYPE (str->codesys))
2682 {
2683 case CODESYS_ISO2022:
2684 {
2685 int i;
2686
2687 for (i = 0; i < 4; i++)
2688 {
2689 str->iso2022.charset[i] =
2690 CODING_SYSTEM_ISO2022_INITIAL_CHARSET (str->codesys, i);
2691 str->iso2022.force_charset_on_output[i] =
2692 CODING_SYSTEM_ISO2022_FORCE_CHARSET_ON_OUTPUT (str->codesys, i);
2693 }
2694 str->iso2022.register_left = 0;
2695 str->iso2022.register_right = 1;
2696 str->iso2022.current_charset = Qnil;
2697 str->iso2022.current_half = 0;
2698 str->iso2022.current_char_boundary = 1;
2699 break;
2700 }
2701 case CODESYS_CCL:
2702 setup_ccl_program (&str->ccl, CODING_SYSTEM_CCL_ENCODE (str->codesys));
2703 break;
2704 default:
2705 break;
2706 }
2707 #endif /* MULE */
2708
2709 str->flags = str->ch = 0;
2710 }
2711
2712 static int
2713 encoding_rewinder (Lstream *stream)
2714 {
2715 struct encoding_stream *str = ENCODING_STREAM_DATA (stream);
2716 reset_encoding_stream (str);
2717 Dynarr_reset (str->runoff);
2718 return Lstream_rewind (str->other_end);
2719 }
2720
2721 static int
2722 encoding_seekable_p (Lstream *stream)
2723 {
2724 struct encoding_stream *str = ENCODING_STREAM_DATA (stream);
2725 return Lstream_seekable_p (str->other_end);
2726 }
2727
2728 static int
2729 encoding_flusher (Lstream *stream)
2730 {
2731 struct encoding_stream *str = ENCODING_STREAM_DATA (stream);
2732 return Lstream_flush (str->other_end);
2733 }
2734
2735 static int
2736 encoding_closer (Lstream *stream)
2737 {
2738 struct encoding_stream *str = ENCODING_STREAM_DATA (stream);
2739 if (stream->flags & LSTREAM_FL_WRITE)
2740 {
2741 str->flags |= CODING_STATE_END;
2742 encoding_writer (stream, 0, 0);
2743 }
2744 Dynarr_free (str->runoff);
2745 return Lstream_close (str->other_end);
2746 }
2747
2748 Lisp_Object
2749 encoding_stream_coding_system (Lstream *stream)
2750 {
2751 Lisp_Object coding_system;
2752 struct encoding_stream *str = ENCODING_STREAM_DATA (stream);
2753
2754 XSETCODING_SYSTEM (coding_system, str->codesys);
2755 return coding_system;
2756 }
2757
2758 void
2759 set_encoding_stream_coding_system (Lstream *lstr, Lisp_Object codesys)
2760 {
2761 Lisp_Coding_System *cs = XCODING_SYSTEM (codesys);
2762 struct encoding_stream *str = ENCODING_STREAM_DATA (lstr);
2763 str->codesys = cs;
2764 reset_encoding_stream (str);
2765 }
2766
2767 static Lisp_Object
2768 make_encoding_stream_1 (Lstream *stream, Lisp_Object codesys,
2769 const char *mode)
2770 {
2771 Lstream *lstr = Lstream_new (lstream_encoding, mode);
2772 struct encoding_stream *str = ENCODING_STREAM_DATA (lstr);
2773 Lisp_Object obj;
2774
2775 xzero (*str);
2776 str->runoff = Dynarr_new (unsigned_char);
2777 str->other_end = stream;
2778 set_encoding_stream_coding_system (lstr, codesys);
2779 XSETLSTREAM (obj, lstr);
2780 return obj;
2781 }
2782
2783 Lisp_Object
2784 make_encoding_input_stream (Lstream *stream, Lisp_Object codesys)
2785 {
2786 return make_encoding_stream_1 (stream, codesys, "r");
2787 }
2788
2789 Lisp_Object
2790 make_encoding_output_stream (Lstream *stream, Lisp_Object codesys)
2791 {
2792 return make_encoding_stream_1 (stream, codesys, "w");
2793 }
2794
2795 /* Convert N bytes of internally-formatted data stored in SRC to an
2796 external format, according to the encoding stream ENCODING.
2797 Store the encoded data into DST. */
2798
2799 static void
2800 mule_encode (Lstream *encoding, const Intbyte *src,
2801 unsigned_char_dynarr *dst, Bytecount n)
2802 {
2803 struct encoding_stream *str = ENCODING_STREAM_DATA (encoding);
2804
2805 switch (CODING_SYSTEM_TYPE (str->codesys))
2806 {
2807 #ifdef DEBUG_XEMACS
2808 case CODESYS_INTERNAL:
2809 Dynarr_add_many (dst, src, n);
2810 break;
2811 #endif
2812 case CODESYS_AUTODETECT:
2813 /* If we got this far and still haven't decided on the coding
2814 system, then do no conversion. */
2815 case CODESYS_NO_CONVERSION:
2816 encode_coding_no_conversion (encoding, src, dst, n);
2817 break;
2818 #ifdef MULE
2819 case CODESYS_SHIFT_JIS:
2820 encode_coding_sjis (encoding, src, dst, n);
2821 break;
2822 case CODESYS_BIG5:
2823 encode_coding_big5 (encoding, src, dst, n);
2824 break;
2825 case CODESYS_UCS4:
2826 encode_coding_ucs4 (encoding, src, dst, n);
2827 break;
2828 case CODESYS_UTF8:
2829 encode_coding_utf8 (encoding, src, dst, n);
2830 break;
2831 case CODESYS_CCL:
2832 str->ccl.last_block = str->flags & CODING_STATE_END;
2833 /* When applying ccl program to stream, MUST NOT set NULL
2834 pointer to src. */
2835 ccl_driver (&str->ccl, ((src) ? src : (unsigned char*)""),
2836 dst, n, 0, CCL_MODE_ENCODING);
2837 break;
2838 case CODESYS_ISO2022:
2839 encode_coding_iso2022 (encoding, src, dst, n);
2840 break;
2841 #endif /* MULE */
2842 default:
2843 abort ();
2844 }
2845 }
2846
2847 DEFUN ("encode-coding-region", Fencode_coding_region, 3, 4, 0, /*
2848 Encode the text between START and END using CODING-SYSTEM.
2849 This will, for example, convert Japanese characters into stuff such as
2850 "^[$B!<!+^[(B" if you use the JIS encoding. Return length of encoded
2851 text. BUFFER defaults to the current buffer if unspecified.
2852 */
2853 (start, end, coding_system, buffer))
2854 {
2855 Charbpos b, e;
2856 struct buffer *buf = decode_buffer (buffer, 0);
2857 Lisp_Object instream, lb_outstream, de_outstream, outstream;
2858 Lstream *istr, *ostr;
2859 struct gcpro gcpro1, gcpro2, gcpro3, gcpro4;
2860
2861 get_buffer_range_char (buf, start, end, &b, &e, 0);
2862
2863 barf_if_buffer_read_only (buf, b, e);
2864
2865 coding_system = Fget_coding_system (coding_system);
2866 instream = make_lisp_buffer_input_stream (buf, b, e, 0);
2867 lb_outstream = make_lisp_buffer_output_stream (buf, b, 0);
2868 de_outstream = make_decoding_output_stream (XLSTREAM (lb_outstream),
2869 Fget_coding_system (Qbinary));
2870 outstream = make_encoding_output_stream (XLSTREAM (de_outstream),
2871 coding_system);
2872 istr = XLSTREAM (instream);
2873 ostr = XLSTREAM (outstream);
2874 GCPRO4 (instream, outstream, de_outstream, lb_outstream);
2875 /* The chain of streams looks like this:
2876
2877 [BUFFER] <----- send through
2878 ------> [ENCODE AS SPECIFIED]
2879 ------> [DECODE AS BINARY]
2880 ------> [BUFFER]
2881 */
2882 while (1)
2883 {
2884 char tempbuf[1024]; /* some random amount */
2885 Charbpos newpos, even_newer_pos;
2886 Charbpos oldpos = lisp_buffer_stream_startpos (istr);
2887 Bytecount size_in_bytes = Lstream_read (istr, tempbuf, sizeof (tempbuf));
2888
2889 if (!size_in_bytes)
2890 break;
2891 newpos = lisp_buffer_stream_startpos (istr);
2892 Lstream_write (ostr, tempbuf, size_in_bytes);
2893 even_newer_pos = lisp_buffer_stream_startpos (istr);
2894 buffer_delete_range (buf, even_newer_pos - (newpos - oldpos),
2895 even_newer_pos, 0);
2896 }
2897
2898 {
2899 Charcount retlen =
2900 lisp_buffer_stream_startpos (XLSTREAM (instream)) - b;
2901 Lstream_close (istr);
2902 Lstream_close (ostr);
2903 UNGCPRO;
2904 Lstream_delete (istr);
2905 Lstream_delete (ostr);
2906 Lstream_delete (XLSTREAM (de_outstream));
2907 Lstream_delete (XLSTREAM (lb_outstream));
2908 return make_int (retlen);
2909 }
2910 }
2911
2912 #ifdef MULE
2913
2914 /************************************************************************/
2915 /* Shift-JIS methods */
2916 /************************************************************************/
2917
2918 /* Shift-JIS is a coding system encoding three character sets: ASCII, right
2919 half of JISX0201-Kana, and JISX0208. An ASCII character is encoded
2920 as is. A character of JISX0201-Kana (DIMENSION1_CHARS94 character set) is
2921 encoded by "position-code + 0x80". A character of JISX0208
2922 (DIMENSION2_CHARS94 character set) is encoded in 2-byte but two
2923 position-codes are divided and shifted so that it fit in the range
2924 below.
2925
2926 --- CODE RANGE of Shift-JIS ---
2927 (character set) (range)
2928 ASCII 0x00 .. 0x7F
2929 JISX0201-Kana 0xA0 .. 0xDF
2930 JISX0208 (1st byte) 0x80 .. 0x9F and 0xE0 .. 0xEF
2931 (2nd byte) 0x40 .. 0x7E and 0x80 .. 0xFC
2932 -------------------------------
2933
2934 */
2935
2936 /* Is this the first byte of a Shift-JIS two-byte char? */
2937
2938 #define BYTE_SJIS_TWO_BYTE_1_P(c) \
2939 (((c) >= 0x81 && (c) <= 0x9F) || ((c) >= 0xE0 && (c) <= 0xEF))
2940
2941 /* Is this the second byte of a Shift-JIS two-byte char? */
2942
2943 #define BYTE_SJIS_TWO_BYTE_2_P(c) \
2944 (((c) >= 0x40 && (c) <= 0x7E) || ((c) >= 0x80 && (c) <= 0xFC))
2945
2946 #define BYTE_SJIS_KATAKANA_P(c) \
2947 ((c) >= 0xA1 && (c) <= 0xDF)
2948
2949 static int
2950 detect_coding_sjis (struct detection_state *st, const Extbyte *src, Bytecount n)
2951 {
2952 while (n--)
2953 {
2954 unsigned char c = *(unsigned char *)src++;
2955 if (c == ISO_CODE_ESC || c == ISO_CODE_SI || c == ISO_CODE_SO)
2956 return 0;
2957 if (st->shift_jis.in_second_byte)
2958 {
2959 st->shift_jis.in_second_byte = 0;
2960 if (c < 0x40)
2961 return 0;
2962 }
2963 else if ((c >= 0x80 && c < 0xA0) || c >= 0xE0)
2964 st->shift_jis.in_second_byte = 1;
2965 }
2966 return CODING_CATEGORY_SHIFT_JIS_MASK;
2967 }
2968
2969 /* Convert Shift-JIS data to internal format. */
2970
2971 static void
2972 decode_coding_sjis (Lstream *decoding, const Extbyte *src,
2973 unsigned_char_dynarr *dst, Bytecount n)
2974 {
2975 struct decoding_stream *str = DECODING_STREAM_DATA (decoding);
2976 unsigned int flags = str->flags;
2977 unsigned int ch = str->ch;
2978 eol_type_t eol_type = str->eol_type;
2979
2980 while (n--)
2981 {
2982 unsigned char c = *(unsigned char *)src++;
2983
2984 if (ch)
2985 {
2986 /* Previous character was first byte of Shift-JIS Kanji char. */
2987 if (BYTE_SJIS_TWO_BYTE_2_P (c))
2988 {
2989 unsigned char e1, e2;
2990
2991 Dynarr_add (dst, LEADING_BYTE_JAPANESE_JISX0208);
2992 DECODE_SJIS (ch, c, e1, e2);
2993 Dynarr_add (dst, e1);
2994 Dynarr_add (dst, e2);
2995 }
2996 else
2997 {
2998 DECODE_ADD_BINARY_CHAR (ch, dst);
2999 DECODE_ADD_BINARY_CHAR (c, dst);
3000 }
3001 ch = 0;
3002 }
3003 else 4269 else
3004 { 4270 {
3005 DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst); 4271 CHECK_INT (value);
3006 if (BYTE_SJIS_TWO_BYTE_1_P (c)) 4272 check_int_range (XINT (value), 0, 9);
3007 ch = c; 4273 data->level = XINT (value);
3008 else if (BYTE_SJIS_KATAKANA_P (c))
3009 {
3010 Dynarr_add (dst, LEADING_BYTE_KATAKANA_JISX0201);
3011 Dynarr_add (dst, c);
3012 }
3013 else
3014 DECODE_ADD_BINARY_CHAR (c, dst);
3015 } 4274 }
3016 label_continue_loop:; 4275 }
3017 } 4276 else
3018 4277 return 0;
3019 DECODE_HANDLE_END_OF_CONVERSION (flags, ch, dst); 4278 return 1;
3020 4279 }
3021 str->flags = flags; 4280
3022 str->ch = ch; 4281 static Lisp_Object
3023 } 4282 gzip_getprop (Lisp_Object coding_system, Lisp_Object prop)
3024 4283 {
3025 /* Convert internally-formatted data to Shift-JIS. */ 4284 struct gzip_coding_system *data =
4285 XCODING_SYSTEM_TYPE_DATA (coding_system, gzip);
4286
4287 if (EQ (prop, Qlevel))
4288 {
4289 if (data->level == -1)
4290 return Qdefault;
4291 return make_int (data->level);
4292 }
4293
4294 return Qunbound;
4295 }
3026 4296
3027 static void 4297 static void
3028 encode_coding_sjis (Lstream *encoding, const Intbyte *src, 4298 gzip_init_coding_stream (struct coding_stream *str)
3029 unsigned_char_dynarr *dst, Bytecount n) 4299 {
3030 { 4300 struct gzip_coding_stream *data = CODING_STREAM_TYPE_DATA (str, gzip);
3031 struct encoding_stream *str = ENCODING_STREAM_DATA (encoding); 4301 if (data->stream_initted)
3032 unsigned int flags = str->flags; 4302 {
3033 unsigned int ch = str->ch; 4303 if (str->direction == CODING_DECODE)
3034 eol_type_t eol_type = CODING_SYSTEM_EOL_TYPE (str->codesys); 4304 inflateEnd (&data->stream);
3035 4305 else
3036 while (n--) 4306 deflateEnd (&data->stream);
3037 { 4307 data->stream_initted = 0;
3038 Intbyte c = *src++; 4308 }
3039 if (c == '\n') 4309 data->reached_eof = 0;
4310 }
4311
4312 static void
4313 gzip_rewind_coding_stream (struct coding_stream *str)
4314 {
4315 gzip_init_coding_stream (str);
4316 }
4317
4318 static Bytecount
4319 gzip_convert (struct coding_stream *str,
4320 const UExtbyte *src,
4321 unsigned_char_dynarr *dst, Bytecount n)
4322 {
4323 struct gzip_coding_stream *data = CODING_STREAM_TYPE_DATA (str, gzip);
4324 int zerr;
4325 if (str->direction == CODING_DECODE)
4326 {
4327 if (data->reached_eof)
4328 return n; /* eat the data */
4329
4330 if (!data->stream_initted)
3040 { 4331 {
3041 if (eol_type != EOL_LF && eol_type != EOL_AUTODETECT) 4332 xzero (data->stream);
3042 Dynarr_add (dst, '\r'); 4333 if (inflateInit (&data->stream) != Z_OK)
3043 if (eol_type != EOL_CR) 4334 return LSTREAM_ERROR;
3044 Dynarr_add (dst, '\n'); 4335 data->stream_initted = 1;
3045 ch = 0;
3046 } 4336 }
3047 else if (BYTE_ASCII_P (c)) 4337
4338 data->stream.next_in = (Bytef *) src;
4339 data->stream.avail_in = n;
4340
4341 /* Normally we stop when we've fed all data to the decompressor; but
4342 if we're at the end of the input, and the decompressor hasn't
4343 reported EOF, we need to keep going, as there might be more output
4344 to generate. Z_OK from the decompressor means input was processed
4345 or output was generated; if neither, we break out of the loop.
4346 Other return values are:
4347
4348 Z_STREAM_END EOF from decompressor
4349 Z_DATA_ERROR Corrupted data
4350 Z_BUF_ERROR No progress possible (this should happen if
4351 we try to feed it an incomplete file)
4352 Z_MEM_ERROR Out of memory
4353 Z_STREAM_ERROR (should never happen)
4354 Z_NEED_DICT (#### when will this happen?)
4355 */
4356 while (data->stream.avail_in > 0 || str->eof)
3048 { 4357 {
3049 Dynarr_add (dst, c); 4358 /* Reserve an output buffer of the same size as the input buffer;
3050 ch = 0; 4359 if that's not enough, we keep reserving the same size. */
4360 Bytecount reserved = n;
4361 Dynarr_add_many (dst, 0, reserved);
4362 /* Careful here! Don't retrieve the pointer until after
4363 reserving the space, or it might be bogus */
4364 data->stream.next_out =
4365 Dynarr_atp (dst, Dynarr_length (dst) - reserved);
4366 data->stream.avail_out = reserved;
4367 zerr = inflate (&data->stream, Z_NO_FLUSH);
4368 /* Lop off the unused portion */
4369 Dynarr_set_size (dst, Dynarr_length (dst) - data->stream.avail_out);
4370 if (zerr != Z_OK)
4371 break;
3051 } 4372 }
3052 else if (INTBYTE_LEADING_BYTE_P (c)) 4373
3053 ch = (c == LEADING_BYTE_KATAKANA_JISX0201 || 4374 if (zerr == Z_STREAM_END)
3054 c == LEADING_BYTE_JAPANESE_JISX0208_1978 || 4375 data->reached_eof = 1;
3055 c == LEADING_BYTE_JAPANESE_JISX0208) ? c : 0; 4376
3056 else if (ch) 4377 if ((Bytecount) data->stream.avail_in < n)
4378 return n - data->stream.avail_in;
4379
4380 if (zerr == Z_OK || zerr == Z_STREAM_END)
4381 return 0;
4382
4383 return LSTREAM_ERROR;
4384 }
4385 else
4386 {
4387 if (!data->stream_initted)
3057 { 4388 {
3058 if (ch == LEADING_BYTE_KATAKANA_JISX0201) 4389 int level = XCODING_SYSTEM_GZIP_LEVEL (str->codesys);
3059 { 4390 xzero (data->stream);
3060 Dynarr_add (dst, c); 4391 if (deflateInit (&data->stream,
3061 ch = 0; 4392 level == -1 ? Z_DEFAULT_COMPRESSION : level) !=
3062 } 4393 Z_OK)
3063 else if (ch == LEADING_BYTE_JAPANESE_JISX0208_1978 || 4394 return LSTREAM_ERROR;
3064 ch == LEADING_BYTE_JAPANESE_JISX0208) 4395 data->stream_initted = 1;
3065 ch = c;
3066 else
3067 {
3068 unsigned char j1, j2;
3069 ENCODE_SJIS (ch, c, j1, j2);
3070 Dynarr_add (dst, j1);
3071 Dynarr_add (dst, j2);
3072 ch = 0;
3073 }
3074 } 4396 }
3075 } 4397
3076 4398 data->stream.next_in = (Bytef *) src;
3077 str->flags = flags; 4399 data->stream.avail_in = n;
3078 str->ch = ch; 4400
3079 } 4401 /* Normally we stop when we've fed all data to the compressor; but if
3080 4402 we're at the end of the input, and the compressor hasn't reported
3081 DEFUN ("decode-shift-jis-char", Fdecode_shift_jis_char, 1, 1, 0, /* 4403 EOF, we need to keep going, as there might be more output to
3082 Decode a JISX0208 character of Shift-JIS coding-system. 4404 generate. (To signal EOF on our end, we set the FLUSH parameter
3083 CODE is the character code in Shift-JIS as a cons of type bytes. 4405 to Z_FINISH; when all data is output, Z_STREAM_END will be
3084 Return the corresponding character. 4406 returned.) Z_OK from the compressor means input was processed or
3085 */ 4407 output was generated; if neither, we break out of the loop. Other
3086 (code)) 4408 return values are:
3087 { 4409
3088 unsigned char c1, c2, s1, s2; 4410 Z_STREAM_END EOF from compressor
3089 4411 Z_BUF_ERROR No progress possible (should never happen)
3090 CHECK_CONS (code); 4412 Z_STREAM_ERROR (should never happen)
3091 CHECK_INT (XCAR (code)); 4413 */
3092 CHECK_INT (XCDR (code)); 4414 while (data->stream.avail_in > 0 || str->eof)
3093 s1 = XINT (XCAR (code)); 4415 {
3094 s2 = XINT (XCDR (code)); 4416 /* Reserve an output buffer of the same size as the input buffer;
3095 if (BYTE_SJIS_TWO_BYTE_1_P (s1) && 4417 if that's not enough, we keep reserving the same size. */
3096 BYTE_SJIS_TWO_BYTE_2_P (s2)) 4418 Bytecount reserved = n;
3097 { 4419 Dynarr_add_many (dst, 0, reserved);
3098 DECODE_SJIS (s1, s2, c1, c2); 4420 /* Careful here! Don't retrieve the pointer until after
3099 return make_char (MAKE_CHAR (Vcharset_japanese_jisx0208, 4421 reserving the space, or it might be bogus */
3100 c1 & 0x7F, c2 & 0x7F)); 4422 data->stream.next_out =
3101 } 4423 Dynarr_atp (dst, Dynarr_length (dst) - reserved);
3102 else 4424 data->stream.avail_out = reserved;
3103 return Qnil; 4425 zerr =
3104 } 4426 deflate (&data->stream,
3105 4427 str->eof ? Z_FINISH : Z_NO_FLUSH);
3106 DEFUN ("encode-shift-jis-char", Fencode_shift_jis_char, 1, 1, 0, /* 4428 /* Lop off the unused portion */
3107 Encode a JISX0208 character CHARACTER to SHIFT-JIS coding-system. 4429 Dynarr_set_size (dst, Dynarr_length (dst) - data->stream.avail_out);
3108 Return the corresponding character code in SHIFT-JIS as a cons of two bytes. 4430 if (zerr != Z_OK)
3109 */ 4431 break;
3110 (character)) 4432 }
3111 { 4433
3112 Lisp_Object charset; 4434 if ((Bytecount) data->stream.avail_in < n)
3113 int c1, c2, s1, s2; 4435 return n - data->stream.avail_in;
3114 4436
3115 CHECK_CHAR_COERCE_INT (character); 4437 if (zerr == Z_OK || zerr == Z_STREAM_END)
3116 BREAKUP_CHAR (XCHAR (character), charset, c1, c2); 4438 return 0;
3117 if (EQ (charset, Vcharset_japanese_jisx0208)) 4439
3118 { 4440 return LSTREAM_ERROR;
3119 ENCODE_SJIS (c1 | 0x80, c2 | 0x80, s1, s2); 4441 }
3120 return Fcons (make_int (s1), make_int (s2)); 4442 }
3121 } 4443
3122 else 4444 #endif /* HAVE_ZLIB */
3123 return Qnil;
3124 }
3125 4445
3126 4446
3127 /************************************************************************/
3128 /* Big5 methods */
3129 /************************************************************************/
3130
3131 /* BIG5 is a coding system encoding two character sets: ASCII and
3132 Big5. An ASCII character is encoded as is. Big5 is a two-byte
3133 character set and is encoded in two-byte.
3134
3135 --- CODE RANGE of BIG5 ---
3136 (character set) (range)
3137 ASCII 0x00 .. 0x7F
3138 Big5 (1st byte) 0xA1 .. 0xFE
3139 (2nd byte) 0x40 .. 0x7E and 0xA1 .. 0xFE
3140 --------------------------
3141
3142 Since the number of characters in Big5 is larger than maximum
3143 characters in Emacs' charset (96x96), it can't be handled as one
3144 charset. So, in Emacs, Big5 is divided into two: `charset-big5-1'
3145 and `charset-big5-2'. Both <type>s are DIMENSION2_CHARS94. The former
3146 contains frequently used characters and the latter contains less
3147 frequently used characters. */
3148
3149 #define BYTE_BIG5_TWO_BYTE_1_P(c) \
3150 ((c) >= 0xA1 && (c) <= 0xFE)
3151
3152 /* Is this the second byte of a Shift-JIS two-byte char? */
3153
3154 #define BYTE_BIG5_TWO_BYTE_2_P(c) \
3155 (((c) >= 0x40 && (c) <= 0x7E) || ((c) >= 0xA1 && (c) <= 0xFE))
3156
3157 /* Number of Big5 characters which have the same code in 1st byte. */
3158
3159 #define BIG5_SAME_ROW (0xFF - 0xA1 + 0x7F - 0x40)
3160
3161 /* Code conversion macros. These are macros because they are used in
3162 inner loops during code conversion.
3163
3164 Note that temporary variables in macros introduce the classic
3165 dynamic-scoping problems with variable names. We use capital-
3166 lettered variables in the assumption that XEmacs does not use
3167 capital letters in variables except in a very formalized way
3168 (e.g. Qstring). */
3169
3170 /* Convert Big5 code (b1, b2) into its internal string representation
3171 (lb, c1, c2). */
3172
3173 /* There is a much simpler way to split the Big5 charset into two.
3174 For the moment I'm going to leave the algorithm as-is because it
3175 claims to separate out the most-used characters into a single
3176 charset, which perhaps will lead to optimizations in various
3177 places.
3178
3179 The way the algorithm works is something like this:
3180
3181 Big5 can be viewed as a 94x157 charset, where the row is
3182 encoded into the bytes 0xA1 .. 0xFE and the column is encoded
3183 into the bytes 0x40 .. 0x7E and 0xA1 .. 0xFE. As for frequency,
3184 the split between low and high column numbers is apparently
3185 meaningless; ascending rows produce less and less frequent chars.
3186 Therefore, we assign the lower half of rows (0xA1 .. 0xC8) to
3187 the first charset, and the upper half (0xC9 .. 0xFE) to the
3188 second. To do the conversion, we convert the character into
3189 a single number where 0 .. 156 is the first row, 157 .. 313
3190 is the second, etc. That way, the characters are ordered by
3191 decreasing frequency. Then we just chop the space in two
3192 and coerce the result into a 94x94 space.
3193 */
3194
3195 #define DECODE_BIG5(b1, b2, lb, c1, c2) do \
3196 { \
3197 int B1 = b1, B2 = b2; \
3198 int I \
3199 = (B1 - 0xA1) * BIG5_SAME_ROW + B2 - (B2 < 0x7F ? 0x40 : 0x62); \
3200 \
3201 if (B1 < 0xC9) \
3202 { \
3203 lb = LEADING_BYTE_CHINESE_BIG5_1; \
3204 } \
3205 else \
3206 { \
3207 lb = LEADING_BYTE_CHINESE_BIG5_2; \
3208 I -= (BIG5_SAME_ROW) * (0xC9 - 0xA1); \
3209 } \
3210 c1 = I / (0xFF - 0xA1) + 0xA1; \
3211 c2 = I % (0xFF - 0xA1) + 0xA1; \
3212 } while (0)
3213
3214 /* Convert the internal string representation of a Big5 character
3215 (lb, c1, c2) into Big5 code (b1, b2). */
3216
3217 #define ENCODE_BIG5(lb, c1, c2, b1, b2) do \
3218 { \
3219 int I = ((c1) - 0xA1) * (0xFF - 0xA1) + ((c2) - 0xA1); \
3220 \
3221 if (lb == LEADING_BYTE_CHINESE_BIG5_2) \
3222 { \
3223 I += BIG5_SAME_ROW * (0xC9 - 0xA1); \
3224 } \
3225 b1 = I / BIG5_SAME_ROW + 0xA1; \
3226 b2 = I % BIG5_SAME_ROW; \
3227 b2 += b2 < 0x3F ? 0x40 : 0x62; \
3228 } while (0)
3229
3230 static int
3231 detect_coding_big5 (struct detection_state *st, const Extbyte *src, Bytecount n)
3232 {
3233 while (n--)
3234 {
3235 unsigned char c = *(unsigned char *)src++;
3236 if (c == ISO_CODE_ESC || c == ISO_CODE_SI || c == ISO_CODE_SO ||
3237 (c >= 0x80 && c <= 0xA0))
3238 return 0;
3239 if (st->big5.in_second_byte)
3240 {
3241 st->big5.in_second_byte = 0;
3242 if (c < 0x40 || (c >= 0x80 && c <= 0xA0))
3243 return 0;
3244 }
3245 else if (c >= 0xA1)
3246 st->big5.in_second_byte = 1;
3247 }
3248 return CODING_CATEGORY_BIG5_MASK;
3249 }
3250
3251 /* Convert Big5 data to internal format. */
3252
3253 static void
3254 decode_coding_big5 (Lstream *decoding, const Extbyte *src,
3255 unsigned_char_dynarr *dst, Bytecount n)
3256 {
3257 struct decoding_stream *str = DECODING_STREAM_DATA (decoding);
3258 unsigned int flags = str->flags;
3259 unsigned int ch = str->ch;
3260 eol_type_t eol_type = str->eol_type;
3261
3262 while (n--)
3263 {
3264 unsigned char c = *(unsigned char *)src++;
3265 if (ch)
3266 {
3267 /* Previous character was first byte of Big5 char. */
3268 if (BYTE_BIG5_TWO_BYTE_2_P (c))
3269 {
3270 unsigned char b1, b2, b3;
3271 DECODE_BIG5 (ch, c, b1, b2, b3);
3272 Dynarr_add (dst, b1);
3273 Dynarr_add (dst, b2);
3274 Dynarr_add (dst, b3);
3275 }
3276 else
3277 {
3278 DECODE_ADD_BINARY_CHAR (ch, dst);
3279 DECODE_ADD_BINARY_CHAR (c, dst);
3280 }
3281 ch = 0;
3282 }
3283 else
3284 {
3285 DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst);
3286 if (BYTE_BIG5_TWO_BYTE_1_P (c))
3287 ch = c;
3288 else
3289 DECODE_ADD_BINARY_CHAR (c, dst);
3290 }
3291 label_continue_loop:;
3292 }
3293
3294 DECODE_HANDLE_END_OF_CONVERSION (flags, ch, dst);
3295
3296 str->flags = flags;
3297 str->ch = ch;
3298 }
3299
3300 /* Convert internally-formatted data to Big5. */
3301
3302 static void
3303 encode_coding_big5 (Lstream *encoding, const Intbyte *src,
3304 unsigned_char_dynarr *dst, Bytecount n)
3305 {
3306 unsigned char c;
3307 struct encoding_stream *str = ENCODING_STREAM_DATA (encoding);
3308 unsigned int flags = str->flags;
3309 unsigned int ch = str->ch;
3310 eol_type_t eol_type = CODING_SYSTEM_EOL_TYPE (str->codesys);
3311
3312 while (n--)
3313 {
3314 c = *src++;
3315 if (c == '\n')
3316 {
3317 if (eol_type != EOL_LF && eol_type != EOL_AUTODETECT)
3318 Dynarr_add (dst, '\r');
3319 if (eol_type != EOL_CR)
3320 Dynarr_add (dst, '\n');
3321 }
3322 else if (BYTE_ASCII_P (c))
3323 {
3324 /* ASCII. */
3325 Dynarr_add (dst, c);
3326 }
3327 else if (INTBYTE_LEADING_BYTE_P (c))
3328 {
3329 if (c == LEADING_BYTE_CHINESE_BIG5_1 ||
3330 c == LEADING_BYTE_CHINESE_BIG5_2)
3331 {
3332 /* A recognized leading byte. */
3333 ch = c;
3334 continue; /* not done with this character. */
3335 }
3336 /* otherwise just ignore this character. */
3337 }
3338 else if (ch == LEADING_BYTE_CHINESE_BIG5_1 ||
3339 ch == LEADING_BYTE_CHINESE_BIG5_2)
3340 {
3341 /* Previous char was a recognized leading byte. */
3342 ch = (ch << 8) | c;
3343 continue; /* not done with this character. */
3344 }
3345 else if (ch)
3346 {
3347 /* Encountering second byte of a Big5 character. */
3348 unsigned char b1, b2;
3349
3350 ENCODE_BIG5 (ch >> 8, ch & 0xFF, c, b1, b2);
3351 Dynarr_add (dst, b1);
3352 Dynarr_add (dst, b2);
3353 }
3354
3355 ch = 0;
3356 }
3357
3358 str->flags = flags;
3359 str->ch = ch;
3360 }
3361
3362
3363 DEFUN ("decode-big5-char", Fdecode_big5_char, 1, 1, 0, /*
3364 Decode a Big5 character CODE of BIG5 coding-system.
3365 CODE is the character code in BIG5, a cons of two integers.
3366 Return the corresponding character.
3367 */
3368 (code))
3369 {
3370 unsigned char c1, c2, b1, b2;
3371
3372 CHECK_CONS (code);
3373 CHECK_INT (XCAR (code));
3374 CHECK_INT (XCDR (code));
3375 b1 = XINT (XCAR (code));
3376 b2 = XINT (XCDR (code));
3377 if (BYTE_BIG5_TWO_BYTE_1_P (b1) &&
3378 BYTE_BIG5_TWO_BYTE_2_P (b2))
3379 {
3380 int leading_byte;
3381 Lisp_Object charset;
3382 DECODE_BIG5 (b1, b2, leading_byte, c1, c2);
3383 charset = CHARSET_BY_LEADING_BYTE (leading_byte);
3384 return make_char (MAKE_CHAR (charset, c1 & 0x7F, c2 & 0x7F));
3385 }
3386 else
3387 return Qnil;
3388 }
3389
3390 DEFUN ("encode-big5-char", Fencode_big5_char, 1, 1, 0, /*
3391 Encode the Big5 character CHARACTER in the BIG5 coding-system.
3392 Return the corresponding character code in Big5.
3393 */
3394 (character))
3395 {
3396 Lisp_Object charset;
3397 int c1, c2, b1, b2;
3398
3399 CHECK_CHAR_COERCE_INT (character);
3400 BREAKUP_CHAR (XCHAR (character), charset, c1, c2);
3401 if (EQ (charset, Vcharset_chinese_big5_1) ||
3402 EQ (charset, Vcharset_chinese_big5_2))
3403 {
3404 ENCODE_BIG5 (XCHARSET_LEADING_BYTE (charset), c1 | 0x80, c2 | 0x80,
3405 b1, b2);
3406 return Fcons (make_int (b1), make_int (b2));
3407 }
3408 else
3409 return Qnil;
3410 }
3411
3412
3413 /************************************************************************/
3414 /* UCS-4 methods */
3415 /* */
3416 /* UCS-4 character codes are implemented as nonnegative integers. */
3417 /* */
3418 /************************************************************************/
3419
3420
3421 DEFUN ("set-ucs-char", Fset_ucs_char, 2, 2, 0, /*
3422 Map UCS-4 code CODE to Mule character CHARACTER.
3423
3424 Return T on success, NIL on failure.
3425 */
3426 (code, character))
3427 {
3428 EMACS_INT c;
3429
3430 CHECK_CHAR (character);
3431 CHECK_NATNUM (code);
3432 c = XINT (code);
3433
3434 if (c < countof (fcd->ucs_to_mule_table))
3435 {
3436 fcd->ucs_to_mule_table[c] = character;
3437 return Qt;
3438 }
3439 else
3440 return Qnil;
3441 }
3442
3443 static Lisp_Object
3444 ucs_to_char (unsigned long code)
3445 {
3446 if (code < countof (fcd->ucs_to_mule_table))
3447 {
3448 return fcd->ucs_to_mule_table[code];
3449 }
3450 else if ((0xe00000 <= code) && (code <= 0xe00000 + 94 * 94 * 14))
3451 {
3452 unsigned int c;
3453
3454 code -= 0xe00000;
3455 c = code % (94 * 94);
3456 return make_char
3457 (MAKE_CHAR (CHARSET_BY_ATTRIBUTES
3458 (CHARSET_TYPE_94X94, code / (94 * 94) + '@',
3459 CHARSET_LEFT_TO_RIGHT),
3460 c / 94 + 33, c % 94 + 33));
3461 }
3462 else
3463 return Qnil;
3464 }
3465
3466 DEFUN ("ucs-char", Fucs_char, 1, 1, 0, /*
3467 Return Mule character corresponding to UCS code CODE (a positive integer).
3468 */
3469 (code))
3470 {
3471 CHECK_NATNUM (code);
3472 return ucs_to_char (XINT (code));
3473 }
3474
3475 DEFUN ("set-char-ucs", Fset_char_ucs, 2, 2, 0, /*
3476 Map Mule character CHARACTER to UCS code CODE (a positive integer).
3477 */
3478 (character, code))
3479 {
3480 /* #### Isn't this gilding the lily? Fput_char_table checks its args.
3481 Fset_char_ucs is more restrictive on index arg, but should
3482 check code arg in a char_table method. */
3483 CHECK_CHAR (character);
3484 CHECK_NATNUM (code);
3485 return Fput_char_table (character, code, mule_to_ucs_table);
3486 }
3487
3488 DEFUN ("char-ucs", Fchar_ucs, 1, 1, 0, /*
3489 Return the UCS code (a positive integer) corresponding to CHARACTER.
3490 */
3491 (character))
3492 {
3493 return Fget_char_table (character, mule_to_ucs_table);
3494 }
3495
3496 /* Decode a UCS-4 character into a buffer. If the lookup fails, use
3497 <GETA MARK> (U+3013) of JIS X 0208, which means correct character
3498 is not found, instead.
3499 #### do something more appropriate (use blob?)
3500 Danger, Will Robinson! Data loss. Should we signal user? */
3501 static void
3502 decode_ucs4 (unsigned long ch, unsigned_char_dynarr *dst)
3503 {
3504 Lisp_Object chr = ucs_to_char (ch);
3505
3506 if (! NILP (chr))
3507 {
3508 Intbyte work[MAX_EMCHAR_LEN];
3509 int len;
3510
3511 ch = XCHAR (chr);
3512 len = (ch < 128) ?
3513 simple_set_charptr_emchar (work, ch) :
3514 non_ascii_set_charptr_emchar (work, ch);
3515 Dynarr_add_many (dst, work, len);
3516 }
3517 else
3518 {
3519 Dynarr_add (dst, LEADING_BYTE_JAPANESE_JISX0208);
3520 Dynarr_add (dst, 34 + 128);
3521 Dynarr_add (dst, 46 + 128);
3522 }
3523 }
3524
3525 static unsigned long
3526 mule_char_to_ucs4 (Lisp_Object charset,
3527 unsigned char h, unsigned char l)
3528 {
3529 Lisp_Object code
3530 = Fget_char_table (make_char (MAKE_CHAR (charset, h & 127, l & 127)),
3531 mule_to_ucs_table);
3532
3533 if (INTP (code))
3534 {
3535 return XINT (code);
3536 }
3537 else if ( (XCHARSET_DIMENSION (charset) == 2) &&
3538 (XCHARSET_CHARS (charset) == 94) )
3539 {
3540 unsigned char final = XCHARSET_FINAL (charset);
3541
3542 if ( ('@' <= final) && (final < 0x7f) )
3543 {
3544 return 0xe00000 + (final - '@') * 94 * 94
3545 + ((h & 127) - 33) * 94 + (l & 127) - 33;
3546 }
3547 else
3548 {
3549 return '?';
3550 }
3551 }
3552 else
3553 {
3554 return '?';
3555 }
3556 }
3557
3558 static void
3559 encode_ucs4 (Lisp_Object charset,
3560 unsigned char h, unsigned char l, unsigned_char_dynarr *dst)
3561 {
3562 unsigned long code = mule_char_to_ucs4 (charset, h, l);
3563 Dynarr_add (dst, code >> 24);
3564 Dynarr_add (dst, (code >> 16) & 255);
3565 Dynarr_add (dst, (code >> 8) & 255);
3566 Dynarr_add (dst, code & 255);
3567 }
3568
3569 static int
3570 detect_coding_ucs4 (struct detection_state *st, const Extbyte *src, Bytecount n)
3571 {
3572 while (n--)
3573 {
3574 unsigned char c = *(unsigned char *)src++;
3575 switch (st->ucs4.in_byte)
3576 {
3577 case 0:
3578 if (c >= 128)
3579 return 0;
3580 else
3581 st->ucs4.in_byte++;
3582 break;
3583 case 3:
3584 st->ucs4.in_byte = 0;
3585 break;
3586 default:
3587 st->ucs4.in_byte++;
3588 }
3589 }
3590 return CODING_CATEGORY_UCS4_MASK;
3591 }
3592
3593 static void
3594 decode_coding_ucs4 (Lstream *decoding, const Extbyte *src,
3595 unsigned_char_dynarr *dst, Bytecount n)
3596 {
3597 struct decoding_stream *str = DECODING_STREAM_DATA (decoding);
3598 unsigned int flags = str->flags;
3599 unsigned int ch = str->ch;
3600 unsigned char counter = str->counter;
3601
3602 while (n--)
3603 {
3604 unsigned char c = *(unsigned char *)src++;
3605 switch (counter)
3606 {
3607 case 0:
3608 ch = c;
3609 counter = 3;
3610 break;
3611 case 1:
3612 decode_ucs4 ( ( ch << 8 ) | c, dst);
3613 ch = 0;
3614 counter = 0;
3615 break;
3616 default:
3617 ch = ( ch << 8 ) | c;
3618 counter--;
3619 }
3620 }
3621 if (counter & CODING_STATE_END)
3622 DECODE_OUTPUT_PARTIAL_CHAR (ch);
3623
3624 str->flags = flags;
3625 str->ch = ch;
3626 str->counter = counter;
3627 }
3628
3629 static void
3630 encode_coding_ucs4 (Lstream *encoding, const Intbyte *src,
3631 unsigned_char_dynarr *dst, Bytecount n)
3632 {
3633 struct encoding_stream *str = ENCODING_STREAM_DATA (encoding);
3634 unsigned int flags = str->flags;
3635 unsigned int ch = str->ch;
3636 unsigned char char_boundary = str->iso2022.current_char_boundary;
3637 Lisp_Object charset = str->iso2022.current_charset;
3638
3639 #ifdef ENABLE_COMPOSITE_CHARS
3640 /* flags for handling composite chars. We do a little switcharoo
3641 on the source while we're outputting the composite char. */
3642 Bytecount saved_n = 0;
3643 const unsigned char *saved_src = NULL;
3644 int in_composite = 0;
3645
3646 back_to_square_n:
3647 #endif
3648
3649 while (n--)
3650 {
3651 unsigned char c = *src++;
3652
3653 if (BYTE_ASCII_P (c))
3654 { /* Processing ASCII character */
3655 ch = 0;
3656 encode_ucs4 (Vcharset_ascii, c, 0, dst);
3657 char_boundary = 1;
3658 }
3659 else if (INTBYTE_LEADING_BYTE_P (c) || INTBYTE_LEADING_BYTE_P (ch))
3660 { /* Processing Leading Byte */
3661 ch = 0;
3662 charset = CHARSET_BY_LEADING_BYTE (c);
3663 if (LEADING_BYTE_PREFIX_P(c))
3664 ch = c;
3665 char_boundary = 0;
3666 }
3667 else
3668 { /* Processing Non-ASCII character */
3669 char_boundary = 1;
3670 if (EQ (charset, Vcharset_control_1))
3671 {
3672 encode_ucs4 (Vcharset_control_1, c, 0, dst);
3673 }
3674 else
3675 {
3676 switch (XCHARSET_REP_BYTES (charset))
3677 {
3678 case 2:
3679 encode_ucs4 (charset, c, 0, dst);
3680 break;
3681 case 3:
3682 if (XCHARSET_PRIVATE_P (charset))
3683 {
3684 encode_ucs4 (charset, c, 0, dst);
3685 ch = 0;
3686 }
3687 else if (ch)
3688 {
3689 #ifdef ENABLE_COMPOSITE_CHARS
3690 if (EQ (charset, Vcharset_composite))
3691 {
3692 if (in_composite)
3693 {
3694 /* #### Bother! We don't know how to
3695 handle this yet. */
3696 Dynarr_add (dst, '\0');
3697 Dynarr_add (dst, '\0');
3698 Dynarr_add (dst, '\0');
3699 Dynarr_add (dst, '~');
3700 }
3701 else
3702 {
3703 Emchar emch = MAKE_CHAR (Vcharset_composite,
3704 ch & 0x7F, c & 0x7F);
3705 Lisp_Object lstr = composite_char_string (emch);
3706 saved_n = n;
3707 saved_src = src;
3708 in_composite = 1;
3709 src = XSTRING_DATA (lstr);
3710 n = XSTRING_LENGTH (lstr);
3711 }
3712 }
3713 else
3714 #endif /* ENABLE_COMPOSITE_CHARS */
3715 {
3716 encode_ucs4(charset, ch, c, dst);
3717 }
3718 ch = 0;
3719 }
3720 else
3721 {
3722 ch = c;
3723 char_boundary = 0;
3724 }
3725 break;
3726 case 4:
3727 if (ch)
3728 {
3729 encode_ucs4 (charset, ch, c, dst);
3730 ch = 0;
3731 }
3732 else
3733 {
3734 ch = c;
3735 char_boundary = 0;
3736 }
3737 break;
3738 default:
3739 abort ();
3740 }
3741 }
3742 }
3743 }
3744
3745 #ifdef ENABLE_COMPOSITE_CHARS
3746 if (in_composite)
3747 {
3748 n = saved_n;
3749 src = saved_src;
3750 in_composite = 0;
3751 goto back_to_square_n; /* Wheeeeeeeee ..... */
3752 }
3753 #endif /* ENABLE_COMPOSITE_CHARS */
3754
3755 str->flags = flags;
3756 str->ch = ch;
3757 str->iso2022.current_char_boundary = char_boundary;
3758 str->iso2022.current_charset = charset;
3759
3760 /* Verbum caro factum est! */
3761 }
3762
3763
3764 /************************************************************************/
3765 /* UTF-8 methods */
3766 /************************************************************************/
3767
3768 static int
3769 detect_coding_utf8 (struct detection_state *st, const Extbyte *src, Bytecount n)
3770 {
3771 while (n--)
3772 {
3773 unsigned char c = *(unsigned char *)src++;
3774 switch (st->utf8.in_byte)
3775 {
3776 case 0:
3777 if (c == ISO_CODE_ESC || c == ISO_CODE_SI || c == ISO_CODE_SO)
3778 return 0;
3779 else if (c >= 0xfc)
3780 st->utf8.in_byte = 5;
3781 else if (c >= 0xf8)
3782 st->utf8.in_byte = 4;
3783 else if (c >= 0xf0)
3784 st->utf8.in_byte = 3;
3785 else if (c >= 0xe0)
3786 st->utf8.in_byte = 2;
3787 else if (c >= 0xc0)
3788 st->utf8.in_byte = 1;
3789 else if (c >= 0x80)
3790 return 0;
3791 break;
3792 default:
3793 if ((c & 0xc0) != 0x80)
3794 return 0;
3795 else
3796 st->utf8.in_byte--;
3797 }
3798 }
3799 return CODING_CATEGORY_UTF8_MASK;
3800 }
3801
3802 static void
3803 decode_coding_utf8 (Lstream *decoding, const Extbyte *src,
3804 unsigned_char_dynarr *dst, Bytecount n)
3805 {
3806 struct decoding_stream *str = DECODING_STREAM_DATA (decoding);
3807 unsigned int flags = str->flags;
3808 unsigned int ch = str->ch;
3809 eol_type_t eol_type = str->eol_type;
3810 unsigned char counter = str->counter;
3811
3812 while (n--)
3813 {
3814 unsigned char c = *(unsigned char *)src++;
3815 switch (counter)
3816 {
3817 case 0:
3818 if ( c >= 0xfc )
3819 {
3820 ch = c & 0x01;
3821 counter = 5;
3822 }
3823 else if ( c >= 0xf8 )
3824 {
3825 ch = c & 0x03;
3826 counter = 4;
3827 }
3828 else if ( c >= 0xf0 )
3829 {
3830 ch = c & 0x07;
3831 counter = 3;
3832 }
3833 else if ( c >= 0xe0 )
3834 {
3835 ch = c & 0x0f;
3836 counter = 2;
3837 }
3838 else if ( c >= 0xc0 )
3839 {
3840 ch = c & 0x1f;
3841 counter = 1;
3842 }
3843 else
3844 {
3845 DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst);
3846 decode_ucs4 (c, dst);
3847 }
3848 break;
3849 case 1:
3850 ch = ( ch << 6 ) | ( c & 0x3f );
3851 decode_ucs4 (ch, dst);
3852 ch = 0;
3853 counter = 0;
3854 break;
3855 default:
3856 ch = ( ch << 6 ) | ( c & 0x3f );
3857 counter--;
3858 }
3859 label_continue_loop:;
3860 }
3861
3862 if (flags & CODING_STATE_END)
3863 DECODE_OUTPUT_PARTIAL_CHAR (ch);
3864
3865 str->flags = flags;
3866 str->ch = ch;
3867 str->counter = counter;
3868 }
3869
3870 static void
3871 encode_utf8 (Lisp_Object charset,
3872 unsigned char h, unsigned char l, unsigned_char_dynarr *dst)
3873 {
3874 unsigned long code = mule_char_to_ucs4 (charset, h, l);
3875 if ( code <= 0x7f )
3876 {
3877 Dynarr_add (dst, code);
3878 }
3879 else if ( code <= 0x7ff )
3880 {
3881 Dynarr_add (dst, (code >> 6) | 0xc0);
3882 Dynarr_add (dst, (code & 0x3f) | 0x80);
3883 }
3884 else if ( code <= 0xffff )
3885 {
3886 Dynarr_add (dst, (code >> 12) | 0xe0);
3887 Dynarr_add (dst, ((code >> 6) & 0x3f) | 0x80);
3888 Dynarr_add (dst, (code & 0x3f) | 0x80);
3889 }
3890 else if ( code <= 0x1fffff )
3891 {
3892 Dynarr_add (dst, (code >> 18) | 0xf0);
3893 Dynarr_add (dst, ((code >> 12) & 0x3f) | 0x80);
3894 Dynarr_add (dst, ((code >> 6) & 0x3f) | 0x80);
3895 Dynarr_add (dst, (code & 0x3f) | 0x80);
3896 }
3897 else if ( code <= 0x3ffffff )
3898 {
3899 Dynarr_add (dst, (code >> 24) | 0xf8);
3900 Dynarr_add (dst, ((code >> 18) & 0x3f) | 0x80);
3901 Dynarr_add (dst, ((code >> 12) & 0x3f) | 0x80);
3902 Dynarr_add (dst, ((code >> 6) & 0x3f) | 0x80);
3903 Dynarr_add (dst, (code & 0x3f) | 0x80);
3904 }
3905 else
3906 {
3907 Dynarr_add (dst, (code >> 30) | 0xfc);
3908 Dynarr_add (dst, ((code >> 24) & 0x3f) | 0x80);
3909 Dynarr_add (dst, ((code >> 18) & 0x3f) | 0x80);
3910 Dynarr_add (dst, ((code >> 12) & 0x3f) | 0x80);
3911 Dynarr_add (dst, ((code >> 6) & 0x3f) | 0x80);
3912 Dynarr_add (dst, (code & 0x3f) | 0x80);
3913 }
3914 }
3915
3916 static void
3917 encode_coding_utf8 (Lstream *encoding, const Intbyte *src,
3918 unsigned_char_dynarr *dst, Bytecount n)
3919 {
3920 struct encoding_stream *str = ENCODING_STREAM_DATA (encoding);
3921 unsigned int flags = str->flags;
3922 unsigned int ch = str->ch;
3923 eol_type_t eol_type = CODING_SYSTEM_EOL_TYPE (str->codesys);
3924 unsigned char char_boundary = str->iso2022.current_char_boundary;
3925 Lisp_Object charset = str->iso2022.current_charset;
3926
3927 #ifdef ENABLE_COMPOSITE_CHARS
3928 /* flags for handling composite chars. We do a little switcharoo
3929 on the source while we're outputting the composite char. */
3930 Bytecount saved_n = 0;
3931 const unsigned char *saved_src = NULL;
3932 int in_composite = 0;
3933
3934 back_to_square_n:
3935 #endif /* ENABLE_COMPOSITE_CHARS */
3936
3937 while (n--)
3938 {
3939 unsigned char c = *src++;
3940
3941 if (BYTE_ASCII_P (c))
3942 { /* Processing ASCII character */
3943 ch = 0;
3944 if (c == '\n')
3945 {
3946 if (eol_type != EOL_LF && eol_type != EOL_AUTODETECT)
3947 Dynarr_add (dst, '\r');
3948 if (eol_type != EOL_CR)
3949 Dynarr_add (dst, c);
3950 }
3951 else
3952 encode_utf8 (Vcharset_ascii, c, 0, dst);
3953 char_boundary = 1;
3954 }
3955 else if (INTBYTE_LEADING_BYTE_P (c) || INTBYTE_LEADING_BYTE_P (ch))
3956 { /* Processing Leading Byte */
3957 ch = 0;
3958 charset = CHARSET_BY_LEADING_BYTE (c);
3959 if (LEADING_BYTE_PREFIX_P(c))
3960 ch = c;
3961 char_boundary = 0;
3962 }
3963 else
3964 { /* Processing Non-ASCII character */
3965 char_boundary = 1;
3966 if (EQ (charset, Vcharset_control_1))
3967 {
3968 encode_utf8 (Vcharset_control_1, c, 0, dst);
3969 }
3970 else
3971 {
3972 switch (XCHARSET_REP_BYTES (charset))
3973 {
3974 case 2:
3975 encode_utf8 (charset, c, 0, dst);
3976 break;
3977 case 3:
3978 if (XCHARSET_PRIVATE_P (charset))
3979 {
3980 encode_utf8 (charset, c, 0, dst);
3981 ch = 0;
3982 }
3983 else if (ch)
3984 {
3985 #ifdef ENABLE_COMPOSITE_CHARS
3986 if (EQ (charset, Vcharset_composite))
3987 {
3988 if (in_composite)
3989 {
3990 /* #### Bother! We don't know how to
3991 handle this yet. */
3992 encode_utf8 (Vcharset_ascii, '~', 0, dst);
3993 }
3994 else
3995 {
3996 Emchar emch = MAKE_CHAR (Vcharset_composite,
3997 ch & 0x7F, c & 0x7F);
3998 Lisp_Object lstr = composite_char_string (emch);
3999 saved_n = n;
4000 saved_src = src;
4001 in_composite = 1;
4002 src = XSTRING_DATA (lstr);
4003 n = XSTRING_LENGTH (lstr);
4004 }
4005 }
4006 else
4007 #endif /* ENABLE_COMPOSITE_CHARS */
4008 {
4009 encode_utf8 (charset, ch, c, dst);
4010 }
4011 ch = 0;
4012 }
4013 else
4014 {
4015 ch = c;
4016 char_boundary = 0;
4017 }
4018 break;
4019 case 4:
4020 if (ch)
4021 {
4022 encode_utf8 (charset, ch, c, dst);
4023 ch = 0;
4024 }
4025 else
4026 {
4027 ch = c;
4028 char_boundary = 0;
4029 }
4030 break;
4031 default:
4032 abort ();
4033 }
4034 }
4035 }
4036 }
4037
4038 #ifdef ENABLE_COMPOSITE_CHARS
4039 if (in_composite)
4040 {
4041 n = saved_n;
4042 src = saved_src;
4043 in_composite = 0;
4044 goto back_to_square_n; /* Wheeeeeeeee ..... */
4045 }
4046 #endif
4047
4048 str->flags = flags;
4049 str->ch = ch;
4050 str->iso2022.current_char_boundary = char_boundary;
4051 str->iso2022.current_charset = charset;
4052
4053 /* Verbum caro factum est! */
4054 }
4055
4056
4057 /************************************************************************/
4058 /* ISO2022 methods */
4059 /************************************************************************/
4060
4061 /* The following note describes the coding system ISO2022 briefly.
4062 Since the intention of this note is to help understand the
4063 functions in this file, some parts are NOT ACCURATE or OVERLY
4064 SIMPLIFIED. For thorough understanding, please refer to the
4065 original document of ISO2022.
4066
4067 ISO2022 provides many mechanisms to encode several character sets
4068 in 7-bit and 8-bit environments. For 7-bit environments, all text
4069 is encoded using bytes less than 128. This may make the encoded
4070 text a little bit longer, but the text passes more easily through
4071 several gateways, some of which strip off MSB (Most Signigant Bit).
4072
4073 There are two kinds of character sets: control character set and
4074 graphic character set. The former contains control characters such
4075 as `newline' and `escape' to provide control functions (control
4076 functions are also provided by escape sequences). The latter
4077 contains graphic characters such as 'A' and '-'. Emacs recognizes
4078 two control character sets and many graphic character sets.
4079
4080 Graphic character sets are classified into one of the following
4081 four classes, according to the number of bytes (DIMENSION) and
4082 number of characters in one dimension (CHARS) of the set:
4083 - DIMENSION1_CHARS94
4084 - DIMENSION1_CHARS96
4085 - DIMENSION2_CHARS94
4086 - DIMENSION2_CHARS96
4087
4088 In addition, each character set is assigned an identification tag,
4089 unique for each set, called "final character" (denoted as <F>
4090 hereafter). The <F> of each character set is decided by ECMA(*)
4091 when it is registered in ISO. The code range of <F> is 0x30..0x7F
4092 (0x30..0x3F are for private use only).
4093
4094 Note (*): ECMA = European Computer Manufacturers Association
4095
4096 Here are examples of graphic character set [NAME(<F>)]:
4097 o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ...
4098 o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ...
4099 o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ...
4100 o DIMENSION2_CHARS96 -- none for the moment
4101
4102 A code area (1 byte = 8 bits) is divided into 4 areas, C0, GL, C1, and GR.
4103 C0 [0x00..0x1F] -- control character plane 0
4104 GL [0x20..0x7F] -- graphic character plane 0
4105 C1 [0x80..0x9F] -- control character plane 1
4106 GR [0xA0..0xFF] -- graphic character plane 1
4107
4108 A control character set is directly designated and invoked to C0 or
4109 C1 by an escape sequence. The most common case is that:
4110 - ISO646's control character set is designated/invoked to C0, and
4111 - ISO6429's control character set is designated/invoked to C1,
4112 and usually these designations/invocations are omitted in encoded
4113 text. In a 7-bit environment, only C0 can be used, and a control
4114 character for C1 is encoded by an appropriate escape sequence to
4115 fit into the environment. All control characters for C1 are
4116 defined to have corresponding escape sequences.
4117
4118 A graphic character set is at first designated to one of four
4119 graphic registers (G0 through G3), then these graphic registers are
4120 invoked to GL or GR. These designations and invocations can be
4121 done independently. The most common case is that G0 is invoked to
4122 GL, G1 is invoked to GR, and ASCII is designated to G0. Usually
4123 these invocations and designations are omitted in encoded text.
4124 In a 7-bit environment, only GL can be used.
4125
4126 When a graphic character set of CHARS94 is invoked to GL, codes
4127 0x20 and 0x7F of the GL area work as control characters SPACE and
4128 DEL respectively, and codes 0xA0 and 0xFF of the GR area should not
4129 be used.
4130
4131 There are two ways of invocation: locking-shift and single-shift.
4132 With locking-shift, the invocation lasts until the next different
4133 invocation, whereas with single-shift, the invocation affects the
4134 following character only and doesn't affect the locking-shift
4135 state. Invocations are done by the following control characters or
4136 escape sequences:
4137
4138 ----------------------------------------------------------------------
4139 abbrev function cntrl escape seq description
4140 ----------------------------------------------------------------------
4141 SI/LS0 (shift-in) 0x0F none invoke G0 into GL
4142 SO/LS1 (shift-out) 0x0E none invoke G1 into GL
4143 LS2 (locking-shift-2) none ESC 'n' invoke G2 into GL
4144 LS3 (locking-shift-3) none ESC 'o' invoke G3 into GL
4145 LS1R (locking-shift-1 right) none ESC '~' invoke G1 into GR (*)
4146 LS2R (locking-shift-2 right) none ESC '}' invoke G2 into GR (*)
4147 LS3R (locking-shift 3 right) none ESC '|' invoke G3 into GR (*)
4148 SS2 (single-shift-2) 0x8E ESC 'N' invoke G2 for one char
4149 SS3 (single-shift-3) 0x8F ESC 'O' invoke G3 for one char
4150 ----------------------------------------------------------------------
4151 (*) These are not used by any known coding system.
4152
4153 Control characters for these functions are defined by macros
4154 ISO_CODE_XXX in `coding.h'.
4155
4156 Designations are done by the following escape sequences:
4157 ----------------------------------------------------------------------
4158 escape sequence description
4159 ----------------------------------------------------------------------
4160 ESC '(' <F> designate DIMENSION1_CHARS94<F> to G0
4161 ESC ')' <F> designate DIMENSION1_CHARS94<F> to G1
4162 ESC '*' <F> designate DIMENSION1_CHARS94<F> to G2
4163 ESC '+' <F> designate DIMENSION1_CHARS94<F> to G3
4164 ESC ',' <F> designate DIMENSION1_CHARS96<F> to G0 (*)
4165 ESC '-' <F> designate DIMENSION1_CHARS96<F> to G1
4166 ESC '.' <F> designate DIMENSION1_CHARS96<F> to G2
4167 ESC '/' <F> designate DIMENSION1_CHARS96<F> to G3
4168 ESC '$' '(' <F> designate DIMENSION2_CHARS94<F> to G0 (**)
4169 ESC '$' ')' <F> designate DIMENSION2_CHARS94<F> to G1
4170 ESC '$' '*' <F> designate DIMENSION2_CHARS94<F> to G2
4171 ESC '$' '+' <F> designate DIMENSION2_CHARS94<F> to G3
4172 ESC '$' ',' <F> designate DIMENSION2_CHARS96<F> to G0 (*)
4173 ESC '$' '-' <F> designate DIMENSION2_CHARS96<F> to G1
4174 ESC '$' '.' <F> designate DIMENSION2_CHARS96<F> to G2
4175 ESC '$' '/' <F> designate DIMENSION2_CHARS96<F> to G3
4176 ----------------------------------------------------------------------
4177
4178 In this list, "DIMENSION1_CHARS94<F>" means a graphic character set
4179 of dimension 1, chars 94, and final character <F>, etc...
4180
4181 Note (*): Although these designations are not allowed in ISO2022,
4182 Emacs accepts them on decoding, and produces them on encoding
4183 CHARS96 character sets in a coding system which is characterized as
4184 7-bit environment, non-locking-shift, and non-single-shift.
4185
4186 Note (**): If <F> is '@', 'A', or 'B', the intermediate character
4187 '(' can be omitted. We refer to this as "short-form" hereafter.
4188
4189 Now you may notice that there are a lot of ways for encoding the
4190 same multilingual text in ISO2022. Actually, there exist many
4191 coding systems such as Compound Text (used in X11's inter client
4192 communication, ISO-2022-JP (used in Japanese internet), ISO-2022-KR
4193 (used in Korean internet), EUC (Extended UNIX Code, used in Asian
4194 localized platforms), and all of these are variants of ISO2022.
4195
4196 In addition to the above, Emacs handles two more kinds of escape
4197 sequences: ISO6429's direction specification and Emacs' private
4198 sequence for specifying character composition.
4199
4200 ISO6429's direction specification takes the following form:
4201 o CSI ']' -- end of the current direction
4202 o CSI '0' ']' -- end of the current direction
4203 o CSI '1' ']' -- start of left-to-right text
4204 o CSI '2' ']' -- start of right-to-left text
4205 The control character CSI (0x9B: control sequence introducer) is
4206 abbreviated to the escape sequence ESC '[' in a 7-bit environment.
4207
4208 Character composition specification takes the following form:
4209 o ESC '0' -- start character composition
4210 o ESC '1' -- end character composition
4211 Since these are not standard escape sequences of any ISO standard,
4212 their use with these meanings is restricted to Emacs only. */
4213
4214 static void
4215 reset_iso2022 (Lisp_Object coding_system, struct iso2022_decoder *iso)
4216 {
4217 int i;
4218
4219 for (i = 0; i < 4; i++)
4220 {
4221 if (!NILP (coding_system))
4222 iso->charset[i] =
4223 XCODING_SYSTEM_ISO2022_INITIAL_CHARSET (coding_system, i);
4224 else
4225 iso->charset[i] = Qt;
4226 iso->invalid_designated[i] = 0;
4227 }
4228 iso->esc = ISO_ESC_NOTHING;
4229 iso->esc_bytes_index = 0;
4230 iso->register_left = 0;
4231 iso->register_right = 1;
4232 iso->switched_dir_and_no_valid_charset_yet = 0;
4233 iso->invalid_switch_dir = 0;
4234 iso->output_direction_sequence = 0;
4235 iso->output_literally = 0;
4236 #ifdef ENABLE_COMPOSITE_CHARS
4237 if (iso->composite_chars)
4238 Dynarr_reset (iso->composite_chars);
4239 #endif
4240 }
4241
4242 static int
4243 fit_to_be_escape_quoted (unsigned char c)
4244 {
4245 switch (c)
4246 {
4247 case ISO_CODE_ESC:
4248 case ISO_CODE_CSI:
4249 case ISO_CODE_SS2:
4250 case ISO_CODE_SS3:
4251 case ISO_CODE_SO:
4252 case ISO_CODE_SI:
4253 return 1;
4254
4255 default:
4256 return 0;
4257 }
4258 }
4259
4260 /* Parse one byte of an ISO2022 escape sequence.
4261 If the result is an invalid escape sequence, return 0 and
4262 do not change anything in STR. Otherwise, if the result is
4263 an incomplete escape sequence, update ISO2022.ESC and
4264 ISO2022.ESC_BYTES and return -1. Otherwise, update
4265 all the state variables (but not ISO2022.ESC_BYTES) and
4266 return 1.
4267
4268 If CHECK_INVALID_CHARSETS is non-zero, check for designation
4269 or invocation of an invalid character set and treat that as
4270 an unrecognized escape sequence.
4271
4272 ********************************************************************
4273
4274 #### Strategies for error annotation and coding orthogonalization
4275
4276 We really want to separate out a number of things. Conceptually,
4277 there is a nested syntax.
4278
4279 At the top level is the ISO 2022 extension syntax, including charset
4280 designation and invocation, and certain auxiliary controls such as the
4281 ISO 6429 direction specification. These are octet-oriented, with the
4282 single exception (AFAIK) of the "exit Unicode" sequence which uses the
4283 UTF's natural width (1 byte for UTF-7 and UTF-8, 2 bytes for UCS-2 and
4284 UTF-16, and 4 bytes for UCS-4 and UTF-32). This will be treated as a
4285 (deprecated) special case in Unicode processing.
4286
4287 The middle layer is ISO 2022 character interpretation. This will depend
4288 on the current state of the ISO 2022 registers, and assembles octets
4289 into the character's internal representation.
4290
4291 The lowest level is translating system control conventions. At present
4292 this is restricted to newline translation, but one could imagine doing
4293 tab conversion or line wrapping here. "Escape from Unicode" processing
4294 would be done at this level.
4295
4296 At each level the parser will verify the syntax. In the case of a
4297 syntax error or warning (such as a redundant escape sequence that affects
4298 no characters), the parser will take some action, typically inserting the
4299 erroneous octets directly into the output and creating an annotation
4300 which can be used by higher level I/O to mark the affected region.
4301
4302 This should make it possible to do something sensible about separating
4303 newline convention processing from character construction, and about
4304 preventing ISO 2022 escape sequences from being recognized
4305 inappropriately.
4306
4307 The basic strategy will be to have octet classification tables, and
4308 switch processing according to the table entry.
4309
4310 It's possible that, by doing the processing with tables of functions or
4311 the like, the parser can be used for both detection and translation. */
4312
4313 static int
4314 parse_iso2022_esc (Lisp_Object codesys, struct iso2022_decoder *iso,
4315 unsigned char c, unsigned int *flags,
4316 int check_invalid_charsets)
4317 {
4318 /* (1) If we're at the end of a designation sequence, CS is the
4319 charset being designated and REG is the register to designate
4320 it to.
4321
4322 (2) If we're at the end of a locking-shift sequence, REG is
4323 the register to invoke and HALF (0 == left, 1 == right) is
4324 the half to invoke it into.
4325
4326 (3) If we're at the end of a single-shift sequence, REG is
4327 the register to invoke. */
4328 Lisp_Object cs = Qnil;
4329 int reg, half;
4330
4331 /* NOTE: This code does goto's all over the fucking place.
4332 The reason for this is that we're basically implementing
4333 a state machine here, and hierarchical languages like C
4334 don't really provide a clean way of doing this. */
4335
4336 if (! (*flags & CODING_STATE_ESCAPE))
4337 /* At beginning of escape sequence; we need to reset our
4338 escape-state variables. */
4339 iso->esc = ISO_ESC_NOTHING;
4340
4341 iso->output_literally = 0;
4342 iso->output_direction_sequence = 0;
4343
4344 switch (iso->esc)
4345 {
4346 case ISO_ESC_NOTHING:
4347 iso->esc_bytes_index = 0;
4348 switch (c)
4349 {
4350 case ISO_CODE_ESC: /* Start escape sequence */
4351 *flags |= CODING_STATE_ESCAPE;
4352 iso->esc = ISO_ESC;
4353 goto not_done;
4354
4355 case ISO_CODE_CSI: /* ISO6429 (specifying directionality) */
4356 *flags |= CODING_STATE_ESCAPE;
4357 iso->esc = ISO_ESC_5_11;
4358 goto not_done;
4359
4360 case ISO_CODE_SO: /* locking shift 1 */
4361 reg = 1; half = 0;
4362 goto locking_shift;
4363 case ISO_CODE_SI: /* locking shift 0 */
4364 reg = 0; half = 0;
4365 goto locking_shift;
4366
4367 case ISO_CODE_SS2: /* single shift */
4368 reg = 2;
4369 goto single_shift;
4370 case ISO_CODE_SS3: /* single shift */
4371 reg = 3;
4372 goto single_shift;
4373
4374 default: /* Other control characters */
4375 return 0;
4376 }
4377
4378 case ISO_ESC:
4379 switch (c)
4380 {
4381 /**** single shift ****/
4382
4383 case 'N': /* single shift 2 */
4384 reg = 2;
4385 goto single_shift;
4386 case 'O': /* single shift 3 */
4387 reg = 3;
4388 goto single_shift;
4389
4390 /**** locking shift ****/
4391
4392 case '~': /* locking shift 1 right */
4393 reg = 1; half = 1;
4394 goto locking_shift;
4395 case 'n': /* locking shift 2 */
4396 reg = 2; half = 0;
4397 goto locking_shift;
4398 case '}': /* locking shift 2 right */
4399 reg = 2; half = 1;
4400 goto locking_shift;
4401 case 'o': /* locking shift 3 */
4402 reg = 3; half = 0;
4403 goto locking_shift;
4404 case '|': /* locking shift 3 right */
4405 reg = 3; half = 1;
4406 goto locking_shift;
4407
4408 #ifdef ENABLE_COMPOSITE_CHARS
4409 /**** composite ****/
4410
4411 case '0':
4412 iso->esc = ISO_ESC_START_COMPOSITE;
4413 *flags = (*flags & CODING_STATE_ISO2022_LOCK) |
4414 CODING_STATE_COMPOSITE;
4415 return 1;
4416
4417 case '1':
4418 iso->esc = ISO_ESC_END_COMPOSITE;
4419 *flags = (*flags & CODING_STATE_ISO2022_LOCK) &
4420 ~CODING_STATE_COMPOSITE;
4421 return 1;
4422 #endif /* ENABLE_COMPOSITE_CHARS */
4423
4424 /**** directionality ****/
4425
4426 case '[':
4427 iso->esc = ISO_ESC_5_11;
4428 goto not_done;
4429
4430 /**** designation ****/
4431
4432 case '$': /* multibyte charset prefix */
4433 iso->esc = ISO_ESC_2_4;
4434 goto not_done;
4435
4436 default:
4437 if (0x28 <= c && c <= 0x2F)
4438 {
4439 iso->esc = (enum iso_esc_flag) (c - 0x28 + ISO_ESC_2_8);
4440 goto not_done;
4441 }
4442
4443 /* This function is called with CODESYS equal to nil when
4444 doing coding-system detection. */
4445 if (!NILP (codesys)
4446 && XCODING_SYSTEM_ISO2022_ESCAPE_QUOTED (codesys)
4447 && fit_to_be_escape_quoted (c))
4448 {
4449 iso->esc = ISO_ESC_LITERAL;
4450 *flags &= CODING_STATE_ISO2022_LOCK;
4451 return 1;
4452 }
4453
4454 /* bzzzt! */
4455 return 0;
4456 }
4457
4458
4459
4460 /**** directionality ****/
4461
4462 case ISO_ESC_5_11: /* ISO6429 direction control */
4463 if (c == ']')
4464 {
4465 *flags &= (CODING_STATE_ISO2022_LOCK & ~CODING_STATE_R2L);
4466 goto directionality;
4467 }
4468 if (c == '0') iso->esc = ISO_ESC_5_11_0;
4469 else if (c == '1') iso->esc = ISO_ESC_5_11_1;
4470 else if (c == '2') iso->esc = ISO_ESC_5_11_2;
4471 else return 0;
4472 goto not_done;
4473
4474 case ISO_ESC_5_11_0:
4475 if (c == ']')
4476 {
4477 *flags &= (CODING_STATE_ISO2022_LOCK & ~CODING_STATE_R2L);
4478 goto directionality;
4479 }
4480 return 0;
4481
4482 case ISO_ESC_5_11_1:
4483 if (c == ']')
4484 {
4485 *flags = (CODING_STATE_ISO2022_LOCK & ~CODING_STATE_R2L);
4486 goto directionality;
4487 }
4488 return 0;
4489
4490 case ISO_ESC_5_11_2:
4491 if (c == ']')
4492 {
4493 *flags = (*flags & CODING_STATE_ISO2022_LOCK) | CODING_STATE_R2L;
4494 goto directionality;
4495 }
4496 return 0;
4497
4498 directionality:
4499 iso->esc = ISO_ESC_DIRECTIONALITY;
4500 /* Various junk here to attempt to preserve the direction sequences
4501 literally in the text if they would otherwise be swallowed due
4502 to invalid designations that don't show up as actual charset
4503 changes in the text. */
4504 if (iso->invalid_switch_dir)
4505 {
4506 /* We already inserted a direction switch literally into the
4507 text. We assume (#### this may not be right) that the
4508 next direction switch is the one going the other way,
4509 and we need to output that literally as well. */
4510 iso->output_literally = 1;
4511 iso->invalid_switch_dir = 0;
4512 }
4513 else
4514 {
4515 int jj;
4516
4517 /* If we are in the thrall of an invalid designation,
4518 then stick the directionality sequence literally into the
4519 output stream so it ends up in the original text again. */
4520 for (jj = 0; jj < 4; jj++)
4521 if (iso->invalid_designated[jj])
4522 break;
4523 if (jj < 4)
4524 {
4525 iso->output_literally = 1;
4526 iso->invalid_switch_dir = 1;
4527 }
4528 else
4529 /* Indicate that we haven't yet seen a valid designation,
4530 so that if a switch-dir is directly followed by an
4531 invalid designation, both get inserted literally. */
4532 iso->switched_dir_and_no_valid_charset_yet = 1;
4533 }
4534 return 1;
4535
4536
4537 /**** designation ****/
4538
4539 case ISO_ESC_2_4:
4540 if (0x28 <= c && c <= 0x2F)
4541 {
4542 iso->esc = (enum iso_esc_flag) (c - 0x28 + ISO_ESC_2_4_8);
4543 goto not_done;
4544 }
4545 if (0x40 <= c && c <= 0x42)
4546 {
4547 cs = CHARSET_BY_ATTRIBUTES (CHARSET_TYPE_94X94, c,
4548 *flags & CODING_STATE_R2L ?
4549 CHARSET_RIGHT_TO_LEFT :
4550 CHARSET_LEFT_TO_RIGHT);
4551 reg = 0;
4552 goto designated;
4553 }
4554 return 0;
4555
4556 default:
4557 {
4558 int type =-1;
4559
4560 if (c < '0' || c > '~')
4561 return 0; /* bad final byte */
4562
4563 if (iso->esc >= ISO_ESC_2_8 &&
4564 iso->esc <= ISO_ESC_2_15)
4565 {
4566 type = ((iso->esc >= ISO_ESC_2_12) ?
4567 CHARSET_TYPE_96 : CHARSET_TYPE_94);
4568 reg = (iso->esc - ISO_ESC_2_8) & 3;
4569 }
4570 else if (iso->esc >= ISO_ESC_2_4_8 &&
4571 iso->esc <= ISO_ESC_2_4_15)
4572 {
4573 type = ((iso->esc >= ISO_ESC_2_4_12) ?
4574 CHARSET_TYPE_96X96 : CHARSET_TYPE_94X94);
4575 reg = (iso->esc - ISO_ESC_2_4_8) & 3;
4576 }
4577 else
4578 {
4579 /* Can this ever be reached? -slb */
4580 abort();
4581 return 0;
4582 }
4583
4584 cs = CHARSET_BY_ATTRIBUTES (type, c,
4585 *flags & CODING_STATE_R2L ?
4586 CHARSET_RIGHT_TO_LEFT :
4587 CHARSET_LEFT_TO_RIGHT);
4588 goto designated;
4589 }
4590 }
4591
4592 not_done:
4593 iso->esc_bytes[iso->esc_bytes_index++] = (unsigned char) c;
4594 return -1;
4595
4596 single_shift:
4597 if (check_invalid_charsets && !CHARSETP (iso->charset[reg]))
4598 /* can't invoke something that ain't there. */
4599 return 0;
4600 iso->esc = ISO_ESC_SINGLE_SHIFT;
4601 *flags &= CODING_STATE_ISO2022_LOCK;
4602 if (reg == 2)
4603 *flags |= CODING_STATE_SS2;
4604 else
4605 *flags |= CODING_STATE_SS3;
4606 return 1;
4607
4608 locking_shift:
4609 if (check_invalid_charsets &&
4610 !CHARSETP (iso->charset[reg]))
4611 /* can't invoke something that ain't there. */
4612 return 0;
4613 if (half)
4614 iso->register_right = reg;
4615 else
4616 iso->register_left = reg;
4617 *flags &= CODING_STATE_ISO2022_LOCK;
4618 iso->esc = ISO_ESC_LOCKING_SHIFT;
4619 return 1;
4620
4621 designated:
4622 if (NILP (cs) && check_invalid_charsets)
4623 {
4624 iso->invalid_designated[reg] = 1;
4625 iso->charset[reg] = Vcharset_ascii;
4626 iso->esc = ISO_ESC_DESIGNATE;
4627 *flags &= CODING_STATE_ISO2022_LOCK;
4628 iso->output_literally = 1;
4629 if (iso->switched_dir_and_no_valid_charset_yet)
4630 {
4631 /* We encountered a switch-direction followed by an
4632 invalid designation. Ensure that the switch-direction
4633 gets outputted; otherwise it will probably get eaten
4634 when the text is written out again. */
4635 iso->switched_dir_and_no_valid_charset_yet = 0;
4636 iso->output_direction_sequence = 1;
4637 /* And make sure that the switch-dir going the other
4638 way gets outputted, as well. */
4639 iso->invalid_switch_dir = 1;
4640 }
4641 return 1;
4642 }
4643 /* This function is called with CODESYS equal to nil when
4644 doing coding-system detection. */
4645 if (!NILP (codesys))
4646 {
4647 charset_conversion_spec_dynarr *dyn =
4648 XCODING_SYSTEM (codesys)->iso2022.input_conv;
4649
4650 if (dyn)
4651 {
4652 int i;
4653
4654 for (i = 0; i < Dynarr_length (dyn); i++)
4655 {
4656 struct charset_conversion_spec *spec = Dynarr_atp (dyn, i);
4657 if (EQ (cs, spec->from_charset))
4658 cs = spec->to_charset;
4659 }
4660 }
4661 }
4662
4663 iso->charset[reg] = cs;
4664 iso->esc = ISO_ESC_DESIGNATE;
4665 *flags &= CODING_STATE_ISO2022_LOCK;
4666 if (iso->invalid_designated[reg])
4667 {
4668 iso->invalid_designated[reg] = 0;
4669 iso->output_literally = 1;
4670 }
4671 if (iso->switched_dir_and_no_valid_charset_yet)
4672 iso->switched_dir_and_no_valid_charset_yet = 0;
4673 return 1;
4674 }
4675
4676 static int
4677 detect_coding_iso2022 (struct detection_state *st, const Extbyte *src, Bytecount n)
4678 {
4679 int mask;
4680
4681 /* #### There are serious deficiencies in the recognition mechanism
4682 here. This needs to be much smarter if it's going to cut it.
4683 The sequence "\xff\x0f" is currently detected as LOCK_SHIFT while
4684 it should be detected as Latin-1.
4685 All the ISO2022 stuff in this file should be synced up with the
4686 code from FSF Emacs-20.4, in which Mule should be more or less stable.
4687 Perhaps we should wait till R2L works in FSF Emacs? */
4688
4689 if (!st->iso2022.initted)
4690 {
4691 reset_iso2022 (Qnil, &st->iso2022.iso);
4692 st->iso2022.mask = (CODING_CATEGORY_ISO_7_MASK |
4693 CODING_CATEGORY_ISO_8_DESIGNATE_MASK |
4694 CODING_CATEGORY_ISO_8_1_MASK |
4695 CODING_CATEGORY_ISO_8_2_MASK |
4696 CODING_CATEGORY_ISO_LOCK_SHIFT_MASK);
4697 st->iso2022.flags = 0;
4698 st->iso2022.high_byte_count = 0;
4699 st->iso2022.saw_single_shift = 0;
4700 st->iso2022.initted = 1;
4701 }
4702
4703 mask = st->iso2022.mask;
4704
4705 while (n--)
4706 {
4707 unsigned char c = *(unsigned char *)src++;
4708 if (c >= 0xA0)
4709 {
4710 mask &= ~CODING_CATEGORY_ISO_7_MASK;
4711 st->iso2022.high_byte_count++;
4712 }
4713 else
4714 {
4715 if (st->iso2022.high_byte_count && !st->iso2022.saw_single_shift)
4716 {
4717 if (st->iso2022.high_byte_count & 1)
4718 /* odd number of high bytes; assume not iso-8-2 */
4719 mask &= ~CODING_CATEGORY_ISO_8_2_MASK;
4720 }
4721 st->iso2022.high_byte_count = 0;
4722 st->iso2022.saw_single_shift = 0;
4723 if (c > 0x80)
4724 mask &= ~CODING_CATEGORY_ISO_7_MASK;
4725 }
4726 if (!(st->iso2022.flags & CODING_STATE_ESCAPE)
4727 && (BYTE_C0_P (c) || BYTE_C1_P (c)))
4728 { /* control chars */
4729 switch (c)
4730 {
4731 /* Allow and ignore control characters that you might
4732 reasonably see in a text file */
4733 case '\r':
4734 case '\n':
4735 case '\t':
4736 case 7: /* bell */
4737 case 8: /* backspace */
4738 case 11: /* vertical tab */
4739 case 12: /* form feed */
4740 case 26: /* MS-DOS C-z junk */
4741 case 31: /* '^_' -- for info */
4742 goto label_continue_loop;
4743
4744 default:
4745 break;
4746 }
4747 }
4748
4749 if ((st->iso2022.flags & CODING_STATE_ESCAPE) || BYTE_C0_P (c)
4750 || BYTE_C1_P (c))
4751 {
4752 if (parse_iso2022_esc (Qnil, &st->iso2022.iso, c,
4753 &st->iso2022.flags, 0))
4754 {
4755 switch (st->iso2022.iso.esc)
4756 {
4757 case ISO_ESC_DESIGNATE:
4758 mask &= ~CODING_CATEGORY_ISO_8_1_MASK;
4759 mask &= ~CODING_CATEGORY_ISO_8_2_MASK;
4760 break;
4761 case ISO_ESC_LOCKING_SHIFT:
4762 mask = CODING_CATEGORY_ISO_LOCK_SHIFT_MASK;
4763 goto ran_out_of_chars;
4764 case ISO_ESC_SINGLE_SHIFT:
4765 mask &= ~CODING_CATEGORY_ISO_8_DESIGNATE_MASK;
4766 st->iso2022.saw_single_shift = 1;
4767 break;
4768 default:
4769 break;
4770 }
4771 }
4772 else
4773 {
4774 mask = 0;
4775 goto ran_out_of_chars;
4776 }
4777 }
4778 label_continue_loop:;
4779 }
4780
4781 ran_out_of_chars:
4782
4783 return mask;
4784 }
4785
4786 static int
4787 postprocess_iso2022_mask (int mask)
4788 {
4789 /* #### kind of cheesy */
4790 /* If seven-bit ISO is allowed, then assume that the encoding is
4791 entirely seven-bit and turn off the eight-bit ones. */
4792 if (mask & CODING_CATEGORY_ISO_7_MASK)
4793 mask &= ~ (CODING_CATEGORY_ISO_8_DESIGNATE_MASK |
4794 CODING_CATEGORY_ISO_8_1_MASK |
4795 CODING_CATEGORY_ISO_8_2_MASK);
4796 return mask;
4797 }
4798
4799 /* If FLAGS is a null pointer or specifies right-to-left motion,
4800 output a switch-dir-to-left-to-right sequence to DST.
4801 Also update FLAGS if it is not a null pointer.
4802 If INTERNAL_P is set, we are outputting in internal format and
4803 need to handle the CSI differently. */
4804
4805 static void
4806 restore_left_to_right_direction (Lisp_Coding_System *codesys,
4807 unsigned_char_dynarr *dst,
4808 unsigned int *flags,
4809 int internal_p)
4810 {
4811 if (!flags || (*flags & CODING_STATE_R2L))
4812 {
4813 if (CODING_SYSTEM_ISO2022_SEVEN (codesys))
4814 {
4815 Dynarr_add (dst, ISO_CODE_ESC);
4816 Dynarr_add (dst, '[');
4817 }
4818 else if (internal_p)
4819 DECODE_ADD_BINARY_CHAR (ISO_CODE_CSI, dst);
4820 else
4821 Dynarr_add (dst, ISO_CODE_CSI);
4822 Dynarr_add (dst, '0');
4823 Dynarr_add (dst, ']');
4824 if (flags)
4825 *flags &= ~CODING_STATE_R2L;
4826 }
4827 }
4828
4829 /* If FLAGS is a null pointer or specifies a direction different from
4830 DIRECTION (which should be either CHARSET_RIGHT_TO_LEFT or
4831 CHARSET_LEFT_TO_RIGHT), output the appropriate switch-dir escape
4832 sequence to DST. Also update FLAGS if it is not a null pointer.
4833 If INTERNAL_P is set, we are outputting in internal format and
4834 need to handle the CSI differently. */
4835
4836 static void
4837 ensure_correct_direction (int direction, Lisp_Coding_System *codesys,
4838 unsigned_char_dynarr *dst, unsigned int *flags,
4839 int internal_p)
4840 {
4841 if ((!flags || (*flags & CODING_STATE_R2L)) &&
4842 direction == CHARSET_LEFT_TO_RIGHT)
4843 restore_left_to_right_direction (codesys, dst, flags, internal_p);
4844 else if (!CODING_SYSTEM_ISO2022_NO_ISO6429 (codesys)
4845 && (!flags || !(*flags & CODING_STATE_R2L)) &&
4846 direction == CHARSET_RIGHT_TO_LEFT)
4847 {
4848 if (CODING_SYSTEM_ISO2022_SEVEN (codesys))
4849 {
4850 Dynarr_add (dst, ISO_CODE_ESC);
4851 Dynarr_add (dst, '[');
4852 }
4853 else if (internal_p)
4854 DECODE_ADD_BINARY_CHAR (ISO_CODE_CSI, dst);
4855 else
4856 Dynarr_add (dst, ISO_CODE_CSI);
4857 Dynarr_add (dst, '2');
4858 Dynarr_add (dst, ']');
4859 if (flags)
4860 *flags |= CODING_STATE_R2L;
4861 }
4862 }
4863
4864 /* Convert ISO2022-format data to internal format. */
4865
4866 static void
4867 decode_coding_iso2022 (Lstream *decoding, const Extbyte *src,
4868 unsigned_char_dynarr *dst, Bytecount n)
4869 {
4870 struct decoding_stream *str = DECODING_STREAM_DATA (decoding);
4871 unsigned int flags = str->flags;
4872 unsigned int ch = str->ch;
4873 eol_type_t eol_type = str->eol_type;
4874 #ifdef ENABLE_COMPOSITE_CHARS
4875 unsigned_char_dynarr *real_dst = dst;
4876 #endif
4877 Lisp_Object coding_system;
4878
4879 XSETCODING_SYSTEM (coding_system, str->codesys);
4880
4881 #ifdef ENABLE_COMPOSITE_CHARS
4882 if (flags & CODING_STATE_COMPOSITE)
4883 dst = str->iso2022.composite_chars;
4884 #endif /* ENABLE_COMPOSITE_CHARS */
4885
4886 while (n--)
4887 {
4888 unsigned char c = *(unsigned char *)src++;
4889 if (flags & CODING_STATE_ESCAPE)
4890 { /* Within ESC sequence */
4891 int retval = parse_iso2022_esc (coding_system, &str->iso2022,
4892 c, &flags, 1);
4893
4894 if (retval)
4895 {
4896 switch (str->iso2022.esc)
4897 {
4898 #ifdef ENABLE_COMPOSITE_CHARS
4899 case ISO_ESC_START_COMPOSITE:
4900 if (str->iso2022.composite_chars)
4901 Dynarr_reset (str->iso2022.composite_chars);
4902 else
4903 str->iso2022.composite_chars = Dynarr_new (unsigned_char);
4904 dst = str->iso2022.composite_chars;
4905 break;
4906 case ISO_ESC_END_COMPOSITE:
4907 {
4908 Intbyte comstr[MAX_EMCHAR_LEN];
4909 Bytecount len;
4910 Emchar emch = lookup_composite_char (Dynarr_atp (dst, 0),
4911 Dynarr_length (dst));
4912 dst = real_dst;
4913 len = set_charptr_emchar (comstr, emch);
4914 Dynarr_add_many (dst, comstr, len);
4915 break;
4916 }
4917 #endif /* ENABLE_COMPOSITE_CHARS */
4918
4919 case ISO_ESC_LITERAL:
4920 DECODE_ADD_BINARY_CHAR (c, dst);
4921 break;
4922
4923 default:
4924 /* Everything else handled already */
4925 break;
4926 }
4927 }
4928
4929 /* Attempted error recovery. */
4930 if (str->iso2022.output_direction_sequence)
4931 ensure_correct_direction (flags & CODING_STATE_R2L ?
4932 CHARSET_RIGHT_TO_LEFT :
4933 CHARSET_LEFT_TO_RIGHT,
4934 str->codesys, dst, 0, 1);
4935 /* More error recovery. */
4936 if (!retval || str->iso2022.output_literally)
4937 {
4938 /* Output the (possibly invalid) sequence */
4939 int i;
4940 for (i = 0; i < str->iso2022.esc_bytes_index; i++)
4941 DECODE_ADD_BINARY_CHAR (str->iso2022.esc_bytes[i], dst);
4942 flags &= CODING_STATE_ISO2022_LOCK;
4943 if (!retval)
4944 n++, src--;/* Repeat the loop with the same character. */
4945 else
4946 {
4947 /* No sense in reprocessing the final byte of the
4948 escape sequence; it could mess things up anyway.
4949 Just add it now. */
4950 DECODE_ADD_BINARY_CHAR (c, dst);
4951 }
4952 }
4953 ch = 0;
4954 }
4955 else if (BYTE_C0_P (c) || BYTE_C1_P (c))
4956 { /* Control characters */
4957
4958 /***** Error-handling *****/
4959
4960 /* If we were in the middle of a character, dump out the
4961 partial character. */
4962 DECODE_OUTPUT_PARTIAL_CHAR (ch);
4963
4964 /* If we just saw a single-shift character, dump it out.
4965 This may dump out the wrong sort of single-shift character,
4966 but least it will give an indication that something went
4967 wrong. */
4968 if (flags & CODING_STATE_SS2)
4969 {
4970 DECODE_ADD_BINARY_CHAR (ISO_CODE_SS2, dst);
4971 flags &= ~CODING_STATE_SS2;
4972 }
4973 if (flags & CODING_STATE_SS3)
4974 {
4975 DECODE_ADD_BINARY_CHAR (ISO_CODE_SS3, dst);
4976 flags &= ~CODING_STATE_SS3;
4977 }
4978
4979 /***** Now handle the control characters. *****/
4980
4981 /* Handle CR/LF */
4982 DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst);
4983
4984 flags &= CODING_STATE_ISO2022_LOCK;
4985
4986 if (!parse_iso2022_esc (coding_system, &str->iso2022, c, &flags, 1))
4987 DECODE_ADD_BINARY_CHAR (c, dst);
4988 }
4989 else
4990 { /* Graphic characters */
4991 Lisp_Object charset;
4992 int lb;
4993 int reg;
4994
4995 DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst);
4996
4997 /* Now determine the charset. */
4998 reg = ((flags & CODING_STATE_SS2) ? 2
4999 : (flags & CODING_STATE_SS3) ? 3
5000 : !BYTE_ASCII_P (c) ? str->iso2022.register_right
5001 : str->iso2022.register_left);
5002 charset = str->iso2022.charset[reg];
5003
5004 /* Error checking: */
5005 if (! CHARSETP (charset)
5006 || str->iso2022.invalid_designated[reg]
5007 || (((c & 0x7F) == ' ' || (c & 0x7F) == ISO_CODE_DEL)
5008 && XCHARSET_CHARS (charset) == 94))
5009 /* Mrmph. We are trying to invoke a register that has no
5010 or an invalid charset in it, or trying to add a character
5011 outside the range of the charset. Insert that char literally
5012 to preserve it for the output. */
5013 {
5014 DECODE_OUTPUT_PARTIAL_CHAR (ch);
5015 DECODE_ADD_BINARY_CHAR (c, dst);
5016 }
5017
5018 else
5019 {
5020 /* Things are probably hunky-dorey. */
5021
5022 /* Fetch reverse charset, maybe. */
5023 if (((flags & CODING_STATE_R2L) &&
5024 XCHARSET_DIRECTION (charset) == CHARSET_LEFT_TO_RIGHT)
5025 ||
5026 (!(flags & CODING_STATE_R2L) &&
5027 XCHARSET_DIRECTION (charset) == CHARSET_RIGHT_TO_LEFT))
5028 {
5029 Lisp_Object new_charset =
5030 XCHARSET_REVERSE_DIRECTION_CHARSET (charset);
5031 if (!NILP (new_charset))
5032 charset = new_charset;
5033 }
5034
5035 lb = XCHARSET_LEADING_BYTE (charset);
5036 switch (XCHARSET_REP_BYTES (charset))
5037 {
5038 case 1: /* ASCII */
5039 DECODE_OUTPUT_PARTIAL_CHAR (ch);
5040 Dynarr_add (dst, c & 0x7F);
5041 break;
5042
5043 case 2: /* one-byte official */
5044 DECODE_OUTPUT_PARTIAL_CHAR (ch);
5045 Dynarr_add (dst, lb);
5046 Dynarr_add (dst, c | 0x80);
5047 break;
5048
5049 case 3: /* one-byte private or two-byte official */
5050 if (XCHARSET_PRIVATE_P (charset))
5051 {
5052 DECODE_OUTPUT_PARTIAL_CHAR (ch);
5053 Dynarr_add (dst, PRE_LEADING_BYTE_PRIVATE_1);
5054 Dynarr_add (dst, lb);
5055 Dynarr_add (dst, c | 0x80);
5056 }
5057 else
5058 {
5059 if (ch)
5060 {
5061 Dynarr_add (dst, lb);
5062 Dynarr_add (dst, ch | 0x80);
5063 Dynarr_add (dst, c | 0x80);
5064 ch = 0;
5065 }
5066 else
5067 ch = c;
5068 }
5069 break;
5070
5071 default: /* two-byte private */
5072 if (ch)
5073 {
5074 Dynarr_add (dst, PRE_LEADING_BYTE_PRIVATE_2);
5075 Dynarr_add (dst, lb);
5076 Dynarr_add (dst, ch | 0x80);
5077 Dynarr_add (dst, c | 0x80);
5078 ch = 0;
5079 }
5080 else
5081 ch = c;
5082 }
5083 }
5084
5085 if (!ch)
5086 flags &= CODING_STATE_ISO2022_LOCK;
5087 }
5088
5089 label_continue_loop:;
5090 }
5091
5092 if (flags & CODING_STATE_END)
5093 DECODE_OUTPUT_PARTIAL_CHAR (ch);
5094
5095 str->flags = flags;
5096 str->ch = ch;
5097 }
5098
5099
5100 /***** ISO2022 encoder *****/
5101
5102 /* Designate CHARSET into register REG. */
5103
5104 static void
5105 iso2022_designate (Lisp_Object charset, unsigned char reg,
5106 struct encoding_stream *str, unsigned_char_dynarr *dst)
5107 {
5108 static const char inter94[] = "()*+";
5109 static const char inter96[] = ",-./";
5110 int type;
5111 unsigned char final;
5112 Lisp_Object old_charset = str->iso2022.charset[reg];
5113
5114 str->iso2022.charset[reg] = charset;
5115 if (!CHARSETP (charset))
5116 /* charset might be an initial nil or t. */
5117 return;
5118 type = XCHARSET_TYPE (charset);
5119 final = XCHARSET_FINAL (charset);
5120 if (!str->iso2022.force_charset_on_output[reg] &&
5121 CHARSETP (old_charset) &&
5122 XCHARSET_TYPE (old_charset) == type &&
5123 XCHARSET_FINAL (old_charset) == final)
5124 return;
5125
5126 str->iso2022.force_charset_on_output[reg] = 0;
5127
5128 {
5129 charset_conversion_spec_dynarr *dyn =
5130 str->codesys->iso2022.output_conv;
5131
5132 if (dyn)
5133 {
5134 int i;
5135
5136 for (i = 0; i < Dynarr_length (dyn); i++)
5137 {
5138 struct charset_conversion_spec *spec = Dynarr_atp (dyn, i);
5139 if (EQ (charset, spec->from_charset))
5140 charset = spec->to_charset;
5141 }
5142 }
5143 }
5144
5145 Dynarr_add (dst, ISO_CODE_ESC);
5146 switch (type)
5147 {
5148 case CHARSET_TYPE_94:
5149 Dynarr_add (dst, inter94[reg]);
5150 break;
5151 case CHARSET_TYPE_96:
5152 Dynarr_add (dst, inter96[reg]);
5153 break;
5154 case CHARSET_TYPE_94X94:
5155 Dynarr_add (dst, '$');
5156 if (reg != 0
5157 || !(CODING_SYSTEM_ISO2022_SHORT (str->codesys))
5158 || final < '@'
5159 || final > 'B')
5160 Dynarr_add (dst, inter94[reg]);
5161 break;
5162 case CHARSET_TYPE_96X96:
5163 Dynarr_add (dst, '$');
5164 Dynarr_add (dst, inter96[reg]);
5165 break;
5166 }
5167 Dynarr_add (dst, final);
5168 }
5169
5170 static void
5171 ensure_normal_shift (struct encoding_stream *str, unsigned_char_dynarr *dst)
5172 {
5173 if (str->iso2022.register_left != 0)
5174 {
5175 Dynarr_add (dst, ISO_CODE_SI);
5176 str->iso2022.register_left = 0;
5177 }
5178 }
5179
5180 static void
5181 ensure_shift_out (struct encoding_stream *str, unsigned_char_dynarr *dst)
5182 {
5183 if (str->iso2022.register_left != 1)
5184 {
5185 Dynarr_add (dst, ISO_CODE_SO);
5186 str->iso2022.register_left = 1;
5187 }
5188 }
5189
5190 /* Convert internally-formatted data to ISO2022 format. */
5191
5192 static void
5193 encode_coding_iso2022 (Lstream *encoding, const Intbyte *src,
5194 unsigned_char_dynarr *dst, Bytecount n)
5195 {
5196 unsigned char charmask, c;
5197 unsigned char char_boundary;
5198 struct encoding_stream *str = ENCODING_STREAM_DATA (encoding);
5199 unsigned int flags = str->flags;
5200 unsigned int ch = str->ch;
5201 Lisp_Coding_System *codesys = str->codesys;
5202 eol_type_t eol_type = CODING_SYSTEM_EOL_TYPE (str->codesys);
5203 int i;
5204 Lisp_Object charset;
5205 int half;
5206
5207 #ifdef ENABLE_COMPOSITE_CHARS
5208 /* flags for handling composite chars. We do a little switcharoo
5209 on the source while we're outputting the composite char. */
5210 Bytecount saved_n = 0;
5211 const unsigned char *saved_src = NULL;
5212 int in_composite = 0;
5213 #endif /* ENABLE_COMPOSITE_CHARS */
5214
5215 char_boundary = str->iso2022.current_char_boundary;
5216 charset = str->iso2022.current_charset;
5217 half = str->iso2022.current_half;
5218
5219 #ifdef ENABLE_COMPOSITE_CHARS
5220 back_to_square_n:
5221 #endif
5222 while (n--)
5223 {
5224 c = *src++;
5225
5226 if (BYTE_ASCII_P (c))
5227 { /* Processing ASCII character */
5228 ch = 0;
5229
5230 restore_left_to_right_direction (codesys, dst, &flags, 0);
5231
5232 /* Make sure G0 contains ASCII */
5233 if ((c > ' ' && c < ISO_CODE_DEL) ||
5234 !CODING_SYSTEM_ISO2022_NO_ASCII_CNTL (codesys))
5235 {
5236 ensure_normal_shift (str, dst);
5237 iso2022_designate (Vcharset_ascii, 0, str, dst);
5238 }
5239
5240 /* If necessary, restore everything to the default state
5241 at end-of-line */
5242 if (c == '\n' &&
5243 !(CODING_SYSTEM_ISO2022_NO_ASCII_EOL (codesys)))
5244 {
5245 restore_left_to_right_direction (codesys, dst, &flags, 0);
5246
5247 ensure_normal_shift (str, dst);
5248
5249 for (i = 0; i < 4; i++)
5250 {
5251 Lisp_Object initial_charset =
5252 CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, i);
5253 iso2022_designate (initial_charset, i, str, dst);
5254 }
5255 }
5256 if (c == '\n')
5257 {
5258 if (eol_type != EOL_LF && eol_type != EOL_AUTODETECT)
5259 Dynarr_add (dst, '\r');
5260 if (eol_type != EOL_CR)
5261 Dynarr_add (dst, c);
5262 }
5263 else
5264 {
5265 if (CODING_SYSTEM_ISO2022_ESCAPE_QUOTED (codesys)
5266 && fit_to_be_escape_quoted (c))
5267 Dynarr_add (dst, ISO_CODE_ESC);
5268 Dynarr_add (dst, c);
5269 }
5270 char_boundary = 1;
5271 }
5272
5273 else if (INTBYTE_LEADING_BYTE_P (c) || INTBYTE_LEADING_BYTE_P (ch))
5274 { /* Processing Leading Byte */
5275 ch = 0;
5276 charset = CHARSET_BY_LEADING_BYTE (c);
5277 if (LEADING_BYTE_PREFIX_P(c))
5278 ch = c;
5279 else if (!EQ (charset, Vcharset_control_1)
5280 #ifdef ENABLE_COMPOSITE_CHARS
5281 && !EQ (charset, Vcharset_composite)
5282 #endif
5283 )
5284 {
5285 int reg;
5286
5287 ensure_correct_direction (XCHARSET_DIRECTION (charset),
5288 codesys, dst, &flags, 0);
5289
5290 /* Now determine which register to use. */
5291 reg = -1;
5292 for (i = 0; i < 4; i++)
5293 {
5294 if (EQ (charset, str->iso2022.charset[i]) ||
5295 EQ (charset,
5296 CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, i)))
5297 {
5298 reg = i;
5299 break;
5300 }
5301 }
5302
5303 if (reg == -1)
5304 {
5305 if (XCHARSET_GRAPHIC (charset) != 0)
5306 {
5307 if (!NILP (str->iso2022.charset[1]) &&
5308 (!CODING_SYSTEM_ISO2022_SEVEN (codesys) ||
5309 CODING_SYSTEM_ISO2022_LOCK_SHIFT (codesys)))
5310 reg = 1;
5311 else if (!NILP (str->iso2022.charset[2]))
5312 reg = 2;
5313 else if (!NILP (str->iso2022.charset[3]))
5314 reg = 3;
5315 else
5316 reg = 0;
5317 }
5318 else
5319 reg = 0;
5320 }
5321
5322 iso2022_designate (charset, reg, str, dst);
5323
5324 /* Now invoke that register. */
5325 switch (reg)
5326 {
5327 case 0:
5328 ensure_normal_shift (str, dst);
5329 half = 0;
5330 break;
5331
5332 case 1:
5333 if (CODING_SYSTEM_ISO2022_SEVEN (codesys))
5334 {
5335 ensure_shift_out (str, dst);
5336 half = 0;
5337 }
5338 else
5339 half = 1;
5340 break;
5341
5342 case 2:
5343 if (CODING_SYSTEM_ISO2022_SEVEN (str->codesys))
5344 {
5345 Dynarr_add (dst, ISO_CODE_ESC);
5346 Dynarr_add (dst, 'N');
5347 half = 0;
5348 }
5349 else
5350 {
5351 Dynarr_add (dst, ISO_CODE_SS2);
5352 half = 1;
5353 }
5354 break;
5355
5356 case 3:
5357 if (CODING_SYSTEM_ISO2022_SEVEN (str->codesys))
5358 {
5359 Dynarr_add (dst, ISO_CODE_ESC);
5360 Dynarr_add (dst, 'O');
5361 half = 0;
5362 }
5363 else
5364 {
5365 Dynarr_add (dst, ISO_CODE_SS3);
5366 half = 1;
5367 }
5368 break;
5369
5370 default:
5371 abort ();
5372 }
5373 }
5374 char_boundary = 0;
5375 }
5376 else
5377 { /* Processing Non-ASCII character */
5378 charmask = (half == 0 ? 0x7F : 0xFF);
5379 char_boundary = 1;
5380 if (EQ (charset, Vcharset_control_1))
5381 {
5382 if (CODING_SYSTEM_ISO2022_ESCAPE_QUOTED (codesys)
5383 && fit_to_be_escape_quoted (c))
5384 Dynarr_add (dst, ISO_CODE_ESC);
5385 /* you asked for it ... */
5386 Dynarr_add (dst, c - 0x20);
5387 }
5388 else
5389 {
5390 switch (XCHARSET_REP_BYTES (charset))
5391 {
5392 case 2:
5393 Dynarr_add (dst, c & charmask);
5394 break;
5395 case 3:
5396 if (XCHARSET_PRIVATE_P (charset))
5397 {
5398 Dynarr_add (dst, c & charmask);
5399 ch = 0;
5400 }
5401 else if (ch)
5402 {
5403 #ifdef ENABLE_COMPOSITE_CHARS
5404 if (EQ (charset, Vcharset_composite))
5405 {
5406 if (in_composite)
5407 {
5408 /* #### Bother! We don't know how to
5409 handle this yet. */
5410 Dynarr_add (dst, '~');
5411 }
5412 else
5413 {
5414 Emchar emch = MAKE_CHAR (Vcharset_composite,
5415 ch & 0x7F, c & 0x7F);
5416 Lisp_Object lstr = composite_char_string (emch);
5417 saved_n = n;
5418 saved_src = src;
5419 in_composite = 1;
5420 src = XSTRING_DATA (lstr);
5421 n = XSTRING_LENGTH (lstr);
5422 Dynarr_add (dst, ISO_CODE_ESC);
5423 Dynarr_add (dst, '0'); /* start composing */
5424 }
5425 }
5426 else
5427 #endif /* ENABLE_COMPOSITE_CHARS */
5428 {
5429 Dynarr_add (dst, ch & charmask);
5430 Dynarr_add (dst, c & charmask);
5431 }
5432 ch = 0;
5433 }
5434 else
5435 {
5436 ch = c;
5437 char_boundary = 0;
5438 }
5439 break;
5440 case 4:
5441 if (ch)
5442 {
5443 Dynarr_add (dst, ch & charmask);
5444 Dynarr_add (dst, c & charmask);
5445 ch = 0;
5446 }
5447 else
5448 {
5449 ch = c;
5450 char_boundary = 0;
5451 }
5452 break;
5453 default:
5454 abort ();
5455 }
5456 }
5457 }
5458 }
5459
5460 #ifdef ENABLE_COMPOSITE_CHARS
5461 if (in_composite)
5462 {
5463 n = saved_n;
5464 src = saved_src;
5465 in_composite = 0;
5466 Dynarr_add (dst, ISO_CODE_ESC);
5467 Dynarr_add (dst, '1'); /* end composing */
5468 goto back_to_square_n; /* Wheeeeeeeee ..... */
5469 }
5470 #endif /* ENABLE_COMPOSITE_CHARS */
5471
5472 if (char_boundary && flags & CODING_STATE_END)
5473 {
5474 restore_left_to_right_direction (codesys, dst, &flags, 0);
5475 ensure_normal_shift (str, dst);
5476 for (i = 0; i < 4; i++)
5477 {
5478 Lisp_Object initial_charset =
5479 CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, i);
5480 iso2022_designate (initial_charset, i, str, dst);
5481 }
5482 }
5483
5484 str->flags = flags;
5485 str->ch = ch;
5486 str->iso2022.current_char_boundary = char_boundary;
5487 str->iso2022.current_charset = charset;
5488 str->iso2022.current_half = half;
5489
5490 /* Verbum caro factum est! */
5491 }
5492 #endif /* MULE */
5493
5494 /************************************************************************/
5495 /* No-conversion methods */
5496 /************************************************************************/
5497
5498 /* This is used when reading in "binary" files -- i.e. files that may
5499 contain all 256 possible byte values and that are not to be
5500 interpreted as being in any particular decoding. */
5501 static void
5502 decode_coding_no_conversion (Lstream *decoding, const Extbyte *src,
5503 unsigned_char_dynarr *dst, Bytecount n)
5504 {
5505 struct decoding_stream *str = DECODING_STREAM_DATA (decoding);
5506 unsigned int flags = str->flags;
5507 unsigned int ch = str->ch;
5508 eol_type_t eol_type = str->eol_type;
5509
5510 while (n--)
5511 {
5512 unsigned char c = *(unsigned char *)src++;
5513
5514 DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst);
5515 DECODE_ADD_BINARY_CHAR (c, dst);
5516 label_continue_loop:;
5517 }
5518
5519 DECODE_HANDLE_END_OF_CONVERSION (flags, ch, dst);
5520
5521 str->flags = flags;
5522 str->ch = ch;
5523 }
5524
5525 static void
5526 encode_coding_no_conversion (Lstream *encoding, const Intbyte *src,
5527 unsigned_char_dynarr *dst, Bytecount n)
5528 {
5529 unsigned char c;
5530 struct encoding_stream *str = ENCODING_STREAM_DATA (encoding);
5531 unsigned int flags = str->flags;
5532 unsigned int ch = str->ch;
5533 eol_type_t eol_type = CODING_SYSTEM_EOL_TYPE (str->codesys);
5534
5535 while (n--)
5536 {
5537 c = *src++;
5538 if (c == '\n')
5539 {
5540 if (eol_type != EOL_LF && eol_type != EOL_AUTODETECT)
5541 Dynarr_add (dst, '\r');
5542 if (eol_type != EOL_CR)
5543 Dynarr_add (dst, '\n');
5544 ch = 0;
5545 }
5546 else if (BYTE_ASCII_P (c))
5547 {
5548 assert (ch == 0);
5549 Dynarr_add (dst, c);
5550 }
5551 else if (INTBYTE_LEADING_BYTE_P (c))
5552 {
5553 assert (ch == 0);
5554 if (c == LEADING_BYTE_LATIN_ISO8859_1 ||
5555 c == LEADING_BYTE_CONTROL_1)
5556 ch = c;
5557 else
5558 Dynarr_add (dst, '~'); /* untranslatable character */
5559 }
5560 else
5561 {
5562 if (ch == LEADING_BYTE_LATIN_ISO8859_1)
5563 Dynarr_add (dst, c);
5564 else if (ch == LEADING_BYTE_CONTROL_1)
5565 {
5566 assert (c < 0xC0);
5567 Dynarr_add (dst, c - 0x20);
5568 }
5569 /* else it should be the second or third byte of an
5570 untranslatable character, so ignore it */
5571 ch = 0;
5572 }
5573 }
5574
5575 str->flags = flags;
5576 str->ch = ch;
5577 }
5578
5579
5580
5581 /************************************************************************/ 4447 /************************************************************************/
5582 /* Initialization */ 4448 /* Initialization */
5583 /************************************************************************/ 4449 /************************************************************************/
5584 4450
5585 void 4451 void
5586 syms_of_file_coding (void) 4452 syms_of_file_coding (void)
5587 { 4453 {
5588 INIT_LRECORD_IMPLEMENTATION (coding_system); 4454 INIT_LRECORD_IMPLEMENTATION (coding_system);
5589 4455
4456 DEFSUBR (Fvalid_coding_system_type_p);
4457 DEFSUBR (Fcoding_system_type_list);
5590 DEFSUBR (Fcoding_system_p); 4458 DEFSUBR (Fcoding_system_p);
5591 DEFSUBR (Ffind_coding_system); 4459 DEFSUBR (Ffind_coding_system);
5592 DEFSUBR (Fget_coding_system); 4460 DEFSUBR (Fget_coding_system);
5593 DEFSUBR (Fcoding_system_list); 4461 DEFSUBR (Fcoding_system_list);
5594 DEFSUBR (Fcoding_system_name); 4462 DEFSUBR (Fcoding_system_name);
5597 DEFSUBR (Fcoding_system_canonical_name_p); 4465 DEFSUBR (Fcoding_system_canonical_name_p);
5598 DEFSUBR (Fcoding_system_alias_p); 4466 DEFSUBR (Fcoding_system_alias_p);
5599 DEFSUBR (Fcoding_system_aliasee); 4467 DEFSUBR (Fcoding_system_aliasee);
5600 DEFSUBR (Fdefine_coding_system_alias); 4468 DEFSUBR (Fdefine_coding_system_alias);
5601 DEFSUBR (Fsubsidiary_coding_system); 4469 DEFSUBR (Fsubsidiary_coding_system);
4470 DEFSUBR (Fcoding_system_base);
4471 DEFSUBR (Fcoding_system_used_for_io);
5602 4472
5603 DEFSUBR (Fcoding_system_type); 4473 DEFSUBR (Fcoding_system_type);
5604 DEFSUBR (Fcoding_system_doc_string); 4474 DEFSUBR (Fcoding_system_description);
5605 DEFSUBR (Fcoding_system_property); 4475 DEFSUBR (Fcoding_system_property);
5606 4476
5607 DEFSUBR (Fcoding_category_list); 4477 DEFSUBR (Fcoding_category_list);
5608 DEFSUBR (Fset_coding_priority_list); 4478 DEFSUBR (Fset_coding_priority_list);
5609 DEFSUBR (Fcoding_priority_list); 4479 DEFSUBR (Fcoding_priority_list);
5611 DEFSUBR (Fcoding_category_system); 4481 DEFSUBR (Fcoding_category_system);
5612 4482
5613 DEFSUBR (Fdetect_coding_region); 4483 DEFSUBR (Fdetect_coding_region);
5614 DEFSUBR (Fdecode_coding_region); 4484 DEFSUBR (Fdecode_coding_region);
5615 DEFSUBR (Fencode_coding_region); 4485 DEFSUBR (Fencode_coding_region);
5616 #ifdef MULE
5617 DEFSUBR (Fdecode_shift_jis_char);
5618 DEFSUBR (Fencode_shift_jis_char);
5619 DEFSUBR (Fdecode_big5_char);
5620 DEFSUBR (Fencode_big5_char);
5621 DEFSUBR (Fset_ucs_char);
5622 DEFSUBR (Fucs_char);
5623 DEFSUBR (Fset_char_ucs);
5624 DEFSUBR (Fchar_ucs);
5625 #endif /* MULE */
5626 DEFSYMBOL_MULTIWORD_PREDICATE (Qcoding_systemp); 4486 DEFSYMBOL_MULTIWORD_PREDICATE (Qcoding_systemp);
5627 DEFSYMBOL (Qno_conversion); 4487 DEFSYMBOL (Qno_conversion);
4488 DEFSYMBOL (Qconvert_eol);
4489 DEFSYMBOL (Qconvert_eol_autodetect);
4490 DEFSYMBOL (Qconvert_eol_lf);
4491 DEFSYMBOL (Qconvert_eol_cr);
4492 DEFSYMBOL (Qconvert_eol_crlf);
5628 DEFSYMBOL (Qraw_text); 4493 DEFSYMBOL (Qraw_text);
5629 #ifdef MULE 4494
5630 DEFSYMBOL (Qbig5);
5631 DEFSYMBOL (Qshift_jis);
5632 defsymbol (&Qucs4, "ucs-4");
5633 defsymbol (&Qutf8, "utf-8");
5634 DEFSYMBOL (Qccl);
5635 DEFSYMBOL (Qiso2022);
5636 #endif /* MULE */
5637 DEFSYMBOL (Qmnemonic); 4495 DEFSYMBOL (Qmnemonic);
5638 DEFSYMBOL (Qeol_type); 4496 DEFSYMBOL (Qeol_type);
5639 DEFSYMBOL (Qpost_read_conversion); 4497 DEFSYMBOL (Qpost_read_conversion);
5640 DEFSYMBOL (Qpre_write_conversion); 4498 DEFSYMBOL (Qpre_write_conversion);
4499
4500 DEFSYMBOL (Qtranslation_table_for_decode);
4501 DEFSYMBOL (Qtranslation_table_for_encode);
4502 DEFSYMBOL (Qsafe_chars);
4503 DEFSYMBOL (Qsafe_charsets);
4504 DEFSYMBOL (Qmime_charset);
4505 DEFSYMBOL (Qvalid_codes);
5641 4506
5642 DEFSYMBOL (Qcr); 4507 DEFSYMBOL (Qcr);
5643 DEFSYMBOL (Qlf); 4508 DEFSYMBOL (Qlf);
5644 DEFSYMBOL (Qcrlf); 4509 DEFSYMBOL (Qcrlf);
5645 DEFSYMBOL (Qeol_cr); 4510 DEFSYMBOL (Qeol_cr);
5646 DEFSYMBOL (Qeol_lf); 4511 DEFSYMBOL (Qeol_lf);
5647 DEFSYMBOL (Qeol_crlf); 4512 DEFSYMBOL (Qeol_crlf);
5648 #ifdef MULE
5649 DEFSYMBOL (Qcharset_g0);
5650 DEFSYMBOL (Qcharset_g1);
5651 DEFSYMBOL (Qcharset_g2);
5652 DEFSYMBOL (Qcharset_g3);
5653 DEFSYMBOL (Qforce_g0_on_output);
5654 DEFSYMBOL (Qforce_g1_on_output);
5655 DEFSYMBOL (Qforce_g2_on_output);
5656 DEFSYMBOL (Qforce_g3_on_output);
5657 DEFSYMBOL (Qno_iso6429);
5658 DEFSYMBOL (Qinput_charset_conversion);
5659 DEFSYMBOL (Qoutput_charset_conversion);
5660
5661 DEFSYMBOL (Qshort);
5662 DEFSYMBOL (Qno_ascii_eol);
5663 DEFSYMBOL (Qno_ascii_cntl);
5664 DEFSYMBOL (Qseven);
5665 DEFSYMBOL (Qlock_shift);
5666 DEFSYMBOL (Qescape_quoted);
5667 #endif /* MULE */
5668 DEFSYMBOL (Qencode); 4513 DEFSYMBOL (Qencode);
5669 DEFSYMBOL (Qdecode); 4514 DEFSYMBOL (Qdecode);
5670 4515
5671 #ifdef MULE 4516 DEFSYMBOL (Qnear_certainty);
5672 defsymbol (&coding_category_symbol[CODING_CATEGORY_SHIFT_JIS], 4517 DEFSYMBOL (Qquite_probable);
5673 "shift-jis"); 4518 DEFSYMBOL (Qsomewhat_likely);
5674 defsymbol (&coding_category_symbol[CODING_CATEGORY_BIG5], 4519 DEFSYMBOL (Qas_likely_as_unlikely);
5675 "big5"); 4520 DEFSYMBOL (Qsomewhat_unlikely);
5676 defsymbol (&coding_category_symbol[CODING_CATEGORY_UCS4], 4521 DEFSYMBOL (Qquite_improbable);
5677 "ucs-4"); 4522 DEFSYMBOL (Qnearly_impossible);
5678 defsymbol (&coding_category_symbol[CODING_CATEGORY_UTF8], 4523
5679 "utf-8"); 4524 DEFSYMBOL (Qdo_eol);
5680 defsymbol (&coding_category_symbol[CODING_CATEGORY_ISO_7], 4525 DEFSYMBOL (Qdo_coding);
5681 "iso-7"); 4526
5682 defsymbol (&coding_category_symbol[CODING_CATEGORY_ISO_8_DESIGNATE], 4527 DEFSYMBOL (Qcanonicalize_after_coding);
5683 "iso-8-designate"); 4528
5684 defsymbol (&coding_category_symbol[CODING_CATEGORY_ISO_8_1], 4529 DEFSYMBOL (Qescape_quoted);
5685 "iso-8-1"); 4530
5686 defsymbol (&coding_category_symbol[CODING_CATEGORY_ISO_8_2], 4531 #ifdef HAVE_ZLIB
5687 "iso-8-2"); 4532 DEFSYMBOL (Qgzip);
5688 defsymbol (&coding_category_symbol[CODING_CATEGORY_ISO_LOCK_SHIFT], 4533 #endif
5689 "iso-lock-shift"); 4534
5690 #endif /* MULE */ 4535 /* WARNING: The existing categories are intimately tied to the function
5691 defsymbol (&coding_category_symbol[CODING_CATEGORY_NO_CONVERSION], 4536 `coding-system-category' in coding.el. If you change a category, or
5692 "no-conversion"); 4537 change the layout of any coding system associated with a category, you
4538 need to check that function and make sure it's written properly. */
4539
4540 #ifdef HAVE_DEFAULT_EOL_DETECTION
4541 Fprovide (intern ("unix-default-eol-detection"));
4542 #endif
5693 } 4543 }
5694 4544
5695 void 4545 void
5696 lstream_type_create_file_coding (void) 4546 lstream_type_create_file_coding (void)
5697 { 4547 {
5698 LSTREAM_HAS_METHOD (decoding, reader); 4548 LSTREAM_HAS_METHOD (coding, reader);
5699 LSTREAM_HAS_METHOD (decoding, writer); 4549 LSTREAM_HAS_METHOD (coding, writer);
5700 LSTREAM_HAS_METHOD (decoding, rewinder); 4550 LSTREAM_HAS_METHOD (coding, rewinder);
5701 LSTREAM_HAS_METHOD (decoding, seekable_p); 4551 LSTREAM_HAS_METHOD (coding, seekable_p);
5702 LSTREAM_HAS_METHOD (decoding, flusher); 4552 LSTREAM_HAS_METHOD (coding, marker);
5703 LSTREAM_HAS_METHOD (decoding, closer); 4553 LSTREAM_HAS_METHOD (coding, flusher);
5704 LSTREAM_HAS_METHOD (decoding, marker); 4554 LSTREAM_HAS_METHOD (coding, closer);
5705 4555 LSTREAM_HAS_METHOD (coding, finalizer);
5706 LSTREAM_HAS_METHOD (encoding, reader); 4556 }
5707 LSTREAM_HAS_METHOD (encoding, writer); 4557
5708 LSTREAM_HAS_METHOD (encoding, rewinder); 4558 void
5709 LSTREAM_HAS_METHOD (encoding, seekable_p); 4559 coding_system_type_create (void)
5710 LSTREAM_HAS_METHOD (encoding, flusher); 4560 {
5711 LSTREAM_HAS_METHOD (encoding, closer); 4561 int i;
5712 LSTREAM_HAS_METHOD (encoding, marker); 4562
4563 staticpro (&Vcoding_system_hash_table);
4564 Vcoding_system_hash_table =
4565 make_lisp_hash_table (50, HASH_TABLE_NON_WEAK, HASH_TABLE_EQ);
4566
4567 the_coding_system_type_entry_dynarr = Dynarr_new (coding_system_type_entry);
4568 dump_add_root_struct_ptr (&the_coding_system_type_entry_dynarr,
4569 &csted_description);
4570
4571 Vcoding_system_type_list = Qnil;
4572 staticpro (&Vcoding_system_type_list);
4573
4574 /* Initialize to something reasonable ... */
4575 for (i = 0; i < MAX_DETECTOR_CATEGORIES; i++)
4576 {
4577 coding_category_system[i] = Qnil;
4578 dump_add_root_object (&coding_category_system[i]);
4579 coding_category_by_priority[i] = i;
4580 }
4581
4582 dump_add_opaque (coding_category_by_priority,
4583 sizeof (coding_category_by_priority));
4584
4585 all_coding_detectors = Dynarr_new2 (detector_dynarr, struct detector);
4586 dump_add_root_struct_ptr (&all_coding_detectors,
4587 &detector_dynarr_description);
4588
4589 dump_add_opaque_int (&coding_system_tick);
4590 dump_add_opaque_int (&coding_detector_count);
4591 dump_add_opaque_int (&coding_detector_category_count);
4592
4593 INITIALIZE_CODING_SYSTEM_TYPE (no_conversion,
4594 "no-conversion-coding-system-p");
4595 CODING_SYSTEM_HAS_METHOD (no_conversion, convert);
4596
4597 INITIALIZE_DETECTOR (no_conversion);
4598 DETECTOR_HAS_METHOD (no_conversion, detect);
4599 INITIALIZE_DETECTOR_CATEGORY (no_conversion, no_conversion);
4600
4601 INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (convert_eol,
4602 "convert-eol-coding-system-p");
4603 CODING_SYSTEM_HAS_METHOD (convert_eol, print);
4604 CODING_SYSTEM_HAS_METHOD (convert_eol, convert);
4605 CODING_SYSTEM_HAS_METHOD (convert_eol, getprop);
4606 CODING_SYSTEM_HAS_METHOD (convert_eol, putprop);
4607 CODING_SYSTEM_HAS_METHOD (convert_eol, conversion_end_type);
4608 CODING_SYSTEM_HAS_METHOD (convert_eol, canonicalize_after_coding);
4609 CODING_SYSTEM_HAS_METHOD (convert_eol, init_coding_stream);
4610
4611 INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (undecided,
4612 "undecided-coding-system-p");
4613 CODING_SYSTEM_HAS_METHOD (undecided, init);
4614 CODING_SYSTEM_HAS_METHOD (undecided, mark);
4615 CODING_SYSTEM_HAS_METHOD (undecided, print);
4616 CODING_SYSTEM_HAS_METHOD (undecided, convert);
4617 CODING_SYSTEM_HAS_METHOD (undecided, putprop);
4618 CODING_SYSTEM_HAS_METHOD (undecided, getprop);
4619 CODING_SYSTEM_HAS_METHOD (undecided, init_coding_stream);
4620 CODING_SYSTEM_HAS_METHOD (undecided, rewind_coding_stream);
4621 CODING_SYSTEM_HAS_METHOD (undecided, finalize_coding_stream);
4622 CODING_SYSTEM_HAS_METHOD (undecided, mark_coding_stream);
4623 CODING_SYSTEM_HAS_METHOD (undecided, canonicalize);
4624 CODING_SYSTEM_HAS_METHOD (undecided, canonicalize_after_coding);
4625
4626 INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (chain, "chain-coding-system-p");
4627
4628 CODING_SYSTEM_HAS_METHOD (chain, print);
4629 CODING_SYSTEM_HAS_METHOD (chain, canonicalize);
4630 CODING_SYSTEM_HAS_METHOD (chain, init);
4631 CODING_SYSTEM_HAS_METHOD (chain, mark);
4632 CODING_SYSTEM_HAS_METHOD (chain, mark_coding_stream);
4633 CODING_SYSTEM_HAS_METHOD (chain, convert);
4634 CODING_SYSTEM_HAS_METHOD (chain, rewind_coding_stream);
4635 CODING_SYSTEM_HAS_METHOD (chain, finalize_coding_stream);
4636 CODING_SYSTEM_HAS_METHOD (chain, finalize);
4637 CODING_SYSTEM_HAS_METHOD (chain, putprop);
4638 CODING_SYSTEM_HAS_METHOD (chain, getprop);
4639 CODING_SYSTEM_HAS_METHOD (chain, conversion_end_type);
4640 CODING_SYSTEM_HAS_METHOD (chain, canonicalize_after_coding);
4641
4642 #ifdef DEBUG_XEMACS
4643 INITIALIZE_CODING_SYSTEM_TYPE (internal, "internal-coding-system-p");
4644 CODING_SYSTEM_HAS_METHOD (internal, convert);
4645 #endif
4646
4647 #ifdef HAVE_ZLIB
4648 INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (gzip, "gzip-coding-system-p");
4649 CODING_SYSTEM_HAS_METHOD (gzip, conversion_end_type);
4650 CODING_SYSTEM_HAS_METHOD (gzip, convert);
4651 CODING_SYSTEM_HAS_METHOD (gzip, init);
4652 CODING_SYSTEM_HAS_METHOD (gzip, print);
4653 CODING_SYSTEM_HAS_METHOD (gzip, init_coding_stream);
4654 CODING_SYSTEM_HAS_METHOD (gzip, rewind_coding_stream);
4655 CODING_SYSTEM_HAS_METHOD (gzip, putprop);
4656 CODING_SYSTEM_HAS_METHOD (gzip, getprop);
4657 #endif
4658 }
4659
4660 void
4661 reinit_coding_system_type_create (void)
4662 {
4663 REINITIALIZE_CODING_SYSTEM_TYPE (no_conversion);
4664 REINITIALIZE_CODING_SYSTEM_TYPE (convert_eol);
4665 REINITIALIZE_CODING_SYSTEM_TYPE (undecided);
4666 REINITIALIZE_CODING_SYSTEM_TYPE (chain);
4667 #if 0
4668 REINITIALIZE_CODING_SYSTEM_TYPE (text_file_wrapper);
4669 #endif /* 0 */
4670 #ifdef DEBUG_XEMACS
4671 REINITIALIZE_CODING_SYSTEM_TYPE (internal);
4672 #endif
4673 #ifdef HAVE_ZLIB
4674 REINITIALIZE_CODING_SYSTEM_TYPE (gzip);
4675 #endif
4676 }
4677
4678 void
4679 reinit_vars_of_file_coding (void)
4680 {
5713 } 4681 }
5714 4682
5715 void 4683 void
5716 vars_of_file_coding (void) 4684 vars_of_file_coding (void)
5717 { 4685 {
5718 int i; 4686 reinit_vars_of_file_coding ();
5719 4687
5720 fcd = xnew (struct file_coding_dump); 4688 /* We always have file-coding support */
5721 dump_add_root_struct_ptr (&fcd, &fcd_description);
5722
5723 /* Initialize to something reasonable ... */
5724 for (i = 0; i < CODING_CATEGORY_LAST; i++)
5725 {
5726 fcd->coding_category_system[i] = Qnil;
5727 fcd->coding_category_by_priority[i] = i;
5728 }
5729
5730 Fprovide (intern ("file-coding")); 4689 Fprovide (intern ("file-coding"));
5731 4690
5732 DEFVAR_LISP ("keyboard-coding-system", &Vkeyboard_coding_system /* 4691 DEFVAR_LISP ("keyboard-coding-system", &Vkeyboard_coding_system /*
5733 Coding system used for TTY keyboard input. 4692 Coding system used for TTY keyboard input.
5734 Not used under a windowing system. 4693 Not used under a windowing system.
5766 Coding system used to convert pathnames when accessing files. 4725 Coding system used to convert pathnames when accessing files.
5767 */ ); 4726 */ );
5768 Vfile_name_coding_system = Qnil; 4727 Vfile_name_coding_system = Qnil;
5769 4728
5770 DEFVAR_BOOL ("enable-multibyte-characters", &enable_multibyte_characters /* 4729 DEFVAR_BOOL ("enable-multibyte-characters", &enable_multibyte_characters /*
5771 Non-nil means the buffer contents are regarded as multi-byte form 4730 Setting this has no effect. It is purely for FSF compatibility.
5772 of characters, not a binary code. This affects the display, file I/O,
5773 and behaviors of various editing commands.
5774
5775 Setting this to nil does not do anything.
5776 */ ); 4731 */ );
5777 enable_multibyte_characters = 1; 4732 enable_multibyte_characters = 1;
4733
4734 Vchain_canonicalize_hash_table =
4735 make_lisp_hash_table (50, HASH_TABLE_NON_WEAK, HASH_TABLE_EQUAL);
4736 staticpro (&Vchain_canonicalize_hash_table);
4737
4738 #ifdef DEBUG_XEMACS
4739 DEFVAR_LISP ("debug-coding-detection", &Vdebug_coding_detection /*
4740 If non-nil, display debug information about detection operations in progress.
4741 Information is displayed on stderr.
4742 */ );
4743 Vdebug_coding_detection = Qnil;
4744 #endif
5778 } 4745 }
5779 4746
5780 void 4747 void
5781 complex_vars_of_file_coding (void) 4748 complex_vars_of_file_coding (void)
5782 { 4749 {
5783 staticpro (&Vcoding_system_hash_table); 4750 Fmake_coding_system
5784 Vcoding_system_hash_table = 4751 (Qconvert_eol_cr, Qconvert_eol,
5785 make_lisp_hash_table (50, HASH_TABLE_NON_WEAK, HASH_TABLE_EQ); 4752 build_msg_string ("Convert CR to LF"),
5786 4753 nconc2 (list6 (Qdocumentation,
5787 the_codesys_prop_dynarr = Dynarr_new (codesys_prop); 4754 build_msg_string (
5788 dump_add_root_struct_ptr (&the_codesys_prop_dynarr, &codesys_prop_dynarr_description); 4755 "Converts CR (used to mark the end of a line on Macintosh systems) to LF\n"
5789 4756 "(used internally and under Unix to mark the end of a line)."),
5790 #define DEFINE_CODESYS_PROP(Prop_Type, Sym) do \ 4757 Qmnemonic, build_string ("CR->LF"),
5791 { \ 4758 Qsubtype, Qcr),
5792 struct codesys_prop csp; \ 4759 /* VERY IMPORTANT! Tell make-coding-system not to generate
5793 csp.sym = (Sym); \ 4760 subsidiaries -- it needs the coding systems we're creating
5794 csp.prop_type = (Prop_Type); \ 4761 to do so! */
5795 Dynarr_add (the_codesys_prop_dynarr, csp); \ 4762 list2 (Qeol_type, Qlf)));
5796 } while (0) 4763
5797 4764 Fmake_coding_system
5798 DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qmnemonic); 4765 (Qconvert_eol_lf, Qconvert_eol,
5799 DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qeol_type); 4766 build_msg_string ("Convert LF to LF (do nothing)"),
5800 DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qeol_cr); 4767 nconc2 (list6 (Qdocumentation,
5801 DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qeol_crlf); 4768 build_msg_string (
5802 DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qeol_lf); 4769 "Do nothing."),
5803 DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qpost_read_conversion); 4770 Qmnemonic, build_string ("LF->LF"),
5804 DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qpre_write_conversion); 4771 Qsubtype, Qlf),
5805 #ifdef MULE 4772 /* VERY IMPORTANT! Tell make-coding-system not to generate
5806 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qcharset_g0); 4773 subsidiaries -- it needs the coding systems we're creating
5807 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qcharset_g1); 4774 to do so! */
5808 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qcharset_g2); 4775 list2 (Qeol_type, Qlf)));
5809 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qcharset_g3); 4776
5810 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qforce_g0_on_output); 4777 Fmake_coding_system
5811 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qforce_g1_on_output); 4778 (Qconvert_eol_crlf, Qconvert_eol,
5812 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qforce_g2_on_output); 4779 build_msg_string ("Convert CRLF to LF"),
5813 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qforce_g3_on_output); 4780 nconc2 (list6 (Qdocumentation,
5814 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qshort); 4781 build_msg_string (
5815 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qno_ascii_eol); 4782 "Converts CR+LF (used to mark the end of a line on Macintosh systems) to LF\n"
5816 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qno_ascii_cntl); 4783 "(used internally and under Unix to mark the end of a line)."),
5817 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qseven); 4784 Qmnemonic, build_string ("CRLF->LF"),
5818 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qlock_shift); 4785 Qsubtype, Qcrlf),
5819 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qno_iso6429); 4786 /* VERY IMPORTANT! Tell make-coding-system not to generate
5820 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qescape_quoted); 4787 subsidiaries -- it needs the coding systems we're creating
5821 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qinput_charset_conversion); 4788 to do so! */
5822 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qoutput_charset_conversion); 4789 list2 (Qeol_type, Qlf)));
5823 4790
5824 DEFINE_CODESYS_PROP (CODESYS_PROP_CCL, Qencode); 4791 Fmake_coding_system
5825 DEFINE_CODESYS_PROP (CODESYS_PROP_CCL, Qdecode); 4792 (Qconvert_eol_autodetect, Qconvert_eol,
5826 #endif /* MULE */ 4793 build_msg_string ("Autodetect EOL type"),
4794 nconc2 (list6 (Qdocumentation,
4795 build_msg_string (
4796 "Autodetect the end-of-line type."),
4797 Qmnemonic, build_string ("Auto-EOL"),
4798 Qsubtype, Qautodetect),
4799 /* VERY IMPORTANT! Tell make-coding-system not to generate
4800 subsidiaries -- it needs the coding systems we're creating
4801 to do so! */
4802 list2 (Qeol_type, Qlf)));
4803
4804 Fmake_coding_system
4805 (Qundecided, Qundecided,
4806 build_msg_string ("Undecided (auto-detect)"),
4807 nconc2 (list4 (Qdocumentation,
4808 build_msg_string
4809 ("Automatically detects the correct encoding."),
4810 Qmnemonic, build_string ("Auto")),
4811 list6 (Qdo_eol, Qt, Qdo_coding, Qt,
4812 /* We do EOL detection ourselves so we don't need to be
4813 wrapped in an EOL detector. (It doesn't actually hurt,
4814 though, I don't think.) */
4815 Qeol_type, Qlf)));
4816
4817 Fmake_coding_system
4818 (intern ("undecided-dos"), Qundecided,
4819 build_msg_string ("Undecided (auto-detect) (CRLF)"),
4820 nconc2 (list4 (Qdocumentation,
4821 build_msg_string
4822 ("Automatically detects the correct encoding; EOL type of CRLF forced."),
4823 Qmnemonic, build_string ("Auto")),
4824 list4 (Qdo_coding, Qt,
4825 Qeol_type, Qcrlf)));
4826
4827 Fmake_coding_system
4828 (intern ("undecided-unix"), Qundecided,
4829 build_msg_string ("Undecided (auto-detect) (LF)"),
4830 nconc2 (list4 (Qdocumentation,
4831 build_msg_string
4832 ("Automatically detects the correct encoding; EOL type of LF forced."),
4833 Qmnemonic, build_string ("Auto")),
4834 list4 (Qdo_coding, Qt,
4835 Qeol_type, Qlf)));
4836
4837 Fmake_coding_system
4838 (intern ("undecided-mac"), Qundecided,
4839 build_msg_string ("Undecided (auto-detect) (CR)"),
4840 nconc2 (list4 (Qdocumentation,
4841 build_msg_string
4842 ("Automatically detects the correct encoding; EOL type of CR forced."),
4843 Qmnemonic, build_string ("Auto")),
4844 list4 (Qdo_coding, Qt,
4845 Qeol_type, Qcr)));
4846
5827 /* Need to create this here or we're really screwed. */ 4847 /* Need to create this here or we're really screwed. */
5828 Fmake_coding_system 4848 Fmake_coding_system
5829 (Qraw_text, Qno_conversion, 4849 (Qraw_text, Qno_conversion,
5830 build_string ("Raw text, which means it converts only line-break-codes."), 4850 build_msg_string ("Raw Text"),
5831 list2 (Qmnemonic, build_string ("Raw"))); 4851 list4 (Qdocumentation,
4852 build_msg_string ("Raw text converts only line-break codes, and acts otherwise like `binary'."),
4853 Qmnemonic, build_string ("Raw")));
5832 4854
5833 Fmake_coding_system 4855 Fmake_coding_system
5834 (Qbinary, Qno_conversion, 4856 (Qbinary, Qno_conversion,
5835 build_string ("Binary, which means it does not convert anything."), 4857 build_msg_string ("Binary"),
5836 list4 (Qeol_type, Qlf, 4858 list6 (Qdocumentation,
4859 build_msg_string (
4860 "This coding system is as close as it comes to doing no conversion.\n"
4861 "On input, each byte is converted directly into the character\n"
4862 "with the corresponding code -- i.e. from the `ascii', `control-1',\n"
4863 "or `latin-1' character sets. On output, these characters are\n"
4864 "converted back to the corresponding bytes, and other characters\n"
4865 "are converted to the default character, i.e. `~'."),
4866 Qeol_type, Qlf,
5837 Qmnemonic, build_string ("Binary"))); 4867 Qmnemonic, build_string ("Binary")));
5838 4868
5839 Fdefine_coding_system_alias (Qno_conversion, Qraw_text); 4869 /* Formerly aliased to raw-text! Completely bogus and not even the same
5840 4870 as FSF Emacs. */
4871 Fdefine_coding_system_alias (Qno_conversion, Qbinary);
4872 Fdefine_coding_system_alias (intern ("no-conversion-unix"),
4873 intern ("raw-text-unix"));
4874 Fdefine_coding_system_alias (intern ("no-conversion-dos"),
4875 intern ("raw-text-dos"));
4876 Fdefine_coding_system_alias (intern ("no-conversion-mac"),
4877 intern ("raw-text-mac"));
4878
4879 /* These four below will get their defaults set correctly in
4880 code-init.el. We init them now so we can handle stuff at dump
4881 time before we get to code-init.el. */
5841 Fdefine_coding_system_alias (Qfile_name, Qbinary); 4882 Fdefine_coding_system_alias (Qfile_name, Qbinary);
4883 Fdefine_coding_system_alias (Qnative, Qfile_name);
5842 4884
5843 Fdefine_coding_system_alias (Qterminal, Qbinary); 4885 Fdefine_coding_system_alias (Qterminal, Qbinary);
5844 Fdefine_coding_system_alias (Qkeyboard, Qbinary); 4886 Fdefine_coding_system_alias (Qkeyboard, Qbinary);
5845 4887
4888 Fdefine_coding_system_alias (Qidentity, Qconvert_eol_lf);
4889
5846 /* Need this for bootstrapping */ 4890 /* Need this for bootstrapping */
5847 fcd->coding_category_system[CODING_CATEGORY_NO_CONVERSION] = 4891 coding_category_system[detector_category_no_conversion] =
5848 Fget_coding_system (Qraw_text); 4892 Fget_coding_system (Qraw_text);
5849 4893 }
5850 #ifdef MULE
5851 {
5852 int i;
5853
5854 for (i = 0; i < countof (fcd->ucs_to_mule_table); i++)
5855 fcd->ucs_to_mule_table[i] = Qnil;
5856 }
5857 staticpro (&mule_to_ucs_table);
5858 mule_to_ucs_table = Fmake_char_table (Qgeneric);
5859 #endif /* MULE */
5860 }