Mercurial > hg > xemacs-beta
comparison src/file-coding.c @ 771:943eaba38521
[xemacs-hg @ 2002-03-13 08:51:24 by ben]
The big ben-mule-21-5 check-in!
Various files were added and deleted. See CHANGES-ben-mule.
There are still some test suite failures. No crashes, though.
Many of the failures have to do with problems in the test suite itself
rather than in the actual code. I'll be addressing these in the next
day or so -- none of the test suite failures are at all critical.
Meanwhile I'll be trying to address the biggest issues -- i.e. build
or run failures, which will almost certainly happen on various platforms.
All comments should be sent to ben@xemacs.org -- use a Cc: if necessary
when sending to mailing lists. There will be pre- and post- tags,
something like
pre-ben-mule-21-5-merge-in, and
post-ben-mule-21-5-merge-in.
author | ben |
---|---|
date | Wed, 13 Mar 2002 08:54:06 +0000 |
parents | fdefd0186b75 |
children | 2923009caf47 |
comparison
equal
deleted
inserted
replaced
770:336a418893b5 | 771:943eaba38521 |
---|---|
1 /* Code conversion functions. | 1 /* Text encoding conversion functions; coding-system object. |
2 #### rename me to coding-system.c or coding.c | |
2 Copyright (C) 1991, 1995 Free Software Foundation, Inc. | 3 Copyright (C) 1991, 1995 Free Software Foundation, Inc. |
3 Copyright (C) 1995 Sun Microsystems, Inc. | 4 Copyright (C) 1995 Sun Microsystems, Inc. |
5 Copyright (C) 2000, 2001, 2002 Ben Wing. | |
4 | 6 |
5 This file is part of XEmacs. | 7 This file is part of XEmacs. |
6 | 8 |
7 XEmacs is free software; you can redistribute it and/or modify it | 9 XEmacs is free software; you can redistribute it and/or modify it |
8 under the terms of the GNU General Public License as published by the | 10 under the terms of the GNU General Public License as published by the |
17 You should have received a copy of the GNU General Public License | 19 You should have received a copy of the GNU General Public License |
18 along with XEmacs; see the file COPYING. If not, write to | 20 along with XEmacs; see the file COPYING. If not, write to |
19 the Free Software Foundation, Inc., 59 Temple Place - Suite 330, | 21 the Free Software Foundation, Inc., 59 Temple Place - Suite 330, |
20 Boston, MA 02111-1307, USA. */ | 22 Boston, MA 02111-1307, USA. */ |
21 | 23 |
22 /* Synched up with: Mule 2.3. Not in FSF. */ | 24 /* Synched up with: Not in FSF. */ |
23 | 25 |
24 /* Rewritten by Ben Wing <ben@xemacs.org>. */ | 26 /* Authorship: |
27 | |
28 Current primary author: Ben Wing <ben@xemacs.org> | |
29 | |
30 Rewritten by Ben Wing <ben@xemacs.org>, based originally on coding.c | |
31 from Mule 2.? but probably does not share one line of code with that | |
32 original source. Rewriting work started around Dec. 1994. or Jan. 1995. | |
33 Proceeded in earnest till Nov. 1995. | |
34 | |
35 Around Feb. 17, 1998, Andy Piper renamed what was then mule-coding.c to | |
36 file-coding.c, with the intention of using it to do end-of-line conversion | |
37 on non-MULE machines (specifically, on Windows machines). He separated | |
38 out the MULE stuff from non-MULE using ifdef's, and searched throughout | |
39 the rest of the source tree looking for coding-system-related code that | |
40 was ifdef MULE but should be ifdef HAVE_CODING_SYSTEMS. | |
41 | |
42 Sept. 4 - 8, 1998, Tomohiko Morioka added the UCS_4 and UTF_8 coding system | |
43 types, providing a primitive means of decoding and encoding externally- | |
44 formatted Unicode/UCS_4 and Unicode/UTF_8 data. | |
45 | |
46 January 25, 2000, Martin Buchholz redid and fleshed out the coding | |
47 system alias handling that was first added in prototype form by | |
48 Hrjove Niksic, April 15, 1999. | |
49 | |
50 April to May 2000, Ben Wing: More major reorganization. Adding features | |
51 needed for MS Windows (multibyte, unicode, unicode-to-multibyte), the | |
52 "chain" coding system for chaining two together, and doing a lot of | |
53 reorganization in preparation for properly abstracting out the different | |
54 coding system types. | |
55 | |
56 June 2001, Ben Wing: Added Unicode support. Eliminated previous | |
57 junky Unicode translation support. | |
58 | |
59 August 2001, Ben Wing: Moved Unicode support to unicode.c. Finished | |
60 abstracting everything except detection, which is hard to abstract (see | |
61 just below). | |
62 | |
63 September 2001, Ben Wing: Moved Mule code to mule-coding.c, Windows code | |
64 to intl-win32.c. Lots more rewriting; very little code is untouched | |
65 from before April 2000. Abstracted the detection code, added multiple | |
66 levels of likelihood to increase the reliability of the algorithm. | |
67 | |
68 October 2001, Ben Wing: HAVE_CODING_SYSTEMS is always now defined. | |
69 Removed the conditionals. | |
70 */ | |
71 | |
72 /* Comments about future work | |
73 | |
74 ------------------------------------------------------------------ | |
75 ABOUT DETECTION | |
76 ------------------------------------------------------------------ | |
77 | |
78 however, in general the detection code has major problems and needs lots | |
79 of work: | |
80 | |
81 -- instead of merely "yes" or "no" for particular categories, we need a | |
82 more flexible system, with various levels of likelihood. Currently | |
83 I've created a system with six levels, as follows: | |
84 | |
85 [see file-coding.h] | |
86 | |
87 Let's consider what this might mean for an ASCII text detector. (In | |
88 order to have accurate detection, especially given the iteration I | |
89 proposed below, we need active detectors for *all* types of data we | |
90 might reasonably encounter, such as ASCII text files, binary files, | |
91 and possibly other sorts of ASCII files, and not assume that simply | |
92 "falling back to no detection" will work at all well.) | |
93 | |
94 An ASCII text detector DOES NOT report ASCII text as level 0, since | |
95 that's what the detector is looking for. Such a detector ideally | |
96 wants all bytes in the range 0x20 - 0x7E (no high bytes!), except for | |
97 whitespace control chars and perhaps a few others; LF, CR, or CRLF | |
98 sequences at regular intervals (where "regular" might mean an average | |
99 < 100 chars and 99% < 300 for code and other stuff of the "text file | |
100 w/line breaks" variety, but for the "text file w/o line breaks" | |
101 variety, excluding blank lines, averages could easily be 600 or more | |
102 with 2000-3000 char "lines" not so uncommon); similar statistical | |
103 variance between odds and evens (not Unicode); frequent occurrences of | |
104 the space character; letters more common than non-letters; etc. Also | |
105 checking for too little variability between frequencies of characters | |
106 and for exclusion of particular characters based on character ranges | |
107 can catch ASCII encodings like base-64, UUEncode, UTF-7, etc. | |
108 Granted, this doesn't even apply to everything called "ASCII", and we | |
109 could potentially distinguish off ASCII for code, ASCII for text, | |
110 etc. as separate categories. However, it does give us a lot to work | |
111 off of, in deciding what likelihood to choose -- and it shows there's | |
112 in fact a lot of detectable patterns to look for even in something | |
113 seemingly so generic as ASCII. The detector would report most text | |
114 files in level 1 or level 2. EUC encodings, Shift-JIS, etc. probably | |
115 go to level -1 because they also pass the EOL test and all other tests | |
116 for the ASCII part of the text, but have lots of high bytes, which in | |
117 essence turn them into binary. Aberrant text files like something in | |
118 BASE64 encoding might get placed in level 0, because they pass most | |
119 tests but fail dramatically the frequency test; but they should not be | |
120 reported as any lower, because that would cause explicit prompting, | |
121 and the user should be able any valid text file without prompting. | |
122 The escape sequences and the base-64-type checks might send 7-bit | |
123 iso2022 to 0, but probably not -1, for similar reasons. | |
124 | |
125 -- The assumed algorithm for the above detection levels is to in essence | |
126 sort categories first by detection level and then by priority. | |
127 Perhaps, however, we would want smarter algorithms, or at least | |
128 something user-controllable -- in particular, when (other than no | |
129 category at level 0 or greater) do we prompt the user to pick a | |
130 category? | |
131 | |
132 -- Improvements in how the detection algorithm works: we want to handle | |
133 lots of different ways something could be encoded, including multiple | |
134 stacked encodings. trying to specify a series of detection levels | |
135 (check for base64 first, then check for gzip, then check for an i18n | |
136 decoding, then for crlf) won't generally work. for example, what | |
137 about the same encoding appearing more than once? for example, take | |
138 euc-jp, base64'd, then gzip'd, then base64'd again: this could well | |
139 happen, and you could specify the encodings specifically as | |
140 base64|gzip|base64|euc-jp, but we'd like to autodetect it without | |
141 worrying about exactly what order these things appear in. we should | |
142 allow for iterating over detection/decoding cycles until we reach | |
143 some maximum (we got stuck in a loop, due to incorrect category | |
144 tables or detection algorithms), have no reported detection levels | |
145 over -1, or we end up with no change after a decoding pass (i.e. the | |
146 coding system associated with a chosen category was `no-conversion' | |
147 or something equivalent). it might make sense to divide things into | |
148 two phases (internal and external), where the internal phase has a | |
149 separate category list and would probably mostly end up handling EOL | |
150 detection; but the i think about it, the more i disagree. with | |
151 properly written detectors, and properly organized tables (in | |
152 general, those decodings that are more "distinctive" and thus | |
153 detectable with greater certainty go lower on the list), we shouldn't | |
154 need two phases. for example, let's say the example above was also | |
155 in CRLF format. The EOL detector (which really detects *plain text* | |
156 with a particular EOL type) would return at most level 0 for all | |
157 results until the text file is reached, whereas the base64, gzip or | |
158 euc-jp decoders will return higher. Once the text file is reached, | |
159 the EOL detector will return 0 or higher for the CRLF encoding, and | |
160 all other decoders will return 0 or lower; thus, we will successfully | |
161 proceed through CRLF decoding, or at worst prompt the user. (The only | |
162 external-vs-internal distinction that might make sense here is to | |
163 favor coding systems of the correct source type over those that | |
164 require conversion between external and internal; if done right, this | |
165 could allow the CRLF detector to return level 1 for all CRLF-encoded | |
166 text files, even those that look like Base-64 or similar encoding, so | |
167 that CRLF encoding will always get decoded without prompting, but not | |
168 interfere with other decoders. On the other hand, this | |
169 external-vs-internal distinction may not matter at all -- with | |
170 automatic internal-external conversion, CRLF decoding can occur | |
171 before or after decoding of euc-jp, base64, iso2022, or similar, | |
172 without any difference in the final results.) | |
173 | |
174 -- There need to be two priority lists and two | |
175 category->coding-system lists. Once is general, the other | |
176 category->langenv-specific. The user sets the former, the langenv | |
177 category->the latter. The langenv-specific entries take precedence | |
178 category->over the others. This works similarly to the | |
179 category->category->Unicode charset priority list. | |
180 | |
181 -- The simple list of coding categories per detectors is not enough. | |
182 Instead of coding categories, we need parameters. For example, | |
183 Unicode might have separate detectors for UTF-8, UTF-7, UTF-16, | |
184 and perhaps UCS-4; or UTF-16/UCS-4 would be one detection type. | |
185 UTF-16 would have parameters such as "little-endian" and "needs BOM", | |
186 and possibly another one like "collapse/expand/leave alone composite | |
187 sequences" once we add this support. Usually these parameters | |
188 correspond directly to a coding system parameter. Different | |
189 likelihood values can be specified for each parameter as well as for | |
190 the detection type as a whole. The user can specify particular | |
191 coding systems for a particular combination of detection type and | |
192 parameters, or can give "default parameters" associated with a | |
193 detection type. In the latter case, we create a new coding system as | |
194 necessary that corresponds to the detected type and parameters. | |
195 | |
196 -- a better means of presentation. rather than just coming up | |
197 with the new file decoded according to the detected coding | |
198 system, allow the user to browse through the file and | |
199 conveniently reject it if it looks wrong; then detection | |
200 starts again, but with that possibility removed. in cases where | |
201 certainty is low and thus more than one possibility is presented, | |
202 the user can browse each one and select one or reject them all. | |
203 | |
204 -- fail-safe: even after the user has made a choice, if they | |
205 later on realize they have the wrong coding system, they can | |
206 go back, and we've squirreled away the original data so they | |
207 can start the process over. this may be tricky. | |
208 | |
209 -- using a larger buffer for detection. we use just a small | |
210 piece, which can give quite random results. we may need to | |
211 buffer up all the data we look through because we can't | |
212 necessarily rewind. the idea is we proceed until we get a | |
213 result that's at least at a certain level of certainty | |
214 (e.g. "probable") or we reached a maximum limit of how much | |
215 we want to buffer. | |
216 | |
217 -- dealing with interactive systems. we might need to go ahead | |
218 and present the data before we've finished detection, and | |
219 then re-decode it, perhaps multiple times, as we get better | |
220 detection results. | |
221 | |
222 -- Clearly some of these are more important than others. at the | |
223 very least, the "better means of presentation" should be | |
224 implementation as soon as possibl, along with a very simple means | |
225 of fail-safe whenever the data is readibly available, e.g. it's | |
226 coming from a file, which is the most common scenario. | |
227 | |
228 | |
229 ------------------------------------------------------------------ | |
230 ABOUT FORMATS | |
231 ------------------------------------------------------------------ | |
232 | |
233 when calling make-coding-system, the name can be a cons of (format1 . | |
234 format2), specifying that it decodes format1->format2 and encodes the other | |
235 way. if only one name is given, that is assumed to be format1, and the | |
236 other is either `external' or `internal' depending on the end type. | |
237 normally the user when decoding gives the decoding order in formats, but | |
238 can leave off the last one, `internal', which is assumed. a multichain | |
239 might look like gzip|multibyte|unicode, using the coding systems named | |
240 `gzip', `(unicode . multibyte)' and `unicode'. the way this actually works | |
241 is by searching for gzip->multibyte; if not found, look for gzip->external | |
242 or gzip->internal. (In general we automatically do conversion between | |
243 internal and external as necessary: thus gzip|crlf does the expected, and | |
244 maps to gzip->external, external->internal, crlf->internal, which when | |
245 fully specified would be gzip|external:external|internal:crlf|internal -- | |
246 see below.) To forcibly fit together two converters that have explicitly | |
247 specified and incompatible names (say you have unicode->multibyte and | |
248 iso8859-1->ebcdic and you know that the multibyte and iso8859-1 in this | |
249 case are compatible), you can force-cast using :, like this: | |
250 ebcdic|iso8859-1:multibyte|unicode. (again, if you force-cast between | |
251 internal and external formats, the conversion happens automatically.) | |
252 | |
253 -------------------------------------------------------------------------- | |
254 ABOUT PDUMP, UNICODE, AND RUNNING XEMACS FROM A DIRECTORY WITH WEIRD CHARS | |
255 -------------------------------------------------------------------------- | |
256 | |
257 -- there's the problem that XEmacs can't be run in a directory with | |
258 non-ASCII/Latin-1 chars in it, since it will be doing Unicode | |
259 processing before we've had a chance to load the tables. In fact, | |
260 even finding the tables in such a situation is problematic using | |
261 the normal commands. my idea is to eventually load the stuff | |
262 extremely extremely early, at the same time as the pdump data gets | |
263 loaded. in fact, the unicode table data (stored in an efficient | |
264 binary format) can even be stuck into the pdump file (which would | |
265 mean as a resource to the executable, for windows). we'd need to | |
266 extend pdump a bit: to allow for attaching extra data to the pdump | |
267 file. (something like pdump_attach_extra_data (addr, length) | |
268 returns a number of some sort, an index into the file, which you | |
269 can then retrieve with pdump_load_extra_data(), which returns an | |
270 addr (mmap()ed or loaded), and later you pdump_unload_extra_data() | |
271 when finished. we'd probably also need | |
272 pdump_attach_extra_data_append(), which appends data to the data | |
273 just written out with pdump_attach_extra_data(). this way, | |
274 multiple tables in memory can be written out into one contiguous | |
275 table. (we'd use the tar-like trick of allowing new blocks to be | |
276 written without going back to change the old blocks -- we just rely | |
277 on the end of file/end of memory.) this same mechanism could be | |
278 extracted out of pdump and used to handle the non-pdump situation | |
279 (or alternatively, we could just dump either the memory image of | |
280 the tables themselves or the compressed binary version). in the | |
281 case of extra unicode tables not known about at compile time that | |
282 get loaded before dumping, we either just dump them into the image | |
283 (pdump and all) or extract them into the compressed binary format, | |
284 free the original tables, and treat them like all other tables. | |
285 | |
286 -------------------------------------------------------------------------- | |
287 HANDLING WRITING A FILE SAFELY, WITHOUT DATA LOSS | |
288 -------------------------------------------------------------------------- | |
289 | |
290 -- When writing a file, we need error detection; otherwise somebody | |
291 will create a Unicode file without realizing the coding system | |
292 of the buffer is Raw, and then lose all the non-ASCII/Latin-1 | |
293 text when it's written out. We need two levels | |
294 | |
295 1. first, a "safe-charset" level that checks before any actual | |
296 encoding to see if all characters in the document can safely | |
297 be represented using the given coding system. FSF has a | |
298 "safe-charset" property of coding systems, but it's stupid | |
299 because this information can be automatically derived from | |
300 the coding system, at least the vast majority of the time. | |
301 What we need is some sort of | |
302 alternative-coding-system-precedence-list, langenv-specific, | |
303 where everything on it can be checked for safe charsets and | |
304 then the user given a list of possibilities. When the user | |
305 does "save with specified encoding", they should see the same | |
306 precedence list. Again like with other precedence lists, | |
307 there's also a global one, and presumably all coding systems | |
308 not on other list get appended to the end (and perhaps not | |
309 checked at all when doing safe-checking?). safe-checking | |
310 should work something like this: compile a list of all | |
311 charsets used in the buffer, along with a count of chars | |
312 used. that way, "slightly unsafe" charsets can perhaps be | |
313 presented at the end, which will lose only a few characters | |
314 and are perhaps what the users were looking for. | |
315 | |
316 2. when actually writing out, we need error checking in case an | |
317 individual char in a charset can't be written even though the | |
318 charsets are safe. again, the user gets the choice of other | |
319 reasonable coding systems. | |
320 | |
321 3. same thing (error checking, list of alternatives, etc.) needs | |
322 to happen when reading! all of this will be a lot of work! | |
323 | |
324 | |
325 --ben | |
326 */ | |
25 | 327 |
26 #include <config.h> | 328 #include <config.h> |
27 #include "lisp.h" | 329 #include "lisp.h" |
28 | 330 |
29 #include "buffer.h" | 331 #include "buffer.h" |
30 #include "elhash.h" | 332 #include "elhash.h" |
31 #include "insdel.h" | 333 #include "insdel.h" |
32 #include "lstream.h" | 334 #include "lstream.h" |
33 #include "opaque.h" | 335 #include "opaque.h" |
34 #ifdef MULE | 336 #include "file-coding.h" |
35 #include "mule-ccl.h" | 337 |
36 #include "chartab.h" | 338 #ifdef HAVE_ZLIB |
339 #include "zlib.h" | |
37 #endif | 340 #endif |
38 #include "file-coding.h" | |
39 | 341 |
40 Lisp_Object Vkeyboard_coding_system; | 342 Lisp_Object Vkeyboard_coding_system; |
41 Lisp_Object Vterminal_coding_system; | 343 Lisp_Object Vterminal_coding_system; |
42 Lisp_Object Vcoding_system_for_read; | 344 Lisp_Object Vcoding_system_for_read; |
43 Lisp_Object Vcoding_system_for_write; | 345 Lisp_Object Vcoding_system_for_write; |
44 Lisp_Object Vfile_name_coding_system; | 346 Lisp_Object Vfile_name_coding_system; |
45 | 347 |
46 /* Table of symbols identifying each coding category. */ | 348 #ifdef DEBUG_XEMACS |
47 Lisp_Object coding_category_symbol[CODING_CATEGORY_LAST]; | 349 Lisp_Object Vdebug_coding_detection; |
48 | |
49 | |
50 | |
51 struct file_coding_dump { | |
52 /* Coding system currently associated with each coding category. */ | |
53 Lisp_Object coding_category_system[CODING_CATEGORY_LAST]; | |
54 | |
55 /* Table of all coding categories in decreasing order of priority. | |
56 This describes a permutation of the possible coding categories. */ | |
57 int coding_category_by_priority[CODING_CATEGORY_LAST]; | |
58 | |
59 #ifdef MULE | |
60 Lisp_Object ucs_to_mule_table[65536]; | |
61 #endif | 350 #endif |
62 } *fcd; | 351 |
63 | 352 typedef struct coding_system_type_entry |
64 static const struct lrecord_description fcd_description_1[] = { | 353 { |
65 { XD_LISP_OBJECT_ARRAY, offsetof (struct file_coding_dump, coding_category_system), CODING_CATEGORY_LAST }, | 354 struct coding_system_methods *meths; |
66 #ifdef MULE | 355 } coding_system_type_entry; |
67 { XD_LISP_OBJECT_ARRAY, offsetof (struct file_coding_dump, ucs_to_mule_table), countof (fcd->ucs_to_mule_table) }, | 356 |
68 #endif | 357 typedef struct |
358 { | |
359 Dynarr_declare (coding_system_type_entry); | |
360 } coding_system_type_entry_dynarr; | |
361 | |
362 static coding_system_type_entry_dynarr *the_coding_system_type_entry_dynarr; | |
363 | |
364 static const struct lrecord_description cste_description_1[] = { | |
365 { XD_STRUCT_PTR, offsetof (coding_system_type_entry, meths), 1, &coding_system_methods_description }, | |
69 { XD_END } | 366 { XD_END } |
70 }; | 367 }; |
71 | 368 |
72 static const struct struct_description fcd_description = { | 369 static const struct struct_description cste_description = { |
73 sizeof (struct file_coding_dump), | 370 sizeof (coding_system_type_entry), |
74 fcd_description_1 | 371 cste_description_1 |
75 }; | 372 }; |
76 | 373 |
77 Lisp_Object mule_to_ucs_table; | 374 static const struct lrecord_description csted_description_1[] = { |
375 XD_DYNARR_DESC (coding_system_type_entry_dynarr, &cste_description), | |
376 { XD_END } | |
377 }; | |
378 | |
379 static const struct struct_description csted_description = { | |
380 sizeof (coding_system_type_entry_dynarr), | |
381 csted_description_1 | |
382 }; | |
383 | |
384 static Lisp_Object Vcoding_system_type_list; | |
385 | |
386 /* Coding system currently associated with each coding category. */ | |
387 Lisp_Object coding_category_system[MAX_DETECTOR_CATEGORIES]; | |
388 | |
389 /* Table of all coding categories in decreasing order of priority. | |
390 This describes a permutation of the possible coding categories. */ | |
391 int coding_category_by_priority[MAX_DETECTOR_CATEGORIES]; | |
392 | |
393 /* Value used with to give a unique name to nameless coding systems */ | |
394 int coding_system_tick; | |
395 | |
396 int coding_detector_count; | |
397 int coding_detector_category_count; | |
398 | |
399 detector_dynarr *all_coding_detectors; | |
400 | |
401 static const struct lrecord_description struct_detector_category_description_1[] | |
402 = | |
403 { | |
404 { XD_LISP_OBJECT, offsetof (struct detector_category, sym) }, | |
405 { XD_END } | |
406 }; | |
407 | |
408 static const struct struct_description struct_detector_category_description = | |
409 { | |
410 sizeof (struct detector_category), | |
411 struct_detector_category_description_1 | |
412 }; | |
413 | |
414 static const struct lrecord_description detector_category_dynarr_description_1[] = | |
415 { | |
416 XD_DYNARR_DESC (detector_category_dynarr, | |
417 &struct_detector_category_description), | |
418 { XD_END } | |
419 }; | |
420 | |
421 static const struct struct_description detector_category_dynarr_description = { | |
422 sizeof (detector_category_dynarr), | |
423 detector_category_dynarr_description_1 | |
424 }; | |
425 | |
426 static const struct lrecord_description struct_detector_description_1[] | |
427 = | |
428 { | |
429 { XD_STRUCT_PTR, offsetof (struct detector, cats), 1, | |
430 &detector_category_dynarr_description }, | |
431 { XD_END } | |
432 }; | |
433 | |
434 static const struct struct_description struct_detector_description = | |
435 { | |
436 sizeof (struct detector), | |
437 struct_detector_description_1 | |
438 }; | |
439 | |
440 static const struct lrecord_description detector_dynarr_description_1[] = | |
441 { | |
442 XD_DYNARR_DESC (detector_dynarr, &struct_detector_description), | |
443 { XD_END } | |
444 }; | |
445 | |
446 static const struct struct_description detector_dynarr_description = { | |
447 sizeof (detector_dynarr), | |
448 detector_dynarr_description_1 | |
449 }; | |
78 | 450 |
79 Lisp_Object Qcoding_systemp; | 451 Lisp_Object Qcoding_systemp; |
80 | 452 |
81 Lisp_Object Qraw_text, Qno_conversion, Qccl, Qiso2022; | 453 Lisp_Object Qraw_text; |
82 /* Qinternal in general.c */ | |
83 | 454 |
84 Lisp_Object Qmnemonic, Qeol_type; | 455 Lisp_Object Qmnemonic, Qeol_type; |
85 Lisp_Object Qcr, Qcrlf, Qlf; | 456 Lisp_Object Qcr, Qcrlf, Qlf; |
86 Lisp_Object Qeol_cr, Qeol_crlf, Qeol_lf; | 457 Lisp_Object Qeol_cr, Qeol_crlf, Qeol_lf; |
87 Lisp_Object Qpost_read_conversion; | 458 Lisp_Object Qpost_read_conversion; |
88 Lisp_Object Qpre_write_conversion; | 459 Lisp_Object Qpre_write_conversion; |
89 | 460 |
90 #ifdef MULE | 461 Lisp_Object Qtranslation_table_for_decode; |
91 Lisp_Object Qucs4, Qutf8; | 462 Lisp_Object Qtranslation_table_for_encode; |
92 Lisp_Object Qbig5, Qshift_jis; | 463 Lisp_Object Qsafe_chars; |
93 Lisp_Object Qcharset_g0, Qcharset_g1, Qcharset_g2, Qcharset_g3; | 464 Lisp_Object Qsafe_charsets; |
94 Lisp_Object Qforce_g0_on_output, Qforce_g1_on_output; | 465 Lisp_Object Qmime_charset; |
95 Lisp_Object Qforce_g2_on_output, Qforce_g3_on_output; | 466 Lisp_Object Qvalid_codes; |
96 Lisp_Object Qno_iso6429; | 467 |
97 Lisp_Object Qinput_charset_conversion, Qoutput_charset_conversion; | 468 Lisp_Object Qno_conversion; |
469 Lisp_Object Qconvert_eol; | |
98 Lisp_Object Qescape_quoted; | 470 Lisp_Object Qescape_quoted; |
99 Lisp_Object Qshort, Qno_ascii_eol, Qno_ascii_cntl, Qseven, Qlock_shift; | 471 Lisp_Object Qencode, Qdecode; |
472 | |
473 Lisp_Object Qconvert_eol_lf, Qconvert_eol_cr, Qconvert_eol_crlf; | |
474 Lisp_Object Qconvert_eol_autodetect; | |
475 | |
476 Lisp_Object Qnear_certainty, Qquite_probable, Qsomewhat_likely; | |
477 Lisp_Object Qas_likely_as_unlikely, Qsomewhat_unlikely, Qquite_improbable; | |
478 Lisp_Object Qnearly_impossible; | |
479 | |
480 Lisp_Object Qdo_eol, Qdo_coding; | |
481 | |
482 Lisp_Object Qcanonicalize_after_coding; | |
483 | |
484 /* This is used to convert autodetected coding systems into existing | |
485 systems. For example, the chain undecided->convert-eol-autodetect may | |
486 have its separate parts detected as mswindows-multibyte and | |
487 convert-eol-crlf, and the result needs to be mapped to | |
488 mswindows-multibyte-dos. */ | |
489 /* #### It's not clear we need this whole chain-canonicalize mechanism | |
490 any more. */ | |
491 static Lisp_Object Vchain_canonicalize_hash_table; | |
492 | |
493 #ifdef HAVE_ZLIB | |
494 Lisp_Object Qgzip; | |
100 #endif | 495 #endif |
101 Lisp_Object Qencode, Qdecode; | 496 |
102 | 497 /* Maps coding system names to either coding system objects or (for |
103 Lisp_Object Vcoding_system_hash_table; | 498 aliases) other names. */ |
499 static Lisp_Object Vcoding_system_hash_table; | |
104 | 500 |
105 int enable_multibyte_characters; | 501 int enable_multibyte_characters; |
106 | 502 |
107 #ifdef MULE | |
108 /* Additional information used by the ISO2022 decoder and detector. */ | |
109 struct iso2022_decoder | |
110 { | |
111 /* CHARSET holds the character sets currently assigned to the G0 | |
112 through G3 variables. It is initialized from the array | |
113 INITIAL_CHARSET in CODESYS. */ | |
114 Lisp_Object charset[4]; | |
115 | |
116 /* Which registers are currently invoked into the left (GL) and | |
117 right (GR) halves of the 8-bit encoding space? */ | |
118 int register_left, register_right; | |
119 | |
120 /* ISO_ESC holds a value indicating part of an escape sequence | |
121 that has already been seen. */ | |
122 enum iso_esc_flag esc; | |
123 | |
124 /* This records the bytes we've seen so far in an escape sequence, | |
125 in case the sequence is invalid (we spit out the bytes unchanged). */ | |
126 unsigned char esc_bytes[8]; | |
127 | |
128 /* Index for next byte to store in ISO escape sequence. */ | |
129 int esc_bytes_index; | |
130 | |
131 #ifdef ENABLE_COMPOSITE_CHARS | |
132 /* Stuff seen so far when composing a string. */ | |
133 unsigned_char_dynarr *composite_chars; | |
134 #endif | |
135 | |
136 /* If we saw an invalid designation sequence for a particular | |
137 register, we flag it here and switch to ASCII. The next time we | |
138 see a valid designation for this register, we turn off the flag | |
139 and do the designation normally, but pretend the sequence was | |
140 invalid. The effect of all this is that (most of the time) the | |
141 escape sequences for both the switch to the unknown charset, and | |
142 the switch back to the known charset, get inserted literally into | |
143 the buffer and saved out as such. The hope is that we can | |
144 preserve the escape sequences so that the resulting written out | |
145 file makes sense. If we don't do any of this, the designation | |
146 to the invalid charset will be preserved but that switch back | |
147 to the known charset will probably get eaten because it was | |
148 the same charset that was already present in the register. */ | |
149 unsigned char invalid_designated[4]; | |
150 | |
151 /* We try to do similar things as above for direction-switching | |
152 sequences. If we encountered a direction switch while an | |
153 invalid designation was present, or an invalid designation | |
154 just after a direction switch (i.e. no valid designation | |
155 encountered yet), we insert the direction-switch escape | |
156 sequence literally into the output stream, and later on | |
157 insert the corresponding direction-restoring escape sequence | |
158 literally also. */ | |
159 unsigned int switched_dir_and_no_valid_charset_yet :1; | |
160 unsigned int invalid_switch_dir :1; | |
161 | |
162 /* Tells the decoder to output the escape sequence literally | |
163 even though it was valid. Used in the games we play to | |
164 avoid lossage when we encounter invalid designations. */ | |
165 unsigned int output_literally :1; | |
166 /* We encountered a direction switch followed by an invalid | |
167 designation. We didn't output the direction switch | |
168 literally because we didn't know about the invalid designation; | |
169 but we have to do so now. */ | |
170 unsigned int output_direction_sequence :1; | |
171 }; | |
172 #endif /* MULE */ | |
173 EXFUN (Fcopy_coding_system, 2); | 503 EXFUN (Fcopy_coding_system, 2); |
174 #ifdef MULE | |
175 struct detection_state; | |
176 static int detect_coding_sjis (struct detection_state *st, | |
177 const Extbyte *src, Bytecount n); | |
178 static void decode_coding_sjis (Lstream *decoding, const Extbyte *src, | |
179 unsigned_char_dynarr *dst, Bytecount n); | |
180 static void encode_coding_sjis (Lstream *encoding, const Intbyte *src, | |
181 unsigned_char_dynarr *dst, Bytecount n); | |
182 static int detect_coding_big5 (struct detection_state *st, | |
183 const Extbyte *src, Bytecount n); | |
184 static void decode_coding_big5 (Lstream *decoding, const Extbyte *src, | |
185 unsigned_char_dynarr *dst, Bytecount n); | |
186 static void encode_coding_big5 (Lstream *encoding, const Intbyte *src, | |
187 unsigned_char_dynarr *dst, Bytecount n); | |
188 static int detect_coding_ucs4 (struct detection_state *st, | |
189 const Extbyte *src, Bytecount n); | |
190 static void decode_coding_ucs4 (Lstream *decoding, const Extbyte *src, | |
191 unsigned_char_dynarr *dst, Bytecount n); | |
192 static void encode_coding_ucs4 (Lstream *encoding, const Intbyte *src, | |
193 unsigned_char_dynarr *dst, Bytecount n); | |
194 static int detect_coding_utf8 (struct detection_state *st, | |
195 const Extbyte *src, Bytecount n); | |
196 static void decode_coding_utf8 (Lstream *decoding, const Extbyte *src, | |
197 unsigned_char_dynarr *dst, Bytecount n); | |
198 static void encode_coding_utf8 (Lstream *encoding, const Intbyte *src, | |
199 unsigned_char_dynarr *dst, Bytecount n); | |
200 static int postprocess_iso2022_mask (int mask); | |
201 static void reset_iso2022 (Lisp_Object coding_system, | |
202 struct iso2022_decoder *iso); | |
203 static int detect_coding_iso2022 (struct detection_state *st, | |
204 const Extbyte *src, Bytecount n); | |
205 static void decode_coding_iso2022 (Lstream *decoding, const Extbyte *src, | |
206 unsigned_char_dynarr *dst, Bytecount n); | |
207 static void encode_coding_iso2022 (Lstream *encoding, const Intbyte *src, | |
208 unsigned_char_dynarr *dst, Bytecount n); | |
209 #endif /* MULE */ | |
210 static void decode_coding_no_conversion (Lstream *decoding, const Extbyte *src, | |
211 unsigned_char_dynarr *dst, Bytecount n); | |
212 static void encode_coding_no_conversion (Lstream *encoding, const Intbyte *src, | |
213 unsigned_char_dynarr *dst, Bytecount n); | |
214 static void mule_decode (Lstream *decoding, const Extbyte *src, | |
215 unsigned_char_dynarr *dst, Bytecount n); | |
216 static void mule_encode (Lstream *encoding, const Intbyte *src, | |
217 unsigned_char_dynarr *dst, Bytecount n); | |
218 | |
219 typedef struct codesys_prop codesys_prop; | |
220 struct codesys_prop | |
221 { | |
222 Lisp_Object sym; | |
223 int prop_type; | |
224 }; | |
225 | |
226 typedef struct | |
227 { | |
228 Dynarr_declare (codesys_prop); | |
229 } codesys_prop_dynarr; | |
230 | |
231 static const struct lrecord_description codesys_prop_description_1[] = { | |
232 { XD_LISP_OBJECT, offsetof (codesys_prop, sym) }, | |
233 { XD_END } | |
234 }; | |
235 | |
236 static const struct struct_description codesys_prop_description = { | |
237 sizeof (codesys_prop), | |
238 codesys_prop_description_1 | |
239 }; | |
240 | |
241 static const struct lrecord_description codesys_prop_dynarr_description_1[] = { | |
242 XD_DYNARR_DESC (codesys_prop_dynarr, &codesys_prop_description), | |
243 { XD_END } | |
244 }; | |
245 | |
246 static const struct struct_description codesys_prop_dynarr_description = { | |
247 sizeof (codesys_prop_dynarr), | |
248 codesys_prop_dynarr_description_1 | |
249 }; | |
250 | |
251 codesys_prop_dynarr *the_codesys_prop_dynarr; | |
252 | |
253 enum codesys_prop_enum | |
254 { | |
255 CODESYS_PROP_ALL_OK, | |
256 CODESYS_PROP_ISO2022, | |
257 CODESYS_PROP_CCL | |
258 }; | |
259 | 504 |
260 | 505 |
261 /************************************************************************/ | 506 /************************************************************************/ |
262 /* Coding system functions */ | 507 /* Coding system object methods */ |
263 /************************************************************************/ | 508 /************************************************************************/ |
264 | |
265 static Lisp_Object mark_coding_system (Lisp_Object); | |
266 static void print_coding_system (Lisp_Object, Lisp_Object, int); | |
267 static void finalize_coding_system (void *header, int for_disksave); | |
268 | |
269 #ifdef MULE | |
270 static const struct lrecord_description ccs_description_1[] = { | |
271 { XD_LISP_OBJECT, offsetof (charset_conversion_spec, from_charset) }, | |
272 { XD_LISP_OBJECT, offsetof (charset_conversion_spec, to_charset) }, | |
273 { XD_END } | |
274 }; | |
275 | |
276 static const struct struct_description ccs_description = { | |
277 sizeof (charset_conversion_spec), | |
278 ccs_description_1 | |
279 }; | |
280 | |
281 static const struct lrecord_description ccsd_description_1[] = { | |
282 XD_DYNARR_DESC (charset_conversion_spec_dynarr, &ccs_description), | |
283 { XD_END } | |
284 }; | |
285 | |
286 static const struct struct_description ccsd_description = { | |
287 sizeof (charset_conversion_spec_dynarr), | |
288 ccsd_description_1 | |
289 }; | |
290 #endif | |
291 | |
292 static const struct lrecord_description coding_system_description[] = { | |
293 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, name) }, | |
294 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, doc_string) }, | |
295 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, mnemonic) }, | |
296 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, post_read_conversion) }, | |
297 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, pre_write_conversion) }, | |
298 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol_lf) }, | |
299 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol_crlf) }, | |
300 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol_cr) }, | |
301 #ifdef MULE | |
302 { XD_LISP_OBJECT_ARRAY, offsetof (Lisp_Coding_System, iso2022.initial_charset), 4 }, | |
303 { XD_STRUCT_PTR, offsetof (Lisp_Coding_System, iso2022.input_conv), 1, &ccsd_description }, | |
304 { XD_STRUCT_PTR, offsetof (Lisp_Coding_System, iso2022.output_conv), 1, &ccsd_description }, | |
305 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, ccl.decode) }, | |
306 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, ccl.encode) }, | |
307 #endif | |
308 { XD_END } | |
309 }; | |
310 | |
311 DEFINE_LRECORD_IMPLEMENTATION ("coding-system", coding_system, | |
312 mark_coding_system, print_coding_system, | |
313 finalize_coding_system, | |
314 0, 0, coding_system_description, | |
315 Lisp_Coding_System); | |
316 | 509 |
317 static Lisp_Object | 510 static Lisp_Object |
318 mark_coding_system (Lisp_Object obj) | 511 mark_coding_system (Lisp_Object obj) |
319 { | 512 { |
320 Lisp_Coding_System *codesys = XCODING_SYSTEM (obj); | 513 Lisp_Coding_System *codesys = XCODING_SYSTEM (obj); |
321 | 514 |
322 mark_object (CODING_SYSTEM_NAME (codesys)); | 515 mark_object (CODING_SYSTEM_NAME (codesys)); |
323 mark_object (CODING_SYSTEM_DOC_STRING (codesys)); | 516 mark_object (CODING_SYSTEM_DESCRIPTION (codesys)); |
324 mark_object (CODING_SYSTEM_MNEMONIC (codesys)); | 517 mark_object (CODING_SYSTEM_MNEMONIC (codesys)); |
518 mark_object (CODING_SYSTEM_DOCUMENTATION (codesys)); | |
325 mark_object (CODING_SYSTEM_EOL_LF (codesys)); | 519 mark_object (CODING_SYSTEM_EOL_LF (codesys)); |
326 mark_object (CODING_SYSTEM_EOL_CRLF (codesys)); | 520 mark_object (CODING_SYSTEM_EOL_CRLF (codesys)); |
327 mark_object (CODING_SYSTEM_EOL_CR (codesys)); | 521 mark_object (CODING_SYSTEM_EOL_CR (codesys)); |
328 | 522 mark_object (CODING_SYSTEM_SUBSIDIARY_PARENT (codesys)); |
329 switch (CODING_SYSTEM_TYPE (codesys)) | 523 mark_object (CODING_SYSTEM_CANONICAL (codesys)); |
330 { | 524 |
331 #ifdef MULE | 525 MAYBE_CODESYSMETH (codesys, mark, (obj)); |
332 int i; | |
333 case CODESYS_ISO2022: | |
334 for (i = 0; i < 4; i++) | |
335 mark_object (CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, i)); | |
336 if (codesys->iso2022.input_conv) | |
337 { | |
338 for (i = 0; i < Dynarr_length (codesys->iso2022.input_conv); i++) | |
339 { | |
340 struct charset_conversion_spec *ccs = | |
341 Dynarr_atp (codesys->iso2022.input_conv, i); | |
342 mark_object (ccs->from_charset); | |
343 mark_object (ccs->to_charset); | |
344 } | |
345 } | |
346 if (codesys->iso2022.output_conv) | |
347 { | |
348 for (i = 0; i < Dynarr_length (codesys->iso2022.output_conv); i++) | |
349 { | |
350 struct charset_conversion_spec *ccs = | |
351 Dynarr_atp (codesys->iso2022.output_conv, i); | |
352 mark_object (ccs->from_charset); | |
353 mark_object (ccs->to_charset); | |
354 } | |
355 } | |
356 break; | |
357 | |
358 case CODESYS_CCL: | |
359 mark_object (CODING_SYSTEM_CCL_DECODE (codesys)); | |
360 mark_object (CODING_SYSTEM_CCL_ENCODE (codesys)); | |
361 break; | |
362 #endif /* MULE */ | |
363 default: | |
364 break; | |
365 } | |
366 | 526 |
367 mark_object (CODING_SYSTEM_PRE_WRITE_CONVERSION (codesys)); | 527 mark_object (CODING_SYSTEM_PRE_WRITE_CONVERSION (codesys)); |
368 return CODING_SYSTEM_POST_READ_CONVERSION (codesys); | 528 return CODING_SYSTEM_POST_READ_CONVERSION (codesys); |
529 } | |
530 | |
531 static void | |
532 print_coding_system_properties (Lisp_Object obj, Lisp_Object printcharfun) | |
533 { | |
534 Lisp_Coding_System *c = XCODING_SYSTEM (obj); | |
535 print_internal (c->methods->type, printcharfun, 1); | |
536 MAYBE_CODESYSMETH (c, print, (obj, printcharfun, 1)); | |
537 if (CODING_SYSTEM_EOL_TYPE (c) != EOL_AUTODETECT) | |
538 write_fmt_string_lisp (printcharfun, " eol-type=%s", | |
539 1, Fcoding_system_property (obj, Qeol_type)); | |
369 } | 540 } |
370 | 541 |
371 static void | 542 static void |
372 print_coding_system (Lisp_Object obj, Lisp_Object printcharfun, | 543 print_coding_system (Lisp_Object obj, Lisp_Object printcharfun, |
373 int escapeflag) | 544 int escapeflag) |
374 { | 545 { |
375 Lisp_Coding_System *c = XCODING_SYSTEM (obj); | 546 Lisp_Coding_System *c = XCODING_SYSTEM (obj); |
376 if (print_readably) | 547 if (print_readably) |
377 printing_unreadable_object ("#<coding_system 0x%x>", c->header.uid); | 548 printing_unreadable_object |
378 | 549 ("printing unreadable object #<coding-system 0x%x>", c->header.uid); |
379 write_c_string ("#<coding_system ", printcharfun); | 550 |
380 print_internal (c->name, printcharfun, 1); | 551 write_fmt_string_lisp (printcharfun, "#<coding-system %s ", 1, c->name); |
552 print_coding_system_properties (obj, printcharfun); | |
381 write_c_string (">", printcharfun); | 553 write_c_string (">", printcharfun); |
554 } | |
555 | |
556 /* Print an abbreviated version of a coding system (but still containing | |
557 all the information), for use within a coding system print method. */ | |
558 | |
559 static void | |
560 print_coding_system_in_print_method (Lisp_Object cs, Lisp_Object printcharfun, | |
561 int escapeflag) | |
562 { | |
563 print_internal (XCODING_SYSTEM_NAME (cs), printcharfun, 0); | |
564 write_c_string ("[", printcharfun); | |
565 print_coding_system_properties (cs, printcharfun); | |
566 write_c_string ("]", printcharfun); | |
382 } | 567 } |
383 | 568 |
384 static void | 569 static void |
385 finalize_coding_system (void *header, int for_disksave) | 570 finalize_coding_system (void *header, int for_disksave) |
386 { | 571 { |
387 Lisp_Coding_System *c = (Lisp_Coding_System *) header; | 572 Lisp_Object cs = wrap_coding_system ((Lisp_Coding_System *) header); |
388 /* Since coding systems never go away, this function is not | 573 /* Since coding systems never go away, this function is not |
389 necessary. But it would be necessary if we changed things | 574 necessary. But it would be necessary if we changed things |
390 so that coding systems could go away. */ | 575 so that coding systems could go away. */ |
391 if (!for_disksave) /* see comment in lstream.c */ | 576 if (!for_disksave) /* see comment in lstream.c */ |
392 { | 577 MAYBE_XCODESYSMETH (cs, finalize, (cs)); |
393 switch (CODING_SYSTEM_TYPE (c)) | 578 } |
394 { | 579 |
395 #ifdef MULE | 580 static Bytecount |
396 case CODESYS_ISO2022: | 581 sizeof_coding_system (const void *header) |
397 if (c->iso2022.input_conv) | 582 { |
398 { | 583 const Lisp_Coding_System *p = (const Lisp_Coding_System *) header; |
399 Dynarr_free (c->iso2022.input_conv); | 584 return offsetof (Lisp_Coding_System, data) + p->methods->extra_data_size; |
400 c->iso2022.input_conv = 0; | 585 } |
401 } | 586 |
402 if (c->iso2022.output_conv) | 587 static const struct lrecord_description coding_system_methods_description_1[] |
403 { | 588 = { |
404 Dynarr_free (c->iso2022.output_conv); | 589 { XD_LISP_OBJECT, |
405 c->iso2022.output_conv = 0; | 590 offsetof (struct coding_system_methods, type) }, |
406 } | 591 { XD_LISP_OBJECT, |
407 break; | 592 offsetof (struct coding_system_methods, predicate_symbol) }, |
408 #endif /* MULE */ | 593 { XD_END } |
409 default: | 594 }; |
410 break; | 595 |
411 } | 596 const struct struct_description coding_system_methods_description = { |
412 } | 597 sizeof (struct coding_system_methods), |
413 } | 598 coding_system_methods_description_1 |
414 | 599 }; |
415 static eol_type_t | 600 |
416 symbol_to_eol_type (Lisp_Object symbol) | 601 const struct lrecord_description coding_system_empty_extra_description[] = { |
417 { | 602 { XD_END } |
418 CHECK_SYMBOL (symbol); | 603 }; |
419 if (NILP (symbol)) return EOL_AUTODETECT; | 604 |
420 if (EQ (symbol, Qlf)) return EOL_LF; | 605 static const struct lrecord_description coding_system_description[] = |
421 if (EQ (symbol, Qcrlf)) return EOL_CRLF; | 606 { |
422 if (EQ (symbol, Qcr)) return EOL_CR; | 607 { XD_STRUCT_PTR, offsetof (Lisp_Coding_System, methods), 1, |
423 | 608 &coding_system_methods_description }, |
424 invalid_constant ("Unrecognized eol type", symbol); | 609 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, name) }, |
425 return EOL_AUTODETECT; /* not reached */ | 610 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, description) }, |
426 } | 611 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, mnemonic) }, |
427 | 612 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, documentation) }, |
428 static Lisp_Object | 613 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, post_read_conversion) }, |
429 eol_type_to_symbol (eol_type_t type) | 614 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, pre_write_conversion) }, |
430 { | 615 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, text_file_wrapper) }, |
431 switch (type) | 616 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, auto_eol_wrapper) }, |
432 { | 617 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol[0]) }, |
433 default: abort (); | 618 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol[1]) }, |
434 case EOL_LF: return Qlf; | 619 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol[2]) }, |
435 case EOL_CRLF: return Qcrlf; | 620 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, subsidiary_parent) }, |
436 case EOL_CR: return Qcr; | 621 { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, canonical) }, |
437 case EOL_AUTODETECT: return Qnil; | 622 { XD_CODING_SYSTEM_END } |
438 } | 623 }; |
439 } | 624 |
440 | 625 DEFINE_LRECORD_SEQUENCE_IMPLEMENTATION ("coding-system", coding_system, |
441 static void | 626 mark_coding_system, |
442 setup_eol_coding_systems (Lisp_Coding_System *codesys) | 627 print_coding_system, |
443 { | 628 finalize_coding_system, |
444 Lisp_Object codesys_obj; | 629 0, 0, coding_system_description, |
445 int len = string_length (XSYMBOL (CODING_SYSTEM_NAME (codesys))->name); | 630 sizeof_coding_system, |
446 char *codesys_name = (char *) alloca (len + 7); | 631 Lisp_Coding_System); |
447 int mlen = -1; | 632 |
448 char *codesys_mnemonic=0; | 633 |
449 | 634 /************************************************************************/ |
450 Lisp_Object codesys_name_sym, sub_codesys_obj; | 635 /* Creating coding systems */ |
451 | 636 /************************************************************************/ |
452 /* kludge */ | 637 |
453 | 638 static struct coding_system_methods * |
454 XSETCODING_SYSTEM (codesys_obj, codesys); | 639 decode_coding_system_type (Lisp_Object type, Error_Behavior errb) |
455 | 640 { |
456 memcpy (codesys_name, | 641 int i; |
457 string_data (XSYMBOL (CODING_SYSTEM_NAME (codesys))->name), len); | 642 |
458 | 643 for (i = 0; i < Dynarr_length (the_coding_system_type_entry_dynarr); i++) |
459 if (STRINGP (CODING_SYSTEM_MNEMONIC (codesys))) | 644 { |
460 { | 645 if (EQ (type, |
461 mlen = XSTRING_LENGTH (CODING_SYSTEM_MNEMONIC (codesys)); | 646 Dynarr_at (the_coding_system_type_entry_dynarr, i).meths->type)) |
462 codesys_mnemonic = (char *) alloca (mlen + 7); | 647 return Dynarr_at (the_coding_system_type_entry_dynarr, i).meths; |
463 memcpy (codesys_mnemonic, | 648 } |
464 XSTRING_DATA (CODING_SYSTEM_MNEMONIC (codesys)), mlen); | 649 |
465 } | 650 maybe_invalid_constant ("Invalid coding system type", type, |
466 | 651 Qcoding_system, errb); |
467 #define DEFINE_SUB_CODESYS(op_sys, op_sys_abbr, Type) do { \ | 652 |
468 strcpy (codesys_name + len, "-" op_sys); \ | 653 return 0; |
469 if (mlen != -1) \ | 654 } |
470 strcpy (codesys_mnemonic + mlen, op_sys_abbr); \ | 655 |
471 codesys_name_sym = intern (codesys_name); \ | 656 static int |
472 sub_codesys_obj = Fcopy_coding_system (codesys_obj, codesys_name_sym); \ | 657 valid_coding_system_type_p (Lisp_Object type) |
473 XCODING_SYSTEM_EOL_TYPE (sub_codesys_obj) = Type; \ | 658 { |
474 if (mlen != -1) \ | 659 return decode_coding_system_type (type, ERROR_ME_NOT) != 0; |
475 XCODING_SYSTEM_MNEMONIC(sub_codesys_obj) = \ | 660 } |
476 build_string (codesys_mnemonic); \ | 661 |
477 CODING_SYSTEM_##Type (codesys) = sub_codesys_obj; \ | 662 DEFUN ("valid-coding-system-type-p", Fvalid_coding_system_type_p, 1, 1, 0, /* |
478 } while (0) | 663 Given a CODING-SYSTEM-TYPE, return non-nil if it is valid. |
479 | 664 Valid types depend on how XEmacs was compiled but may include |
480 DEFINE_SUB_CODESYS("unix", "", EOL_LF); | 665 'undecided, 'chain, 'integer, 'ccl, 'iso2022, 'big5, 'shift-jis, |
481 DEFINE_SUB_CODESYS("dos", ":T", EOL_CRLF); | 666 'utf-16, 'ucs-4, 'utf-8, etc. |
482 DEFINE_SUB_CODESYS("mac", ":t", EOL_CR); | 667 */ |
668 (coding_system_type)) | |
669 { | |
670 return valid_coding_system_type_p (coding_system_type) ? Qt : Qnil; | |
671 } | |
672 | |
673 DEFUN ("coding-system-type-list", Fcoding_system_type_list, 0, 0, 0, /* | |
674 Return a list of valid coding system types. | |
675 */ | |
676 ()) | |
677 { | |
678 return Fcopy_sequence (Vcoding_system_type_list); | |
679 } | |
680 | |
681 void | |
682 add_entry_to_coding_system_type_list (struct coding_system_methods *meths) | |
683 { | |
684 struct coding_system_type_entry entry; | |
685 | |
686 entry.meths = meths; | |
687 Dynarr_add (the_coding_system_type_entry_dynarr, entry); | |
688 Vcoding_system_type_list = Fcons (meths->type, Vcoding_system_type_list); | |
483 } | 689 } |
484 | 690 |
485 DEFUN ("coding-system-p", Fcoding_system_p, 1, 1, 0, /* | 691 DEFUN ("coding-system-p", Fcoding_system_p, 1, 1, 0, /* |
486 Return t if OBJECT is a coding system. | 692 Return t if OBJECT is a coding system. |
487 A coding system is an object that defines how text containing multiple | 693 A coding system is an object that defines how text containing multiple |
527 while (1) | 733 while (1) |
528 { | 734 { |
529 coding_system_or_name = | 735 coding_system_or_name = |
530 Fgethash (coding_system_or_name, Vcoding_system_hash_table, Qnil); | 736 Fgethash (coding_system_or_name, Vcoding_system_hash_table, Qnil); |
531 | 737 |
532 if (CODING_SYSTEMP (coding_system_or_name) || NILP (coding_system_or_name)) | 738 if (CODING_SYSTEMP (coding_system_or_name) |
739 || NILP (coding_system_or_name)) | |
533 return coding_system_or_name; | 740 return coding_system_or_name; |
534 } | 741 } |
535 } | 742 } |
536 | 743 |
537 DEFUN ("get-coding-system", Fget_coding_system, 1, 1, 0, /* | 744 DEFUN ("get-coding-system", Fget_coding_system, 1, 1, 0, /* |
546 if (NILP (coding_system)) | 753 if (NILP (coding_system)) |
547 invalid_argument ("No such coding system", name); | 754 invalid_argument ("No such coding system", name); |
548 return coding_system; | 755 return coding_system; |
549 } | 756 } |
550 | 757 |
551 /* We store the coding systems in hash tables with the names as the key and the | 758 int |
552 actual coding system object as the value. Occasionally we need to use them | 759 coding_system_is_binary (Lisp_Object coding_system) |
553 in a list format. These routines provide us with that. */ | 760 { |
761 Lisp_Coding_System *cs = XCODING_SYSTEM (coding_system); | |
762 return | |
763 (EQ (CODING_SYSTEM_TYPE (cs), Qno_conversion) && | |
764 CODING_SYSTEM_EOL_TYPE (cs) == EOL_LF && | |
765 EQ (CODING_SYSTEM_POST_READ_CONVERSION (cs), Qnil) && | |
766 EQ (CODING_SYSTEM_PRE_WRITE_CONVERSION (cs), Qnil)); | |
767 } | |
768 | |
769 static Lisp_Object | |
770 coding_system_real_canonical (Lisp_Object cs) | |
771 { | |
772 if (!NILP (XCODING_SYSTEM_CANONICAL (cs))) | |
773 return XCODING_SYSTEM_CANONICAL (cs); | |
774 return cs; | |
775 } | |
776 | |
777 /* Return true if coding system is of the "standard" type that decodes | |
778 bytes into characters (suitable for decoding a text file). */ | |
779 int | |
780 coding_system_is_for_text_file (Lisp_Object coding_system) | |
781 { | |
782 return (XCODESYSMETH_OR_GIVEN | |
783 (coding_system, conversion_end_type, | |
784 (coding_system_real_canonical (coding_system)), | |
785 DECODES_BYTE_TO_CHARACTER) == | |
786 DECODES_BYTE_TO_CHARACTER); | |
787 } | |
788 | |
789 static int | |
790 decoding_source_sink_type_is_char (Lisp_Object cs, enum source_or_sink sex) | |
791 { | |
792 enum source_sink_type type = | |
793 XCODESYSMETH_OR_GIVEN (cs, conversion_end_type, | |
794 (coding_system_real_canonical (cs)), | |
795 DECODES_BYTE_TO_CHARACTER); | |
796 if (sex == CODING_SOURCE) | |
797 return (type == DECODES_CHARACTER_TO_CHARACTER || | |
798 type == DECODES_CHARACTER_TO_BYTE); | |
799 else | |
800 return (type == DECODES_CHARACTER_TO_CHARACTER || | |
801 type == DECODES_BYTE_TO_CHARACTER); | |
802 } | |
803 | |
804 static int | |
805 encoding_source_sink_type_is_char (Lisp_Object cs, enum source_or_sink sex) | |
806 { | |
807 return decoding_source_sink_type_is_char (cs, | |
808 /* Sex change */ | |
809 sex == CODING_SOURCE ? | |
810 CODING_SINK : CODING_SOURCE); | |
811 } | |
812 | |
813 /* Like Ffind_coding_system() but check that the coding system is of the | |
814 "standard" type that decodes bytes into characters (suitable for | |
815 decoding a text file), and if not, returns an appropriate wrapper that | |
816 does. Also, if EOL_WRAP is non-zero, check whether this coding system | |
817 wants EOL auto-detection, and if so, wrap with a convert-eol coding | |
818 system to do this. */ | |
819 | |
820 Lisp_Object | |
821 find_coding_system_for_text_file (Lisp_Object name, int eol_wrap) | |
822 { | |
823 Lisp_Object coding_system = Ffind_coding_system (name); | |
824 Lisp_Object wrapper = coding_system; | |
825 | |
826 if (NILP (coding_system)) | |
827 return Qnil; | |
828 if (!coding_system_is_for_text_file (coding_system)) | |
829 { | |
830 wrapper = XCODING_SYSTEM_TEXT_FILE_WRAPPER (coding_system); | |
831 if (NILP (wrapper)) | |
832 { | |
833 Lisp_Object chain; | |
834 if (!decoding_source_sink_type_is_char (coding_system, CODING_SINK)) | |
835 chain = list2 (coding_system, Qbinary); | |
836 else | |
837 chain = list1 (coding_system); | |
838 if (decoding_source_sink_type_is_char (coding_system, CODING_SOURCE)) | |
839 chain = Fcons (Qbinary, chain); | |
840 wrapper = | |
841 make_internal_coding_system | |
842 (coding_system, | |
843 "internal-text-file-wrapper", | |
844 Qchain, | |
845 Qunbound, list4 (Qchain, chain, | |
846 Qcanonicalize_after_coding, coding_system)); | |
847 XCODING_SYSTEM_TEXT_FILE_WRAPPER (coding_system) = wrapper; | |
848 } | |
849 } | |
850 | |
851 if (!eol_wrap || XCODING_SYSTEM_EOL_TYPE (coding_system) != EOL_AUTODETECT) | |
852 return wrapper; | |
853 | |
854 coding_system = wrapper; | |
855 wrapper = XCODING_SYSTEM_AUTO_EOL_WRAPPER (coding_system); | |
856 if (!NILP (wrapper)) | |
857 return wrapper; | |
858 wrapper = | |
859 make_internal_coding_system | |
860 (coding_system, | |
861 "internal-auto-eol-wrapper", | |
862 Qundecided, Qunbound, | |
863 list4 (Qcoding_system, coding_system, | |
864 Qdo_eol, Qt)); | |
865 XCODING_SYSTEM_AUTO_EOL_WRAPPER (coding_system) = wrapper; | |
866 return wrapper; | |
867 } | |
868 | |
869 /* Like Fget_coding_system() but verify that the coding system is of the | |
870 "standard" type that decodes bytes into characters (suitable for | |
871 decoding a text file), and if not, returns an appropriate wrapper that | |
872 does. Also, if EOL_WRAP is non-zero, check whether this coding system | |
873 wants EOL auto-detection, and if so, wrap with a convert-eol coding | |
874 system to do this. */ | |
875 | |
876 Lisp_Object | |
877 get_coding_system_for_text_file (Lisp_Object name, int eol_wrap) | |
878 { | |
879 Lisp_Object coding_system = find_coding_system_for_text_file (name, | |
880 eol_wrap); | |
881 if (NILP (coding_system)) | |
882 invalid_argument ("No such coding system", name); | |
883 return coding_system; | |
884 } | |
885 | |
886 /* We store the coding systems in hash tables with the names as the | |
887 key and the actual coding system object as the value. Occasionally | |
888 we need to use them in a list format. These routines provide us | |
889 with that. */ | |
554 struct coding_system_list_closure | 890 struct coding_system_list_closure |
555 { | 891 { |
556 Lisp_Object *coding_system_list; | 892 Lisp_Object *coding_system_list; |
893 int normal; | |
894 int internal; | |
557 }; | 895 }; |
558 | 896 |
559 static int | 897 static int |
560 add_coding_system_to_list_mapper (Lisp_Object key, Lisp_Object value, | 898 add_coding_system_to_list_mapper (Lisp_Object key, Lisp_Object value, |
561 void *coding_system_list_closure) | 899 void *coding_system_list_closure) |
563 /* This function can GC */ | 901 /* This function can GC */ |
564 struct coding_system_list_closure *cscl = | 902 struct coding_system_list_closure *cscl = |
565 (struct coding_system_list_closure *) coding_system_list_closure; | 903 (struct coding_system_list_closure *) coding_system_list_closure; |
566 Lisp_Object *coding_system_list = cscl->coding_system_list; | 904 Lisp_Object *coding_system_list = cscl->coding_system_list; |
567 | 905 |
568 *coding_system_list = Fcons (key, *coding_system_list); | 906 /* We can't just use VALUE because KEY might be an alias, and we need |
907 the real coding system object. */ | |
908 if (XCODING_SYSTEM (Ffind_coding_system (key))->internal_p ? | |
909 cscl->internal : cscl->normal) | |
910 *coding_system_list = Fcons (key, *coding_system_list); | |
569 return 0; | 911 return 0; |
570 } | 912 } |
571 | 913 |
572 DEFUN ("coding-system-list", Fcoding_system_list, 0, 0, 0, /* | 914 DEFUN ("coding-system-list", Fcoding_system_list, 0, 1, 0, /* |
573 Return a list of the names of all defined coding systems. | 915 Return a list of the names of all defined coding systems. |
916 If INTERNAL is nil, only the normal (non-internal) coding systems are | |
917 included. (Internal coding systems are created for various internal | |
918 purposes, such as implementing EOL types of CRLF and CR; generally, you do | |
919 not want to see these.) If it is t, only the internal coding systems are | |
920 included. If it is any other non-nil value both normal and internal are | |
921 included. | |
574 */ | 922 */ |
575 ()) | 923 (internal)) |
576 { | 924 { |
577 Lisp_Object coding_system_list = Qnil; | 925 Lisp_Object coding_system_list = Qnil; |
578 struct gcpro gcpro1; | 926 struct gcpro gcpro1; |
579 struct coding_system_list_closure coding_system_list_closure; | 927 struct coding_system_list_closure coding_system_list_closure; |
580 | 928 |
581 GCPRO1 (coding_system_list); | 929 GCPRO1 (coding_system_list); |
582 coding_system_list_closure.coding_system_list = &coding_system_list; | 930 coding_system_list_closure.coding_system_list = &coding_system_list; |
931 coding_system_list_closure.normal = !EQ (internal, Qt); | |
932 coding_system_list_closure.internal = !NILP (internal); | |
583 elisp_maphash (add_coding_system_to_list_mapper, Vcoding_system_hash_table, | 933 elisp_maphash (add_coding_system_to_list_mapper, Vcoding_system_hash_table, |
584 &coding_system_list_closure); | 934 &coding_system_list_closure); |
585 UNGCPRO; | 935 UNGCPRO; |
586 | 936 |
587 return coding_system_list; | 937 return coding_system_list; |
595 coding_system = Fget_coding_system (coding_system); | 945 coding_system = Fget_coding_system (coding_system); |
596 return XCODING_SYSTEM_NAME (coding_system); | 946 return XCODING_SYSTEM_NAME (coding_system); |
597 } | 947 } |
598 | 948 |
599 static Lisp_Coding_System * | 949 static Lisp_Coding_System * |
600 allocate_coding_system (enum coding_system_type type, Lisp_Object name) | 950 allocate_coding_system (struct coding_system_methods *codesys_meths, |
601 { | 951 Bytecount data_size, |
952 Lisp_Object name) | |
953 { | |
954 Bytecount total_size = offsetof (Lisp_Coding_System, data) + data_size; | |
602 Lisp_Coding_System *codesys = | 955 Lisp_Coding_System *codesys = |
603 alloc_lcrecord_type (Lisp_Coding_System, &lrecord_coding_system); | 956 (Lisp_Coding_System *) alloc_lcrecord (total_size, &lrecord_coding_system); |
604 | 957 |
605 zero_lcrecord (codesys); | 958 zero_sized_lcrecord (codesys, total_size); |
959 codesys->methods = codesys_meths; | |
606 CODING_SYSTEM_PRE_WRITE_CONVERSION (codesys) = Qnil; | 960 CODING_SYSTEM_PRE_WRITE_CONVERSION (codesys) = Qnil; |
607 CODING_SYSTEM_POST_READ_CONVERSION (codesys) = Qnil; | 961 CODING_SYSTEM_POST_READ_CONVERSION (codesys) = Qnil; |
608 CODING_SYSTEM_EOL_TYPE (codesys) = EOL_AUTODETECT; | 962 CODING_SYSTEM_EOL_TYPE (codesys) = EOL_LF; |
609 CODING_SYSTEM_EOL_CRLF (codesys) = Qnil; | 963 CODING_SYSTEM_EOL_CRLF (codesys) = Qnil; |
610 CODING_SYSTEM_EOL_CR (codesys) = Qnil; | 964 CODING_SYSTEM_EOL_CR (codesys) = Qnil; |
611 CODING_SYSTEM_EOL_LF (codesys) = Qnil; | 965 CODING_SYSTEM_EOL_LF (codesys) = Qnil; |
612 CODING_SYSTEM_TYPE (codesys) = type; | 966 CODING_SYSTEM_SUBSIDIARY_PARENT (codesys) = Qnil; |
967 CODING_SYSTEM_CANONICAL (codesys) = Qnil; | |
613 CODING_SYSTEM_MNEMONIC (codesys) = Qnil; | 968 CODING_SYSTEM_MNEMONIC (codesys) = Qnil; |
614 #ifdef MULE | 969 CODING_SYSTEM_DOCUMENTATION (codesys) = Qnil; |
615 if (type == CODESYS_ISO2022) | 970 CODING_SYSTEM_TEXT_FILE_WRAPPER (codesys) = Qnil; |
616 { | 971 CODING_SYSTEM_AUTO_EOL_WRAPPER (codesys) = Qnil; |
617 int i; | 972 CODING_SYSTEM_NAME (codesys) = name; |
618 for (i = 0; i < 4; i++) | 973 |
619 CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, i) = Qnil; | 974 MAYBE_CODESYSMETH (codesys, init, (wrap_coding_system (codesys))); |
620 } | |
621 else if (type == CODESYS_CCL) | |
622 { | |
623 CODING_SYSTEM_CCL_DECODE (codesys) = Qnil; | |
624 CODING_SYSTEM_CCL_ENCODE (codesys) = Qnil; | |
625 } | |
626 #endif /* MULE */ | |
627 CODING_SYSTEM_NAME (codesys) = name; | |
628 | 975 |
629 return codesys; | 976 return codesys; |
630 } | 977 } |
631 | 978 |
632 #ifdef MULE | 979 static enum eol_type |
633 /* Given a list of charset conversion specs as specified in a Lisp | 980 symbol_to_eol_type (Lisp_Object symbol) |
634 program, parse it into STORE_HERE. */ | 981 { |
635 | 982 CHECK_SYMBOL (symbol); |
983 if (NILP (symbol)) return EOL_AUTODETECT; | |
984 if (EQ (symbol, Qlf)) return EOL_LF; | |
985 if (EQ (symbol, Qcrlf)) return EOL_CRLF; | |
986 if (EQ (symbol, Qcr)) return EOL_CR; | |
987 | |
988 invalid_constant ("Unrecognized eol type", symbol); | |
989 return EOL_AUTODETECT; /* not reached */ | |
990 } | |
991 | |
992 static Lisp_Object | |
993 eol_type_to_symbol (enum eol_type type) | |
994 { | |
995 switch (type) | |
996 { | |
997 default: abort (); | |
998 case EOL_LF: return Qlf; | |
999 case EOL_CRLF: return Qcrlf; | |
1000 case EOL_CR: return Qcr; | |
1001 case EOL_AUTODETECT: return Qnil; | |
1002 } | |
1003 } | |
1004 | |
1005 struct subsidiary_type | |
1006 { | |
1007 Char_ASCII *extension; | |
1008 Char_ASCII *mnemonic_ext; | |
1009 enum eol_type eol; | |
1010 }; | |
1011 | |
1012 static struct subsidiary_type coding_subsidiary_list[] = | |
1013 { { "-unix", "", EOL_LF }, | |
1014 { "-dos", ":T", EOL_CRLF }, | |
1015 { "-mac", ":t", EOL_CR } }; | |
1016 | |
1017 /* kludge */ | |
636 static void | 1018 static void |
637 parse_charset_conversion_specs (charset_conversion_spec_dynarr *store_here, | 1019 setup_eol_coding_systems (Lisp_Object codesys) |
638 Lisp_Object spec_list) | 1020 { |
639 { | 1021 int len = string_length (XSYMBOL (XCODING_SYSTEM_NAME (codesys))->name); |
640 Lisp_Object rest; | 1022 Intbyte *codesys_name = (Intbyte *) alloca (len + 7); |
641 | 1023 int mlen = -1; |
642 EXTERNAL_LIST_LOOP (rest, spec_list) | 1024 Intbyte *codesys_mnemonic = 0; |
643 { | 1025 Lisp_Object codesys_name_sym, sub_codesys; |
644 Lisp_Object car = XCAR (rest); | 1026 int i; |
645 Lisp_Object from, to; | 1027 |
646 struct charset_conversion_spec spec; | 1028 memcpy (codesys_name, |
647 | 1029 string_data (XSYMBOL (XCODING_SYSTEM_NAME (codesys))->name), len); |
648 if (!CONSP (car) || !CONSP (XCDR (car)) || !NILP (XCDR (XCDR (car)))) | 1030 |
649 invalid_argument ("Invalid charset conversion spec", car); | 1031 if (STRINGP (XCODING_SYSTEM_MNEMONIC (codesys))) |
650 from = Fget_charset (XCAR (car)); | 1032 { |
651 to = Fget_charset (XCAR (XCDR (car))); | 1033 mlen = XSTRING_LENGTH (XCODING_SYSTEM_MNEMONIC (codesys)); |
652 if (XCHARSET_TYPE (from) != XCHARSET_TYPE (to)) | 1034 codesys_mnemonic = (Intbyte *) alloca (mlen + 7); |
653 invalid_operation_2 | 1035 memcpy (codesys_mnemonic, |
654 ("Attempted conversion between different charset types", | 1036 XSTRING_DATA (XCODING_SYSTEM_MNEMONIC (codesys)), mlen); |
655 from, to); | 1037 } |
656 spec.from_charset = from; | 1038 |
657 spec.to_charset = to; | 1039 /* Create three "subsidiary" coding systems, decoding data encoded using |
658 | 1040 each of the three EOL types. We do this for each subsidiary by |
659 Dynarr_add (store_here, spec); | 1041 copying the original coding system, setting the EOL type |
660 } | 1042 appropriately, and setting the CANONICAL member of the new coding |
661 } | 1043 system to be a chain consisting of the original coding system followed |
662 | 1044 by a convert-eol coding system to do the EOL decoding. For EOL type |
663 /* Given a dynarr LOAD_HERE of internally-stored charset conversion | 1045 LF, however, we don't need any decoding, so we skip creating a |
664 specs, return the equivalent as the Lisp programmer would see it. | 1046 CANONICAL. |
665 | 1047 |
666 If LOAD_HERE is 0, return Qnil. */ | 1048 If the original coding system is not a text-type coding system |
1049 (decodes byte->char), we need to coerce it to one by the appropriate | |
1050 wrapping in CANONICAL. */ | |
1051 | |
1052 for (i = 0; i < countof (coding_subsidiary_list); i++) | |
1053 { | |
1054 Char_ASCII *extension = coding_subsidiary_list[i].extension; | |
1055 Char_ASCII *mnemonic_ext = coding_subsidiary_list[i].mnemonic_ext; | |
1056 enum eol_type eol = coding_subsidiary_list[i].eol; | |
1057 | |
1058 qxestrcpy_c (codesys_name + len, extension); | |
1059 codesys_name_sym = intern_int (codesys_name); | |
1060 if (mlen != -1) | |
1061 qxestrcpy_c (codesys_mnemonic + mlen, mnemonic_ext); | |
1062 | |
1063 sub_codesys = Fcopy_coding_system (codesys, codesys_name_sym); | |
1064 if (mlen != -1) | |
1065 XCODING_SYSTEM_MNEMONIC (sub_codesys) = | |
1066 build_intstring (codesys_mnemonic); | |
1067 | |
1068 if (eol != EOL_LF) | |
1069 { | |
1070 Lisp_Object chain = list2 (get_coding_system_for_text_file | |
1071 (codesys, 0), | |
1072 eol == EOL_CR ? Qconvert_eol_cr : | |
1073 Qconvert_eol_crlf); | |
1074 Lisp_Object canon = | |
1075 make_internal_coding_system | |
1076 (sub_codesys, "internal-subsidiary-eol-wrapper", | |
1077 Qchain, Qunbound, | |
1078 mlen != -1 ? | |
1079 list6 (Qmnemonic, build_intstring (codesys_mnemonic), | |
1080 Qchain, chain, | |
1081 Qcanonicalize_after_coding, sub_codesys) : | |
1082 list4 (Qchain, chain, | |
1083 Qcanonicalize_after_coding, sub_codesys)); | |
1084 XCODING_SYSTEM_CANONICAL (sub_codesys) = canon; | |
1085 } | |
1086 XCODING_SYSTEM_EOL_TYPE (sub_codesys) = eol; | |
1087 XCODING_SYSTEM_SUBSIDIARY_PARENT (sub_codesys) = codesys; | |
1088 XCODING_SYSTEM (codesys)->eol[eol] = sub_codesys; | |
1089 } | |
1090 } | |
1091 | |
1092 /* Basic function to create new coding systems. For `make-coding-system', | |
1093 NAME-OR-EXISTING is the NAME argument, PREFIX is null, and TYPE, | |
1094 DESCRIPTION, and PROPS are the same. All created coding systems are put | |
1095 in a hash table indexed by NAME. | |
1096 | |
1097 If PREFIX is a string, NAME-OR-EXISTING should specify an existing | |
1098 coding system (or nil), and an internal coding system will be created. | |
1099 The name of the coding system will be constructed by combining PREFIX | |
1100 with the name of the existing coding system (if given), and a number | |
1101 will be appended to insure uniqueness. In such a case, if Qunbound is | |
1102 given for DESCRIPTION, the description gets created based on the | |
1103 generated name. Also, if no mnemonic is given in the properties list, a | |
1104 mnemonic is created based on the generated name. | |
1105 | |
1106 For internal coding systems, the coding system is marked as internal | |
1107 (see `coding-system-list'), and no subsidiaries will be created or | |
1108 eol-wrapping will happen. Otherwise: | |
1109 | |
1110 -- if the eol-type property is `lf' or t, the coding system is merely | |
1111 created and returned. (For t, the coding system will be wrapped with | |
1112 an EOL autodetector when it's used to read a file.) | |
1113 | |
1114 -- if eol-type is `crlf' or `cr', after the coding system object is | |
1115 created, it will be wrapped in a chain with the appropriate | |
1116 convert-eol coding system (either `convert-eol-crlf' or | |
1117 `convert-eol-cr'), so that CRLF->LF or CR->LF conversion is done at | |
1118 decoding time, and the opposite at encoding time. The resulting | |
1119 chain becomes the CANONICAL field of the coding system object. | |
1120 | |
1121 -- if eol-type is nil or omitted, "subsidiaries" are generated: Three | |
1122 coding systems where the original coding system (before wrapping with | |
1123 convert-eol-autodetect) is either unwrapped or wrapped with | |
1124 convert-eol-crlf or convert-eol-cr, respectively, so that coding systems | |
1125 to handle LF, CRLF, and CR end-of-line indicators are created. (This | |
1126 crazy crap is based on existing behavior in other Mule versions, | |
1127 including FSF Emacs.) | |
1128 */ | |
667 | 1129 |
668 static Lisp_Object | 1130 static Lisp_Object |
669 unparse_charset_conversion_specs (charset_conversion_spec_dynarr *load_here) | 1131 make_coding_system_1 (Lisp_Object name_or_existing, Char_ASCII *prefix, |
670 { | 1132 Lisp_Object type, Lisp_Object description, |
671 int i; | 1133 Lisp_Object props) |
672 Lisp_Object result; | 1134 { |
673 | 1135 Lisp_Coding_System *cs; |
674 if (!load_here) | 1136 int need_to_setup_eol_systems = 1; |
675 return Qnil; | 1137 enum eol_type eol_wrapper = EOL_AUTODETECT; |
676 for (i = 0, result = Qnil; i < Dynarr_length (load_here); i++) | 1138 struct coding_system_methods *meths; |
677 { | 1139 Lisp_Object csobj; |
678 struct charset_conversion_spec *ccs = Dynarr_atp (load_here, i); | 1140 Lisp_Object defmnem = Qnil; |
679 result = Fcons (list2 (ccs->from_charset, ccs->to_charset), result); | 1141 |
680 } | 1142 if (NILP (type)) |
681 | 1143 type = Qundecided; |
682 return Fnreverse (result); | 1144 meths = decode_coding_system_type (type, ERROR_ME); |
683 } | 1145 |
684 | 1146 if (prefix) |
685 #endif /* MULE */ | 1147 { |
1148 Intbyte *newname = | |
1149 emacs_sprintf_malloc (NULL, "%s-%s-%d", | |
1150 prefix, | |
1151 NILP (name_or_existing) ? (Intbyte *) "nil" : | |
1152 XSTRING_DATA (Fsymbol_name (XCODING_SYSTEM_NAME | |
1153 (name_or_existing))), | |
1154 ++coding_system_tick); | |
1155 name_or_existing = intern_int (newname); | |
1156 xfree (newname); | |
1157 | |
1158 if (UNBOUNDP (description)) | |
1159 { | |
1160 newname = | |
1161 emacs_sprintf_malloc | |
1162 (NULL, "For Internal Use (%s)", | |
1163 XSTRING_DATA (Fsymbol_name (name_or_existing))); | |
1164 description = build_intstring (newname); | |
1165 xfree (newname); | |
1166 } | |
1167 | |
1168 newname = emacs_sprintf_malloc (NULL, "Int%d", coding_system_tick); | |
1169 defmnem = build_intstring (newname); | |
1170 } | |
1171 else | |
1172 CHECK_SYMBOL (name_or_existing); | |
1173 | |
1174 if (!NILP (Ffind_coding_system (name_or_existing))) | |
1175 invalid_operation ("Cannot redefine existing coding system", | |
1176 name_or_existing); | |
1177 | |
1178 cs = allocate_coding_system (meths, meths->extra_data_size, | |
1179 name_or_existing); | |
1180 XSETCODING_SYSTEM (csobj, cs); | |
1181 | |
1182 cs->internal_p = !!prefix; | |
1183 | |
1184 if (NILP (description)) | |
1185 description = build_string (""); | |
1186 else | |
1187 CHECK_STRING (description); | |
1188 CODING_SYSTEM_DESCRIPTION (cs) = description; | |
1189 | |
1190 if (!NILP (defmnem)) | |
1191 CODING_SYSTEM_MNEMONIC (cs) = defmnem; | |
1192 | |
1193 { | |
1194 EXTERNAL_PROPERTY_LIST_LOOP_3 (key, value, props) | |
1195 { | |
1196 int recognized = 1; | |
1197 | |
1198 if (EQ (key, Qmnemonic)) | |
1199 { | |
1200 if (!NILP (value)) | |
1201 CHECK_STRING (value); | |
1202 CODING_SYSTEM_MNEMONIC (cs) = value; | |
1203 } | |
1204 | |
1205 else if (EQ (key, Qdocumentation)) | |
1206 { | |
1207 if (!NILP (value)) | |
1208 CHECK_STRING (value); | |
1209 CODING_SYSTEM_DOCUMENTATION (cs) = value; | |
1210 } | |
1211 | |
1212 else if (EQ (key, Qeol_type)) | |
1213 { | |
1214 need_to_setup_eol_systems = NILP (value); | |
1215 if (EQ (value, Qt)) | |
1216 value = Qnil; | |
1217 eol_wrapper = symbol_to_eol_type (value); | |
1218 } | |
1219 | |
1220 else if (EQ (key, Qpost_read_conversion)) | |
1221 CODING_SYSTEM_POST_READ_CONVERSION (cs) = value; | |
1222 else if (EQ (key, Qpre_write_conversion)) | |
1223 CODING_SYSTEM_PRE_WRITE_CONVERSION (cs) = value; | |
1224 /* FSF compatibility */ | |
1225 else if (EQ (key, Qtranslation_table_for_decode)) | |
1226 ; | |
1227 else if (EQ (key, Qtranslation_table_for_encode)) | |
1228 ; | |
1229 else if (EQ (key, Qsafe_chars)) | |
1230 ; | |
1231 else if (EQ (key, Qsafe_charsets)) | |
1232 ; | |
1233 else if (EQ (key, Qmime_charset)) | |
1234 ; | |
1235 else if (EQ (key, Qvalid_codes)) | |
1236 ; | |
1237 else | |
1238 recognized = CODESYSMETH_OR_GIVEN (cs, putprop, | |
1239 (csobj, key, value), 0); | |
1240 | |
1241 if (!recognized) | |
1242 invalid_constant ("Unrecognized property", key); | |
1243 } | |
1244 } | |
1245 | |
1246 { | |
1247 XCODING_SYSTEM_CANONICAL (csobj) = | |
1248 CODESYSMETH_OR_GIVEN (cs, canonicalize, (csobj), Qnil); | |
1249 XCODING_SYSTEM_EOL_TYPE (csobj) = EOL_AUTODETECT; /* for copy-coding-system | |
1250 below */ | |
1251 | |
1252 if (need_to_setup_eol_systems && !cs->internal_p) | |
1253 setup_eol_coding_systems (csobj); | |
1254 else if (eol_wrapper == EOL_CR || eol_wrapper == EOL_CRLF) | |
1255 { | |
1256 /* If a specific eol-type (other than LF) was specified, we handle | |
1257 this by converting the coding system into a chain that wraps the | |
1258 coding system along with a convert-eol system after it, in | |
1259 exactly that same switcheroo fashion that the normal | |
1260 canonicalize method works -- BUT we will run into a problem if | |
1261 we do it the obvious way, because when `chain' creates its | |
1262 substreams, the substream containing the coding system we're | |
1263 creating will have canonicalization expansion done on it, | |
1264 leading to infinite recursion. So we have to generate a new, | |
1265 internal coding system with the previous value of CANONICAL. */ | |
1266 Intbyte *newname = | |
1267 emacs_sprintf_malloc | |
1268 (NULL, "internal-eol-copy-%s-%d", | |
1269 XSTRING_DATA (Fsymbol_name (name_or_existing)), | |
1270 ++coding_system_tick); | |
1271 Lisp_Object newnamesym = intern_int (newname); | |
1272 Lisp_Object copied = Fcopy_coding_system (csobj, newnamesym); | |
1273 xfree (newname); | |
1274 | |
1275 XCODING_SYSTEM_CANONICAL (csobj) = | |
1276 make_internal_coding_system | |
1277 (csobj, | |
1278 "internal-eol-wrapper", | |
1279 Qchain, Qunbound, | |
1280 list4 (Qchain, | |
1281 list2 (copied, | |
1282 eol_wrapper == EOL_CR ? | |
1283 Qconvert_eol_cr : | |
1284 Qconvert_eol_crlf), | |
1285 Qcanonicalize_after_coding, | |
1286 csobj)); | |
1287 } | |
1288 XCODING_SYSTEM_EOL_TYPE (csobj) = eol_wrapper; | |
1289 } | |
1290 | |
1291 Fputhash (name_or_existing, csobj, Vcoding_system_hash_table); | |
1292 | |
1293 return csobj; | |
1294 } | |
1295 | |
1296 Lisp_Object | |
1297 make_internal_coding_system (Lisp_Object existing, Char_ASCII *prefix, | |
1298 Lisp_Object type, Lisp_Object description, | |
1299 Lisp_Object props) | |
1300 { | |
1301 return make_coding_system_1 (existing, prefix, type, description, props); | |
1302 } | |
686 | 1303 |
687 DEFUN ("make-coding-system", Fmake_coding_system, 2, 4, 0, /* | 1304 DEFUN ("make-coding-system", Fmake_coding_system, 2, 4, 0, /* |
688 Register symbol NAME as a coding system. | 1305 Register symbol NAME as a coding system. |
689 | 1306 |
690 TYPE describes the conversion method used and should be one of | 1307 TYPE describes the conversion method used and should be one of |
691 | 1308 |
692 nil or 'undecided | 1309 nil or 'undecided |
693 Automatic conversion. XEmacs attempts to detect the coding system | 1310 Automatic conversion. XEmacs attempts to detect the coding system |
694 used in the file. | 1311 used in the file. |
1312 'chain | |
1313 Chain two or more coding systems together to make a combination coding | |
1314 system. | |
695 'no-conversion | 1315 'no-conversion |
696 No conversion. Use this for binary files and such. On output, | 1316 No conversion. Use this for binary files and such. On output, |
697 graphic characters that are not in ASCII or Latin-1 will be | 1317 graphic characters that are not in ASCII or Latin-1 will be |
698 replaced by a ?. (For a no-conversion-encoded buffer, these | 1318 replaced by a ?. (For a no-conversion-encoded buffer, these |
699 characters will only be present if you explicitly insert them.) | 1319 characters will only be present if you explicitly insert them.) |
1320 'convert-eol | |
1321 Convert CRLF sequences or CR to LF. | |
700 'shift-jis | 1322 'shift-jis |
701 Shift-JIS (a Japanese encoding commonly used in PC operating systems). | 1323 Shift-JIS (a Japanese encoding commonly used in PC operating systems). |
702 'ucs-4 | 1324 'unicode |
703 ISO 10646 UCS-4 encoding. | 1325 Any Unicode encoding (UCS-4, UTF-8, UTF-16, etc.). |
704 'utf-8 | 1326 'mswindows-unicode-to-multibyte |
705 ISO 10646 UTF-8 encoding. | 1327 (MS Windows only) Converts from Windows Unicode to Windows Multibyte |
1328 (any code page encoding) upon encoding, and the other way upon decoding. | |
1329 'mswindows-multibyte | |
1330 Converts to or from Windows Multibyte (any code page encoding). | |
1331 This is resolved into a chain of `mswindows-unicode' and | |
1332 `mswindows-unicode-to-multibyte'. | |
706 'iso2022 | 1333 'iso2022 |
707 Any ISO2022-compliant encoding. Among other things, this includes | 1334 Any ISO2022-compliant encoding. Among other things, this includes |
708 JIS (the Japanese encoding commonly used for e-mail), EUC (the | 1335 JIS (the Japanese encoding commonly used for e-mail), EUC (the |
709 standard Unix encoding for Japanese and other languages), and | 1336 standard Unix encoding for Japanese and other languages), and |
710 Compound Text (the encoding used in X11). You can specify more | 1337 Compound Text (the encoding used in X11). You can specify more |
713 Big5 (the encoding commonly used for Taiwanese). | 1340 Big5 (the encoding commonly used for Taiwanese). |
714 'ccl | 1341 'ccl |
715 The conversion is performed using a user-written pseudo-code | 1342 The conversion is performed using a user-written pseudo-code |
716 program. CCL (Code Conversion Language) is the name of this | 1343 program. CCL (Code Conversion Language) is the name of this |
717 pseudo-code. | 1344 pseudo-code. |
1345 'gzip | |
1346 GZIP compression format. | |
718 'internal | 1347 'internal |
719 Write out or read in the raw contents of the memory representing | 1348 Write out or read in the raw contents of the memory representing |
720 the buffer's text. This is primarily useful for debugging | 1349 the buffer's text. This is primarily useful for debugging |
721 purposes, and is only enabled when XEmacs has been compiled with | 1350 purposes, and is only enabled when XEmacs has been compiled with |
722 DEBUG_XEMACS defined (via the --debug configure option). | 1351 DEBUG_XEMACS defined (via the --debug configure option). |
724 in an internal inconsistency in the memory representing a | 1353 in an internal inconsistency in the memory representing a |
725 buffer's text, which will produce unpredictable results and may | 1354 buffer's text, which will produce unpredictable results and may |
726 cause XEmacs to crash. Under normal circumstances you should | 1355 cause XEmacs to crash. Under normal circumstances you should |
727 never use 'internal conversion. | 1356 never use 'internal conversion. |
728 | 1357 |
729 DOC-STRING is a string describing the coding system. | 1358 DESCRIPTION is a short English phrase describing the coding system, |
1359 suitable for use as a menu item. (See also the `documentation' property | |
1360 below.) | |
730 | 1361 |
731 PROPS is a property list, describing the specific nature of the | 1362 PROPS is a property list, describing the specific nature of the |
732 character set. Recognized properties are: | 1363 character set. Recognized properties are: |
733 | 1364 |
734 'mnemonic | 1365 'mnemonic |
735 String to be displayed in the modeline when this coding system is | 1366 String to be displayed in the modeline when this coding system is |
736 active. | 1367 active. |
1368 | |
1369 'documentation | |
1370 Detailed documentation on the coding system. | |
737 | 1371 |
738 'eol-type | 1372 'eol-type |
739 End-of-line conversion to be used. It should be one of | 1373 End-of-line conversion to be used. It should be one of |
740 | 1374 |
741 nil | 1375 nil |
762 generate subsidiary coding systems. (This value is | 1396 generate subsidiary coding systems. (This value is |
763 converted to nil when stored internally, and | 1397 converted to nil when stored internally, and |
764 `coding-system-property' will return nil.) | 1398 `coding-system-property' will return nil.) |
765 | 1399 |
766 'post-read-conversion | 1400 'post-read-conversion |
767 Function called after a file has been read in, to perform the | 1401 The value is a function to call after some text is inserted and |
768 decoding. Called with two arguments, START and END, denoting | 1402 decoded by the coding system itself and before any functions in |
769 a region of the current buffer to be decoded. | 1403 `after-change-functions' are called. (#### Not actually true in |
1404 XEmacs. `after-change-functions' will be called twice if | |
1405 `post-read-conversion' changes something.) The argument of this | |
1406 function is the same as for a function in | |
1407 `after-insert-file-functions', i.e. LENGTH of the text inserted, | |
1408 with point at the head of the text to be decoded. | |
770 | 1409 |
771 'pre-write-conversion | 1410 'pre-write-conversion |
772 Function called before a file is written out, to perform the | 1411 The value is a function to call after all functions in |
773 encoding. Called with two arguments, START and END, denoting | 1412 `write-region-annotate-functions' and `buffer-file-format' are |
774 a region of the current buffer to be encoded. | 1413 called, and before the text is encoded by the coding system itself. |
1414 The arguments to this function are the same as those of a function | |
1415 in `write-region-annotate-functions', i.e. FROM and TO, specifying | |
1416 a region of text. | |
1417 | |
1418 | |
1419 | |
1420 The following properties are allowed for FSF compatibility but currently | |
1421 ignored: | |
1422 | |
1423 'translation-table-for-decode | |
1424 The value is a translation table to be applied on decoding. See | |
1425 the function `make-translation-table' for the format of translation | |
1426 table. This is not applicable to CCL-based coding systems. | |
1427 | |
1428 'translation-table-for-encode | |
1429 The value is a translation table to be applied on encoding. This is | |
1430 not applicable to CCL-based coding systems. | |
1431 | |
1432 'safe-chars | |
1433 The value is a char table. If a character has non-nil value in it, | |
1434 the character is safely supported by the coding system. This | |
1435 overrides the specification of safe-charsets. | |
1436 | |
1437 'safe-charsets | |
1438 The value is a list of charsets safely supported by the coding | |
1439 system. The value t means that all charsets Emacs handles are | |
1440 supported. Even if some charset is not in this list, it doesn't | |
1441 mean that the charset can't be encoded in the coding system; | |
1442 it just means that some other receiver of text encoded | |
1443 in the coding system won't be able to handle that charset. | |
1444 | |
1445 'mime-charset | |
1446 The value is a symbol of which name is `MIME-charset' parameter of | |
1447 the coding system. | |
1448 | |
1449 'valid-codes (meaningful only for a coding system based on CCL) | |
1450 The value is a list to indicate valid byte ranges of the encoded | |
1451 file. Each element of the list is an integer or a cons of integer. | |
1452 In the former case, the integer value is a valid byte code. In the | |
1453 latter case, the integers specifies the range of valid byte codes. | |
1454 | |
1455 | |
1456 | |
1457 The following additional property is recognized if TYPE is 'convert-eol: | |
1458 | |
1459 'subtype | |
1460 One of `lf', `crlf', `cr' or `autodetect'. When decoding, the | |
1461 corresponding sequence will be converted to LF. When encoding, the | |
1462 opposite happens. This coding system converts characters to | |
1463 characters. | |
1464 | |
775 | 1465 |
776 | 1466 |
777 The following additional properties are recognized if TYPE is 'iso2022: | 1467 The following additional properties are recognized if TYPE is 'iso2022: |
778 | 1468 |
779 'charset-g0 | 1469 'charset-g0 |
844 characters in one charset to another when encoding is performed. | 1534 characters in one charset to another when encoding is performed. |
845 The form of each specification is the same as for | 1535 The form of each specification is the same as for |
846 'input-charset-conversion. | 1536 'input-charset-conversion. |
847 | 1537 |
848 | 1538 |
1539 | |
849 The following additional properties are recognized (and required) | 1540 The following additional properties are recognized (and required) |
850 if TYPE is 'ccl: | 1541 if TYPE is 'ccl: |
851 | 1542 |
852 'decode | 1543 'decode |
853 CCL program used for decoding (converting to internal format). | 1544 CCL program used for decoding (converting to internal format). |
854 | 1545 |
855 'encode | 1546 'encode |
856 CCL program used for encoding (converting to external format). | 1547 CCL program used for encoding (converting to external format). |
1548 | |
1549 | |
1550 The following additional properties are recognized if TYPE is 'chain: | |
1551 | |
1552 'chain | |
1553 List of coding systems to be chained together, in decoding order. | |
1554 | |
1555 'canonicalize-after-coding | |
1556 Coding system to be returned by the detector routines in place of | |
1557 this coding system. | |
1558 | |
1559 | |
1560 | |
1561 The following additional properties are recognized if TYPE is 'unicode: | |
1562 | |
1563 'type | |
1564 One of `utf-16', `utf-8', `ucs-4', or `utf-7' (the latter is not | |
1565 yet implemented). `utf-16' is the basic two-byte encoding; | |
1566 `ucs-4' is the four-byte encoding; `utf-8' is an ASCII-compatible | |
1567 variable-width 8-bit encoding; `utf-7' is a 7-bit encoding using | |
1568 only characters that will safely pass through all mail gateways. | |
1569 | |
1570 'little-endian | |
1571 If non-nil, `utf-16' and `ucs-4' will write out the groups of two | |
1572 or four bytes little-endian instead of big-endian. This is required, | |
1573 for example, under Windows. | |
1574 | |
1575 'need-bom | |
1576 If non-nil, a byte order mark (BOM, or Unicode FFFE) should be | |
1577 written out at the beginning of the data. This serves both to | |
1578 identify the endianness of the following data and to mark the | |
1579 data as Unicode (at least, this is how Windows uses it). | |
1580 | |
1581 | |
1582 | |
1583 The following additional properties are recognized if TYPE is | |
1584 'mswindows-multibyte: | |
1585 | |
1586 'code-page | |
1587 Either a number (specifying a particular code page) or one of the | |
1588 symbols `ansi', `oem', `mac', or `ebcdic', specifying the ANSI, | |
1589 OEM, Macintosh, or EBCDIC code page associated with a particular | |
1590 locale (given by the `locale' property). NOTE: EBCDIC code pages | |
1591 only exist in Windows 2000 and later. | |
1592 | |
1593 'locale | |
1594 If `code-page' is a symbol, this specifies the locale whose code | |
1595 page of the corresponding type should be used. This should be | |
1596 one of the following: A cons of two strings, (LANGUAGE | |
1597 . SUBLANGUAGE) (see `mswindows-set-current-locale'); a string (a | |
1598 language; SUBLANG_DEFAULT, i.e. the default sublanguage, is | |
1599 used); or one of the symbols `current', `user-default', or | |
1600 `system-default', corresponding to the values of | |
1601 `mswindows-current-locale', `mswindows-user-default-locale', or | |
1602 `mswindows-system-default-locale', respectively. | |
1603 | |
1604 | |
1605 | |
1606 The following additional properties are recognized if TYPE is 'undecided: | |
1607 | |
1608 'do-eol | |
1609 Do EOL detection. | |
1610 | |
1611 'do-coding | |
1612 Do encoding detection. | |
1613 | |
1614 'coding-system | |
1615 If encoding detection is not done, use the specified coding system | |
1616 to do decoding. This is used internally when implementing coding | |
1617 systems with an EOL type that specifies autodetection (the default), | |
1618 so that the detector routines return the proper subsidiary. | |
1619 | |
1620 | |
1621 | |
1622 The following additional property is recognized if TYPE is 'gzip: | |
1623 | |
1624 'level | |
1625 Compression level: 0 through 9, or `default' (currently 6). | |
1626 | |
857 */ | 1627 */ |
858 (name, type, doc_string, props)) | 1628 (name, type, description, props)) |
859 { | 1629 { |
860 Lisp_Coding_System *codesys; | 1630 return make_coding_system_1 (name, 0, type, description, props); |
861 enum coding_system_type ty; | |
862 int need_to_setup_eol_systems = 1; | |
863 | |
864 /* Convert type to constant */ | |
865 if (NILP (type) || EQ (type, Qundecided)) | |
866 { ty = CODESYS_AUTODETECT; } | |
867 #ifdef MULE | |
868 else if (EQ (type, Qshift_jis)) { ty = CODESYS_SHIFT_JIS; } | |
869 else if (EQ (type, Qiso2022)) { ty = CODESYS_ISO2022; } | |
870 else if (EQ (type, Qbig5)) { ty = CODESYS_BIG5; } | |
871 else if (EQ (type, Qucs4)) { ty = CODESYS_UCS4; } | |
872 else if (EQ (type, Qutf8)) { ty = CODESYS_UTF8; } | |
873 else if (EQ (type, Qccl)) { ty = CODESYS_CCL; } | |
874 #endif | |
875 else if (EQ (type, Qno_conversion)) { ty = CODESYS_NO_CONVERSION; } | |
876 #ifdef DEBUG_XEMACS | |
877 else if (EQ (type, Qinternal)) { ty = CODESYS_INTERNAL; } | |
878 #endif | |
879 else | |
880 invalid_constant ("Invalid coding system type", type); | |
881 | |
882 CHECK_SYMBOL (name); | |
883 | |
884 codesys = allocate_coding_system (ty, name); | |
885 | |
886 if (NILP (doc_string)) | |
887 doc_string = build_string (""); | |
888 else | |
889 CHECK_STRING (doc_string); | |
890 CODING_SYSTEM_DOC_STRING (codesys) = doc_string; | |
891 | |
892 { | |
893 EXTERNAL_PROPERTY_LIST_LOOP_3 (key, value, props) | |
894 { | |
895 if (EQ (key, Qmnemonic)) | |
896 { | |
897 if (!NILP (value)) | |
898 CHECK_STRING (value); | |
899 CODING_SYSTEM_MNEMONIC (codesys) = value; | |
900 } | |
901 | |
902 else if (EQ (key, Qeol_type)) | |
903 { | |
904 need_to_setup_eol_systems = NILP (value); | |
905 if (EQ (value, Qt)) | |
906 value = Qnil; | |
907 CODING_SYSTEM_EOL_TYPE (codesys) = symbol_to_eol_type (value); | |
908 } | |
909 | |
910 else if (EQ (key, Qpost_read_conversion)) CODING_SYSTEM_POST_READ_CONVERSION (codesys) = value; | |
911 else if (EQ (key, Qpre_write_conversion)) CODING_SYSTEM_PRE_WRITE_CONVERSION (codesys) = value; | |
912 #ifdef MULE | |
913 else if (ty == CODESYS_ISO2022) | |
914 { | |
915 #define FROB_INITIAL_CHARSET(charset_num) \ | |
916 CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, charset_num) = \ | |
917 ((EQ (value, Qt) || EQ (value, Qnil)) ? value : Fget_charset (value)) | |
918 | |
919 if (EQ (key, Qcharset_g0)) FROB_INITIAL_CHARSET (0); | |
920 else if (EQ (key, Qcharset_g1)) FROB_INITIAL_CHARSET (1); | |
921 else if (EQ (key, Qcharset_g2)) FROB_INITIAL_CHARSET (2); | |
922 else if (EQ (key, Qcharset_g3)) FROB_INITIAL_CHARSET (3); | |
923 | |
924 #define FROB_FORCE_CHARSET(charset_num) \ | |
925 CODING_SYSTEM_ISO2022_FORCE_CHARSET_ON_OUTPUT (codesys, charset_num) = !NILP (value) | |
926 | |
927 else if (EQ (key, Qforce_g0_on_output)) FROB_FORCE_CHARSET (0); | |
928 else if (EQ (key, Qforce_g1_on_output)) FROB_FORCE_CHARSET (1); | |
929 else if (EQ (key, Qforce_g2_on_output)) FROB_FORCE_CHARSET (2); | |
930 else if (EQ (key, Qforce_g3_on_output)) FROB_FORCE_CHARSET (3); | |
931 | |
932 #define FROB_BOOLEAN_PROPERTY(prop) \ | |
933 CODING_SYSTEM_ISO2022_##prop (codesys) = !NILP (value) | |
934 | |
935 else if (EQ (key, Qshort)) FROB_BOOLEAN_PROPERTY (SHORT); | |
936 else if (EQ (key, Qno_ascii_eol)) FROB_BOOLEAN_PROPERTY (NO_ASCII_EOL); | |
937 else if (EQ (key, Qno_ascii_cntl)) FROB_BOOLEAN_PROPERTY (NO_ASCII_CNTL); | |
938 else if (EQ (key, Qseven)) FROB_BOOLEAN_PROPERTY (SEVEN); | |
939 else if (EQ (key, Qlock_shift)) FROB_BOOLEAN_PROPERTY (LOCK_SHIFT); | |
940 else if (EQ (key, Qno_iso6429)) FROB_BOOLEAN_PROPERTY (NO_ISO6429); | |
941 else if (EQ (key, Qescape_quoted)) FROB_BOOLEAN_PROPERTY (ESCAPE_QUOTED); | |
942 | |
943 else if (EQ (key, Qinput_charset_conversion)) | |
944 { | |
945 codesys->iso2022.input_conv = | |
946 Dynarr_new (charset_conversion_spec); | |
947 parse_charset_conversion_specs (codesys->iso2022.input_conv, | |
948 value); | |
949 } | |
950 else if (EQ (key, Qoutput_charset_conversion)) | |
951 { | |
952 codesys->iso2022.output_conv = | |
953 Dynarr_new (charset_conversion_spec); | |
954 parse_charset_conversion_specs (codesys->iso2022.output_conv, | |
955 value); | |
956 } | |
957 else | |
958 invalid_constant ("Unrecognized property", key); | |
959 } | |
960 else if (EQ (type, Qccl)) | |
961 { | |
962 Lisp_Object sym; | |
963 struct ccl_program test_ccl; | |
964 Extbyte *suffix; | |
965 | |
966 /* Check key first. */ | |
967 if (EQ (key, Qdecode)) | |
968 suffix = "-ccl-decode"; | |
969 else if (EQ (key, Qencode)) | |
970 suffix = "-ccl-encode"; | |
971 else | |
972 invalid_constant ("Unrecognized property", key); | |
973 | |
974 /* If value is vector, register it as a ccl program | |
975 associated with an newly created symbol for | |
976 backward compatibility. */ | |
977 if (VECTORP (value)) | |
978 { | |
979 sym = Fintern (concat2 (Fsymbol_name (name), | |
980 build_string (suffix)), | |
981 Qnil); | |
982 Fregister_ccl_program (sym, value); | |
983 } | |
984 else | |
985 { | |
986 CHECK_SYMBOL (value); | |
987 sym = value; | |
988 } | |
989 /* check if the given ccl programs are valid. */ | |
990 if (setup_ccl_program (&test_ccl, sym) < 0) | |
991 invalid_argument ("Invalid CCL program", value); | |
992 | |
993 if (EQ (key, Qdecode)) | |
994 CODING_SYSTEM_CCL_DECODE (codesys) = sym; | |
995 else if (EQ (key, Qencode)) | |
996 CODING_SYSTEM_CCL_ENCODE (codesys) = sym; | |
997 | |
998 } | |
999 #endif /* MULE */ | |
1000 else | |
1001 invalid_constant ("Unrecognized property", key); | |
1002 } | |
1003 } | |
1004 | |
1005 if (need_to_setup_eol_systems) | |
1006 setup_eol_coding_systems (codesys); | |
1007 | |
1008 { | |
1009 Lisp_Object codesys_obj; | |
1010 XSETCODING_SYSTEM (codesys_obj, codesys); | |
1011 Fputhash (name, codesys_obj, Vcoding_system_hash_table); | |
1012 return codesys_obj; | |
1013 } | |
1014 } | 1631 } |
1015 | 1632 |
1016 DEFUN ("copy-coding-system", Fcopy_coding_system, 2, 2, 0, /* | 1633 DEFUN ("copy-coding-system", Fcopy_coding_system, 2, 2, 0, /* |
1017 Copy OLD-CODING-SYSTEM to NEW-NAME. | 1634 Copy OLD-CODING-SYSTEM to NEW-NAME. |
1018 If NEW-NAME does not name an existing coding system, a new one will | 1635 If NEW-NAME does not name an existing coding system, a new one will |
1019 be created. | 1636 be created. |
1637 If you are using this function to create an alias, think again: | |
1638 Use `define-coding-system-alias' instead. | |
1020 */ | 1639 */ |
1021 (old_coding_system, new_name)) | 1640 (old_coding_system, new_name)) |
1022 { | 1641 { |
1023 Lisp_Object new_coding_system; | 1642 Lisp_Object new_coding_system; |
1024 old_coding_system = Fget_coding_system (old_coding_system); | 1643 old_coding_system = Fget_coding_system (old_coding_system); |
1025 new_coding_system = Ffind_coding_system (new_name); | 1644 new_coding_system = |
1645 UNBOUNDP (new_name) ? Qnil : Ffind_coding_system (new_name); | |
1026 if (NILP (new_coding_system)) | 1646 if (NILP (new_coding_system)) |
1027 { | 1647 { |
1028 XSETCODING_SYSTEM (new_coding_system, | 1648 XSETCODING_SYSTEM |
1029 allocate_coding_system | 1649 (new_coding_system, |
1030 (XCODING_SYSTEM_TYPE (old_coding_system), | 1650 allocate_coding_system |
1031 new_name)); | 1651 (XCODING_SYSTEM (old_coding_system)->methods, |
1032 Fputhash (new_name, new_coding_system, Vcoding_system_hash_table); | 1652 XCODING_SYSTEM (old_coding_system)->methods->extra_data_size, |
1033 } | 1653 new_name)); |
1654 if (!UNBOUNDP (new_name)) | |
1655 Fputhash (new_name, new_coding_system, Vcoding_system_hash_table); | |
1656 } | |
1657 else if (XCODING_SYSTEM (old_coding_system)->methods != | |
1658 XCODING_SYSTEM (new_coding_system)->methods) | |
1659 invalid_operation_2 ("Coding systems not same type", | |
1660 old_coding_system, new_coding_system); | |
1034 | 1661 |
1035 { | 1662 { |
1036 Lisp_Coding_System *to = XCODING_SYSTEM (new_coding_system); | 1663 Lisp_Coding_System *to = XCODING_SYSTEM (new_coding_system); |
1037 Lisp_Coding_System *from = XCODING_SYSTEM (old_coding_system); | 1664 Lisp_Coding_System *from = XCODING_SYSTEM (old_coding_system); |
1038 memcpy (((char *) to ) + sizeof (to->header), | 1665 copy_sized_lcrecord (to, from, sizeof_coding_system (from)); |
1039 ((char *) from) + sizeof (from->header), | |
1040 sizeof (*from) - sizeof (from->header)); | |
1041 to->name = new_name; | 1666 to->name = new_name; |
1042 } | 1667 } |
1043 return new_coding_system; | 1668 return new_coding_system; |
1044 } | 1669 } |
1045 | 1670 |
1046 DEFUN ("coding-system-canonical-name-p", Fcoding_system_canonical_name_p, 1, 1, 0, /* | 1671 DEFUN ("coding-system-canonical-name-p", Fcoding_system_canonical_name_p, |
1672 1, 1, 0, /* | |
1047 Return t if OBJECT names a coding system, and is not a coding system alias. | 1673 Return t if OBJECT names a coding system, and is not a coding system alias. |
1048 */ | 1674 */ |
1049 (object)) | 1675 (object)) |
1050 { | 1676 { |
1051 return CODING_SYSTEMP (Fgethash (object, Vcoding_system_hash_table, Qnil)) | 1677 return CODING_SYSTEMP (Fgethash (object, Vcoding_system_hash_table, Qnil)) |
1073 else | 1699 else |
1074 invalid_argument ("Symbol is not a coding system alias", alias); | 1700 invalid_argument ("Symbol is not a coding system alias", alias); |
1075 return Qnil; /* To keep the compiler happy */ | 1701 return Qnil; /* To keep the compiler happy */ |
1076 } | 1702 } |
1077 | 1703 |
1078 static Lisp_Object | |
1079 append_suffix_to_symbol (Lisp_Object symbol, const char *ascii_string) | |
1080 { | |
1081 return Fintern (concat2 (Fsymbol_name (symbol), build_string (ascii_string)), | |
1082 Qnil); | |
1083 } | |
1084 | |
1085 /* A maphash function, for removing dangling coding system aliases. */ | 1704 /* A maphash function, for removing dangling coding system aliases. */ |
1086 static int | 1705 static int |
1087 dangling_coding_system_alias_p (Lisp_Object alias, | 1706 dangling_coding_system_alias_p (Lisp_Object alias, |
1088 Lisp_Object aliasee, | 1707 Lisp_Object aliasee, |
1089 void *dangling_aliases) | 1708 void *dangling_aliases) |
1127 ("Symbol is the canonical name of a coding system and cannot be redefined", | 1746 ("Symbol is the canonical name of a coding system and cannot be redefined", |
1128 alias); | 1747 alias); |
1129 | 1748 |
1130 if (NILP (aliasee)) | 1749 if (NILP (aliasee)) |
1131 { | 1750 { |
1132 Lisp_Object subsidiary_unix = append_suffix_to_symbol (alias, "-unix"); | 1751 Lisp_Object subsidiary_unix = add_suffix_to_symbol (alias, "-unix"); |
1133 Lisp_Object subsidiary_dos = append_suffix_to_symbol (alias, "-dos"); | 1752 Lisp_Object subsidiary_dos = add_suffix_to_symbol (alias, "-dos"); |
1134 Lisp_Object subsidiary_mac = append_suffix_to_symbol (alias, "-mac"); | 1753 Lisp_Object subsidiary_mac = add_suffix_to_symbol (alias, "-mac"); |
1135 | 1754 |
1136 Fremhash (alias, Vcoding_system_hash_table); | 1755 Fremhash (alias, Vcoding_system_hash_table); |
1137 | 1756 |
1138 /* Undefine subsidiary aliases, | 1757 /* Undefine subsidiary aliases, |
1139 presumably created by a previous call to this function */ | 1758 presumably created by a previous call to this function */ |
1188 static const char *suffixes[] = { "-unix", "-dos", "-mac" }; | 1807 static const char *suffixes[] = { "-unix", "-dos", "-mac" }; |
1189 int i; | 1808 int i; |
1190 for (i = 0; i < countof (suffixes); i++) | 1809 for (i = 0; i < countof (suffixes); i++) |
1191 { | 1810 { |
1192 Lisp_Object alias_subsidiary = | 1811 Lisp_Object alias_subsidiary = |
1193 append_suffix_to_symbol (alias, suffixes[i]); | 1812 add_suffix_to_symbol (alias, suffixes[i]); |
1194 Lisp_Object aliasee_subsidiary = | 1813 Lisp_Object aliasee_subsidiary = |
1195 append_suffix_to_symbol (aliasee, suffixes[i]); | 1814 add_suffix_to_symbol (aliasee, suffixes[i]); |
1196 | 1815 |
1197 if (! NILP (Ffind_coding_system (aliasee_subsidiary))) | 1816 if (! NILP (Ffind_coding_system (aliasee_subsidiary))) |
1198 Fdefine_coding_system_alias (alias_subsidiary, aliasee_subsidiary); | 1817 Fdefine_coding_system_alias (alias_subsidiary, aliasee_subsidiary); |
1199 } | 1818 } |
1200 } | 1819 } |
1203 meaningful or nothing at all. */ | 1822 meaningful or nothing at all. */ |
1204 return Qnil; | 1823 return Qnil; |
1205 } | 1824 } |
1206 | 1825 |
1207 static Lisp_Object | 1826 static Lisp_Object |
1208 subsidiary_coding_system (Lisp_Object coding_system, eol_type_t type) | 1827 subsidiary_coding_system (Lisp_Object coding_system, enum eol_type type) |
1209 { | 1828 { |
1210 Lisp_Coding_System *cs = XCODING_SYSTEM (coding_system); | 1829 Lisp_Coding_System *cs = XCODING_SYSTEM (coding_system); |
1211 Lisp_Object new_coding_system; | 1830 Lisp_Object new_coding_system; |
1212 | |
1213 if (CODING_SYSTEM_EOL_TYPE (cs) != EOL_AUTODETECT) | |
1214 return coding_system; | |
1215 | 1831 |
1216 switch (type) | 1832 switch (type) |
1217 { | 1833 { |
1218 case EOL_AUTODETECT: return coding_system; | 1834 case EOL_AUTODETECT: return coding_system; |
1219 case EOL_LF: new_coding_system = CODING_SYSTEM_EOL_LF (cs); break; | 1835 case EOL_LF: new_coding_system = CODING_SYSTEM_EOL_LF (cs); break; |
1225 return NILP (new_coding_system) ? coding_system : new_coding_system; | 1841 return NILP (new_coding_system) ? coding_system : new_coding_system; |
1226 } | 1842 } |
1227 | 1843 |
1228 DEFUN ("subsidiary-coding-system", Fsubsidiary_coding_system, 2, 2, 0, /* | 1844 DEFUN ("subsidiary-coding-system", Fsubsidiary_coding_system, 2, 2, 0, /* |
1229 Return the subsidiary coding system of CODING-SYSTEM with eol type EOL-TYPE. | 1845 Return the subsidiary coding system of CODING-SYSTEM with eol type EOL-TYPE. |
1846 The logically opposite operation is `coding-system-base'. | |
1230 */ | 1847 */ |
1231 (coding_system, eol_type)) | 1848 (coding_system, eol_type)) |
1232 { | 1849 { |
1233 coding_system = Fget_coding_system (coding_system); | 1850 coding_system = get_coding_system_for_text_file (coding_system, 0); |
1234 | 1851 |
1235 return subsidiary_coding_system (coding_system, | 1852 return subsidiary_coding_system (coding_system, |
1236 symbol_to_eol_type (eol_type)); | 1853 symbol_to_eol_type (eol_type)); |
1854 } | |
1855 | |
1856 DEFUN ("coding-system-base", Fcoding_system_base, | |
1857 1, 1, 0, /* | |
1858 Return the base coding system of CODING-SYSTEM. | |
1859 If CODING-SYSTEM is a subsidiary, this returns its parent; otherwise, it | |
1860 returns CODING-SYSTEM. | |
1861 The logically opposite operation is `subsidiary-coding-system'. | |
1862 */ | |
1863 (coding_system)) | |
1864 { | |
1865 Lisp_Object base; | |
1866 | |
1867 coding_system = Fget_coding_system (coding_system); | |
1868 if (EQ (XCODING_SYSTEM_NAME (coding_system), Qbinary)) | |
1869 return Fget_coding_system (Qraw_text); /* hack! */ | |
1870 base = XCODING_SYSTEM_SUBSIDIARY_PARENT (coding_system); | |
1871 if (!NILP (base)) | |
1872 return base; | |
1873 return coding_system; | |
1874 } | |
1875 | |
1876 DEFUN ("coding-system-used-for-io", Fcoding_system_used_for_io, | |
1877 1, 1, 0, /* | |
1878 Return the coding system actually used for I/O. | |
1879 In some cases (e.g. when a particular EOL type is specified) this won't be | |
1880 the coding system itself. This can be useful when trying to track down | |
1881 more closely how exactly data is decoded. | |
1882 */ | |
1883 (coding_system)) | |
1884 { | |
1885 Lisp_Object canon; | |
1886 | |
1887 coding_system = Fget_coding_system (coding_system); | |
1888 canon = XCODING_SYSTEM_CANONICAL (coding_system); | |
1889 if (!NILP (canon)) | |
1890 return canon; | |
1891 return coding_system; | |
1237 } | 1892 } |
1238 | 1893 |
1239 | 1894 |
1240 /************************************************************************/ | 1895 /************************************************************************/ |
1241 /* Coding system accessors */ | 1896 /* Coding system accessors */ |
1242 /************************************************************************/ | 1897 /************************************************************************/ |
1243 | 1898 |
1244 DEFUN ("coding-system-doc-string", Fcoding_system_doc_string, 1, 1, 0, /* | 1899 DEFUN ("coding-system-description", Fcoding_system_description, 1, 1, 0, /* |
1245 Return the doc string for CODING-SYSTEM. | 1900 Return the description for CODING-SYSTEM. |
1901 The `description' of a coding system is a short English phrase giving the | |
1902 name rendered according to English punctuation rules, plus possibly some | |
1903 explanatory text (typically in the form of a parenthetical phrase). The | |
1904 description is intended to be short enough that it can appear as a menu item, | |
1905 and clear enough to be recognizable even to someone who is assumed to have | |
1906 some basic familiarity with different encodings but may not know all the | |
1907 technical names; thus, for `cn-gb-2312' is described as "Chinese EUC" and | |
1908 `hz-gb-2312' is described as "Hz/ZW (Chinese)", where the actual name of | |
1909 the encoding is given, followed by a note that this is a Chinese encoding, | |
1910 because the great majority of people encountering this would have no idea | |
1911 what it is, and giving the language indicates whether the encoding should | |
1912 just be ignored or (conceivably) investigated more thoroughly. | |
1246 */ | 1913 */ |
1247 (coding_system)) | 1914 (coding_system)) |
1248 { | 1915 { |
1249 coding_system = Fget_coding_system (coding_system); | 1916 coding_system = Fget_coding_system (coding_system); |
1250 return XCODING_SYSTEM_DOC_STRING (coding_system); | 1917 return XCODING_SYSTEM_DESCRIPTION (coding_system); |
1251 } | 1918 } |
1252 | 1919 |
1253 DEFUN ("coding-system-type", Fcoding_system_type, 1, 1, 0, /* | 1920 DEFUN ("coding-system-type", Fcoding_system_type, 1, 1, 0, /* |
1254 Return the type of CODING-SYSTEM. | 1921 Return the type of CODING-SYSTEM. |
1255 */ | 1922 */ |
1256 (coding_system)) | 1923 (coding_system)) |
1257 { | 1924 { |
1258 switch (XCODING_SYSTEM_TYPE (Fget_coding_system (coding_system))) | 1925 coding_system = Fget_coding_system (coding_system); |
1259 { | 1926 return XCODING_SYSTEM_TYPE (coding_system); |
1260 default: abort (); | 1927 } |
1261 case CODESYS_AUTODETECT: return Qundecided; | |
1262 #ifdef MULE | |
1263 case CODESYS_SHIFT_JIS: return Qshift_jis; | |
1264 case CODESYS_ISO2022: return Qiso2022; | |
1265 case CODESYS_BIG5: return Qbig5; | |
1266 case CODESYS_UCS4: return Qucs4; | |
1267 case CODESYS_UTF8: return Qutf8; | |
1268 case CODESYS_CCL: return Qccl; | |
1269 #endif | |
1270 case CODESYS_NO_CONVERSION: return Qno_conversion; | |
1271 #ifdef DEBUG_XEMACS | |
1272 case CODESYS_INTERNAL: return Qinternal; | |
1273 #endif | |
1274 } | |
1275 } | |
1276 | |
1277 #ifdef MULE | |
1278 static | |
1279 Lisp_Object coding_system_charset (Lisp_Object coding_system, int gnum) | |
1280 { | |
1281 Lisp_Object cs | |
1282 = XCODING_SYSTEM_ISO2022_INITIAL_CHARSET (coding_system, gnum); | |
1283 | |
1284 return CHARSETP (cs) ? XCHARSET_NAME (cs) : Qnil; | |
1285 } | |
1286 #endif /* MULE */ | |
1287 | 1928 |
1288 DEFUN ("coding-system-property", Fcoding_system_property, 2, 2, 0, /* | 1929 DEFUN ("coding-system-property", Fcoding_system_property, 2, 2, 0, /* |
1289 Return the PROP property of CODING-SYSTEM. | 1930 Return the PROP property of CODING-SYSTEM. |
1290 */ | 1931 */ |
1291 (coding_system, prop)) | 1932 (coding_system, prop)) |
1292 { | 1933 { |
1293 int i, ok = 0; | |
1294 enum coding_system_type type; | |
1295 | |
1296 coding_system = Fget_coding_system (coding_system); | 1934 coding_system = Fget_coding_system (coding_system); |
1297 CHECK_SYMBOL (prop); | 1935 CHECK_SYMBOL (prop); |
1298 type = XCODING_SYSTEM_TYPE (coding_system); | |
1299 | |
1300 for (i = 0; !ok && i < Dynarr_length (the_codesys_prop_dynarr); i++) | |
1301 if (EQ (Dynarr_at (the_codesys_prop_dynarr, i).sym, prop)) | |
1302 { | |
1303 ok = 1; | |
1304 switch (Dynarr_at (the_codesys_prop_dynarr, i).prop_type) | |
1305 { | |
1306 case CODESYS_PROP_ALL_OK: | |
1307 break; | |
1308 #ifdef MULE | |
1309 case CODESYS_PROP_ISO2022: | |
1310 if (type != CODESYS_ISO2022) | |
1311 invalid_argument | |
1312 ("Property only valid in ISO2022 coding systems", | |
1313 prop); | |
1314 break; | |
1315 | |
1316 case CODESYS_PROP_CCL: | |
1317 if (type != CODESYS_CCL) | |
1318 invalid_argument | |
1319 ("Property only valid in CCL coding systems", | |
1320 prop); | |
1321 break; | |
1322 #endif /* MULE */ | |
1323 default: | |
1324 abort (); | |
1325 } | |
1326 } | |
1327 | |
1328 if (!ok) | |
1329 invalid_constant ("Unrecognized property", prop); | |
1330 | 1936 |
1331 if (EQ (prop, Qname)) | 1937 if (EQ (prop, Qname)) |
1332 return XCODING_SYSTEM_NAME (coding_system); | 1938 return XCODING_SYSTEM_NAME (coding_system); |
1333 else if (EQ (prop, Qtype)) | 1939 else if (EQ (prop, Qtype)) |
1334 return Fcoding_system_type (coding_system); | 1940 return Fcoding_system_type (coding_system); |
1335 else if (EQ (prop, Qdoc_string)) | 1941 else if (EQ (prop, Qdescription)) |
1336 return XCODING_SYSTEM_DOC_STRING (coding_system); | 1942 return XCODING_SYSTEM_DESCRIPTION (coding_system); |
1337 else if (EQ (prop, Qmnemonic)) | 1943 else if (EQ (prop, Qmnemonic)) |
1338 return XCODING_SYSTEM_MNEMONIC (coding_system); | 1944 return XCODING_SYSTEM_MNEMONIC (coding_system); |
1945 else if (EQ (prop, Qdocumentation)) | |
1946 return XCODING_SYSTEM_DOCUMENTATION (coding_system); | |
1339 else if (EQ (prop, Qeol_type)) | 1947 else if (EQ (prop, Qeol_type)) |
1340 return eol_type_to_symbol (XCODING_SYSTEM_EOL_TYPE (coding_system)); | 1948 return eol_type_to_symbol (XCODING_SYSTEM_EOL_TYPE |
1949 (coding_system)); | |
1341 else if (EQ (prop, Qeol_lf)) | 1950 else if (EQ (prop, Qeol_lf)) |
1342 return XCODING_SYSTEM_EOL_LF (coding_system); | 1951 return XCODING_SYSTEM_EOL_LF (coding_system); |
1343 else if (EQ (prop, Qeol_crlf)) | 1952 else if (EQ (prop, Qeol_crlf)) |
1344 return XCODING_SYSTEM_EOL_CRLF (coding_system); | 1953 return XCODING_SYSTEM_EOL_CRLF (coding_system); |
1345 else if (EQ (prop, Qeol_cr)) | 1954 else if (EQ (prop, Qeol_cr)) |
1346 return XCODING_SYSTEM_EOL_CR (coding_system); | 1955 return XCODING_SYSTEM_EOL_CR (coding_system); |
1347 else if (EQ (prop, Qpost_read_conversion)) | 1956 else if (EQ (prop, Qpost_read_conversion)) |
1348 return XCODING_SYSTEM_POST_READ_CONVERSION (coding_system); | 1957 return XCODING_SYSTEM_POST_READ_CONVERSION (coding_system); |
1349 else if (EQ (prop, Qpre_write_conversion)) | 1958 else if (EQ (prop, Qpre_write_conversion)) |
1350 return XCODING_SYSTEM_PRE_WRITE_CONVERSION (coding_system); | 1959 return XCODING_SYSTEM_PRE_WRITE_CONVERSION (coding_system); |
1351 #ifdef MULE | |
1352 else if (type == CODESYS_ISO2022) | |
1353 { | |
1354 if (EQ (prop, Qcharset_g0)) | |
1355 return coding_system_charset (coding_system, 0); | |
1356 else if (EQ (prop, Qcharset_g1)) | |
1357 return coding_system_charset (coding_system, 1); | |
1358 else if (EQ (prop, Qcharset_g2)) | |
1359 return coding_system_charset (coding_system, 2); | |
1360 else if (EQ (prop, Qcharset_g3)) | |
1361 return coding_system_charset (coding_system, 3); | |
1362 | |
1363 #define FORCE_CHARSET(charset_num) \ | |
1364 (XCODING_SYSTEM_ISO2022_FORCE_CHARSET_ON_OUTPUT \ | |
1365 (coding_system, charset_num) ? Qt : Qnil) | |
1366 | |
1367 else if (EQ (prop, Qforce_g0_on_output)) return FORCE_CHARSET (0); | |
1368 else if (EQ (prop, Qforce_g1_on_output)) return FORCE_CHARSET (1); | |
1369 else if (EQ (prop, Qforce_g2_on_output)) return FORCE_CHARSET (2); | |
1370 else if (EQ (prop, Qforce_g3_on_output)) return FORCE_CHARSET (3); | |
1371 | |
1372 #define LISP_BOOLEAN(prop) \ | |
1373 (XCODING_SYSTEM_ISO2022_##prop (coding_system) ? Qt : Qnil) | |
1374 | |
1375 else if (EQ (prop, Qshort)) return LISP_BOOLEAN (SHORT); | |
1376 else if (EQ (prop, Qno_ascii_eol)) return LISP_BOOLEAN (NO_ASCII_EOL); | |
1377 else if (EQ (prop, Qno_ascii_cntl)) return LISP_BOOLEAN (NO_ASCII_CNTL); | |
1378 else if (EQ (prop, Qseven)) return LISP_BOOLEAN (SEVEN); | |
1379 else if (EQ (prop, Qlock_shift)) return LISP_BOOLEAN (LOCK_SHIFT); | |
1380 else if (EQ (prop, Qno_iso6429)) return LISP_BOOLEAN (NO_ISO6429); | |
1381 else if (EQ (prop, Qescape_quoted)) return LISP_BOOLEAN (ESCAPE_QUOTED); | |
1382 | |
1383 else if (EQ (prop, Qinput_charset_conversion)) | |
1384 return | |
1385 unparse_charset_conversion_specs | |
1386 (XCODING_SYSTEM (coding_system)->iso2022.input_conv); | |
1387 else if (EQ (prop, Qoutput_charset_conversion)) | |
1388 return | |
1389 unparse_charset_conversion_specs | |
1390 (XCODING_SYSTEM (coding_system)->iso2022.output_conv); | |
1391 else | |
1392 abort (); | |
1393 } | |
1394 else if (type == CODESYS_CCL) | |
1395 { | |
1396 if (EQ (prop, Qdecode)) | |
1397 return XCODING_SYSTEM_CCL_DECODE (coding_system); | |
1398 else if (EQ (prop, Qencode)) | |
1399 return XCODING_SYSTEM_CCL_ENCODE (coding_system); | |
1400 else | |
1401 abort (); | |
1402 } | |
1403 #endif /* MULE */ | |
1404 else | 1960 else |
1405 abort (); | 1961 { |
1406 | 1962 Lisp_Object value = CODESYSMETH_OR_GIVEN (XCODING_SYSTEM (coding_system), |
1407 return Qnil; /* not reached */ | 1963 getprop, |
1964 (coding_system, prop), | |
1965 Qunbound); | |
1966 if (UNBOUNDP (value)) | |
1967 invalid_constant ("Unrecognized property", prop); | |
1968 return value; | |
1969 } | |
1408 } | 1970 } |
1409 | 1971 |
1410 | 1972 |
1411 /************************************************************************/ | 1973 /************************************************************************/ |
1412 /* Coding category functions */ | 1974 /* Coding stream functions */ |
1413 /************************************************************************/ | 1975 /************************************************************************/ |
1414 | 1976 |
1977 /* A coding stream is a stream used for encoding or decoding text. The | |
1978 coding-stream object keeps track of the actual coding system, the stream | |
1979 that is at the other end, and data that needs to be persistent across | |
1980 the lifetime of the stream. */ | |
1981 | |
1982 DEFINE_LSTREAM_IMPLEMENTATION ("coding", coding); | |
1983 | |
1984 /* Encoding and decoding are parallel operations, so we create just one | |
1985 stream for both. "Decoding" may involve the extra step of autodetection | |
1986 of the data format, but that's only because of the conventional | |
1987 definition of decoding as converting from external- to | |
1988 internal-formatted data. | |
1989 | |
1990 #### We really need to abstract out the concept of "data formats" and | |
1991 define "converters" that convert from and to specified formats, | |
1992 eliminating the idea of decoding and encoding. When specifying a | |
1993 conversion process, we need to give the data formats themselves, not the | |
1994 conversion processes -- e.g. a coding system called "Unicode->multibyte" | |
1995 converts in both directions, and we could auto-detect the format of data | |
1996 at either end. */ | |
1997 | |
1998 static Bytecount | |
1999 coding_reader (Lstream *stream, unsigned char *data, Bytecount size) | |
2000 { | |
2001 unsigned char *orig_data = data; | |
2002 Bytecount read_size; | |
2003 int error_occurred = 0; | |
2004 struct coding_stream *str = CODING_STREAM_DATA (stream); | |
2005 | |
2006 /* We need to interface to coding_{de,en}code_1(), which expects to take | |
2007 some amount of data and store the result into a Dynarr. We have | |
2008 coding_{de,en}code_1() store into c->runoff, and take data from there | |
2009 as necessary. */ | |
2010 | |
2011 /* We loop until we have enough data, reading chunks from the other | |
2012 end and converting it. */ | |
2013 while (1) | |
2014 { | |
2015 /* Take data from convert_to if we can. Make sure to take at | |
2016 most SIZE bytes, and delete the data from convert_to. */ | |
2017 if (Dynarr_length (str->convert_to) > 0) | |
2018 { | |
2019 Bytecount chunk = | |
2020 min (size, (Bytecount) Dynarr_length (str->convert_to)); | |
2021 memcpy (data, Dynarr_atp (str->convert_to, 0), chunk); | |
2022 Dynarr_delete_many (str->convert_to, 0, chunk); | |
2023 data += chunk; | |
2024 size -= chunk; | |
2025 } | |
2026 | |
2027 if (size == 0) | |
2028 break; /* No more room for data */ | |
2029 | |
2030 if (str->eof) | |
2031 break; | |
2032 | |
2033 { | |
2034 /* Exhausted convert_to, so get some more. Read into convert_from, | |
2035 after existing "rejected" data from the last conversion. */ | |
2036 Bytecount rejected = Dynarr_length (str->convert_from); | |
2037 /* #### 1024 is arbitrary; we really need to separate 0 from EOF, | |
2038 and when we get 0, keep taking more data until we don't get 0 -- | |
2039 we don't know how much data the conversion routine might need | |
2040 before it can generate any data of its own */ | |
2041 Bytecount readmore = max (size, 1024); | |
2042 | |
2043 Dynarr_add_many (str->convert_from, 0, readmore); | |
2044 read_size = Lstream_read (str->other_end, | |
2045 Dynarr_atp (str->convert_from, rejected), | |
2046 readmore); | |
2047 /* Trim size down to how much we actually got */ | |
2048 Dynarr_set_size (str->convert_from, rejected + max (0, read_size)); | |
2049 } | |
2050 | |
2051 if (read_size < 0) /* LSTREAM_ERROR */ | |
2052 { | |
2053 error_occurred = 1; | |
2054 break; | |
2055 } | |
2056 if (read_size == 0) /* LSTREAM_EOF */ | |
2057 /* There might be some more end data produced in the translation, | |
2058 so we set a flag and call the conversion method once more to | |
2059 output any final stuff it may be holding, any "go back to a sane | |
2060 state" escape sequences, etc. The conversion method is free to | |
2061 look at this flag, and we use it above to stop looping. */ | |
2062 str->eof = 1; | |
2063 { | |
2064 Bytecount processed; | |
2065 Bytecount to_process = Dynarr_length (str->convert_from); | |
2066 | |
2067 /* Convert the data, and save any rejected data in convert_from */ | |
2068 processed = | |
2069 XCODESYSMETH (str->codesys, convert, | |
2070 (str, Dynarr_atp (str->convert_from, 0), | |
2071 str->convert_to, to_process)); | |
2072 if (processed < 0) | |
2073 { | |
2074 error_occurred = 1; | |
2075 break; | |
2076 } | |
2077 assert (processed <= to_process); | |
2078 if (processed < to_process) | |
2079 memmove (Dynarr_atp (str->convert_from, 0), | |
2080 Dynarr_atp (str->convert_from, processed), | |
2081 to_process - processed); | |
2082 Dynarr_set_size (str->convert_from, to_process - processed); | |
2083 } | |
2084 } | |
2085 | |
2086 if (data - orig_data == 0) | |
2087 return error_occurred ? -1 : 0; | |
2088 else | |
2089 return data - orig_data; | |
2090 } | |
2091 | |
2092 static Bytecount | |
2093 coding_writer (Lstream *stream, const unsigned char *data, Bytecount size) | |
2094 { | |
2095 struct coding_stream *str = CODING_STREAM_DATA (stream); | |
2096 | |
2097 /* Convert all our data into convert_to, and then attempt to write | |
2098 it all out to the other end. */ | |
2099 Dynarr_reset (str->convert_to); | |
2100 size = XCODESYSMETH (str->codesys, convert, | |
2101 (str, data, str->convert_to, size)); | |
2102 if (Lstream_write (str->other_end, Dynarr_atp (str->convert_to, 0), | |
2103 Dynarr_length (str->convert_to)) < 0) | |
2104 return -1; | |
2105 else | |
2106 /* The return value indicates how much of the incoming data was | |
2107 processed, not how many bytes were written. */ | |
2108 return size; | |
2109 } | |
2110 | |
1415 static int | 2111 static int |
1416 decode_coding_category (Lisp_Object symbol) | 2112 encode_decode_source_sink_type_is_char (Lisp_Object cs, |
1417 { | 2113 enum source_or_sink sex, |
2114 enum encode_decode direction) | |
2115 { | |
2116 return (direction == CODING_DECODE ? | |
2117 decoding_source_sink_type_is_char (cs, sex) : | |
2118 encoding_source_sink_type_is_char (cs, sex)); | |
2119 } | |
2120 | |
2121 /* Ensure that the convert methods only get full characters sent to them to | |
2122 convert if the source of that conversion is characters; and that no such | |
2123 full-character checking happens when the source is bytes. Keep in mind | |
2124 that (1) the conversion_end_type return values take the perspective of | |
2125 encoding; (2) the source for decoding is the same as the sink for | |
2126 encoding; (3) when writing, the data is given to us, and we set our own | |
2127 stream to be character mode or not; (4) when reading, the data comes | |
2128 from the other_end stream, and we set that one to be character mode or | |
2129 not. This is consistent with the comment above the prototype for | |
2130 Lstream_set_character_mode(), which lays out rules for who is allowed to | |
2131 modify the character type mode on a stream. | |
2132 | |
2133 NOTE: We could potentially implement the full-character checking stuff | |
2134 ourselves, which might be a bit safer in case people mess up the | |
2135 character mode themselves. But people shouldn't be doing that -- don't | |
2136 hide bugs -- and there's no sense duplicating code. */ | |
2137 | |
2138 static void | |
2139 set_coding_character_mode (Lstream *stream) | |
2140 { | |
2141 struct coding_stream *str = CODING_STREAM_DATA (stream); | |
2142 Lstream *stream_to_set = | |
2143 stream->flags & LSTREAM_FL_WRITE ? stream : str->other_end; | |
2144 if (encode_decode_source_sink_type_is_char | |
2145 (str->codesys, CODING_SOURCE, str->direction)) | |
2146 Lstream_set_character_mode (stream_to_set); | |
2147 else | |
2148 Lstream_unset_character_mode (stream_to_set); | |
2149 } | |
2150 | |
2151 static Lisp_Object | |
2152 coding_marker (Lisp_Object stream) | |
2153 { | |
2154 struct coding_stream *str = CODING_STREAM_DATA (XLSTREAM (stream)); | |
2155 | |
2156 mark_object (str->orig_codesys); | |
2157 mark_object (str->codesys); | |
2158 MAYBE_XCODESYSMETH (str->codesys, mark_coding_stream, (str)); | |
2159 return wrap_lstream (str->other_end); | |
2160 } | |
2161 | |
2162 static int | |
2163 coding_rewinder (Lstream *stream) | |
2164 { | |
2165 struct coding_stream *str = CODING_STREAM_DATA (stream); | |
2166 MAYBE_XCODESYSMETH (str->codesys, rewind_coding_stream, (str)); | |
2167 | |
2168 str->ch = 0; | |
2169 Dynarr_reset (str->convert_to); | |
2170 Dynarr_reset (str->convert_from); | |
2171 return Lstream_rewind (str->other_end); | |
2172 } | |
2173 | |
2174 static int | |
2175 coding_seekable_p (Lstream *stream) | |
2176 { | |
2177 struct coding_stream *str = CODING_STREAM_DATA (stream); | |
2178 return Lstream_seekable_p (str->other_end); | |
2179 } | |
2180 | |
2181 static int | |
2182 coding_flusher (Lstream *stream) | |
2183 { | |
2184 struct coding_stream *str = CODING_STREAM_DATA (stream); | |
2185 return Lstream_flush (str->other_end); | |
2186 } | |
2187 | |
2188 static int | |
2189 coding_closer (Lstream *stream) | |
2190 { | |
2191 struct coding_stream *str = CODING_STREAM_DATA (stream); | |
2192 if (stream->flags & LSTREAM_FL_WRITE) | |
2193 { | |
2194 str->eof = 1; | |
2195 coding_writer (stream, 0, 0); | |
2196 str->eof = 0; | |
2197 } | |
2198 /* It's safe to free the runoff dynarrs now because they are used only | |
2199 during conversion. We need to keep the type-specific data around, | |
2200 though, because of canonicalize_after_coding. */ | |
2201 if (str->convert_to) | |
2202 { | |
2203 Dynarr_free (str->convert_to); | |
2204 str->convert_to = 0; | |
2205 } | |
2206 if (str->convert_from) | |
2207 { | |
2208 Dynarr_free (str->convert_from); | |
2209 str->convert_from = 0; | |
2210 } | |
2211 | |
2212 return Lstream_close (str->other_end); | |
2213 } | |
2214 | |
2215 static void | |
2216 coding_finalizer (Lstream *stream) | |
2217 { | |
2218 struct coding_stream *str = CODING_STREAM_DATA (stream); | |
2219 | |
2220 assert (!str->finalized); | |
2221 MAYBE_XCODESYSMETH (str->codesys, finalize_coding_stream, (str)); | |
2222 if (str->data) | |
2223 { | |
2224 xfree (str->data); | |
2225 str->data = 0; | |
2226 } | |
2227 str->finalized = 1; | |
2228 } | |
2229 | |
2230 static Lisp_Object | |
2231 coding_stream_canonicalize_after_coding (Lstream *stream) | |
2232 { | |
2233 struct coding_stream *str = CODING_STREAM_DATA (stream); | |
2234 | |
2235 return XCODESYSMETH_OR_GIVEN (str->codesys, canonicalize_after_coding, | |
2236 (str), str->codesys); | |
2237 } | |
2238 | |
2239 Lisp_Object | |
2240 coding_stream_detected_coding_system (Lstream *stream) | |
2241 { | |
2242 Lisp_Object codesys = | |
2243 coding_stream_canonicalize_after_coding (stream); | |
2244 if (NILP (codesys)) | |
2245 return Fget_coding_system (Qidentity); | |
2246 return codesys; | |
2247 } | |
2248 | |
2249 Lisp_Object | |
2250 coding_stream_coding_system (Lstream *stream) | |
2251 { | |
2252 return CODING_STREAM_DATA (stream)->codesys; | |
2253 } | |
2254 | |
2255 /* Change the coding system associated with a stream. */ | |
2256 | |
2257 void | |
2258 set_coding_stream_coding_system (Lstream *lstr, Lisp_Object codesys) | |
2259 { | |
2260 struct coding_stream *str = CODING_STREAM_DATA (lstr); | |
2261 if (EQ (str->orig_codesys, codesys)) | |
2262 return; | |
2263 /* We do the equivalent of closing the stream, destroying it, and | |
2264 reinitializing it. This includes flushing out the data and signalling | |
2265 EOF, if we're a writing stream; we also replace the type-specific data | |
2266 with the data appropriate for the new coding system. */ | |
2267 if (!NILP (str->codesys)) | |
2268 { | |
2269 if (lstr->flags & LSTREAM_FL_WRITE) | |
2270 { | |
2271 Lstream_flush (lstr); | |
2272 str->eof = 1; | |
2273 coding_writer (lstr, 0, 0); | |
2274 str->eof = 0; | |
2275 } | |
2276 MAYBE_XCODESYSMETH (str->codesys, finalize_coding_stream, (str)); | |
2277 } | |
2278 str->orig_codesys = codesys; | |
2279 str->codesys = coding_system_real_canonical (codesys); | |
2280 | |
2281 if (str->data) | |
2282 { | |
2283 xfree (str->data); | |
2284 str->data = 0; | |
2285 } | |
2286 if (XCODING_SYSTEM_METHODS (str->codesys)->coding_data_size) | |
2287 str->data = | |
2288 xmalloc_and_zero (XCODING_SYSTEM_METHODS (str->codesys)-> | |
2289 coding_data_size); | |
2290 MAYBE_XCODESYSMETH (str->codesys, init_coding_stream, (str)); | |
2291 /* The new coding system may have different ideas regarding whether its | |
2292 ends are characters or bytes. */ | |
2293 set_coding_character_mode (lstr); | |
2294 } | |
2295 | |
2296 /* WARNING WARNING WARNING WARNING!!!!! If you open up a coding | |
2297 stream for writing, no automatic code detection will be performed. | |
2298 The reason for this is that automatic code detection requires a | |
2299 seekable input. Things will also fail if you open a coding | |
2300 stream for reading using a non-fully-specified coding system and | |
2301 a non-seekable input stream. */ | |
2302 | |
2303 static Lisp_Object | |
2304 make_coding_stream_1 (Lstream *stream, Lisp_Object codesys, | |
2305 const char *mode, enum encode_decode direction) | |
2306 { | |
2307 Lstream *lstr = Lstream_new (lstream_coding, mode); | |
2308 struct coding_stream *str = CODING_STREAM_DATA (lstr); | |
2309 Lisp_Object obj; | |
2310 | |
2311 codesys = Fget_coding_system (codesys); | |
2312 xzero (*str); | |
2313 str->codesys = Qnil; | |
2314 str->orig_codesys = Qnil; | |
2315 str->us = lstr; | |
2316 str->other_end = stream; | |
2317 str->convert_to = Dynarr_new (unsigned_char); | |
2318 str->convert_from = Dynarr_new (unsigned_char); | |
2319 str->direction = direction; | |
2320 set_coding_stream_coding_system (lstr, codesys); | |
2321 XSETLSTREAM (obj, lstr); | |
2322 return obj; | |
2323 } | |
2324 | |
2325 Lisp_Object | |
2326 make_coding_input_stream (Lstream *stream, Lisp_Object codesys, | |
2327 enum encode_decode direction) | |
2328 { | |
2329 return make_coding_stream_1 (stream, codesys, "r", direction); | |
2330 } | |
2331 | |
2332 Lisp_Object | |
2333 make_coding_output_stream (Lstream *stream, Lisp_Object codesys, | |
2334 enum encode_decode direction) | |
2335 { | |
2336 return make_coding_stream_1 (stream, codesys, "w", direction); | |
2337 } | |
2338 | |
2339 static Lisp_Object | |
2340 encode_decode_coding_region (Lisp_Object start, Lisp_Object end, | |
2341 Lisp_Object coding_system, Lisp_Object buffer, | |
2342 enum encode_decode direction) | |
2343 { | |
2344 Charbpos b, e; | |
2345 struct buffer *buf = decode_buffer (buffer, 0); | |
2346 Lisp_Object instream = Qnil, to_outstream = Qnil, outstream = Qnil; | |
2347 Lisp_Object from_outstream = Qnil, auto_outstream = Qnil; | |
2348 Lisp_Object lb_outstream = Qnil; | |
2349 Lisp_Object next; | |
2350 Lstream *istr, *ostr; | |
2351 struct gcpro gcpro1, gcpro2, gcpro3, gcpro4, gcpro5; | |
2352 struct gcpro ngcpro1; | |
2353 int source_char, sink_char; | |
2354 | |
2355 get_buffer_range_char (buf, start, end, &b, &e, 0); | |
2356 barf_if_buffer_read_only (buf, b, e); | |
2357 | |
2358 GCPRO5 (instream, to_outstream, outstream, from_outstream, lb_outstream); | |
2359 NGCPRO1 (auto_outstream); | |
2360 | |
2361 coding_system = Fget_coding_system (coding_system); | |
2362 source_char = encode_decode_source_sink_type_is_char (coding_system, | |
2363 CODING_SOURCE, | |
2364 direction); | |
2365 sink_char = encode_decode_source_sink_type_is_char (coding_system, | |
2366 CODING_SINK, | |
2367 direction); | |
2368 | |
2369 /* Order is IN <---> [TO] -> OUT -> [FROM] -> [AUTODETECT-EOL] -> LB */ | |
2370 instream = make_lisp_buffer_input_stream (buf, b, e, 0); | |
2371 next = lb_outstream = make_lisp_buffer_output_stream (buf, b, 0); | |
2372 | |
2373 if (direction == CODING_DECODE && | |
2374 XCODING_SYSTEM_EOL_TYPE (coding_system) == EOL_AUTODETECT) | |
2375 next = auto_outstream = | |
2376 make_coding_output_stream | |
2377 (XLSTREAM (next), Fget_coding_system (Qconvert_eol_autodetect), CODING_DECODE); | |
2378 | |
2379 if (!sink_char) | |
2380 next = from_outstream = | |
2381 make_coding_output_stream (XLSTREAM (next), Qbinary, CODING_DECODE); | |
2382 outstream = make_coding_output_stream (XLSTREAM (next), coding_system, | |
2383 direction); | |
2384 if (!source_char) | |
2385 { | |
2386 to_outstream = | |
2387 make_coding_output_stream (XLSTREAM (outstream), | |
2388 Qbinary, CODING_ENCODE); | |
2389 ostr = XLSTREAM (to_outstream); | |
2390 } | |
2391 else | |
2392 ostr = XLSTREAM (outstream); | |
2393 istr = XLSTREAM (instream); | |
2394 | |
2395 /* The chain of streams looks like this: | |
2396 | |
2397 [BUFFER] <----- send through | |
2398 ------> [CHAR->BYTE i.e. ENCODE AS BINARY if source is | |
2399 in bytes] | |
2400 ------> [ENCODE/DECODE AS SPECIFIED] | |
2401 ------> [BYTE->CHAR i.e. DECODE AS BINARY | |
2402 if sink is in bytes] | |
2403 ------> [AUTODETECT EOL if | |
2404 we're decoding and | |
2405 coding system calls | |
2406 for this] | |
2407 ------> [BUFFER] | |
2408 */ | |
2409 while (1) | |
2410 { | |
2411 char tempbuf[1024]; /* some random amount */ | |
2412 Charbpos newpos, even_newer_pos; | |
2413 Charbpos oldpos = lisp_buffer_stream_startpos (istr); | |
2414 Bytecount size_in_bytes = | |
2415 Lstream_read (istr, tempbuf, sizeof (tempbuf)); | |
2416 | |
2417 if (!size_in_bytes) | |
2418 break; | |
2419 newpos = lisp_buffer_stream_startpos (istr); | |
2420 Lstream_write (ostr, tempbuf, size_in_bytes); | |
2421 even_newer_pos = lisp_buffer_stream_startpos (istr); | |
2422 buffer_delete_range (buf, even_newer_pos - (newpos - oldpos), | |
2423 even_newer_pos, 0); | |
2424 } | |
2425 | |
2426 { | |
2427 Charcount retlen = | |
2428 lisp_buffer_stream_startpos (XLSTREAM (instream)) - b; | |
2429 Lstream_close (istr); | |
2430 Lstream_close (ostr); | |
2431 NUNGCPRO; | |
2432 UNGCPRO; | |
2433 Lstream_delete (istr); | |
2434 if (!NILP (from_outstream)) | |
2435 Lstream_delete (XLSTREAM (from_outstream)); | |
2436 Lstream_delete (XLSTREAM (outstream)); | |
2437 if (!NILP (to_outstream)) | |
2438 Lstream_delete (XLSTREAM (to_outstream)); | |
2439 if (!NILP (auto_outstream)) | |
2440 Lstream_delete (XLSTREAM (auto_outstream)); | |
2441 Lstream_delete (XLSTREAM (lb_outstream)); | |
2442 return make_int (retlen); | |
2443 } | |
2444 } | |
2445 | |
2446 DEFUN ("decode-coding-region", Fdecode_coding_region, 3, 4, 0, /* | |
2447 Decode the text between START and END which is encoded in CODING-SYSTEM. | |
2448 This is useful if you've read in encoded text from a file without decoding | |
2449 it (e.g. you read in a JIS-formatted file but used the `binary' or | |
2450 `no-conversion' coding system, so that it shows up as "^[$B!<!+^[(B"). | |
2451 Return length of decoded text. | |
2452 BUFFER defaults to the current buffer if unspecified. | |
2453 */ | |
2454 (start, end, coding_system, buffer)) | |
2455 { | |
2456 return encode_decode_coding_region (start, end, coding_system, buffer, | |
2457 CODING_DECODE); | |
2458 } | |
2459 | |
2460 DEFUN ("encode-coding-region", Fencode_coding_region, 3, 4, 0, /* | |
2461 Encode the text between START and END using CODING-SYSTEM. | |
2462 This will, for example, convert Japanese characters into stuff such as | |
2463 "^[$B!<!+^[(B" if you use the JIS encoding. Return length of encoded | |
2464 text. BUFFER defaults to the current buffer if unspecified. | |
2465 */ | |
2466 (start, end, coding_system, buffer)) | |
2467 { | |
2468 return encode_decode_coding_region (start, end, coding_system, buffer, | |
2469 CODING_ENCODE); | |
2470 } | |
2471 | |
2472 | |
2473 /************************************************************************/ | |
2474 /* Chain methods */ | |
2475 /************************************************************************/ | |
2476 | |
2477 /* #### Need a way to create "opposite-direction" coding systems. */ | |
2478 | |
2479 /* Chain two or more coding systems together to make a combination coding | |
2480 system. */ | |
2481 DEFINE_CODING_SYSTEM_TYPE (chain); | |
2482 | |
2483 struct chain_coding_system | |
2484 { | |
2485 /* List of coding systems, in decode order */ | |
2486 Lisp_Object *chain; | |
2487 /* Number of coding systems in list */ | |
2488 int count; | |
2489 /* Coding system to return as a result of canonicalize-after-coding */ | |
2490 Lisp_Object canonicalize_after_coding; | |
2491 }; | |
2492 | |
2493 struct chain_coding_stream | |
2494 { | |
2495 int initted; | |
2496 /* Lstreams for chain coding system */ | |
2497 Lisp_Object *lstreams; | |
2498 int lstream_count; | |
2499 }; | |
2500 | |
2501 static const struct lrecord_description lo_description_1[] = { | |
2502 { XD_LISP_OBJECT, 0 }, | |
2503 { XD_END } | |
2504 }; | |
2505 | |
2506 static const struct struct_description lo_description = { | |
2507 sizeof (Lisp_Object), | |
2508 lo_description_1 | |
2509 }; | |
2510 | |
2511 static const struct lrecord_description chain_coding_system_description[] = { | |
2512 { XD_INT, | |
2513 coding_system_data_offset + offsetof (struct chain_coding_system, | |
2514 count) }, | |
2515 { XD_STRUCT_PTR, | |
2516 coding_system_data_offset + offsetof (struct chain_coding_system, | |
2517 chain), | |
2518 XD_INDIRECT (0, 0), &lo_description }, | |
2519 { XD_LISP_OBJECT, | |
2520 coding_system_data_offset + offsetof (struct chain_coding_system, | |
2521 canonicalize_after_coding) }, | |
2522 { XD_END } | |
2523 }; | |
2524 | |
2525 static Lisp_Object | |
2526 chain_canonicalize (Lisp_Object codesys) | |
2527 { | |
2528 /* We make use of the fact that this method is called at init time, after | |
2529 properties have been parsed. init_method is called too early. */ | |
2530 /* #### It's not clear we need this whole chain-canonicalize mechanism | |
2531 any more. */ | |
2532 Lisp_Object chain = Flist (XCODING_SYSTEM_CHAIN_COUNT (codesys), | |
2533 XCODING_SYSTEM_CHAIN_CHAIN (codesys)); | |
2534 chain = Fcons (XCODING_SYSTEM_PRE_WRITE_CONVERSION (codesys), | |
2535 Fcons (XCODING_SYSTEM_POST_READ_CONVERSION (codesys), | |
2536 chain)); | |
2537 Fputhash (chain, codesys, Vchain_canonicalize_hash_table); | |
2538 return codesys; | |
2539 } | |
2540 | |
2541 static Lisp_Object | |
2542 chain_canonicalize_after_coding (struct coding_stream *str) | |
2543 { | |
2544 Lisp_Object cac = | |
2545 XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (str->codesys); | |
2546 if (!NILP (cac)) | |
2547 return cac; | |
2548 return str->codesys; | |
2549 #if 0 | |
2550 struct chain_coding_stream *data = CODING_STREAM_TYPE_DATA (str, chain); | |
2551 Lisp_Object us = str->codesys, codesys; | |
1418 int i; | 2552 int i; |
2553 Lisp_Object chain; | |
2554 Lisp_Object tail; | |
2555 int changed = 0; | |
2556 | |
2557 /* #### It's not clear we need this whole chain-canonicalize mechanism | |
2558 any more. */ | |
2559 if (str->direction == CODING_ENCODE || !data->initted) | |
2560 return us; | |
2561 | |
2562 chain = Flist (XCODING_SYSTEM_CHAIN_COUNT (us), | |
2563 XCODING_SYSTEM_CHAIN_CHAIN (us)); | |
2564 | |
2565 tail = chain; | |
2566 for (i = 0; i < XCODING_SYSTEM_CHAIN_COUNT (us); i++) | |
2567 { | |
2568 codesys = (coding_stream_canonicalize_after_coding | |
2569 (XLSTREAM (data->lstreams[i]))); | |
2570 if (!EQ (codesys, XCAR (tail))) | |
2571 changed = 1; | |
2572 XCAR (tail) = codesys; | |
2573 tail = XCDR (tail); | |
2574 } | |
2575 | |
2576 if (!changed) | |
2577 return us; | |
2578 | |
2579 chain = delq_no_quit (Qnil, chain); | |
2580 | |
2581 if (NILP (XCODING_SYSTEM_PRE_WRITE_CONVERSION (us)) && | |
2582 NILP (XCODING_SYSTEM_POST_READ_CONVERSION (us))) | |
2583 { | |
2584 if (NILP (chain)) | |
2585 return Qnil; | |
2586 if (NILP (XCDR (chain))) | |
2587 return XCAR (chain); | |
2588 } | |
2589 | |
2590 codesys = Fgethash (Fcons (XCODING_SYSTEM_PRE_WRITE_CONVERSION (us), | |
2591 Fcons (XCODING_SYSTEM_POST_READ_CONVERSION (us), | |
2592 chain)), Vchain_canonicalize_hash_table, | |
2593 Qnil); | |
2594 if (!NILP (codesys)) | |
2595 return codesys; | |
2596 return make_internal_coding_system | |
2597 (us, "internal-chain-canonicalizer-wrapper", | |
2598 Qchain, Qunbound, list2 (Qchain, chain)); | |
2599 #endif /* 0 */ | |
2600 } | |
2601 | |
2602 static void | |
2603 chain_init (Lisp_Object codesys) | |
2604 { | |
2605 XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (codesys) = Qnil; | |
2606 } | |
2607 | |
2608 static void | |
2609 chain_mark (Lisp_Object codesys) | |
2610 { | |
2611 int i; | |
2612 | |
2613 for (i = 0; i < XCODING_SYSTEM_CHAIN_COUNT (codesys); i++) | |
2614 mark_object (XCODING_SYSTEM_CHAIN_CHAIN (codesys)[i]); | |
2615 mark_object (XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (codesys)); | |
2616 } | |
2617 | |
2618 static void | |
2619 chain_mark_coding_stream_1 (struct chain_coding_stream *data) | |
2620 { | |
2621 int i; | |
2622 | |
2623 for (i = 0; i < data->lstream_count; i++) | |
2624 mark_object (data->lstreams[i]); | |
2625 } | |
2626 | |
2627 static void | |
2628 chain_mark_coding_stream (struct coding_stream *str) | |
2629 { | |
2630 chain_mark_coding_stream_1 (CODING_STREAM_TYPE_DATA (str, chain)); | |
2631 } | |
2632 | |
2633 static void | |
2634 chain_print (Lisp_Object cs, Lisp_Object printcharfun, int escapeflag) | |
2635 { | |
2636 int i; | |
2637 | |
2638 write_c_string ("(", printcharfun); | |
2639 for (i = 0; i < XCODING_SYSTEM_CHAIN_COUNT (cs); i++) | |
2640 { | |
2641 write_c_string (i == 0 ? "" : "->", printcharfun); | |
2642 print_coding_system_in_print_method (XCODING_SYSTEM_CHAIN_CHAIN (cs)[i], | |
2643 printcharfun, escapeflag); | |
2644 } | |
2645 { | |
2646 Lisp_Object cac = XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (cs); | |
2647 if (!NILP (cac)) | |
2648 { | |
2649 if (i > 0) | |
2650 write_c_string (" ", printcharfun); | |
2651 write_c_string ("canonicalize-after-coding=", printcharfun); | |
2652 print_coding_system_in_print_method (cac, printcharfun, escapeflag); | |
2653 } | |
2654 } | |
2655 | |
2656 write_c_string (")", printcharfun); | |
2657 } | |
2658 | |
2659 static void | |
2660 chain_rewind_coding_stream_1 (struct chain_coding_stream *data) | |
2661 { | |
2662 /* Each will rewind the next; there is always at least one stream (the | |
2663 dynarr stream at the end) if we're initted */ | |
2664 if (data->initted) | |
2665 Lstream_rewind (XLSTREAM (data->lstreams[0])); | |
2666 } | |
2667 | |
2668 static void | |
2669 chain_rewind_coding_stream (struct coding_stream *str) | |
2670 { | |
2671 chain_rewind_coding_stream_1 (CODING_STREAM_TYPE_DATA (str, chain)); | |
2672 } | |
2673 | |
2674 static void | |
2675 chain_init_coding_streams_1 (struct chain_coding_stream *data, | |
2676 unsigned_char_dynarr *dst, | |
2677 int ncodesys, Lisp_Object *codesys, | |
2678 enum encode_decode direction) | |
2679 { | |
2680 int i; | |
2681 Lisp_Object lstream_out; | |
2682 | |
2683 data->lstream_count = ncodesys + 1; | |
2684 data->lstreams = xnew_array (Lisp_Object, data->lstream_count); | |
2685 | |
2686 lstream_out = make_dynarr_output_stream (dst); | |
2687 Lstream_set_buffering (XLSTREAM (lstream_out), LSTREAM_UNBUFFERED, 0); | |
2688 data->lstreams[data->lstream_count - 1] = lstream_out; | |
2689 | |
2690 for (i = ncodesys - 1; i >= 0; i--) | |
2691 { | |
2692 data->lstreams[i] = | |
2693 make_coding_output_stream | |
2694 (XLSTREAM (lstream_out), | |
2695 codesys[direction == CODING_ENCODE ? ncodesys - (i + 1) : i], | |
2696 direction); | |
2697 lstream_out = data->lstreams[i]; | |
2698 Lstream_set_buffering (XLSTREAM (lstream_out), LSTREAM_UNBUFFERED, | |
2699 0); | |
2700 } | |
2701 data->initted = 1; | |
2702 } | |
2703 | |
2704 static Bytecount | |
2705 chain_convert (struct coding_stream *str, const UExtbyte *src, | |
2706 unsigned_char_dynarr *dst, Bytecount n) | |
2707 { | |
2708 struct chain_coding_stream *data = CODING_STREAM_TYPE_DATA (str, chain); | |
2709 | |
2710 if (str->eof) | |
2711 { | |
2712 /* Each will close the next; there is always at least one stream (the | |
2713 dynarr stream at the end) if we're initted. We need to close now | |
2714 because more data may be generated. */ | |
2715 if (data->initted) | |
2716 Lstream_close (XLSTREAM (data->lstreams[0])); | |
2717 return n; | |
2718 } | |
2719 | |
2720 if (!data->initted) | |
2721 chain_init_coding_streams_1 | |
2722 (data, dst, XCODING_SYSTEM_CHAIN_COUNT (str->codesys), | |
2723 XCODING_SYSTEM_CHAIN_CHAIN (str->codesys), str->direction); | |
2724 | |
2725 if (Lstream_write (XLSTREAM (data->lstreams[0]), src, n) < 0) | |
2726 return -1; | |
2727 return n; | |
2728 } | |
2729 | |
2730 static void | |
2731 chain_finalize_coding_stream_1 (struct chain_coding_stream *data) | |
2732 { | |
2733 if (data->lstreams) | |
2734 { | |
2735 /* Order of deletion is important here! Delete from the head of the | |
2736 chain and work your way towards the tail. In general, when you | |
2737 delete an object, there should be *NO* pointers to it anywhere. | |
2738 Deleting back-to-front would be a problem because there are | |
2739 pointers going forward. If there were pointers in both | |
2740 directions, you'd have to disconnect the pointers to a particular | |
2741 object before deleting it. */ | |
2742 if (!gc_in_progress) | |
2743 { | |
2744 int i; | |
2745 /* During GC, these objects are unmarked, and are about to be | |
2746 freed. We do NOT want them on the free list, and that will | |
2747 cause lots of nastiness including crashes. Just let them be | |
2748 freed normally. */ | |
2749 for (i = 0; i < data->lstream_count; i++) | |
2750 Lstream_delete (XLSTREAM ((data->lstreams)[i])); | |
2751 } | |
2752 xfree (data->lstreams); | |
2753 } | |
2754 } | |
2755 | |
2756 static void | |
2757 chain_finalize_coding_stream (struct coding_stream *str) | |
2758 { | |
2759 chain_finalize_coding_stream_1 (CODING_STREAM_TYPE_DATA (str, chain)); | |
2760 } | |
2761 | |
2762 static void | |
2763 chain_finalize (Lisp_Object c) | |
2764 { | |
2765 if (XCODING_SYSTEM_CHAIN_CHAIN (c)) | |
2766 xfree (XCODING_SYSTEM_CHAIN_CHAIN (c)); | |
2767 } | |
2768 | |
2769 static int | |
2770 chain_putprop (Lisp_Object codesys, Lisp_Object key, Lisp_Object value) | |
2771 { | |
2772 if (EQ (key, Qchain)) | |
2773 { | |
2774 Lisp_Object tail; | |
2775 Lisp_Object *cslist; | |
2776 int count = 0; | |
2777 int i; | |
2778 | |
2779 EXTERNAL_LIST_LOOP (tail, value) | |
2780 { | |
2781 Fget_coding_system (XCAR (tail)); | |
2782 count++; | |
2783 } | |
2784 | |
2785 cslist = xnew_array (Lisp_Object, count); | |
2786 XCODING_SYSTEM_CHAIN_CHAIN (codesys) = cslist; | |
2787 | |
2788 count = 0; | |
2789 EXTERNAL_LIST_LOOP (tail, value) | |
2790 { | |
2791 cslist[count] = Fget_coding_system (XCAR (tail)); | |
2792 count++; | |
2793 } | |
2794 | |
2795 XCODING_SYSTEM_CHAIN_COUNT (codesys) = count; | |
2796 | |
2797 for (i = 0; i < count - 1; i++) | |
2798 { | |
2799 if (decoding_source_sink_type_is_char (cslist[i], CODING_SINK) != | |
2800 decoding_source_sink_type_is_char (cslist[i + 1], CODING_SOURCE)) | |
2801 invalid_argument_2 ("Sink of first must match source of second", | |
2802 cslist[i], cslist[i + 1]); | |
2803 } | |
2804 } | |
2805 else if (EQ (key, Qcanonicalize_after_coding)) | |
2806 XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (codesys) = | |
2807 Fget_coding_system (value); | |
2808 else | |
2809 return 0; | |
2810 return 1; | |
2811 } | |
2812 | |
2813 static Lisp_Object | |
2814 chain_getprop (Lisp_Object coding_system, Lisp_Object prop) | |
2815 { | |
2816 if (EQ (prop, Qchain)) | |
2817 { | |
2818 Lisp_Object result = Qnil; | |
2819 int i; | |
2820 | |
2821 for (i = 0; i < XCODING_SYSTEM_CHAIN_COUNT (coding_system); i++) | |
2822 result = Fcons (XCODING_SYSTEM_CHAIN_CHAIN (coding_system)[i], | |
2823 result); | |
2824 | |
2825 return Fnreverse (result); | |
2826 } | |
2827 else if (EQ (prop, Qcanonicalize_after_coding)) | |
2828 return XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (coding_system); | |
2829 else | |
2830 return Qunbound; | |
2831 } | |
2832 | |
2833 static enum source_sink_type | |
2834 chain_conversion_end_type (Lisp_Object codesys) | |
2835 { | |
2836 Lisp_Object *cslist = XCODING_SYSTEM_CHAIN_CHAIN (codesys); | |
2837 int n = XCODING_SYSTEM_CHAIN_COUNT (codesys); | |
2838 int charp_source, charp_sink; | |
2839 | |
2840 if (n == 0) | |
2841 return DECODES_BYTE_TO_BYTE; /* arbitrary */ | |
2842 charp_source = decoding_source_sink_type_is_char (cslist[0], CODING_SOURCE); | |
2843 charp_sink = decoding_source_sink_type_is_char (cslist[n - 1], CODING_SINK); | |
2844 | |
2845 switch (charp_source * 2 + charp_sink) | |
2846 { | |
2847 case 0: return DECODES_BYTE_TO_BYTE; | |
2848 case 1: return DECODES_BYTE_TO_CHARACTER; | |
2849 case 2: return DECODES_CHARACTER_TO_BYTE; | |
2850 case 3: return DECODES_CHARACTER_TO_CHARACTER; | |
2851 } | |
2852 | |
2853 abort (); | |
2854 return DECODES_BYTE_TO_BYTE; | |
2855 } | |
2856 | |
2857 | |
2858 /************************************************************************/ | |
2859 /* No-conversion methods */ | |
2860 /************************************************************************/ | |
2861 | |
2862 /* "No conversion"; used for binary files. We use quotes because there | |
2863 really is some conversion being applied (it does byte<->char | |
2864 conversion), but it appears to the user as if the text is read in | |
2865 without conversion. */ | |
2866 DEFINE_CODING_SYSTEM_TYPE (no_conversion); | |
2867 | |
2868 /* This is used when reading in "binary" files -- i.e. files that may | |
2869 contain all 256 possible byte values and that are not to be | |
2870 interpreted as being in any particular encoding. */ | |
2871 static Bytecount | |
2872 no_conversion_convert (struct coding_stream *str, | |
2873 const UExtbyte *src, | |
2874 unsigned_char_dynarr *dst, Bytecount n) | |
2875 { | |
2876 UExtbyte c; | |
2877 unsigned int ch = str->ch; | |
2878 Bytecount orign = n; | |
2879 | |
2880 if (str->direction == CODING_DECODE) | |
2881 { | |
2882 while (n--) | |
2883 { | |
2884 c = *src++; | |
2885 | |
2886 DECODE_ADD_BINARY_CHAR (c, dst); | |
2887 } | |
2888 | |
2889 if (str->eof) | |
2890 DECODE_OUTPUT_PARTIAL_CHAR (ch, dst); | |
2891 } | |
2892 else | |
2893 { | |
2894 | |
2895 while (n--) | |
2896 { | |
2897 c = *src++; | |
2898 if (BYTE_ASCII_P (c)) | |
2899 { | |
2900 assert (ch == 0); | |
2901 Dynarr_add (dst, c); | |
2902 } | |
2903 #ifdef MULE | |
2904 else if (INTBYTE_LEADING_BYTE_P (c)) | |
2905 { | |
2906 assert (ch == 0); | |
2907 if (c == LEADING_BYTE_LATIN_ISO8859_1 || | |
2908 c == LEADING_BYTE_CONTROL_1) | |
2909 ch = c; | |
2910 else | |
2911 Dynarr_add (dst, '~'); /* untranslatable character */ | |
2912 } | |
2913 else | |
2914 { | |
2915 if (ch == LEADING_BYTE_LATIN_ISO8859_1) | |
2916 Dynarr_add (dst, c); | |
2917 else if (ch == LEADING_BYTE_CONTROL_1) | |
2918 { | |
2919 assert (c < 0xC0); | |
2920 Dynarr_add (dst, c - 0x20); | |
2921 } | |
2922 /* else it should be the second or third byte of an | |
2923 untranslatable character, so ignore it */ | |
2924 ch = 0; | |
2925 } | |
2926 #endif /* MULE */ | |
2927 | |
2928 } | |
2929 } | |
2930 | |
2931 str->ch = ch; | |
2932 return orign; | |
2933 } | |
2934 | |
2935 DEFINE_DETECTOR (no_conversion); | |
2936 DEFINE_DETECTOR_CATEGORY (no_conversion, no_conversion); | |
2937 | |
2938 struct no_conversion_detector | |
2939 { | |
2940 int dummy; | |
2941 }; | |
2942 | |
2943 static void | |
2944 no_conversion_detect (struct detection_state *st, const UExtbyte *src, | |
2945 Bytecount n) | |
2946 { | |
2947 /* Hack until we get better handling of this stuff! */ | |
2948 DET_RESULT (st, no_conversion) = DET_SLIGHTLY_LIKELY; | |
2949 } | |
2950 | |
2951 | |
2952 /************************************************************************/ | |
2953 /* Convert-eol methods */ | |
2954 /************************************************************************/ | |
2955 | |
2956 /* This is used to handle end-of-line (EOL) differences. It is | |
2957 character-to-character, and works (when encoding) *BEFORE* sending | |
2958 data to the main encoding routine -- thus, that routine must handle | |
2959 different EOL types itself if it does line-oriented type processing. | |
2960 This is unavoidable because we don't know whether the output of the | |
2961 main encoding routine is ASCII compatible (Unicode is definitely not, | |
2962 for example). | |
2963 | |
2964 There is one parameter: `subtype', either `cr', `lf', `crlf', or `autodetect'. | |
2965 */ | |
2966 | |
2967 DEFINE_CODING_SYSTEM_TYPE (convert_eol); | |
2968 | |
2969 struct convert_eol_coding_system | |
2970 { | |
2971 enum eol_type subtype; | |
2972 }; | |
2973 | |
2974 #define CODING_SYSTEM_CONVERT_EOL_SUBTYPE(codesys) \ | |
2975 (CODING_SYSTEM_TYPE_DATA (codesys, convert_eol)->subtype) | |
2976 #define XCODING_SYSTEM_CONVERT_EOL_SUBTYPE(codesys) \ | |
2977 (XCODING_SYSTEM_TYPE_DATA (codesys, convert_eol)->subtype) | |
2978 | |
2979 struct convert_eol_coding_stream | |
2980 { | |
2981 enum eol_type actual; | |
2982 }; | |
2983 | |
2984 static const struct lrecord_description | |
2985 convert_eol_coding_system_description[] = { | |
2986 { XD_END } | |
2987 }; | |
2988 | |
2989 static void | |
2990 convert_eol_print (Lisp_Object cs, Lisp_Object printcharfun, int escapeflag) | |
2991 { | |
2992 struct convert_eol_coding_system *data = | |
2993 XCODING_SYSTEM_TYPE_DATA (cs, convert_eol); | |
2994 | |
2995 write_fmt_string (printcharfun, "(%s)", | |
2996 data->subtype == EOL_LF ? "lf" : | |
2997 data->subtype == EOL_CRLF ? "crlf" : | |
2998 data->subtype == EOL_CR ? "cr" : | |
2999 data->subtype == EOL_AUTODETECT ? "autodetect" : | |
3000 (abort(), "")); | |
3001 } | |
3002 | |
3003 static enum source_sink_type | |
3004 convert_eol_conversion_end_type (Lisp_Object codesys) | |
3005 { | |
3006 return DECODES_CHARACTER_TO_CHARACTER; | |
3007 } | |
3008 | |
3009 static int | |
3010 convert_eol_putprop (Lisp_Object codesys, | |
3011 Lisp_Object key, | |
3012 Lisp_Object value) | |
3013 { | |
3014 struct convert_eol_coding_system *data = | |
3015 XCODING_SYSTEM_TYPE_DATA (codesys, convert_eol); | |
3016 | |
3017 if (EQ (key, Qsubtype)) | |
3018 { | |
3019 if (EQ (value, Qlf) /* || EQ (value, Qunix) */) | |
3020 data->subtype = EOL_LF; | |
3021 else if (EQ (value, Qcrlf) /* || EQ (value, Qdos) */) | |
3022 data->subtype = EOL_CRLF; | |
3023 else if (EQ (value, Qcr) /* || EQ (value, Qmac) */) | |
3024 data->subtype = EOL_CR; | |
3025 else if (EQ (value, Qautodetect) /* || EQ (value, Qmac) */) | |
3026 data->subtype = EOL_AUTODETECT; | |
3027 else invalid_constant ("Unrecognized eol type", value); | |
3028 } | |
3029 else | |
3030 return 0; | |
3031 return 1; | |
3032 } | |
3033 | |
3034 static Lisp_Object | |
3035 convert_eol_getprop (Lisp_Object coding_system, Lisp_Object prop) | |
3036 { | |
3037 struct convert_eol_coding_system *data = | |
3038 XCODING_SYSTEM_TYPE_DATA (coding_system, convert_eol); | |
3039 | |
3040 if (EQ (prop, Qsubtype)) | |
3041 { | |
3042 switch (data->subtype) | |
3043 { | |
3044 case EOL_LF: return Qlf; | |
3045 case EOL_CRLF: return Qcrlf; | |
3046 case EOL_CR: return Qcr; | |
3047 case EOL_AUTODETECT: return Qautodetect; | |
3048 default: abort (); | |
3049 } | |
3050 } | |
3051 | |
3052 return Qunbound; | |
3053 } | |
3054 | |
3055 static void | |
3056 convert_eol_init_coding_stream (struct coding_stream *str) | |
3057 { | |
3058 struct convert_eol_coding_stream *data = | |
3059 CODING_STREAM_TYPE_DATA (str, convert_eol); | |
3060 data->actual = XCODING_SYSTEM_CONVERT_EOL_SUBTYPE (str->codesys); | |
3061 } | |
3062 | |
3063 static Bytecount | |
3064 convert_eol_convert (struct coding_stream *str, const Intbyte *src, | |
3065 unsigned_char_dynarr *dst, Bytecount n) | |
3066 { | |
3067 if (str->direction == CODING_DECODE) | |
3068 { | |
3069 struct convert_eol_coding_stream *data = | |
3070 CODING_STREAM_TYPE_DATA (str, convert_eol); | |
3071 | |
3072 if (data->actual == EOL_AUTODETECT) | |
3073 { | |
3074 Bytecount n2 = n; | |
3075 const Intbyte *src2 = src; | |
3076 | |
3077 for (; n2; n2--) | |
3078 { | |
3079 Intbyte c = *src2++; | |
3080 if (c == '\n') | |
3081 { | |
3082 data->actual = EOL_LF; | |
3083 break; | |
3084 } | |
3085 else if (c == '\r') | |
3086 { | |
3087 if (n2 == 1) | |
3088 { | |
3089 /* If we're seeing a '\r' at the end of the data, then | |
3090 reject the '\r' right now so it doesn't become an | |
3091 issue in the code below -- unless we're at the end of | |
3092 the stream, in which case we can't do that (because | |
3093 then the '\r' will never get written out), and in any | |
3094 case we should be recognizing it at EOL_CR format. */ | |
3095 if (str->eof) | |
3096 data->actual = EOL_CR; | |
3097 else | |
3098 n--; | |
3099 break; | |
3100 } | |
3101 else if (*src2 == '\n') | |
3102 data->actual = EOL_CRLF; | |
3103 else | |
3104 data->actual = EOL_CR; | |
3105 break; | |
3106 } | |
3107 } | |
3108 } | |
3109 | |
3110 /* str->eof is set, the caller reached EOF on the other end and has | |
3111 no new data to give us. The only data we get is the data we | |
3112 rejected from last time. */ | |
3113 if (data->actual == EOL_LF || data->actual == EOL_AUTODETECT || | |
3114 (str->eof)) | |
3115 Dynarr_add_many (dst, src, n); | |
3116 else | |
3117 { | |
3118 const Intbyte *end = src + n; | |
3119 while (1) | |
3120 { | |
3121 /* Find the next section with no \r and add it. */ | |
3122 const Intbyte *runstart = src; | |
3123 src = (Intbyte *) memchr (src, '\r', end - src); | |
3124 if (!src) | |
3125 src = end; | |
3126 Dynarr_add_many (dst, runstart, src - runstart); | |
3127 /* Stop if at end ... */ | |
3128 if (src == end) | |
3129 break; | |
3130 /* ... else, translate as necessary. */ | |
3131 src++; | |
3132 if (data->actual == EOL_CR) | |
3133 Dynarr_add (dst, '\n'); | |
3134 /* We need to be careful here with CRLF. If we see a CR at the | |
3135 end of the data, we don't know if it's part of a CRLF, so we | |
3136 reject it. Otherwise: If it's part of a CRLF, eat it and | |
3137 loop; the following LF gets added next time around. If it's | |
3138 not part of a CRLF, add the CR and loop. The following | |
3139 character will be processed in the next loop iteration. This | |
3140 correctly handles a sequence like CR+CR+LF. */ | |
3141 else if (src == end) | |
3142 return n - 1; /* reject the CR at the end; we'll get it again | |
3143 next time the convert method is called */ | |
3144 else if (*src != '\n') | |
3145 Dynarr_add (dst, '\r'); | |
3146 } | |
3147 } | |
3148 | |
3149 return n; | |
3150 } | |
3151 else | |
3152 { | |
3153 enum eol_type subtype = | |
3154 XCODING_SYSTEM_CONVERT_EOL_SUBTYPE (str->codesys); | |
3155 const Intbyte *end = src + n; | |
3156 | |
3157 /* We try to be relatively efficient here. */ | |
3158 if (subtype == EOL_LF) | |
3159 Dynarr_add_many (dst, src, n); | |
3160 else | |
3161 { | |
3162 while (1) | |
3163 { | |
3164 /* Find the next section with no \n and add it. */ | |
3165 const Intbyte *runstart = src; | |
3166 src = (Intbyte *) memchr (src, '\n', end - src); | |
3167 if (!src) | |
3168 src = end; | |
3169 Dynarr_add_many (dst, runstart, src - runstart); | |
3170 /* Stop if at end ... */ | |
3171 if (src == end) | |
3172 break; | |
3173 /* ... else, skip over \n and add its translation. */ | |
3174 src++; | |
3175 Dynarr_add (dst, '\r'); | |
3176 if (subtype == EOL_CRLF) | |
3177 Dynarr_add (dst, '\n'); | |
3178 } | |
3179 } | |
3180 | |
3181 return n; | |
3182 } | |
3183 } | |
3184 | |
3185 static Lisp_Object | |
3186 convert_eol_canonicalize_after_coding (struct coding_stream *str) | |
3187 { | |
3188 struct convert_eol_coding_stream *data = | |
3189 CODING_STREAM_TYPE_DATA (str, convert_eol); | |
3190 | |
3191 if (str->direction == CODING_ENCODE) | |
3192 return str->codesys; | |
3193 | |
3194 switch (data->actual) | |
3195 { | |
3196 case EOL_LF: return Fget_coding_system (Qconvert_eol_lf); | |
3197 case EOL_CRLF: return Fget_coding_system (Qconvert_eol_crlf); | |
3198 case EOL_CR: return Fget_coding_system (Qconvert_eol_cr); | |
3199 case EOL_AUTODETECT: return str->codesys; | |
3200 default: abort (); return Qnil; | |
3201 } | |
3202 } | |
3203 | |
3204 | |
3205 /************************************************************************/ | |
3206 /* Undecided methods */ | |
3207 /************************************************************************/ | |
3208 | |
3209 /* Do autodetection. We can autodetect the EOL type only, the coding | |
3210 system only, or both. We only do autodetection when decoding; when | |
3211 encoding, we just pass the data through. | |
3212 | |
3213 When doing just EOL detection, a coding system can be specified; if so, | |
3214 we will decode this data through the coding system before doing EOL | |
3215 detection. The reason for specifying this is so that | |
3216 canonicalize-after-coding works: We will canonicalize the specified | |
3217 coding system into the appropriate EOL type. When doing both coding and | |
3218 EOL detection, we do similar canonicalization, and also catch situations | |
3219 where the EOL type is overspecified, i.e. the detected coding system | |
3220 specifies an EOL type, and either switch to the equivalent | |
3221 non-EOL-processing coding system (if possible), or terminate EOL | |
3222 detection and use the specified EOL type. This prevents data from being | |
3223 EOL-processed twice. | |
3224 */ | |
3225 | |
3226 DEFINE_CODING_SYSTEM_TYPE (undecided); | |
3227 | |
3228 struct undecided_coding_system | |
3229 { | |
3230 int do_eol, do_coding; | |
3231 Lisp_Object cs; | |
3232 }; | |
3233 | |
3234 struct undecided_coding_stream | |
3235 { | |
3236 Lisp_Object actual; | |
3237 /* Either 2 or 3 lstreams here; see undecided_convert */ | |
3238 struct chain_coding_stream c; | |
3239 | |
3240 struct detection_state *st; | |
3241 }; | |
3242 | |
3243 static const struct lrecord_description | |
3244 undecided_coding_system_description[] = { | |
3245 { XD_LISP_OBJECT, | |
3246 coding_system_data_offset + offsetof (struct undecided_coding_system, | |
3247 cs) }, | |
3248 { XD_END } | |
3249 }; | |
3250 | |
3251 static void | |
3252 undecided_init (Lisp_Object codesys) | |
3253 { | |
3254 struct undecided_coding_system *data = | |
3255 XCODING_SYSTEM_TYPE_DATA (codesys, undecided); | |
3256 | |
3257 data->cs = Qnil; | |
3258 } | |
3259 | |
3260 static void | |
3261 undecided_mark (Lisp_Object codesys) | |
3262 { | |
3263 struct undecided_coding_system *data = | |
3264 XCODING_SYSTEM_TYPE_DATA (codesys, undecided); | |
3265 | |
3266 mark_object (data->cs); | |
3267 } | |
3268 | |
3269 static void | |
3270 undecided_print (Lisp_Object cs, Lisp_Object printcharfun, int escapeflag) | |
3271 { | |
3272 struct undecided_coding_system *data = | |
3273 XCODING_SYSTEM_TYPE_DATA (cs, undecided); | |
3274 int need_space = 0; | |
3275 | |
3276 write_c_string ("(", printcharfun); | |
3277 if (data->do_eol) | |
3278 { | |
3279 write_c_string ("do-eol", printcharfun); | |
3280 need_space = 1; | |
3281 } | |
3282 if (data->do_coding) | |
3283 { | |
3284 if (need_space) | |
3285 write_c_string (" ", printcharfun); | |
3286 write_c_string ("do-coding", printcharfun); | |
3287 need_space = 1; | |
3288 } | |
3289 if (!NILP (data->cs)) | |
3290 { | |
3291 if (need_space) | |
3292 write_c_string (" ", printcharfun); | |
3293 write_c_string ("coding-system=", printcharfun); | |
3294 print_coding_system_in_print_method (data->cs, printcharfun, escapeflag); | |
3295 } | |
3296 write_c_string (")", printcharfun); | |
3297 } | |
3298 | |
3299 static void | |
3300 undecided_mark_coding_stream (struct coding_stream *str) | |
3301 { | |
3302 chain_mark_coding_stream_1 (&CODING_STREAM_TYPE_DATA (str, undecided)->c); | |
3303 } | |
3304 | |
3305 static int | |
3306 undecided_putprop (Lisp_Object codesys, Lisp_Object key, Lisp_Object value) | |
3307 { | |
3308 struct undecided_coding_system *data = | |
3309 XCODING_SYSTEM_TYPE_DATA (codesys, undecided); | |
3310 | |
3311 if (EQ (key, Qdo_eol)) | |
3312 data->do_eol = 1; | |
3313 else if (EQ (key, Qdo_coding)) | |
3314 data->do_coding = 1; | |
3315 else if (EQ (key, Qcoding_system)) | |
3316 data->cs = get_coding_system_for_text_file (value, 0); | |
3317 else | |
3318 return 0; | |
3319 return 1; | |
3320 } | |
3321 | |
3322 static Lisp_Object | |
3323 undecided_getprop (Lisp_Object codesys, Lisp_Object prop) | |
3324 { | |
3325 struct undecided_coding_system *data = | |
3326 XCODING_SYSTEM_TYPE_DATA (codesys, undecided); | |
3327 | |
3328 if (EQ (prop, Qdo_eol)) | |
3329 return data->do_eol ? Qt : Qnil; | |
3330 if (EQ (prop, Qdo_coding)) | |
3331 return data->do_coding ? Qt : Qnil; | |
3332 if (EQ (prop, Qcoding_system)) | |
3333 return data->cs; | |
3334 return Qunbound; | |
3335 } | |
3336 | |
3337 static struct detection_state * | |
3338 allocate_detection_state (void) | |
3339 { | |
3340 int i; | |
3341 Bytecount size = MAX_ALIGN_SIZE (sizeof (struct detection_state)); | |
3342 struct detection_state *block; | |
3343 | |
3344 for (i = 0; i < coding_detector_count; i++) | |
3345 size += MAX_ALIGN_SIZE (Dynarr_at (all_coding_detectors, i).data_size); | |
3346 | |
3347 block = (struct detection_state *) xmalloc_and_zero (size); | |
3348 | |
3349 size = MAX_ALIGN_SIZE (sizeof (struct detection_state)); | |
3350 for (i = 0; i < coding_detector_count; i++) | |
3351 { | |
3352 block->data_offset[i] = size; | |
3353 size += MAX_ALIGN_SIZE (Dynarr_at (all_coding_detectors, i).data_size); | |
3354 } | |
3355 | |
3356 return block; | |
3357 } | |
3358 | |
3359 static void | |
3360 free_detection_state (struct detection_state *st) | |
3361 { | |
3362 int i; | |
3363 | |
3364 for (i = 0; i < coding_detector_count; i++) | |
3365 { | |
3366 if (Dynarr_at (all_coding_detectors, i).finalize_detection_state_method) | |
3367 Dynarr_at (all_coding_detectors, i).finalize_detection_state_method | |
3368 (st); | |
3369 } | |
3370 | |
3371 xfree (st); | |
3372 } | |
3373 | |
3374 static int | |
3375 coding_category_symbol_to_id (Lisp_Object symbol) | |
3376 { | |
3377 int i; | |
1419 | 3378 |
1420 CHECK_SYMBOL (symbol); | 3379 CHECK_SYMBOL (symbol); |
1421 for (i = 0; i < CODING_CATEGORY_LAST; i++) | 3380 for (i = 0; i < coding_detector_count; i++) |
1422 if (EQ (coding_category_symbol[i], symbol)) | 3381 { |
1423 return i; | 3382 detector_category_dynarr *cats = |
1424 | 3383 Dynarr_at (all_coding_detectors, i).cats; |
3384 int j; | |
3385 | |
3386 for (j = 0; j < Dynarr_length (cats); j++) | |
3387 if (EQ (Dynarr_at (cats, j).sym, symbol)) | |
3388 return Dynarr_at (cats, j).id; | |
3389 } | |
3390 | |
1425 invalid_constant ("Unrecognized coding category", symbol); | 3391 invalid_constant ("Unrecognized coding category", symbol); |
1426 return 0; /* not reached */ | 3392 return 0; /* not reached */ |
1427 } | 3393 } |
1428 | 3394 |
1429 DEFUN ("coding-category-list", Fcoding_category_list, 0, 0, 0, /* | 3395 static Lisp_Object |
1430 Return a list of all recognized coding categories. | 3396 coding_category_id_to_symbol (int id) |
1431 */ | |
1432 ()) | |
1433 { | 3397 { |
1434 int i; | 3398 int i; |
1435 Lisp_Object list = Qnil; | 3399 |
1436 | 3400 for (i = 0; i < coding_detector_count; i++) |
1437 for (i = CODING_CATEGORY_LAST - 1; i >= 0; i--) | 3401 { |
1438 list = Fcons (coding_category_symbol[i], list); | 3402 detector_category_dynarr *cats = |
1439 return list; | 3403 Dynarr_at (all_coding_detectors, i).cats; |
1440 } | 3404 int j; |
1441 | 3405 |
1442 DEFUN ("set-coding-priority-list", Fset_coding_priority_list, 1, 1, 0, /* | 3406 for (j = 0; j < Dynarr_length (cats); j++) |
1443 Change the priority order of the coding categories. | 3407 if (id == Dynarr_at (cats, j).id) |
1444 LIST should be list of coding categories, in descending order of | 3408 return Dynarr_at (cats, j).sym; |
1445 priority. Unspecified coding categories will be lower in priority | 3409 } |
1446 than all specified ones, in the same relative order they were in | 3410 |
1447 previously. | 3411 abort (); |
1448 */ | 3412 return Qnil; /* (usually) not reached */ |
1449 (list)) | 3413 } |
1450 { | 3414 |
1451 int category_to_priority[CODING_CATEGORY_LAST]; | 3415 static Lisp_Object |
1452 int i, j; | 3416 detection_result_number_to_symbol (enum detection_result result) |
1453 Lisp_Object rest; | 3417 { |
1454 | 3418 #define FROB(sym, num) if (result == num) return (sym) |
1455 /* First generate a list that maps coding categories to priorities. */ | 3419 FROB (Qnear_certainty, DET_NEAR_CERTAINTY); |
1456 | 3420 FROB (Qquite_probable, DET_QUITE_PROBABLE); |
1457 for (i = 0; i < CODING_CATEGORY_LAST; i++) | 3421 FROB (Qsomewhat_likely, DET_SOMEWHAT_LIKELY); |
1458 category_to_priority[i] = -1; | 3422 FROB (Qas_likely_as_unlikely, DET_AS_LIKELY_AS_UNLIKELY); |
1459 | 3423 FROB (Qsomewhat_unlikely, DET_SOMEWHAT_UNLIKELY); |
1460 /* Highest priority comes from the specified list. */ | 3424 FROB (Qquite_improbable, DET_QUITE_IMPROBABLE); |
1461 i = 0; | 3425 FROB (Qnearly_impossible, DET_NEARLY_IMPOSSIBLE); |
1462 EXTERNAL_LIST_LOOP (rest, list) | 3426 #undef FROB |
1463 { | 3427 |
1464 int cat = decode_coding_category (XCAR (rest)); | 3428 abort (); |
1465 | 3429 return Qnil; /* (usually) not reached */ |
1466 if (category_to_priority[cat] >= 0) | 3430 } |
1467 sferror ("Duplicate coding category in list", XCAR (rest)); | 3431 |
1468 category_to_priority[cat] = i++; | 3432 static enum detection_result |
1469 } | 3433 detection_result_symbol_to_number (Lisp_Object symbol) |
1470 | 3434 { |
1471 /* Now go through the existing categories by priority to retrieve | 3435 #define FROB(sym, num) if (EQ (symbol, sym)) return (num) |
1472 the categories not yet specified and preserve their priority | 3436 FROB (Qnear_certainty, DET_NEAR_CERTAINTY); |
1473 order. */ | 3437 FROB (Qquite_probable, DET_QUITE_PROBABLE); |
1474 for (j = 0; j < CODING_CATEGORY_LAST; j++) | 3438 FROB (Qsomewhat_likely, DET_SOMEWHAT_LIKELY); |
1475 { | 3439 FROB (Qas_likely_as_unlikely, DET_AS_LIKELY_AS_UNLIKELY); |
1476 int cat = fcd->coding_category_by_priority[j]; | 3440 FROB (Qsomewhat_unlikely, DET_SOMEWHAT_UNLIKELY); |
1477 if (category_to_priority[cat] < 0) | 3441 FROB (Qquite_improbable, DET_QUITE_IMPROBABLE); |
1478 category_to_priority[cat] = i++; | 3442 FROB (Qnearly_impossible, DET_NEARLY_IMPOSSIBLE); |
1479 } | 3443 #undef FROB |
1480 | 3444 |
1481 /* Now we need to construct the inverse of the mapping we just | 3445 invalid_constant ("Unrecognized detection result", symbol); |
1482 constructed. */ | 3446 return ((enum detection_result) 0); /* not reached */ |
1483 | 3447 } |
1484 for (i = 0; i < CODING_CATEGORY_LAST; i++) | 3448 |
1485 fcd->coding_category_by_priority[category_to_priority[i]] = i; | 3449 /* Set all detection results for a given detector to a specified value. */ |
1486 | 3450 void |
1487 /* Phew! That was confusing. */ | 3451 set_detection_results (struct detection_state *st, int detector, int given) |
1488 return Qnil; | 3452 { |
1489 } | 3453 detector_category_dynarr *cats = |
1490 | 3454 Dynarr_at (all_coding_detectors, detector).cats; |
1491 DEFUN ("coding-priority-list", Fcoding_priority_list, 0, 0, 0, /* | |
1492 Return a list of coding categories in descending order of priority. | |
1493 */ | |
1494 ()) | |
1495 { | |
1496 int i; | 3455 int i; |
1497 Lisp_Object list = Qnil; | 3456 |
1498 | 3457 for (i = 0; i < Dynarr_length (cats); i++) |
1499 for (i = CODING_CATEGORY_LAST - 1; i >= 0; i--) | 3458 st->categories[Dynarr_at (cats, i).id] = given; |
1500 list = Fcons (coding_category_symbol[fcd->coding_category_by_priority[i]], | 3459 } |
1501 list); | |
1502 return list; | |
1503 } | |
1504 | |
1505 DEFUN ("set-coding-category-system", Fset_coding_category_system, 2, 2, 0, /* | |
1506 Change the coding system associated with a coding category. | |
1507 */ | |
1508 (coding_category, coding_system)) | |
1509 { | |
1510 int cat = decode_coding_category (coding_category); | |
1511 | |
1512 coding_system = Fget_coding_system (coding_system); | |
1513 fcd->coding_category_system[cat] = coding_system; | |
1514 return Qnil; | |
1515 } | |
1516 | |
1517 DEFUN ("coding-category-system", Fcoding_category_system, 1, 1, 0, /* | |
1518 Return the coding system associated with a coding category. | |
1519 */ | |
1520 (coding_category)) | |
1521 { | |
1522 int cat = decode_coding_category (coding_category); | |
1523 Lisp_Object sys = fcd->coding_category_system[cat]; | |
1524 | |
1525 if (!NILP (sys)) | |
1526 return XCODING_SYSTEM_NAME (sys); | |
1527 return Qnil; | |
1528 } | |
1529 | |
1530 | |
1531 /************************************************************************/ | |
1532 /* Detecting the encoding of data */ | |
1533 /************************************************************************/ | |
1534 | |
1535 struct detection_state | |
1536 { | |
1537 eol_type_t eol_type; | |
1538 int seen_non_ascii; | |
1539 int mask; | |
1540 #ifdef MULE | |
1541 struct | |
1542 { | |
1543 int mask; | |
1544 int in_second_byte; | |
1545 } | |
1546 big5; | |
1547 | |
1548 struct | |
1549 { | |
1550 int mask; | |
1551 int in_second_byte; | |
1552 } | |
1553 shift_jis; | |
1554 | |
1555 struct | |
1556 { | |
1557 int mask; | |
1558 int in_byte; | |
1559 } | |
1560 ucs4; | |
1561 | |
1562 struct | |
1563 { | |
1564 int mask; | |
1565 int in_byte; | |
1566 } | |
1567 utf8; | |
1568 | |
1569 struct | |
1570 { | |
1571 int mask; | |
1572 int initted; | |
1573 struct iso2022_decoder iso; | |
1574 unsigned int flags; | |
1575 int high_byte_count; | |
1576 unsigned int saw_single_shift:1; | |
1577 } | |
1578 iso2022; | |
1579 #endif | |
1580 struct | |
1581 { | |
1582 int seen_anything; | |
1583 int just_saw_cr; | |
1584 } | |
1585 eol; | |
1586 }; | |
1587 | 3460 |
1588 static int | 3461 static int |
1589 acceptable_control_char_p (int c) | 3462 acceptable_control_char_p (int c) |
1590 { | 3463 { |
1591 switch (c) | 3464 switch (c) |
1605 default: | 3478 default: |
1606 return 0; | 3479 return 0; |
1607 } | 3480 } |
1608 } | 3481 } |
1609 | 3482 |
3483 #ifdef DEBUG_XEMACS | |
3484 | |
3485 static UExtbyte | |
3486 hex_digit_to_char (int digit) | |
3487 { | |
3488 if (digit < 10) | |
3489 return digit + '0'; | |
3490 else | |
3491 return digit - 10 + 'A'; | |
3492 } | |
3493 | |
3494 static void | |
3495 output_bytes_in_ascii_and_hex (const UExtbyte *src, Bytecount n) | |
3496 { | |
3497 UExtbyte *ascii = alloca_array (UExtbyte, n + 1); | |
3498 UExtbyte *hex = alloca_array (UExtbyte, 3 * n + 1); | |
3499 int i; | |
3500 | |
3501 for (i = 0; i < n; i++) | |
3502 { | |
3503 UExtbyte c = src[i]; | |
3504 if (c < 0x20) | |
3505 ascii[i] = '.'; | |
3506 else | |
3507 ascii[i] = c; | |
3508 hex[3 * i] = hex_digit_to_char (c >> 4); | |
3509 hex[3 * i + 1] = hex_digit_to_char (c & 0xF); | |
3510 hex[3 * i + 2] = ' '; | |
3511 } | |
3512 ascii[i] = '\0'; | |
3513 hex[3 * i - 1] = '\0'; | |
3514 stderr_out ("%s %s", ascii, hex); | |
3515 } | |
3516 | |
3517 #endif /* DEBUG_XEMACS */ | |
3518 | |
3519 /* Attempt to determine the encoding of the given text. Before calling | |
3520 this function for the first time, you must zero out the detection state. | |
3521 | |
3522 Returns: | |
3523 | |
3524 0 == keep going | |
3525 1 == stop | |
3526 */ | |
3527 | |
1610 static int | 3528 static int |
1611 mask_has_at_most_one_bit_p (int mask) | 3529 detect_coding_type (struct detection_state *st, const UExtbyte *src, |
1612 { | 3530 Bytecount n) |
1613 /* Perhaps the only thing useful you learn from intensive Microsoft | 3531 { |
1614 technical interviews */ | 3532 Bytecount n2 = n; |
1615 return (mask & (mask - 1)) == 0; | 3533 const UExtbyte *src2 = src; |
1616 } | 3534 int i; |
1617 | 3535 |
1618 static eol_type_t | 3536 #ifdef DEBUG_XEMACS |
1619 detect_eol_type (struct detection_state *st, const Extbyte *src, | 3537 if (!NILP (Vdebug_coding_detection)) |
1620 Bytecount n) | 3538 { |
1621 { | 3539 int bytes = min (16, n); |
1622 while (n--) | 3540 stderr_out ("detect_coding_type: processing %ld bytes\n", n); |
1623 { | 3541 stderr_out ("First %d: ", bytes); |
1624 unsigned char c = *(unsigned char *)src++; | 3542 output_bytes_in_ascii_and_hex (src, bytes); |
1625 if (c == '\n') | 3543 stderr_out ("\nLast %d: ", bytes); |
3544 output_bytes_in_ascii_and_hex (src + n - bytes, bytes); | |
3545 stderr_out ("\n"); | |
3546 } | |
3547 #endif /* DEBUG_XEMACS */ | |
3548 if (!st->seen_non_ascii) | |
3549 { | |
3550 for (; n2; n2--, src2++) | |
1626 { | 3551 { |
1627 if (st->eol.just_saw_cr) | 3552 UExtbyte c = *src2; |
1628 return EOL_CRLF; | |
1629 else if (st->eol.seen_anything) | |
1630 return EOL_LF; | |
1631 } | |
1632 else if (st->eol.just_saw_cr) | |
1633 return EOL_CR; | |
1634 else if (c == '\r') | |
1635 st->eol.just_saw_cr = 1; | |
1636 else | |
1637 st->eol.just_saw_cr = 0; | |
1638 st->eol.seen_anything = 1; | |
1639 } | |
1640 | |
1641 return EOL_AUTODETECT; | |
1642 } | |
1643 | |
1644 /* Attempt to determine the encoding and EOL type of the given text. | |
1645 Before calling this function for the first type, you must initialize | |
1646 st->eol_type as appropriate and initialize st->mask to ~0. | |
1647 | |
1648 st->eol_type holds the determined EOL type, or EOL_AUTODETECT if | |
1649 not yet known. | |
1650 | |
1651 st->mask holds the determined coding category mask, or ~0 if only | |
1652 ASCII has been seen so far. | |
1653 | |
1654 Returns: | |
1655 | |
1656 0 == st->eol_type is EOL_AUTODETECT and/or more than coding category | |
1657 is present in st->mask | |
1658 1 == definitive answers are here for both st->eol_type and st->mask | |
1659 */ | |
1660 | |
1661 static int | |
1662 detect_coding_type (struct detection_state *st, const Extbyte *src, | |
1663 Bytecount n, int just_do_eol) | |
1664 { | |
1665 if (st->eol_type == EOL_AUTODETECT) | |
1666 st->eol_type = detect_eol_type (st, src, n); | |
1667 | |
1668 if (just_do_eol) | |
1669 return st->eol_type != EOL_AUTODETECT; | |
1670 | |
1671 if (!st->seen_non_ascii) | |
1672 { | |
1673 for (; n; n--, src++) | |
1674 { | |
1675 unsigned char c = *(unsigned char *) src; | |
1676 if ((c < 0x20 && !acceptable_control_char_p (c)) || c >= 0x80) | 3553 if ((c < 0x20 && !acceptable_control_char_p (c)) || c >= 0x80) |
1677 { | 3554 { |
1678 st->seen_non_ascii = 1; | 3555 st->seen_non_ascii = 1; |
1679 #ifdef MULE | |
1680 st->shift_jis.mask = ~0; | |
1681 st->big5.mask = ~0; | |
1682 st->ucs4.mask = ~0; | |
1683 st->utf8.mask = ~0; | |
1684 st->iso2022.mask = ~0; | |
1685 #endif | |
1686 break; | 3556 break; |
1687 } | 3557 } |
1688 } | 3558 } |
1689 } | 3559 } |
1690 | 3560 |
1691 if (!n) | 3561 for (i = 0; i < coding_detector_count; i++) |
1692 return 0; | 3562 Dynarr_at (all_coding_detectors, i).detect_method (st, src, n); |
1693 #ifdef MULE | 3563 |
1694 if (!mask_has_at_most_one_bit_p (st->iso2022.mask)) | 3564 st->bytes_seen += n; |
1695 st->iso2022.mask = detect_coding_iso2022 (st, src, n); | 3565 |
1696 if (!mask_has_at_most_one_bit_p (st->shift_jis.mask)) | 3566 #ifdef DEBUG_XEMACS |
1697 st->shift_jis.mask = detect_coding_sjis (st, src, n); | 3567 if (!NILP (Vdebug_coding_detection)) |
1698 if (!mask_has_at_most_one_bit_p (st->big5.mask)) | 3568 { |
1699 st->big5.mask = detect_coding_big5 (st, src, n); | 3569 stderr_out ("seen_non_ascii: %d\n", st->seen_non_ascii); |
1700 if (!mask_has_at_most_one_bit_p (st->utf8.mask)) | 3570 for (i = 0; i < coding_detector_category_count; i++) |
1701 st->utf8.mask = detect_coding_utf8 (st, src, n); | 3571 stderr_out_lisp |
1702 if (!mask_has_at_most_one_bit_p (st->ucs4.mask)) | 3572 ("%s: %s\n", |
1703 st->ucs4.mask = detect_coding_ucs4 (st, src, n); | 3573 2, |
1704 | 3574 coding_category_id_to_symbol (i), |
1705 st->mask | 3575 detection_result_number_to_symbol ((enum detection_result) |
1706 = st->iso2022.mask | st->shift_jis.mask | st->big5.mask | 3576 st->categories[i])); |
1707 | st->utf8.mask | st->ucs4.mask; | 3577 } |
3578 #endif /* DEBUG_XEMACS */ | |
3579 | |
3580 { | |
3581 int not_unlikely = 0; | |
3582 int retval; | |
3583 | |
3584 for (i = 0; i < coding_detector_category_count; i++) | |
3585 if (st->categories[i] >= 0) | |
3586 not_unlikely++; | |
3587 | |
3588 retval = (not_unlikely <= 1 | |
3589 #if 0 /* this is bogus */ | |
3590 || st->bytes_seen >= MAX_BYTES_PROCESSED_FOR_DETECTION | |
1708 #endif | 3591 #endif |
1709 { | 3592 ); |
1710 int retval = mask_has_at_most_one_bit_p (st->mask); | 3593 |
1711 st->mask |= CODING_CATEGORY_NO_CONVERSION_MASK; | 3594 #ifdef DEBUG_XEMACS |
1712 return retval && st->eol_type != EOL_AUTODETECT; | 3595 if (!NILP (Vdebug_coding_detection)) |
3596 stderr_out ("detect_coding_type: returning %d (%s)\n", | |
3597 retval, retval ? "stop" : "keep going"); | |
3598 #endif /* DEBUG_XEMACS */ | |
3599 | |
3600 return retval; | |
1713 } | 3601 } |
1714 } | 3602 } |
1715 | 3603 |
1716 static Lisp_Object | 3604 static Lisp_Object |
1717 coding_system_from_mask (int mask) | 3605 detected_coding_system (struct detection_state *st) |
1718 { | 3606 { |
1719 if (mask == ~0) | 3607 int i; |
3608 int even = 1; | |
3609 | |
3610 if (st->seen_non_ascii) | |
3611 { | |
3612 for (i = 0; i < coding_detector_category_count; i++) | |
3613 if (st->categories[i] != DET_AS_LIKELY_AS_UNLIKELY) | |
3614 { | |
3615 even = 0; | |
3616 break; | |
3617 } | |
3618 } | |
3619 | |
3620 /* #### Here we are ignoring the results of detection when it's all | |
3621 ASCII. This is obviously a bad thing. But we need to fix up the | |
3622 existing detection methods somewhat before we can switch. */ | |
3623 if (even) | |
1720 { | 3624 { |
1721 /* If the file was entirely or basically ASCII, use the | 3625 /* If the file was entirely or basically ASCII, use the |
1722 default value of `buffer-file-coding-system'. */ | 3626 default value of `buffer-file-coding-system'. */ |
1723 Lisp_Object retval = | 3627 Lisp_Object retval = |
1724 XBUFFER (Vbuffer_defaults)->buffer_file_coding_system; | 3628 XBUFFER (Vbuffer_defaults)->buffer_file_coding_system; |
1725 if (!NILP (retval)) | 3629 if (!NILP (retval)) |
1726 { | 3630 { |
1727 retval = Ffind_coding_system (retval); | 3631 retval = find_coding_system_for_text_file (retval, 0); |
1728 if (NILP (retval)) | 3632 if (NILP (retval)) |
1729 { | 3633 { |
1730 warn_when_safe | 3634 warn_when_safe |
1731 (Qbad_variable, Qwarning, | 3635 (Qbad_variable, Qwarning, |
1732 "Invalid `default-buffer-file-coding-system', set to nil"); | 3636 "Invalid `default-buffer-file-coding-system', set to nil"); |
1737 retval = Fget_coding_system (Qraw_text); | 3641 retval = Fget_coding_system (Qraw_text); |
1738 return retval; | 3642 return retval; |
1739 } | 3643 } |
1740 else | 3644 else |
1741 { | 3645 { |
1742 int i; | 3646 int likelihood; |
1743 int cat = -1; | 3647 Lisp_Object retval = Qnil; |
1744 #ifdef MULE | 3648 |
1745 mask = postprocess_iso2022_mask (mask); | 3649 /* Look through the coding categories first by likelihood and then by |
1746 #endif | 3650 priority and find the first one that is allowed. */ |
1747 /* Look through the coding categories by priority and find | 3651 |
1748 the first one that is allowed. */ | 3652 for (likelihood = DET_HIGHEST; likelihood >= DET_LOWEST; likelihood--) |
1749 for (i = 0; i < CODING_CATEGORY_LAST; i++) | |
1750 { | 3653 { |
1751 cat = fcd->coding_category_by_priority[i]; | 3654 for (i = 0; i < coding_detector_category_count; i++) |
1752 if ((mask & (1 << cat)) && | 3655 { |
1753 !NILP (fcd->coding_category_system[cat])) | 3656 int cat = coding_category_by_priority[i]; |
1754 break; | 3657 if (st->categories[cat] == likelihood && |
3658 !NILP (coding_category_system[cat])) | |
3659 { | |
3660 retval = (get_coding_system_for_text_file | |
3661 (coding_category_system[cat], 0)); | |
3662 if (likelihood < DET_AS_LIKELY_AS_UNLIKELY) | |
3663 warn_when_safe_lispobj | |
3664 (intern ("detection"), | |
3665 Qerror, | |
3666 emacs_sprintf_string_lisp | |
3667 ( | |
3668 "Detected coding %s is unlikely to be correct (likelihood == `%s')", | |
3669 Qnil, 2, XCODING_SYSTEM_NAME (retval), | |
3670 detection_result_number_to_symbol | |
3671 ((enum detection_result) likelihood))); | |
3672 return retval; | |
3673 } | |
3674 } | |
1755 } | 3675 } |
1756 if (cat >= 0) | 3676 |
1757 return fcd->coding_category_system[cat]; | 3677 return Fget_coding_system (Qraw_text); |
1758 else | |
1759 return Fget_coding_system (Qraw_text); | |
1760 } | 3678 } |
1761 } | 3679 } |
1762 | 3680 |
1763 /* Given a seekable read stream and potential coding system and EOL type | 3681 /* Given a seekable read stream and potential coding system and EOL type |
1764 as specified, do any autodetection that is called for. If the | 3682 as specified, do any autodetection that is called for. If the |
1769 This function does not automatically fetch subsidiary coding systems; | 3687 This function does not automatically fetch subsidiary coding systems; |
1770 that should be unnecessary with the explicit eol-type argument. */ | 3688 that should be unnecessary with the explicit eol-type argument. */ |
1771 | 3689 |
1772 #define LENGTH(string_constant) (sizeof (string_constant) - 1) | 3690 #define LENGTH(string_constant) (sizeof (string_constant) - 1) |
1773 | 3691 |
1774 void | 3692 static Lisp_Object |
1775 determine_real_coding_system (Lstream *stream, Lisp_Object *codesys_in_out, | 3693 unwind_free_detection_state (Lisp_Object opaque) |
1776 eol_type_t *eol_type_in_out) | 3694 { |
1777 { | 3695 struct detection_state *st = |
1778 struct detection_state decst; | 3696 (struct detection_state *) get_opaque_ptr (opaque); |
1779 | 3697 free_detection_state (st); |
1780 if (*eol_type_in_out == EOL_AUTODETECT) | 3698 free_opaque_ptr (opaque); |
1781 *eol_type_in_out = XCODING_SYSTEM_EOL_TYPE (*codesys_in_out); | 3699 return Qnil; |
1782 | 3700 } |
1783 xzero (decst); | 3701 |
1784 decst.eol_type = *eol_type_in_out; | 3702 static Lisp_Object |
1785 decst.mask = ~0; | 3703 look_for_coding_system_magic_cookie (const UExtbyte *data, Bytecount len) |
1786 | 3704 { |
1787 /* If autodetection is called for, do it now. */ | 3705 Lisp_Object coding_system = Qnil; |
1788 if (XCODING_SYSTEM_TYPE (*codesys_in_out) == CODESYS_AUTODETECT | 3706 const UExtbyte *p; |
1789 || *eol_type_in_out == EOL_AUTODETECT) | 3707 const UExtbyte *scan_end; |
1790 { | 3708 |
1791 Extbyte buf[4096]; | 3709 /* Look for initial "-*-"; mode line prefix */ |
1792 Lisp_Object coding_system = Qnil; | 3710 for (p = data, |
1793 Extbyte *p; | 3711 scan_end = data + len - LENGTH ("-*-coding:?-*-"); |
1794 Bytecount nread = Lstream_read (stream, buf, sizeof (buf)); | 3712 p <= scan_end |
1795 Extbyte *scan_end; | 3713 && *p != '\n' |
1796 | 3714 && *p != '\r'; |
1797 /* Look for initial "-*-"; mode line prefix */ | 3715 p++) |
1798 for (p = buf, | 3716 if (*p == '-' && *(p+1) == '*' && *(p+2) == '-') |
1799 scan_end = buf + nread - LENGTH ("-*-coding:?-*-"); | 3717 { |
1800 p <= scan_end | 3718 const UExtbyte *local_vars_beg = p + 3; |
3719 /* Look for final "-*-"; mode line suffix */ | |
3720 for (p = local_vars_beg, | |
3721 scan_end = data + len - LENGTH ("-*-"); | |
3722 p <= scan_end | |
1801 && *p != '\n' | 3723 && *p != '\n' |
1802 && *p != '\r'; | 3724 && *p != '\r'; |
1803 p++) | 3725 p++) |
1804 if (*p == '-' && *(p+1) == '*' && *(p+2) == '-') | 3726 if (*p == '-' && *(p+1) == '*' && *(p+2) == '-') |
1805 { | 3727 { |
1806 Extbyte *local_vars_beg = p + 3; | 3728 const UExtbyte *suffix = p; |
1807 /* Look for final "-*-"; mode line suffix */ | 3729 /* Look for "coding:" */ |
1808 for (p = local_vars_beg, | 3730 for (p = local_vars_beg, |
1809 scan_end = buf + nread - LENGTH ("-*-"); | 3731 scan_end = suffix - LENGTH ("coding:?"); |
1810 p <= scan_end | 3732 p <= scan_end; |
1811 && *p != '\n' | 3733 p++) |
1812 && *p != '\r'; | 3734 if (memcmp ("coding:", p, LENGTH ("coding:")) == 0 |
1813 p++) | 3735 && (p == local_vars_beg |
1814 if (*p == '-' && *(p+1) == '*' && *(p+2) == '-') | 3736 || (*(p-1) == ' ' || |
3737 *(p-1) == '\t' || | |
3738 *(p-1) == ';'))) | |
3739 { | |
3740 Bytecount n; | |
3741 Intbyte *name; | |
3742 | |
3743 p += LENGTH ("coding:"); | |
3744 while (*p == ' ' || *p == '\t') p++; | |
3745 name = alloca_intbytes (suffix - p + 1); | |
3746 memcpy (name, p, suffix - p); | |
3747 name[suffix - p] = '\0'; | |
3748 | |
3749 /* Get coding system name */ | |
3750 /* Characters valid in a MIME charset name (rfc 1521), | |
3751 and in a Lisp symbol name. */ | |
3752 n = qxestrspn (name, | |
3753 "ABCDEFGHIJKLMNOPQRSTUVWXYZ" | |
3754 "abcdefghijklmnopqrstuvwxyz" | |
3755 "0123456789" | |
3756 "!$%&*+-.^_{|}~"); | |
3757 if (n > 0) | |
3758 { | |
3759 name[n] = '\0'; | |
3760 coding_system = | |
3761 find_coding_system_for_text_file (intern_int (name), | |
3762 0); | |
3763 } | |
3764 break; | |
3765 } | |
3766 break; | |
3767 } | |
3768 break; | |
3769 } | |
3770 | |
3771 return coding_system; | |
3772 } | |
3773 | |
3774 static Lisp_Object | |
3775 determine_real_coding_system (Lstream *stream) | |
3776 { | |
3777 struct detection_state *st = allocate_detection_state (); | |
3778 int depth = record_unwind_protect (unwind_free_detection_state, | |
3779 make_opaque_ptr (st)); | |
3780 UExtbyte buf[4096]; | |
3781 Bytecount nread = Lstream_read (stream, buf, sizeof (buf)); | |
3782 Lisp_Object coding_system = look_for_coding_system_magic_cookie (buf, nread); | |
3783 | |
3784 if (NILP (coding_system)) | |
3785 { | |
3786 while (1) | |
3787 { | |
3788 if (detect_coding_type (st, buf, nread)) | |
3789 break; | |
3790 nread = Lstream_read (stream, buf, sizeof (buf)); | |
3791 if (nread == 0) | |
3792 break; | |
3793 } | |
3794 | |
3795 coding_system = detected_coding_system (st); | |
3796 } | |
3797 | |
3798 Lstream_rewind (stream); | |
3799 | |
3800 unbind_to (depth); | |
3801 return coding_system; | |
3802 } | |
3803 | |
3804 static void | |
3805 undecided_init_coding_stream (struct coding_stream *str) | |
3806 { | |
3807 struct undecided_coding_stream *data = | |
3808 CODING_STREAM_TYPE_DATA (str, undecided); | |
3809 struct undecided_coding_system *csdata = | |
3810 XCODING_SYSTEM_TYPE_DATA (str->codesys, undecided); | |
3811 | |
3812 data->actual = Qnil; | |
3813 | |
3814 if (str->direction == CODING_DECODE) | |
3815 { | |
3816 Lstream *lst = str->other_end; | |
3817 | |
3818 if ((lst->flags & LSTREAM_FL_READ) && | |
3819 Lstream_seekable_p (lst) && | |
3820 csdata->do_coding) | |
3821 /* We can determine the coding system now. */ | |
3822 data->actual = determine_real_coding_system (lst); | |
3823 } | |
3824 } | |
3825 | |
3826 static void | |
3827 undecided_rewind_coding_stream (struct coding_stream *str) | |
3828 { | |
3829 chain_rewind_coding_stream_1 (&CODING_STREAM_TYPE_DATA (str, undecided)->c); | |
3830 } | |
3831 | |
3832 static void | |
3833 undecided_finalize_coding_stream (struct coding_stream *str) | |
3834 { | |
3835 struct undecided_coding_stream *data = | |
3836 CODING_STREAM_TYPE_DATA (str, undecided); | |
3837 | |
3838 chain_finalize_coding_stream_1 | |
3839 (&CODING_STREAM_TYPE_DATA (str, undecided)->c); | |
3840 if (data->st) | |
3841 free_detection_state (data->st); | |
3842 } | |
3843 | |
3844 static Lisp_Object | |
3845 undecided_canonicalize (Lisp_Object codesys) | |
3846 { | |
3847 struct undecided_coding_system *csdata = | |
3848 XCODING_SYSTEM_TYPE_DATA (codesys, undecided); | |
3849 if (!csdata->do_eol && !csdata->do_coding) | |
3850 return NILP (csdata->cs) ? Fget_coding_system (Qbinary) : csdata->cs; | |
3851 if (csdata->do_eol && !csdata->do_coding && NILP (csdata->cs)) | |
3852 return Fget_coding_system (Qconvert_eol_autodetect); | |
3853 return codesys; | |
3854 } | |
3855 | |
3856 static Bytecount | |
3857 undecided_convert (struct coding_stream *str, const UExtbyte *src, | |
3858 unsigned_char_dynarr *dst, Bytecount n) | |
3859 { | |
3860 int first_time = 0; | |
3861 | |
3862 if (str->direction == CODING_DECODE) | |
3863 { | |
3864 /* At this point, we have only the following possibilities: | |
3865 | |
3866 do_eol && do_coding | |
3867 do_coding only | |
3868 do_eol only and a coding system was specified | |
3869 | |
3870 Other possibilities are removed during undecided_canonicalize. | |
3871 | |
3872 Therefore, our substreams are either | |
3873 | |
3874 lstream_coding -> lstream_dynarr, or | |
3875 lstream_coding -> lstream_eol -> lstream_dynarr. | |
3876 */ | |
3877 struct undecided_coding_system *csdata = | |
3878 XCODING_SYSTEM_TYPE_DATA (str->codesys, undecided); | |
3879 struct undecided_coding_stream *data = | |
3880 CODING_STREAM_TYPE_DATA (str, undecided); | |
3881 | |
3882 if (str->eof) | |
3883 { | |
3884 /* Each will close the next. We need to close now because more | |
3885 data may be generated. */ | |
3886 if (data->c.initted) | |
3887 Lstream_close (XLSTREAM (data->c.lstreams[0])); | |
3888 return n; | |
3889 } | |
3890 | |
3891 if (!data->c.initted) | |
3892 { | |
3893 data->c.lstream_count = csdata->do_eol ? 3 : 2; | |
3894 data->c.lstreams = xnew_array (Lisp_Object, data->c.lstream_count); | |
3895 | |
3896 data->c.lstreams[data->c.lstream_count - 1] = | |
3897 make_dynarr_output_stream (dst); | |
3898 Lstream_set_buffering | |
3899 (XLSTREAM (data->c.lstreams[data->c.lstream_count - 1]), | |
3900 LSTREAM_UNBUFFERED, 0); | |
3901 if (csdata->do_eol) | |
3902 { | |
3903 data->c.lstreams[1] = | |
3904 make_coding_output_stream | |
3905 (XLSTREAM (data->c.lstreams[data->c.lstream_count - 1]), | |
3906 Fget_coding_system (Qconvert_eol_autodetect), | |
3907 CODING_DECODE); | |
3908 Lstream_set_buffering | |
3909 (XLSTREAM (data->c.lstreams[1]), | |
3910 LSTREAM_UNBUFFERED, 0); | |
3911 } | |
3912 | |
3913 data->c.lstreams[0] = | |
3914 make_coding_output_stream | |
3915 (XLSTREAM (data->c.lstreams[1]), | |
3916 /* Substitute binary if we need to detect the encoding */ | |
3917 csdata->do_coding ? Qbinary : csdata->cs, | |
3918 CODING_DECODE); | |
3919 Lstream_set_buffering (XLSTREAM (data->c.lstreams[0]), | |
3920 LSTREAM_UNBUFFERED, 0); | |
3921 | |
3922 first_time = 1; | |
3923 data->c.initted = 1; | |
3924 } | |
3925 | |
3926 /* If necessary, do encoding-detection now. We do this when we're a | |
3927 writing stream or a non-seekable reading stream, meaning that we | |
3928 can't just process the whole input, rewind, and start over. */ | |
3929 | |
3930 if (csdata->do_coding) | |
3931 { | |
3932 int actual_was_nil = NILP (data->actual); | |
3933 if (NILP (data->actual)) | |
3934 { | |
3935 if (!data->st) | |
3936 data->st = allocate_detection_state (); | |
3937 if (first_time) | |
3938 /* #### This is cheesy. What we really ought to do is buffer | |
3939 up a certain minimum amount of data to get a better result. | |
3940 */ | |
3941 data->actual = look_for_coding_system_magic_cookie (src, n); | |
3942 if (NILP (data->actual)) | |
1815 { | 3943 { |
1816 Extbyte *suffix = p; | 3944 /* #### This is cheesy. What we really ought to do is buffer |
1817 /* Look for "coding:" */ | 3945 up a certain minimum amount of data so as to get a less |
1818 for (p = local_vars_beg, | 3946 random result when doing subprocess detection. */ |
1819 scan_end = suffix - LENGTH ("coding:?"); | 3947 detect_coding_type (data->st, src, n); |
1820 p <= scan_end; | 3948 data->actual = detected_coding_system (data->st); |
1821 p++) | |
1822 if (memcmp ("coding:", p, LENGTH ("coding:")) == 0 | |
1823 && (p == local_vars_beg | |
1824 || (*(p-1) == ' ' || | |
1825 *(p-1) == '\t' || | |
1826 *(p-1) == ';'))) | |
1827 { | |
1828 Extbyte save; | |
1829 int n; | |
1830 p += LENGTH ("coding:"); | |
1831 while (*p == ' ' || *p == '\t') p++; | |
1832 | |
1833 /* Get coding system name */ | |
1834 save = *suffix; *suffix = '\0'; | |
1835 /* Characters valid in a MIME charset name (rfc 1521), | |
1836 and in a Lisp symbol name. */ | |
1837 n = strspn ( (char *) p, | |
1838 "ABCDEFGHIJKLMNOPQRSTUVWXYZ" | |
1839 "abcdefghijklmnopqrstuvwxyz" | |
1840 "0123456789" | |
1841 "!$%&*+-.^_{|}~"); | |
1842 *suffix = save; | |
1843 if (n > 0) | |
1844 { | |
1845 save = p[n]; p[n] = '\0'; | |
1846 coding_system = | |
1847 Ffind_coding_system (intern ((char *) p)); | |
1848 p[n] = save; | |
1849 } | |
1850 break; | |
1851 } | |
1852 break; | |
1853 } | 3949 } |
1854 break; | 3950 } |
1855 } | 3951 /* We need to set the detected coding system if we actually have |
1856 | 3952 such a coding system but didn't before. That is the case |
1857 if (NILP (coding_system)) | 3953 either when we just detected it in the previous code or when |
1858 do | 3954 it was detected during undecided_init_coding_stream(). We |
1859 { | 3955 can check for that using first_time. */ |
1860 if (detect_coding_type (&decst, buf, nread, | 3956 if (!NILP (data->actual) && (actual_was_nil || first_time)) |
1861 XCODING_SYSTEM_TYPE (*codesys_in_out) | 3957 { |
1862 != CODESYS_AUTODETECT)) | 3958 /* If the detected coding system doesn't allow for EOL |
1863 break; | 3959 autodetection, try to get the equivalent that does; |
1864 nread = Lstream_read (stream, buf, sizeof (buf)); | 3960 otherwise, disable EOL detection (overriding whatever |
1865 if (nread == 0) | 3961 may already have been detected). */ |
1866 break; | 3962 if (XCODING_SYSTEM_EOL_TYPE (data->actual) != EOL_AUTODETECT) |
1867 } | 3963 { |
1868 while (1); | 3964 if (!NILP (XCODING_SYSTEM_SUBSIDIARY_PARENT (data->actual))) |
1869 | 3965 data->actual = |
1870 else if (XCODING_SYSTEM_TYPE (*codesys_in_out) == CODESYS_AUTODETECT | 3966 XCODING_SYSTEM_SUBSIDIARY_PARENT (data->actual); |
1871 && XCODING_SYSTEM_EOL_TYPE (coding_system) == EOL_AUTODETECT) | 3967 else if (data->c.lstream_count == 3) |
1872 do | 3968 set_coding_stream_coding_system |
1873 { | 3969 (XLSTREAM (data->c.lstreams[1]), |
1874 if (detect_coding_type (&decst, buf, nread, 1)) | 3970 Fget_coding_system (Qidentity)); |
1875 break; | 3971 } |
1876 nread = Lstream_read (stream, buf, sizeof (buf)); | 3972 set_coding_stream_coding_system |
1877 if (!nread) | 3973 (XLSTREAM (data->c.lstreams[0]), data->actual); |
1878 break; | 3974 } |
1879 } | |
1880 while (1); | |
1881 | |
1882 *eol_type_in_out = decst.eol_type; | |
1883 if (XCODING_SYSTEM_TYPE (*codesys_in_out) == CODESYS_AUTODETECT) | |
1884 { | |
1885 if (NILP (coding_system)) | |
1886 *codesys_in_out = coding_system_from_mask (decst.mask); | |
1887 else | |
1888 *codesys_in_out = coding_system; | |
1889 } | 3975 } |
1890 } | 3976 |
1891 | 3977 if (Lstream_write (XLSTREAM (data->c.lstreams[0]), src, n) < 0) |
1892 /* If we absolutely can't determine the EOL type, just assume LF. */ | 3978 return -1; |
1893 if (*eol_type_in_out == EOL_AUTODETECT) | 3979 return n; |
1894 *eol_type_in_out = EOL_LF; | 3980 } |
1895 | 3981 else |
1896 Lstream_rewind (stream); | 3982 return no_conversion_convert (str, src, dst, n); |
3983 } | |
3984 | |
3985 static Lisp_Object | |
3986 undecided_canonicalize_after_coding (struct coding_stream *str) | |
3987 { | |
3988 struct undecided_coding_stream *data = | |
3989 CODING_STREAM_TYPE_DATA (str, undecided); | |
3990 Lisp_Object ret, eolret; | |
3991 | |
3992 if (str->direction == CODING_ENCODE) | |
3993 return str->codesys; | |
3994 | |
3995 if (!data->c.initted) | |
3996 return Fget_coding_system (Qundecided); | |
3997 | |
3998 ret = coding_stream_canonicalize_after_coding | |
3999 (XLSTREAM (data->c.lstreams[0])); | |
4000 if (NILP (ret)) | |
4001 ret = Fget_coding_system (Qundecided); | |
4002 if (XCODING_SYSTEM_EOL_TYPE (ret) != EOL_AUTODETECT) | |
4003 return ret; | |
4004 eolret = coding_stream_canonicalize_after_coding | |
4005 (XLSTREAM (data->c.lstreams[1])); | |
4006 if (!EQ (XCODING_SYSTEM_TYPE (eolret), Qconvert_eol)) | |
4007 return ret; | |
4008 return | |
4009 Fsubsidiary_coding_system (ret, Fcoding_system_property (eolret, | |
4010 Qsubtype)); | |
4011 } | |
4012 | |
4013 | |
4014 /************************************************************************/ | |
4015 /* Lisp interface: Coding category functions and detection */ | |
4016 /************************************************************************/ | |
4017 | |
4018 DEFUN ("coding-category-list", Fcoding_category_list, 0, 0, 0, /* | |
4019 Return a list of all recognized coding categories. | |
4020 */ | |
4021 ()) | |
4022 { | |
4023 int i; | |
4024 Lisp_Object list = Qnil; | |
4025 | |
4026 for (i = 0; i < coding_detector_count; i++) | |
4027 { | |
4028 detector_category_dynarr *cats = | |
4029 Dynarr_at (all_coding_detectors, i).cats; | |
4030 int j; | |
4031 | |
4032 for (j = 0; j < Dynarr_length (cats); j++) | |
4033 list = Fcons (Dynarr_at (cats, j).sym, list); | |
4034 } | |
4035 | |
4036 return Fnreverse (list); | |
4037 } | |
4038 | |
4039 DEFUN ("set-coding-priority-list", Fset_coding_priority_list, 1, 1, 0, /* | |
4040 Change the priority order of the coding categories. | |
4041 LIST should be list of coding categories, in descending order of | |
4042 priority. Unspecified coding categories will be lower in priority | |
4043 than all specified ones, in the same relative order they were in | |
4044 previously. | |
4045 */ | |
4046 (list)) | |
4047 { | |
4048 int *category_to_priority = | |
4049 alloca_array (int, coding_detector_category_count); | |
4050 int i, j; | |
4051 Lisp_Object rest; | |
4052 | |
4053 /* First generate a list that maps coding categories to priorities. */ | |
4054 | |
4055 for (i = 0; i < coding_detector_category_count; i++) | |
4056 category_to_priority[i] = -1; | |
4057 | |
4058 /* Highest priority comes from the specified list. */ | |
4059 i = 0; | |
4060 EXTERNAL_LIST_LOOP (rest, list) | |
4061 { | |
4062 int cat = coding_category_symbol_to_id (XCAR (rest)); | |
4063 | |
4064 if (category_to_priority[cat] >= 0) | |
4065 sferror ("Duplicate coding category in list", XCAR (rest)); | |
4066 category_to_priority[cat] = i++; | |
4067 } | |
4068 | |
4069 /* Now go through the existing categories by priority to retrieve | |
4070 the categories not yet specified and preserve their priority | |
4071 order. */ | |
4072 for (j = 0; j < coding_detector_category_count; j++) | |
4073 { | |
4074 int cat = coding_category_by_priority[j]; | |
4075 if (category_to_priority[cat] < 0) | |
4076 category_to_priority[cat] = i++; | |
4077 } | |
4078 | |
4079 /* Now we need to construct the inverse of the mapping we just | |
4080 constructed. */ | |
4081 | |
4082 for (i = 0; i < coding_detector_category_count; i++) | |
4083 coding_category_by_priority[category_to_priority[i]] = i; | |
4084 | |
4085 /* Phew! That was confusing. */ | |
4086 return Qnil; | |
4087 } | |
4088 | |
4089 DEFUN ("coding-priority-list", Fcoding_priority_list, 0, 0, 0, /* | |
4090 Return a list of coding categories in descending order of priority. | |
4091 */ | |
4092 ()) | |
4093 { | |
4094 int i; | |
4095 Lisp_Object list = Qnil; | |
4096 | |
4097 for (i = 0; i < coding_detector_category_count; i++) | |
4098 list = | |
4099 Fcons (coding_category_id_to_symbol (coding_category_by_priority[i]), | |
4100 list); | |
4101 return Fnreverse (list); | |
4102 } | |
4103 | |
4104 DEFUN ("set-coding-category-system", Fset_coding_category_system, 2, 2, 0, /* | |
4105 Change the coding system associated with a coding category. | |
4106 */ | |
4107 (coding_category, coding_system)) | |
4108 { | |
4109 coding_category_system[coding_category_symbol_to_id (coding_category)] = | |
4110 Fget_coding_system (coding_system); | |
4111 return Qnil; | |
4112 } | |
4113 | |
4114 DEFUN ("coding-category-system", Fcoding_category_system, 1, 1, 0, /* | |
4115 Return the coding system associated with a coding category. | |
4116 */ | |
4117 (coding_category)) | |
4118 { | |
4119 Lisp_Object sys = | |
4120 coding_category_system[coding_category_symbol_to_id (coding_category)]; | |
4121 | |
4122 if (!NILP (sys)) | |
4123 return XCODING_SYSTEM_NAME (sys); | |
4124 return Qnil; | |
4125 } | |
4126 | |
4127 Lisp_Object | |
4128 detect_coding_stream (Lisp_Object stream) | |
4129 { | |
4130 Lisp_Object val = Qnil; | |
4131 struct gcpro gcpro1, gcpro2, gcpro3; | |
4132 UExtbyte random_buffer[65536]; | |
4133 Lisp_Object binary_instream = | |
4134 make_coding_input_stream | |
4135 (XLSTREAM (stream), Qbinary, | |
4136 CODING_ENCODE); | |
4137 Lisp_Object decstream = | |
4138 make_coding_input_stream | |
4139 (XLSTREAM (binary_instream), | |
4140 Qundecided, CODING_DECODE); | |
4141 Lstream *decstr = XLSTREAM (decstream); | |
4142 | |
4143 GCPRO3 (decstream, stream, binary_instream); | |
4144 /* Read and discard all data; detection happens as a side effect of this, | |
4145 and we examine what was detected afterwards. */ | |
4146 while (Lstream_read (decstr, random_buffer, sizeof (random_buffer)) > 0) | |
4147 ; | |
4148 | |
4149 val = coding_stream_detected_coding_system (decstr); | |
4150 Lstream_close (decstr); | |
4151 Lstream_delete (decstr); | |
4152 Lstream_delete (XLSTREAM (binary_instream)); | |
4153 UNGCPRO; | |
4154 return val; | |
1897 } | 4155 } |
1898 | 4156 |
1899 DEFUN ("detect-coding-region", Fdetect_coding_region, 2, 3, 0, /* | 4157 DEFUN ("detect-coding-region", Fdetect_coding_region, 2, 3, 0, /* |
1900 Detect coding system of the text in the region between START and END. | 4158 Detect coding system of the text in the region between START and END. |
1901 Return a list of possible coding systems ordered by priority. | 4159 Return a list of possible coding systems ordered by priority. |
1906 (start, end, buffer)) | 4164 (start, end, buffer)) |
1907 { | 4165 { |
1908 Lisp_Object val = Qnil; | 4166 Lisp_Object val = Qnil; |
1909 struct buffer *buf = decode_buffer (buffer, 0); | 4167 struct buffer *buf = decode_buffer (buffer, 0); |
1910 Charbpos b, e; | 4168 Charbpos b, e; |
1911 Lisp_Object instream, lb_instream; | 4169 Lisp_Object lb_instream; |
1912 Lstream *istr, *lb_istr; | |
1913 struct detection_state decst; | |
1914 struct gcpro gcpro1, gcpro2; | |
1915 | 4170 |
1916 get_buffer_range_char (buf, start, end, &b, &e, 0); | 4171 get_buffer_range_char (buf, start, end, &b, &e, 0); |
1917 lb_instream = make_lisp_buffer_input_stream (buf, b, e, 0); | 4172 lb_instream = make_lisp_buffer_input_stream (buf, b, e, 0); |
1918 lb_istr = XLSTREAM (lb_instream); | 4173 |
1919 instream = make_encoding_input_stream (lb_istr, Fget_coding_system (Qbinary)); | 4174 val = detect_coding_stream (lb_instream); |
1920 istr = XLSTREAM (instream); | 4175 Lstream_delete (XLSTREAM (lb_instream)); |
1921 GCPRO2 (instream, lb_instream); | 4176 return val; |
1922 xzero (decst); | 4177 } |
1923 decst.eol_type = EOL_AUTODETECT; | 4178 |
1924 decst.mask = ~0; | 4179 |
1925 while (1) | 4180 |
1926 { | 4181 #ifdef DEBUG_XEMACS |
1927 Extbyte random_buffer[4096]; | 4182 |
1928 Bytecount nread = Lstream_read (istr, random_buffer, sizeof (random_buffer)); | 4183 /************************************************************************/ |
1929 | 4184 /* Internal methods */ |
1930 if (!nread) | 4185 /************************************************************************/ |
1931 break; | 4186 |
1932 if (detect_coding_type (&decst, random_buffer, nread, 0)) | 4187 /* Raw (internally-formatted) data. */ |
1933 break; | 4188 DEFINE_CODING_SYSTEM_TYPE (internal); |
1934 } | 4189 |
1935 | 4190 static Bytecount |
1936 if (decst.mask == ~0) | 4191 internal_convert (struct coding_stream *str, const UExtbyte *src, |
1937 val = subsidiary_coding_system (Fget_coding_system (Qundecided), | 4192 unsigned_char_dynarr *dst, Bytecount n) |
1938 decst.eol_type); | 4193 { |
4194 Bytecount orign = n; | |
4195 Dynarr_add_many (dst, src, n); | |
4196 return orign; | |
4197 } | |
4198 | |
4199 #endif /* DEBUG_XEMACS */ | |
4200 | |
4201 | |
4202 | |
4203 #ifdef HAVE_ZLIB | |
4204 | |
4205 /************************************************************************/ | |
4206 /* Gzip methods */ | |
4207 /************************************************************************/ | |
4208 | |
4209 DEFINE_CODING_SYSTEM_TYPE (gzip); | |
4210 | |
4211 struct gzip_coding_system | |
4212 { | |
4213 int level; /* 0 through 9, or -1 for default */ | |
4214 }; | |
4215 | |
4216 #define CODING_SYSTEM_GZIP_LEVEL(codesys) \ | |
4217 (CODING_SYSTEM_TYPE_DATA (codesys, gzip)->level) | |
4218 #define XCODING_SYSTEM_GZIP_LEVEL(codesys) \ | |
4219 (XCODING_SYSTEM_TYPE_DATA (codesys, gzip)->level) | |
4220 | |
4221 struct gzip_coding_stream | |
4222 { | |
4223 z_stream stream; | |
4224 int stream_initted; | |
4225 int reached_eof; /* #### this should be handled by the caller, once we | |
4226 return LSTREAM_EOF */ | |
4227 }; | |
4228 | |
4229 static const struct lrecord_description | |
4230 gzip_coding_system_description[] = { | |
4231 { XD_END } | |
4232 }; | |
4233 | |
4234 enum source_sink_type | |
4235 gzip_conversion_end_type (Lisp_Object codesys) | |
4236 { | |
4237 return DECODES_BYTE_TO_BYTE; | |
4238 } | |
4239 | |
4240 static void | |
4241 gzip_init (Lisp_Object codesys) | |
4242 { | |
4243 struct gzip_coding_system *data = XCODING_SYSTEM_TYPE_DATA (codesys, gzip); | |
4244 data->level = -1; | |
4245 } | |
4246 | |
4247 static void | |
4248 gzip_print (Lisp_Object cs, Lisp_Object printcharfun, int escapeflag) | |
4249 { | |
4250 struct gzip_coding_system *data = XCODING_SYSTEM_TYPE_DATA (cs, gzip); | |
4251 | |
4252 write_c_string ("(", printcharfun); | |
4253 if (data->level == -1) | |
4254 write_c_string ("default", printcharfun); | |
1939 else | 4255 else |
1940 { | 4256 print_internal (make_int (data->level), printcharfun, 0); |
1941 int i; | 4257 write_c_string (")", printcharfun); |
1942 | |
1943 val = Qnil; | |
1944 #ifdef MULE | |
1945 decst.mask = postprocess_iso2022_mask (decst.mask); | |
1946 #endif | |
1947 for (i = CODING_CATEGORY_LAST - 1; i >= 0; i--) | |
1948 { | |
1949 int sys = fcd->coding_category_by_priority[i]; | |
1950 if (decst.mask & (1 << sys)) | |
1951 { | |
1952 Lisp_Object codesys = fcd->coding_category_system[sys]; | |
1953 if (!NILP (codesys)) | |
1954 codesys = subsidiary_coding_system (codesys, decst.eol_type); | |
1955 val = Fcons (codesys, val); | |
1956 } | |
1957 } | |
1958 } | |
1959 Lstream_close (istr); | |
1960 UNGCPRO; | |
1961 Lstream_delete (istr); | |
1962 Lstream_delete (lb_istr); | |
1963 return val; | |
1964 } | |
1965 | |
1966 | |
1967 /************************************************************************/ | |
1968 /* Converting to internal Mule format ("decoding") */ | |
1969 /************************************************************************/ | |
1970 | |
1971 /* A decoding stream is a stream used for decoding text (i.e. | |
1972 converting from some external format to internal format). | |
1973 The decoding-stream object keeps track of the actual coding | |
1974 stream, the stream that is at the other end, and data that | |
1975 needs to be persistent across the lifetime of the stream. */ | |
1976 | |
1977 /* Handle the EOL stuff related to just-read-in character C. | |
1978 EOL_TYPE is the EOL type of the coding stream. | |
1979 FLAGS is the current value of FLAGS in the coding stream, and may | |
1980 be modified by this macro. (The macro only looks at the | |
1981 CODING_STATE_CR flag.) DST is the Dynarr to which the decoded | |
1982 bytes are to be written. You need to also define a local goto | |
1983 label "label_continue_loop" that is at the end of the main | |
1984 character-reading loop. | |
1985 | |
1986 If C is a CR character, then this macro handles it entirely and | |
1987 jumps to label_continue_loop. Otherwise, this macro does not add | |
1988 anything to DST, and continues normally. You should continue | |
1989 processing C normally after this macro. */ | |
1990 | |
1991 #define DECODE_HANDLE_EOL_TYPE(eol_type, c, flags, dst) \ | |
1992 do { \ | |
1993 if (c == '\r') \ | |
1994 { \ | |
1995 if (eol_type == EOL_CR) \ | |
1996 Dynarr_add (dst, '\n'); \ | |
1997 else if (eol_type != EOL_CRLF || flags & CODING_STATE_CR) \ | |
1998 Dynarr_add (dst, c); \ | |
1999 else \ | |
2000 flags |= CODING_STATE_CR; \ | |
2001 goto label_continue_loop; \ | |
2002 } \ | |
2003 else if (flags & CODING_STATE_CR) \ | |
2004 { /* eol_type == CODING_SYSTEM_EOL_CRLF */ \ | |
2005 if (c != '\n') \ | |
2006 Dynarr_add (dst, '\r'); \ | |
2007 flags &= ~CODING_STATE_CR; \ | |
2008 } \ | |
2009 } while (0) | |
2010 | |
2011 /* C should be a binary character in the range 0 - 255; convert | |
2012 to internal format and add to Dynarr DST. */ | |
2013 | |
2014 #define DECODE_ADD_BINARY_CHAR(c, dst) \ | |
2015 do { \ | |
2016 if (BYTE_ASCII_P (c)) \ | |
2017 Dynarr_add (dst, c); \ | |
2018 else if (BYTE_C1_P (c)) \ | |
2019 { \ | |
2020 Dynarr_add (dst, LEADING_BYTE_CONTROL_1); \ | |
2021 Dynarr_add (dst, c + 0x20); \ | |
2022 } \ | |
2023 else \ | |
2024 { \ | |
2025 Dynarr_add (dst, LEADING_BYTE_LATIN_ISO8859_1); \ | |
2026 Dynarr_add (dst, c); \ | |
2027 } \ | |
2028 } while (0) | |
2029 | |
2030 #define DECODE_OUTPUT_PARTIAL_CHAR(ch) \ | |
2031 do { \ | |
2032 if (ch) \ | |
2033 { \ | |
2034 DECODE_ADD_BINARY_CHAR (ch, dst); \ | |
2035 ch = 0; \ | |
2036 } \ | |
2037 } while (0) | |
2038 | |
2039 #define DECODE_HANDLE_END_OF_CONVERSION(flags, ch, dst) \ | |
2040 do { \ | |
2041 if (flags & CODING_STATE_END) \ | |
2042 { \ | |
2043 DECODE_OUTPUT_PARTIAL_CHAR (ch); \ | |
2044 if (flags & CODING_STATE_CR) \ | |
2045 Dynarr_add (dst, '\r'); \ | |
2046 } \ | |
2047 } while (0) | |
2048 | |
2049 #define DECODING_STREAM_DATA(stream) LSTREAM_TYPE_DATA (stream, decoding) | |
2050 | |
2051 struct decoding_stream | |
2052 { | |
2053 /* Coding system that governs the conversion. */ | |
2054 Lisp_Coding_System *codesys; | |
2055 | |
2056 /* Stream that we read the encoded data from or | |
2057 write the decoded data to. */ | |
2058 Lstream *other_end; | |
2059 | |
2060 /* If we are reading, then we can return only a fixed amount of | |
2061 data, so if the conversion resulted in too much data, we store it | |
2062 here for retrieval the next time around. */ | |
2063 unsigned_char_dynarr *runoff; | |
2064 | |
2065 /* FLAGS holds flags indicating the current state of the decoding. | |
2066 Some of these flags are dependent on the coding system. */ | |
2067 unsigned int flags; | |
2068 | |
2069 /* CH holds a partially built-up character. Since we only deal | |
2070 with one- and two-byte characters at the moment, we only use | |
2071 this to store the first byte of a two-byte character. */ | |
2072 unsigned int ch; | |
2073 | |
2074 /* EOL_TYPE specifies the type of end-of-line conversion that | |
2075 currently applies. We need to keep this separate from the | |
2076 EOL type stored in CODESYS because the latter might indicate | |
2077 automatic EOL-type detection while the former will always | |
2078 indicate a particular EOL type. */ | |
2079 eol_type_t eol_type; | |
2080 #ifdef MULE | |
2081 /* Additional ISO2022 information. We define the structure above | |
2082 because it's also needed by the detection routines. */ | |
2083 struct iso2022_decoder iso2022; | |
2084 | |
2085 /* Additional information (the state of the running CCL program) | |
2086 used by the CCL decoder. */ | |
2087 struct ccl_program ccl; | |
2088 | |
2089 /* counter for UTF-8 or UCS-4 */ | |
2090 unsigned char counter; | |
2091 #endif | |
2092 struct detection_state decst; | |
2093 }; | |
2094 | |
2095 static Bytecount decoding_reader (Lstream *stream, | |
2096 unsigned char *data, Bytecount size); | |
2097 static Bytecount decoding_writer (Lstream *stream, | |
2098 const unsigned char *data, Bytecount size); | |
2099 static int decoding_rewinder (Lstream *stream); | |
2100 static int decoding_seekable_p (Lstream *stream); | |
2101 static int decoding_flusher (Lstream *stream); | |
2102 static int decoding_closer (Lstream *stream); | |
2103 | |
2104 static Lisp_Object decoding_marker (Lisp_Object stream); | |
2105 | |
2106 DEFINE_LSTREAM_IMPLEMENTATION ("decoding", lstream_decoding, | |
2107 sizeof (struct decoding_stream)); | |
2108 | |
2109 static Lisp_Object | |
2110 decoding_marker (Lisp_Object stream) | |
2111 { | |
2112 Lstream *str = DECODING_STREAM_DATA (XLSTREAM (stream))->other_end; | |
2113 Lisp_Object str_obj; | |
2114 | |
2115 /* We do not need to mark the coding systems or charsets stored | |
2116 within the stream because they are stored in a global list | |
2117 and automatically marked. */ | |
2118 | |
2119 XSETLSTREAM (str_obj, str); | |
2120 mark_object (str_obj); | |
2121 if (str->imp->marker) | |
2122 return (str->imp->marker) (str_obj); | |
2123 else | |
2124 return Qnil; | |
2125 } | |
2126 | |
2127 /* Read SIZE bytes of data and store it into DATA. We are a decoding stream | |
2128 so we read data from the other end, decode it, and store it into DATA. */ | |
2129 | |
2130 static Bytecount | |
2131 decoding_reader (Lstream *stream, unsigned char *data, Bytecount size) | |
2132 { | |
2133 struct decoding_stream *str = DECODING_STREAM_DATA (stream); | |
2134 unsigned char *orig_data = data; | |
2135 Bytecount read_size; | |
2136 int error_occurred = 0; | |
2137 | |
2138 /* We need to interface to mule_decode(), which expects to take some | |
2139 amount of data and store the result into a Dynarr. We have | |
2140 mule_decode() store into str->runoff, and take data from there | |
2141 as necessary. */ | |
2142 | |
2143 /* We loop until we have enough data, reading chunks from the other | |
2144 end and decoding it. */ | |
2145 while (1) | |
2146 { | |
2147 /* Take data from the runoff if we can. Make sure to take at | |
2148 most SIZE bytes, and delete the data from the runoff. */ | |
2149 if (Dynarr_length (str->runoff) > 0) | |
2150 { | |
2151 Bytecount chunk = min (size, (Bytecount) Dynarr_length (str->runoff)); | |
2152 memcpy (data, Dynarr_atp (str->runoff, 0), chunk); | |
2153 Dynarr_delete_many (str->runoff, 0, chunk); | |
2154 data += chunk; | |
2155 size -= chunk; | |
2156 } | |
2157 | |
2158 if (size == 0) | |
2159 break; /* No more room for data */ | |
2160 | |
2161 if (str->flags & CODING_STATE_END) | |
2162 /* This means that on the previous iteration, we hit the EOF on | |
2163 the other end. We loop once more so that mule_decode() can | |
2164 output any final stuff it may be holding, or any "go back | |
2165 to a sane state" escape sequences. (This latter makes sense | |
2166 during encoding.) */ | |
2167 break; | |
2168 | |
2169 /* Exhausted the runoff, so get some more. DATA has at least | |
2170 SIZE bytes left of storage in it, so it's OK to read directly | |
2171 into it. (We'll be overwriting above, after we've decoded it | |
2172 into the runoff.) */ | |
2173 read_size = Lstream_read (str->other_end, data, size); | |
2174 if (read_size < 0) | |
2175 { | |
2176 error_occurred = 1; | |
2177 break; | |
2178 } | |
2179 if (read_size == 0) | |
2180 /* There might be some more end data produced in the translation. | |
2181 See the comment above. */ | |
2182 str->flags |= CODING_STATE_END; | |
2183 mule_decode (stream, (Extbyte *) data, str->runoff, read_size); | |
2184 } | |
2185 | |
2186 if (data - orig_data == 0) | |
2187 return error_occurred ? -1 : 0; | |
2188 else | |
2189 return data - orig_data; | |
2190 } | |
2191 | |
2192 static Bytecount | |
2193 decoding_writer (Lstream *stream, const unsigned char *data, Bytecount size) | |
2194 { | |
2195 struct decoding_stream *str = DECODING_STREAM_DATA (stream); | |
2196 Bytecount retval; | |
2197 | |
2198 /* Decode all our data into the runoff, and then attempt to write | |
2199 it all out to the other end. Remove whatever chunk we succeeded | |
2200 in writing. */ | |
2201 mule_decode (stream, (Extbyte *) data, str->runoff, size); | |
2202 retval = Lstream_write (str->other_end, Dynarr_atp (str->runoff, 0), | |
2203 Dynarr_length (str->runoff)); | |
2204 if (retval > 0) | |
2205 Dynarr_delete_many (str->runoff, 0, retval); | |
2206 /* Do NOT return retval. The return value indicates how much | |
2207 of the incoming data was written, not how many bytes were | |
2208 written. */ | |
2209 return size; | |
2210 } | |
2211 | |
2212 static void | |
2213 reset_decoding_stream (struct decoding_stream *str) | |
2214 { | |
2215 #ifdef MULE | |
2216 if (CODING_SYSTEM_TYPE (str->codesys) == CODESYS_ISO2022) | |
2217 { | |
2218 Lisp_Object coding_system; | |
2219 XSETCODING_SYSTEM (coding_system, str->codesys); | |
2220 reset_iso2022 (coding_system, &str->iso2022); | |
2221 } | |
2222 else if (CODING_SYSTEM_TYPE (str->codesys) == CODESYS_CCL) | |
2223 { | |
2224 setup_ccl_program (&str->ccl, CODING_SYSTEM_CCL_DECODE (str->codesys)); | |
2225 } | |
2226 str->counter = 0; | |
2227 #endif /* MULE */ | |
2228 if (CODING_SYSTEM_TYPE (str->codesys) == CODESYS_AUTODETECT | |
2229 || CODING_SYSTEM_EOL_TYPE (str->codesys) == EOL_AUTODETECT) | |
2230 { | |
2231 xzero (str->decst); | |
2232 str->decst.eol_type = EOL_AUTODETECT; | |
2233 str->decst.mask = ~0; | |
2234 } | |
2235 str->flags = str->ch = 0; | |
2236 } | 4258 } |
2237 | 4259 |
2238 static int | 4260 static int |
2239 decoding_rewinder (Lstream *stream) | 4261 gzip_putprop (Lisp_Object codesys, Lisp_Object key, Lisp_Object value) |
2240 { | 4262 { |
2241 struct decoding_stream *str = DECODING_STREAM_DATA (stream); | 4263 struct gzip_coding_system *data = XCODING_SYSTEM_TYPE_DATA (codesys, gzip); |
2242 reset_decoding_stream (str); | 4264 |
2243 Dynarr_reset (str->runoff); | 4265 if (EQ (key, Qlevel)) |
2244 return Lstream_rewind (str->other_end); | 4266 { |
2245 } | 4267 if (EQ (value, Qdefault)) |
2246 | 4268 data->level = -1; |
2247 static int | |
2248 decoding_seekable_p (Lstream *stream) | |
2249 { | |
2250 struct decoding_stream *str = DECODING_STREAM_DATA (stream); | |
2251 return Lstream_seekable_p (str->other_end); | |
2252 } | |
2253 | |
2254 static int | |
2255 decoding_flusher (Lstream *stream) | |
2256 { | |
2257 struct decoding_stream *str = DECODING_STREAM_DATA (stream); | |
2258 return Lstream_flush (str->other_end); | |
2259 } | |
2260 | |
2261 static int | |
2262 decoding_closer (Lstream *stream) | |
2263 { | |
2264 struct decoding_stream *str = DECODING_STREAM_DATA (stream); | |
2265 if (stream->flags & LSTREAM_FL_WRITE) | |
2266 { | |
2267 str->flags |= CODING_STATE_END; | |
2268 decoding_writer (stream, 0, 0); | |
2269 } | |
2270 Dynarr_free (str->runoff); | |
2271 #ifdef MULE | |
2272 #ifdef ENABLE_COMPOSITE_CHARS | |
2273 if (str->iso2022.composite_chars) | |
2274 Dynarr_free (str->iso2022.composite_chars); | |
2275 #endif | |
2276 #endif | |
2277 return Lstream_close (str->other_end); | |
2278 } | |
2279 | |
2280 Lisp_Object | |
2281 decoding_stream_coding_system (Lstream *stream) | |
2282 { | |
2283 Lisp_Object coding_system; | |
2284 struct decoding_stream *str = DECODING_STREAM_DATA (stream); | |
2285 | |
2286 XSETCODING_SYSTEM (coding_system, str->codesys); | |
2287 return subsidiary_coding_system (coding_system, str->eol_type); | |
2288 } | |
2289 | |
2290 void | |
2291 set_decoding_stream_coding_system (Lstream *lstr, Lisp_Object codesys) | |
2292 { | |
2293 Lisp_Coding_System *cs = XCODING_SYSTEM (codesys); | |
2294 struct decoding_stream *str = DECODING_STREAM_DATA (lstr); | |
2295 str->codesys = cs; | |
2296 if (CODING_SYSTEM_EOL_TYPE (cs) != EOL_AUTODETECT) | |
2297 str->eol_type = CODING_SYSTEM_EOL_TYPE (cs); | |
2298 reset_decoding_stream (str); | |
2299 } | |
2300 | |
2301 /* WARNING WARNING WARNING WARNING!!!!! If you open up a decoding | |
2302 stream for writing, no automatic code detection will be performed. | |
2303 The reason for this is that automatic code detection requires a | |
2304 seekable input. Things will also fail if you open a decoding | |
2305 stream for reading using a non-fully-specified coding system and | |
2306 a non-seekable input stream. */ | |
2307 | |
2308 static Lisp_Object | |
2309 make_decoding_stream_1 (Lstream *stream, Lisp_Object codesys, | |
2310 const char *mode) | |
2311 { | |
2312 Lstream *lstr = Lstream_new (lstream_decoding, mode); | |
2313 struct decoding_stream *str = DECODING_STREAM_DATA (lstr); | |
2314 Lisp_Object obj; | |
2315 | |
2316 xzero (*str); | |
2317 str->other_end = stream; | |
2318 str->runoff = (unsigned_char_dynarr *) Dynarr_new (unsigned_char); | |
2319 str->eol_type = EOL_AUTODETECT; | |
2320 if (!strcmp (mode, "r") | |
2321 && Lstream_seekable_p (stream)) | |
2322 /* We can determine the coding system now. */ | |
2323 determine_real_coding_system (stream, &codesys, &str->eol_type); | |
2324 set_decoding_stream_coding_system (lstr, codesys); | |
2325 str->decst.eol_type = str->eol_type; | |
2326 str->decst.mask = ~0; | |
2327 XSETLSTREAM (obj, lstr); | |
2328 return obj; | |
2329 } | |
2330 | |
2331 Lisp_Object | |
2332 make_decoding_input_stream (Lstream *stream, Lisp_Object codesys) | |
2333 { | |
2334 return make_decoding_stream_1 (stream, codesys, "r"); | |
2335 } | |
2336 | |
2337 Lisp_Object | |
2338 make_decoding_output_stream (Lstream *stream, Lisp_Object codesys) | |
2339 { | |
2340 return make_decoding_stream_1 (stream, codesys, "w"); | |
2341 } | |
2342 | |
2343 /* Note: the decode_coding_* functions all take the same | |
2344 arguments as mule_decode(), which is to say some SRC data of | |
2345 size N, which is to be stored into dynamic array DST. | |
2346 DECODING is the stream within which the decoding is | |
2347 taking place, but no data is actually read from or | |
2348 written to that stream; that is handled in decoding_reader() | |
2349 or decoding_writer(). This allows the same functions to | |
2350 be used for both reading and writing. */ | |
2351 | |
2352 static void | |
2353 mule_decode (Lstream *decoding, const Extbyte *src, | |
2354 unsigned_char_dynarr *dst, Bytecount n) | |
2355 { | |
2356 struct decoding_stream *str = DECODING_STREAM_DATA (decoding); | |
2357 | |
2358 /* If necessary, do encoding-detection now. We do this when | |
2359 we're a writing stream or a non-seekable reading stream, | |
2360 meaning that we can't just process the whole input, | |
2361 rewind, and start over. */ | |
2362 | |
2363 if (CODING_SYSTEM_TYPE (str->codesys) == CODESYS_AUTODETECT || | |
2364 str->eol_type == EOL_AUTODETECT) | |
2365 { | |
2366 Lisp_Object codesys; | |
2367 | |
2368 XSETCODING_SYSTEM (codesys, str->codesys); | |
2369 detect_coding_type (&str->decst, src, n, | |
2370 CODING_SYSTEM_TYPE (str->codesys) != | |
2371 CODESYS_AUTODETECT); | |
2372 if (CODING_SYSTEM_TYPE (str->codesys) == CODESYS_AUTODETECT && | |
2373 str->decst.mask != ~0) | |
2374 /* #### This is cheesy. What we really ought to do is | |
2375 buffer up a certain amount of data so as to get a | |
2376 less random result. */ | |
2377 codesys = coding_system_from_mask (str->decst.mask); | |
2378 str->eol_type = str->decst.eol_type; | |
2379 if (XCODING_SYSTEM (codesys) != str->codesys) | |
2380 { | |
2381 /* Preserve the CODING_STATE_END flag in case it was set. | |
2382 If we erase it, bad things might happen. */ | |
2383 int was_end = str->flags & CODING_STATE_END; | |
2384 set_decoding_stream_coding_system (decoding, codesys); | |
2385 if (was_end) | |
2386 str->flags |= CODING_STATE_END; | |
2387 } | |
2388 } | |
2389 | |
2390 switch (CODING_SYSTEM_TYPE (str->codesys)) | |
2391 { | |
2392 #ifdef DEBUG_XEMACS | |
2393 case CODESYS_INTERNAL: | |
2394 Dynarr_add_many (dst, src, n); | |
2395 break; | |
2396 #endif | |
2397 case CODESYS_AUTODETECT: | |
2398 /* If we got this far and still haven't decided on the coding | |
2399 system, then do no conversion. */ | |
2400 case CODESYS_NO_CONVERSION: | |
2401 decode_coding_no_conversion (decoding, src, dst, n); | |
2402 break; | |
2403 #ifdef MULE | |
2404 case CODESYS_SHIFT_JIS: | |
2405 decode_coding_sjis (decoding, src, dst, n); | |
2406 break; | |
2407 case CODESYS_BIG5: | |
2408 decode_coding_big5 (decoding, src, dst, n); | |
2409 break; | |
2410 case CODESYS_UCS4: | |
2411 decode_coding_ucs4 (decoding, src, dst, n); | |
2412 break; | |
2413 case CODESYS_UTF8: | |
2414 decode_coding_utf8 (decoding, src, dst, n); | |
2415 break; | |
2416 case CODESYS_CCL: | |
2417 str->ccl.last_block = str->flags & CODING_STATE_END; | |
2418 /* When applying ccl program to stream, MUST NOT set NULL | |
2419 pointer to src. */ | |
2420 ccl_driver (&str->ccl, (src ? (unsigned char *)src : (unsigned char*)""), | |
2421 dst, n, 0, CCL_MODE_DECODING); | |
2422 break; | |
2423 case CODESYS_ISO2022: | |
2424 decode_coding_iso2022 (decoding, src, dst, n); | |
2425 break; | |
2426 #endif /* MULE */ | |
2427 default: | |
2428 abort (); | |
2429 } | |
2430 } | |
2431 | |
2432 DEFUN ("decode-coding-region", Fdecode_coding_region, 3, 4, 0, /* | |
2433 Decode the text between START and END which is encoded in CODING-SYSTEM. | |
2434 This is useful if you've read in encoded text from a file without decoding | |
2435 it (e.g. you read in a JIS-formatted file but used the `binary' or | |
2436 `no-conversion' coding system, so that it shows up as "^[$B!<!+^[(B"). | |
2437 Return length of decoded text. | |
2438 BUFFER defaults to the current buffer if unspecified. | |
2439 */ | |
2440 (start, end, coding_system, buffer)) | |
2441 { | |
2442 Charbpos b, e; | |
2443 struct buffer *buf = decode_buffer (buffer, 0); | |
2444 Lisp_Object instream, lb_outstream, de_outstream, outstream; | |
2445 Lstream *istr, *ostr; | |
2446 struct gcpro gcpro1, gcpro2, gcpro3, gcpro4; | |
2447 | |
2448 get_buffer_range_char (buf, start, end, &b, &e, 0); | |
2449 | |
2450 barf_if_buffer_read_only (buf, b, e); | |
2451 | |
2452 coding_system = Fget_coding_system (coding_system); | |
2453 instream = make_lisp_buffer_input_stream (buf, b, e, 0); | |
2454 lb_outstream = make_lisp_buffer_output_stream (buf, b, 0); | |
2455 de_outstream = make_decoding_output_stream (XLSTREAM (lb_outstream), | |
2456 coding_system); | |
2457 outstream = make_encoding_output_stream (XLSTREAM (de_outstream), | |
2458 Fget_coding_system (Qbinary)); | |
2459 istr = XLSTREAM (instream); | |
2460 ostr = XLSTREAM (outstream); | |
2461 GCPRO4 (instream, lb_outstream, de_outstream, outstream); | |
2462 | |
2463 /* The chain of streams looks like this: | |
2464 | |
2465 [BUFFER] <----- send through | |
2466 ------> [ENCODE AS BINARY] | |
2467 ------> [DECODE AS SPECIFIED] | |
2468 ------> [BUFFER] | |
2469 */ | |
2470 | |
2471 while (1) | |
2472 { | |
2473 char tempbuf[1024]; /* some random amount */ | |
2474 Charbpos newpos, even_newer_pos; | |
2475 Charbpos oldpos = lisp_buffer_stream_startpos (istr); | |
2476 Bytecount size_in_bytes = Lstream_read (istr, tempbuf, sizeof (tempbuf)); | |
2477 | |
2478 if (!size_in_bytes) | |
2479 break; | |
2480 newpos = lisp_buffer_stream_startpos (istr); | |
2481 Lstream_write (ostr, tempbuf, size_in_bytes); | |
2482 even_newer_pos = lisp_buffer_stream_startpos (istr); | |
2483 buffer_delete_range (buf, even_newer_pos - (newpos - oldpos), | |
2484 even_newer_pos, 0); | |
2485 } | |
2486 Lstream_close (istr); | |
2487 Lstream_close (ostr); | |
2488 UNGCPRO; | |
2489 Lstream_delete (istr); | |
2490 Lstream_delete (ostr); | |
2491 Lstream_delete (XLSTREAM (de_outstream)); | |
2492 Lstream_delete (XLSTREAM (lb_outstream)); | |
2493 return Qnil; | |
2494 } | |
2495 | |
2496 | |
2497 /************************************************************************/ | |
2498 /* Converting to an external encoding ("encoding") */ | |
2499 /************************************************************************/ | |
2500 | |
2501 /* An encoding stream is an output stream. When you create the | |
2502 stream, you specify the coding system that governs the encoding | |
2503 and another stream that the resulting encoded data is to be | |
2504 sent to, and then start sending data to it. */ | |
2505 | |
2506 #define ENCODING_STREAM_DATA(stream) LSTREAM_TYPE_DATA (stream, encoding) | |
2507 | |
2508 struct encoding_stream | |
2509 { | |
2510 /* Coding system that governs the conversion. */ | |
2511 Lisp_Coding_System *codesys; | |
2512 | |
2513 /* Stream that we read the encoded data from or | |
2514 write the decoded data to. */ | |
2515 Lstream *other_end; | |
2516 | |
2517 /* If we are reading, then we can return only a fixed amount of | |
2518 data, so if the conversion resulted in too much data, we store it | |
2519 here for retrieval the next time around. */ | |
2520 unsigned_char_dynarr *runoff; | |
2521 | |
2522 /* FLAGS holds flags indicating the current state of the encoding. | |
2523 Some of these flags are dependent on the coding system. */ | |
2524 unsigned int flags; | |
2525 | |
2526 /* CH holds a partially built-up character. Since we only deal | |
2527 with one- and two-byte characters at the moment, we only use | |
2528 this to store the first byte of a two-byte character. */ | |
2529 unsigned int ch; | |
2530 #ifdef MULE | |
2531 /* Additional information used by the ISO2022 encoder. */ | |
2532 struct | |
2533 { | |
2534 /* CHARSET holds the character sets currently assigned to the G0 | |
2535 through G3 registers. It is initialized from the array | |
2536 INITIAL_CHARSET in CODESYS. */ | |
2537 Lisp_Object charset[4]; | |
2538 | |
2539 /* Which registers are currently invoked into the left (GL) and | |
2540 right (GR) halves of the 8-bit encoding space? */ | |
2541 int register_left, register_right; | |
2542 | |
2543 /* Whether we need to explicitly designate the charset in the | |
2544 G? register before using it. It is initialized from the | |
2545 array FORCE_CHARSET_ON_OUTPUT in CODESYS. */ | |
2546 unsigned char force_charset_on_output[4]; | |
2547 | |
2548 /* Other state variables that need to be preserved across | |
2549 invocations. */ | |
2550 Lisp_Object current_charset; | |
2551 int current_half; | |
2552 int current_char_boundary; | |
2553 } iso2022; | |
2554 | |
2555 /* Additional information (the state of the running CCL program) | |
2556 used by the CCL encoder. */ | |
2557 struct ccl_program ccl; | |
2558 #endif /* MULE */ | |
2559 }; | |
2560 | |
2561 static Bytecount encoding_reader (Lstream *stream, unsigned char *data, Bytecount size); | |
2562 static Bytecount encoding_writer (Lstream *stream, const unsigned char *data, | |
2563 Bytecount size); | |
2564 static int encoding_rewinder (Lstream *stream); | |
2565 static int encoding_seekable_p (Lstream *stream); | |
2566 static int encoding_flusher (Lstream *stream); | |
2567 static int encoding_closer (Lstream *stream); | |
2568 | |
2569 static Lisp_Object encoding_marker (Lisp_Object stream); | |
2570 | |
2571 DEFINE_LSTREAM_IMPLEMENTATION ("encoding", lstream_encoding, | |
2572 sizeof (struct encoding_stream)); | |
2573 | |
2574 static Lisp_Object | |
2575 encoding_marker (Lisp_Object stream) | |
2576 { | |
2577 Lstream *str = ENCODING_STREAM_DATA (XLSTREAM (stream))->other_end; | |
2578 Lisp_Object str_obj; | |
2579 | |
2580 /* We do not need to mark the coding systems or charsets stored | |
2581 within the stream because they are stored in a global list | |
2582 and automatically marked. */ | |
2583 | |
2584 XSETLSTREAM (str_obj, str); | |
2585 mark_object (str_obj); | |
2586 if (str->imp->marker) | |
2587 return (str->imp->marker) (str_obj); | |
2588 else | |
2589 return Qnil; | |
2590 } | |
2591 | |
2592 /* Read SIZE bytes of data and store it into DATA. We are a encoding stream | |
2593 so we read data from the other end, encode it, and store it into DATA. */ | |
2594 | |
2595 static Bytecount | |
2596 encoding_reader (Lstream *stream, unsigned char *data, Bytecount size) | |
2597 { | |
2598 struct encoding_stream *str = ENCODING_STREAM_DATA (stream); | |
2599 unsigned char *orig_data = data; | |
2600 Bytecount read_size; | |
2601 int error_occurred = 0; | |
2602 | |
2603 /* We need to interface to mule_encode(), which expects to take some | |
2604 amount of data and store the result into a Dynarr. We have | |
2605 mule_encode() store into str->runoff, and take data from there | |
2606 as necessary. */ | |
2607 | |
2608 /* We loop until we have enough data, reading chunks from the other | |
2609 end and encoding it. */ | |
2610 while (1) | |
2611 { | |
2612 /* Take data from the runoff if we can. Make sure to take at | |
2613 most SIZE bytes, and delete the data from the runoff. */ | |
2614 if (Dynarr_length (str->runoff) > 0) | |
2615 { | |
2616 int chunk = min ((int) size, Dynarr_length (str->runoff)); | |
2617 memcpy (data, Dynarr_atp (str->runoff, 0), chunk); | |
2618 Dynarr_delete_many (str->runoff, 0, chunk); | |
2619 data += chunk; | |
2620 size -= chunk; | |
2621 } | |
2622 | |
2623 if (size == 0) | |
2624 break; /* No more room for data */ | |
2625 | |
2626 if (str->flags & CODING_STATE_END) | |
2627 /* This means that on the previous iteration, we hit the EOF on | |
2628 the other end. We loop once more so that mule_encode() can | |
2629 output any final stuff it may be holding, or any "go back | |
2630 to a sane state" escape sequences. (This latter makes sense | |
2631 during encoding.) */ | |
2632 break; | |
2633 | |
2634 /* Exhausted the runoff, so get some more. DATA at least SIZE bytes | |
2635 left of storage in it, so it's OK to read directly into it. | |
2636 (We'll be overwriting above, after we've encoded it into the | |
2637 runoff.) */ | |
2638 read_size = Lstream_read (str->other_end, data, size); | |
2639 if (read_size < 0) | |
2640 { | |
2641 error_occurred = 1; | |
2642 break; | |
2643 } | |
2644 if (read_size == 0) | |
2645 /* There might be some more end data produced in the translation. | |
2646 See the comment above. */ | |
2647 str->flags |= CODING_STATE_END; | |
2648 mule_encode (stream, data, str->runoff, read_size); | |
2649 } | |
2650 | |
2651 if (data == orig_data) | |
2652 return error_occurred ? -1 : 0; | |
2653 else | |
2654 return data - orig_data; | |
2655 } | |
2656 | |
2657 static Bytecount | |
2658 encoding_writer (Lstream *stream, const unsigned char *data, Bytecount size) | |
2659 { | |
2660 struct encoding_stream *str = ENCODING_STREAM_DATA (stream); | |
2661 Bytecount retval; | |
2662 | |
2663 /* Encode all our data into the runoff, and then attempt to write | |
2664 it all out to the other end. Remove whatever chunk we succeeded | |
2665 in writing. */ | |
2666 mule_encode (stream, data, str->runoff, size); | |
2667 retval = Lstream_write (str->other_end, Dynarr_atp (str->runoff, 0), | |
2668 Dynarr_length (str->runoff)); | |
2669 if (retval > 0) | |
2670 Dynarr_delete_many (str->runoff, 0, retval); | |
2671 /* Do NOT return retval. The return value indicates how much | |
2672 of the incoming data was written, not how many bytes were | |
2673 written. */ | |
2674 return size; | |
2675 } | |
2676 | |
2677 static void | |
2678 reset_encoding_stream (struct encoding_stream *str) | |
2679 { | |
2680 #ifdef MULE | |
2681 switch (CODING_SYSTEM_TYPE (str->codesys)) | |
2682 { | |
2683 case CODESYS_ISO2022: | |
2684 { | |
2685 int i; | |
2686 | |
2687 for (i = 0; i < 4; i++) | |
2688 { | |
2689 str->iso2022.charset[i] = | |
2690 CODING_SYSTEM_ISO2022_INITIAL_CHARSET (str->codesys, i); | |
2691 str->iso2022.force_charset_on_output[i] = | |
2692 CODING_SYSTEM_ISO2022_FORCE_CHARSET_ON_OUTPUT (str->codesys, i); | |
2693 } | |
2694 str->iso2022.register_left = 0; | |
2695 str->iso2022.register_right = 1; | |
2696 str->iso2022.current_charset = Qnil; | |
2697 str->iso2022.current_half = 0; | |
2698 str->iso2022.current_char_boundary = 1; | |
2699 break; | |
2700 } | |
2701 case CODESYS_CCL: | |
2702 setup_ccl_program (&str->ccl, CODING_SYSTEM_CCL_ENCODE (str->codesys)); | |
2703 break; | |
2704 default: | |
2705 break; | |
2706 } | |
2707 #endif /* MULE */ | |
2708 | |
2709 str->flags = str->ch = 0; | |
2710 } | |
2711 | |
2712 static int | |
2713 encoding_rewinder (Lstream *stream) | |
2714 { | |
2715 struct encoding_stream *str = ENCODING_STREAM_DATA (stream); | |
2716 reset_encoding_stream (str); | |
2717 Dynarr_reset (str->runoff); | |
2718 return Lstream_rewind (str->other_end); | |
2719 } | |
2720 | |
2721 static int | |
2722 encoding_seekable_p (Lstream *stream) | |
2723 { | |
2724 struct encoding_stream *str = ENCODING_STREAM_DATA (stream); | |
2725 return Lstream_seekable_p (str->other_end); | |
2726 } | |
2727 | |
2728 static int | |
2729 encoding_flusher (Lstream *stream) | |
2730 { | |
2731 struct encoding_stream *str = ENCODING_STREAM_DATA (stream); | |
2732 return Lstream_flush (str->other_end); | |
2733 } | |
2734 | |
2735 static int | |
2736 encoding_closer (Lstream *stream) | |
2737 { | |
2738 struct encoding_stream *str = ENCODING_STREAM_DATA (stream); | |
2739 if (stream->flags & LSTREAM_FL_WRITE) | |
2740 { | |
2741 str->flags |= CODING_STATE_END; | |
2742 encoding_writer (stream, 0, 0); | |
2743 } | |
2744 Dynarr_free (str->runoff); | |
2745 return Lstream_close (str->other_end); | |
2746 } | |
2747 | |
2748 Lisp_Object | |
2749 encoding_stream_coding_system (Lstream *stream) | |
2750 { | |
2751 Lisp_Object coding_system; | |
2752 struct encoding_stream *str = ENCODING_STREAM_DATA (stream); | |
2753 | |
2754 XSETCODING_SYSTEM (coding_system, str->codesys); | |
2755 return coding_system; | |
2756 } | |
2757 | |
2758 void | |
2759 set_encoding_stream_coding_system (Lstream *lstr, Lisp_Object codesys) | |
2760 { | |
2761 Lisp_Coding_System *cs = XCODING_SYSTEM (codesys); | |
2762 struct encoding_stream *str = ENCODING_STREAM_DATA (lstr); | |
2763 str->codesys = cs; | |
2764 reset_encoding_stream (str); | |
2765 } | |
2766 | |
2767 static Lisp_Object | |
2768 make_encoding_stream_1 (Lstream *stream, Lisp_Object codesys, | |
2769 const char *mode) | |
2770 { | |
2771 Lstream *lstr = Lstream_new (lstream_encoding, mode); | |
2772 struct encoding_stream *str = ENCODING_STREAM_DATA (lstr); | |
2773 Lisp_Object obj; | |
2774 | |
2775 xzero (*str); | |
2776 str->runoff = Dynarr_new (unsigned_char); | |
2777 str->other_end = stream; | |
2778 set_encoding_stream_coding_system (lstr, codesys); | |
2779 XSETLSTREAM (obj, lstr); | |
2780 return obj; | |
2781 } | |
2782 | |
2783 Lisp_Object | |
2784 make_encoding_input_stream (Lstream *stream, Lisp_Object codesys) | |
2785 { | |
2786 return make_encoding_stream_1 (stream, codesys, "r"); | |
2787 } | |
2788 | |
2789 Lisp_Object | |
2790 make_encoding_output_stream (Lstream *stream, Lisp_Object codesys) | |
2791 { | |
2792 return make_encoding_stream_1 (stream, codesys, "w"); | |
2793 } | |
2794 | |
2795 /* Convert N bytes of internally-formatted data stored in SRC to an | |
2796 external format, according to the encoding stream ENCODING. | |
2797 Store the encoded data into DST. */ | |
2798 | |
2799 static void | |
2800 mule_encode (Lstream *encoding, const Intbyte *src, | |
2801 unsigned_char_dynarr *dst, Bytecount n) | |
2802 { | |
2803 struct encoding_stream *str = ENCODING_STREAM_DATA (encoding); | |
2804 | |
2805 switch (CODING_SYSTEM_TYPE (str->codesys)) | |
2806 { | |
2807 #ifdef DEBUG_XEMACS | |
2808 case CODESYS_INTERNAL: | |
2809 Dynarr_add_many (dst, src, n); | |
2810 break; | |
2811 #endif | |
2812 case CODESYS_AUTODETECT: | |
2813 /* If we got this far and still haven't decided on the coding | |
2814 system, then do no conversion. */ | |
2815 case CODESYS_NO_CONVERSION: | |
2816 encode_coding_no_conversion (encoding, src, dst, n); | |
2817 break; | |
2818 #ifdef MULE | |
2819 case CODESYS_SHIFT_JIS: | |
2820 encode_coding_sjis (encoding, src, dst, n); | |
2821 break; | |
2822 case CODESYS_BIG5: | |
2823 encode_coding_big5 (encoding, src, dst, n); | |
2824 break; | |
2825 case CODESYS_UCS4: | |
2826 encode_coding_ucs4 (encoding, src, dst, n); | |
2827 break; | |
2828 case CODESYS_UTF8: | |
2829 encode_coding_utf8 (encoding, src, dst, n); | |
2830 break; | |
2831 case CODESYS_CCL: | |
2832 str->ccl.last_block = str->flags & CODING_STATE_END; | |
2833 /* When applying ccl program to stream, MUST NOT set NULL | |
2834 pointer to src. */ | |
2835 ccl_driver (&str->ccl, ((src) ? src : (unsigned char*)""), | |
2836 dst, n, 0, CCL_MODE_ENCODING); | |
2837 break; | |
2838 case CODESYS_ISO2022: | |
2839 encode_coding_iso2022 (encoding, src, dst, n); | |
2840 break; | |
2841 #endif /* MULE */ | |
2842 default: | |
2843 abort (); | |
2844 } | |
2845 } | |
2846 | |
2847 DEFUN ("encode-coding-region", Fencode_coding_region, 3, 4, 0, /* | |
2848 Encode the text between START and END using CODING-SYSTEM. | |
2849 This will, for example, convert Japanese characters into stuff such as | |
2850 "^[$B!<!+^[(B" if you use the JIS encoding. Return length of encoded | |
2851 text. BUFFER defaults to the current buffer if unspecified. | |
2852 */ | |
2853 (start, end, coding_system, buffer)) | |
2854 { | |
2855 Charbpos b, e; | |
2856 struct buffer *buf = decode_buffer (buffer, 0); | |
2857 Lisp_Object instream, lb_outstream, de_outstream, outstream; | |
2858 Lstream *istr, *ostr; | |
2859 struct gcpro gcpro1, gcpro2, gcpro3, gcpro4; | |
2860 | |
2861 get_buffer_range_char (buf, start, end, &b, &e, 0); | |
2862 | |
2863 barf_if_buffer_read_only (buf, b, e); | |
2864 | |
2865 coding_system = Fget_coding_system (coding_system); | |
2866 instream = make_lisp_buffer_input_stream (buf, b, e, 0); | |
2867 lb_outstream = make_lisp_buffer_output_stream (buf, b, 0); | |
2868 de_outstream = make_decoding_output_stream (XLSTREAM (lb_outstream), | |
2869 Fget_coding_system (Qbinary)); | |
2870 outstream = make_encoding_output_stream (XLSTREAM (de_outstream), | |
2871 coding_system); | |
2872 istr = XLSTREAM (instream); | |
2873 ostr = XLSTREAM (outstream); | |
2874 GCPRO4 (instream, outstream, de_outstream, lb_outstream); | |
2875 /* The chain of streams looks like this: | |
2876 | |
2877 [BUFFER] <----- send through | |
2878 ------> [ENCODE AS SPECIFIED] | |
2879 ------> [DECODE AS BINARY] | |
2880 ------> [BUFFER] | |
2881 */ | |
2882 while (1) | |
2883 { | |
2884 char tempbuf[1024]; /* some random amount */ | |
2885 Charbpos newpos, even_newer_pos; | |
2886 Charbpos oldpos = lisp_buffer_stream_startpos (istr); | |
2887 Bytecount size_in_bytes = Lstream_read (istr, tempbuf, sizeof (tempbuf)); | |
2888 | |
2889 if (!size_in_bytes) | |
2890 break; | |
2891 newpos = lisp_buffer_stream_startpos (istr); | |
2892 Lstream_write (ostr, tempbuf, size_in_bytes); | |
2893 even_newer_pos = lisp_buffer_stream_startpos (istr); | |
2894 buffer_delete_range (buf, even_newer_pos - (newpos - oldpos), | |
2895 even_newer_pos, 0); | |
2896 } | |
2897 | |
2898 { | |
2899 Charcount retlen = | |
2900 lisp_buffer_stream_startpos (XLSTREAM (instream)) - b; | |
2901 Lstream_close (istr); | |
2902 Lstream_close (ostr); | |
2903 UNGCPRO; | |
2904 Lstream_delete (istr); | |
2905 Lstream_delete (ostr); | |
2906 Lstream_delete (XLSTREAM (de_outstream)); | |
2907 Lstream_delete (XLSTREAM (lb_outstream)); | |
2908 return make_int (retlen); | |
2909 } | |
2910 } | |
2911 | |
2912 #ifdef MULE | |
2913 | |
2914 /************************************************************************/ | |
2915 /* Shift-JIS methods */ | |
2916 /************************************************************************/ | |
2917 | |
2918 /* Shift-JIS is a coding system encoding three character sets: ASCII, right | |
2919 half of JISX0201-Kana, and JISX0208. An ASCII character is encoded | |
2920 as is. A character of JISX0201-Kana (DIMENSION1_CHARS94 character set) is | |
2921 encoded by "position-code + 0x80". A character of JISX0208 | |
2922 (DIMENSION2_CHARS94 character set) is encoded in 2-byte but two | |
2923 position-codes are divided and shifted so that it fit in the range | |
2924 below. | |
2925 | |
2926 --- CODE RANGE of Shift-JIS --- | |
2927 (character set) (range) | |
2928 ASCII 0x00 .. 0x7F | |
2929 JISX0201-Kana 0xA0 .. 0xDF | |
2930 JISX0208 (1st byte) 0x80 .. 0x9F and 0xE0 .. 0xEF | |
2931 (2nd byte) 0x40 .. 0x7E and 0x80 .. 0xFC | |
2932 ------------------------------- | |
2933 | |
2934 */ | |
2935 | |
2936 /* Is this the first byte of a Shift-JIS two-byte char? */ | |
2937 | |
2938 #define BYTE_SJIS_TWO_BYTE_1_P(c) \ | |
2939 (((c) >= 0x81 && (c) <= 0x9F) || ((c) >= 0xE0 && (c) <= 0xEF)) | |
2940 | |
2941 /* Is this the second byte of a Shift-JIS two-byte char? */ | |
2942 | |
2943 #define BYTE_SJIS_TWO_BYTE_2_P(c) \ | |
2944 (((c) >= 0x40 && (c) <= 0x7E) || ((c) >= 0x80 && (c) <= 0xFC)) | |
2945 | |
2946 #define BYTE_SJIS_KATAKANA_P(c) \ | |
2947 ((c) >= 0xA1 && (c) <= 0xDF) | |
2948 | |
2949 static int | |
2950 detect_coding_sjis (struct detection_state *st, const Extbyte *src, Bytecount n) | |
2951 { | |
2952 while (n--) | |
2953 { | |
2954 unsigned char c = *(unsigned char *)src++; | |
2955 if (c == ISO_CODE_ESC || c == ISO_CODE_SI || c == ISO_CODE_SO) | |
2956 return 0; | |
2957 if (st->shift_jis.in_second_byte) | |
2958 { | |
2959 st->shift_jis.in_second_byte = 0; | |
2960 if (c < 0x40) | |
2961 return 0; | |
2962 } | |
2963 else if ((c >= 0x80 && c < 0xA0) || c >= 0xE0) | |
2964 st->shift_jis.in_second_byte = 1; | |
2965 } | |
2966 return CODING_CATEGORY_SHIFT_JIS_MASK; | |
2967 } | |
2968 | |
2969 /* Convert Shift-JIS data to internal format. */ | |
2970 | |
2971 static void | |
2972 decode_coding_sjis (Lstream *decoding, const Extbyte *src, | |
2973 unsigned_char_dynarr *dst, Bytecount n) | |
2974 { | |
2975 struct decoding_stream *str = DECODING_STREAM_DATA (decoding); | |
2976 unsigned int flags = str->flags; | |
2977 unsigned int ch = str->ch; | |
2978 eol_type_t eol_type = str->eol_type; | |
2979 | |
2980 while (n--) | |
2981 { | |
2982 unsigned char c = *(unsigned char *)src++; | |
2983 | |
2984 if (ch) | |
2985 { | |
2986 /* Previous character was first byte of Shift-JIS Kanji char. */ | |
2987 if (BYTE_SJIS_TWO_BYTE_2_P (c)) | |
2988 { | |
2989 unsigned char e1, e2; | |
2990 | |
2991 Dynarr_add (dst, LEADING_BYTE_JAPANESE_JISX0208); | |
2992 DECODE_SJIS (ch, c, e1, e2); | |
2993 Dynarr_add (dst, e1); | |
2994 Dynarr_add (dst, e2); | |
2995 } | |
2996 else | |
2997 { | |
2998 DECODE_ADD_BINARY_CHAR (ch, dst); | |
2999 DECODE_ADD_BINARY_CHAR (c, dst); | |
3000 } | |
3001 ch = 0; | |
3002 } | |
3003 else | 4269 else |
3004 { | 4270 { |
3005 DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst); | 4271 CHECK_INT (value); |
3006 if (BYTE_SJIS_TWO_BYTE_1_P (c)) | 4272 check_int_range (XINT (value), 0, 9); |
3007 ch = c; | 4273 data->level = XINT (value); |
3008 else if (BYTE_SJIS_KATAKANA_P (c)) | |
3009 { | |
3010 Dynarr_add (dst, LEADING_BYTE_KATAKANA_JISX0201); | |
3011 Dynarr_add (dst, c); | |
3012 } | |
3013 else | |
3014 DECODE_ADD_BINARY_CHAR (c, dst); | |
3015 } | 4274 } |
3016 label_continue_loop:; | 4275 } |
3017 } | 4276 else |
3018 | 4277 return 0; |
3019 DECODE_HANDLE_END_OF_CONVERSION (flags, ch, dst); | 4278 return 1; |
3020 | 4279 } |
3021 str->flags = flags; | 4280 |
3022 str->ch = ch; | 4281 static Lisp_Object |
3023 } | 4282 gzip_getprop (Lisp_Object coding_system, Lisp_Object prop) |
3024 | 4283 { |
3025 /* Convert internally-formatted data to Shift-JIS. */ | 4284 struct gzip_coding_system *data = |
4285 XCODING_SYSTEM_TYPE_DATA (coding_system, gzip); | |
4286 | |
4287 if (EQ (prop, Qlevel)) | |
4288 { | |
4289 if (data->level == -1) | |
4290 return Qdefault; | |
4291 return make_int (data->level); | |
4292 } | |
4293 | |
4294 return Qunbound; | |
4295 } | |
3026 | 4296 |
3027 static void | 4297 static void |
3028 encode_coding_sjis (Lstream *encoding, const Intbyte *src, | 4298 gzip_init_coding_stream (struct coding_stream *str) |
3029 unsigned_char_dynarr *dst, Bytecount n) | 4299 { |
3030 { | 4300 struct gzip_coding_stream *data = CODING_STREAM_TYPE_DATA (str, gzip); |
3031 struct encoding_stream *str = ENCODING_STREAM_DATA (encoding); | 4301 if (data->stream_initted) |
3032 unsigned int flags = str->flags; | 4302 { |
3033 unsigned int ch = str->ch; | 4303 if (str->direction == CODING_DECODE) |
3034 eol_type_t eol_type = CODING_SYSTEM_EOL_TYPE (str->codesys); | 4304 inflateEnd (&data->stream); |
3035 | 4305 else |
3036 while (n--) | 4306 deflateEnd (&data->stream); |
3037 { | 4307 data->stream_initted = 0; |
3038 Intbyte c = *src++; | 4308 } |
3039 if (c == '\n') | 4309 data->reached_eof = 0; |
4310 } | |
4311 | |
4312 static void | |
4313 gzip_rewind_coding_stream (struct coding_stream *str) | |
4314 { | |
4315 gzip_init_coding_stream (str); | |
4316 } | |
4317 | |
4318 static Bytecount | |
4319 gzip_convert (struct coding_stream *str, | |
4320 const UExtbyte *src, | |
4321 unsigned_char_dynarr *dst, Bytecount n) | |
4322 { | |
4323 struct gzip_coding_stream *data = CODING_STREAM_TYPE_DATA (str, gzip); | |
4324 int zerr; | |
4325 if (str->direction == CODING_DECODE) | |
4326 { | |
4327 if (data->reached_eof) | |
4328 return n; /* eat the data */ | |
4329 | |
4330 if (!data->stream_initted) | |
3040 { | 4331 { |
3041 if (eol_type != EOL_LF && eol_type != EOL_AUTODETECT) | 4332 xzero (data->stream); |
3042 Dynarr_add (dst, '\r'); | 4333 if (inflateInit (&data->stream) != Z_OK) |
3043 if (eol_type != EOL_CR) | 4334 return LSTREAM_ERROR; |
3044 Dynarr_add (dst, '\n'); | 4335 data->stream_initted = 1; |
3045 ch = 0; | |
3046 } | 4336 } |
3047 else if (BYTE_ASCII_P (c)) | 4337 |
4338 data->stream.next_in = (Bytef *) src; | |
4339 data->stream.avail_in = n; | |
4340 | |
4341 /* Normally we stop when we've fed all data to the decompressor; but | |
4342 if we're at the end of the input, and the decompressor hasn't | |
4343 reported EOF, we need to keep going, as there might be more output | |
4344 to generate. Z_OK from the decompressor means input was processed | |
4345 or output was generated; if neither, we break out of the loop. | |
4346 Other return values are: | |
4347 | |
4348 Z_STREAM_END EOF from decompressor | |
4349 Z_DATA_ERROR Corrupted data | |
4350 Z_BUF_ERROR No progress possible (this should happen if | |
4351 we try to feed it an incomplete file) | |
4352 Z_MEM_ERROR Out of memory | |
4353 Z_STREAM_ERROR (should never happen) | |
4354 Z_NEED_DICT (#### when will this happen?) | |
4355 */ | |
4356 while (data->stream.avail_in > 0 || str->eof) | |
3048 { | 4357 { |
3049 Dynarr_add (dst, c); | 4358 /* Reserve an output buffer of the same size as the input buffer; |
3050 ch = 0; | 4359 if that's not enough, we keep reserving the same size. */ |
4360 Bytecount reserved = n; | |
4361 Dynarr_add_many (dst, 0, reserved); | |
4362 /* Careful here! Don't retrieve the pointer until after | |
4363 reserving the space, or it might be bogus */ | |
4364 data->stream.next_out = | |
4365 Dynarr_atp (dst, Dynarr_length (dst) - reserved); | |
4366 data->stream.avail_out = reserved; | |
4367 zerr = inflate (&data->stream, Z_NO_FLUSH); | |
4368 /* Lop off the unused portion */ | |
4369 Dynarr_set_size (dst, Dynarr_length (dst) - data->stream.avail_out); | |
4370 if (zerr != Z_OK) | |
4371 break; | |
3051 } | 4372 } |
3052 else if (INTBYTE_LEADING_BYTE_P (c)) | 4373 |
3053 ch = (c == LEADING_BYTE_KATAKANA_JISX0201 || | 4374 if (zerr == Z_STREAM_END) |
3054 c == LEADING_BYTE_JAPANESE_JISX0208_1978 || | 4375 data->reached_eof = 1; |
3055 c == LEADING_BYTE_JAPANESE_JISX0208) ? c : 0; | 4376 |
3056 else if (ch) | 4377 if ((Bytecount) data->stream.avail_in < n) |
4378 return n - data->stream.avail_in; | |
4379 | |
4380 if (zerr == Z_OK || zerr == Z_STREAM_END) | |
4381 return 0; | |
4382 | |
4383 return LSTREAM_ERROR; | |
4384 } | |
4385 else | |
4386 { | |
4387 if (!data->stream_initted) | |
3057 { | 4388 { |
3058 if (ch == LEADING_BYTE_KATAKANA_JISX0201) | 4389 int level = XCODING_SYSTEM_GZIP_LEVEL (str->codesys); |
3059 { | 4390 xzero (data->stream); |
3060 Dynarr_add (dst, c); | 4391 if (deflateInit (&data->stream, |
3061 ch = 0; | 4392 level == -1 ? Z_DEFAULT_COMPRESSION : level) != |
3062 } | 4393 Z_OK) |
3063 else if (ch == LEADING_BYTE_JAPANESE_JISX0208_1978 || | 4394 return LSTREAM_ERROR; |
3064 ch == LEADING_BYTE_JAPANESE_JISX0208) | 4395 data->stream_initted = 1; |
3065 ch = c; | |
3066 else | |
3067 { | |
3068 unsigned char j1, j2; | |
3069 ENCODE_SJIS (ch, c, j1, j2); | |
3070 Dynarr_add (dst, j1); | |
3071 Dynarr_add (dst, j2); | |
3072 ch = 0; | |
3073 } | |
3074 } | 4396 } |
3075 } | 4397 |
3076 | 4398 data->stream.next_in = (Bytef *) src; |
3077 str->flags = flags; | 4399 data->stream.avail_in = n; |
3078 str->ch = ch; | 4400 |
3079 } | 4401 /* Normally we stop when we've fed all data to the compressor; but if |
3080 | 4402 we're at the end of the input, and the compressor hasn't reported |
3081 DEFUN ("decode-shift-jis-char", Fdecode_shift_jis_char, 1, 1, 0, /* | 4403 EOF, we need to keep going, as there might be more output to |
3082 Decode a JISX0208 character of Shift-JIS coding-system. | 4404 generate. (To signal EOF on our end, we set the FLUSH parameter |
3083 CODE is the character code in Shift-JIS as a cons of type bytes. | 4405 to Z_FINISH; when all data is output, Z_STREAM_END will be |
3084 Return the corresponding character. | 4406 returned.) Z_OK from the compressor means input was processed or |
3085 */ | 4407 output was generated; if neither, we break out of the loop. Other |
3086 (code)) | 4408 return values are: |
3087 { | 4409 |
3088 unsigned char c1, c2, s1, s2; | 4410 Z_STREAM_END EOF from compressor |
3089 | 4411 Z_BUF_ERROR No progress possible (should never happen) |
3090 CHECK_CONS (code); | 4412 Z_STREAM_ERROR (should never happen) |
3091 CHECK_INT (XCAR (code)); | 4413 */ |
3092 CHECK_INT (XCDR (code)); | 4414 while (data->stream.avail_in > 0 || str->eof) |
3093 s1 = XINT (XCAR (code)); | 4415 { |
3094 s2 = XINT (XCDR (code)); | 4416 /* Reserve an output buffer of the same size as the input buffer; |
3095 if (BYTE_SJIS_TWO_BYTE_1_P (s1) && | 4417 if that's not enough, we keep reserving the same size. */ |
3096 BYTE_SJIS_TWO_BYTE_2_P (s2)) | 4418 Bytecount reserved = n; |
3097 { | 4419 Dynarr_add_many (dst, 0, reserved); |
3098 DECODE_SJIS (s1, s2, c1, c2); | 4420 /* Careful here! Don't retrieve the pointer until after |
3099 return make_char (MAKE_CHAR (Vcharset_japanese_jisx0208, | 4421 reserving the space, or it might be bogus */ |
3100 c1 & 0x7F, c2 & 0x7F)); | 4422 data->stream.next_out = |
3101 } | 4423 Dynarr_atp (dst, Dynarr_length (dst) - reserved); |
3102 else | 4424 data->stream.avail_out = reserved; |
3103 return Qnil; | 4425 zerr = |
3104 } | 4426 deflate (&data->stream, |
3105 | 4427 str->eof ? Z_FINISH : Z_NO_FLUSH); |
3106 DEFUN ("encode-shift-jis-char", Fencode_shift_jis_char, 1, 1, 0, /* | 4428 /* Lop off the unused portion */ |
3107 Encode a JISX0208 character CHARACTER to SHIFT-JIS coding-system. | 4429 Dynarr_set_size (dst, Dynarr_length (dst) - data->stream.avail_out); |
3108 Return the corresponding character code in SHIFT-JIS as a cons of two bytes. | 4430 if (zerr != Z_OK) |
3109 */ | 4431 break; |
3110 (character)) | 4432 } |
3111 { | 4433 |
3112 Lisp_Object charset; | 4434 if ((Bytecount) data->stream.avail_in < n) |
3113 int c1, c2, s1, s2; | 4435 return n - data->stream.avail_in; |
3114 | 4436 |
3115 CHECK_CHAR_COERCE_INT (character); | 4437 if (zerr == Z_OK || zerr == Z_STREAM_END) |
3116 BREAKUP_CHAR (XCHAR (character), charset, c1, c2); | 4438 return 0; |
3117 if (EQ (charset, Vcharset_japanese_jisx0208)) | 4439 |
3118 { | 4440 return LSTREAM_ERROR; |
3119 ENCODE_SJIS (c1 | 0x80, c2 | 0x80, s1, s2); | 4441 } |
3120 return Fcons (make_int (s1), make_int (s2)); | 4442 } |
3121 } | 4443 |
3122 else | 4444 #endif /* HAVE_ZLIB */ |
3123 return Qnil; | |
3124 } | |
3125 | 4445 |
3126 | 4446 |
3127 /************************************************************************/ | |
3128 /* Big5 methods */ | |
3129 /************************************************************************/ | |
3130 | |
3131 /* BIG5 is a coding system encoding two character sets: ASCII and | |
3132 Big5. An ASCII character is encoded as is. Big5 is a two-byte | |
3133 character set and is encoded in two-byte. | |
3134 | |
3135 --- CODE RANGE of BIG5 --- | |
3136 (character set) (range) | |
3137 ASCII 0x00 .. 0x7F | |
3138 Big5 (1st byte) 0xA1 .. 0xFE | |
3139 (2nd byte) 0x40 .. 0x7E and 0xA1 .. 0xFE | |
3140 -------------------------- | |
3141 | |
3142 Since the number of characters in Big5 is larger than maximum | |
3143 characters in Emacs' charset (96x96), it can't be handled as one | |
3144 charset. So, in Emacs, Big5 is divided into two: `charset-big5-1' | |
3145 and `charset-big5-2'. Both <type>s are DIMENSION2_CHARS94. The former | |
3146 contains frequently used characters and the latter contains less | |
3147 frequently used characters. */ | |
3148 | |
3149 #define BYTE_BIG5_TWO_BYTE_1_P(c) \ | |
3150 ((c) >= 0xA1 && (c) <= 0xFE) | |
3151 | |
3152 /* Is this the second byte of a Shift-JIS two-byte char? */ | |
3153 | |
3154 #define BYTE_BIG5_TWO_BYTE_2_P(c) \ | |
3155 (((c) >= 0x40 && (c) <= 0x7E) || ((c) >= 0xA1 && (c) <= 0xFE)) | |
3156 | |
3157 /* Number of Big5 characters which have the same code in 1st byte. */ | |
3158 | |
3159 #define BIG5_SAME_ROW (0xFF - 0xA1 + 0x7F - 0x40) | |
3160 | |
3161 /* Code conversion macros. These are macros because they are used in | |
3162 inner loops during code conversion. | |
3163 | |
3164 Note that temporary variables in macros introduce the classic | |
3165 dynamic-scoping problems with variable names. We use capital- | |
3166 lettered variables in the assumption that XEmacs does not use | |
3167 capital letters in variables except in a very formalized way | |
3168 (e.g. Qstring). */ | |
3169 | |
3170 /* Convert Big5 code (b1, b2) into its internal string representation | |
3171 (lb, c1, c2). */ | |
3172 | |
3173 /* There is a much simpler way to split the Big5 charset into two. | |
3174 For the moment I'm going to leave the algorithm as-is because it | |
3175 claims to separate out the most-used characters into a single | |
3176 charset, which perhaps will lead to optimizations in various | |
3177 places. | |
3178 | |
3179 The way the algorithm works is something like this: | |
3180 | |
3181 Big5 can be viewed as a 94x157 charset, where the row is | |
3182 encoded into the bytes 0xA1 .. 0xFE and the column is encoded | |
3183 into the bytes 0x40 .. 0x7E and 0xA1 .. 0xFE. As for frequency, | |
3184 the split between low and high column numbers is apparently | |
3185 meaningless; ascending rows produce less and less frequent chars. | |
3186 Therefore, we assign the lower half of rows (0xA1 .. 0xC8) to | |
3187 the first charset, and the upper half (0xC9 .. 0xFE) to the | |
3188 second. To do the conversion, we convert the character into | |
3189 a single number where 0 .. 156 is the first row, 157 .. 313 | |
3190 is the second, etc. That way, the characters are ordered by | |
3191 decreasing frequency. Then we just chop the space in two | |
3192 and coerce the result into a 94x94 space. | |
3193 */ | |
3194 | |
3195 #define DECODE_BIG5(b1, b2, lb, c1, c2) do \ | |
3196 { \ | |
3197 int B1 = b1, B2 = b2; \ | |
3198 int I \ | |
3199 = (B1 - 0xA1) * BIG5_SAME_ROW + B2 - (B2 < 0x7F ? 0x40 : 0x62); \ | |
3200 \ | |
3201 if (B1 < 0xC9) \ | |
3202 { \ | |
3203 lb = LEADING_BYTE_CHINESE_BIG5_1; \ | |
3204 } \ | |
3205 else \ | |
3206 { \ | |
3207 lb = LEADING_BYTE_CHINESE_BIG5_2; \ | |
3208 I -= (BIG5_SAME_ROW) * (0xC9 - 0xA1); \ | |
3209 } \ | |
3210 c1 = I / (0xFF - 0xA1) + 0xA1; \ | |
3211 c2 = I % (0xFF - 0xA1) + 0xA1; \ | |
3212 } while (0) | |
3213 | |
3214 /* Convert the internal string representation of a Big5 character | |
3215 (lb, c1, c2) into Big5 code (b1, b2). */ | |
3216 | |
3217 #define ENCODE_BIG5(lb, c1, c2, b1, b2) do \ | |
3218 { \ | |
3219 int I = ((c1) - 0xA1) * (0xFF - 0xA1) + ((c2) - 0xA1); \ | |
3220 \ | |
3221 if (lb == LEADING_BYTE_CHINESE_BIG5_2) \ | |
3222 { \ | |
3223 I += BIG5_SAME_ROW * (0xC9 - 0xA1); \ | |
3224 } \ | |
3225 b1 = I / BIG5_SAME_ROW + 0xA1; \ | |
3226 b2 = I % BIG5_SAME_ROW; \ | |
3227 b2 += b2 < 0x3F ? 0x40 : 0x62; \ | |
3228 } while (0) | |
3229 | |
3230 static int | |
3231 detect_coding_big5 (struct detection_state *st, const Extbyte *src, Bytecount n) | |
3232 { | |
3233 while (n--) | |
3234 { | |
3235 unsigned char c = *(unsigned char *)src++; | |
3236 if (c == ISO_CODE_ESC || c == ISO_CODE_SI || c == ISO_CODE_SO || | |
3237 (c >= 0x80 && c <= 0xA0)) | |
3238 return 0; | |
3239 if (st->big5.in_second_byte) | |
3240 { | |
3241 st->big5.in_second_byte = 0; | |
3242 if (c < 0x40 || (c >= 0x80 && c <= 0xA0)) | |
3243 return 0; | |
3244 } | |
3245 else if (c >= 0xA1) | |
3246 st->big5.in_second_byte = 1; | |
3247 } | |
3248 return CODING_CATEGORY_BIG5_MASK; | |
3249 } | |
3250 | |
3251 /* Convert Big5 data to internal format. */ | |
3252 | |
3253 static void | |
3254 decode_coding_big5 (Lstream *decoding, const Extbyte *src, | |
3255 unsigned_char_dynarr *dst, Bytecount n) | |
3256 { | |
3257 struct decoding_stream *str = DECODING_STREAM_DATA (decoding); | |
3258 unsigned int flags = str->flags; | |
3259 unsigned int ch = str->ch; | |
3260 eol_type_t eol_type = str->eol_type; | |
3261 | |
3262 while (n--) | |
3263 { | |
3264 unsigned char c = *(unsigned char *)src++; | |
3265 if (ch) | |
3266 { | |
3267 /* Previous character was first byte of Big5 char. */ | |
3268 if (BYTE_BIG5_TWO_BYTE_2_P (c)) | |
3269 { | |
3270 unsigned char b1, b2, b3; | |
3271 DECODE_BIG5 (ch, c, b1, b2, b3); | |
3272 Dynarr_add (dst, b1); | |
3273 Dynarr_add (dst, b2); | |
3274 Dynarr_add (dst, b3); | |
3275 } | |
3276 else | |
3277 { | |
3278 DECODE_ADD_BINARY_CHAR (ch, dst); | |
3279 DECODE_ADD_BINARY_CHAR (c, dst); | |
3280 } | |
3281 ch = 0; | |
3282 } | |
3283 else | |
3284 { | |
3285 DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst); | |
3286 if (BYTE_BIG5_TWO_BYTE_1_P (c)) | |
3287 ch = c; | |
3288 else | |
3289 DECODE_ADD_BINARY_CHAR (c, dst); | |
3290 } | |
3291 label_continue_loop:; | |
3292 } | |
3293 | |
3294 DECODE_HANDLE_END_OF_CONVERSION (flags, ch, dst); | |
3295 | |
3296 str->flags = flags; | |
3297 str->ch = ch; | |
3298 } | |
3299 | |
3300 /* Convert internally-formatted data to Big5. */ | |
3301 | |
3302 static void | |
3303 encode_coding_big5 (Lstream *encoding, const Intbyte *src, | |
3304 unsigned_char_dynarr *dst, Bytecount n) | |
3305 { | |
3306 unsigned char c; | |
3307 struct encoding_stream *str = ENCODING_STREAM_DATA (encoding); | |
3308 unsigned int flags = str->flags; | |
3309 unsigned int ch = str->ch; | |
3310 eol_type_t eol_type = CODING_SYSTEM_EOL_TYPE (str->codesys); | |
3311 | |
3312 while (n--) | |
3313 { | |
3314 c = *src++; | |
3315 if (c == '\n') | |
3316 { | |
3317 if (eol_type != EOL_LF && eol_type != EOL_AUTODETECT) | |
3318 Dynarr_add (dst, '\r'); | |
3319 if (eol_type != EOL_CR) | |
3320 Dynarr_add (dst, '\n'); | |
3321 } | |
3322 else if (BYTE_ASCII_P (c)) | |
3323 { | |
3324 /* ASCII. */ | |
3325 Dynarr_add (dst, c); | |
3326 } | |
3327 else if (INTBYTE_LEADING_BYTE_P (c)) | |
3328 { | |
3329 if (c == LEADING_BYTE_CHINESE_BIG5_1 || | |
3330 c == LEADING_BYTE_CHINESE_BIG5_2) | |
3331 { | |
3332 /* A recognized leading byte. */ | |
3333 ch = c; | |
3334 continue; /* not done with this character. */ | |
3335 } | |
3336 /* otherwise just ignore this character. */ | |
3337 } | |
3338 else if (ch == LEADING_BYTE_CHINESE_BIG5_1 || | |
3339 ch == LEADING_BYTE_CHINESE_BIG5_2) | |
3340 { | |
3341 /* Previous char was a recognized leading byte. */ | |
3342 ch = (ch << 8) | c; | |
3343 continue; /* not done with this character. */ | |
3344 } | |
3345 else if (ch) | |
3346 { | |
3347 /* Encountering second byte of a Big5 character. */ | |
3348 unsigned char b1, b2; | |
3349 | |
3350 ENCODE_BIG5 (ch >> 8, ch & 0xFF, c, b1, b2); | |
3351 Dynarr_add (dst, b1); | |
3352 Dynarr_add (dst, b2); | |
3353 } | |
3354 | |
3355 ch = 0; | |
3356 } | |
3357 | |
3358 str->flags = flags; | |
3359 str->ch = ch; | |
3360 } | |
3361 | |
3362 | |
3363 DEFUN ("decode-big5-char", Fdecode_big5_char, 1, 1, 0, /* | |
3364 Decode a Big5 character CODE of BIG5 coding-system. | |
3365 CODE is the character code in BIG5, a cons of two integers. | |
3366 Return the corresponding character. | |
3367 */ | |
3368 (code)) | |
3369 { | |
3370 unsigned char c1, c2, b1, b2; | |
3371 | |
3372 CHECK_CONS (code); | |
3373 CHECK_INT (XCAR (code)); | |
3374 CHECK_INT (XCDR (code)); | |
3375 b1 = XINT (XCAR (code)); | |
3376 b2 = XINT (XCDR (code)); | |
3377 if (BYTE_BIG5_TWO_BYTE_1_P (b1) && | |
3378 BYTE_BIG5_TWO_BYTE_2_P (b2)) | |
3379 { | |
3380 int leading_byte; | |
3381 Lisp_Object charset; | |
3382 DECODE_BIG5 (b1, b2, leading_byte, c1, c2); | |
3383 charset = CHARSET_BY_LEADING_BYTE (leading_byte); | |
3384 return make_char (MAKE_CHAR (charset, c1 & 0x7F, c2 & 0x7F)); | |
3385 } | |
3386 else | |
3387 return Qnil; | |
3388 } | |
3389 | |
3390 DEFUN ("encode-big5-char", Fencode_big5_char, 1, 1, 0, /* | |
3391 Encode the Big5 character CHARACTER in the BIG5 coding-system. | |
3392 Return the corresponding character code in Big5. | |
3393 */ | |
3394 (character)) | |
3395 { | |
3396 Lisp_Object charset; | |
3397 int c1, c2, b1, b2; | |
3398 | |
3399 CHECK_CHAR_COERCE_INT (character); | |
3400 BREAKUP_CHAR (XCHAR (character), charset, c1, c2); | |
3401 if (EQ (charset, Vcharset_chinese_big5_1) || | |
3402 EQ (charset, Vcharset_chinese_big5_2)) | |
3403 { | |
3404 ENCODE_BIG5 (XCHARSET_LEADING_BYTE (charset), c1 | 0x80, c2 | 0x80, | |
3405 b1, b2); | |
3406 return Fcons (make_int (b1), make_int (b2)); | |
3407 } | |
3408 else | |
3409 return Qnil; | |
3410 } | |
3411 | |
3412 | |
3413 /************************************************************************/ | |
3414 /* UCS-4 methods */ | |
3415 /* */ | |
3416 /* UCS-4 character codes are implemented as nonnegative integers. */ | |
3417 /* */ | |
3418 /************************************************************************/ | |
3419 | |
3420 | |
3421 DEFUN ("set-ucs-char", Fset_ucs_char, 2, 2, 0, /* | |
3422 Map UCS-4 code CODE to Mule character CHARACTER. | |
3423 | |
3424 Return T on success, NIL on failure. | |
3425 */ | |
3426 (code, character)) | |
3427 { | |
3428 EMACS_INT c; | |
3429 | |
3430 CHECK_CHAR (character); | |
3431 CHECK_NATNUM (code); | |
3432 c = XINT (code); | |
3433 | |
3434 if (c < countof (fcd->ucs_to_mule_table)) | |
3435 { | |
3436 fcd->ucs_to_mule_table[c] = character; | |
3437 return Qt; | |
3438 } | |
3439 else | |
3440 return Qnil; | |
3441 } | |
3442 | |
3443 static Lisp_Object | |
3444 ucs_to_char (unsigned long code) | |
3445 { | |
3446 if (code < countof (fcd->ucs_to_mule_table)) | |
3447 { | |
3448 return fcd->ucs_to_mule_table[code]; | |
3449 } | |
3450 else if ((0xe00000 <= code) && (code <= 0xe00000 + 94 * 94 * 14)) | |
3451 { | |
3452 unsigned int c; | |
3453 | |
3454 code -= 0xe00000; | |
3455 c = code % (94 * 94); | |
3456 return make_char | |
3457 (MAKE_CHAR (CHARSET_BY_ATTRIBUTES | |
3458 (CHARSET_TYPE_94X94, code / (94 * 94) + '@', | |
3459 CHARSET_LEFT_TO_RIGHT), | |
3460 c / 94 + 33, c % 94 + 33)); | |
3461 } | |
3462 else | |
3463 return Qnil; | |
3464 } | |
3465 | |
3466 DEFUN ("ucs-char", Fucs_char, 1, 1, 0, /* | |
3467 Return Mule character corresponding to UCS code CODE (a positive integer). | |
3468 */ | |
3469 (code)) | |
3470 { | |
3471 CHECK_NATNUM (code); | |
3472 return ucs_to_char (XINT (code)); | |
3473 } | |
3474 | |
3475 DEFUN ("set-char-ucs", Fset_char_ucs, 2, 2, 0, /* | |
3476 Map Mule character CHARACTER to UCS code CODE (a positive integer). | |
3477 */ | |
3478 (character, code)) | |
3479 { | |
3480 /* #### Isn't this gilding the lily? Fput_char_table checks its args. | |
3481 Fset_char_ucs is more restrictive on index arg, but should | |
3482 check code arg in a char_table method. */ | |
3483 CHECK_CHAR (character); | |
3484 CHECK_NATNUM (code); | |
3485 return Fput_char_table (character, code, mule_to_ucs_table); | |
3486 } | |
3487 | |
3488 DEFUN ("char-ucs", Fchar_ucs, 1, 1, 0, /* | |
3489 Return the UCS code (a positive integer) corresponding to CHARACTER. | |
3490 */ | |
3491 (character)) | |
3492 { | |
3493 return Fget_char_table (character, mule_to_ucs_table); | |
3494 } | |
3495 | |
3496 /* Decode a UCS-4 character into a buffer. If the lookup fails, use | |
3497 <GETA MARK> (U+3013) of JIS X 0208, which means correct character | |
3498 is not found, instead. | |
3499 #### do something more appropriate (use blob?) | |
3500 Danger, Will Robinson! Data loss. Should we signal user? */ | |
3501 static void | |
3502 decode_ucs4 (unsigned long ch, unsigned_char_dynarr *dst) | |
3503 { | |
3504 Lisp_Object chr = ucs_to_char (ch); | |
3505 | |
3506 if (! NILP (chr)) | |
3507 { | |
3508 Intbyte work[MAX_EMCHAR_LEN]; | |
3509 int len; | |
3510 | |
3511 ch = XCHAR (chr); | |
3512 len = (ch < 128) ? | |
3513 simple_set_charptr_emchar (work, ch) : | |
3514 non_ascii_set_charptr_emchar (work, ch); | |
3515 Dynarr_add_many (dst, work, len); | |
3516 } | |
3517 else | |
3518 { | |
3519 Dynarr_add (dst, LEADING_BYTE_JAPANESE_JISX0208); | |
3520 Dynarr_add (dst, 34 + 128); | |
3521 Dynarr_add (dst, 46 + 128); | |
3522 } | |
3523 } | |
3524 | |
3525 static unsigned long | |
3526 mule_char_to_ucs4 (Lisp_Object charset, | |
3527 unsigned char h, unsigned char l) | |
3528 { | |
3529 Lisp_Object code | |
3530 = Fget_char_table (make_char (MAKE_CHAR (charset, h & 127, l & 127)), | |
3531 mule_to_ucs_table); | |
3532 | |
3533 if (INTP (code)) | |
3534 { | |
3535 return XINT (code); | |
3536 } | |
3537 else if ( (XCHARSET_DIMENSION (charset) == 2) && | |
3538 (XCHARSET_CHARS (charset) == 94) ) | |
3539 { | |
3540 unsigned char final = XCHARSET_FINAL (charset); | |
3541 | |
3542 if ( ('@' <= final) && (final < 0x7f) ) | |
3543 { | |
3544 return 0xe00000 + (final - '@') * 94 * 94 | |
3545 + ((h & 127) - 33) * 94 + (l & 127) - 33; | |
3546 } | |
3547 else | |
3548 { | |
3549 return '?'; | |
3550 } | |
3551 } | |
3552 else | |
3553 { | |
3554 return '?'; | |
3555 } | |
3556 } | |
3557 | |
3558 static void | |
3559 encode_ucs4 (Lisp_Object charset, | |
3560 unsigned char h, unsigned char l, unsigned_char_dynarr *dst) | |
3561 { | |
3562 unsigned long code = mule_char_to_ucs4 (charset, h, l); | |
3563 Dynarr_add (dst, code >> 24); | |
3564 Dynarr_add (dst, (code >> 16) & 255); | |
3565 Dynarr_add (dst, (code >> 8) & 255); | |
3566 Dynarr_add (dst, code & 255); | |
3567 } | |
3568 | |
3569 static int | |
3570 detect_coding_ucs4 (struct detection_state *st, const Extbyte *src, Bytecount n) | |
3571 { | |
3572 while (n--) | |
3573 { | |
3574 unsigned char c = *(unsigned char *)src++; | |
3575 switch (st->ucs4.in_byte) | |
3576 { | |
3577 case 0: | |
3578 if (c >= 128) | |
3579 return 0; | |
3580 else | |
3581 st->ucs4.in_byte++; | |
3582 break; | |
3583 case 3: | |
3584 st->ucs4.in_byte = 0; | |
3585 break; | |
3586 default: | |
3587 st->ucs4.in_byte++; | |
3588 } | |
3589 } | |
3590 return CODING_CATEGORY_UCS4_MASK; | |
3591 } | |
3592 | |
3593 static void | |
3594 decode_coding_ucs4 (Lstream *decoding, const Extbyte *src, | |
3595 unsigned_char_dynarr *dst, Bytecount n) | |
3596 { | |
3597 struct decoding_stream *str = DECODING_STREAM_DATA (decoding); | |
3598 unsigned int flags = str->flags; | |
3599 unsigned int ch = str->ch; | |
3600 unsigned char counter = str->counter; | |
3601 | |
3602 while (n--) | |
3603 { | |
3604 unsigned char c = *(unsigned char *)src++; | |
3605 switch (counter) | |
3606 { | |
3607 case 0: | |
3608 ch = c; | |
3609 counter = 3; | |
3610 break; | |
3611 case 1: | |
3612 decode_ucs4 ( ( ch << 8 ) | c, dst); | |
3613 ch = 0; | |
3614 counter = 0; | |
3615 break; | |
3616 default: | |
3617 ch = ( ch << 8 ) | c; | |
3618 counter--; | |
3619 } | |
3620 } | |
3621 if (counter & CODING_STATE_END) | |
3622 DECODE_OUTPUT_PARTIAL_CHAR (ch); | |
3623 | |
3624 str->flags = flags; | |
3625 str->ch = ch; | |
3626 str->counter = counter; | |
3627 } | |
3628 | |
3629 static void | |
3630 encode_coding_ucs4 (Lstream *encoding, const Intbyte *src, | |
3631 unsigned_char_dynarr *dst, Bytecount n) | |
3632 { | |
3633 struct encoding_stream *str = ENCODING_STREAM_DATA (encoding); | |
3634 unsigned int flags = str->flags; | |
3635 unsigned int ch = str->ch; | |
3636 unsigned char char_boundary = str->iso2022.current_char_boundary; | |
3637 Lisp_Object charset = str->iso2022.current_charset; | |
3638 | |
3639 #ifdef ENABLE_COMPOSITE_CHARS | |
3640 /* flags for handling composite chars. We do a little switcharoo | |
3641 on the source while we're outputting the composite char. */ | |
3642 Bytecount saved_n = 0; | |
3643 const unsigned char *saved_src = NULL; | |
3644 int in_composite = 0; | |
3645 | |
3646 back_to_square_n: | |
3647 #endif | |
3648 | |
3649 while (n--) | |
3650 { | |
3651 unsigned char c = *src++; | |
3652 | |
3653 if (BYTE_ASCII_P (c)) | |
3654 { /* Processing ASCII character */ | |
3655 ch = 0; | |
3656 encode_ucs4 (Vcharset_ascii, c, 0, dst); | |
3657 char_boundary = 1; | |
3658 } | |
3659 else if (INTBYTE_LEADING_BYTE_P (c) || INTBYTE_LEADING_BYTE_P (ch)) | |
3660 { /* Processing Leading Byte */ | |
3661 ch = 0; | |
3662 charset = CHARSET_BY_LEADING_BYTE (c); | |
3663 if (LEADING_BYTE_PREFIX_P(c)) | |
3664 ch = c; | |
3665 char_boundary = 0; | |
3666 } | |
3667 else | |
3668 { /* Processing Non-ASCII character */ | |
3669 char_boundary = 1; | |
3670 if (EQ (charset, Vcharset_control_1)) | |
3671 { | |
3672 encode_ucs4 (Vcharset_control_1, c, 0, dst); | |
3673 } | |
3674 else | |
3675 { | |
3676 switch (XCHARSET_REP_BYTES (charset)) | |
3677 { | |
3678 case 2: | |
3679 encode_ucs4 (charset, c, 0, dst); | |
3680 break; | |
3681 case 3: | |
3682 if (XCHARSET_PRIVATE_P (charset)) | |
3683 { | |
3684 encode_ucs4 (charset, c, 0, dst); | |
3685 ch = 0; | |
3686 } | |
3687 else if (ch) | |
3688 { | |
3689 #ifdef ENABLE_COMPOSITE_CHARS | |
3690 if (EQ (charset, Vcharset_composite)) | |
3691 { | |
3692 if (in_composite) | |
3693 { | |
3694 /* #### Bother! We don't know how to | |
3695 handle this yet. */ | |
3696 Dynarr_add (dst, '\0'); | |
3697 Dynarr_add (dst, '\0'); | |
3698 Dynarr_add (dst, '\0'); | |
3699 Dynarr_add (dst, '~'); | |
3700 } | |
3701 else | |
3702 { | |
3703 Emchar emch = MAKE_CHAR (Vcharset_composite, | |
3704 ch & 0x7F, c & 0x7F); | |
3705 Lisp_Object lstr = composite_char_string (emch); | |
3706 saved_n = n; | |
3707 saved_src = src; | |
3708 in_composite = 1; | |
3709 src = XSTRING_DATA (lstr); | |
3710 n = XSTRING_LENGTH (lstr); | |
3711 } | |
3712 } | |
3713 else | |
3714 #endif /* ENABLE_COMPOSITE_CHARS */ | |
3715 { | |
3716 encode_ucs4(charset, ch, c, dst); | |
3717 } | |
3718 ch = 0; | |
3719 } | |
3720 else | |
3721 { | |
3722 ch = c; | |
3723 char_boundary = 0; | |
3724 } | |
3725 break; | |
3726 case 4: | |
3727 if (ch) | |
3728 { | |
3729 encode_ucs4 (charset, ch, c, dst); | |
3730 ch = 0; | |
3731 } | |
3732 else | |
3733 { | |
3734 ch = c; | |
3735 char_boundary = 0; | |
3736 } | |
3737 break; | |
3738 default: | |
3739 abort (); | |
3740 } | |
3741 } | |
3742 } | |
3743 } | |
3744 | |
3745 #ifdef ENABLE_COMPOSITE_CHARS | |
3746 if (in_composite) | |
3747 { | |
3748 n = saved_n; | |
3749 src = saved_src; | |
3750 in_composite = 0; | |
3751 goto back_to_square_n; /* Wheeeeeeeee ..... */ | |
3752 } | |
3753 #endif /* ENABLE_COMPOSITE_CHARS */ | |
3754 | |
3755 str->flags = flags; | |
3756 str->ch = ch; | |
3757 str->iso2022.current_char_boundary = char_boundary; | |
3758 str->iso2022.current_charset = charset; | |
3759 | |
3760 /* Verbum caro factum est! */ | |
3761 } | |
3762 | |
3763 | |
3764 /************************************************************************/ | |
3765 /* UTF-8 methods */ | |
3766 /************************************************************************/ | |
3767 | |
3768 static int | |
3769 detect_coding_utf8 (struct detection_state *st, const Extbyte *src, Bytecount n) | |
3770 { | |
3771 while (n--) | |
3772 { | |
3773 unsigned char c = *(unsigned char *)src++; | |
3774 switch (st->utf8.in_byte) | |
3775 { | |
3776 case 0: | |
3777 if (c == ISO_CODE_ESC || c == ISO_CODE_SI || c == ISO_CODE_SO) | |
3778 return 0; | |
3779 else if (c >= 0xfc) | |
3780 st->utf8.in_byte = 5; | |
3781 else if (c >= 0xf8) | |
3782 st->utf8.in_byte = 4; | |
3783 else if (c >= 0xf0) | |
3784 st->utf8.in_byte = 3; | |
3785 else if (c >= 0xe0) | |
3786 st->utf8.in_byte = 2; | |
3787 else if (c >= 0xc0) | |
3788 st->utf8.in_byte = 1; | |
3789 else if (c >= 0x80) | |
3790 return 0; | |
3791 break; | |
3792 default: | |
3793 if ((c & 0xc0) != 0x80) | |
3794 return 0; | |
3795 else | |
3796 st->utf8.in_byte--; | |
3797 } | |
3798 } | |
3799 return CODING_CATEGORY_UTF8_MASK; | |
3800 } | |
3801 | |
3802 static void | |
3803 decode_coding_utf8 (Lstream *decoding, const Extbyte *src, | |
3804 unsigned_char_dynarr *dst, Bytecount n) | |
3805 { | |
3806 struct decoding_stream *str = DECODING_STREAM_DATA (decoding); | |
3807 unsigned int flags = str->flags; | |
3808 unsigned int ch = str->ch; | |
3809 eol_type_t eol_type = str->eol_type; | |
3810 unsigned char counter = str->counter; | |
3811 | |
3812 while (n--) | |
3813 { | |
3814 unsigned char c = *(unsigned char *)src++; | |
3815 switch (counter) | |
3816 { | |
3817 case 0: | |
3818 if ( c >= 0xfc ) | |
3819 { | |
3820 ch = c & 0x01; | |
3821 counter = 5; | |
3822 } | |
3823 else if ( c >= 0xf8 ) | |
3824 { | |
3825 ch = c & 0x03; | |
3826 counter = 4; | |
3827 } | |
3828 else if ( c >= 0xf0 ) | |
3829 { | |
3830 ch = c & 0x07; | |
3831 counter = 3; | |
3832 } | |
3833 else if ( c >= 0xe0 ) | |
3834 { | |
3835 ch = c & 0x0f; | |
3836 counter = 2; | |
3837 } | |
3838 else if ( c >= 0xc0 ) | |
3839 { | |
3840 ch = c & 0x1f; | |
3841 counter = 1; | |
3842 } | |
3843 else | |
3844 { | |
3845 DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst); | |
3846 decode_ucs4 (c, dst); | |
3847 } | |
3848 break; | |
3849 case 1: | |
3850 ch = ( ch << 6 ) | ( c & 0x3f ); | |
3851 decode_ucs4 (ch, dst); | |
3852 ch = 0; | |
3853 counter = 0; | |
3854 break; | |
3855 default: | |
3856 ch = ( ch << 6 ) | ( c & 0x3f ); | |
3857 counter--; | |
3858 } | |
3859 label_continue_loop:; | |
3860 } | |
3861 | |
3862 if (flags & CODING_STATE_END) | |
3863 DECODE_OUTPUT_PARTIAL_CHAR (ch); | |
3864 | |
3865 str->flags = flags; | |
3866 str->ch = ch; | |
3867 str->counter = counter; | |
3868 } | |
3869 | |
3870 static void | |
3871 encode_utf8 (Lisp_Object charset, | |
3872 unsigned char h, unsigned char l, unsigned_char_dynarr *dst) | |
3873 { | |
3874 unsigned long code = mule_char_to_ucs4 (charset, h, l); | |
3875 if ( code <= 0x7f ) | |
3876 { | |
3877 Dynarr_add (dst, code); | |
3878 } | |
3879 else if ( code <= 0x7ff ) | |
3880 { | |
3881 Dynarr_add (dst, (code >> 6) | 0xc0); | |
3882 Dynarr_add (dst, (code & 0x3f) | 0x80); | |
3883 } | |
3884 else if ( code <= 0xffff ) | |
3885 { | |
3886 Dynarr_add (dst, (code >> 12) | 0xe0); | |
3887 Dynarr_add (dst, ((code >> 6) & 0x3f) | 0x80); | |
3888 Dynarr_add (dst, (code & 0x3f) | 0x80); | |
3889 } | |
3890 else if ( code <= 0x1fffff ) | |
3891 { | |
3892 Dynarr_add (dst, (code >> 18) | 0xf0); | |
3893 Dynarr_add (dst, ((code >> 12) & 0x3f) | 0x80); | |
3894 Dynarr_add (dst, ((code >> 6) & 0x3f) | 0x80); | |
3895 Dynarr_add (dst, (code & 0x3f) | 0x80); | |
3896 } | |
3897 else if ( code <= 0x3ffffff ) | |
3898 { | |
3899 Dynarr_add (dst, (code >> 24) | 0xf8); | |
3900 Dynarr_add (dst, ((code >> 18) & 0x3f) | 0x80); | |
3901 Dynarr_add (dst, ((code >> 12) & 0x3f) | 0x80); | |
3902 Dynarr_add (dst, ((code >> 6) & 0x3f) | 0x80); | |
3903 Dynarr_add (dst, (code & 0x3f) | 0x80); | |
3904 } | |
3905 else | |
3906 { | |
3907 Dynarr_add (dst, (code >> 30) | 0xfc); | |
3908 Dynarr_add (dst, ((code >> 24) & 0x3f) | 0x80); | |
3909 Dynarr_add (dst, ((code >> 18) & 0x3f) | 0x80); | |
3910 Dynarr_add (dst, ((code >> 12) & 0x3f) | 0x80); | |
3911 Dynarr_add (dst, ((code >> 6) & 0x3f) | 0x80); | |
3912 Dynarr_add (dst, (code & 0x3f) | 0x80); | |
3913 } | |
3914 } | |
3915 | |
3916 static void | |
3917 encode_coding_utf8 (Lstream *encoding, const Intbyte *src, | |
3918 unsigned_char_dynarr *dst, Bytecount n) | |
3919 { | |
3920 struct encoding_stream *str = ENCODING_STREAM_DATA (encoding); | |
3921 unsigned int flags = str->flags; | |
3922 unsigned int ch = str->ch; | |
3923 eol_type_t eol_type = CODING_SYSTEM_EOL_TYPE (str->codesys); | |
3924 unsigned char char_boundary = str->iso2022.current_char_boundary; | |
3925 Lisp_Object charset = str->iso2022.current_charset; | |
3926 | |
3927 #ifdef ENABLE_COMPOSITE_CHARS | |
3928 /* flags for handling composite chars. We do a little switcharoo | |
3929 on the source while we're outputting the composite char. */ | |
3930 Bytecount saved_n = 0; | |
3931 const unsigned char *saved_src = NULL; | |
3932 int in_composite = 0; | |
3933 | |
3934 back_to_square_n: | |
3935 #endif /* ENABLE_COMPOSITE_CHARS */ | |
3936 | |
3937 while (n--) | |
3938 { | |
3939 unsigned char c = *src++; | |
3940 | |
3941 if (BYTE_ASCII_P (c)) | |
3942 { /* Processing ASCII character */ | |
3943 ch = 0; | |
3944 if (c == '\n') | |
3945 { | |
3946 if (eol_type != EOL_LF && eol_type != EOL_AUTODETECT) | |
3947 Dynarr_add (dst, '\r'); | |
3948 if (eol_type != EOL_CR) | |
3949 Dynarr_add (dst, c); | |
3950 } | |
3951 else | |
3952 encode_utf8 (Vcharset_ascii, c, 0, dst); | |
3953 char_boundary = 1; | |
3954 } | |
3955 else if (INTBYTE_LEADING_BYTE_P (c) || INTBYTE_LEADING_BYTE_P (ch)) | |
3956 { /* Processing Leading Byte */ | |
3957 ch = 0; | |
3958 charset = CHARSET_BY_LEADING_BYTE (c); | |
3959 if (LEADING_BYTE_PREFIX_P(c)) | |
3960 ch = c; | |
3961 char_boundary = 0; | |
3962 } | |
3963 else | |
3964 { /* Processing Non-ASCII character */ | |
3965 char_boundary = 1; | |
3966 if (EQ (charset, Vcharset_control_1)) | |
3967 { | |
3968 encode_utf8 (Vcharset_control_1, c, 0, dst); | |
3969 } | |
3970 else | |
3971 { | |
3972 switch (XCHARSET_REP_BYTES (charset)) | |
3973 { | |
3974 case 2: | |
3975 encode_utf8 (charset, c, 0, dst); | |
3976 break; | |
3977 case 3: | |
3978 if (XCHARSET_PRIVATE_P (charset)) | |
3979 { | |
3980 encode_utf8 (charset, c, 0, dst); | |
3981 ch = 0; | |
3982 } | |
3983 else if (ch) | |
3984 { | |
3985 #ifdef ENABLE_COMPOSITE_CHARS | |
3986 if (EQ (charset, Vcharset_composite)) | |
3987 { | |
3988 if (in_composite) | |
3989 { | |
3990 /* #### Bother! We don't know how to | |
3991 handle this yet. */ | |
3992 encode_utf8 (Vcharset_ascii, '~', 0, dst); | |
3993 } | |
3994 else | |
3995 { | |
3996 Emchar emch = MAKE_CHAR (Vcharset_composite, | |
3997 ch & 0x7F, c & 0x7F); | |
3998 Lisp_Object lstr = composite_char_string (emch); | |
3999 saved_n = n; | |
4000 saved_src = src; | |
4001 in_composite = 1; | |
4002 src = XSTRING_DATA (lstr); | |
4003 n = XSTRING_LENGTH (lstr); | |
4004 } | |
4005 } | |
4006 else | |
4007 #endif /* ENABLE_COMPOSITE_CHARS */ | |
4008 { | |
4009 encode_utf8 (charset, ch, c, dst); | |
4010 } | |
4011 ch = 0; | |
4012 } | |
4013 else | |
4014 { | |
4015 ch = c; | |
4016 char_boundary = 0; | |
4017 } | |
4018 break; | |
4019 case 4: | |
4020 if (ch) | |
4021 { | |
4022 encode_utf8 (charset, ch, c, dst); | |
4023 ch = 0; | |
4024 } | |
4025 else | |
4026 { | |
4027 ch = c; | |
4028 char_boundary = 0; | |
4029 } | |
4030 break; | |
4031 default: | |
4032 abort (); | |
4033 } | |
4034 } | |
4035 } | |
4036 } | |
4037 | |
4038 #ifdef ENABLE_COMPOSITE_CHARS | |
4039 if (in_composite) | |
4040 { | |
4041 n = saved_n; | |
4042 src = saved_src; | |
4043 in_composite = 0; | |
4044 goto back_to_square_n; /* Wheeeeeeeee ..... */ | |
4045 } | |
4046 #endif | |
4047 | |
4048 str->flags = flags; | |
4049 str->ch = ch; | |
4050 str->iso2022.current_char_boundary = char_boundary; | |
4051 str->iso2022.current_charset = charset; | |
4052 | |
4053 /* Verbum caro factum est! */ | |
4054 } | |
4055 | |
4056 | |
4057 /************************************************************************/ | |
4058 /* ISO2022 methods */ | |
4059 /************************************************************************/ | |
4060 | |
4061 /* The following note describes the coding system ISO2022 briefly. | |
4062 Since the intention of this note is to help understand the | |
4063 functions in this file, some parts are NOT ACCURATE or OVERLY | |
4064 SIMPLIFIED. For thorough understanding, please refer to the | |
4065 original document of ISO2022. | |
4066 | |
4067 ISO2022 provides many mechanisms to encode several character sets | |
4068 in 7-bit and 8-bit environments. For 7-bit environments, all text | |
4069 is encoded using bytes less than 128. This may make the encoded | |
4070 text a little bit longer, but the text passes more easily through | |
4071 several gateways, some of which strip off MSB (Most Signigant Bit). | |
4072 | |
4073 There are two kinds of character sets: control character set and | |
4074 graphic character set. The former contains control characters such | |
4075 as `newline' and `escape' to provide control functions (control | |
4076 functions are also provided by escape sequences). The latter | |
4077 contains graphic characters such as 'A' and '-'. Emacs recognizes | |
4078 two control character sets and many graphic character sets. | |
4079 | |
4080 Graphic character sets are classified into one of the following | |
4081 four classes, according to the number of bytes (DIMENSION) and | |
4082 number of characters in one dimension (CHARS) of the set: | |
4083 - DIMENSION1_CHARS94 | |
4084 - DIMENSION1_CHARS96 | |
4085 - DIMENSION2_CHARS94 | |
4086 - DIMENSION2_CHARS96 | |
4087 | |
4088 In addition, each character set is assigned an identification tag, | |
4089 unique for each set, called "final character" (denoted as <F> | |
4090 hereafter). The <F> of each character set is decided by ECMA(*) | |
4091 when it is registered in ISO. The code range of <F> is 0x30..0x7F | |
4092 (0x30..0x3F are for private use only). | |
4093 | |
4094 Note (*): ECMA = European Computer Manufacturers Association | |
4095 | |
4096 Here are examples of graphic character set [NAME(<F>)]: | |
4097 o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ... | |
4098 o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ... | |
4099 o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... | |
4100 o DIMENSION2_CHARS96 -- none for the moment | |
4101 | |
4102 A code area (1 byte = 8 bits) is divided into 4 areas, C0, GL, C1, and GR. | |
4103 C0 [0x00..0x1F] -- control character plane 0 | |
4104 GL [0x20..0x7F] -- graphic character plane 0 | |
4105 C1 [0x80..0x9F] -- control character plane 1 | |
4106 GR [0xA0..0xFF] -- graphic character plane 1 | |
4107 | |
4108 A control character set is directly designated and invoked to C0 or | |
4109 C1 by an escape sequence. The most common case is that: | |
4110 - ISO646's control character set is designated/invoked to C0, and | |
4111 - ISO6429's control character set is designated/invoked to C1, | |
4112 and usually these designations/invocations are omitted in encoded | |
4113 text. In a 7-bit environment, only C0 can be used, and a control | |
4114 character for C1 is encoded by an appropriate escape sequence to | |
4115 fit into the environment. All control characters for C1 are | |
4116 defined to have corresponding escape sequences. | |
4117 | |
4118 A graphic character set is at first designated to one of four | |
4119 graphic registers (G0 through G3), then these graphic registers are | |
4120 invoked to GL or GR. These designations and invocations can be | |
4121 done independently. The most common case is that G0 is invoked to | |
4122 GL, G1 is invoked to GR, and ASCII is designated to G0. Usually | |
4123 these invocations and designations are omitted in encoded text. | |
4124 In a 7-bit environment, only GL can be used. | |
4125 | |
4126 When a graphic character set of CHARS94 is invoked to GL, codes | |
4127 0x20 and 0x7F of the GL area work as control characters SPACE and | |
4128 DEL respectively, and codes 0xA0 and 0xFF of the GR area should not | |
4129 be used. | |
4130 | |
4131 There are two ways of invocation: locking-shift and single-shift. | |
4132 With locking-shift, the invocation lasts until the next different | |
4133 invocation, whereas with single-shift, the invocation affects the | |
4134 following character only and doesn't affect the locking-shift | |
4135 state. Invocations are done by the following control characters or | |
4136 escape sequences: | |
4137 | |
4138 ---------------------------------------------------------------------- | |
4139 abbrev function cntrl escape seq description | |
4140 ---------------------------------------------------------------------- | |
4141 SI/LS0 (shift-in) 0x0F none invoke G0 into GL | |
4142 SO/LS1 (shift-out) 0x0E none invoke G1 into GL | |
4143 LS2 (locking-shift-2) none ESC 'n' invoke G2 into GL | |
4144 LS3 (locking-shift-3) none ESC 'o' invoke G3 into GL | |
4145 LS1R (locking-shift-1 right) none ESC '~' invoke G1 into GR (*) | |
4146 LS2R (locking-shift-2 right) none ESC '}' invoke G2 into GR (*) | |
4147 LS3R (locking-shift 3 right) none ESC '|' invoke G3 into GR (*) | |
4148 SS2 (single-shift-2) 0x8E ESC 'N' invoke G2 for one char | |
4149 SS3 (single-shift-3) 0x8F ESC 'O' invoke G3 for one char | |
4150 ---------------------------------------------------------------------- | |
4151 (*) These are not used by any known coding system. | |
4152 | |
4153 Control characters for these functions are defined by macros | |
4154 ISO_CODE_XXX in `coding.h'. | |
4155 | |
4156 Designations are done by the following escape sequences: | |
4157 ---------------------------------------------------------------------- | |
4158 escape sequence description | |
4159 ---------------------------------------------------------------------- | |
4160 ESC '(' <F> designate DIMENSION1_CHARS94<F> to G0 | |
4161 ESC ')' <F> designate DIMENSION1_CHARS94<F> to G1 | |
4162 ESC '*' <F> designate DIMENSION1_CHARS94<F> to G2 | |
4163 ESC '+' <F> designate DIMENSION1_CHARS94<F> to G3 | |
4164 ESC ',' <F> designate DIMENSION1_CHARS96<F> to G0 (*) | |
4165 ESC '-' <F> designate DIMENSION1_CHARS96<F> to G1 | |
4166 ESC '.' <F> designate DIMENSION1_CHARS96<F> to G2 | |
4167 ESC '/' <F> designate DIMENSION1_CHARS96<F> to G3 | |
4168 ESC '$' '(' <F> designate DIMENSION2_CHARS94<F> to G0 (**) | |
4169 ESC '$' ')' <F> designate DIMENSION2_CHARS94<F> to G1 | |
4170 ESC '$' '*' <F> designate DIMENSION2_CHARS94<F> to G2 | |
4171 ESC '$' '+' <F> designate DIMENSION2_CHARS94<F> to G3 | |
4172 ESC '$' ',' <F> designate DIMENSION2_CHARS96<F> to G0 (*) | |
4173 ESC '$' '-' <F> designate DIMENSION2_CHARS96<F> to G1 | |
4174 ESC '$' '.' <F> designate DIMENSION2_CHARS96<F> to G2 | |
4175 ESC '$' '/' <F> designate DIMENSION2_CHARS96<F> to G3 | |
4176 ---------------------------------------------------------------------- | |
4177 | |
4178 In this list, "DIMENSION1_CHARS94<F>" means a graphic character set | |
4179 of dimension 1, chars 94, and final character <F>, etc... | |
4180 | |
4181 Note (*): Although these designations are not allowed in ISO2022, | |
4182 Emacs accepts them on decoding, and produces them on encoding | |
4183 CHARS96 character sets in a coding system which is characterized as | |
4184 7-bit environment, non-locking-shift, and non-single-shift. | |
4185 | |
4186 Note (**): If <F> is '@', 'A', or 'B', the intermediate character | |
4187 '(' can be omitted. We refer to this as "short-form" hereafter. | |
4188 | |
4189 Now you may notice that there are a lot of ways for encoding the | |
4190 same multilingual text in ISO2022. Actually, there exist many | |
4191 coding systems such as Compound Text (used in X11's inter client | |
4192 communication, ISO-2022-JP (used in Japanese internet), ISO-2022-KR | |
4193 (used in Korean internet), EUC (Extended UNIX Code, used in Asian | |
4194 localized platforms), and all of these are variants of ISO2022. | |
4195 | |
4196 In addition to the above, Emacs handles two more kinds of escape | |
4197 sequences: ISO6429's direction specification and Emacs' private | |
4198 sequence for specifying character composition. | |
4199 | |
4200 ISO6429's direction specification takes the following form: | |
4201 o CSI ']' -- end of the current direction | |
4202 o CSI '0' ']' -- end of the current direction | |
4203 o CSI '1' ']' -- start of left-to-right text | |
4204 o CSI '2' ']' -- start of right-to-left text | |
4205 The control character CSI (0x9B: control sequence introducer) is | |
4206 abbreviated to the escape sequence ESC '[' in a 7-bit environment. | |
4207 | |
4208 Character composition specification takes the following form: | |
4209 o ESC '0' -- start character composition | |
4210 o ESC '1' -- end character composition | |
4211 Since these are not standard escape sequences of any ISO standard, | |
4212 their use with these meanings is restricted to Emacs only. */ | |
4213 | |
4214 static void | |
4215 reset_iso2022 (Lisp_Object coding_system, struct iso2022_decoder *iso) | |
4216 { | |
4217 int i; | |
4218 | |
4219 for (i = 0; i < 4; i++) | |
4220 { | |
4221 if (!NILP (coding_system)) | |
4222 iso->charset[i] = | |
4223 XCODING_SYSTEM_ISO2022_INITIAL_CHARSET (coding_system, i); | |
4224 else | |
4225 iso->charset[i] = Qt; | |
4226 iso->invalid_designated[i] = 0; | |
4227 } | |
4228 iso->esc = ISO_ESC_NOTHING; | |
4229 iso->esc_bytes_index = 0; | |
4230 iso->register_left = 0; | |
4231 iso->register_right = 1; | |
4232 iso->switched_dir_and_no_valid_charset_yet = 0; | |
4233 iso->invalid_switch_dir = 0; | |
4234 iso->output_direction_sequence = 0; | |
4235 iso->output_literally = 0; | |
4236 #ifdef ENABLE_COMPOSITE_CHARS | |
4237 if (iso->composite_chars) | |
4238 Dynarr_reset (iso->composite_chars); | |
4239 #endif | |
4240 } | |
4241 | |
4242 static int | |
4243 fit_to_be_escape_quoted (unsigned char c) | |
4244 { | |
4245 switch (c) | |
4246 { | |
4247 case ISO_CODE_ESC: | |
4248 case ISO_CODE_CSI: | |
4249 case ISO_CODE_SS2: | |
4250 case ISO_CODE_SS3: | |
4251 case ISO_CODE_SO: | |
4252 case ISO_CODE_SI: | |
4253 return 1; | |
4254 | |
4255 default: | |
4256 return 0; | |
4257 } | |
4258 } | |
4259 | |
4260 /* Parse one byte of an ISO2022 escape sequence. | |
4261 If the result is an invalid escape sequence, return 0 and | |
4262 do not change anything in STR. Otherwise, if the result is | |
4263 an incomplete escape sequence, update ISO2022.ESC and | |
4264 ISO2022.ESC_BYTES and return -1. Otherwise, update | |
4265 all the state variables (but not ISO2022.ESC_BYTES) and | |
4266 return 1. | |
4267 | |
4268 If CHECK_INVALID_CHARSETS is non-zero, check for designation | |
4269 or invocation of an invalid character set and treat that as | |
4270 an unrecognized escape sequence. | |
4271 | |
4272 ******************************************************************** | |
4273 | |
4274 #### Strategies for error annotation and coding orthogonalization | |
4275 | |
4276 We really want to separate out a number of things. Conceptually, | |
4277 there is a nested syntax. | |
4278 | |
4279 At the top level is the ISO 2022 extension syntax, including charset | |
4280 designation and invocation, and certain auxiliary controls such as the | |
4281 ISO 6429 direction specification. These are octet-oriented, with the | |
4282 single exception (AFAIK) of the "exit Unicode" sequence which uses the | |
4283 UTF's natural width (1 byte for UTF-7 and UTF-8, 2 bytes for UCS-2 and | |
4284 UTF-16, and 4 bytes for UCS-4 and UTF-32). This will be treated as a | |
4285 (deprecated) special case in Unicode processing. | |
4286 | |
4287 The middle layer is ISO 2022 character interpretation. This will depend | |
4288 on the current state of the ISO 2022 registers, and assembles octets | |
4289 into the character's internal representation. | |
4290 | |
4291 The lowest level is translating system control conventions. At present | |
4292 this is restricted to newline translation, but one could imagine doing | |
4293 tab conversion or line wrapping here. "Escape from Unicode" processing | |
4294 would be done at this level. | |
4295 | |
4296 At each level the parser will verify the syntax. In the case of a | |
4297 syntax error or warning (such as a redundant escape sequence that affects | |
4298 no characters), the parser will take some action, typically inserting the | |
4299 erroneous octets directly into the output and creating an annotation | |
4300 which can be used by higher level I/O to mark the affected region. | |
4301 | |
4302 This should make it possible to do something sensible about separating | |
4303 newline convention processing from character construction, and about | |
4304 preventing ISO 2022 escape sequences from being recognized | |
4305 inappropriately. | |
4306 | |
4307 The basic strategy will be to have octet classification tables, and | |
4308 switch processing according to the table entry. | |
4309 | |
4310 It's possible that, by doing the processing with tables of functions or | |
4311 the like, the parser can be used for both detection and translation. */ | |
4312 | |
4313 static int | |
4314 parse_iso2022_esc (Lisp_Object codesys, struct iso2022_decoder *iso, | |
4315 unsigned char c, unsigned int *flags, | |
4316 int check_invalid_charsets) | |
4317 { | |
4318 /* (1) If we're at the end of a designation sequence, CS is the | |
4319 charset being designated and REG is the register to designate | |
4320 it to. | |
4321 | |
4322 (2) If we're at the end of a locking-shift sequence, REG is | |
4323 the register to invoke and HALF (0 == left, 1 == right) is | |
4324 the half to invoke it into. | |
4325 | |
4326 (3) If we're at the end of a single-shift sequence, REG is | |
4327 the register to invoke. */ | |
4328 Lisp_Object cs = Qnil; | |
4329 int reg, half; | |
4330 | |
4331 /* NOTE: This code does goto's all over the fucking place. | |
4332 The reason for this is that we're basically implementing | |
4333 a state machine here, and hierarchical languages like C | |
4334 don't really provide a clean way of doing this. */ | |
4335 | |
4336 if (! (*flags & CODING_STATE_ESCAPE)) | |
4337 /* At beginning of escape sequence; we need to reset our | |
4338 escape-state variables. */ | |
4339 iso->esc = ISO_ESC_NOTHING; | |
4340 | |
4341 iso->output_literally = 0; | |
4342 iso->output_direction_sequence = 0; | |
4343 | |
4344 switch (iso->esc) | |
4345 { | |
4346 case ISO_ESC_NOTHING: | |
4347 iso->esc_bytes_index = 0; | |
4348 switch (c) | |
4349 { | |
4350 case ISO_CODE_ESC: /* Start escape sequence */ | |
4351 *flags |= CODING_STATE_ESCAPE; | |
4352 iso->esc = ISO_ESC; | |
4353 goto not_done; | |
4354 | |
4355 case ISO_CODE_CSI: /* ISO6429 (specifying directionality) */ | |
4356 *flags |= CODING_STATE_ESCAPE; | |
4357 iso->esc = ISO_ESC_5_11; | |
4358 goto not_done; | |
4359 | |
4360 case ISO_CODE_SO: /* locking shift 1 */ | |
4361 reg = 1; half = 0; | |
4362 goto locking_shift; | |
4363 case ISO_CODE_SI: /* locking shift 0 */ | |
4364 reg = 0; half = 0; | |
4365 goto locking_shift; | |
4366 | |
4367 case ISO_CODE_SS2: /* single shift */ | |
4368 reg = 2; | |
4369 goto single_shift; | |
4370 case ISO_CODE_SS3: /* single shift */ | |
4371 reg = 3; | |
4372 goto single_shift; | |
4373 | |
4374 default: /* Other control characters */ | |
4375 return 0; | |
4376 } | |
4377 | |
4378 case ISO_ESC: | |
4379 switch (c) | |
4380 { | |
4381 /**** single shift ****/ | |
4382 | |
4383 case 'N': /* single shift 2 */ | |
4384 reg = 2; | |
4385 goto single_shift; | |
4386 case 'O': /* single shift 3 */ | |
4387 reg = 3; | |
4388 goto single_shift; | |
4389 | |
4390 /**** locking shift ****/ | |
4391 | |
4392 case '~': /* locking shift 1 right */ | |
4393 reg = 1; half = 1; | |
4394 goto locking_shift; | |
4395 case 'n': /* locking shift 2 */ | |
4396 reg = 2; half = 0; | |
4397 goto locking_shift; | |
4398 case '}': /* locking shift 2 right */ | |
4399 reg = 2; half = 1; | |
4400 goto locking_shift; | |
4401 case 'o': /* locking shift 3 */ | |
4402 reg = 3; half = 0; | |
4403 goto locking_shift; | |
4404 case '|': /* locking shift 3 right */ | |
4405 reg = 3; half = 1; | |
4406 goto locking_shift; | |
4407 | |
4408 #ifdef ENABLE_COMPOSITE_CHARS | |
4409 /**** composite ****/ | |
4410 | |
4411 case '0': | |
4412 iso->esc = ISO_ESC_START_COMPOSITE; | |
4413 *flags = (*flags & CODING_STATE_ISO2022_LOCK) | | |
4414 CODING_STATE_COMPOSITE; | |
4415 return 1; | |
4416 | |
4417 case '1': | |
4418 iso->esc = ISO_ESC_END_COMPOSITE; | |
4419 *flags = (*flags & CODING_STATE_ISO2022_LOCK) & | |
4420 ~CODING_STATE_COMPOSITE; | |
4421 return 1; | |
4422 #endif /* ENABLE_COMPOSITE_CHARS */ | |
4423 | |
4424 /**** directionality ****/ | |
4425 | |
4426 case '[': | |
4427 iso->esc = ISO_ESC_5_11; | |
4428 goto not_done; | |
4429 | |
4430 /**** designation ****/ | |
4431 | |
4432 case '$': /* multibyte charset prefix */ | |
4433 iso->esc = ISO_ESC_2_4; | |
4434 goto not_done; | |
4435 | |
4436 default: | |
4437 if (0x28 <= c && c <= 0x2F) | |
4438 { | |
4439 iso->esc = (enum iso_esc_flag) (c - 0x28 + ISO_ESC_2_8); | |
4440 goto not_done; | |
4441 } | |
4442 | |
4443 /* This function is called with CODESYS equal to nil when | |
4444 doing coding-system detection. */ | |
4445 if (!NILP (codesys) | |
4446 && XCODING_SYSTEM_ISO2022_ESCAPE_QUOTED (codesys) | |
4447 && fit_to_be_escape_quoted (c)) | |
4448 { | |
4449 iso->esc = ISO_ESC_LITERAL; | |
4450 *flags &= CODING_STATE_ISO2022_LOCK; | |
4451 return 1; | |
4452 } | |
4453 | |
4454 /* bzzzt! */ | |
4455 return 0; | |
4456 } | |
4457 | |
4458 | |
4459 | |
4460 /**** directionality ****/ | |
4461 | |
4462 case ISO_ESC_5_11: /* ISO6429 direction control */ | |
4463 if (c == ']') | |
4464 { | |
4465 *flags &= (CODING_STATE_ISO2022_LOCK & ~CODING_STATE_R2L); | |
4466 goto directionality; | |
4467 } | |
4468 if (c == '0') iso->esc = ISO_ESC_5_11_0; | |
4469 else if (c == '1') iso->esc = ISO_ESC_5_11_1; | |
4470 else if (c == '2') iso->esc = ISO_ESC_5_11_2; | |
4471 else return 0; | |
4472 goto not_done; | |
4473 | |
4474 case ISO_ESC_5_11_0: | |
4475 if (c == ']') | |
4476 { | |
4477 *flags &= (CODING_STATE_ISO2022_LOCK & ~CODING_STATE_R2L); | |
4478 goto directionality; | |
4479 } | |
4480 return 0; | |
4481 | |
4482 case ISO_ESC_5_11_1: | |
4483 if (c == ']') | |
4484 { | |
4485 *flags = (CODING_STATE_ISO2022_LOCK & ~CODING_STATE_R2L); | |
4486 goto directionality; | |
4487 } | |
4488 return 0; | |
4489 | |
4490 case ISO_ESC_5_11_2: | |
4491 if (c == ']') | |
4492 { | |
4493 *flags = (*flags & CODING_STATE_ISO2022_LOCK) | CODING_STATE_R2L; | |
4494 goto directionality; | |
4495 } | |
4496 return 0; | |
4497 | |
4498 directionality: | |
4499 iso->esc = ISO_ESC_DIRECTIONALITY; | |
4500 /* Various junk here to attempt to preserve the direction sequences | |
4501 literally in the text if they would otherwise be swallowed due | |
4502 to invalid designations that don't show up as actual charset | |
4503 changes in the text. */ | |
4504 if (iso->invalid_switch_dir) | |
4505 { | |
4506 /* We already inserted a direction switch literally into the | |
4507 text. We assume (#### this may not be right) that the | |
4508 next direction switch is the one going the other way, | |
4509 and we need to output that literally as well. */ | |
4510 iso->output_literally = 1; | |
4511 iso->invalid_switch_dir = 0; | |
4512 } | |
4513 else | |
4514 { | |
4515 int jj; | |
4516 | |
4517 /* If we are in the thrall of an invalid designation, | |
4518 then stick the directionality sequence literally into the | |
4519 output stream so it ends up in the original text again. */ | |
4520 for (jj = 0; jj < 4; jj++) | |
4521 if (iso->invalid_designated[jj]) | |
4522 break; | |
4523 if (jj < 4) | |
4524 { | |
4525 iso->output_literally = 1; | |
4526 iso->invalid_switch_dir = 1; | |
4527 } | |
4528 else | |
4529 /* Indicate that we haven't yet seen a valid designation, | |
4530 so that if a switch-dir is directly followed by an | |
4531 invalid designation, both get inserted literally. */ | |
4532 iso->switched_dir_and_no_valid_charset_yet = 1; | |
4533 } | |
4534 return 1; | |
4535 | |
4536 | |
4537 /**** designation ****/ | |
4538 | |
4539 case ISO_ESC_2_4: | |
4540 if (0x28 <= c && c <= 0x2F) | |
4541 { | |
4542 iso->esc = (enum iso_esc_flag) (c - 0x28 + ISO_ESC_2_4_8); | |
4543 goto not_done; | |
4544 } | |
4545 if (0x40 <= c && c <= 0x42) | |
4546 { | |
4547 cs = CHARSET_BY_ATTRIBUTES (CHARSET_TYPE_94X94, c, | |
4548 *flags & CODING_STATE_R2L ? | |
4549 CHARSET_RIGHT_TO_LEFT : | |
4550 CHARSET_LEFT_TO_RIGHT); | |
4551 reg = 0; | |
4552 goto designated; | |
4553 } | |
4554 return 0; | |
4555 | |
4556 default: | |
4557 { | |
4558 int type =-1; | |
4559 | |
4560 if (c < '0' || c > '~') | |
4561 return 0; /* bad final byte */ | |
4562 | |
4563 if (iso->esc >= ISO_ESC_2_8 && | |
4564 iso->esc <= ISO_ESC_2_15) | |
4565 { | |
4566 type = ((iso->esc >= ISO_ESC_2_12) ? | |
4567 CHARSET_TYPE_96 : CHARSET_TYPE_94); | |
4568 reg = (iso->esc - ISO_ESC_2_8) & 3; | |
4569 } | |
4570 else if (iso->esc >= ISO_ESC_2_4_8 && | |
4571 iso->esc <= ISO_ESC_2_4_15) | |
4572 { | |
4573 type = ((iso->esc >= ISO_ESC_2_4_12) ? | |
4574 CHARSET_TYPE_96X96 : CHARSET_TYPE_94X94); | |
4575 reg = (iso->esc - ISO_ESC_2_4_8) & 3; | |
4576 } | |
4577 else | |
4578 { | |
4579 /* Can this ever be reached? -slb */ | |
4580 abort(); | |
4581 return 0; | |
4582 } | |
4583 | |
4584 cs = CHARSET_BY_ATTRIBUTES (type, c, | |
4585 *flags & CODING_STATE_R2L ? | |
4586 CHARSET_RIGHT_TO_LEFT : | |
4587 CHARSET_LEFT_TO_RIGHT); | |
4588 goto designated; | |
4589 } | |
4590 } | |
4591 | |
4592 not_done: | |
4593 iso->esc_bytes[iso->esc_bytes_index++] = (unsigned char) c; | |
4594 return -1; | |
4595 | |
4596 single_shift: | |
4597 if (check_invalid_charsets && !CHARSETP (iso->charset[reg])) | |
4598 /* can't invoke something that ain't there. */ | |
4599 return 0; | |
4600 iso->esc = ISO_ESC_SINGLE_SHIFT; | |
4601 *flags &= CODING_STATE_ISO2022_LOCK; | |
4602 if (reg == 2) | |
4603 *flags |= CODING_STATE_SS2; | |
4604 else | |
4605 *flags |= CODING_STATE_SS3; | |
4606 return 1; | |
4607 | |
4608 locking_shift: | |
4609 if (check_invalid_charsets && | |
4610 !CHARSETP (iso->charset[reg])) | |
4611 /* can't invoke something that ain't there. */ | |
4612 return 0; | |
4613 if (half) | |
4614 iso->register_right = reg; | |
4615 else | |
4616 iso->register_left = reg; | |
4617 *flags &= CODING_STATE_ISO2022_LOCK; | |
4618 iso->esc = ISO_ESC_LOCKING_SHIFT; | |
4619 return 1; | |
4620 | |
4621 designated: | |
4622 if (NILP (cs) && check_invalid_charsets) | |
4623 { | |
4624 iso->invalid_designated[reg] = 1; | |
4625 iso->charset[reg] = Vcharset_ascii; | |
4626 iso->esc = ISO_ESC_DESIGNATE; | |
4627 *flags &= CODING_STATE_ISO2022_LOCK; | |
4628 iso->output_literally = 1; | |
4629 if (iso->switched_dir_and_no_valid_charset_yet) | |
4630 { | |
4631 /* We encountered a switch-direction followed by an | |
4632 invalid designation. Ensure that the switch-direction | |
4633 gets outputted; otherwise it will probably get eaten | |
4634 when the text is written out again. */ | |
4635 iso->switched_dir_and_no_valid_charset_yet = 0; | |
4636 iso->output_direction_sequence = 1; | |
4637 /* And make sure that the switch-dir going the other | |
4638 way gets outputted, as well. */ | |
4639 iso->invalid_switch_dir = 1; | |
4640 } | |
4641 return 1; | |
4642 } | |
4643 /* This function is called with CODESYS equal to nil when | |
4644 doing coding-system detection. */ | |
4645 if (!NILP (codesys)) | |
4646 { | |
4647 charset_conversion_spec_dynarr *dyn = | |
4648 XCODING_SYSTEM (codesys)->iso2022.input_conv; | |
4649 | |
4650 if (dyn) | |
4651 { | |
4652 int i; | |
4653 | |
4654 for (i = 0; i < Dynarr_length (dyn); i++) | |
4655 { | |
4656 struct charset_conversion_spec *spec = Dynarr_atp (dyn, i); | |
4657 if (EQ (cs, spec->from_charset)) | |
4658 cs = spec->to_charset; | |
4659 } | |
4660 } | |
4661 } | |
4662 | |
4663 iso->charset[reg] = cs; | |
4664 iso->esc = ISO_ESC_DESIGNATE; | |
4665 *flags &= CODING_STATE_ISO2022_LOCK; | |
4666 if (iso->invalid_designated[reg]) | |
4667 { | |
4668 iso->invalid_designated[reg] = 0; | |
4669 iso->output_literally = 1; | |
4670 } | |
4671 if (iso->switched_dir_and_no_valid_charset_yet) | |
4672 iso->switched_dir_and_no_valid_charset_yet = 0; | |
4673 return 1; | |
4674 } | |
4675 | |
4676 static int | |
4677 detect_coding_iso2022 (struct detection_state *st, const Extbyte *src, Bytecount n) | |
4678 { | |
4679 int mask; | |
4680 | |
4681 /* #### There are serious deficiencies in the recognition mechanism | |
4682 here. This needs to be much smarter if it's going to cut it. | |
4683 The sequence "\xff\x0f" is currently detected as LOCK_SHIFT while | |
4684 it should be detected as Latin-1. | |
4685 All the ISO2022 stuff in this file should be synced up with the | |
4686 code from FSF Emacs-20.4, in which Mule should be more or less stable. | |
4687 Perhaps we should wait till R2L works in FSF Emacs? */ | |
4688 | |
4689 if (!st->iso2022.initted) | |
4690 { | |
4691 reset_iso2022 (Qnil, &st->iso2022.iso); | |
4692 st->iso2022.mask = (CODING_CATEGORY_ISO_7_MASK | | |
4693 CODING_CATEGORY_ISO_8_DESIGNATE_MASK | | |
4694 CODING_CATEGORY_ISO_8_1_MASK | | |
4695 CODING_CATEGORY_ISO_8_2_MASK | | |
4696 CODING_CATEGORY_ISO_LOCK_SHIFT_MASK); | |
4697 st->iso2022.flags = 0; | |
4698 st->iso2022.high_byte_count = 0; | |
4699 st->iso2022.saw_single_shift = 0; | |
4700 st->iso2022.initted = 1; | |
4701 } | |
4702 | |
4703 mask = st->iso2022.mask; | |
4704 | |
4705 while (n--) | |
4706 { | |
4707 unsigned char c = *(unsigned char *)src++; | |
4708 if (c >= 0xA0) | |
4709 { | |
4710 mask &= ~CODING_CATEGORY_ISO_7_MASK; | |
4711 st->iso2022.high_byte_count++; | |
4712 } | |
4713 else | |
4714 { | |
4715 if (st->iso2022.high_byte_count && !st->iso2022.saw_single_shift) | |
4716 { | |
4717 if (st->iso2022.high_byte_count & 1) | |
4718 /* odd number of high bytes; assume not iso-8-2 */ | |
4719 mask &= ~CODING_CATEGORY_ISO_8_2_MASK; | |
4720 } | |
4721 st->iso2022.high_byte_count = 0; | |
4722 st->iso2022.saw_single_shift = 0; | |
4723 if (c > 0x80) | |
4724 mask &= ~CODING_CATEGORY_ISO_7_MASK; | |
4725 } | |
4726 if (!(st->iso2022.flags & CODING_STATE_ESCAPE) | |
4727 && (BYTE_C0_P (c) || BYTE_C1_P (c))) | |
4728 { /* control chars */ | |
4729 switch (c) | |
4730 { | |
4731 /* Allow and ignore control characters that you might | |
4732 reasonably see in a text file */ | |
4733 case '\r': | |
4734 case '\n': | |
4735 case '\t': | |
4736 case 7: /* bell */ | |
4737 case 8: /* backspace */ | |
4738 case 11: /* vertical tab */ | |
4739 case 12: /* form feed */ | |
4740 case 26: /* MS-DOS C-z junk */ | |
4741 case 31: /* '^_' -- for info */ | |
4742 goto label_continue_loop; | |
4743 | |
4744 default: | |
4745 break; | |
4746 } | |
4747 } | |
4748 | |
4749 if ((st->iso2022.flags & CODING_STATE_ESCAPE) || BYTE_C0_P (c) | |
4750 || BYTE_C1_P (c)) | |
4751 { | |
4752 if (parse_iso2022_esc (Qnil, &st->iso2022.iso, c, | |
4753 &st->iso2022.flags, 0)) | |
4754 { | |
4755 switch (st->iso2022.iso.esc) | |
4756 { | |
4757 case ISO_ESC_DESIGNATE: | |
4758 mask &= ~CODING_CATEGORY_ISO_8_1_MASK; | |
4759 mask &= ~CODING_CATEGORY_ISO_8_2_MASK; | |
4760 break; | |
4761 case ISO_ESC_LOCKING_SHIFT: | |
4762 mask = CODING_CATEGORY_ISO_LOCK_SHIFT_MASK; | |
4763 goto ran_out_of_chars; | |
4764 case ISO_ESC_SINGLE_SHIFT: | |
4765 mask &= ~CODING_CATEGORY_ISO_8_DESIGNATE_MASK; | |
4766 st->iso2022.saw_single_shift = 1; | |
4767 break; | |
4768 default: | |
4769 break; | |
4770 } | |
4771 } | |
4772 else | |
4773 { | |
4774 mask = 0; | |
4775 goto ran_out_of_chars; | |
4776 } | |
4777 } | |
4778 label_continue_loop:; | |
4779 } | |
4780 | |
4781 ran_out_of_chars: | |
4782 | |
4783 return mask; | |
4784 } | |
4785 | |
4786 static int | |
4787 postprocess_iso2022_mask (int mask) | |
4788 { | |
4789 /* #### kind of cheesy */ | |
4790 /* If seven-bit ISO is allowed, then assume that the encoding is | |
4791 entirely seven-bit and turn off the eight-bit ones. */ | |
4792 if (mask & CODING_CATEGORY_ISO_7_MASK) | |
4793 mask &= ~ (CODING_CATEGORY_ISO_8_DESIGNATE_MASK | | |
4794 CODING_CATEGORY_ISO_8_1_MASK | | |
4795 CODING_CATEGORY_ISO_8_2_MASK); | |
4796 return mask; | |
4797 } | |
4798 | |
4799 /* If FLAGS is a null pointer or specifies right-to-left motion, | |
4800 output a switch-dir-to-left-to-right sequence to DST. | |
4801 Also update FLAGS if it is not a null pointer. | |
4802 If INTERNAL_P is set, we are outputting in internal format and | |
4803 need to handle the CSI differently. */ | |
4804 | |
4805 static void | |
4806 restore_left_to_right_direction (Lisp_Coding_System *codesys, | |
4807 unsigned_char_dynarr *dst, | |
4808 unsigned int *flags, | |
4809 int internal_p) | |
4810 { | |
4811 if (!flags || (*flags & CODING_STATE_R2L)) | |
4812 { | |
4813 if (CODING_SYSTEM_ISO2022_SEVEN (codesys)) | |
4814 { | |
4815 Dynarr_add (dst, ISO_CODE_ESC); | |
4816 Dynarr_add (dst, '['); | |
4817 } | |
4818 else if (internal_p) | |
4819 DECODE_ADD_BINARY_CHAR (ISO_CODE_CSI, dst); | |
4820 else | |
4821 Dynarr_add (dst, ISO_CODE_CSI); | |
4822 Dynarr_add (dst, '0'); | |
4823 Dynarr_add (dst, ']'); | |
4824 if (flags) | |
4825 *flags &= ~CODING_STATE_R2L; | |
4826 } | |
4827 } | |
4828 | |
4829 /* If FLAGS is a null pointer or specifies a direction different from | |
4830 DIRECTION (which should be either CHARSET_RIGHT_TO_LEFT or | |
4831 CHARSET_LEFT_TO_RIGHT), output the appropriate switch-dir escape | |
4832 sequence to DST. Also update FLAGS if it is not a null pointer. | |
4833 If INTERNAL_P is set, we are outputting in internal format and | |
4834 need to handle the CSI differently. */ | |
4835 | |
4836 static void | |
4837 ensure_correct_direction (int direction, Lisp_Coding_System *codesys, | |
4838 unsigned_char_dynarr *dst, unsigned int *flags, | |
4839 int internal_p) | |
4840 { | |
4841 if ((!flags || (*flags & CODING_STATE_R2L)) && | |
4842 direction == CHARSET_LEFT_TO_RIGHT) | |
4843 restore_left_to_right_direction (codesys, dst, flags, internal_p); | |
4844 else if (!CODING_SYSTEM_ISO2022_NO_ISO6429 (codesys) | |
4845 && (!flags || !(*flags & CODING_STATE_R2L)) && | |
4846 direction == CHARSET_RIGHT_TO_LEFT) | |
4847 { | |
4848 if (CODING_SYSTEM_ISO2022_SEVEN (codesys)) | |
4849 { | |
4850 Dynarr_add (dst, ISO_CODE_ESC); | |
4851 Dynarr_add (dst, '['); | |
4852 } | |
4853 else if (internal_p) | |
4854 DECODE_ADD_BINARY_CHAR (ISO_CODE_CSI, dst); | |
4855 else | |
4856 Dynarr_add (dst, ISO_CODE_CSI); | |
4857 Dynarr_add (dst, '2'); | |
4858 Dynarr_add (dst, ']'); | |
4859 if (flags) | |
4860 *flags |= CODING_STATE_R2L; | |
4861 } | |
4862 } | |
4863 | |
4864 /* Convert ISO2022-format data to internal format. */ | |
4865 | |
4866 static void | |
4867 decode_coding_iso2022 (Lstream *decoding, const Extbyte *src, | |
4868 unsigned_char_dynarr *dst, Bytecount n) | |
4869 { | |
4870 struct decoding_stream *str = DECODING_STREAM_DATA (decoding); | |
4871 unsigned int flags = str->flags; | |
4872 unsigned int ch = str->ch; | |
4873 eol_type_t eol_type = str->eol_type; | |
4874 #ifdef ENABLE_COMPOSITE_CHARS | |
4875 unsigned_char_dynarr *real_dst = dst; | |
4876 #endif | |
4877 Lisp_Object coding_system; | |
4878 | |
4879 XSETCODING_SYSTEM (coding_system, str->codesys); | |
4880 | |
4881 #ifdef ENABLE_COMPOSITE_CHARS | |
4882 if (flags & CODING_STATE_COMPOSITE) | |
4883 dst = str->iso2022.composite_chars; | |
4884 #endif /* ENABLE_COMPOSITE_CHARS */ | |
4885 | |
4886 while (n--) | |
4887 { | |
4888 unsigned char c = *(unsigned char *)src++; | |
4889 if (flags & CODING_STATE_ESCAPE) | |
4890 { /* Within ESC sequence */ | |
4891 int retval = parse_iso2022_esc (coding_system, &str->iso2022, | |
4892 c, &flags, 1); | |
4893 | |
4894 if (retval) | |
4895 { | |
4896 switch (str->iso2022.esc) | |
4897 { | |
4898 #ifdef ENABLE_COMPOSITE_CHARS | |
4899 case ISO_ESC_START_COMPOSITE: | |
4900 if (str->iso2022.composite_chars) | |
4901 Dynarr_reset (str->iso2022.composite_chars); | |
4902 else | |
4903 str->iso2022.composite_chars = Dynarr_new (unsigned_char); | |
4904 dst = str->iso2022.composite_chars; | |
4905 break; | |
4906 case ISO_ESC_END_COMPOSITE: | |
4907 { | |
4908 Intbyte comstr[MAX_EMCHAR_LEN]; | |
4909 Bytecount len; | |
4910 Emchar emch = lookup_composite_char (Dynarr_atp (dst, 0), | |
4911 Dynarr_length (dst)); | |
4912 dst = real_dst; | |
4913 len = set_charptr_emchar (comstr, emch); | |
4914 Dynarr_add_many (dst, comstr, len); | |
4915 break; | |
4916 } | |
4917 #endif /* ENABLE_COMPOSITE_CHARS */ | |
4918 | |
4919 case ISO_ESC_LITERAL: | |
4920 DECODE_ADD_BINARY_CHAR (c, dst); | |
4921 break; | |
4922 | |
4923 default: | |
4924 /* Everything else handled already */ | |
4925 break; | |
4926 } | |
4927 } | |
4928 | |
4929 /* Attempted error recovery. */ | |
4930 if (str->iso2022.output_direction_sequence) | |
4931 ensure_correct_direction (flags & CODING_STATE_R2L ? | |
4932 CHARSET_RIGHT_TO_LEFT : | |
4933 CHARSET_LEFT_TO_RIGHT, | |
4934 str->codesys, dst, 0, 1); | |
4935 /* More error recovery. */ | |
4936 if (!retval || str->iso2022.output_literally) | |
4937 { | |
4938 /* Output the (possibly invalid) sequence */ | |
4939 int i; | |
4940 for (i = 0; i < str->iso2022.esc_bytes_index; i++) | |
4941 DECODE_ADD_BINARY_CHAR (str->iso2022.esc_bytes[i], dst); | |
4942 flags &= CODING_STATE_ISO2022_LOCK; | |
4943 if (!retval) | |
4944 n++, src--;/* Repeat the loop with the same character. */ | |
4945 else | |
4946 { | |
4947 /* No sense in reprocessing the final byte of the | |
4948 escape sequence; it could mess things up anyway. | |
4949 Just add it now. */ | |
4950 DECODE_ADD_BINARY_CHAR (c, dst); | |
4951 } | |
4952 } | |
4953 ch = 0; | |
4954 } | |
4955 else if (BYTE_C0_P (c) || BYTE_C1_P (c)) | |
4956 { /* Control characters */ | |
4957 | |
4958 /***** Error-handling *****/ | |
4959 | |
4960 /* If we were in the middle of a character, dump out the | |
4961 partial character. */ | |
4962 DECODE_OUTPUT_PARTIAL_CHAR (ch); | |
4963 | |
4964 /* If we just saw a single-shift character, dump it out. | |
4965 This may dump out the wrong sort of single-shift character, | |
4966 but least it will give an indication that something went | |
4967 wrong. */ | |
4968 if (flags & CODING_STATE_SS2) | |
4969 { | |
4970 DECODE_ADD_BINARY_CHAR (ISO_CODE_SS2, dst); | |
4971 flags &= ~CODING_STATE_SS2; | |
4972 } | |
4973 if (flags & CODING_STATE_SS3) | |
4974 { | |
4975 DECODE_ADD_BINARY_CHAR (ISO_CODE_SS3, dst); | |
4976 flags &= ~CODING_STATE_SS3; | |
4977 } | |
4978 | |
4979 /***** Now handle the control characters. *****/ | |
4980 | |
4981 /* Handle CR/LF */ | |
4982 DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst); | |
4983 | |
4984 flags &= CODING_STATE_ISO2022_LOCK; | |
4985 | |
4986 if (!parse_iso2022_esc (coding_system, &str->iso2022, c, &flags, 1)) | |
4987 DECODE_ADD_BINARY_CHAR (c, dst); | |
4988 } | |
4989 else | |
4990 { /* Graphic characters */ | |
4991 Lisp_Object charset; | |
4992 int lb; | |
4993 int reg; | |
4994 | |
4995 DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst); | |
4996 | |
4997 /* Now determine the charset. */ | |
4998 reg = ((flags & CODING_STATE_SS2) ? 2 | |
4999 : (flags & CODING_STATE_SS3) ? 3 | |
5000 : !BYTE_ASCII_P (c) ? str->iso2022.register_right | |
5001 : str->iso2022.register_left); | |
5002 charset = str->iso2022.charset[reg]; | |
5003 | |
5004 /* Error checking: */ | |
5005 if (! CHARSETP (charset) | |
5006 || str->iso2022.invalid_designated[reg] | |
5007 || (((c & 0x7F) == ' ' || (c & 0x7F) == ISO_CODE_DEL) | |
5008 && XCHARSET_CHARS (charset) == 94)) | |
5009 /* Mrmph. We are trying to invoke a register that has no | |
5010 or an invalid charset in it, or trying to add a character | |
5011 outside the range of the charset. Insert that char literally | |
5012 to preserve it for the output. */ | |
5013 { | |
5014 DECODE_OUTPUT_PARTIAL_CHAR (ch); | |
5015 DECODE_ADD_BINARY_CHAR (c, dst); | |
5016 } | |
5017 | |
5018 else | |
5019 { | |
5020 /* Things are probably hunky-dorey. */ | |
5021 | |
5022 /* Fetch reverse charset, maybe. */ | |
5023 if (((flags & CODING_STATE_R2L) && | |
5024 XCHARSET_DIRECTION (charset) == CHARSET_LEFT_TO_RIGHT) | |
5025 || | |
5026 (!(flags & CODING_STATE_R2L) && | |
5027 XCHARSET_DIRECTION (charset) == CHARSET_RIGHT_TO_LEFT)) | |
5028 { | |
5029 Lisp_Object new_charset = | |
5030 XCHARSET_REVERSE_DIRECTION_CHARSET (charset); | |
5031 if (!NILP (new_charset)) | |
5032 charset = new_charset; | |
5033 } | |
5034 | |
5035 lb = XCHARSET_LEADING_BYTE (charset); | |
5036 switch (XCHARSET_REP_BYTES (charset)) | |
5037 { | |
5038 case 1: /* ASCII */ | |
5039 DECODE_OUTPUT_PARTIAL_CHAR (ch); | |
5040 Dynarr_add (dst, c & 0x7F); | |
5041 break; | |
5042 | |
5043 case 2: /* one-byte official */ | |
5044 DECODE_OUTPUT_PARTIAL_CHAR (ch); | |
5045 Dynarr_add (dst, lb); | |
5046 Dynarr_add (dst, c | 0x80); | |
5047 break; | |
5048 | |
5049 case 3: /* one-byte private or two-byte official */ | |
5050 if (XCHARSET_PRIVATE_P (charset)) | |
5051 { | |
5052 DECODE_OUTPUT_PARTIAL_CHAR (ch); | |
5053 Dynarr_add (dst, PRE_LEADING_BYTE_PRIVATE_1); | |
5054 Dynarr_add (dst, lb); | |
5055 Dynarr_add (dst, c | 0x80); | |
5056 } | |
5057 else | |
5058 { | |
5059 if (ch) | |
5060 { | |
5061 Dynarr_add (dst, lb); | |
5062 Dynarr_add (dst, ch | 0x80); | |
5063 Dynarr_add (dst, c | 0x80); | |
5064 ch = 0; | |
5065 } | |
5066 else | |
5067 ch = c; | |
5068 } | |
5069 break; | |
5070 | |
5071 default: /* two-byte private */ | |
5072 if (ch) | |
5073 { | |
5074 Dynarr_add (dst, PRE_LEADING_BYTE_PRIVATE_2); | |
5075 Dynarr_add (dst, lb); | |
5076 Dynarr_add (dst, ch | 0x80); | |
5077 Dynarr_add (dst, c | 0x80); | |
5078 ch = 0; | |
5079 } | |
5080 else | |
5081 ch = c; | |
5082 } | |
5083 } | |
5084 | |
5085 if (!ch) | |
5086 flags &= CODING_STATE_ISO2022_LOCK; | |
5087 } | |
5088 | |
5089 label_continue_loop:; | |
5090 } | |
5091 | |
5092 if (flags & CODING_STATE_END) | |
5093 DECODE_OUTPUT_PARTIAL_CHAR (ch); | |
5094 | |
5095 str->flags = flags; | |
5096 str->ch = ch; | |
5097 } | |
5098 | |
5099 | |
5100 /***** ISO2022 encoder *****/ | |
5101 | |
5102 /* Designate CHARSET into register REG. */ | |
5103 | |
5104 static void | |
5105 iso2022_designate (Lisp_Object charset, unsigned char reg, | |
5106 struct encoding_stream *str, unsigned_char_dynarr *dst) | |
5107 { | |
5108 static const char inter94[] = "()*+"; | |
5109 static const char inter96[] = ",-./"; | |
5110 int type; | |
5111 unsigned char final; | |
5112 Lisp_Object old_charset = str->iso2022.charset[reg]; | |
5113 | |
5114 str->iso2022.charset[reg] = charset; | |
5115 if (!CHARSETP (charset)) | |
5116 /* charset might be an initial nil or t. */ | |
5117 return; | |
5118 type = XCHARSET_TYPE (charset); | |
5119 final = XCHARSET_FINAL (charset); | |
5120 if (!str->iso2022.force_charset_on_output[reg] && | |
5121 CHARSETP (old_charset) && | |
5122 XCHARSET_TYPE (old_charset) == type && | |
5123 XCHARSET_FINAL (old_charset) == final) | |
5124 return; | |
5125 | |
5126 str->iso2022.force_charset_on_output[reg] = 0; | |
5127 | |
5128 { | |
5129 charset_conversion_spec_dynarr *dyn = | |
5130 str->codesys->iso2022.output_conv; | |
5131 | |
5132 if (dyn) | |
5133 { | |
5134 int i; | |
5135 | |
5136 for (i = 0; i < Dynarr_length (dyn); i++) | |
5137 { | |
5138 struct charset_conversion_spec *spec = Dynarr_atp (dyn, i); | |
5139 if (EQ (charset, spec->from_charset)) | |
5140 charset = spec->to_charset; | |
5141 } | |
5142 } | |
5143 } | |
5144 | |
5145 Dynarr_add (dst, ISO_CODE_ESC); | |
5146 switch (type) | |
5147 { | |
5148 case CHARSET_TYPE_94: | |
5149 Dynarr_add (dst, inter94[reg]); | |
5150 break; | |
5151 case CHARSET_TYPE_96: | |
5152 Dynarr_add (dst, inter96[reg]); | |
5153 break; | |
5154 case CHARSET_TYPE_94X94: | |
5155 Dynarr_add (dst, '$'); | |
5156 if (reg != 0 | |
5157 || !(CODING_SYSTEM_ISO2022_SHORT (str->codesys)) | |
5158 || final < '@' | |
5159 || final > 'B') | |
5160 Dynarr_add (dst, inter94[reg]); | |
5161 break; | |
5162 case CHARSET_TYPE_96X96: | |
5163 Dynarr_add (dst, '$'); | |
5164 Dynarr_add (dst, inter96[reg]); | |
5165 break; | |
5166 } | |
5167 Dynarr_add (dst, final); | |
5168 } | |
5169 | |
5170 static void | |
5171 ensure_normal_shift (struct encoding_stream *str, unsigned_char_dynarr *dst) | |
5172 { | |
5173 if (str->iso2022.register_left != 0) | |
5174 { | |
5175 Dynarr_add (dst, ISO_CODE_SI); | |
5176 str->iso2022.register_left = 0; | |
5177 } | |
5178 } | |
5179 | |
5180 static void | |
5181 ensure_shift_out (struct encoding_stream *str, unsigned_char_dynarr *dst) | |
5182 { | |
5183 if (str->iso2022.register_left != 1) | |
5184 { | |
5185 Dynarr_add (dst, ISO_CODE_SO); | |
5186 str->iso2022.register_left = 1; | |
5187 } | |
5188 } | |
5189 | |
5190 /* Convert internally-formatted data to ISO2022 format. */ | |
5191 | |
5192 static void | |
5193 encode_coding_iso2022 (Lstream *encoding, const Intbyte *src, | |
5194 unsigned_char_dynarr *dst, Bytecount n) | |
5195 { | |
5196 unsigned char charmask, c; | |
5197 unsigned char char_boundary; | |
5198 struct encoding_stream *str = ENCODING_STREAM_DATA (encoding); | |
5199 unsigned int flags = str->flags; | |
5200 unsigned int ch = str->ch; | |
5201 Lisp_Coding_System *codesys = str->codesys; | |
5202 eol_type_t eol_type = CODING_SYSTEM_EOL_TYPE (str->codesys); | |
5203 int i; | |
5204 Lisp_Object charset; | |
5205 int half; | |
5206 | |
5207 #ifdef ENABLE_COMPOSITE_CHARS | |
5208 /* flags for handling composite chars. We do a little switcharoo | |
5209 on the source while we're outputting the composite char. */ | |
5210 Bytecount saved_n = 0; | |
5211 const unsigned char *saved_src = NULL; | |
5212 int in_composite = 0; | |
5213 #endif /* ENABLE_COMPOSITE_CHARS */ | |
5214 | |
5215 char_boundary = str->iso2022.current_char_boundary; | |
5216 charset = str->iso2022.current_charset; | |
5217 half = str->iso2022.current_half; | |
5218 | |
5219 #ifdef ENABLE_COMPOSITE_CHARS | |
5220 back_to_square_n: | |
5221 #endif | |
5222 while (n--) | |
5223 { | |
5224 c = *src++; | |
5225 | |
5226 if (BYTE_ASCII_P (c)) | |
5227 { /* Processing ASCII character */ | |
5228 ch = 0; | |
5229 | |
5230 restore_left_to_right_direction (codesys, dst, &flags, 0); | |
5231 | |
5232 /* Make sure G0 contains ASCII */ | |
5233 if ((c > ' ' && c < ISO_CODE_DEL) || | |
5234 !CODING_SYSTEM_ISO2022_NO_ASCII_CNTL (codesys)) | |
5235 { | |
5236 ensure_normal_shift (str, dst); | |
5237 iso2022_designate (Vcharset_ascii, 0, str, dst); | |
5238 } | |
5239 | |
5240 /* If necessary, restore everything to the default state | |
5241 at end-of-line */ | |
5242 if (c == '\n' && | |
5243 !(CODING_SYSTEM_ISO2022_NO_ASCII_EOL (codesys))) | |
5244 { | |
5245 restore_left_to_right_direction (codesys, dst, &flags, 0); | |
5246 | |
5247 ensure_normal_shift (str, dst); | |
5248 | |
5249 for (i = 0; i < 4; i++) | |
5250 { | |
5251 Lisp_Object initial_charset = | |
5252 CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, i); | |
5253 iso2022_designate (initial_charset, i, str, dst); | |
5254 } | |
5255 } | |
5256 if (c == '\n') | |
5257 { | |
5258 if (eol_type != EOL_LF && eol_type != EOL_AUTODETECT) | |
5259 Dynarr_add (dst, '\r'); | |
5260 if (eol_type != EOL_CR) | |
5261 Dynarr_add (dst, c); | |
5262 } | |
5263 else | |
5264 { | |
5265 if (CODING_SYSTEM_ISO2022_ESCAPE_QUOTED (codesys) | |
5266 && fit_to_be_escape_quoted (c)) | |
5267 Dynarr_add (dst, ISO_CODE_ESC); | |
5268 Dynarr_add (dst, c); | |
5269 } | |
5270 char_boundary = 1; | |
5271 } | |
5272 | |
5273 else if (INTBYTE_LEADING_BYTE_P (c) || INTBYTE_LEADING_BYTE_P (ch)) | |
5274 { /* Processing Leading Byte */ | |
5275 ch = 0; | |
5276 charset = CHARSET_BY_LEADING_BYTE (c); | |
5277 if (LEADING_BYTE_PREFIX_P(c)) | |
5278 ch = c; | |
5279 else if (!EQ (charset, Vcharset_control_1) | |
5280 #ifdef ENABLE_COMPOSITE_CHARS | |
5281 && !EQ (charset, Vcharset_composite) | |
5282 #endif | |
5283 ) | |
5284 { | |
5285 int reg; | |
5286 | |
5287 ensure_correct_direction (XCHARSET_DIRECTION (charset), | |
5288 codesys, dst, &flags, 0); | |
5289 | |
5290 /* Now determine which register to use. */ | |
5291 reg = -1; | |
5292 for (i = 0; i < 4; i++) | |
5293 { | |
5294 if (EQ (charset, str->iso2022.charset[i]) || | |
5295 EQ (charset, | |
5296 CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, i))) | |
5297 { | |
5298 reg = i; | |
5299 break; | |
5300 } | |
5301 } | |
5302 | |
5303 if (reg == -1) | |
5304 { | |
5305 if (XCHARSET_GRAPHIC (charset) != 0) | |
5306 { | |
5307 if (!NILP (str->iso2022.charset[1]) && | |
5308 (!CODING_SYSTEM_ISO2022_SEVEN (codesys) || | |
5309 CODING_SYSTEM_ISO2022_LOCK_SHIFT (codesys))) | |
5310 reg = 1; | |
5311 else if (!NILP (str->iso2022.charset[2])) | |
5312 reg = 2; | |
5313 else if (!NILP (str->iso2022.charset[3])) | |
5314 reg = 3; | |
5315 else | |
5316 reg = 0; | |
5317 } | |
5318 else | |
5319 reg = 0; | |
5320 } | |
5321 | |
5322 iso2022_designate (charset, reg, str, dst); | |
5323 | |
5324 /* Now invoke that register. */ | |
5325 switch (reg) | |
5326 { | |
5327 case 0: | |
5328 ensure_normal_shift (str, dst); | |
5329 half = 0; | |
5330 break; | |
5331 | |
5332 case 1: | |
5333 if (CODING_SYSTEM_ISO2022_SEVEN (codesys)) | |
5334 { | |
5335 ensure_shift_out (str, dst); | |
5336 half = 0; | |
5337 } | |
5338 else | |
5339 half = 1; | |
5340 break; | |
5341 | |
5342 case 2: | |
5343 if (CODING_SYSTEM_ISO2022_SEVEN (str->codesys)) | |
5344 { | |
5345 Dynarr_add (dst, ISO_CODE_ESC); | |
5346 Dynarr_add (dst, 'N'); | |
5347 half = 0; | |
5348 } | |
5349 else | |
5350 { | |
5351 Dynarr_add (dst, ISO_CODE_SS2); | |
5352 half = 1; | |
5353 } | |
5354 break; | |
5355 | |
5356 case 3: | |
5357 if (CODING_SYSTEM_ISO2022_SEVEN (str->codesys)) | |
5358 { | |
5359 Dynarr_add (dst, ISO_CODE_ESC); | |
5360 Dynarr_add (dst, 'O'); | |
5361 half = 0; | |
5362 } | |
5363 else | |
5364 { | |
5365 Dynarr_add (dst, ISO_CODE_SS3); | |
5366 half = 1; | |
5367 } | |
5368 break; | |
5369 | |
5370 default: | |
5371 abort (); | |
5372 } | |
5373 } | |
5374 char_boundary = 0; | |
5375 } | |
5376 else | |
5377 { /* Processing Non-ASCII character */ | |
5378 charmask = (half == 0 ? 0x7F : 0xFF); | |
5379 char_boundary = 1; | |
5380 if (EQ (charset, Vcharset_control_1)) | |
5381 { | |
5382 if (CODING_SYSTEM_ISO2022_ESCAPE_QUOTED (codesys) | |
5383 && fit_to_be_escape_quoted (c)) | |
5384 Dynarr_add (dst, ISO_CODE_ESC); | |
5385 /* you asked for it ... */ | |
5386 Dynarr_add (dst, c - 0x20); | |
5387 } | |
5388 else | |
5389 { | |
5390 switch (XCHARSET_REP_BYTES (charset)) | |
5391 { | |
5392 case 2: | |
5393 Dynarr_add (dst, c & charmask); | |
5394 break; | |
5395 case 3: | |
5396 if (XCHARSET_PRIVATE_P (charset)) | |
5397 { | |
5398 Dynarr_add (dst, c & charmask); | |
5399 ch = 0; | |
5400 } | |
5401 else if (ch) | |
5402 { | |
5403 #ifdef ENABLE_COMPOSITE_CHARS | |
5404 if (EQ (charset, Vcharset_composite)) | |
5405 { | |
5406 if (in_composite) | |
5407 { | |
5408 /* #### Bother! We don't know how to | |
5409 handle this yet. */ | |
5410 Dynarr_add (dst, '~'); | |
5411 } | |
5412 else | |
5413 { | |
5414 Emchar emch = MAKE_CHAR (Vcharset_composite, | |
5415 ch & 0x7F, c & 0x7F); | |
5416 Lisp_Object lstr = composite_char_string (emch); | |
5417 saved_n = n; | |
5418 saved_src = src; | |
5419 in_composite = 1; | |
5420 src = XSTRING_DATA (lstr); | |
5421 n = XSTRING_LENGTH (lstr); | |
5422 Dynarr_add (dst, ISO_CODE_ESC); | |
5423 Dynarr_add (dst, '0'); /* start composing */ | |
5424 } | |
5425 } | |
5426 else | |
5427 #endif /* ENABLE_COMPOSITE_CHARS */ | |
5428 { | |
5429 Dynarr_add (dst, ch & charmask); | |
5430 Dynarr_add (dst, c & charmask); | |
5431 } | |
5432 ch = 0; | |
5433 } | |
5434 else | |
5435 { | |
5436 ch = c; | |
5437 char_boundary = 0; | |
5438 } | |
5439 break; | |
5440 case 4: | |
5441 if (ch) | |
5442 { | |
5443 Dynarr_add (dst, ch & charmask); | |
5444 Dynarr_add (dst, c & charmask); | |
5445 ch = 0; | |
5446 } | |
5447 else | |
5448 { | |
5449 ch = c; | |
5450 char_boundary = 0; | |
5451 } | |
5452 break; | |
5453 default: | |
5454 abort (); | |
5455 } | |
5456 } | |
5457 } | |
5458 } | |
5459 | |
5460 #ifdef ENABLE_COMPOSITE_CHARS | |
5461 if (in_composite) | |
5462 { | |
5463 n = saved_n; | |
5464 src = saved_src; | |
5465 in_composite = 0; | |
5466 Dynarr_add (dst, ISO_CODE_ESC); | |
5467 Dynarr_add (dst, '1'); /* end composing */ | |
5468 goto back_to_square_n; /* Wheeeeeeeee ..... */ | |
5469 } | |
5470 #endif /* ENABLE_COMPOSITE_CHARS */ | |
5471 | |
5472 if (char_boundary && flags & CODING_STATE_END) | |
5473 { | |
5474 restore_left_to_right_direction (codesys, dst, &flags, 0); | |
5475 ensure_normal_shift (str, dst); | |
5476 for (i = 0; i < 4; i++) | |
5477 { | |
5478 Lisp_Object initial_charset = | |
5479 CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, i); | |
5480 iso2022_designate (initial_charset, i, str, dst); | |
5481 } | |
5482 } | |
5483 | |
5484 str->flags = flags; | |
5485 str->ch = ch; | |
5486 str->iso2022.current_char_boundary = char_boundary; | |
5487 str->iso2022.current_charset = charset; | |
5488 str->iso2022.current_half = half; | |
5489 | |
5490 /* Verbum caro factum est! */ | |
5491 } | |
5492 #endif /* MULE */ | |
5493 | |
5494 /************************************************************************/ | |
5495 /* No-conversion methods */ | |
5496 /************************************************************************/ | |
5497 | |
5498 /* This is used when reading in "binary" files -- i.e. files that may | |
5499 contain all 256 possible byte values and that are not to be | |
5500 interpreted as being in any particular decoding. */ | |
5501 static void | |
5502 decode_coding_no_conversion (Lstream *decoding, const Extbyte *src, | |
5503 unsigned_char_dynarr *dst, Bytecount n) | |
5504 { | |
5505 struct decoding_stream *str = DECODING_STREAM_DATA (decoding); | |
5506 unsigned int flags = str->flags; | |
5507 unsigned int ch = str->ch; | |
5508 eol_type_t eol_type = str->eol_type; | |
5509 | |
5510 while (n--) | |
5511 { | |
5512 unsigned char c = *(unsigned char *)src++; | |
5513 | |
5514 DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst); | |
5515 DECODE_ADD_BINARY_CHAR (c, dst); | |
5516 label_continue_loop:; | |
5517 } | |
5518 | |
5519 DECODE_HANDLE_END_OF_CONVERSION (flags, ch, dst); | |
5520 | |
5521 str->flags = flags; | |
5522 str->ch = ch; | |
5523 } | |
5524 | |
5525 static void | |
5526 encode_coding_no_conversion (Lstream *encoding, const Intbyte *src, | |
5527 unsigned_char_dynarr *dst, Bytecount n) | |
5528 { | |
5529 unsigned char c; | |
5530 struct encoding_stream *str = ENCODING_STREAM_DATA (encoding); | |
5531 unsigned int flags = str->flags; | |
5532 unsigned int ch = str->ch; | |
5533 eol_type_t eol_type = CODING_SYSTEM_EOL_TYPE (str->codesys); | |
5534 | |
5535 while (n--) | |
5536 { | |
5537 c = *src++; | |
5538 if (c == '\n') | |
5539 { | |
5540 if (eol_type != EOL_LF && eol_type != EOL_AUTODETECT) | |
5541 Dynarr_add (dst, '\r'); | |
5542 if (eol_type != EOL_CR) | |
5543 Dynarr_add (dst, '\n'); | |
5544 ch = 0; | |
5545 } | |
5546 else if (BYTE_ASCII_P (c)) | |
5547 { | |
5548 assert (ch == 0); | |
5549 Dynarr_add (dst, c); | |
5550 } | |
5551 else if (INTBYTE_LEADING_BYTE_P (c)) | |
5552 { | |
5553 assert (ch == 0); | |
5554 if (c == LEADING_BYTE_LATIN_ISO8859_1 || | |
5555 c == LEADING_BYTE_CONTROL_1) | |
5556 ch = c; | |
5557 else | |
5558 Dynarr_add (dst, '~'); /* untranslatable character */ | |
5559 } | |
5560 else | |
5561 { | |
5562 if (ch == LEADING_BYTE_LATIN_ISO8859_1) | |
5563 Dynarr_add (dst, c); | |
5564 else if (ch == LEADING_BYTE_CONTROL_1) | |
5565 { | |
5566 assert (c < 0xC0); | |
5567 Dynarr_add (dst, c - 0x20); | |
5568 } | |
5569 /* else it should be the second or third byte of an | |
5570 untranslatable character, so ignore it */ | |
5571 ch = 0; | |
5572 } | |
5573 } | |
5574 | |
5575 str->flags = flags; | |
5576 str->ch = ch; | |
5577 } | |
5578 | |
5579 | |
5580 | |
5581 /************************************************************************/ | 4447 /************************************************************************/ |
5582 /* Initialization */ | 4448 /* Initialization */ |
5583 /************************************************************************/ | 4449 /************************************************************************/ |
5584 | 4450 |
5585 void | 4451 void |
5586 syms_of_file_coding (void) | 4452 syms_of_file_coding (void) |
5587 { | 4453 { |
5588 INIT_LRECORD_IMPLEMENTATION (coding_system); | 4454 INIT_LRECORD_IMPLEMENTATION (coding_system); |
5589 | 4455 |
4456 DEFSUBR (Fvalid_coding_system_type_p); | |
4457 DEFSUBR (Fcoding_system_type_list); | |
5590 DEFSUBR (Fcoding_system_p); | 4458 DEFSUBR (Fcoding_system_p); |
5591 DEFSUBR (Ffind_coding_system); | 4459 DEFSUBR (Ffind_coding_system); |
5592 DEFSUBR (Fget_coding_system); | 4460 DEFSUBR (Fget_coding_system); |
5593 DEFSUBR (Fcoding_system_list); | 4461 DEFSUBR (Fcoding_system_list); |
5594 DEFSUBR (Fcoding_system_name); | 4462 DEFSUBR (Fcoding_system_name); |
5597 DEFSUBR (Fcoding_system_canonical_name_p); | 4465 DEFSUBR (Fcoding_system_canonical_name_p); |
5598 DEFSUBR (Fcoding_system_alias_p); | 4466 DEFSUBR (Fcoding_system_alias_p); |
5599 DEFSUBR (Fcoding_system_aliasee); | 4467 DEFSUBR (Fcoding_system_aliasee); |
5600 DEFSUBR (Fdefine_coding_system_alias); | 4468 DEFSUBR (Fdefine_coding_system_alias); |
5601 DEFSUBR (Fsubsidiary_coding_system); | 4469 DEFSUBR (Fsubsidiary_coding_system); |
4470 DEFSUBR (Fcoding_system_base); | |
4471 DEFSUBR (Fcoding_system_used_for_io); | |
5602 | 4472 |
5603 DEFSUBR (Fcoding_system_type); | 4473 DEFSUBR (Fcoding_system_type); |
5604 DEFSUBR (Fcoding_system_doc_string); | 4474 DEFSUBR (Fcoding_system_description); |
5605 DEFSUBR (Fcoding_system_property); | 4475 DEFSUBR (Fcoding_system_property); |
5606 | 4476 |
5607 DEFSUBR (Fcoding_category_list); | 4477 DEFSUBR (Fcoding_category_list); |
5608 DEFSUBR (Fset_coding_priority_list); | 4478 DEFSUBR (Fset_coding_priority_list); |
5609 DEFSUBR (Fcoding_priority_list); | 4479 DEFSUBR (Fcoding_priority_list); |
5611 DEFSUBR (Fcoding_category_system); | 4481 DEFSUBR (Fcoding_category_system); |
5612 | 4482 |
5613 DEFSUBR (Fdetect_coding_region); | 4483 DEFSUBR (Fdetect_coding_region); |
5614 DEFSUBR (Fdecode_coding_region); | 4484 DEFSUBR (Fdecode_coding_region); |
5615 DEFSUBR (Fencode_coding_region); | 4485 DEFSUBR (Fencode_coding_region); |
5616 #ifdef MULE | |
5617 DEFSUBR (Fdecode_shift_jis_char); | |
5618 DEFSUBR (Fencode_shift_jis_char); | |
5619 DEFSUBR (Fdecode_big5_char); | |
5620 DEFSUBR (Fencode_big5_char); | |
5621 DEFSUBR (Fset_ucs_char); | |
5622 DEFSUBR (Fucs_char); | |
5623 DEFSUBR (Fset_char_ucs); | |
5624 DEFSUBR (Fchar_ucs); | |
5625 #endif /* MULE */ | |
5626 DEFSYMBOL_MULTIWORD_PREDICATE (Qcoding_systemp); | 4486 DEFSYMBOL_MULTIWORD_PREDICATE (Qcoding_systemp); |
5627 DEFSYMBOL (Qno_conversion); | 4487 DEFSYMBOL (Qno_conversion); |
4488 DEFSYMBOL (Qconvert_eol); | |
4489 DEFSYMBOL (Qconvert_eol_autodetect); | |
4490 DEFSYMBOL (Qconvert_eol_lf); | |
4491 DEFSYMBOL (Qconvert_eol_cr); | |
4492 DEFSYMBOL (Qconvert_eol_crlf); | |
5628 DEFSYMBOL (Qraw_text); | 4493 DEFSYMBOL (Qraw_text); |
5629 #ifdef MULE | 4494 |
5630 DEFSYMBOL (Qbig5); | |
5631 DEFSYMBOL (Qshift_jis); | |
5632 defsymbol (&Qucs4, "ucs-4"); | |
5633 defsymbol (&Qutf8, "utf-8"); | |
5634 DEFSYMBOL (Qccl); | |
5635 DEFSYMBOL (Qiso2022); | |
5636 #endif /* MULE */ | |
5637 DEFSYMBOL (Qmnemonic); | 4495 DEFSYMBOL (Qmnemonic); |
5638 DEFSYMBOL (Qeol_type); | 4496 DEFSYMBOL (Qeol_type); |
5639 DEFSYMBOL (Qpost_read_conversion); | 4497 DEFSYMBOL (Qpost_read_conversion); |
5640 DEFSYMBOL (Qpre_write_conversion); | 4498 DEFSYMBOL (Qpre_write_conversion); |
4499 | |
4500 DEFSYMBOL (Qtranslation_table_for_decode); | |
4501 DEFSYMBOL (Qtranslation_table_for_encode); | |
4502 DEFSYMBOL (Qsafe_chars); | |
4503 DEFSYMBOL (Qsafe_charsets); | |
4504 DEFSYMBOL (Qmime_charset); | |
4505 DEFSYMBOL (Qvalid_codes); | |
5641 | 4506 |
5642 DEFSYMBOL (Qcr); | 4507 DEFSYMBOL (Qcr); |
5643 DEFSYMBOL (Qlf); | 4508 DEFSYMBOL (Qlf); |
5644 DEFSYMBOL (Qcrlf); | 4509 DEFSYMBOL (Qcrlf); |
5645 DEFSYMBOL (Qeol_cr); | 4510 DEFSYMBOL (Qeol_cr); |
5646 DEFSYMBOL (Qeol_lf); | 4511 DEFSYMBOL (Qeol_lf); |
5647 DEFSYMBOL (Qeol_crlf); | 4512 DEFSYMBOL (Qeol_crlf); |
5648 #ifdef MULE | |
5649 DEFSYMBOL (Qcharset_g0); | |
5650 DEFSYMBOL (Qcharset_g1); | |
5651 DEFSYMBOL (Qcharset_g2); | |
5652 DEFSYMBOL (Qcharset_g3); | |
5653 DEFSYMBOL (Qforce_g0_on_output); | |
5654 DEFSYMBOL (Qforce_g1_on_output); | |
5655 DEFSYMBOL (Qforce_g2_on_output); | |
5656 DEFSYMBOL (Qforce_g3_on_output); | |
5657 DEFSYMBOL (Qno_iso6429); | |
5658 DEFSYMBOL (Qinput_charset_conversion); | |
5659 DEFSYMBOL (Qoutput_charset_conversion); | |
5660 | |
5661 DEFSYMBOL (Qshort); | |
5662 DEFSYMBOL (Qno_ascii_eol); | |
5663 DEFSYMBOL (Qno_ascii_cntl); | |
5664 DEFSYMBOL (Qseven); | |
5665 DEFSYMBOL (Qlock_shift); | |
5666 DEFSYMBOL (Qescape_quoted); | |
5667 #endif /* MULE */ | |
5668 DEFSYMBOL (Qencode); | 4513 DEFSYMBOL (Qencode); |
5669 DEFSYMBOL (Qdecode); | 4514 DEFSYMBOL (Qdecode); |
5670 | 4515 |
5671 #ifdef MULE | 4516 DEFSYMBOL (Qnear_certainty); |
5672 defsymbol (&coding_category_symbol[CODING_CATEGORY_SHIFT_JIS], | 4517 DEFSYMBOL (Qquite_probable); |
5673 "shift-jis"); | 4518 DEFSYMBOL (Qsomewhat_likely); |
5674 defsymbol (&coding_category_symbol[CODING_CATEGORY_BIG5], | 4519 DEFSYMBOL (Qas_likely_as_unlikely); |
5675 "big5"); | 4520 DEFSYMBOL (Qsomewhat_unlikely); |
5676 defsymbol (&coding_category_symbol[CODING_CATEGORY_UCS4], | 4521 DEFSYMBOL (Qquite_improbable); |
5677 "ucs-4"); | 4522 DEFSYMBOL (Qnearly_impossible); |
5678 defsymbol (&coding_category_symbol[CODING_CATEGORY_UTF8], | 4523 |
5679 "utf-8"); | 4524 DEFSYMBOL (Qdo_eol); |
5680 defsymbol (&coding_category_symbol[CODING_CATEGORY_ISO_7], | 4525 DEFSYMBOL (Qdo_coding); |
5681 "iso-7"); | 4526 |
5682 defsymbol (&coding_category_symbol[CODING_CATEGORY_ISO_8_DESIGNATE], | 4527 DEFSYMBOL (Qcanonicalize_after_coding); |
5683 "iso-8-designate"); | 4528 |
5684 defsymbol (&coding_category_symbol[CODING_CATEGORY_ISO_8_1], | 4529 DEFSYMBOL (Qescape_quoted); |
5685 "iso-8-1"); | 4530 |
5686 defsymbol (&coding_category_symbol[CODING_CATEGORY_ISO_8_2], | 4531 #ifdef HAVE_ZLIB |
5687 "iso-8-2"); | 4532 DEFSYMBOL (Qgzip); |
5688 defsymbol (&coding_category_symbol[CODING_CATEGORY_ISO_LOCK_SHIFT], | 4533 #endif |
5689 "iso-lock-shift"); | 4534 |
5690 #endif /* MULE */ | 4535 /* WARNING: The existing categories are intimately tied to the function |
5691 defsymbol (&coding_category_symbol[CODING_CATEGORY_NO_CONVERSION], | 4536 `coding-system-category' in coding.el. If you change a category, or |
5692 "no-conversion"); | 4537 change the layout of any coding system associated with a category, you |
4538 need to check that function and make sure it's written properly. */ | |
4539 | |
4540 #ifdef HAVE_DEFAULT_EOL_DETECTION | |
4541 Fprovide (intern ("unix-default-eol-detection")); | |
4542 #endif | |
5693 } | 4543 } |
5694 | 4544 |
5695 void | 4545 void |
5696 lstream_type_create_file_coding (void) | 4546 lstream_type_create_file_coding (void) |
5697 { | 4547 { |
5698 LSTREAM_HAS_METHOD (decoding, reader); | 4548 LSTREAM_HAS_METHOD (coding, reader); |
5699 LSTREAM_HAS_METHOD (decoding, writer); | 4549 LSTREAM_HAS_METHOD (coding, writer); |
5700 LSTREAM_HAS_METHOD (decoding, rewinder); | 4550 LSTREAM_HAS_METHOD (coding, rewinder); |
5701 LSTREAM_HAS_METHOD (decoding, seekable_p); | 4551 LSTREAM_HAS_METHOD (coding, seekable_p); |
5702 LSTREAM_HAS_METHOD (decoding, flusher); | 4552 LSTREAM_HAS_METHOD (coding, marker); |
5703 LSTREAM_HAS_METHOD (decoding, closer); | 4553 LSTREAM_HAS_METHOD (coding, flusher); |
5704 LSTREAM_HAS_METHOD (decoding, marker); | 4554 LSTREAM_HAS_METHOD (coding, closer); |
5705 | 4555 LSTREAM_HAS_METHOD (coding, finalizer); |
5706 LSTREAM_HAS_METHOD (encoding, reader); | 4556 } |
5707 LSTREAM_HAS_METHOD (encoding, writer); | 4557 |
5708 LSTREAM_HAS_METHOD (encoding, rewinder); | 4558 void |
5709 LSTREAM_HAS_METHOD (encoding, seekable_p); | 4559 coding_system_type_create (void) |
5710 LSTREAM_HAS_METHOD (encoding, flusher); | 4560 { |
5711 LSTREAM_HAS_METHOD (encoding, closer); | 4561 int i; |
5712 LSTREAM_HAS_METHOD (encoding, marker); | 4562 |
4563 staticpro (&Vcoding_system_hash_table); | |
4564 Vcoding_system_hash_table = | |
4565 make_lisp_hash_table (50, HASH_TABLE_NON_WEAK, HASH_TABLE_EQ); | |
4566 | |
4567 the_coding_system_type_entry_dynarr = Dynarr_new (coding_system_type_entry); | |
4568 dump_add_root_struct_ptr (&the_coding_system_type_entry_dynarr, | |
4569 &csted_description); | |
4570 | |
4571 Vcoding_system_type_list = Qnil; | |
4572 staticpro (&Vcoding_system_type_list); | |
4573 | |
4574 /* Initialize to something reasonable ... */ | |
4575 for (i = 0; i < MAX_DETECTOR_CATEGORIES; i++) | |
4576 { | |
4577 coding_category_system[i] = Qnil; | |
4578 dump_add_root_object (&coding_category_system[i]); | |
4579 coding_category_by_priority[i] = i; | |
4580 } | |
4581 | |
4582 dump_add_opaque (coding_category_by_priority, | |
4583 sizeof (coding_category_by_priority)); | |
4584 | |
4585 all_coding_detectors = Dynarr_new2 (detector_dynarr, struct detector); | |
4586 dump_add_root_struct_ptr (&all_coding_detectors, | |
4587 &detector_dynarr_description); | |
4588 | |
4589 dump_add_opaque_int (&coding_system_tick); | |
4590 dump_add_opaque_int (&coding_detector_count); | |
4591 dump_add_opaque_int (&coding_detector_category_count); | |
4592 | |
4593 INITIALIZE_CODING_SYSTEM_TYPE (no_conversion, | |
4594 "no-conversion-coding-system-p"); | |
4595 CODING_SYSTEM_HAS_METHOD (no_conversion, convert); | |
4596 | |
4597 INITIALIZE_DETECTOR (no_conversion); | |
4598 DETECTOR_HAS_METHOD (no_conversion, detect); | |
4599 INITIALIZE_DETECTOR_CATEGORY (no_conversion, no_conversion); | |
4600 | |
4601 INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (convert_eol, | |
4602 "convert-eol-coding-system-p"); | |
4603 CODING_SYSTEM_HAS_METHOD (convert_eol, print); | |
4604 CODING_SYSTEM_HAS_METHOD (convert_eol, convert); | |
4605 CODING_SYSTEM_HAS_METHOD (convert_eol, getprop); | |
4606 CODING_SYSTEM_HAS_METHOD (convert_eol, putprop); | |
4607 CODING_SYSTEM_HAS_METHOD (convert_eol, conversion_end_type); | |
4608 CODING_SYSTEM_HAS_METHOD (convert_eol, canonicalize_after_coding); | |
4609 CODING_SYSTEM_HAS_METHOD (convert_eol, init_coding_stream); | |
4610 | |
4611 INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (undecided, | |
4612 "undecided-coding-system-p"); | |
4613 CODING_SYSTEM_HAS_METHOD (undecided, init); | |
4614 CODING_SYSTEM_HAS_METHOD (undecided, mark); | |
4615 CODING_SYSTEM_HAS_METHOD (undecided, print); | |
4616 CODING_SYSTEM_HAS_METHOD (undecided, convert); | |
4617 CODING_SYSTEM_HAS_METHOD (undecided, putprop); | |
4618 CODING_SYSTEM_HAS_METHOD (undecided, getprop); | |
4619 CODING_SYSTEM_HAS_METHOD (undecided, init_coding_stream); | |
4620 CODING_SYSTEM_HAS_METHOD (undecided, rewind_coding_stream); | |
4621 CODING_SYSTEM_HAS_METHOD (undecided, finalize_coding_stream); | |
4622 CODING_SYSTEM_HAS_METHOD (undecided, mark_coding_stream); | |
4623 CODING_SYSTEM_HAS_METHOD (undecided, canonicalize); | |
4624 CODING_SYSTEM_HAS_METHOD (undecided, canonicalize_after_coding); | |
4625 | |
4626 INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (chain, "chain-coding-system-p"); | |
4627 | |
4628 CODING_SYSTEM_HAS_METHOD (chain, print); | |
4629 CODING_SYSTEM_HAS_METHOD (chain, canonicalize); | |
4630 CODING_SYSTEM_HAS_METHOD (chain, init); | |
4631 CODING_SYSTEM_HAS_METHOD (chain, mark); | |
4632 CODING_SYSTEM_HAS_METHOD (chain, mark_coding_stream); | |
4633 CODING_SYSTEM_HAS_METHOD (chain, convert); | |
4634 CODING_SYSTEM_HAS_METHOD (chain, rewind_coding_stream); | |
4635 CODING_SYSTEM_HAS_METHOD (chain, finalize_coding_stream); | |
4636 CODING_SYSTEM_HAS_METHOD (chain, finalize); | |
4637 CODING_SYSTEM_HAS_METHOD (chain, putprop); | |
4638 CODING_SYSTEM_HAS_METHOD (chain, getprop); | |
4639 CODING_SYSTEM_HAS_METHOD (chain, conversion_end_type); | |
4640 CODING_SYSTEM_HAS_METHOD (chain, canonicalize_after_coding); | |
4641 | |
4642 #ifdef DEBUG_XEMACS | |
4643 INITIALIZE_CODING_SYSTEM_TYPE (internal, "internal-coding-system-p"); | |
4644 CODING_SYSTEM_HAS_METHOD (internal, convert); | |
4645 #endif | |
4646 | |
4647 #ifdef HAVE_ZLIB | |
4648 INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (gzip, "gzip-coding-system-p"); | |
4649 CODING_SYSTEM_HAS_METHOD (gzip, conversion_end_type); | |
4650 CODING_SYSTEM_HAS_METHOD (gzip, convert); | |
4651 CODING_SYSTEM_HAS_METHOD (gzip, init); | |
4652 CODING_SYSTEM_HAS_METHOD (gzip, print); | |
4653 CODING_SYSTEM_HAS_METHOD (gzip, init_coding_stream); | |
4654 CODING_SYSTEM_HAS_METHOD (gzip, rewind_coding_stream); | |
4655 CODING_SYSTEM_HAS_METHOD (gzip, putprop); | |
4656 CODING_SYSTEM_HAS_METHOD (gzip, getprop); | |
4657 #endif | |
4658 } | |
4659 | |
4660 void | |
4661 reinit_coding_system_type_create (void) | |
4662 { | |
4663 REINITIALIZE_CODING_SYSTEM_TYPE (no_conversion); | |
4664 REINITIALIZE_CODING_SYSTEM_TYPE (convert_eol); | |
4665 REINITIALIZE_CODING_SYSTEM_TYPE (undecided); | |
4666 REINITIALIZE_CODING_SYSTEM_TYPE (chain); | |
4667 #if 0 | |
4668 REINITIALIZE_CODING_SYSTEM_TYPE (text_file_wrapper); | |
4669 #endif /* 0 */ | |
4670 #ifdef DEBUG_XEMACS | |
4671 REINITIALIZE_CODING_SYSTEM_TYPE (internal); | |
4672 #endif | |
4673 #ifdef HAVE_ZLIB | |
4674 REINITIALIZE_CODING_SYSTEM_TYPE (gzip); | |
4675 #endif | |
4676 } | |
4677 | |
4678 void | |
4679 reinit_vars_of_file_coding (void) | |
4680 { | |
5713 } | 4681 } |
5714 | 4682 |
5715 void | 4683 void |
5716 vars_of_file_coding (void) | 4684 vars_of_file_coding (void) |
5717 { | 4685 { |
5718 int i; | 4686 reinit_vars_of_file_coding (); |
5719 | 4687 |
5720 fcd = xnew (struct file_coding_dump); | 4688 /* We always have file-coding support */ |
5721 dump_add_root_struct_ptr (&fcd, &fcd_description); | |
5722 | |
5723 /* Initialize to something reasonable ... */ | |
5724 for (i = 0; i < CODING_CATEGORY_LAST; i++) | |
5725 { | |
5726 fcd->coding_category_system[i] = Qnil; | |
5727 fcd->coding_category_by_priority[i] = i; | |
5728 } | |
5729 | |
5730 Fprovide (intern ("file-coding")); | 4689 Fprovide (intern ("file-coding")); |
5731 | 4690 |
5732 DEFVAR_LISP ("keyboard-coding-system", &Vkeyboard_coding_system /* | 4691 DEFVAR_LISP ("keyboard-coding-system", &Vkeyboard_coding_system /* |
5733 Coding system used for TTY keyboard input. | 4692 Coding system used for TTY keyboard input. |
5734 Not used under a windowing system. | 4693 Not used under a windowing system. |
5766 Coding system used to convert pathnames when accessing files. | 4725 Coding system used to convert pathnames when accessing files. |
5767 */ ); | 4726 */ ); |
5768 Vfile_name_coding_system = Qnil; | 4727 Vfile_name_coding_system = Qnil; |
5769 | 4728 |
5770 DEFVAR_BOOL ("enable-multibyte-characters", &enable_multibyte_characters /* | 4729 DEFVAR_BOOL ("enable-multibyte-characters", &enable_multibyte_characters /* |
5771 Non-nil means the buffer contents are regarded as multi-byte form | 4730 Setting this has no effect. It is purely for FSF compatibility. |
5772 of characters, not a binary code. This affects the display, file I/O, | |
5773 and behaviors of various editing commands. | |
5774 | |
5775 Setting this to nil does not do anything. | |
5776 */ ); | 4731 */ ); |
5777 enable_multibyte_characters = 1; | 4732 enable_multibyte_characters = 1; |
4733 | |
4734 Vchain_canonicalize_hash_table = | |
4735 make_lisp_hash_table (50, HASH_TABLE_NON_WEAK, HASH_TABLE_EQUAL); | |
4736 staticpro (&Vchain_canonicalize_hash_table); | |
4737 | |
4738 #ifdef DEBUG_XEMACS | |
4739 DEFVAR_LISP ("debug-coding-detection", &Vdebug_coding_detection /* | |
4740 If non-nil, display debug information about detection operations in progress. | |
4741 Information is displayed on stderr. | |
4742 */ ); | |
4743 Vdebug_coding_detection = Qnil; | |
4744 #endif | |
5778 } | 4745 } |
5779 | 4746 |
5780 void | 4747 void |
5781 complex_vars_of_file_coding (void) | 4748 complex_vars_of_file_coding (void) |
5782 { | 4749 { |
5783 staticpro (&Vcoding_system_hash_table); | 4750 Fmake_coding_system |
5784 Vcoding_system_hash_table = | 4751 (Qconvert_eol_cr, Qconvert_eol, |
5785 make_lisp_hash_table (50, HASH_TABLE_NON_WEAK, HASH_TABLE_EQ); | 4752 build_msg_string ("Convert CR to LF"), |
5786 | 4753 nconc2 (list6 (Qdocumentation, |
5787 the_codesys_prop_dynarr = Dynarr_new (codesys_prop); | 4754 build_msg_string ( |
5788 dump_add_root_struct_ptr (&the_codesys_prop_dynarr, &codesys_prop_dynarr_description); | 4755 "Converts CR (used to mark the end of a line on Macintosh systems) to LF\n" |
5789 | 4756 "(used internally and under Unix to mark the end of a line)."), |
5790 #define DEFINE_CODESYS_PROP(Prop_Type, Sym) do \ | 4757 Qmnemonic, build_string ("CR->LF"), |
5791 { \ | 4758 Qsubtype, Qcr), |
5792 struct codesys_prop csp; \ | 4759 /* VERY IMPORTANT! Tell make-coding-system not to generate |
5793 csp.sym = (Sym); \ | 4760 subsidiaries -- it needs the coding systems we're creating |
5794 csp.prop_type = (Prop_Type); \ | 4761 to do so! */ |
5795 Dynarr_add (the_codesys_prop_dynarr, csp); \ | 4762 list2 (Qeol_type, Qlf))); |
5796 } while (0) | 4763 |
5797 | 4764 Fmake_coding_system |
5798 DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qmnemonic); | 4765 (Qconvert_eol_lf, Qconvert_eol, |
5799 DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qeol_type); | 4766 build_msg_string ("Convert LF to LF (do nothing)"), |
5800 DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qeol_cr); | 4767 nconc2 (list6 (Qdocumentation, |
5801 DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qeol_crlf); | 4768 build_msg_string ( |
5802 DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qeol_lf); | 4769 "Do nothing."), |
5803 DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qpost_read_conversion); | 4770 Qmnemonic, build_string ("LF->LF"), |
5804 DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qpre_write_conversion); | 4771 Qsubtype, Qlf), |
5805 #ifdef MULE | 4772 /* VERY IMPORTANT! Tell make-coding-system not to generate |
5806 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qcharset_g0); | 4773 subsidiaries -- it needs the coding systems we're creating |
5807 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qcharset_g1); | 4774 to do so! */ |
5808 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qcharset_g2); | 4775 list2 (Qeol_type, Qlf))); |
5809 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qcharset_g3); | 4776 |
5810 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qforce_g0_on_output); | 4777 Fmake_coding_system |
5811 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qforce_g1_on_output); | 4778 (Qconvert_eol_crlf, Qconvert_eol, |
5812 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qforce_g2_on_output); | 4779 build_msg_string ("Convert CRLF to LF"), |
5813 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qforce_g3_on_output); | 4780 nconc2 (list6 (Qdocumentation, |
5814 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qshort); | 4781 build_msg_string ( |
5815 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qno_ascii_eol); | 4782 "Converts CR+LF (used to mark the end of a line on Macintosh systems) to LF\n" |
5816 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qno_ascii_cntl); | 4783 "(used internally and under Unix to mark the end of a line)."), |
5817 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qseven); | 4784 Qmnemonic, build_string ("CRLF->LF"), |
5818 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qlock_shift); | 4785 Qsubtype, Qcrlf), |
5819 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qno_iso6429); | 4786 /* VERY IMPORTANT! Tell make-coding-system not to generate |
5820 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qescape_quoted); | 4787 subsidiaries -- it needs the coding systems we're creating |
5821 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qinput_charset_conversion); | 4788 to do so! */ |
5822 DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qoutput_charset_conversion); | 4789 list2 (Qeol_type, Qlf))); |
5823 | 4790 |
5824 DEFINE_CODESYS_PROP (CODESYS_PROP_CCL, Qencode); | 4791 Fmake_coding_system |
5825 DEFINE_CODESYS_PROP (CODESYS_PROP_CCL, Qdecode); | 4792 (Qconvert_eol_autodetect, Qconvert_eol, |
5826 #endif /* MULE */ | 4793 build_msg_string ("Autodetect EOL type"), |
4794 nconc2 (list6 (Qdocumentation, | |
4795 build_msg_string ( | |
4796 "Autodetect the end-of-line type."), | |
4797 Qmnemonic, build_string ("Auto-EOL"), | |
4798 Qsubtype, Qautodetect), | |
4799 /* VERY IMPORTANT! Tell make-coding-system not to generate | |
4800 subsidiaries -- it needs the coding systems we're creating | |
4801 to do so! */ | |
4802 list2 (Qeol_type, Qlf))); | |
4803 | |
4804 Fmake_coding_system | |
4805 (Qundecided, Qundecided, | |
4806 build_msg_string ("Undecided (auto-detect)"), | |
4807 nconc2 (list4 (Qdocumentation, | |
4808 build_msg_string | |
4809 ("Automatically detects the correct encoding."), | |
4810 Qmnemonic, build_string ("Auto")), | |
4811 list6 (Qdo_eol, Qt, Qdo_coding, Qt, | |
4812 /* We do EOL detection ourselves so we don't need to be | |
4813 wrapped in an EOL detector. (It doesn't actually hurt, | |
4814 though, I don't think.) */ | |
4815 Qeol_type, Qlf))); | |
4816 | |
4817 Fmake_coding_system | |
4818 (intern ("undecided-dos"), Qundecided, | |
4819 build_msg_string ("Undecided (auto-detect) (CRLF)"), | |
4820 nconc2 (list4 (Qdocumentation, | |
4821 build_msg_string | |
4822 ("Automatically detects the correct encoding; EOL type of CRLF forced."), | |
4823 Qmnemonic, build_string ("Auto")), | |
4824 list4 (Qdo_coding, Qt, | |
4825 Qeol_type, Qcrlf))); | |
4826 | |
4827 Fmake_coding_system | |
4828 (intern ("undecided-unix"), Qundecided, | |
4829 build_msg_string ("Undecided (auto-detect) (LF)"), | |
4830 nconc2 (list4 (Qdocumentation, | |
4831 build_msg_string | |
4832 ("Automatically detects the correct encoding; EOL type of LF forced."), | |
4833 Qmnemonic, build_string ("Auto")), | |
4834 list4 (Qdo_coding, Qt, | |
4835 Qeol_type, Qlf))); | |
4836 | |
4837 Fmake_coding_system | |
4838 (intern ("undecided-mac"), Qundecided, | |
4839 build_msg_string ("Undecided (auto-detect) (CR)"), | |
4840 nconc2 (list4 (Qdocumentation, | |
4841 build_msg_string | |
4842 ("Automatically detects the correct encoding; EOL type of CR forced."), | |
4843 Qmnemonic, build_string ("Auto")), | |
4844 list4 (Qdo_coding, Qt, | |
4845 Qeol_type, Qcr))); | |
4846 | |
5827 /* Need to create this here or we're really screwed. */ | 4847 /* Need to create this here or we're really screwed. */ |
5828 Fmake_coding_system | 4848 Fmake_coding_system |
5829 (Qraw_text, Qno_conversion, | 4849 (Qraw_text, Qno_conversion, |
5830 build_string ("Raw text, which means it converts only line-break-codes."), | 4850 build_msg_string ("Raw Text"), |
5831 list2 (Qmnemonic, build_string ("Raw"))); | 4851 list4 (Qdocumentation, |
4852 build_msg_string ("Raw text converts only line-break codes, and acts otherwise like `binary'."), | |
4853 Qmnemonic, build_string ("Raw"))); | |
5832 | 4854 |
5833 Fmake_coding_system | 4855 Fmake_coding_system |
5834 (Qbinary, Qno_conversion, | 4856 (Qbinary, Qno_conversion, |
5835 build_string ("Binary, which means it does not convert anything."), | 4857 build_msg_string ("Binary"), |
5836 list4 (Qeol_type, Qlf, | 4858 list6 (Qdocumentation, |
4859 build_msg_string ( | |
4860 "This coding system is as close as it comes to doing no conversion.\n" | |
4861 "On input, each byte is converted directly into the character\n" | |
4862 "with the corresponding code -- i.e. from the `ascii', `control-1',\n" | |
4863 "or `latin-1' character sets. On output, these characters are\n" | |
4864 "converted back to the corresponding bytes, and other characters\n" | |
4865 "are converted to the default character, i.e. `~'."), | |
4866 Qeol_type, Qlf, | |
5837 Qmnemonic, build_string ("Binary"))); | 4867 Qmnemonic, build_string ("Binary"))); |
5838 | 4868 |
5839 Fdefine_coding_system_alias (Qno_conversion, Qraw_text); | 4869 /* Formerly aliased to raw-text! Completely bogus and not even the same |
5840 | 4870 as FSF Emacs. */ |
4871 Fdefine_coding_system_alias (Qno_conversion, Qbinary); | |
4872 Fdefine_coding_system_alias (intern ("no-conversion-unix"), | |
4873 intern ("raw-text-unix")); | |
4874 Fdefine_coding_system_alias (intern ("no-conversion-dos"), | |
4875 intern ("raw-text-dos")); | |
4876 Fdefine_coding_system_alias (intern ("no-conversion-mac"), | |
4877 intern ("raw-text-mac")); | |
4878 | |
4879 /* These four below will get their defaults set correctly in | |
4880 code-init.el. We init them now so we can handle stuff at dump | |
4881 time before we get to code-init.el. */ | |
5841 Fdefine_coding_system_alias (Qfile_name, Qbinary); | 4882 Fdefine_coding_system_alias (Qfile_name, Qbinary); |
4883 Fdefine_coding_system_alias (Qnative, Qfile_name); | |
5842 | 4884 |
5843 Fdefine_coding_system_alias (Qterminal, Qbinary); | 4885 Fdefine_coding_system_alias (Qterminal, Qbinary); |
5844 Fdefine_coding_system_alias (Qkeyboard, Qbinary); | 4886 Fdefine_coding_system_alias (Qkeyboard, Qbinary); |
5845 | 4887 |
4888 Fdefine_coding_system_alias (Qidentity, Qconvert_eol_lf); | |
4889 | |
5846 /* Need this for bootstrapping */ | 4890 /* Need this for bootstrapping */ |
5847 fcd->coding_category_system[CODING_CATEGORY_NO_CONVERSION] = | 4891 coding_category_system[detector_category_no_conversion] = |
5848 Fget_coding_system (Qraw_text); | 4892 Fget_coding_system (Qraw_text); |
5849 | 4893 } |
5850 #ifdef MULE | |
5851 { | |
5852 int i; | |
5853 | |
5854 for (i = 0; i < countof (fcd->ucs_to_mule_table); i++) | |
5855 fcd->ucs_to_mule_table[i] = Qnil; | |
5856 } | |
5857 staticpro (&mule_to_ucs_table); | |
5858 mule_to_ucs_table = Fmake_char_table (Qgeneric); | |
5859 #endif /* MULE */ | |
5860 } |