Mercurial > hg > xemacs-beta
diff src/file-coding.c @ 771:943eaba38521
[xemacs-hg @ 2002-03-13 08:51:24 by ben]
The big ben-mule-21-5 check-in!
Various files were added and deleted. See CHANGES-ben-mule.
There are still some test suite failures. No crashes, though.
Many of the failures have to do with problems in the test suite itself
rather than in the actual code. I'll be addressing these in the next
day or so -- none of the test suite failures are at all critical.
Meanwhile I'll be trying to address the biggest issues -- i.e. build
or run failures, which will almost certainly happen on various platforms.
All comments should be sent to ben@xemacs.org -- use a Cc: if necessary
when sending to mailing lists. There will be pre- and post- tags,
something like
pre-ben-mule-21-5-merge-in, and
post-ben-mule-21-5-merge-in.
author | ben |
---|---|
date | Wed, 13 Mar 2002 08:54:06 +0000 |
parents | fdefd0186b75 |
children | 2923009caf47 |
line wrap: on
line diff
--- a/src/file-coding.c Fri Mar 08 13:33:14 2002 +0000 +++ b/src/file-coding.c Wed Mar 13 08:54:06 2002 +0000 @@ -1,6 +1,8 @@ -/* Code conversion functions. +/* Text encoding conversion functions; coding-system object. + #### rename me to coding-system.c or coding.c Copyright (C) 1991, 1995 Free Software Foundation, Inc. Copyright (C) 1995 Sun Microsystems, Inc. + Copyright (C) 2000, 2001, 2002 Ben Wing. This file is part of XEmacs. @@ -19,9 +21,309 @@ the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */ -/* Synched up with: Mule 2.3. Not in FSF. */ - -/* Rewritten by Ben Wing <ben@xemacs.org>. */ +/* Synched up with: Not in FSF. */ + +/* Authorship: + + Current primary author: Ben Wing <ben@xemacs.org> + + Rewritten by Ben Wing <ben@xemacs.org>, based originally on coding.c + from Mule 2.? but probably does not share one line of code with that + original source. Rewriting work started around Dec. 1994. or Jan. 1995. + Proceeded in earnest till Nov. 1995. + + Around Feb. 17, 1998, Andy Piper renamed what was then mule-coding.c to + file-coding.c, with the intention of using it to do end-of-line conversion + on non-MULE machines (specifically, on Windows machines). He separated + out the MULE stuff from non-MULE using ifdef's, and searched throughout + the rest of the source tree looking for coding-system-related code that + was ifdef MULE but should be ifdef HAVE_CODING_SYSTEMS. + + Sept. 4 - 8, 1998, Tomohiko Morioka added the UCS_4 and UTF_8 coding system + types, providing a primitive means of decoding and encoding externally- + formatted Unicode/UCS_4 and Unicode/UTF_8 data. + + January 25, 2000, Martin Buchholz redid and fleshed out the coding + system alias handling that was first added in prototype form by + Hrjove Niksic, April 15, 1999. + + April to May 2000, Ben Wing: More major reorganization. Adding features + needed for MS Windows (multibyte, unicode, unicode-to-multibyte), the + "chain" coding system for chaining two together, and doing a lot of + reorganization in preparation for properly abstracting out the different + coding system types. + + June 2001, Ben Wing: Added Unicode support. Eliminated previous + junky Unicode translation support. + + August 2001, Ben Wing: Moved Unicode support to unicode.c. Finished + abstracting everything except detection, which is hard to abstract (see + just below). + + September 2001, Ben Wing: Moved Mule code to mule-coding.c, Windows code + to intl-win32.c. Lots more rewriting; very little code is untouched + from before April 2000. Abstracted the detection code, added multiple + levels of likelihood to increase the reliability of the algorithm. + + October 2001, Ben Wing: HAVE_CODING_SYSTEMS is always now defined. + Removed the conditionals. + */ + +/* Comments about future work + +------------------------------------------------------------------ + ABOUT DETECTION +------------------------------------------------------------------ + + however, in general the detection code has major problems and needs lots + of work: + + -- instead of merely "yes" or "no" for particular categories, we need a + more flexible system, with various levels of likelihood. Currently + I've created a system with six levels, as follows: + + [see file-coding.h] + + Let's consider what this might mean for an ASCII text detector. (In + order to have accurate detection, especially given the iteration I + proposed below, we need active detectors for *all* types of data we + might reasonably encounter, such as ASCII text files, binary files, + and possibly other sorts of ASCII files, and not assume that simply + "falling back to no detection" will work at all well.) + + An ASCII text detector DOES NOT report ASCII text as level 0, since + that's what the detector is looking for. Such a detector ideally + wants all bytes in the range 0x20 - 0x7E (no high bytes!), except for + whitespace control chars and perhaps a few others; LF, CR, or CRLF + sequences at regular intervals (where "regular" might mean an average + < 100 chars and 99% < 300 for code and other stuff of the "text file + w/line breaks" variety, but for the "text file w/o line breaks" + variety, excluding blank lines, averages could easily be 600 or more + with 2000-3000 char "lines" not so uncommon); similar statistical + variance between odds and evens (not Unicode); frequent occurrences of + the space character; letters more common than non-letters; etc. Also + checking for too little variability between frequencies of characters + and for exclusion of particular characters based on character ranges + can catch ASCII encodings like base-64, UUEncode, UTF-7, etc. + Granted, this doesn't even apply to everything called "ASCII", and we + could potentially distinguish off ASCII for code, ASCII for text, + etc. as separate categories. However, it does give us a lot to work + off of, in deciding what likelihood to choose -- and it shows there's + in fact a lot of detectable patterns to look for even in something + seemingly so generic as ASCII. The detector would report most text + files in level 1 or level 2. EUC encodings, Shift-JIS, etc. probably + go to level -1 because they also pass the EOL test and all other tests + for the ASCII part of the text, but have lots of high bytes, which in + essence turn them into binary. Aberrant text files like something in + BASE64 encoding might get placed in level 0, because they pass most + tests but fail dramatically the frequency test; but they should not be + reported as any lower, because that would cause explicit prompting, + and the user should be able any valid text file without prompting. + The escape sequences and the base-64-type checks might send 7-bit + iso2022 to 0, but probably not -1, for similar reasons. + + -- The assumed algorithm for the above detection levels is to in essence + sort categories first by detection level and then by priority. + Perhaps, however, we would want smarter algorithms, or at least + something user-controllable -- in particular, when (other than no + category at level 0 or greater) do we prompt the user to pick a + category? + + -- Improvements in how the detection algorithm works: we want to handle + lots of different ways something could be encoded, including multiple + stacked encodings. trying to specify a series of detection levels + (check for base64 first, then check for gzip, then check for an i18n + decoding, then for crlf) won't generally work. for example, what + about the same encoding appearing more than once? for example, take + euc-jp, base64'd, then gzip'd, then base64'd again: this could well + happen, and you could specify the encodings specifically as + base64|gzip|base64|euc-jp, but we'd like to autodetect it without + worrying about exactly what order these things appear in. we should + allow for iterating over detection/decoding cycles until we reach + some maximum (we got stuck in a loop, due to incorrect category + tables or detection algorithms), have no reported detection levels + over -1, or we end up with no change after a decoding pass (i.e. the + coding system associated with a chosen category was `no-conversion' + or something equivalent). it might make sense to divide things into + two phases (internal and external), where the internal phase has a + separate category list and would probably mostly end up handling EOL + detection; but the i think about it, the more i disagree. with + properly written detectors, and properly organized tables (in + general, those decodings that are more "distinctive" and thus + detectable with greater certainty go lower on the list), we shouldn't + need two phases. for example, let's say the example above was also + in CRLF format. The EOL detector (which really detects *plain text* + with a particular EOL type) would return at most level 0 for all + results until the text file is reached, whereas the base64, gzip or + euc-jp decoders will return higher. Once the text file is reached, + the EOL detector will return 0 or higher for the CRLF encoding, and + all other decoders will return 0 or lower; thus, we will successfully + proceed through CRLF decoding, or at worst prompt the user. (The only + external-vs-internal distinction that might make sense here is to + favor coding systems of the correct source type over those that + require conversion between external and internal; if done right, this + could allow the CRLF detector to return level 1 for all CRLF-encoded + text files, even those that look like Base-64 or similar encoding, so + that CRLF encoding will always get decoded without prompting, but not + interfere with other decoders. On the other hand, this + external-vs-internal distinction may not matter at all -- with + automatic internal-external conversion, CRLF decoding can occur + before or after decoding of euc-jp, base64, iso2022, or similar, + without any difference in the final results.) + + -- There need to be two priority lists and two + category->coding-system lists. Once is general, the other + category->langenv-specific. The user sets the former, the langenv + category->the latter. The langenv-specific entries take precedence + category->over the others. This works similarly to the + category->category->Unicode charset priority list. + + -- The simple list of coding categories per detectors is not enough. + Instead of coding categories, we need parameters. For example, + Unicode might have separate detectors for UTF-8, UTF-7, UTF-16, + and perhaps UCS-4; or UTF-16/UCS-4 would be one detection type. + UTF-16 would have parameters such as "little-endian" and "needs BOM", + and possibly another one like "collapse/expand/leave alone composite + sequences" once we add this support. Usually these parameters + correspond directly to a coding system parameter. Different + likelihood values can be specified for each parameter as well as for + the detection type as a whole. The user can specify particular + coding systems for a particular combination of detection type and + parameters, or can give "default parameters" associated with a + detection type. In the latter case, we create a new coding system as + necessary that corresponds to the detected type and parameters. + + -- a better means of presentation. rather than just coming up + with the new file decoded according to the detected coding + system, allow the user to browse through the file and + conveniently reject it if it looks wrong; then detection + starts again, but with that possibility removed. in cases where + certainty is low and thus more than one possibility is presented, + the user can browse each one and select one or reject them all. + + -- fail-safe: even after the user has made a choice, if they + later on realize they have the wrong coding system, they can + go back, and we've squirreled away the original data so they + can start the process over. this may be tricky. + + -- using a larger buffer for detection. we use just a small + piece, which can give quite random results. we may need to + buffer up all the data we look through because we can't + necessarily rewind. the idea is we proceed until we get a + result that's at least at a certain level of certainty + (e.g. "probable") or we reached a maximum limit of how much + we want to buffer. + + -- dealing with interactive systems. we might need to go ahead + and present the data before we've finished detection, and + then re-decode it, perhaps multiple times, as we get better + detection results. + + -- Clearly some of these are more important than others. at the + very least, the "better means of presentation" should be + implementation as soon as possibl, along with a very simple means + of fail-safe whenever the data is readibly available, e.g. it's + coming from a file, which is the most common scenario. + + +------------------------------------------------------------------ + ABOUT FORMATS +------------------------------------------------------------------ + +when calling make-coding-system, the name can be a cons of (format1 . +format2), specifying that it decodes format1->format2 and encodes the other +way. if only one name is given, that is assumed to be format1, and the +other is either `external' or `internal' depending on the end type. +normally the user when decoding gives the decoding order in formats, but +can leave off the last one, `internal', which is assumed. a multichain +might look like gzip|multibyte|unicode, using the coding systems named +`gzip', `(unicode . multibyte)' and `unicode'. the way this actually works +is by searching for gzip->multibyte; if not found, look for gzip->external +or gzip->internal. (In general we automatically do conversion between +internal and external as necessary: thus gzip|crlf does the expected, and +maps to gzip->external, external->internal, crlf->internal, which when +fully specified would be gzip|external:external|internal:crlf|internal -- +see below.) To forcibly fit together two converters that have explicitly +specified and incompatible names (say you have unicode->multibyte and +iso8859-1->ebcdic and you know that the multibyte and iso8859-1 in this +case are compatible), you can force-cast using :, like this: +ebcdic|iso8859-1:multibyte|unicode. (again, if you force-cast between +internal and external formats, the conversion happens automatically.) + +-------------------------------------------------------------------------- +ABOUT PDUMP, UNICODE, AND RUNNING XEMACS FROM A DIRECTORY WITH WEIRD CHARS +-------------------------------------------------------------------------- + +-- there's the problem that XEmacs can't be run in a directory with + non-ASCII/Latin-1 chars in it, since it will be doing Unicode + processing before we've had a chance to load the tables. In fact, + even finding the tables in such a situation is problematic using + the normal commands. my idea is to eventually load the stuff + extremely extremely early, at the same time as the pdump data gets + loaded. in fact, the unicode table data (stored in an efficient + binary format) can even be stuck into the pdump file (which would + mean as a resource to the executable, for windows). we'd need to + extend pdump a bit: to allow for attaching extra data to the pdump + file. (something like pdump_attach_extra_data (addr, length) + returns a number of some sort, an index into the file, which you + can then retrieve with pdump_load_extra_data(), which returns an + addr (mmap()ed or loaded), and later you pdump_unload_extra_data() + when finished. we'd probably also need + pdump_attach_extra_data_append(), which appends data to the data + just written out with pdump_attach_extra_data(). this way, + multiple tables in memory can be written out into one contiguous + table. (we'd use the tar-like trick of allowing new blocks to be + written without going back to change the old blocks -- we just rely + on the end of file/end of memory.) this same mechanism could be + extracted out of pdump and used to handle the non-pdump situation + (or alternatively, we could just dump either the memory image of + the tables themselves or the compressed binary version). in the + case of extra unicode tables not known about at compile time that + get loaded before dumping, we either just dump them into the image + (pdump and all) or extract them into the compressed binary format, + free the original tables, and treat them like all other tables. + +-------------------------------------------------------------------------- + HANDLING WRITING A FILE SAFELY, WITHOUT DATA LOSS +-------------------------------------------------------------------------- + + -- When writing a file, we need error detection; otherwise somebody + will create a Unicode file without realizing the coding system + of the buffer is Raw, and then lose all the non-ASCII/Latin-1 + text when it's written out. We need two levels + + 1. first, a "safe-charset" level that checks before any actual + encoding to see if all characters in the document can safely + be represented using the given coding system. FSF has a + "safe-charset" property of coding systems, but it's stupid + because this information can be automatically derived from + the coding system, at least the vast majority of the time. + What we need is some sort of + alternative-coding-system-precedence-list, langenv-specific, + where everything on it can be checked for safe charsets and + then the user given a list of possibilities. When the user + does "save with specified encoding", they should see the same + precedence list. Again like with other precedence lists, + there's also a global one, and presumably all coding systems + not on other list get appended to the end (and perhaps not + checked at all when doing safe-checking?). safe-checking + should work something like this: compile a list of all + charsets used in the buffer, along with a count of chars + used. that way, "slightly unsafe" charsets can perhaps be + presented at the end, which will lose only a few characters + and are perhaps what the users were looking for. + + 2. when actually writing out, we need error checking in case an + individual char in a charset can't be written even though the + charsets are safe. again, the user gets the choice of other + reasonable coding systems. + + 3. same thing (error checking, list of alternatives, etc.) needs + to happen when reading! all of this will be a lot of work! + + + --ben +*/ #include <config.h> #include "lisp.h" @@ -31,11 +333,11 @@ #include "insdel.h" #include "lstream.h" #include "opaque.h" -#ifdef MULE -#include "mule-ccl.h" -#include "chartab.h" +#include "file-coding.h" + +#ifdef HAVE_ZLIB +#include "zlib.h" #endif -#include "file-coding.h" Lisp_Object Vkeyboard_coding_system; Lisp_Object Vterminal_coding_system; @@ -43,43 +345,112 @@ Lisp_Object Vcoding_system_for_write; Lisp_Object Vfile_name_coding_system; -/* Table of symbols identifying each coding category. */ -Lisp_Object coding_category_symbol[CODING_CATEGORY_LAST]; - - - -struct file_coding_dump { - /* Coding system currently associated with each coding category. */ - Lisp_Object coding_category_system[CODING_CATEGORY_LAST]; - - /* Table of all coding categories in decreasing order of priority. - This describes a permutation of the possible coding categories. */ - int coding_category_by_priority[CODING_CATEGORY_LAST]; - -#ifdef MULE - Lisp_Object ucs_to_mule_table[65536]; +#ifdef DEBUG_XEMACS +Lisp_Object Vdebug_coding_detection; #endif -} *fcd; - -static const struct lrecord_description fcd_description_1[] = { - { XD_LISP_OBJECT_ARRAY, offsetof (struct file_coding_dump, coding_category_system), CODING_CATEGORY_LAST }, -#ifdef MULE - { XD_LISP_OBJECT_ARRAY, offsetof (struct file_coding_dump, ucs_to_mule_table), countof (fcd->ucs_to_mule_table) }, -#endif + +typedef struct coding_system_type_entry +{ + struct coding_system_methods *meths; +} coding_system_type_entry; + +typedef struct +{ + Dynarr_declare (coding_system_type_entry); +} coding_system_type_entry_dynarr; + +static coding_system_type_entry_dynarr *the_coding_system_type_entry_dynarr; + +static const struct lrecord_description cste_description_1[] = { + { XD_STRUCT_PTR, offsetof (coding_system_type_entry, meths), 1, &coding_system_methods_description }, + { XD_END } +}; + +static const struct struct_description cste_description = { + sizeof (coding_system_type_entry), + cste_description_1 +}; + +static const struct lrecord_description csted_description_1[] = { + XD_DYNARR_DESC (coding_system_type_entry_dynarr, &cste_description), { XD_END } }; -static const struct struct_description fcd_description = { - sizeof (struct file_coding_dump), - fcd_description_1 +static const struct struct_description csted_description = { + sizeof (coding_system_type_entry_dynarr), + csted_description_1 +}; + +static Lisp_Object Vcoding_system_type_list; + +/* Coding system currently associated with each coding category. */ +Lisp_Object coding_category_system[MAX_DETECTOR_CATEGORIES]; + +/* Table of all coding categories in decreasing order of priority. + This describes a permutation of the possible coding categories. */ +int coding_category_by_priority[MAX_DETECTOR_CATEGORIES]; + +/* Value used with to give a unique name to nameless coding systems */ +int coding_system_tick; + +int coding_detector_count; +int coding_detector_category_count; + +detector_dynarr *all_coding_detectors; + +static const struct lrecord_description struct_detector_category_description_1[] += +{ + { XD_LISP_OBJECT, offsetof (struct detector_category, sym) }, + { XD_END } +}; + +static const struct struct_description struct_detector_category_description = +{ + sizeof (struct detector_category), + struct_detector_category_description_1 }; -Lisp_Object mule_to_ucs_table; +static const struct lrecord_description detector_category_dynarr_description_1[] = +{ + XD_DYNARR_DESC (detector_category_dynarr, + &struct_detector_category_description), + { XD_END } +}; + +static const struct struct_description detector_category_dynarr_description = { + sizeof (detector_category_dynarr), + detector_category_dynarr_description_1 +}; + +static const struct lrecord_description struct_detector_description_1[] += +{ + { XD_STRUCT_PTR, offsetof (struct detector, cats), 1, + &detector_category_dynarr_description }, + { XD_END } +}; + +static const struct struct_description struct_detector_description = +{ + sizeof (struct detector), + struct_detector_description_1 +}; + +static const struct lrecord_description detector_dynarr_description_1[] = +{ + XD_DYNARR_DESC (detector_dynarr, &struct_detector_description), + { XD_END } +}; + +static const struct struct_description detector_dynarr_description = { + sizeof (detector_dynarr), + detector_dynarr_description_1 +}; Lisp_Object Qcoding_systemp; -Lisp_Object Qraw_text, Qno_conversion, Qccl, Qiso2022; -/* Qinternal in general.c */ +Lisp_Object Qraw_text; Lisp_Object Qmnemonic, Qeol_type; Lisp_Object Qcr, Qcrlf, Qlf; @@ -87,399 +458,234 @@ Lisp_Object Qpost_read_conversion; Lisp_Object Qpre_write_conversion; -#ifdef MULE -Lisp_Object Qucs4, Qutf8; -Lisp_Object Qbig5, Qshift_jis; -Lisp_Object Qcharset_g0, Qcharset_g1, Qcharset_g2, Qcharset_g3; -Lisp_Object Qforce_g0_on_output, Qforce_g1_on_output; -Lisp_Object Qforce_g2_on_output, Qforce_g3_on_output; -Lisp_Object Qno_iso6429; -Lisp_Object Qinput_charset_conversion, Qoutput_charset_conversion; +Lisp_Object Qtranslation_table_for_decode; +Lisp_Object Qtranslation_table_for_encode; +Lisp_Object Qsafe_chars; +Lisp_Object Qsafe_charsets; +Lisp_Object Qmime_charset; +Lisp_Object Qvalid_codes; + +Lisp_Object Qno_conversion; +Lisp_Object Qconvert_eol; Lisp_Object Qescape_quoted; -Lisp_Object Qshort, Qno_ascii_eol, Qno_ascii_cntl, Qseven, Qlock_shift; +Lisp_Object Qencode, Qdecode; + +Lisp_Object Qconvert_eol_lf, Qconvert_eol_cr, Qconvert_eol_crlf; +Lisp_Object Qconvert_eol_autodetect; + +Lisp_Object Qnear_certainty, Qquite_probable, Qsomewhat_likely; +Lisp_Object Qas_likely_as_unlikely, Qsomewhat_unlikely, Qquite_improbable; +Lisp_Object Qnearly_impossible; + +Lisp_Object Qdo_eol, Qdo_coding; + +Lisp_Object Qcanonicalize_after_coding; + +/* This is used to convert autodetected coding systems into existing + systems. For example, the chain undecided->convert-eol-autodetect may + have its separate parts detected as mswindows-multibyte and + convert-eol-crlf, and the result needs to be mapped to + mswindows-multibyte-dos. */ +/* #### It's not clear we need this whole chain-canonicalize mechanism + any more. */ +static Lisp_Object Vchain_canonicalize_hash_table; + +#ifdef HAVE_ZLIB +Lisp_Object Qgzip; #endif -Lisp_Object Qencode, Qdecode; - -Lisp_Object Vcoding_system_hash_table; + +/* Maps coding system names to either coding system objects or (for + aliases) other names. */ +static Lisp_Object Vcoding_system_hash_table; int enable_multibyte_characters; -#ifdef MULE -/* Additional information used by the ISO2022 decoder and detector. */ -struct iso2022_decoder -{ - /* CHARSET holds the character sets currently assigned to the G0 - through G3 variables. It is initialized from the array - INITIAL_CHARSET in CODESYS. */ - Lisp_Object charset[4]; - - /* Which registers are currently invoked into the left (GL) and - right (GR) halves of the 8-bit encoding space? */ - int register_left, register_right; - - /* ISO_ESC holds a value indicating part of an escape sequence - that has already been seen. */ - enum iso_esc_flag esc; - - /* This records the bytes we've seen so far in an escape sequence, - in case the sequence is invalid (we spit out the bytes unchanged). */ - unsigned char esc_bytes[8]; - - /* Index for next byte to store in ISO escape sequence. */ - int esc_bytes_index; - -#ifdef ENABLE_COMPOSITE_CHARS - /* Stuff seen so far when composing a string. */ - unsigned_char_dynarr *composite_chars; -#endif - - /* If we saw an invalid designation sequence for a particular - register, we flag it here and switch to ASCII. The next time we - see a valid designation for this register, we turn off the flag - and do the designation normally, but pretend the sequence was - invalid. The effect of all this is that (most of the time) the - escape sequences for both the switch to the unknown charset, and - the switch back to the known charset, get inserted literally into - the buffer and saved out as such. The hope is that we can - preserve the escape sequences so that the resulting written out - file makes sense. If we don't do any of this, the designation - to the invalid charset will be preserved but that switch back - to the known charset will probably get eaten because it was - the same charset that was already present in the register. */ - unsigned char invalid_designated[4]; - - /* We try to do similar things as above for direction-switching - sequences. If we encountered a direction switch while an - invalid designation was present, or an invalid designation - just after a direction switch (i.e. no valid designation - encountered yet), we insert the direction-switch escape - sequence literally into the output stream, and later on - insert the corresponding direction-restoring escape sequence - literally also. */ - unsigned int switched_dir_and_no_valid_charset_yet :1; - unsigned int invalid_switch_dir :1; - - /* Tells the decoder to output the escape sequence literally - even though it was valid. Used in the games we play to - avoid lossage when we encounter invalid designations. */ - unsigned int output_literally :1; - /* We encountered a direction switch followed by an invalid - designation. We didn't output the direction switch - literally because we didn't know about the invalid designation; - but we have to do so now. */ - unsigned int output_direction_sequence :1; -}; -#endif /* MULE */ EXFUN (Fcopy_coding_system, 2); -#ifdef MULE -struct detection_state; -static int detect_coding_sjis (struct detection_state *st, - const Extbyte *src, Bytecount n); -static void decode_coding_sjis (Lstream *decoding, const Extbyte *src, - unsigned_char_dynarr *dst, Bytecount n); -static void encode_coding_sjis (Lstream *encoding, const Intbyte *src, - unsigned_char_dynarr *dst, Bytecount n); -static int detect_coding_big5 (struct detection_state *st, - const Extbyte *src, Bytecount n); -static void decode_coding_big5 (Lstream *decoding, const Extbyte *src, - unsigned_char_dynarr *dst, Bytecount n); -static void encode_coding_big5 (Lstream *encoding, const Intbyte *src, - unsigned_char_dynarr *dst, Bytecount n); -static int detect_coding_ucs4 (struct detection_state *st, - const Extbyte *src, Bytecount n); -static void decode_coding_ucs4 (Lstream *decoding, const Extbyte *src, - unsigned_char_dynarr *dst, Bytecount n); -static void encode_coding_ucs4 (Lstream *encoding, const Intbyte *src, - unsigned_char_dynarr *dst, Bytecount n); -static int detect_coding_utf8 (struct detection_state *st, - const Extbyte *src, Bytecount n); -static void decode_coding_utf8 (Lstream *decoding, const Extbyte *src, - unsigned_char_dynarr *dst, Bytecount n); -static void encode_coding_utf8 (Lstream *encoding, const Intbyte *src, - unsigned_char_dynarr *dst, Bytecount n); -static int postprocess_iso2022_mask (int mask); -static void reset_iso2022 (Lisp_Object coding_system, - struct iso2022_decoder *iso); -static int detect_coding_iso2022 (struct detection_state *st, - const Extbyte *src, Bytecount n); -static void decode_coding_iso2022 (Lstream *decoding, const Extbyte *src, - unsigned_char_dynarr *dst, Bytecount n); -static void encode_coding_iso2022 (Lstream *encoding, const Intbyte *src, - unsigned_char_dynarr *dst, Bytecount n); -#endif /* MULE */ -static void decode_coding_no_conversion (Lstream *decoding, const Extbyte *src, - unsigned_char_dynarr *dst, Bytecount n); -static void encode_coding_no_conversion (Lstream *encoding, const Intbyte *src, - unsigned_char_dynarr *dst, Bytecount n); -static void mule_decode (Lstream *decoding, const Extbyte *src, - unsigned_char_dynarr *dst, Bytecount n); -static void mule_encode (Lstream *encoding, const Intbyte *src, - unsigned_char_dynarr *dst, Bytecount n); - -typedef struct codesys_prop codesys_prop; -struct codesys_prop -{ - Lisp_Object sym; - int prop_type; -}; - -typedef struct -{ - Dynarr_declare (codesys_prop); -} codesys_prop_dynarr; - -static const struct lrecord_description codesys_prop_description_1[] = { - { XD_LISP_OBJECT, offsetof (codesys_prop, sym) }, - { XD_END } -}; - -static const struct struct_description codesys_prop_description = { - sizeof (codesys_prop), - codesys_prop_description_1 -}; - -static const struct lrecord_description codesys_prop_dynarr_description_1[] = { - XD_DYNARR_DESC (codesys_prop_dynarr, &codesys_prop_description), - { XD_END } -}; - -static const struct struct_description codesys_prop_dynarr_description = { - sizeof (codesys_prop_dynarr), - codesys_prop_dynarr_description_1 -}; - -codesys_prop_dynarr *the_codesys_prop_dynarr; - -enum codesys_prop_enum -{ - CODESYS_PROP_ALL_OK, - CODESYS_PROP_ISO2022, - CODESYS_PROP_CCL -}; /************************************************************************/ -/* Coding system functions */ +/* Coding system object methods */ /************************************************************************/ -static Lisp_Object mark_coding_system (Lisp_Object); -static void print_coding_system (Lisp_Object, Lisp_Object, int); -static void finalize_coding_system (void *header, int for_disksave); - -#ifdef MULE -static const struct lrecord_description ccs_description_1[] = { - { XD_LISP_OBJECT, offsetof (charset_conversion_spec, from_charset) }, - { XD_LISP_OBJECT, offsetof (charset_conversion_spec, to_charset) }, - { XD_END } -}; - -static const struct struct_description ccs_description = { - sizeof (charset_conversion_spec), - ccs_description_1 -}; - -static const struct lrecord_description ccsd_description_1[] = { - XD_DYNARR_DESC (charset_conversion_spec_dynarr, &ccs_description), - { XD_END } -}; - -static const struct struct_description ccsd_description = { - sizeof (charset_conversion_spec_dynarr), - ccsd_description_1 -}; -#endif - -static const struct lrecord_description coding_system_description[] = { - { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, name) }, - { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, doc_string) }, - { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, mnemonic) }, - { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, post_read_conversion) }, - { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, pre_write_conversion) }, - { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol_lf) }, - { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol_crlf) }, - { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol_cr) }, -#ifdef MULE - { XD_LISP_OBJECT_ARRAY, offsetof (Lisp_Coding_System, iso2022.initial_charset), 4 }, - { XD_STRUCT_PTR, offsetof (Lisp_Coding_System, iso2022.input_conv), 1, &ccsd_description }, - { XD_STRUCT_PTR, offsetof (Lisp_Coding_System, iso2022.output_conv), 1, &ccsd_description }, - { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, ccl.decode) }, - { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, ccl.encode) }, -#endif - { XD_END } -}; - -DEFINE_LRECORD_IMPLEMENTATION ("coding-system", coding_system, - mark_coding_system, print_coding_system, - finalize_coding_system, - 0, 0, coding_system_description, - Lisp_Coding_System); - static Lisp_Object mark_coding_system (Lisp_Object obj) { Lisp_Coding_System *codesys = XCODING_SYSTEM (obj); mark_object (CODING_SYSTEM_NAME (codesys)); - mark_object (CODING_SYSTEM_DOC_STRING (codesys)); + mark_object (CODING_SYSTEM_DESCRIPTION (codesys)); mark_object (CODING_SYSTEM_MNEMONIC (codesys)); + mark_object (CODING_SYSTEM_DOCUMENTATION (codesys)); mark_object (CODING_SYSTEM_EOL_LF (codesys)); mark_object (CODING_SYSTEM_EOL_CRLF (codesys)); mark_object (CODING_SYSTEM_EOL_CR (codesys)); - - switch (CODING_SYSTEM_TYPE (codesys)) - { -#ifdef MULE - int i; - case CODESYS_ISO2022: - for (i = 0; i < 4; i++) - mark_object (CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, i)); - if (codesys->iso2022.input_conv) - { - for (i = 0; i < Dynarr_length (codesys->iso2022.input_conv); i++) - { - struct charset_conversion_spec *ccs = - Dynarr_atp (codesys->iso2022.input_conv, i); - mark_object (ccs->from_charset); - mark_object (ccs->to_charset); - } - } - if (codesys->iso2022.output_conv) - { - for (i = 0; i < Dynarr_length (codesys->iso2022.output_conv); i++) - { - struct charset_conversion_spec *ccs = - Dynarr_atp (codesys->iso2022.output_conv, i); - mark_object (ccs->from_charset); - mark_object (ccs->to_charset); - } - } - break; - - case CODESYS_CCL: - mark_object (CODING_SYSTEM_CCL_DECODE (codesys)); - mark_object (CODING_SYSTEM_CCL_ENCODE (codesys)); - break; -#endif /* MULE */ - default: - break; - } + mark_object (CODING_SYSTEM_SUBSIDIARY_PARENT (codesys)); + mark_object (CODING_SYSTEM_CANONICAL (codesys)); + + MAYBE_CODESYSMETH (codesys, mark, (obj)); mark_object (CODING_SYSTEM_PRE_WRITE_CONVERSION (codesys)); return CODING_SYSTEM_POST_READ_CONVERSION (codesys); } static void +print_coding_system_properties (Lisp_Object obj, Lisp_Object printcharfun) +{ + Lisp_Coding_System *c = XCODING_SYSTEM (obj); + print_internal (c->methods->type, printcharfun, 1); + MAYBE_CODESYSMETH (c, print, (obj, printcharfun, 1)); + if (CODING_SYSTEM_EOL_TYPE (c) != EOL_AUTODETECT) + write_fmt_string_lisp (printcharfun, " eol-type=%s", + 1, Fcoding_system_property (obj, Qeol_type)); +} + +static void print_coding_system (Lisp_Object obj, Lisp_Object printcharfun, int escapeflag) { Lisp_Coding_System *c = XCODING_SYSTEM (obj); if (print_readably) - printing_unreadable_object ("#<coding_system 0x%x>", c->header.uid); - - write_c_string ("#<coding_system ", printcharfun); - print_internal (c->name, printcharfun, 1); + printing_unreadable_object + ("printing unreadable object #<coding-system 0x%x>", c->header.uid); + + write_fmt_string_lisp (printcharfun, "#<coding-system %s ", 1, c->name); + print_coding_system_properties (obj, printcharfun); write_c_string (">", printcharfun); } +/* Print an abbreviated version of a coding system (but still containing + all the information), for use within a coding system print method. */ + +static void +print_coding_system_in_print_method (Lisp_Object cs, Lisp_Object printcharfun, + int escapeflag) +{ + print_internal (XCODING_SYSTEM_NAME (cs), printcharfun, 0); + write_c_string ("[", printcharfun); + print_coding_system_properties (cs, printcharfun); + write_c_string ("]", printcharfun); +} + static void finalize_coding_system (void *header, int for_disksave) { - Lisp_Coding_System *c = (Lisp_Coding_System *) header; + Lisp_Object cs = wrap_coding_system ((Lisp_Coding_System *) header); /* Since coding systems never go away, this function is not necessary. But it would be necessary if we changed things so that coding systems could go away. */ if (!for_disksave) /* see comment in lstream.c */ - { - switch (CODING_SYSTEM_TYPE (c)) - { -#ifdef MULE - case CODESYS_ISO2022: - if (c->iso2022.input_conv) - { - Dynarr_free (c->iso2022.input_conv); - c->iso2022.input_conv = 0; - } - if (c->iso2022.output_conv) - { - Dynarr_free (c->iso2022.output_conv); - c->iso2022.output_conv = 0; - } - break; -#endif /* MULE */ - default: - break; - } - } + MAYBE_XCODESYSMETH (cs, finalize, (cs)); +} + +static Bytecount +sizeof_coding_system (const void *header) +{ + const Lisp_Coding_System *p = (const Lisp_Coding_System *) header; + return offsetof (Lisp_Coding_System, data) + p->methods->extra_data_size; } -static eol_type_t -symbol_to_eol_type (Lisp_Object symbol) +static const struct lrecord_description coding_system_methods_description_1[] += { + { XD_LISP_OBJECT, + offsetof (struct coding_system_methods, type) }, + { XD_LISP_OBJECT, + offsetof (struct coding_system_methods, predicate_symbol) }, + { XD_END } +}; + +const struct struct_description coding_system_methods_description = { + sizeof (struct coding_system_methods), + coding_system_methods_description_1 +}; + +const struct lrecord_description coding_system_empty_extra_description[] = { + { XD_END } +}; + +static const struct lrecord_description coding_system_description[] = { - CHECK_SYMBOL (symbol); - if (NILP (symbol)) return EOL_AUTODETECT; - if (EQ (symbol, Qlf)) return EOL_LF; - if (EQ (symbol, Qcrlf)) return EOL_CRLF; - if (EQ (symbol, Qcr)) return EOL_CR; - - invalid_constant ("Unrecognized eol type", symbol); - return EOL_AUTODETECT; /* not reached */ -} - -static Lisp_Object -eol_type_to_symbol (eol_type_t type) + { XD_STRUCT_PTR, offsetof (Lisp_Coding_System, methods), 1, + &coding_system_methods_description }, + { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, name) }, + { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, description) }, + { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, mnemonic) }, + { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, documentation) }, + { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, post_read_conversion) }, + { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, pre_write_conversion) }, + { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, text_file_wrapper) }, + { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, auto_eol_wrapper) }, + { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol[0]) }, + { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol[1]) }, + { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, eol[2]) }, + { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, subsidiary_parent) }, + { XD_LISP_OBJECT, offsetof (Lisp_Coding_System, canonical) }, + { XD_CODING_SYSTEM_END } +}; + +DEFINE_LRECORD_SEQUENCE_IMPLEMENTATION ("coding-system", coding_system, + mark_coding_system, + print_coding_system, + finalize_coding_system, + 0, 0, coding_system_description, + sizeof_coding_system, + Lisp_Coding_System); + + +/************************************************************************/ +/* Creating coding systems */ +/************************************************************************/ + +static struct coding_system_methods * +decode_coding_system_type (Lisp_Object type, Error_Behavior errb) { - switch (type) + int i; + + for (i = 0; i < Dynarr_length (the_coding_system_type_entry_dynarr); i++) { - default: abort (); - case EOL_LF: return Qlf; - case EOL_CRLF: return Qcrlf; - case EOL_CR: return Qcr; - case EOL_AUTODETECT: return Qnil; + if (EQ (type, + Dynarr_at (the_coding_system_type_entry_dynarr, i).meths->type)) + return Dynarr_at (the_coding_system_type_entry_dynarr, i).meths; } + + maybe_invalid_constant ("Invalid coding system type", type, + Qcoding_system, errb); + + return 0; } -static void -setup_eol_coding_systems (Lisp_Coding_System *codesys) +static int +valid_coding_system_type_p (Lisp_Object type) { - Lisp_Object codesys_obj; - int len = string_length (XSYMBOL (CODING_SYSTEM_NAME (codesys))->name); - char *codesys_name = (char *) alloca (len + 7); - int mlen = -1; - char *codesys_mnemonic=0; - - Lisp_Object codesys_name_sym, sub_codesys_obj; - - /* kludge */ - - XSETCODING_SYSTEM (codesys_obj, codesys); - - memcpy (codesys_name, - string_data (XSYMBOL (CODING_SYSTEM_NAME (codesys))->name), len); - - if (STRINGP (CODING_SYSTEM_MNEMONIC (codesys))) - { - mlen = XSTRING_LENGTH (CODING_SYSTEM_MNEMONIC (codesys)); - codesys_mnemonic = (char *) alloca (mlen + 7); - memcpy (codesys_mnemonic, - XSTRING_DATA (CODING_SYSTEM_MNEMONIC (codesys)), mlen); - } - -#define DEFINE_SUB_CODESYS(op_sys, op_sys_abbr, Type) do { \ - strcpy (codesys_name + len, "-" op_sys); \ - if (mlen != -1) \ - strcpy (codesys_mnemonic + mlen, op_sys_abbr); \ - codesys_name_sym = intern (codesys_name); \ - sub_codesys_obj = Fcopy_coding_system (codesys_obj, codesys_name_sym); \ - XCODING_SYSTEM_EOL_TYPE (sub_codesys_obj) = Type; \ - if (mlen != -1) \ - XCODING_SYSTEM_MNEMONIC(sub_codesys_obj) = \ - build_string (codesys_mnemonic); \ - CODING_SYSTEM_##Type (codesys) = sub_codesys_obj; \ -} while (0) - - DEFINE_SUB_CODESYS("unix", "", EOL_LF); - DEFINE_SUB_CODESYS("dos", ":T", EOL_CRLF); - DEFINE_SUB_CODESYS("mac", ":t", EOL_CR); + return decode_coding_system_type (type, ERROR_ME_NOT) != 0; +} + +DEFUN ("valid-coding-system-type-p", Fvalid_coding_system_type_p, 1, 1, 0, /* +Given a CODING-SYSTEM-TYPE, return non-nil if it is valid. +Valid types depend on how XEmacs was compiled but may include +'undecided, 'chain, 'integer, 'ccl, 'iso2022, 'big5, 'shift-jis, +'utf-16, 'ucs-4, 'utf-8, etc. +*/ + (coding_system_type)) +{ + return valid_coding_system_type_p (coding_system_type) ? Qt : Qnil; +} + +DEFUN ("coding-system-type-list", Fcoding_system_type_list, 0, 0, 0, /* +Return a list of valid coding system types. +*/ + ()) +{ + return Fcopy_sequence (Vcoding_system_type_list); +} + +void +add_entry_to_coding_system_type_list (struct coding_system_methods *meths) +{ + struct coding_system_type_entry entry; + + entry.meths = meths; + Dynarr_add (the_coding_system_type_entry_dynarr, entry); + Vcoding_system_type_list = Fcons (meths->type, Vcoding_system_type_list); } DEFUN ("coding-system-p", Fcoding_system_p, 1, 1, 0, /* @@ -529,7 +735,8 @@ coding_system_or_name = Fgethash (coding_system_or_name, Vcoding_system_hash_table, Qnil); - if (CODING_SYSTEMP (coding_system_or_name) || NILP (coding_system_or_name)) + if (CODING_SYSTEMP (coding_system_or_name) + || NILP (coding_system_or_name)) return coding_system_or_name; } } @@ -548,12 +755,143 @@ return coding_system; } -/* We store the coding systems in hash tables with the names as the key and the - actual coding system object as the value. Occasionally we need to use them - in a list format. These routines provide us with that. */ +int +coding_system_is_binary (Lisp_Object coding_system) +{ + Lisp_Coding_System *cs = XCODING_SYSTEM (coding_system); + return + (EQ (CODING_SYSTEM_TYPE (cs), Qno_conversion) && + CODING_SYSTEM_EOL_TYPE (cs) == EOL_LF && + EQ (CODING_SYSTEM_POST_READ_CONVERSION (cs), Qnil) && + EQ (CODING_SYSTEM_PRE_WRITE_CONVERSION (cs), Qnil)); +} + +static Lisp_Object +coding_system_real_canonical (Lisp_Object cs) +{ + if (!NILP (XCODING_SYSTEM_CANONICAL (cs))) + return XCODING_SYSTEM_CANONICAL (cs); + return cs; +} + +/* Return true if coding system is of the "standard" type that decodes + bytes into characters (suitable for decoding a text file). */ +int +coding_system_is_for_text_file (Lisp_Object coding_system) +{ + return (XCODESYSMETH_OR_GIVEN + (coding_system, conversion_end_type, + (coding_system_real_canonical (coding_system)), + DECODES_BYTE_TO_CHARACTER) == + DECODES_BYTE_TO_CHARACTER); +} + +static int +decoding_source_sink_type_is_char (Lisp_Object cs, enum source_or_sink sex) +{ + enum source_sink_type type = + XCODESYSMETH_OR_GIVEN (cs, conversion_end_type, + (coding_system_real_canonical (cs)), + DECODES_BYTE_TO_CHARACTER); + if (sex == CODING_SOURCE) + return (type == DECODES_CHARACTER_TO_CHARACTER || + type == DECODES_CHARACTER_TO_BYTE); + else + return (type == DECODES_CHARACTER_TO_CHARACTER || + type == DECODES_BYTE_TO_CHARACTER); +} + +static int +encoding_source_sink_type_is_char (Lisp_Object cs, enum source_or_sink sex) +{ + return decoding_source_sink_type_is_char (cs, + /* Sex change */ + sex == CODING_SOURCE ? + CODING_SINK : CODING_SOURCE); +} + +/* Like Ffind_coding_system() but check that the coding system is of the + "standard" type that decodes bytes into characters (suitable for + decoding a text file), and if not, returns an appropriate wrapper that + does. Also, if EOL_WRAP is non-zero, check whether this coding system + wants EOL auto-detection, and if so, wrap with a convert-eol coding + system to do this. */ + +Lisp_Object +find_coding_system_for_text_file (Lisp_Object name, int eol_wrap) +{ + Lisp_Object coding_system = Ffind_coding_system (name); + Lisp_Object wrapper = coding_system; + + if (NILP (coding_system)) + return Qnil; + if (!coding_system_is_for_text_file (coding_system)) + { + wrapper = XCODING_SYSTEM_TEXT_FILE_WRAPPER (coding_system); + if (NILP (wrapper)) + { + Lisp_Object chain; + if (!decoding_source_sink_type_is_char (coding_system, CODING_SINK)) + chain = list2 (coding_system, Qbinary); + else + chain = list1 (coding_system); + if (decoding_source_sink_type_is_char (coding_system, CODING_SOURCE)) + chain = Fcons (Qbinary, chain); + wrapper = + make_internal_coding_system + (coding_system, + "internal-text-file-wrapper", + Qchain, + Qunbound, list4 (Qchain, chain, + Qcanonicalize_after_coding, coding_system)); + XCODING_SYSTEM_TEXT_FILE_WRAPPER (coding_system) = wrapper; + } + } + + if (!eol_wrap || XCODING_SYSTEM_EOL_TYPE (coding_system) != EOL_AUTODETECT) + return wrapper; + + coding_system = wrapper; + wrapper = XCODING_SYSTEM_AUTO_EOL_WRAPPER (coding_system); + if (!NILP (wrapper)) + return wrapper; + wrapper = + make_internal_coding_system + (coding_system, + "internal-auto-eol-wrapper", + Qundecided, Qunbound, + list4 (Qcoding_system, coding_system, + Qdo_eol, Qt)); + XCODING_SYSTEM_AUTO_EOL_WRAPPER (coding_system) = wrapper; + return wrapper; +} + +/* Like Fget_coding_system() but verify that the coding system is of the + "standard" type that decodes bytes into characters (suitable for + decoding a text file), and if not, returns an appropriate wrapper that + does. Also, if EOL_WRAP is non-zero, check whether this coding system + wants EOL auto-detection, and if so, wrap with a convert-eol coding + system to do this. */ + +Lisp_Object +get_coding_system_for_text_file (Lisp_Object name, int eol_wrap) +{ + Lisp_Object coding_system = find_coding_system_for_text_file (name, + eol_wrap); + if (NILP (coding_system)) + invalid_argument ("No such coding system", name); + return coding_system; +} + +/* We store the coding systems in hash tables with the names as the + key and the actual coding system object as the value. Occasionally + we need to use them in a list format. These routines provide us + with that. */ struct coding_system_list_closure { Lisp_Object *coding_system_list; + int normal; + int internal; }; static int @@ -565,14 +903,24 @@ (struct coding_system_list_closure *) coding_system_list_closure; Lisp_Object *coding_system_list = cscl->coding_system_list; - *coding_system_list = Fcons (key, *coding_system_list); + /* We can't just use VALUE because KEY might be an alias, and we need + the real coding system object. */ + if (XCODING_SYSTEM (Ffind_coding_system (key))->internal_p ? + cscl->internal : cscl->normal) + *coding_system_list = Fcons (key, *coding_system_list); return 0; } -DEFUN ("coding-system-list", Fcoding_system_list, 0, 0, 0, /* +DEFUN ("coding-system-list", Fcoding_system_list, 0, 1, 0, /* Return a list of the names of all defined coding systems. +If INTERNAL is nil, only the normal (non-internal) coding systems are +included. (Internal coding systems are created for various internal +purposes, such as implementing EOL types of CRLF and CR; generally, you do +not want to see these.) If it is t, only the internal coding systems are +included. If it is any other non-nil value both normal and internal are +included. */ - ()) + (internal)) { Lisp_Object coding_system_list = Qnil; struct gcpro gcpro1; @@ -580,6 +928,8 @@ GCPRO1 (coding_system_list); coding_system_list_closure.coding_system_list = &coding_system_list; + coding_system_list_closure.normal = !EQ (internal, Qt); + coding_system_list_closure.internal = !NILP (internal); elisp_maphash (add_coding_system_to_list_mapper, Vcoding_system_hash_table, &coding_system_list_closure); UNGCPRO; @@ -597,92 +947,359 @@ } static Lisp_Coding_System * -allocate_coding_system (enum coding_system_type type, Lisp_Object name) +allocate_coding_system (struct coding_system_methods *codesys_meths, + Bytecount data_size, + Lisp_Object name) { + Bytecount total_size = offsetof (Lisp_Coding_System, data) + data_size; Lisp_Coding_System *codesys = - alloc_lcrecord_type (Lisp_Coding_System, &lrecord_coding_system); - - zero_lcrecord (codesys); + (Lisp_Coding_System *) alloc_lcrecord (total_size, &lrecord_coding_system); + + zero_sized_lcrecord (codesys, total_size); + codesys->methods = codesys_meths; CODING_SYSTEM_PRE_WRITE_CONVERSION (codesys) = Qnil; CODING_SYSTEM_POST_READ_CONVERSION (codesys) = Qnil; - CODING_SYSTEM_EOL_TYPE (codesys) = EOL_AUTODETECT; + CODING_SYSTEM_EOL_TYPE (codesys) = EOL_LF; CODING_SYSTEM_EOL_CRLF (codesys) = Qnil; CODING_SYSTEM_EOL_CR (codesys) = Qnil; CODING_SYSTEM_EOL_LF (codesys) = Qnil; - CODING_SYSTEM_TYPE (codesys) = type; + CODING_SYSTEM_SUBSIDIARY_PARENT (codesys) = Qnil; + CODING_SYSTEM_CANONICAL (codesys) = Qnil; CODING_SYSTEM_MNEMONIC (codesys) = Qnil; -#ifdef MULE - if (type == CODESYS_ISO2022) - { - int i; - for (i = 0; i < 4; i++) - CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, i) = Qnil; - } - else if (type == CODESYS_CCL) - { - CODING_SYSTEM_CCL_DECODE (codesys) = Qnil; - CODING_SYSTEM_CCL_ENCODE (codesys) = Qnil; - } -#endif /* MULE */ - CODING_SYSTEM_NAME (codesys) = name; + CODING_SYSTEM_DOCUMENTATION (codesys) = Qnil; + CODING_SYSTEM_TEXT_FILE_WRAPPER (codesys) = Qnil; + CODING_SYSTEM_AUTO_EOL_WRAPPER (codesys) = Qnil; + CODING_SYSTEM_NAME (codesys) = name; + + MAYBE_CODESYSMETH (codesys, init, (wrap_coding_system (codesys))); return codesys; } -#ifdef MULE -/* Given a list of charset conversion specs as specified in a Lisp - program, parse it into STORE_HERE. */ - +static enum eol_type +symbol_to_eol_type (Lisp_Object symbol) +{ + CHECK_SYMBOL (symbol); + if (NILP (symbol)) return EOL_AUTODETECT; + if (EQ (symbol, Qlf)) return EOL_LF; + if (EQ (symbol, Qcrlf)) return EOL_CRLF; + if (EQ (symbol, Qcr)) return EOL_CR; + + invalid_constant ("Unrecognized eol type", symbol); + return EOL_AUTODETECT; /* not reached */ +} + +static Lisp_Object +eol_type_to_symbol (enum eol_type type) +{ + switch (type) + { + default: abort (); + case EOL_LF: return Qlf; + case EOL_CRLF: return Qcrlf; + case EOL_CR: return Qcr; + case EOL_AUTODETECT: return Qnil; + } +} + +struct subsidiary_type +{ + Char_ASCII *extension; + Char_ASCII *mnemonic_ext; + enum eol_type eol; +}; + +static struct subsidiary_type coding_subsidiary_list[] = +{ { "-unix", "", EOL_LF }, + { "-dos", ":T", EOL_CRLF }, + { "-mac", ":t", EOL_CR } }; + +/* kludge */ static void -parse_charset_conversion_specs (charset_conversion_spec_dynarr *store_here, - Lisp_Object spec_list) +setup_eol_coding_systems (Lisp_Object codesys) { - Lisp_Object rest; - - EXTERNAL_LIST_LOOP (rest, spec_list) + int len = string_length (XSYMBOL (XCODING_SYSTEM_NAME (codesys))->name); + Intbyte *codesys_name = (Intbyte *) alloca (len + 7); + int mlen = -1; + Intbyte *codesys_mnemonic = 0; + Lisp_Object codesys_name_sym, sub_codesys; + int i; + + memcpy (codesys_name, + string_data (XSYMBOL (XCODING_SYSTEM_NAME (codesys))->name), len); + + if (STRINGP (XCODING_SYSTEM_MNEMONIC (codesys))) { - Lisp_Object car = XCAR (rest); - Lisp_Object from, to; - struct charset_conversion_spec spec; - - if (!CONSP (car) || !CONSP (XCDR (car)) || !NILP (XCDR (XCDR (car)))) - invalid_argument ("Invalid charset conversion spec", car); - from = Fget_charset (XCAR (car)); - to = Fget_charset (XCAR (XCDR (car))); - if (XCHARSET_TYPE (from) != XCHARSET_TYPE (to)) - invalid_operation_2 - ("Attempted conversion between different charset types", - from, to); - spec.from_charset = from; - spec.to_charset = to; - - Dynarr_add (store_here, spec); + mlen = XSTRING_LENGTH (XCODING_SYSTEM_MNEMONIC (codesys)); + codesys_mnemonic = (Intbyte *) alloca (mlen + 7); + memcpy (codesys_mnemonic, + XSTRING_DATA (XCODING_SYSTEM_MNEMONIC (codesys)), mlen); + } + + /* Create three "subsidiary" coding systems, decoding data encoded using + each of the three EOL types. We do this for each subsidiary by + copying the original coding system, setting the EOL type + appropriately, and setting the CANONICAL member of the new coding + system to be a chain consisting of the original coding system followed + by a convert-eol coding system to do the EOL decoding. For EOL type + LF, however, we don't need any decoding, so we skip creating a + CANONICAL. + + If the original coding system is not a text-type coding system + (decodes byte->char), we need to coerce it to one by the appropriate + wrapping in CANONICAL. */ + + for (i = 0; i < countof (coding_subsidiary_list); i++) + { + Char_ASCII *extension = coding_subsidiary_list[i].extension; + Char_ASCII *mnemonic_ext = coding_subsidiary_list[i].mnemonic_ext; + enum eol_type eol = coding_subsidiary_list[i].eol; + + qxestrcpy_c (codesys_name + len, extension); + codesys_name_sym = intern_int (codesys_name); + if (mlen != -1) + qxestrcpy_c (codesys_mnemonic + mlen, mnemonic_ext); + + sub_codesys = Fcopy_coding_system (codesys, codesys_name_sym); + if (mlen != -1) + XCODING_SYSTEM_MNEMONIC (sub_codesys) = + build_intstring (codesys_mnemonic); + + if (eol != EOL_LF) + { + Lisp_Object chain = list2 (get_coding_system_for_text_file + (codesys, 0), + eol == EOL_CR ? Qconvert_eol_cr : + Qconvert_eol_crlf); + Lisp_Object canon = + make_internal_coding_system + (sub_codesys, "internal-subsidiary-eol-wrapper", + Qchain, Qunbound, + mlen != -1 ? + list6 (Qmnemonic, build_intstring (codesys_mnemonic), + Qchain, chain, + Qcanonicalize_after_coding, sub_codesys) : + list4 (Qchain, chain, + Qcanonicalize_after_coding, sub_codesys)); + XCODING_SYSTEM_CANONICAL (sub_codesys) = canon; + } + XCODING_SYSTEM_EOL_TYPE (sub_codesys) = eol; + XCODING_SYSTEM_SUBSIDIARY_PARENT (sub_codesys) = codesys; + XCODING_SYSTEM (codesys)->eol[eol] = sub_codesys; } } -/* Given a dynarr LOAD_HERE of internally-stored charset conversion - specs, return the equivalent as the Lisp programmer would see it. - - If LOAD_HERE is 0, return Qnil. */ +/* Basic function to create new coding systems. For `make-coding-system', + NAME-OR-EXISTING is the NAME argument, PREFIX is null, and TYPE, + DESCRIPTION, and PROPS are the same. All created coding systems are put + in a hash table indexed by NAME. + + If PREFIX is a string, NAME-OR-EXISTING should specify an existing + coding system (or nil), and an internal coding system will be created. + The name of the coding system will be constructed by combining PREFIX + with the name of the existing coding system (if given), and a number + will be appended to insure uniqueness. In such a case, if Qunbound is + given for DESCRIPTION, the description gets created based on the + generated name. Also, if no mnemonic is given in the properties list, a + mnemonic is created based on the generated name. + + For internal coding systems, the coding system is marked as internal + (see `coding-system-list'), and no subsidiaries will be created or + eol-wrapping will happen. Otherwise: + + -- if the eol-type property is `lf' or t, the coding system is merely + created and returned. (For t, the coding system will be wrapped with + an EOL autodetector when it's used to read a file.) + + -- if eol-type is `crlf' or `cr', after the coding system object is + created, it will be wrapped in a chain with the appropriate + convert-eol coding system (either `convert-eol-crlf' or + `convert-eol-cr'), so that CRLF->LF or CR->LF conversion is done at + decoding time, and the opposite at encoding time. The resulting + chain becomes the CANONICAL field of the coding system object. + + -- if eol-type is nil or omitted, "subsidiaries" are generated: Three + coding systems where the original coding system (before wrapping with + convert-eol-autodetect) is either unwrapped or wrapped with + convert-eol-crlf or convert-eol-cr, respectively, so that coding systems + to handle LF, CRLF, and CR end-of-line indicators are created. (This + crazy crap is based on existing behavior in other Mule versions, + including FSF Emacs.) + */ static Lisp_Object -unparse_charset_conversion_specs (charset_conversion_spec_dynarr *load_here) +make_coding_system_1 (Lisp_Object name_or_existing, Char_ASCII *prefix, + Lisp_Object type, Lisp_Object description, + Lisp_Object props) { - int i; - Lisp_Object result; - - if (!load_here) - return Qnil; - for (i = 0, result = Qnil; i < Dynarr_length (load_here); i++) + Lisp_Coding_System *cs; + int need_to_setup_eol_systems = 1; + enum eol_type eol_wrapper = EOL_AUTODETECT; + struct coding_system_methods *meths; + Lisp_Object csobj; + Lisp_Object defmnem = Qnil; + + if (NILP (type)) + type = Qundecided; + meths = decode_coding_system_type (type, ERROR_ME); + + if (prefix) { - struct charset_conversion_spec *ccs = Dynarr_atp (load_here, i); - result = Fcons (list2 (ccs->from_charset, ccs->to_charset), result); + Intbyte *newname = + emacs_sprintf_malloc (NULL, "%s-%s-%d", + prefix, + NILP (name_or_existing) ? (Intbyte *) "nil" : + XSTRING_DATA (Fsymbol_name (XCODING_SYSTEM_NAME + (name_or_existing))), + ++coding_system_tick); + name_or_existing = intern_int (newname); + xfree (newname); + + if (UNBOUNDP (description)) + { + newname = + emacs_sprintf_malloc + (NULL, "For Internal Use (%s)", + XSTRING_DATA (Fsymbol_name (name_or_existing))); + description = build_intstring (newname); + xfree (newname); + } + + newname = emacs_sprintf_malloc (NULL, "Int%d", coding_system_tick); + defmnem = build_intstring (newname); } - - return Fnreverse (result); + else + CHECK_SYMBOL (name_or_existing); + + if (!NILP (Ffind_coding_system (name_or_existing))) + invalid_operation ("Cannot redefine existing coding system", + name_or_existing); + + cs = allocate_coding_system (meths, meths->extra_data_size, + name_or_existing); + XSETCODING_SYSTEM (csobj, cs); + + cs->internal_p = !!prefix; + + if (NILP (description)) + description = build_string (""); + else + CHECK_STRING (description); + CODING_SYSTEM_DESCRIPTION (cs) = description; + + if (!NILP (defmnem)) + CODING_SYSTEM_MNEMONIC (cs) = defmnem; + + { + EXTERNAL_PROPERTY_LIST_LOOP_3 (key, value, props) + { + int recognized = 1; + + if (EQ (key, Qmnemonic)) + { + if (!NILP (value)) + CHECK_STRING (value); + CODING_SYSTEM_MNEMONIC (cs) = value; + } + + else if (EQ (key, Qdocumentation)) + { + if (!NILP (value)) + CHECK_STRING (value); + CODING_SYSTEM_DOCUMENTATION (cs) = value; + } + + else if (EQ (key, Qeol_type)) + { + need_to_setup_eol_systems = NILP (value); + if (EQ (value, Qt)) + value = Qnil; + eol_wrapper = symbol_to_eol_type (value); + } + + else if (EQ (key, Qpost_read_conversion)) + CODING_SYSTEM_POST_READ_CONVERSION (cs) = value; + else if (EQ (key, Qpre_write_conversion)) + CODING_SYSTEM_PRE_WRITE_CONVERSION (cs) = value; + /* FSF compatibility */ + else if (EQ (key, Qtranslation_table_for_decode)) + ; + else if (EQ (key, Qtranslation_table_for_encode)) + ; + else if (EQ (key, Qsafe_chars)) + ; + else if (EQ (key, Qsafe_charsets)) + ; + else if (EQ (key, Qmime_charset)) + ; + else if (EQ (key, Qvalid_codes)) + ; + else + recognized = CODESYSMETH_OR_GIVEN (cs, putprop, + (csobj, key, value), 0); + + if (!recognized) + invalid_constant ("Unrecognized property", key); + } + } + + { + XCODING_SYSTEM_CANONICAL (csobj) = + CODESYSMETH_OR_GIVEN (cs, canonicalize, (csobj), Qnil); + XCODING_SYSTEM_EOL_TYPE (csobj) = EOL_AUTODETECT; /* for copy-coding-system + below */ + + if (need_to_setup_eol_systems && !cs->internal_p) + setup_eol_coding_systems (csobj); + else if (eol_wrapper == EOL_CR || eol_wrapper == EOL_CRLF) + { + /* If a specific eol-type (other than LF) was specified, we handle + this by converting the coding system into a chain that wraps the + coding system along with a convert-eol system after it, in + exactly that same switcheroo fashion that the normal + canonicalize method works -- BUT we will run into a problem if + we do it the obvious way, because when `chain' creates its + substreams, the substream containing the coding system we're + creating will have canonicalization expansion done on it, + leading to infinite recursion. So we have to generate a new, + internal coding system with the previous value of CANONICAL. */ + Intbyte *newname = + emacs_sprintf_malloc + (NULL, "internal-eol-copy-%s-%d", + XSTRING_DATA (Fsymbol_name (name_or_existing)), + ++coding_system_tick); + Lisp_Object newnamesym = intern_int (newname); + Lisp_Object copied = Fcopy_coding_system (csobj, newnamesym); + xfree (newname); + + XCODING_SYSTEM_CANONICAL (csobj) = + make_internal_coding_system + (csobj, + "internal-eol-wrapper", + Qchain, Qunbound, + list4 (Qchain, + list2 (copied, + eol_wrapper == EOL_CR ? + Qconvert_eol_cr : + Qconvert_eol_crlf), + Qcanonicalize_after_coding, + csobj)); + } + XCODING_SYSTEM_EOL_TYPE (csobj) = eol_wrapper; + } + + Fputhash (name_or_existing, csobj, Vcoding_system_hash_table); + + return csobj; } -#endif /* MULE */ +Lisp_Object +make_internal_coding_system (Lisp_Object existing, Char_ASCII *prefix, + Lisp_Object type, Lisp_Object description, + Lisp_Object props) +{ + return make_coding_system_1 (existing, prefix, type, description, props); +} DEFUN ("make-coding-system", Fmake_coding_system, 2, 4, 0, /* Register symbol NAME as a coding system. @@ -692,17 +1309,27 @@ nil or 'undecided Automatic conversion. XEmacs attempts to detect the coding system used in the file. +'chain + Chain two or more coding systems together to make a combination coding + system. 'no-conversion No conversion. Use this for binary files and such. On output, graphic characters that are not in ASCII or Latin-1 will be replaced by a ?. (For a no-conversion-encoded buffer, these characters will only be present if you explicitly insert them.) +'convert-eol + Convert CRLF sequences or CR to LF. 'shift-jis Shift-JIS (a Japanese encoding commonly used in PC operating systems). -'ucs-4 - ISO 10646 UCS-4 encoding. -'utf-8 - ISO 10646 UTF-8 encoding. +'unicode + Any Unicode encoding (UCS-4, UTF-8, UTF-16, etc.). +'mswindows-unicode-to-multibyte + (MS Windows only) Converts from Windows Unicode to Windows Multibyte + (any code page encoding) upon encoding, and the other way upon decoding. +'mswindows-multibyte + Converts to or from Windows Multibyte (any code page encoding). + This is resolved into a chain of `mswindows-unicode' and + `mswindows-unicode-to-multibyte'. 'iso2022 Any ISO2022-compliant encoding. Among other things, this includes JIS (the Japanese encoding commonly used for e-mail), EUC (the @@ -715,6 +1342,8 @@ The conversion is performed using a user-written pseudo-code program. CCL (Code Conversion Language) is the name of this pseudo-code. +'gzip + GZIP compression format. 'internal Write out or read in the raw contents of the memory representing the buffer's text. This is primarily useful for debugging @@ -726,7 +1355,9 @@ cause XEmacs to crash. Under normal circumstances you should never use 'internal conversion. -DOC-STRING is a string describing the coding system. +DESCRIPTION is a short English phrase describing the coding system, +suitable for use as a menu item. (See also the `documentation' property +below.) PROPS is a property list, describing the specific nature of the character set. Recognized properties are: @@ -735,6 +1366,9 @@ String to be displayed in the modeline when this coding system is active. +'documentation + Detailed documentation on the coding system. + 'eol-type End-of-line conversion to be used. It should be one of @@ -764,14 +1398,70 @@ `coding-system-property' will return nil.) 'post-read-conversion - Function called after a file has been read in, to perform the - decoding. Called with two arguments, START and END, denoting - a region of the current buffer to be decoded. + The value is a function to call after some text is inserted and + decoded by the coding system itself and before any functions in + `after-change-functions' are called. (#### Not actually true in + XEmacs. `after-change-functions' will be called twice if + `post-read-conversion' changes something.) The argument of this + function is the same as for a function in + `after-insert-file-functions', i.e. LENGTH of the text inserted, + with point at the head of the text to be decoded. 'pre-write-conversion - Function called before a file is written out, to perform the - encoding. Called with two arguments, START and END, denoting - a region of the current buffer to be encoded. + The value is a function to call after all functions in + `write-region-annotate-functions' and `buffer-file-format' are + called, and before the text is encoded by the coding system itself. + The arguments to this function are the same as those of a function + in `write-region-annotate-functions', i.e. FROM and TO, specifying + a region of text. + + + +The following properties are allowed for FSF compatibility but currently +ignored: + +'translation-table-for-decode + The value is a translation table to be applied on decoding. See + the function `make-translation-table' for the format of translation + table. This is not applicable to CCL-based coding systems. + +'translation-table-for-encode + The value is a translation table to be applied on encoding. This is + not applicable to CCL-based coding systems. + +'safe-chars + The value is a char table. If a character has non-nil value in it, + the character is safely supported by the coding system. This + overrides the specification of safe-charsets. + +'safe-charsets + The value is a list of charsets safely supported by the coding + system. The value t means that all charsets Emacs handles are + supported. Even if some charset is not in this list, it doesn't + mean that the charset can't be encoded in the coding system; + it just means that some other receiver of text encoded + in the coding system won't be able to handle that charset. + +'mime-charset + The value is a symbol of which name is `MIME-charset' parameter of + the coding system. + +'valid-codes (meaningful only for a coding system based on CCL) + The value is a list to indicate valid byte ranges of the encoded + file. Each element of the list is an integer or a cons of integer. + In the former case, the integer value is a valid byte code. In the + latter case, the integers specifies the range of valid byte codes. + + + +The following additional property is recognized if TYPE is 'convert-eol: + +'subtype + One of `lf', `crlf', `cr' or `autodetect'. When decoding, the + corresponding sequence will be converted to LF. When encoding, the + opposite happens. This coding system converts characters to + characters. + The following additional properties are recognized if TYPE is 'iso2022: @@ -846,6 +1536,7 @@ 'input-charset-conversion. + The following additional properties are recognized (and required) if TYPE is 'ccl: @@ -854,196 +1545,131 @@ 'encode CCL program used for encoding (converting to external format). + + +The following additional properties are recognized if TYPE is 'chain: + +'chain + List of coding systems to be chained together, in decoding order. + +'canonicalize-after-coding + Coding system to be returned by the detector routines in place of + this coding system. + + + +The following additional properties are recognized if TYPE is 'unicode: + +'type + One of `utf-16', `utf-8', `ucs-4', or `utf-7' (the latter is not + yet implemented). `utf-16' is the basic two-byte encoding; + `ucs-4' is the four-byte encoding; `utf-8' is an ASCII-compatible + variable-width 8-bit encoding; `utf-7' is a 7-bit encoding using + only characters that will safely pass through all mail gateways. + +'little-endian + If non-nil, `utf-16' and `ucs-4' will write out the groups of two + or four bytes little-endian instead of big-endian. This is required, + for example, under Windows. + +'need-bom + If non-nil, a byte order mark (BOM, or Unicode FFFE) should be + written out at the beginning of the data. This serves both to + identify the endianness of the following data and to mark the + data as Unicode (at least, this is how Windows uses it). + + + +The following additional properties are recognized if TYPE is +'mswindows-multibyte: + +'code-page + Either a number (specifying a particular code page) or one of the + symbols `ansi', `oem', `mac', or `ebcdic', specifying the ANSI, + OEM, Macintosh, or EBCDIC code page associated with a particular + locale (given by the `locale' property). NOTE: EBCDIC code pages + only exist in Windows 2000 and later. + +'locale + If `code-page' is a symbol, this specifies the locale whose code + page of the corresponding type should be used. This should be + one of the following: A cons of two strings, (LANGUAGE + . SUBLANGUAGE) (see `mswindows-set-current-locale'); a string (a + language; SUBLANG_DEFAULT, i.e. the default sublanguage, is + used); or one of the symbols `current', `user-default', or + `system-default', corresponding to the values of + `mswindows-current-locale', `mswindows-user-default-locale', or + `mswindows-system-default-locale', respectively. + + + +The following additional properties are recognized if TYPE is 'undecided: + +'do-eol + Do EOL detection. + +'do-coding + Do encoding detection. + +'coding-system + If encoding detection is not done, use the specified coding system + to do decoding. This is used internally when implementing coding + systems with an EOL type that specifies autodetection (the default), + so that the detector routines return the proper subsidiary. + + + +The following additional property is recognized if TYPE is 'gzip: + +'level + Compression level: 0 through 9, or `default' (currently 6). + */ - (name, type, doc_string, props)) + (name, type, description, props)) { - Lisp_Coding_System *codesys; - enum coding_system_type ty; - int need_to_setup_eol_systems = 1; - - /* Convert type to constant */ - if (NILP (type) || EQ (type, Qundecided)) - { ty = CODESYS_AUTODETECT; } -#ifdef MULE - else if (EQ (type, Qshift_jis)) { ty = CODESYS_SHIFT_JIS; } - else if (EQ (type, Qiso2022)) { ty = CODESYS_ISO2022; } - else if (EQ (type, Qbig5)) { ty = CODESYS_BIG5; } - else if (EQ (type, Qucs4)) { ty = CODESYS_UCS4; } - else if (EQ (type, Qutf8)) { ty = CODESYS_UTF8; } - else if (EQ (type, Qccl)) { ty = CODESYS_CCL; } -#endif - else if (EQ (type, Qno_conversion)) { ty = CODESYS_NO_CONVERSION; } -#ifdef DEBUG_XEMACS - else if (EQ (type, Qinternal)) { ty = CODESYS_INTERNAL; } -#endif - else - invalid_constant ("Invalid coding system type", type); - - CHECK_SYMBOL (name); - - codesys = allocate_coding_system (ty, name); - - if (NILP (doc_string)) - doc_string = build_string (""); - else - CHECK_STRING (doc_string); - CODING_SYSTEM_DOC_STRING (codesys) = doc_string; - - { - EXTERNAL_PROPERTY_LIST_LOOP_3 (key, value, props) - { - if (EQ (key, Qmnemonic)) - { - if (!NILP (value)) - CHECK_STRING (value); - CODING_SYSTEM_MNEMONIC (codesys) = value; - } - - else if (EQ (key, Qeol_type)) - { - need_to_setup_eol_systems = NILP (value); - if (EQ (value, Qt)) - value = Qnil; - CODING_SYSTEM_EOL_TYPE (codesys) = symbol_to_eol_type (value); - } - - else if (EQ (key, Qpost_read_conversion)) CODING_SYSTEM_POST_READ_CONVERSION (codesys) = value; - else if (EQ (key, Qpre_write_conversion)) CODING_SYSTEM_PRE_WRITE_CONVERSION (codesys) = value; -#ifdef MULE - else if (ty == CODESYS_ISO2022) - { -#define FROB_INITIAL_CHARSET(charset_num) \ - CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, charset_num) = \ - ((EQ (value, Qt) || EQ (value, Qnil)) ? value : Fget_charset (value)) - - if (EQ (key, Qcharset_g0)) FROB_INITIAL_CHARSET (0); - else if (EQ (key, Qcharset_g1)) FROB_INITIAL_CHARSET (1); - else if (EQ (key, Qcharset_g2)) FROB_INITIAL_CHARSET (2); - else if (EQ (key, Qcharset_g3)) FROB_INITIAL_CHARSET (3); - -#define FROB_FORCE_CHARSET(charset_num) \ - CODING_SYSTEM_ISO2022_FORCE_CHARSET_ON_OUTPUT (codesys, charset_num) = !NILP (value) - - else if (EQ (key, Qforce_g0_on_output)) FROB_FORCE_CHARSET (0); - else if (EQ (key, Qforce_g1_on_output)) FROB_FORCE_CHARSET (1); - else if (EQ (key, Qforce_g2_on_output)) FROB_FORCE_CHARSET (2); - else if (EQ (key, Qforce_g3_on_output)) FROB_FORCE_CHARSET (3); - -#define FROB_BOOLEAN_PROPERTY(prop) \ - CODING_SYSTEM_ISO2022_##prop (codesys) = !NILP (value) - - else if (EQ (key, Qshort)) FROB_BOOLEAN_PROPERTY (SHORT); - else if (EQ (key, Qno_ascii_eol)) FROB_BOOLEAN_PROPERTY (NO_ASCII_EOL); - else if (EQ (key, Qno_ascii_cntl)) FROB_BOOLEAN_PROPERTY (NO_ASCII_CNTL); - else if (EQ (key, Qseven)) FROB_BOOLEAN_PROPERTY (SEVEN); - else if (EQ (key, Qlock_shift)) FROB_BOOLEAN_PROPERTY (LOCK_SHIFT); - else if (EQ (key, Qno_iso6429)) FROB_BOOLEAN_PROPERTY (NO_ISO6429); - else if (EQ (key, Qescape_quoted)) FROB_BOOLEAN_PROPERTY (ESCAPE_QUOTED); - - else if (EQ (key, Qinput_charset_conversion)) - { - codesys->iso2022.input_conv = - Dynarr_new (charset_conversion_spec); - parse_charset_conversion_specs (codesys->iso2022.input_conv, - value); - } - else if (EQ (key, Qoutput_charset_conversion)) - { - codesys->iso2022.output_conv = - Dynarr_new (charset_conversion_spec); - parse_charset_conversion_specs (codesys->iso2022.output_conv, - value); - } - else - invalid_constant ("Unrecognized property", key); - } - else if (EQ (type, Qccl)) - { - Lisp_Object sym; - struct ccl_program test_ccl; - Extbyte *suffix; - - /* Check key first. */ - if (EQ (key, Qdecode)) - suffix = "-ccl-decode"; - else if (EQ (key, Qencode)) - suffix = "-ccl-encode"; - else - invalid_constant ("Unrecognized property", key); - - /* If value is vector, register it as a ccl program - associated with an newly created symbol for - backward compatibility. */ - if (VECTORP (value)) - { - sym = Fintern (concat2 (Fsymbol_name (name), - build_string (suffix)), - Qnil); - Fregister_ccl_program (sym, value); - } - else - { - CHECK_SYMBOL (value); - sym = value; - } - /* check if the given ccl programs are valid. */ - if (setup_ccl_program (&test_ccl, sym) < 0) - invalid_argument ("Invalid CCL program", value); - - if (EQ (key, Qdecode)) - CODING_SYSTEM_CCL_DECODE (codesys) = sym; - else if (EQ (key, Qencode)) - CODING_SYSTEM_CCL_ENCODE (codesys) = sym; - - } -#endif /* MULE */ - else - invalid_constant ("Unrecognized property", key); - } - } - - if (need_to_setup_eol_systems) - setup_eol_coding_systems (codesys); - - { - Lisp_Object codesys_obj; - XSETCODING_SYSTEM (codesys_obj, codesys); - Fputhash (name, codesys_obj, Vcoding_system_hash_table); - return codesys_obj; - } + return make_coding_system_1 (name, 0, type, description, props); } DEFUN ("copy-coding-system", Fcopy_coding_system, 2, 2, 0, /* Copy OLD-CODING-SYSTEM to NEW-NAME. If NEW-NAME does not name an existing coding system, a new one will be created. +If you are using this function to create an alias, think again: +Use `define-coding-system-alias' instead. */ (old_coding_system, new_name)) { Lisp_Object new_coding_system; old_coding_system = Fget_coding_system (old_coding_system); - new_coding_system = Ffind_coding_system (new_name); + new_coding_system = + UNBOUNDP (new_name) ? Qnil : Ffind_coding_system (new_name); if (NILP (new_coding_system)) { - XSETCODING_SYSTEM (new_coding_system, - allocate_coding_system - (XCODING_SYSTEM_TYPE (old_coding_system), - new_name)); - Fputhash (new_name, new_coding_system, Vcoding_system_hash_table); + XSETCODING_SYSTEM + (new_coding_system, + allocate_coding_system + (XCODING_SYSTEM (old_coding_system)->methods, + XCODING_SYSTEM (old_coding_system)->methods->extra_data_size, + new_name)); + if (!UNBOUNDP (new_name)) + Fputhash (new_name, new_coding_system, Vcoding_system_hash_table); } + else if (XCODING_SYSTEM (old_coding_system)->methods != + XCODING_SYSTEM (new_coding_system)->methods) + invalid_operation_2 ("Coding systems not same type", + old_coding_system, new_coding_system); { Lisp_Coding_System *to = XCODING_SYSTEM (new_coding_system); Lisp_Coding_System *from = XCODING_SYSTEM (old_coding_system); - memcpy (((char *) to ) + sizeof (to->header), - ((char *) from) + sizeof (from->header), - sizeof (*from) - sizeof (from->header)); + copy_sized_lcrecord (to, from, sizeof_coding_system (from)); to->name = new_name; } return new_coding_system; } -DEFUN ("coding-system-canonical-name-p", Fcoding_system_canonical_name_p, 1, 1, 0, /* +DEFUN ("coding-system-canonical-name-p", Fcoding_system_canonical_name_p, + 1, 1, 0, /* Return t if OBJECT names a coding system, and is not a coding system alias. */ (object)) @@ -1075,13 +1701,6 @@ return Qnil; /* To keep the compiler happy */ } -static Lisp_Object -append_suffix_to_symbol (Lisp_Object symbol, const char *ascii_string) -{ - return Fintern (concat2 (Fsymbol_name (symbol), build_string (ascii_string)), - Qnil); -} - /* A maphash function, for removing dangling coding system aliases. */ static int dangling_coding_system_alias_p (Lisp_Object alias, @@ -1129,9 +1748,9 @@ if (NILP (aliasee)) { - Lisp_Object subsidiary_unix = append_suffix_to_symbol (alias, "-unix"); - Lisp_Object subsidiary_dos = append_suffix_to_symbol (alias, "-dos"); - Lisp_Object subsidiary_mac = append_suffix_to_symbol (alias, "-mac"); + Lisp_Object subsidiary_unix = add_suffix_to_symbol (alias, "-unix"); + Lisp_Object subsidiary_dos = add_suffix_to_symbol (alias, "-dos"); + Lisp_Object subsidiary_mac = add_suffix_to_symbol (alias, "-mac"); Fremhash (alias, Vcoding_system_hash_table); @@ -1190,9 +1809,9 @@ for (i = 0; i < countof (suffixes); i++) { Lisp_Object alias_subsidiary = - append_suffix_to_symbol (alias, suffixes[i]); + add_suffix_to_symbol (alias, suffixes[i]); Lisp_Object aliasee_subsidiary = - append_suffix_to_symbol (aliasee, suffixes[i]); + add_suffix_to_symbol (aliasee, suffixes[i]); if (! NILP (Ffind_coding_system (aliasee_subsidiary))) Fdefine_coding_system_alias (alias_subsidiary, aliasee_subsidiary); @@ -1205,14 +1824,11 @@ } static Lisp_Object -subsidiary_coding_system (Lisp_Object coding_system, eol_type_t type) +subsidiary_coding_system (Lisp_Object coding_system, enum eol_type type) { Lisp_Coding_System *cs = XCODING_SYSTEM (coding_system); Lisp_Object new_coding_system; - if (CODING_SYSTEM_EOL_TYPE (cs) != EOL_AUTODETECT) - return coding_system; - switch (type) { case EOL_AUTODETECT: return coding_system; @@ -1227,27 +1843,78 @@ DEFUN ("subsidiary-coding-system", Fsubsidiary_coding_system, 2, 2, 0, /* Return the subsidiary coding system of CODING-SYSTEM with eol type EOL-TYPE. +The logically opposite operation is `coding-system-base'. */ (coding_system, eol_type)) { - coding_system = Fget_coding_system (coding_system); + coding_system = get_coding_system_for_text_file (coding_system, 0); return subsidiary_coding_system (coding_system, symbol_to_eol_type (eol_type)); } +DEFUN ("coding-system-base", Fcoding_system_base, + 1, 1, 0, /* +Return the base coding system of CODING-SYSTEM. +If CODING-SYSTEM is a subsidiary, this returns its parent; otherwise, it +returns CODING-SYSTEM. +The logically opposite operation is `subsidiary-coding-system'. +*/ + (coding_system)) +{ + Lisp_Object base; + + coding_system = Fget_coding_system (coding_system); + if (EQ (XCODING_SYSTEM_NAME (coding_system), Qbinary)) + return Fget_coding_system (Qraw_text); /* hack! */ + base = XCODING_SYSTEM_SUBSIDIARY_PARENT (coding_system); + if (!NILP (base)) + return base; + return coding_system; +} + +DEFUN ("coding-system-used-for-io", Fcoding_system_used_for_io, + 1, 1, 0, /* +Return the coding system actually used for I/O. +In some cases (e.g. when a particular EOL type is specified) this won't be +the coding system itself. This can be useful when trying to track down +more closely how exactly data is decoded. +*/ + (coding_system)) +{ + Lisp_Object canon; + + coding_system = Fget_coding_system (coding_system); + canon = XCODING_SYSTEM_CANONICAL (coding_system); + if (!NILP (canon)) + return canon; + return coding_system; +} + /************************************************************************/ /* Coding system accessors */ /************************************************************************/ -DEFUN ("coding-system-doc-string", Fcoding_system_doc_string, 1, 1, 0, /* -Return the doc string for CODING-SYSTEM. +DEFUN ("coding-system-description", Fcoding_system_description, 1, 1, 0, /* +Return the description for CODING-SYSTEM. +The `description' of a coding system is a short English phrase giving the +name rendered according to English punctuation rules, plus possibly some +explanatory text (typically in the form of a parenthetical phrase). The +description is intended to be short enough that it can appear as a menu item, +and clear enough to be recognizable even to someone who is assumed to have +some basic familiarity with different encodings but may not know all the +technical names; thus, for `cn-gb-2312' is described as "Chinese EUC" and +`hz-gb-2312' is described as "Hz/ZW (Chinese)", where the actual name of +the encoding is given, followed by a note that this is a Chinese encoding, +because the great majority of people encountering this would have no idea +what it is, and giving the language indicates whether the encoding should +just be ignored or (conceivably) investigated more thoroughly. */ (coding_system)) { coding_system = Fget_coding_system (coding_system); - return XCODING_SYSTEM_DOC_STRING (coding_system); + return XCODING_SYSTEM_DESCRIPTION (coding_system); } DEFUN ("coding-system-type", Fcoding_system_type, 1, 1, 0, /* @@ -1255,89 +1922,31 @@ */ (coding_system)) { - switch (XCODING_SYSTEM_TYPE (Fget_coding_system (coding_system))) - { - default: abort (); - case CODESYS_AUTODETECT: return Qundecided; -#ifdef MULE - case CODESYS_SHIFT_JIS: return Qshift_jis; - case CODESYS_ISO2022: return Qiso2022; - case CODESYS_BIG5: return Qbig5; - case CODESYS_UCS4: return Qucs4; - case CODESYS_UTF8: return Qutf8; - case CODESYS_CCL: return Qccl; -#endif - case CODESYS_NO_CONVERSION: return Qno_conversion; -#ifdef DEBUG_XEMACS - case CODESYS_INTERNAL: return Qinternal; -#endif - } + coding_system = Fget_coding_system (coding_system); + return XCODING_SYSTEM_TYPE (coding_system); } -#ifdef MULE -static -Lisp_Object coding_system_charset (Lisp_Object coding_system, int gnum) -{ - Lisp_Object cs - = XCODING_SYSTEM_ISO2022_INITIAL_CHARSET (coding_system, gnum); - - return CHARSETP (cs) ? XCHARSET_NAME (cs) : Qnil; -} -#endif /* MULE */ - DEFUN ("coding-system-property", Fcoding_system_property, 2, 2, 0, /* Return the PROP property of CODING-SYSTEM. */ (coding_system, prop)) { - int i, ok = 0; - enum coding_system_type type; - coding_system = Fget_coding_system (coding_system); CHECK_SYMBOL (prop); - type = XCODING_SYSTEM_TYPE (coding_system); - - for (i = 0; !ok && i < Dynarr_length (the_codesys_prop_dynarr); i++) - if (EQ (Dynarr_at (the_codesys_prop_dynarr, i).sym, prop)) - { - ok = 1; - switch (Dynarr_at (the_codesys_prop_dynarr, i).prop_type) - { - case CODESYS_PROP_ALL_OK: - break; -#ifdef MULE - case CODESYS_PROP_ISO2022: - if (type != CODESYS_ISO2022) - invalid_argument - ("Property only valid in ISO2022 coding systems", - prop); - break; - - case CODESYS_PROP_CCL: - if (type != CODESYS_CCL) - invalid_argument - ("Property only valid in CCL coding systems", - prop); - break; -#endif /* MULE */ - default: - abort (); - } - } - - if (!ok) - invalid_constant ("Unrecognized property", prop); if (EQ (prop, Qname)) return XCODING_SYSTEM_NAME (coding_system); else if (EQ (prop, Qtype)) return Fcoding_system_type (coding_system); - else if (EQ (prop, Qdoc_string)) - return XCODING_SYSTEM_DOC_STRING (coding_system); + else if (EQ (prop, Qdescription)) + return XCODING_SYSTEM_DESCRIPTION (coding_system); else if (EQ (prop, Qmnemonic)) return XCODING_SYSTEM_MNEMONIC (coding_system); + else if (EQ (prop, Qdocumentation)) + return XCODING_SYSTEM_DOCUMENTATION (coding_system); else if (EQ (prop, Qeol_type)) - return eol_type_to_symbol (XCODING_SYSTEM_EOL_TYPE (coding_system)); + return eol_type_to_symbol (XCODING_SYSTEM_EOL_TYPE + (coding_system)); else if (EQ (prop, Qeol_lf)) return XCODING_SYSTEM_EOL_LF (coding_system); else if (EQ (prop, Qeol_crlf)) @@ -1348,242 +1957,1506 @@ return XCODING_SYSTEM_POST_READ_CONVERSION (coding_system); else if (EQ (prop, Qpre_write_conversion)) return XCODING_SYSTEM_PRE_WRITE_CONVERSION (coding_system); -#ifdef MULE - else if (type == CODESYS_ISO2022) + else + { + Lisp_Object value = CODESYSMETH_OR_GIVEN (XCODING_SYSTEM (coding_system), + getprop, + (coding_system, prop), + Qunbound); + if (UNBOUNDP (value)) + invalid_constant ("Unrecognized property", prop); + return value; + } +} + + +/************************************************************************/ +/* Coding stream functions */ +/************************************************************************/ + +/* A coding stream is a stream used for encoding or decoding text. The + coding-stream object keeps track of the actual coding system, the stream + that is at the other end, and data that needs to be persistent across + the lifetime of the stream. */ + +DEFINE_LSTREAM_IMPLEMENTATION ("coding", coding); + +/* Encoding and decoding are parallel operations, so we create just one + stream for both. "Decoding" may involve the extra step of autodetection + of the data format, but that's only because of the conventional + definition of decoding as converting from external- to + internal-formatted data. + + #### We really need to abstract out the concept of "data formats" and + define "converters" that convert from and to specified formats, + eliminating the idea of decoding and encoding. When specifying a + conversion process, we need to give the data formats themselves, not the + conversion processes -- e.g. a coding system called "Unicode->multibyte" + converts in both directions, and we could auto-detect the format of data + at either end. */ + +static Bytecount +coding_reader (Lstream *stream, unsigned char *data, Bytecount size) +{ + unsigned char *orig_data = data; + Bytecount read_size; + int error_occurred = 0; + struct coding_stream *str = CODING_STREAM_DATA (stream); + + /* We need to interface to coding_{de,en}code_1(), which expects to take + some amount of data and store the result into a Dynarr. We have + coding_{de,en}code_1() store into c->runoff, and take data from there + as necessary. */ + + /* We loop until we have enough data, reading chunks from the other + end and converting it. */ + while (1) + { + /* Take data from convert_to if we can. Make sure to take at + most SIZE bytes, and delete the data from convert_to. */ + if (Dynarr_length (str->convert_to) > 0) + { + Bytecount chunk = + min (size, (Bytecount) Dynarr_length (str->convert_to)); + memcpy (data, Dynarr_atp (str->convert_to, 0), chunk); + Dynarr_delete_many (str->convert_to, 0, chunk); + data += chunk; + size -= chunk; + } + + if (size == 0) + break; /* No more room for data */ + + if (str->eof) + break; + + { + /* Exhausted convert_to, so get some more. Read into convert_from, + after existing "rejected" data from the last conversion. */ + Bytecount rejected = Dynarr_length (str->convert_from); + /* #### 1024 is arbitrary; we really need to separate 0 from EOF, + and when we get 0, keep taking more data until we don't get 0 -- + we don't know how much data the conversion routine might need + before it can generate any data of its own */ + Bytecount readmore = max (size, 1024); + + Dynarr_add_many (str->convert_from, 0, readmore); + read_size = Lstream_read (str->other_end, + Dynarr_atp (str->convert_from, rejected), + readmore); + /* Trim size down to how much we actually got */ + Dynarr_set_size (str->convert_from, rejected + max (0, read_size)); + } + + if (read_size < 0) /* LSTREAM_ERROR */ + { + error_occurred = 1; + break; + } + if (read_size == 0) /* LSTREAM_EOF */ + /* There might be some more end data produced in the translation, + so we set a flag and call the conversion method once more to + output any final stuff it may be holding, any "go back to a sane + state" escape sequences, etc. The conversion method is free to + look at this flag, and we use it above to stop looping. */ + str->eof = 1; + { + Bytecount processed; + Bytecount to_process = Dynarr_length (str->convert_from); + + /* Convert the data, and save any rejected data in convert_from */ + processed = + XCODESYSMETH (str->codesys, convert, + (str, Dynarr_atp (str->convert_from, 0), + str->convert_to, to_process)); + if (processed < 0) + { + error_occurred = 1; + break; + } + assert (processed <= to_process); + if (processed < to_process) + memmove (Dynarr_atp (str->convert_from, 0), + Dynarr_atp (str->convert_from, processed), + to_process - processed); + Dynarr_set_size (str->convert_from, to_process - processed); + } + } + + if (data - orig_data == 0) + return error_occurred ? -1 : 0; + else + return data - orig_data; +} + +static Bytecount +coding_writer (Lstream *stream, const unsigned char *data, Bytecount size) +{ + struct coding_stream *str = CODING_STREAM_DATA (stream); + + /* Convert all our data into convert_to, and then attempt to write + it all out to the other end. */ + Dynarr_reset (str->convert_to); + size = XCODESYSMETH (str->codesys, convert, + (str, data, str->convert_to, size)); + if (Lstream_write (str->other_end, Dynarr_atp (str->convert_to, 0), + Dynarr_length (str->convert_to)) < 0) + return -1; + else + /* The return value indicates how much of the incoming data was + processed, not how many bytes were written. */ + return size; +} + +static int +encode_decode_source_sink_type_is_char (Lisp_Object cs, + enum source_or_sink sex, + enum encode_decode direction) +{ + return (direction == CODING_DECODE ? + decoding_source_sink_type_is_char (cs, sex) : + encoding_source_sink_type_is_char (cs, sex)); +} + +/* Ensure that the convert methods only get full characters sent to them to + convert if the source of that conversion is characters; and that no such + full-character checking happens when the source is bytes. Keep in mind + that (1) the conversion_end_type return values take the perspective of + encoding; (2) the source for decoding is the same as the sink for + encoding; (3) when writing, the data is given to us, and we set our own + stream to be character mode or not; (4) when reading, the data comes + from the other_end stream, and we set that one to be character mode or + not. This is consistent with the comment above the prototype for + Lstream_set_character_mode(), which lays out rules for who is allowed to + modify the character type mode on a stream. + + NOTE: We could potentially implement the full-character checking stuff + ourselves, which might be a bit safer in case people mess up the + character mode themselves. But people shouldn't be doing that -- don't + hide bugs -- and there's no sense duplicating code. */ + +static void +set_coding_character_mode (Lstream *stream) +{ + struct coding_stream *str = CODING_STREAM_DATA (stream); + Lstream *stream_to_set = + stream->flags & LSTREAM_FL_WRITE ? stream : str->other_end; + if (encode_decode_source_sink_type_is_char + (str->codesys, CODING_SOURCE, str->direction)) + Lstream_set_character_mode (stream_to_set); + else + Lstream_unset_character_mode (stream_to_set); +} + +static Lisp_Object +coding_marker (Lisp_Object stream) +{ + struct coding_stream *str = CODING_STREAM_DATA (XLSTREAM (stream)); + + mark_object (str->orig_codesys); + mark_object (str->codesys); + MAYBE_XCODESYSMETH (str->codesys, mark_coding_stream, (str)); + return wrap_lstream (str->other_end); +} + +static int +coding_rewinder (Lstream *stream) +{ + struct coding_stream *str = CODING_STREAM_DATA (stream); + MAYBE_XCODESYSMETH (str->codesys, rewind_coding_stream, (str)); + + str->ch = 0; + Dynarr_reset (str->convert_to); + Dynarr_reset (str->convert_from); + return Lstream_rewind (str->other_end); +} + +static int +coding_seekable_p (Lstream *stream) +{ + struct coding_stream *str = CODING_STREAM_DATA (stream); + return Lstream_seekable_p (str->other_end); +} + +static int +coding_flusher (Lstream *stream) +{ + struct coding_stream *str = CODING_STREAM_DATA (stream); + return Lstream_flush (str->other_end); +} + +static int +coding_closer (Lstream *stream) +{ + struct coding_stream *str = CODING_STREAM_DATA (stream); + if (stream->flags & LSTREAM_FL_WRITE) + { + str->eof = 1; + coding_writer (stream, 0, 0); + str->eof = 0; + } + /* It's safe to free the runoff dynarrs now because they are used only + during conversion. We need to keep the type-specific data around, + though, because of canonicalize_after_coding. */ + if (str->convert_to) + { + Dynarr_free (str->convert_to); + str->convert_to = 0; + } + if (str->convert_from) { - if (EQ (prop, Qcharset_g0)) - return coding_system_charset (coding_system, 0); - else if (EQ (prop, Qcharset_g1)) - return coding_system_charset (coding_system, 1); - else if (EQ (prop, Qcharset_g2)) - return coding_system_charset (coding_system, 2); - else if (EQ (prop, Qcharset_g3)) - return coding_system_charset (coding_system, 3); - -#define FORCE_CHARSET(charset_num) \ - (XCODING_SYSTEM_ISO2022_FORCE_CHARSET_ON_OUTPUT \ - (coding_system, charset_num) ? Qt : Qnil) - - else if (EQ (prop, Qforce_g0_on_output)) return FORCE_CHARSET (0); - else if (EQ (prop, Qforce_g1_on_output)) return FORCE_CHARSET (1); - else if (EQ (prop, Qforce_g2_on_output)) return FORCE_CHARSET (2); - else if (EQ (prop, Qforce_g3_on_output)) return FORCE_CHARSET (3); - -#define LISP_BOOLEAN(prop) \ - (XCODING_SYSTEM_ISO2022_##prop (coding_system) ? Qt : Qnil) - - else if (EQ (prop, Qshort)) return LISP_BOOLEAN (SHORT); - else if (EQ (prop, Qno_ascii_eol)) return LISP_BOOLEAN (NO_ASCII_EOL); - else if (EQ (prop, Qno_ascii_cntl)) return LISP_BOOLEAN (NO_ASCII_CNTL); - else if (EQ (prop, Qseven)) return LISP_BOOLEAN (SEVEN); - else if (EQ (prop, Qlock_shift)) return LISP_BOOLEAN (LOCK_SHIFT); - else if (EQ (prop, Qno_iso6429)) return LISP_BOOLEAN (NO_ISO6429); - else if (EQ (prop, Qescape_quoted)) return LISP_BOOLEAN (ESCAPE_QUOTED); - - else if (EQ (prop, Qinput_charset_conversion)) - return - unparse_charset_conversion_specs - (XCODING_SYSTEM (coding_system)->iso2022.input_conv); - else if (EQ (prop, Qoutput_charset_conversion)) - return - unparse_charset_conversion_specs - (XCODING_SYSTEM (coding_system)->iso2022.output_conv); - else - abort (); + Dynarr_free (str->convert_from); + str->convert_from = 0; + } + + return Lstream_close (str->other_end); +} + +static void +coding_finalizer (Lstream *stream) +{ + struct coding_stream *str = CODING_STREAM_DATA (stream); + + assert (!str->finalized); + MAYBE_XCODESYSMETH (str->codesys, finalize_coding_stream, (str)); + if (str->data) + { + xfree (str->data); + str->data = 0; + } + str->finalized = 1; +} + +static Lisp_Object +coding_stream_canonicalize_after_coding (Lstream *stream) +{ + struct coding_stream *str = CODING_STREAM_DATA (stream); + + return XCODESYSMETH_OR_GIVEN (str->codesys, canonicalize_after_coding, + (str), str->codesys); +} + +Lisp_Object +coding_stream_detected_coding_system (Lstream *stream) +{ + Lisp_Object codesys = + coding_stream_canonicalize_after_coding (stream); + if (NILP (codesys)) + return Fget_coding_system (Qidentity); + return codesys; +} + +Lisp_Object +coding_stream_coding_system (Lstream *stream) +{ + return CODING_STREAM_DATA (stream)->codesys; +} + +/* Change the coding system associated with a stream. */ + +void +set_coding_stream_coding_system (Lstream *lstr, Lisp_Object codesys) +{ + struct coding_stream *str = CODING_STREAM_DATA (lstr); + if (EQ (str->orig_codesys, codesys)) + return; + /* We do the equivalent of closing the stream, destroying it, and + reinitializing it. This includes flushing out the data and signalling + EOF, if we're a writing stream; we also replace the type-specific data + with the data appropriate for the new coding system. */ + if (!NILP (str->codesys)) + { + if (lstr->flags & LSTREAM_FL_WRITE) + { + Lstream_flush (lstr); + str->eof = 1; + coding_writer (lstr, 0, 0); + str->eof = 0; + } + MAYBE_XCODESYSMETH (str->codesys, finalize_coding_stream, (str)); + } + str->orig_codesys = codesys; + str->codesys = coding_system_real_canonical (codesys); + + if (str->data) + { + xfree (str->data); + str->data = 0; } - else if (type == CODESYS_CCL) + if (XCODING_SYSTEM_METHODS (str->codesys)->coding_data_size) + str->data = + xmalloc_and_zero (XCODING_SYSTEM_METHODS (str->codesys)-> + coding_data_size); + MAYBE_XCODESYSMETH (str->codesys, init_coding_stream, (str)); + /* The new coding system may have different ideas regarding whether its + ends are characters or bytes. */ + set_coding_character_mode (lstr); +} + +/* WARNING WARNING WARNING WARNING!!!!! If you open up a coding + stream for writing, no automatic code detection will be performed. + The reason for this is that automatic code detection requires a + seekable input. Things will also fail if you open a coding + stream for reading using a non-fully-specified coding system and + a non-seekable input stream. */ + +static Lisp_Object +make_coding_stream_1 (Lstream *stream, Lisp_Object codesys, + const char *mode, enum encode_decode direction) +{ + Lstream *lstr = Lstream_new (lstream_coding, mode); + struct coding_stream *str = CODING_STREAM_DATA (lstr); + Lisp_Object obj; + + codesys = Fget_coding_system (codesys); + xzero (*str); + str->codesys = Qnil; + str->orig_codesys = Qnil; + str->us = lstr; + str->other_end = stream; + str->convert_to = Dynarr_new (unsigned_char); + str->convert_from = Dynarr_new (unsigned_char); + str->direction = direction; + set_coding_stream_coding_system (lstr, codesys); + XSETLSTREAM (obj, lstr); + return obj; +} + +Lisp_Object +make_coding_input_stream (Lstream *stream, Lisp_Object codesys, + enum encode_decode direction) +{ + return make_coding_stream_1 (stream, codesys, "r", direction); +} + +Lisp_Object +make_coding_output_stream (Lstream *stream, Lisp_Object codesys, + enum encode_decode direction) +{ + return make_coding_stream_1 (stream, codesys, "w", direction); +} + +static Lisp_Object +encode_decode_coding_region (Lisp_Object start, Lisp_Object end, + Lisp_Object coding_system, Lisp_Object buffer, + enum encode_decode direction) +{ + Charbpos b, e; + struct buffer *buf = decode_buffer (buffer, 0); + Lisp_Object instream = Qnil, to_outstream = Qnil, outstream = Qnil; + Lisp_Object from_outstream = Qnil, auto_outstream = Qnil; + Lisp_Object lb_outstream = Qnil; + Lisp_Object next; + Lstream *istr, *ostr; + struct gcpro gcpro1, gcpro2, gcpro3, gcpro4, gcpro5; + struct gcpro ngcpro1; + int source_char, sink_char; + + get_buffer_range_char (buf, start, end, &b, &e, 0); + barf_if_buffer_read_only (buf, b, e); + + GCPRO5 (instream, to_outstream, outstream, from_outstream, lb_outstream); + NGCPRO1 (auto_outstream); + + coding_system = Fget_coding_system (coding_system); + source_char = encode_decode_source_sink_type_is_char (coding_system, + CODING_SOURCE, + direction); + sink_char = encode_decode_source_sink_type_is_char (coding_system, + CODING_SINK, + direction); + + /* Order is IN <---> [TO] -> OUT -> [FROM] -> [AUTODETECT-EOL] -> LB */ + instream = make_lisp_buffer_input_stream (buf, b, e, 0); + next = lb_outstream = make_lisp_buffer_output_stream (buf, b, 0); + + if (direction == CODING_DECODE && + XCODING_SYSTEM_EOL_TYPE (coding_system) == EOL_AUTODETECT) + next = auto_outstream = + make_coding_output_stream + (XLSTREAM (next), Fget_coding_system (Qconvert_eol_autodetect), CODING_DECODE); + + if (!sink_char) + next = from_outstream = + make_coding_output_stream (XLSTREAM (next), Qbinary, CODING_DECODE); + outstream = make_coding_output_stream (XLSTREAM (next), coding_system, + direction); + if (!source_char) { - if (EQ (prop, Qdecode)) - return XCODING_SYSTEM_CCL_DECODE (coding_system); - else if (EQ (prop, Qencode)) - return XCODING_SYSTEM_CCL_ENCODE (coding_system); - else - abort (); + to_outstream = + make_coding_output_stream (XLSTREAM (outstream), + Qbinary, CODING_ENCODE); + ostr = XLSTREAM (to_outstream); + } + else + ostr = XLSTREAM (outstream); + istr = XLSTREAM (instream); + + /* The chain of streams looks like this: + + [BUFFER] <----- send through + ------> [CHAR->BYTE i.e. ENCODE AS BINARY if source is + in bytes] + ------> [ENCODE/DECODE AS SPECIFIED] + ------> [BYTE->CHAR i.e. DECODE AS BINARY + if sink is in bytes] + ------> [AUTODETECT EOL if + we're decoding and + coding system calls + for this] + ------> [BUFFER] + */ + while (1) + { + char tempbuf[1024]; /* some random amount */ + Charbpos newpos, even_newer_pos; + Charbpos oldpos = lisp_buffer_stream_startpos (istr); + Bytecount size_in_bytes = + Lstream_read (istr, tempbuf, sizeof (tempbuf)); + + if (!size_in_bytes) + break; + newpos = lisp_buffer_stream_startpos (istr); + Lstream_write (ostr, tempbuf, size_in_bytes); + even_newer_pos = lisp_buffer_stream_startpos (istr); + buffer_delete_range (buf, even_newer_pos - (newpos - oldpos), + even_newer_pos, 0); } -#endif /* MULE */ - else - abort (); - - return Qnil; /* not reached */ + + { + Charcount retlen = + lisp_buffer_stream_startpos (XLSTREAM (instream)) - b; + Lstream_close (istr); + Lstream_close (ostr); + NUNGCPRO; + UNGCPRO; + Lstream_delete (istr); + if (!NILP (from_outstream)) + Lstream_delete (XLSTREAM (from_outstream)); + Lstream_delete (XLSTREAM (outstream)); + if (!NILP (to_outstream)) + Lstream_delete (XLSTREAM (to_outstream)); + if (!NILP (auto_outstream)) + Lstream_delete (XLSTREAM (auto_outstream)); + Lstream_delete (XLSTREAM (lb_outstream)); + return make_int (retlen); + } +} + +DEFUN ("decode-coding-region", Fdecode_coding_region, 3, 4, 0, /* +Decode the text between START and END which is encoded in CODING-SYSTEM. +This is useful if you've read in encoded text from a file without decoding +it (e.g. you read in a JIS-formatted file but used the `binary' or +`no-conversion' coding system, so that it shows up as "^[$B!<!+^[(B"). +Return length of decoded text. +BUFFER defaults to the current buffer if unspecified. +*/ + (start, end, coding_system, buffer)) +{ + return encode_decode_coding_region (start, end, coding_system, buffer, + CODING_DECODE); +} + +DEFUN ("encode-coding-region", Fencode_coding_region, 3, 4, 0, /* +Encode the text between START and END using CODING-SYSTEM. +This will, for example, convert Japanese characters into stuff such as +"^[$B!<!+^[(B" if you use the JIS encoding. Return length of encoded +text. BUFFER defaults to the current buffer if unspecified. +*/ + (start, end, coding_system, buffer)) +{ + return encode_decode_coding_region (start, end, coding_system, buffer, + CODING_ENCODE); } /************************************************************************/ -/* Coding category functions */ +/* Chain methods */ /************************************************************************/ +/* #### Need a way to create "opposite-direction" coding systems. */ + +/* Chain two or more coding systems together to make a combination coding + system. */ +DEFINE_CODING_SYSTEM_TYPE (chain); + +struct chain_coding_system +{ + /* List of coding systems, in decode order */ + Lisp_Object *chain; + /* Number of coding systems in list */ + int count; + /* Coding system to return as a result of canonicalize-after-coding */ + Lisp_Object canonicalize_after_coding; +}; + +struct chain_coding_stream +{ + int initted; + /* Lstreams for chain coding system */ + Lisp_Object *lstreams; + int lstream_count; +}; + +static const struct lrecord_description lo_description_1[] = { + { XD_LISP_OBJECT, 0 }, + { XD_END } +}; + +static const struct struct_description lo_description = { + sizeof (Lisp_Object), + lo_description_1 +}; + +static const struct lrecord_description chain_coding_system_description[] = { + { XD_INT, + coding_system_data_offset + offsetof (struct chain_coding_system, + count) }, + { XD_STRUCT_PTR, + coding_system_data_offset + offsetof (struct chain_coding_system, + chain), + XD_INDIRECT (0, 0), &lo_description }, + { XD_LISP_OBJECT, + coding_system_data_offset + offsetof (struct chain_coding_system, + canonicalize_after_coding) }, + { XD_END } +}; + +static Lisp_Object +chain_canonicalize (Lisp_Object codesys) +{ + /* We make use of the fact that this method is called at init time, after + properties have been parsed. init_method is called too early. */ + /* #### It's not clear we need this whole chain-canonicalize mechanism + any more. */ + Lisp_Object chain = Flist (XCODING_SYSTEM_CHAIN_COUNT (codesys), + XCODING_SYSTEM_CHAIN_CHAIN (codesys)); + chain = Fcons (XCODING_SYSTEM_PRE_WRITE_CONVERSION (codesys), + Fcons (XCODING_SYSTEM_POST_READ_CONVERSION (codesys), + chain)); + Fputhash (chain, codesys, Vchain_canonicalize_hash_table); + return codesys; +} + +static Lisp_Object +chain_canonicalize_after_coding (struct coding_stream *str) +{ + Lisp_Object cac = + XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (str->codesys); + if (!NILP (cac)) + return cac; + return str->codesys; +#if 0 + struct chain_coding_stream *data = CODING_STREAM_TYPE_DATA (str, chain); + Lisp_Object us = str->codesys, codesys; + int i; + Lisp_Object chain; + Lisp_Object tail; + int changed = 0; + + /* #### It's not clear we need this whole chain-canonicalize mechanism + any more. */ + if (str->direction == CODING_ENCODE || !data->initted) + return us; + + chain = Flist (XCODING_SYSTEM_CHAIN_COUNT (us), + XCODING_SYSTEM_CHAIN_CHAIN (us)); + + tail = chain; + for (i = 0; i < XCODING_SYSTEM_CHAIN_COUNT (us); i++) + { + codesys = (coding_stream_canonicalize_after_coding + (XLSTREAM (data->lstreams[i]))); + if (!EQ (codesys, XCAR (tail))) + changed = 1; + XCAR (tail) = codesys; + tail = XCDR (tail); + } + + if (!changed) + return us; + + chain = delq_no_quit (Qnil, chain); + + if (NILP (XCODING_SYSTEM_PRE_WRITE_CONVERSION (us)) && + NILP (XCODING_SYSTEM_POST_READ_CONVERSION (us))) + { + if (NILP (chain)) + return Qnil; + if (NILP (XCDR (chain))) + return XCAR (chain); + } + + codesys = Fgethash (Fcons (XCODING_SYSTEM_PRE_WRITE_CONVERSION (us), + Fcons (XCODING_SYSTEM_POST_READ_CONVERSION (us), + chain)), Vchain_canonicalize_hash_table, + Qnil); + if (!NILP (codesys)) + return codesys; + return make_internal_coding_system + (us, "internal-chain-canonicalizer-wrapper", + Qchain, Qunbound, list2 (Qchain, chain)); +#endif /* 0 */ +} + +static void +chain_init (Lisp_Object codesys) +{ + XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (codesys) = Qnil; +} + +static void +chain_mark (Lisp_Object codesys) +{ + int i; + + for (i = 0; i < XCODING_SYSTEM_CHAIN_COUNT (codesys); i++) + mark_object (XCODING_SYSTEM_CHAIN_CHAIN (codesys)[i]); + mark_object (XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (codesys)); +} + +static void +chain_mark_coding_stream_1 (struct chain_coding_stream *data) +{ + int i; + + for (i = 0; i < data->lstream_count; i++) + mark_object (data->lstreams[i]); +} + +static void +chain_mark_coding_stream (struct coding_stream *str) +{ + chain_mark_coding_stream_1 (CODING_STREAM_TYPE_DATA (str, chain)); +} + +static void +chain_print (Lisp_Object cs, Lisp_Object printcharfun, int escapeflag) +{ + int i; + + write_c_string ("(", printcharfun); + for (i = 0; i < XCODING_SYSTEM_CHAIN_COUNT (cs); i++) + { + write_c_string (i == 0 ? "" : "->", printcharfun); + print_coding_system_in_print_method (XCODING_SYSTEM_CHAIN_CHAIN (cs)[i], + printcharfun, escapeflag); + } + { + Lisp_Object cac = XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (cs); + if (!NILP (cac)) + { + if (i > 0) + write_c_string (" ", printcharfun); + write_c_string ("canonicalize-after-coding=", printcharfun); + print_coding_system_in_print_method (cac, printcharfun, escapeflag); + } + } + + write_c_string (")", printcharfun); +} + +static void +chain_rewind_coding_stream_1 (struct chain_coding_stream *data) +{ + /* Each will rewind the next; there is always at least one stream (the + dynarr stream at the end) if we're initted */ + if (data->initted) + Lstream_rewind (XLSTREAM (data->lstreams[0])); +} + +static void +chain_rewind_coding_stream (struct coding_stream *str) +{ + chain_rewind_coding_stream_1 (CODING_STREAM_TYPE_DATA (str, chain)); +} + +static void +chain_init_coding_streams_1 (struct chain_coding_stream *data, + unsigned_char_dynarr *dst, + int ncodesys, Lisp_Object *codesys, + enum encode_decode direction) +{ + int i; + Lisp_Object lstream_out; + + data->lstream_count = ncodesys + 1; + data->lstreams = xnew_array (Lisp_Object, data->lstream_count); + + lstream_out = make_dynarr_output_stream (dst); + Lstream_set_buffering (XLSTREAM (lstream_out), LSTREAM_UNBUFFERED, 0); + data->lstreams[data->lstream_count - 1] = lstream_out; + + for (i = ncodesys - 1; i >= 0; i--) + { + data->lstreams[i] = + make_coding_output_stream + (XLSTREAM (lstream_out), + codesys[direction == CODING_ENCODE ? ncodesys - (i + 1) : i], + direction); + lstream_out = data->lstreams[i]; + Lstream_set_buffering (XLSTREAM (lstream_out), LSTREAM_UNBUFFERED, + 0); + } + data->initted = 1; +} + +static Bytecount +chain_convert (struct coding_stream *str, const UExtbyte *src, + unsigned_char_dynarr *dst, Bytecount n) +{ + struct chain_coding_stream *data = CODING_STREAM_TYPE_DATA (str, chain); + + if (str->eof) + { + /* Each will close the next; there is always at least one stream (the + dynarr stream at the end) if we're initted. We need to close now + because more data may be generated. */ + if (data->initted) + Lstream_close (XLSTREAM (data->lstreams[0])); + return n; + } + + if (!data->initted) + chain_init_coding_streams_1 + (data, dst, XCODING_SYSTEM_CHAIN_COUNT (str->codesys), + XCODING_SYSTEM_CHAIN_CHAIN (str->codesys), str->direction); + + if (Lstream_write (XLSTREAM (data->lstreams[0]), src, n) < 0) + return -1; + return n; +} + +static void +chain_finalize_coding_stream_1 (struct chain_coding_stream *data) +{ + if (data->lstreams) + { + /* Order of deletion is important here! Delete from the head of the + chain and work your way towards the tail. In general, when you + delete an object, there should be *NO* pointers to it anywhere. + Deleting back-to-front would be a problem because there are + pointers going forward. If there were pointers in both + directions, you'd have to disconnect the pointers to a particular + object before deleting it. */ + if (!gc_in_progress) + { + int i; + /* During GC, these objects are unmarked, and are about to be + freed. We do NOT want them on the free list, and that will + cause lots of nastiness including crashes. Just let them be + freed normally. */ + for (i = 0; i < data->lstream_count; i++) + Lstream_delete (XLSTREAM ((data->lstreams)[i])); + } + xfree (data->lstreams); + } +} + +static void +chain_finalize_coding_stream (struct coding_stream *str) +{ + chain_finalize_coding_stream_1 (CODING_STREAM_TYPE_DATA (str, chain)); +} + +static void +chain_finalize (Lisp_Object c) +{ + if (XCODING_SYSTEM_CHAIN_CHAIN (c)) + xfree (XCODING_SYSTEM_CHAIN_CHAIN (c)); +} + static int -decode_coding_category (Lisp_Object symbol) +chain_putprop (Lisp_Object codesys, Lisp_Object key, Lisp_Object value) +{ + if (EQ (key, Qchain)) + { + Lisp_Object tail; + Lisp_Object *cslist; + int count = 0; + int i; + + EXTERNAL_LIST_LOOP (tail, value) + { + Fget_coding_system (XCAR (tail)); + count++; + } + + cslist = xnew_array (Lisp_Object, count); + XCODING_SYSTEM_CHAIN_CHAIN (codesys) = cslist; + + count = 0; + EXTERNAL_LIST_LOOP (tail, value) + { + cslist[count] = Fget_coding_system (XCAR (tail)); + count++; + } + + XCODING_SYSTEM_CHAIN_COUNT (codesys) = count; + + for (i = 0; i < count - 1; i++) + { + if (decoding_source_sink_type_is_char (cslist[i], CODING_SINK) != + decoding_source_sink_type_is_char (cslist[i + 1], CODING_SOURCE)) + invalid_argument_2 ("Sink of first must match source of second", + cslist[i], cslist[i + 1]); + } + } + else if (EQ (key, Qcanonicalize_after_coding)) + XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (codesys) = + Fget_coding_system (value); + else + return 0; + return 1; +} + +static Lisp_Object +chain_getprop (Lisp_Object coding_system, Lisp_Object prop) +{ + if (EQ (prop, Qchain)) + { + Lisp_Object result = Qnil; + int i; + + for (i = 0; i < XCODING_SYSTEM_CHAIN_COUNT (coding_system); i++) + result = Fcons (XCODING_SYSTEM_CHAIN_CHAIN (coding_system)[i], + result); + + return Fnreverse (result); + } + else if (EQ (prop, Qcanonicalize_after_coding)) + return XCODING_SYSTEM_CHAIN_CANONICALIZE_AFTER_CODING (coding_system); + else + return Qunbound; +} + +static enum source_sink_type +chain_conversion_end_type (Lisp_Object codesys) +{ + Lisp_Object *cslist = XCODING_SYSTEM_CHAIN_CHAIN (codesys); + int n = XCODING_SYSTEM_CHAIN_COUNT (codesys); + int charp_source, charp_sink; + + if (n == 0) + return DECODES_BYTE_TO_BYTE; /* arbitrary */ + charp_source = decoding_source_sink_type_is_char (cslist[0], CODING_SOURCE); + charp_sink = decoding_source_sink_type_is_char (cslist[n - 1], CODING_SINK); + + switch (charp_source * 2 + charp_sink) + { + case 0: return DECODES_BYTE_TO_BYTE; + case 1: return DECODES_BYTE_TO_CHARACTER; + case 2: return DECODES_CHARACTER_TO_BYTE; + case 3: return DECODES_CHARACTER_TO_CHARACTER; + } + + abort (); + return DECODES_BYTE_TO_BYTE; +} + + +/************************************************************************/ +/* No-conversion methods */ +/************************************************************************/ + +/* "No conversion"; used for binary files. We use quotes because there + really is some conversion being applied (it does byte<->char + conversion), but it appears to the user as if the text is read in + without conversion. */ +DEFINE_CODING_SYSTEM_TYPE (no_conversion); + +/* This is used when reading in "binary" files -- i.e. files that may + contain all 256 possible byte values and that are not to be + interpreted as being in any particular encoding. */ +static Bytecount +no_conversion_convert (struct coding_stream *str, + const UExtbyte *src, + unsigned_char_dynarr *dst, Bytecount n) +{ + UExtbyte c; + unsigned int ch = str->ch; + Bytecount orign = n; + + if (str->direction == CODING_DECODE) + { + while (n--) + { + c = *src++; + + DECODE_ADD_BINARY_CHAR (c, dst); + } + + if (str->eof) + DECODE_OUTPUT_PARTIAL_CHAR (ch, dst); + } + else + { + + while (n--) + { + c = *src++; + if (BYTE_ASCII_P (c)) + { + assert (ch == 0); + Dynarr_add (dst, c); + } +#ifdef MULE + else if (INTBYTE_LEADING_BYTE_P (c)) + { + assert (ch == 0); + if (c == LEADING_BYTE_LATIN_ISO8859_1 || + c == LEADING_BYTE_CONTROL_1) + ch = c; + else + Dynarr_add (dst, '~'); /* untranslatable character */ + } + else + { + if (ch == LEADING_BYTE_LATIN_ISO8859_1) + Dynarr_add (dst, c); + else if (ch == LEADING_BYTE_CONTROL_1) + { + assert (c < 0xC0); + Dynarr_add (dst, c - 0x20); + } + /* else it should be the second or third byte of an + untranslatable character, so ignore it */ + ch = 0; + } +#endif /* MULE */ + + } + } + + str->ch = ch; + return orign; +} + +DEFINE_DETECTOR (no_conversion); +DEFINE_DETECTOR_CATEGORY (no_conversion, no_conversion); + +struct no_conversion_detector +{ + int dummy; +}; + +static void +no_conversion_detect (struct detection_state *st, const UExtbyte *src, + Bytecount n) +{ + /* Hack until we get better handling of this stuff! */ + DET_RESULT (st, no_conversion) = DET_SLIGHTLY_LIKELY; +} + + +/************************************************************************/ +/* Convert-eol methods */ +/************************************************************************/ + +/* This is used to handle end-of-line (EOL) differences. It is +character-to-character, and works (when encoding) *BEFORE* sending +data to the main encoding routine -- thus, that routine must handle +different EOL types itself if it does line-oriented type processing. +This is unavoidable because we don't know whether the output of the +main encoding routine is ASCII compatible (Unicode is definitely not, +for example). + +There is one parameter: `subtype', either `cr', `lf', `crlf', or `autodetect'. +*/ + +DEFINE_CODING_SYSTEM_TYPE (convert_eol); + +struct convert_eol_coding_system +{ + enum eol_type subtype; +}; + +#define CODING_SYSTEM_CONVERT_EOL_SUBTYPE(codesys) \ + (CODING_SYSTEM_TYPE_DATA (codesys, convert_eol)->subtype) +#define XCODING_SYSTEM_CONVERT_EOL_SUBTYPE(codesys) \ + (XCODING_SYSTEM_TYPE_DATA (codesys, convert_eol)->subtype) + +struct convert_eol_coding_stream +{ + enum eol_type actual; +}; + +static const struct lrecord_description + convert_eol_coding_system_description[] = { + { XD_END } +}; + +static void +convert_eol_print (Lisp_Object cs, Lisp_Object printcharfun, int escapeflag) +{ + struct convert_eol_coding_system *data = + XCODING_SYSTEM_TYPE_DATA (cs, convert_eol); + + write_fmt_string (printcharfun, "(%s)", + data->subtype == EOL_LF ? "lf" : + data->subtype == EOL_CRLF ? "crlf" : + data->subtype == EOL_CR ? "cr" : + data->subtype == EOL_AUTODETECT ? "autodetect" : + (abort(), "")); +} + +static enum source_sink_type +convert_eol_conversion_end_type (Lisp_Object codesys) +{ + return DECODES_CHARACTER_TO_CHARACTER; +} + +static int +convert_eol_putprop (Lisp_Object codesys, + Lisp_Object key, + Lisp_Object value) +{ + struct convert_eol_coding_system *data = + XCODING_SYSTEM_TYPE_DATA (codesys, convert_eol); + + if (EQ (key, Qsubtype)) + { + if (EQ (value, Qlf) /* || EQ (value, Qunix) */) + data->subtype = EOL_LF; + else if (EQ (value, Qcrlf) /* || EQ (value, Qdos) */) + data->subtype = EOL_CRLF; + else if (EQ (value, Qcr) /* || EQ (value, Qmac) */) + data->subtype = EOL_CR; + else if (EQ (value, Qautodetect) /* || EQ (value, Qmac) */) + data->subtype = EOL_AUTODETECT; + else invalid_constant ("Unrecognized eol type", value); + } + else + return 0; + return 1; +} + +static Lisp_Object +convert_eol_getprop (Lisp_Object coding_system, Lisp_Object prop) +{ + struct convert_eol_coding_system *data = + XCODING_SYSTEM_TYPE_DATA (coding_system, convert_eol); + + if (EQ (prop, Qsubtype)) + { + switch (data->subtype) + { + case EOL_LF: return Qlf; + case EOL_CRLF: return Qcrlf; + case EOL_CR: return Qcr; + case EOL_AUTODETECT: return Qautodetect; + default: abort (); + } + } + + return Qunbound; +} + +static void +convert_eol_init_coding_stream (struct coding_stream *str) +{ + struct convert_eol_coding_stream *data = + CODING_STREAM_TYPE_DATA (str, convert_eol); + data->actual = XCODING_SYSTEM_CONVERT_EOL_SUBTYPE (str->codesys); +} + +static Bytecount +convert_eol_convert (struct coding_stream *str, const Intbyte *src, + unsigned_char_dynarr *dst, Bytecount n) +{ + if (str->direction == CODING_DECODE) + { + struct convert_eol_coding_stream *data = + CODING_STREAM_TYPE_DATA (str, convert_eol); + + if (data->actual == EOL_AUTODETECT) + { + Bytecount n2 = n; + const Intbyte *src2 = src; + + for (; n2; n2--) + { + Intbyte c = *src2++; + if (c == '\n') + { + data->actual = EOL_LF; + break; + } + else if (c == '\r') + { + if (n2 == 1) + { + /* If we're seeing a '\r' at the end of the data, then + reject the '\r' right now so it doesn't become an + issue in the code below -- unless we're at the end of + the stream, in which case we can't do that (because + then the '\r' will never get written out), and in any + case we should be recognizing it at EOL_CR format. */ + if (str->eof) + data->actual = EOL_CR; + else + n--; + break; + } + else if (*src2 == '\n') + data->actual = EOL_CRLF; + else + data->actual = EOL_CR; + break; + } + } + } + + /* str->eof is set, the caller reached EOF on the other end and has + no new data to give us. The only data we get is the data we + rejected from last time. */ + if (data->actual == EOL_LF || data->actual == EOL_AUTODETECT || + (str->eof)) + Dynarr_add_many (dst, src, n); + else + { + const Intbyte *end = src + n; + while (1) + { + /* Find the next section with no \r and add it. */ + const Intbyte *runstart = src; + src = (Intbyte *) memchr (src, '\r', end - src); + if (!src) + src = end; + Dynarr_add_many (dst, runstart, src - runstart); + /* Stop if at end ... */ + if (src == end) + break; + /* ... else, translate as necessary. */ + src++; + if (data->actual == EOL_CR) + Dynarr_add (dst, '\n'); + /* We need to be careful here with CRLF. If we see a CR at the + end of the data, we don't know if it's part of a CRLF, so we + reject it. Otherwise: If it's part of a CRLF, eat it and + loop; the following LF gets added next time around. If it's + not part of a CRLF, add the CR and loop. The following + character will be processed in the next loop iteration. This + correctly handles a sequence like CR+CR+LF. */ + else if (src == end) + return n - 1; /* reject the CR at the end; we'll get it again + next time the convert method is called */ + else if (*src != '\n') + Dynarr_add (dst, '\r'); + } + } + + return n; + } + else + { + enum eol_type subtype = + XCODING_SYSTEM_CONVERT_EOL_SUBTYPE (str->codesys); + const Intbyte *end = src + n; + + /* We try to be relatively efficient here. */ + if (subtype == EOL_LF) + Dynarr_add_many (dst, src, n); + else + { + while (1) + { + /* Find the next section with no \n and add it. */ + const Intbyte *runstart = src; + src = (Intbyte *) memchr (src, '\n', end - src); + if (!src) + src = end; + Dynarr_add_many (dst, runstart, src - runstart); + /* Stop if at end ... */ + if (src == end) + break; + /* ... else, skip over \n and add its translation. */ + src++; + Dynarr_add (dst, '\r'); + if (subtype == EOL_CRLF) + Dynarr_add (dst, '\n'); + } + } + + return n; + } +} + +static Lisp_Object +convert_eol_canonicalize_after_coding (struct coding_stream *str) +{ + struct convert_eol_coding_stream *data = + CODING_STREAM_TYPE_DATA (str, convert_eol); + + if (str->direction == CODING_ENCODE) + return str->codesys; + + switch (data->actual) + { + case EOL_LF: return Fget_coding_system (Qconvert_eol_lf); + case EOL_CRLF: return Fget_coding_system (Qconvert_eol_crlf); + case EOL_CR: return Fget_coding_system (Qconvert_eol_cr); + case EOL_AUTODETECT: return str->codesys; + default: abort (); return Qnil; + } +} + + +/************************************************************************/ +/* Undecided methods */ +/************************************************************************/ + +/* Do autodetection. We can autodetect the EOL type only, the coding + system only, or both. We only do autodetection when decoding; when + encoding, we just pass the data through. + + When doing just EOL detection, a coding system can be specified; if so, + we will decode this data through the coding system before doing EOL + detection. The reason for specifying this is so that + canonicalize-after-coding works: We will canonicalize the specified + coding system into the appropriate EOL type. When doing both coding and + EOL detection, we do similar canonicalization, and also catch situations + where the EOL type is overspecified, i.e. the detected coding system + specifies an EOL type, and either switch to the equivalent + non-EOL-processing coding system (if possible), or terminate EOL + detection and use the specified EOL type. This prevents data from being + EOL-processed twice. + */ + +DEFINE_CODING_SYSTEM_TYPE (undecided); + +struct undecided_coding_system +{ + int do_eol, do_coding; + Lisp_Object cs; +}; + +struct undecided_coding_stream +{ + Lisp_Object actual; + /* Either 2 or 3 lstreams here; see undecided_convert */ + struct chain_coding_stream c; + + struct detection_state *st; +}; + +static const struct lrecord_description + undecided_coding_system_description[] = { + { XD_LISP_OBJECT, + coding_system_data_offset + offsetof (struct undecided_coding_system, + cs) }, + { XD_END } +}; + +static void +undecided_init (Lisp_Object codesys) +{ + struct undecided_coding_system *data = + XCODING_SYSTEM_TYPE_DATA (codesys, undecided); + + data->cs = Qnil; +} + +static void +undecided_mark (Lisp_Object codesys) +{ + struct undecided_coding_system *data = + XCODING_SYSTEM_TYPE_DATA (codesys, undecided); + + mark_object (data->cs); +} + +static void +undecided_print (Lisp_Object cs, Lisp_Object printcharfun, int escapeflag) +{ + struct undecided_coding_system *data = + XCODING_SYSTEM_TYPE_DATA (cs, undecided); + int need_space = 0; + + write_c_string ("(", printcharfun); + if (data->do_eol) + { + write_c_string ("do-eol", printcharfun); + need_space = 1; + } + if (data->do_coding) + { + if (need_space) + write_c_string (" ", printcharfun); + write_c_string ("do-coding", printcharfun); + need_space = 1; + } + if (!NILP (data->cs)) + { + if (need_space) + write_c_string (" ", printcharfun); + write_c_string ("coding-system=", printcharfun); + print_coding_system_in_print_method (data->cs, printcharfun, escapeflag); + } + write_c_string (")", printcharfun); +} + +static void +undecided_mark_coding_stream (struct coding_stream *str) +{ + chain_mark_coding_stream_1 (&CODING_STREAM_TYPE_DATA (str, undecided)->c); +} + +static int +undecided_putprop (Lisp_Object codesys, Lisp_Object key, Lisp_Object value) +{ + struct undecided_coding_system *data = + XCODING_SYSTEM_TYPE_DATA (codesys, undecided); + + if (EQ (key, Qdo_eol)) + data->do_eol = 1; + else if (EQ (key, Qdo_coding)) + data->do_coding = 1; + else if (EQ (key, Qcoding_system)) + data->cs = get_coding_system_for_text_file (value, 0); + else + return 0; + return 1; +} + +static Lisp_Object +undecided_getprop (Lisp_Object codesys, Lisp_Object prop) +{ + struct undecided_coding_system *data = + XCODING_SYSTEM_TYPE_DATA (codesys, undecided); + + if (EQ (prop, Qdo_eol)) + return data->do_eol ? Qt : Qnil; + if (EQ (prop, Qdo_coding)) + return data->do_coding ? Qt : Qnil; + if (EQ (prop, Qcoding_system)) + return data->cs; + return Qunbound; +} + +static struct detection_state * +allocate_detection_state (void) +{ + int i; + Bytecount size = MAX_ALIGN_SIZE (sizeof (struct detection_state)); + struct detection_state *block; + + for (i = 0; i < coding_detector_count; i++) + size += MAX_ALIGN_SIZE (Dynarr_at (all_coding_detectors, i).data_size); + + block = (struct detection_state *) xmalloc_and_zero (size); + + size = MAX_ALIGN_SIZE (sizeof (struct detection_state)); + for (i = 0; i < coding_detector_count; i++) + { + block->data_offset[i] = size; + size += MAX_ALIGN_SIZE (Dynarr_at (all_coding_detectors, i).data_size); + } + + return block; +} + +static void +free_detection_state (struct detection_state *st) +{ + int i; + + for (i = 0; i < coding_detector_count; i++) + { + if (Dynarr_at (all_coding_detectors, i).finalize_detection_state_method) + Dynarr_at (all_coding_detectors, i).finalize_detection_state_method + (st); + } + + xfree (st); +} + +static int +coding_category_symbol_to_id (Lisp_Object symbol) { int i; CHECK_SYMBOL (symbol); - for (i = 0; i < CODING_CATEGORY_LAST; i++) - if (EQ (coding_category_symbol[i], symbol)) - return i; - + for (i = 0; i < coding_detector_count; i++) + { + detector_category_dynarr *cats = + Dynarr_at (all_coding_detectors, i).cats; + int j; + + for (j = 0; j < Dynarr_length (cats); j++) + if (EQ (Dynarr_at (cats, j).sym, symbol)) + return Dynarr_at (cats, j).id; + } + invalid_constant ("Unrecognized coding category", symbol); return 0; /* not reached */ } -DEFUN ("coding-category-list", Fcoding_category_list, 0, 0, 0, /* -Return a list of all recognized coding categories. -*/ - ()) -{ - int i; - Lisp_Object list = Qnil; - - for (i = CODING_CATEGORY_LAST - 1; i >= 0; i--) - list = Fcons (coding_category_symbol[i], list); - return list; -} - -DEFUN ("set-coding-priority-list", Fset_coding_priority_list, 1, 1, 0, /* -Change the priority order of the coding categories. -LIST should be list of coding categories, in descending order of -priority. Unspecified coding categories will be lower in priority -than all specified ones, in the same relative order they were in -previously. -*/ - (list)) -{ - int category_to_priority[CODING_CATEGORY_LAST]; - int i, j; - Lisp_Object rest; - - /* First generate a list that maps coding categories to priorities. */ - - for (i = 0; i < CODING_CATEGORY_LAST; i++) - category_to_priority[i] = -1; - - /* Highest priority comes from the specified list. */ - i = 0; - EXTERNAL_LIST_LOOP (rest, list) - { - int cat = decode_coding_category (XCAR (rest)); - - if (category_to_priority[cat] >= 0) - sferror ("Duplicate coding category in list", XCAR (rest)); - category_to_priority[cat] = i++; - } - - /* Now go through the existing categories by priority to retrieve - the categories not yet specified and preserve their priority - order. */ - for (j = 0; j < CODING_CATEGORY_LAST; j++) - { - int cat = fcd->coding_category_by_priority[j]; - if (category_to_priority[cat] < 0) - category_to_priority[cat] = i++; - } - - /* Now we need to construct the inverse of the mapping we just - constructed. */ - - for (i = 0; i < CODING_CATEGORY_LAST; i++) - fcd->coding_category_by_priority[category_to_priority[i]] = i; - - /* Phew! That was confusing. */ - return Qnil; -} - -DEFUN ("coding-priority-list", Fcoding_priority_list, 0, 0, 0, /* -Return a list of coding categories in descending order of priority. -*/ - ()) +static Lisp_Object +coding_category_id_to_symbol (int id) { int i; - Lisp_Object list = Qnil; - - for (i = CODING_CATEGORY_LAST - 1; i >= 0; i--) - list = Fcons (coding_category_symbol[fcd->coding_category_by_priority[i]], - list); - return list; -} - -DEFUN ("set-coding-category-system", Fset_coding_category_system, 2, 2, 0, /* -Change the coding system associated with a coding category. -*/ - (coding_category, coding_system)) -{ - int cat = decode_coding_category (coding_category); - - coding_system = Fget_coding_system (coding_system); - fcd->coding_category_system[cat] = coding_system; - return Qnil; + + for (i = 0; i < coding_detector_count; i++) + { + detector_category_dynarr *cats = + Dynarr_at (all_coding_detectors, i).cats; + int j; + + for (j = 0; j < Dynarr_length (cats); j++) + if (id == Dynarr_at (cats, j).id) + return Dynarr_at (cats, j).sym; + } + + abort (); + return Qnil; /* (usually) not reached */ } -DEFUN ("coding-category-system", Fcoding_category_system, 1, 1, 0, /* -Return the coding system associated with a coding category. -*/ - (coding_category)) -{ - int cat = decode_coding_category (coding_category); - Lisp_Object sys = fcd->coding_category_system[cat]; - - if (!NILP (sys)) - return XCODING_SYSTEM_NAME (sys); - return Qnil; -} - - -/************************************************************************/ -/* Detecting the encoding of data */ -/************************************************************************/ - -struct detection_state +static Lisp_Object +detection_result_number_to_symbol (enum detection_result result) { - eol_type_t eol_type; - int seen_non_ascii; - int mask; -#ifdef MULE - struct - { - int mask; - int in_second_byte; - } - big5; - - struct - { - int mask; - int in_second_byte; - } - shift_jis; - - struct - { - int mask; - int in_byte; - } - ucs4; - - struct - { - int mask; - int in_byte; - } - utf8; - - struct - { - int mask; - int initted; - struct iso2022_decoder iso; - unsigned int flags; - int high_byte_count; - unsigned int saw_single_shift:1; - } - iso2022; -#endif - struct - { - int seen_anything; - int just_saw_cr; - } - eol; -}; +#define FROB(sym, num) if (result == num) return (sym) + FROB (Qnear_certainty, DET_NEAR_CERTAINTY); + FROB (Qquite_probable, DET_QUITE_PROBABLE); + FROB (Qsomewhat_likely, DET_SOMEWHAT_LIKELY); + FROB (Qas_likely_as_unlikely, DET_AS_LIKELY_AS_UNLIKELY); + FROB (Qsomewhat_unlikely, DET_SOMEWHAT_UNLIKELY); + FROB (Qquite_improbable, DET_QUITE_IMPROBABLE); + FROB (Qnearly_impossible, DET_NEARLY_IMPOSSIBLE); +#undef FROB + + abort (); + return Qnil; /* (usually) not reached */ +} + +static enum detection_result +detection_result_symbol_to_number (Lisp_Object symbol) +{ +#define FROB(sym, num) if (EQ (symbol, sym)) return (num) + FROB (Qnear_certainty, DET_NEAR_CERTAINTY); + FROB (Qquite_probable, DET_QUITE_PROBABLE); + FROB (Qsomewhat_likely, DET_SOMEWHAT_LIKELY); + FROB (Qas_likely_as_unlikely, DET_AS_LIKELY_AS_UNLIKELY); + FROB (Qsomewhat_unlikely, DET_SOMEWHAT_UNLIKELY); + FROB (Qquite_improbable, DET_QUITE_IMPROBABLE); + FROB (Qnearly_impossible, DET_NEARLY_IMPOSSIBLE); +#undef FROB + + invalid_constant ("Unrecognized detection result", symbol); + return ((enum detection_result) 0); /* not reached */ +} + +/* Set all detection results for a given detector to a specified value. */ +void +set_detection_results (struct detection_state *st, int detector, int given) +{ + detector_category_dynarr *cats = + Dynarr_at (all_coding_detectors, detector).cats; + int i; + + for (i = 0; i < Dynarr_length (cats); i++) + st->categories[Dynarr_at (cats, i).id] = given; +} static int acceptable_control_char_p (int c) @@ -1607,116 +3480,147 @@ } } -static int -mask_has_at_most_one_bit_p (int mask) +#ifdef DEBUG_XEMACS + +static UExtbyte +hex_digit_to_char (int digit) { - /* Perhaps the only thing useful you learn from intensive Microsoft - technical interviews */ - return (mask & (mask - 1)) == 0; + if (digit < 10) + return digit + '0'; + else + return digit - 10 + 'A'; } -static eol_type_t -detect_eol_type (struct detection_state *st, const Extbyte *src, - Bytecount n) +static void +output_bytes_in_ascii_and_hex (const UExtbyte *src, Bytecount n) { - while (n--) + UExtbyte *ascii = alloca_array (UExtbyte, n + 1); + UExtbyte *hex = alloca_array (UExtbyte, 3 * n + 1); + int i; + + for (i = 0; i < n; i++) { - unsigned char c = *(unsigned char *)src++; - if (c == '\n') - { - if (st->eol.just_saw_cr) - return EOL_CRLF; - else if (st->eol.seen_anything) - return EOL_LF; - } - else if (st->eol.just_saw_cr) - return EOL_CR; - else if (c == '\r') - st->eol.just_saw_cr = 1; + UExtbyte c = src[i]; + if (c < 0x20) + ascii[i] = '.'; else - st->eol.just_saw_cr = 0; - st->eol.seen_anything = 1; + ascii[i] = c; + hex[3 * i] = hex_digit_to_char (c >> 4); + hex[3 * i + 1] = hex_digit_to_char (c & 0xF); + hex[3 * i + 2] = ' '; } - - return EOL_AUTODETECT; + ascii[i] = '\0'; + hex[3 * i - 1] = '\0'; + stderr_out ("%s %s", ascii, hex); } -/* Attempt to determine the encoding and EOL type of the given text. - Before calling this function for the first type, you must initialize - st->eol_type as appropriate and initialize st->mask to ~0. - - st->eol_type holds the determined EOL type, or EOL_AUTODETECT if - not yet known. - - st->mask holds the determined coding category mask, or ~0 if only - ASCII has been seen so far. +#endif /* DEBUG_XEMACS */ + +/* Attempt to determine the encoding of the given text. Before calling + this function for the first time, you must zero out the detection state. Returns: - 0 == st->eol_type is EOL_AUTODETECT and/or more than coding category - is present in st->mask - 1 == definitive answers are here for both st->eol_type and st->mask + 0 == keep going + 1 == stop */ static int -detect_coding_type (struct detection_state *st, const Extbyte *src, - Bytecount n, int just_do_eol) +detect_coding_type (struct detection_state *st, const UExtbyte *src, + Bytecount n) { - if (st->eol_type == EOL_AUTODETECT) - st->eol_type = detect_eol_type (st, src, n); - - if (just_do_eol) - return st->eol_type != EOL_AUTODETECT; - + Bytecount n2 = n; + const UExtbyte *src2 = src; + int i; + +#ifdef DEBUG_XEMACS + if (!NILP (Vdebug_coding_detection)) + { + int bytes = min (16, n); + stderr_out ("detect_coding_type: processing %ld bytes\n", n); + stderr_out ("First %d: ", bytes); + output_bytes_in_ascii_and_hex (src, bytes); + stderr_out ("\nLast %d: ", bytes); + output_bytes_in_ascii_and_hex (src + n - bytes, bytes); + stderr_out ("\n"); + } +#endif /* DEBUG_XEMACS */ if (!st->seen_non_ascii) { - for (; n; n--, src++) + for (; n2; n2--, src2++) { - unsigned char c = *(unsigned char *) src; + UExtbyte c = *src2; if ((c < 0x20 && !acceptable_control_char_p (c)) || c >= 0x80) { st->seen_non_ascii = 1; -#ifdef MULE - st->shift_jis.mask = ~0; - st->big5.mask = ~0; - st->ucs4.mask = ~0; - st->utf8.mask = ~0; - st->iso2022.mask = ~0; -#endif break; } } } - if (!n) - return 0; -#ifdef MULE - if (!mask_has_at_most_one_bit_p (st->iso2022.mask)) - st->iso2022.mask = detect_coding_iso2022 (st, src, n); - if (!mask_has_at_most_one_bit_p (st->shift_jis.mask)) - st->shift_jis.mask = detect_coding_sjis (st, src, n); - if (!mask_has_at_most_one_bit_p (st->big5.mask)) - st->big5.mask = detect_coding_big5 (st, src, n); - if (!mask_has_at_most_one_bit_p (st->utf8.mask)) - st->utf8.mask = detect_coding_utf8 (st, src, n); - if (!mask_has_at_most_one_bit_p (st->ucs4.mask)) - st->ucs4.mask = detect_coding_ucs4 (st, src, n); - - st->mask - = st->iso2022.mask | st->shift_jis.mask | st->big5.mask - | st->utf8.mask | st->ucs4.mask; + for (i = 0; i < coding_detector_count; i++) + Dynarr_at (all_coding_detectors, i).detect_method (st, src, n); + + st->bytes_seen += n; + +#ifdef DEBUG_XEMACS + if (!NILP (Vdebug_coding_detection)) + { + stderr_out ("seen_non_ascii: %d\n", st->seen_non_ascii); + for (i = 0; i < coding_detector_category_count; i++) + stderr_out_lisp + ("%s: %s\n", + 2, + coding_category_id_to_symbol (i), + detection_result_number_to_symbol ((enum detection_result) + st->categories[i])); + } +#endif /* DEBUG_XEMACS */ + + { + int not_unlikely = 0; + int retval; + + for (i = 0; i < coding_detector_category_count; i++) + if (st->categories[i] >= 0) + not_unlikely++; + + retval = (not_unlikely <= 1 +#if 0 /* this is bogus */ + || st->bytes_seen >= MAX_BYTES_PROCESSED_FOR_DETECTION #endif - { - int retval = mask_has_at_most_one_bit_p (st->mask); - st->mask |= CODING_CATEGORY_NO_CONVERSION_MASK; - return retval && st->eol_type != EOL_AUTODETECT; + ); + +#ifdef DEBUG_XEMACS + if (!NILP (Vdebug_coding_detection)) + stderr_out ("detect_coding_type: returning %d (%s)\n", + retval, retval ? "stop" : "keep going"); +#endif /* DEBUG_XEMACS */ + + return retval; } } static Lisp_Object -coding_system_from_mask (int mask) +detected_coding_system (struct detection_state *st) { - if (mask == ~0) + int i; + int even = 1; + + if (st->seen_non_ascii) + { + for (i = 0; i < coding_detector_category_count; i++) + if (st->categories[i] != DET_AS_LIKELY_AS_UNLIKELY) + { + even = 0; + break; + } + } + + /* #### Here we are ignoring the results of detection when it's all + ASCII. This is obviously a bad thing. But we need to fix up the + existing detection methods somewhat before we can switch. */ + if (even) { /* If the file was entirely or basically ASCII, use the default value of `buffer-file-coding-system'. */ @@ -1724,7 +3628,7 @@ XBUFFER (Vbuffer_defaults)->buffer_file_coding_system; if (!NILP (retval)) { - retval = Ffind_coding_system (retval); + retval = find_coding_system_for_text_file (retval, 0); if (NILP (retval)) { warn_when_safe @@ -1739,24 +3643,38 @@ } else { - int i; - int cat = -1; -#ifdef MULE - mask = postprocess_iso2022_mask (mask); -#endif - /* Look through the coding categories by priority and find - the first one that is allowed. */ - for (i = 0; i < CODING_CATEGORY_LAST; i++) + int likelihood; + Lisp_Object retval = Qnil; + + /* Look through the coding categories first by likelihood and then by + priority and find the first one that is allowed. */ + + for (likelihood = DET_HIGHEST; likelihood >= DET_LOWEST; likelihood--) { - cat = fcd->coding_category_by_priority[i]; - if ((mask & (1 << cat)) && - !NILP (fcd->coding_category_system[cat])) - break; + for (i = 0; i < coding_detector_category_count; i++) + { + int cat = coding_category_by_priority[i]; + if (st->categories[cat] == likelihood && + !NILP (coding_category_system[cat])) + { + retval = (get_coding_system_for_text_file + (coding_category_system[cat], 0)); + if (likelihood < DET_AS_LIKELY_AS_UNLIKELY) + warn_when_safe_lispobj + (intern ("detection"), + Qerror, + emacs_sprintf_string_lisp + ( +"Detected coding %s is unlikely to be correct (likelihood == `%s')", + Qnil, 2, XCODING_SYSTEM_NAME (retval), + detection_result_number_to_symbol + ((enum detection_result) likelihood))); + return retval; + } + } } - if (cat >= 0) - return fcd->coding_category_system[cat]; - else - return Fget_coding_system (Qraw_text); + + return Fget_coding_system (Qraw_text); } } @@ -1771,129 +3689,469 @@ #define LENGTH(string_constant) (sizeof (string_constant) - 1) -void -determine_real_coding_system (Lstream *stream, Lisp_Object *codesys_in_out, - eol_type_t *eol_type_in_out) +static Lisp_Object +unwind_free_detection_state (Lisp_Object opaque) +{ + struct detection_state *st = + (struct detection_state *) get_opaque_ptr (opaque); + free_detection_state (st); + free_opaque_ptr (opaque); + return Qnil; +} + +static Lisp_Object +look_for_coding_system_magic_cookie (const UExtbyte *data, Bytecount len) { - struct detection_state decst; - - if (*eol_type_in_out == EOL_AUTODETECT) - *eol_type_in_out = XCODING_SYSTEM_EOL_TYPE (*codesys_in_out); - - xzero (decst); - decst.eol_type = *eol_type_in_out; - decst.mask = ~0; - - /* If autodetection is called for, do it now. */ - if (XCODING_SYSTEM_TYPE (*codesys_in_out) == CODESYS_AUTODETECT - || *eol_type_in_out == EOL_AUTODETECT) - { - Extbyte buf[4096]; - Lisp_Object coding_system = Qnil; - Extbyte *p; - Bytecount nread = Lstream_read (stream, buf, sizeof (buf)); - Extbyte *scan_end; - - /* Look for initial "-*-"; mode line prefix */ - for (p = buf, - scan_end = buf + nread - LENGTH ("-*-coding:?-*-"); - p <= scan_end + Lisp_Object coding_system = Qnil; + const UExtbyte *p; + const UExtbyte *scan_end; + + /* Look for initial "-*-"; mode line prefix */ + for (p = data, + scan_end = data + len - LENGTH ("-*-coding:?-*-"); + p <= scan_end + && *p != '\n' + && *p != '\r'; + p++) + if (*p == '-' && *(p+1) == '*' && *(p+2) == '-') + { + const UExtbyte *local_vars_beg = p + 3; + /* Look for final "-*-"; mode line suffix */ + for (p = local_vars_beg, + scan_end = data + len - LENGTH ("-*-"); + p <= scan_end && *p != '\n' && *p != '\r'; - p++) - if (*p == '-' && *(p+1) == '*' && *(p+2) == '-') - { - Extbyte *local_vars_beg = p + 3; - /* Look for final "-*-"; mode line suffix */ - for (p = local_vars_beg, - scan_end = buf + nread - LENGTH ("-*-"); - p <= scan_end - && *p != '\n' - && *p != '\r'; - p++) - if (*p == '-' && *(p+1) == '*' && *(p+2) == '-') - { - Extbyte *suffix = p; - /* Look for "coding:" */ - for (p = local_vars_beg, - scan_end = suffix - LENGTH ("coding:?"); - p <= scan_end; - p++) - if (memcmp ("coding:", p, LENGTH ("coding:")) == 0 - && (p == local_vars_beg - || (*(p-1) == ' ' || - *(p-1) == '\t' || - *(p-1) == ';'))) + p++) + if (*p == '-' && *(p+1) == '*' && *(p+2) == '-') + { + const UExtbyte *suffix = p; + /* Look for "coding:" */ + for (p = local_vars_beg, + scan_end = suffix - LENGTH ("coding:?"); + p <= scan_end; + p++) + if (memcmp ("coding:", p, LENGTH ("coding:")) == 0 + && (p == local_vars_beg + || (*(p-1) == ' ' || + *(p-1) == '\t' || + *(p-1) == ';'))) + { + Bytecount n; + Intbyte *name; + + p += LENGTH ("coding:"); + while (*p == ' ' || *p == '\t') p++; + name = alloca_intbytes (suffix - p + 1); + memcpy (name, p, suffix - p); + name[suffix - p] = '\0'; + + /* Get coding system name */ + /* Characters valid in a MIME charset name (rfc 1521), + and in a Lisp symbol name. */ + n = qxestrspn (name, + "ABCDEFGHIJKLMNOPQRSTUVWXYZ" + "abcdefghijklmnopqrstuvwxyz" + "0123456789" + "!$%&*+-.^_{|}~"); + if (n > 0) { - Extbyte save; - int n; - p += LENGTH ("coding:"); - while (*p == ' ' || *p == '\t') p++; - - /* Get coding system name */ - save = *suffix; *suffix = '\0'; - /* Characters valid in a MIME charset name (rfc 1521), - and in a Lisp symbol name. */ - n = strspn ( (char *) p, - "ABCDEFGHIJKLMNOPQRSTUVWXYZ" - "abcdefghijklmnopqrstuvwxyz" - "0123456789" - "!$%&*+-.^_{|}~"); - *suffix = save; - if (n > 0) - { - save = p[n]; p[n] = '\0'; - coding_system = - Ffind_coding_system (intern ((char *) p)); - p[n] = save; - } - break; + name[n] = '\0'; + coding_system = + find_coding_system_for_text_file (intern_int (name), + 0); } - break; - } + break; + } + break; + } + break; + } + + return coding_system; +} + +static Lisp_Object +determine_real_coding_system (Lstream *stream) +{ + struct detection_state *st = allocate_detection_state (); + int depth = record_unwind_protect (unwind_free_detection_state, + make_opaque_ptr (st)); + UExtbyte buf[4096]; + Bytecount nread = Lstream_read (stream, buf, sizeof (buf)); + Lisp_Object coding_system = look_for_coding_system_magic_cookie (buf, nread); + + if (NILP (coding_system)) + { + while (1) + { + if (detect_coding_type (st, buf, nread)) break; - } - - if (NILP (coding_system)) - do - { - if (detect_coding_type (&decst, buf, nread, - XCODING_SYSTEM_TYPE (*codesys_in_out) - != CODESYS_AUTODETECT)) - break; - nread = Lstream_read (stream, buf, sizeof (buf)); - if (nread == 0) - break; - } - while (1); - - else if (XCODING_SYSTEM_TYPE (*codesys_in_out) == CODESYS_AUTODETECT - && XCODING_SYSTEM_EOL_TYPE (coding_system) == EOL_AUTODETECT) - do - { - if (detect_coding_type (&decst, buf, nread, 1)) - break; - nread = Lstream_read (stream, buf, sizeof (buf)); - if (!nread) - break; - } - while (1); - - *eol_type_in_out = decst.eol_type; - if (XCODING_SYSTEM_TYPE (*codesys_in_out) == CODESYS_AUTODETECT) - { - if (NILP (coding_system)) - *codesys_in_out = coding_system_from_mask (decst.mask); - else - *codesys_in_out = coding_system; + nread = Lstream_read (stream, buf, sizeof (buf)); + if (nread == 0) + break; } + + coding_system = detected_coding_system (st); } - /* If we absolutely can't determine the EOL type, just assume LF. */ - if (*eol_type_in_out == EOL_AUTODETECT) - *eol_type_in_out = EOL_LF; - Lstream_rewind (stream); + + unbind_to (depth); + return coding_system; +} + +static void +undecided_init_coding_stream (struct coding_stream *str) +{ + struct undecided_coding_stream *data = + CODING_STREAM_TYPE_DATA (str, undecided); + struct undecided_coding_system *csdata = + XCODING_SYSTEM_TYPE_DATA (str->codesys, undecided); + + data->actual = Qnil; + + if (str->direction == CODING_DECODE) + { + Lstream *lst = str->other_end; + + if ((lst->flags & LSTREAM_FL_READ) && + Lstream_seekable_p (lst) && + csdata->do_coding) + /* We can determine the coding system now. */ + data->actual = determine_real_coding_system (lst); + } +} + +static void +undecided_rewind_coding_stream (struct coding_stream *str) +{ + chain_rewind_coding_stream_1 (&CODING_STREAM_TYPE_DATA (str, undecided)->c); +} + +static void +undecided_finalize_coding_stream (struct coding_stream *str) +{ + struct undecided_coding_stream *data = + CODING_STREAM_TYPE_DATA (str, undecided); + + chain_finalize_coding_stream_1 + (&CODING_STREAM_TYPE_DATA (str, undecided)->c); + if (data->st) + free_detection_state (data->st); +} + +static Lisp_Object +undecided_canonicalize (Lisp_Object codesys) +{ + struct undecided_coding_system *csdata = + XCODING_SYSTEM_TYPE_DATA (codesys, undecided); + if (!csdata->do_eol && !csdata->do_coding) + return NILP (csdata->cs) ? Fget_coding_system (Qbinary) : csdata->cs; + if (csdata->do_eol && !csdata->do_coding && NILP (csdata->cs)) + return Fget_coding_system (Qconvert_eol_autodetect); + return codesys; +} + +static Bytecount +undecided_convert (struct coding_stream *str, const UExtbyte *src, + unsigned_char_dynarr *dst, Bytecount n) +{ + int first_time = 0; + + if (str->direction == CODING_DECODE) + { + /* At this point, we have only the following possibilities: + + do_eol && do_coding + do_coding only + do_eol only and a coding system was specified + + Other possibilities are removed during undecided_canonicalize. + + Therefore, our substreams are either + + lstream_coding -> lstream_dynarr, or + lstream_coding -> lstream_eol -> lstream_dynarr. + */ + struct undecided_coding_system *csdata = + XCODING_SYSTEM_TYPE_DATA (str->codesys, undecided); + struct undecided_coding_stream *data = + CODING_STREAM_TYPE_DATA (str, undecided); + + if (str->eof) + { + /* Each will close the next. We need to close now because more + data may be generated. */ + if (data->c.initted) + Lstream_close (XLSTREAM (data->c.lstreams[0])); + return n; + } + + if (!data->c.initted) + { + data->c.lstream_count = csdata->do_eol ? 3 : 2; + data->c.lstreams = xnew_array (Lisp_Object, data->c.lstream_count); + + data->c.lstreams[data->c.lstream_count - 1] = + make_dynarr_output_stream (dst); + Lstream_set_buffering + (XLSTREAM (data->c.lstreams[data->c.lstream_count - 1]), + LSTREAM_UNBUFFERED, 0); + if (csdata->do_eol) + { + data->c.lstreams[1] = + make_coding_output_stream + (XLSTREAM (data->c.lstreams[data->c.lstream_count - 1]), + Fget_coding_system (Qconvert_eol_autodetect), + CODING_DECODE); + Lstream_set_buffering + (XLSTREAM (data->c.lstreams[1]), + LSTREAM_UNBUFFERED, 0); + } + + data->c.lstreams[0] = + make_coding_output_stream + (XLSTREAM (data->c.lstreams[1]), + /* Substitute binary if we need to detect the encoding */ + csdata->do_coding ? Qbinary : csdata->cs, + CODING_DECODE); + Lstream_set_buffering (XLSTREAM (data->c.lstreams[0]), + LSTREAM_UNBUFFERED, 0); + + first_time = 1; + data->c.initted = 1; + } + + /* If necessary, do encoding-detection now. We do this when we're a + writing stream or a non-seekable reading stream, meaning that we + can't just process the whole input, rewind, and start over. */ + + if (csdata->do_coding) + { + int actual_was_nil = NILP (data->actual); + if (NILP (data->actual)) + { + if (!data->st) + data->st = allocate_detection_state (); + if (first_time) + /* #### This is cheesy. What we really ought to do is buffer + up a certain minimum amount of data to get a better result. + */ + data->actual = look_for_coding_system_magic_cookie (src, n); + if (NILP (data->actual)) + { + /* #### This is cheesy. What we really ought to do is buffer + up a certain minimum amount of data so as to get a less + random result when doing subprocess detection. */ + detect_coding_type (data->st, src, n); + data->actual = detected_coding_system (data->st); + } + } + /* We need to set the detected coding system if we actually have + such a coding system but didn't before. That is the case + either when we just detected it in the previous code or when + it was detected during undecided_init_coding_stream(). We + can check for that using first_time. */ + if (!NILP (data->actual) && (actual_was_nil || first_time)) + { + /* If the detected coding system doesn't allow for EOL + autodetection, try to get the equivalent that does; + otherwise, disable EOL detection (overriding whatever + may already have been detected). */ + if (XCODING_SYSTEM_EOL_TYPE (data->actual) != EOL_AUTODETECT) + { + if (!NILP (XCODING_SYSTEM_SUBSIDIARY_PARENT (data->actual))) + data->actual = + XCODING_SYSTEM_SUBSIDIARY_PARENT (data->actual); + else if (data->c.lstream_count == 3) + set_coding_stream_coding_system + (XLSTREAM (data->c.lstreams[1]), + Fget_coding_system (Qidentity)); + } + set_coding_stream_coding_system + (XLSTREAM (data->c.lstreams[0]), data->actual); + } + } + + if (Lstream_write (XLSTREAM (data->c.lstreams[0]), src, n) < 0) + return -1; + return n; + } + else + return no_conversion_convert (str, src, dst, n); +} + +static Lisp_Object +undecided_canonicalize_after_coding (struct coding_stream *str) +{ + struct undecided_coding_stream *data = + CODING_STREAM_TYPE_DATA (str, undecided); + Lisp_Object ret, eolret; + + if (str->direction == CODING_ENCODE) + return str->codesys; + + if (!data->c.initted) + return Fget_coding_system (Qundecided); + + ret = coding_stream_canonicalize_after_coding + (XLSTREAM (data->c.lstreams[0])); + if (NILP (ret)) + ret = Fget_coding_system (Qundecided); + if (XCODING_SYSTEM_EOL_TYPE (ret) != EOL_AUTODETECT) + return ret; + eolret = coding_stream_canonicalize_after_coding + (XLSTREAM (data->c.lstreams[1])); + if (!EQ (XCODING_SYSTEM_TYPE (eolret), Qconvert_eol)) + return ret; + return + Fsubsidiary_coding_system (ret, Fcoding_system_property (eolret, + Qsubtype)); +} + + +/************************************************************************/ +/* Lisp interface: Coding category functions and detection */ +/************************************************************************/ + +DEFUN ("coding-category-list", Fcoding_category_list, 0, 0, 0, /* +Return a list of all recognized coding categories. +*/ + ()) +{ + int i; + Lisp_Object list = Qnil; + + for (i = 0; i < coding_detector_count; i++) + { + detector_category_dynarr *cats = + Dynarr_at (all_coding_detectors, i).cats; + int j; + + for (j = 0; j < Dynarr_length (cats); j++) + list = Fcons (Dynarr_at (cats, j).sym, list); + } + + return Fnreverse (list); +} + +DEFUN ("set-coding-priority-list", Fset_coding_priority_list, 1, 1, 0, /* +Change the priority order of the coding categories. +LIST should be list of coding categories, in descending order of +priority. Unspecified coding categories will be lower in priority +than all specified ones, in the same relative order they were in +previously. +*/ + (list)) +{ + int *category_to_priority = + alloca_array (int, coding_detector_category_count); + int i, j; + Lisp_Object rest; + + /* First generate a list that maps coding categories to priorities. */ + + for (i = 0; i < coding_detector_category_count; i++) + category_to_priority[i] = -1; + + /* Highest priority comes from the specified list. */ + i = 0; + EXTERNAL_LIST_LOOP (rest, list) + { + int cat = coding_category_symbol_to_id (XCAR (rest)); + + if (category_to_priority[cat] >= 0) + sferror ("Duplicate coding category in list", XCAR (rest)); + category_to_priority[cat] = i++; + } + + /* Now go through the existing categories by priority to retrieve + the categories not yet specified and preserve their priority + order. */ + for (j = 0; j < coding_detector_category_count; j++) + { + int cat = coding_category_by_priority[j]; + if (category_to_priority[cat] < 0) + category_to_priority[cat] = i++; + } + + /* Now we need to construct the inverse of the mapping we just + constructed. */ + + for (i = 0; i < coding_detector_category_count; i++) + coding_category_by_priority[category_to_priority[i]] = i; + + /* Phew! That was confusing. */ + return Qnil; +} + +DEFUN ("coding-priority-list", Fcoding_priority_list, 0, 0, 0, /* +Return a list of coding categories in descending order of priority. +*/ + ()) +{ + int i; + Lisp_Object list = Qnil; + + for (i = 0; i < coding_detector_category_count; i++) + list = + Fcons (coding_category_id_to_symbol (coding_category_by_priority[i]), + list); + return Fnreverse (list); +} + +DEFUN ("set-coding-category-system", Fset_coding_category_system, 2, 2, 0, /* +Change the coding system associated with a coding category. +*/ + (coding_category, coding_system)) +{ + coding_category_system[coding_category_symbol_to_id (coding_category)] = + Fget_coding_system (coding_system); + return Qnil; +} + +DEFUN ("coding-category-system", Fcoding_category_system, 1, 1, 0, /* +Return the coding system associated with a coding category. +*/ + (coding_category)) +{ + Lisp_Object sys = + coding_category_system[coding_category_symbol_to_id (coding_category)]; + + if (!NILP (sys)) + return XCODING_SYSTEM_NAME (sys); + return Qnil; +} + +Lisp_Object +detect_coding_stream (Lisp_Object stream) +{ + Lisp_Object val = Qnil; + struct gcpro gcpro1, gcpro2, gcpro3; + UExtbyte random_buffer[65536]; + Lisp_Object binary_instream = + make_coding_input_stream + (XLSTREAM (stream), Qbinary, + CODING_ENCODE); + Lisp_Object decstream = + make_coding_input_stream + (XLSTREAM (binary_instream), + Qundecided, CODING_DECODE); + Lstream *decstr = XLSTREAM (decstream); + + GCPRO3 (decstream, stream, binary_instream); + /* Read and discard all data; detection happens as a side effect of this, + and we examine what was detected afterwards. */ + while (Lstream_read (decstr, random_buffer, sizeof (random_buffer)) > 0) + ; + + val = coding_stream_detected_coding_system (decstr); + Lstream_close (decstr); + Lstream_delete (decstr); + Lstream_delete (XLSTREAM (binary_instream)); + UNGCPRO; + return val; } DEFUN ("detect-coding-region", Fdetect_coding_region, 2, 3, 0, /* @@ -1908,3676 +4166,284 @@ Lisp_Object val = Qnil; struct buffer *buf = decode_buffer (buffer, 0); Charbpos b, e; - Lisp_Object instream, lb_instream; - Lstream *istr, *lb_istr; - struct detection_state decst; - struct gcpro gcpro1, gcpro2; + Lisp_Object lb_instream; get_buffer_range_char (buf, start, end, &b, &e, 0); lb_instream = make_lisp_buffer_input_stream (buf, b, e, 0); - lb_istr = XLSTREAM (lb_instream); - instream = make_encoding_input_stream (lb_istr, Fget_coding_system (Qbinary)); - istr = XLSTREAM (instream); - GCPRO2 (instream, lb_instream); - xzero (decst); - decst.eol_type = EOL_AUTODETECT; - decst.mask = ~0; - while (1) - { - Extbyte random_buffer[4096]; - Bytecount nread = Lstream_read (istr, random_buffer, sizeof (random_buffer)); - - if (!nread) - break; - if (detect_coding_type (&decst, random_buffer, nread, 0)) - break; - } - - if (decst.mask == ~0) - val = subsidiary_coding_system (Fget_coding_system (Qundecided), - decst.eol_type); - else - { - int i; - - val = Qnil; -#ifdef MULE - decst.mask = postprocess_iso2022_mask (decst.mask); -#endif - for (i = CODING_CATEGORY_LAST - 1; i >= 0; i--) - { - int sys = fcd->coding_category_by_priority[i]; - if (decst.mask & (1 << sys)) - { - Lisp_Object codesys = fcd->coding_category_system[sys]; - if (!NILP (codesys)) - codesys = subsidiary_coding_system (codesys, decst.eol_type); - val = Fcons (codesys, val); - } - } - } - Lstream_close (istr); - UNGCPRO; - Lstream_delete (istr); - Lstream_delete (lb_istr); + + val = detect_coding_stream (lb_instream); + Lstream_delete (XLSTREAM (lb_instream)); return val; } -/************************************************************************/ -/* Converting to internal Mule format ("decoding") */ + +#ifdef DEBUG_XEMACS + /************************************************************************/ - -/* A decoding stream is a stream used for decoding text (i.e. - converting from some external format to internal format). - The decoding-stream object keeps track of the actual coding - stream, the stream that is at the other end, and data that - needs to be persistent across the lifetime of the stream. */ - -/* Handle the EOL stuff related to just-read-in character C. - EOL_TYPE is the EOL type of the coding stream. - FLAGS is the current value of FLAGS in the coding stream, and may - be modified by this macro. (The macro only looks at the - CODING_STATE_CR flag.) DST is the Dynarr to which the decoded - bytes are to be written. You need to also define a local goto - label "label_continue_loop" that is at the end of the main - character-reading loop. - - If C is a CR character, then this macro handles it entirely and - jumps to label_continue_loop. Otherwise, this macro does not add - anything to DST, and continues normally. You should continue - processing C normally after this macro. */ - -#define DECODE_HANDLE_EOL_TYPE(eol_type, c, flags, dst) \ -do { \ - if (c == '\r') \ - { \ - if (eol_type == EOL_CR) \ - Dynarr_add (dst, '\n'); \ - else if (eol_type != EOL_CRLF || flags & CODING_STATE_CR) \ - Dynarr_add (dst, c); \ - else \ - flags |= CODING_STATE_CR; \ - goto label_continue_loop; \ - } \ - else if (flags & CODING_STATE_CR) \ - { /* eol_type == CODING_SYSTEM_EOL_CRLF */ \ - if (c != '\n') \ - Dynarr_add (dst, '\r'); \ - flags &= ~CODING_STATE_CR; \ - } \ -} while (0) - -/* C should be a binary character in the range 0 - 255; convert - to internal format and add to Dynarr DST. */ - -#define DECODE_ADD_BINARY_CHAR(c, dst) \ -do { \ - if (BYTE_ASCII_P (c)) \ - Dynarr_add (dst, c); \ - else if (BYTE_C1_P (c)) \ - { \ - Dynarr_add (dst, LEADING_BYTE_CONTROL_1); \ - Dynarr_add (dst, c + 0x20); \ - } \ - else \ - { \ - Dynarr_add (dst, LEADING_BYTE_LATIN_ISO8859_1); \ - Dynarr_add (dst, c); \ - } \ -} while (0) - -#define DECODE_OUTPUT_PARTIAL_CHAR(ch) \ -do { \ - if (ch) \ - { \ - DECODE_ADD_BINARY_CHAR (ch, dst); \ - ch = 0; \ - } \ -} while (0) - -#define DECODE_HANDLE_END_OF_CONVERSION(flags, ch, dst) \ -do { \ - if (flags & CODING_STATE_END) \ - { \ - DECODE_OUTPUT_PARTIAL_CHAR (ch); \ - if (flags & CODING_STATE_CR) \ - Dynarr_add (dst, '\r'); \ - } \ -} while (0) - -#define DECODING_STREAM_DATA(stream) LSTREAM_TYPE_DATA (stream, decoding) - -struct decoding_stream -{ - /* Coding system that governs the conversion. */ - Lisp_Coding_System *codesys; - - /* Stream that we read the encoded data from or - write the decoded data to. */ - Lstream *other_end; - - /* If we are reading, then we can return only a fixed amount of - data, so if the conversion resulted in too much data, we store it - here for retrieval the next time around. */ - unsigned_char_dynarr *runoff; - - /* FLAGS holds flags indicating the current state of the decoding. - Some of these flags are dependent on the coding system. */ - unsigned int flags; - - /* CH holds a partially built-up character. Since we only deal - with one- and two-byte characters at the moment, we only use - this to store the first byte of a two-byte character. */ - unsigned int ch; - - /* EOL_TYPE specifies the type of end-of-line conversion that - currently applies. We need to keep this separate from the - EOL type stored in CODESYS because the latter might indicate - automatic EOL-type detection while the former will always - indicate a particular EOL type. */ - eol_type_t eol_type; -#ifdef MULE - /* Additional ISO2022 information. We define the structure above - because it's also needed by the detection routines. */ - struct iso2022_decoder iso2022; - - /* Additional information (the state of the running CCL program) - used by the CCL decoder. */ - struct ccl_program ccl; - - /* counter for UTF-8 or UCS-4 */ - unsigned char counter; -#endif - struct detection_state decst; -}; - -static Bytecount decoding_reader (Lstream *stream, - unsigned char *data, Bytecount size); -static Bytecount decoding_writer (Lstream *stream, - const unsigned char *data, Bytecount size); -static int decoding_rewinder (Lstream *stream); -static int decoding_seekable_p (Lstream *stream); -static int decoding_flusher (Lstream *stream); -static int decoding_closer (Lstream *stream); - -static Lisp_Object decoding_marker (Lisp_Object stream); - -DEFINE_LSTREAM_IMPLEMENTATION ("decoding", lstream_decoding, - sizeof (struct decoding_stream)); - -static Lisp_Object -decoding_marker (Lisp_Object stream) -{ - Lstream *str = DECODING_STREAM_DATA (XLSTREAM (stream))->other_end; - Lisp_Object str_obj; - - /* We do not need to mark the coding systems or charsets stored - within the stream because they are stored in a global list - and automatically marked. */ - - XSETLSTREAM (str_obj, str); - mark_object (str_obj); - if (str->imp->marker) - return (str->imp->marker) (str_obj); - else - return Qnil; -} - -/* Read SIZE bytes of data and store it into DATA. We are a decoding stream - so we read data from the other end, decode it, and store it into DATA. */ +/* Internal methods */ +/************************************************************************/ + +/* Raw (internally-formatted) data. */ +DEFINE_CODING_SYSTEM_TYPE (internal); static Bytecount -decoding_reader (Lstream *stream, unsigned char *data, Bytecount size) +internal_convert (struct coding_stream *str, const UExtbyte *src, + unsigned_char_dynarr *dst, Bytecount n) +{ + Bytecount orign = n; + Dynarr_add_many (dst, src, n); + return orign; +} + +#endif /* DEBUG_XEMACS */ + + + +#ifdef HAVE_ZLIB + +/************************************************************************/ +/* Gzip methods */ +/************************************************************************/ + +DEFINE_CODING_SYSTEM_TYPE (gzip); + +struct gzip_coding_system { - struct decoding_stream *str = DECODING_STREAM_DATA (stream); - unsigned char *orig_data = data; - Bytecount read_size; - int error_occurred = 0; - - /* We need to interface to mule_decode(), which expects to take some - amount of data and store the result into a Dynarr. We have - mule_decode() store into str->runoff, and take data from there - as necessary. */ - - /* We loop until we have enough data, reading chunks from the other - end and decoding it. */ - while (1) - { - /* Take data from the runoff if we can. Make sure to take at - most SIZE bytes, and delete the data from the runoff. */ - if (Dynarr_length (str->runoff) > 0) - { - Bytecount chunk = min (size, (Bytecount) Dynarr_length (str->runoff)); - memcpy (data, Dynarr_atp (str->runoff, 0), chunk); - Dynarr_delete_many (str->runoff, 0, chunk); - data += chunk; - size -= chunk; - } - - if (size == 0) - break; /* No more room for data */ - - if (str->flags & CODING_STATE_END) - /* This means that on the previous iteration, we hit the EOF on - the other end. We loop once more so that mule_decode() can - output any final stuff it may be holding, or any "go back - to a sane state" escape sequences. (This latter makes sense - during encoding.) */ - break; - - /* Exhausted the runoff, so get some more. DATA has at least - SIZE bytes left of storage in it, so it's OK to read directly - into it. (We'll be overwriting above, after we've decoded it - into the runoff.) */ - read_size = Lstream_read (str->other_end, data, size); - if (read_size < 0) - { - error_occurred = 1; - break; - } - if (read_size == 0) - /* There might be some more end data produced in the translation. - See the comment above. */ - str->flags |= CODING_STATE_END; - mule_decode (stream, (Extbyte *) data, str->runoff, read_size); - } - - if (data - orig_data == 0) - return error_occurred ? -1 : 0; - else - return data - orig_data; -} - -static Bytecount -decoding_writer (Lstream *stream, const unsigned char *data, Bytecount size) + int level; /* 0 through 9, or -1 for default */ +}; + +#define CODING_SYSTEM_GZIP_LEVEL(codesys) \ + (CODING_SYSTEM_TYPE_DATA (codesys, gzip)->level) +#define XCODING_SYSTEM_GZIP_LEVEL(codesys) \ + (XCODING_SYSTEM_TYPE_DATA (codesys, gzip)->level) + +struct gzip_coding_stream { - struct decoding_stream *str = DECODING_STREAM_DATA (stream); - Bytecount retval; - - /* Decode all our data into the runoff, and then attempt to write - it all out to the other end. Remove whatever chunk we succeeded - in writing. */ - mule_decode (stream, (Extbyte *) data, str->runoff, size); - retval = Lstream_write (str->other_end, Dynarr_atp (str->runoff, 0), - Dynarr_length (str->runoff)); - if (retval > 0) - Dynarr_delete_many (str->runoff, 0, retval); - /* Do NOT return retval. The return value indicates how much - of the incoming data was written, not how many bytes were - written. */ - return size; + z_stream stream; + int stream_initted; + int reached_eof; /* #### this should be handled by the caller, once we + return LSTREAM_EOF */ +}; + +static const struct lrecord_description + gzip_coding_system_description[] = { + { XD_END } +}; + +enum source_sink_type +gzip_conversion_end_type (Lisp_Object codesys) +{ + return DECODES_BYTE_TO_BYTE; } static void -reset_decoding_stream (struct decoding_stream *str) +gzip_init (Lisp_Object codesys) +{ + struct gzip_coding_system *data = XCODING_SYSTEM_TYPE_DATA (codesys, gzip); + data->level = -1; +} + +static void +gzip_print (Lisp_Object cs, Lisp_Object printcharfun, int escapeflag) { -#ifdef MULE - if (CODING_SYSTEM_TYPE (str->codesys) == CODESYS_ISO2022) - { - Lisp_Object coding_system; - XSETCODING_SYSTEM (coding_system, str->codesys); - reset_iso2022 (coding_system, &str->iso2022); - } - else if (CODING_SYSTEM_TYPE (str->codesys) == CODESYS_CCL) - { - setup_ccl_program (&str->ccl, CODING_SYSTEM_CCL_DECODE (str->codesys)); - } - str->counter = 0; -#endif /* MULE */ - if (CODING_SYSTEM_TYPE (str->codesys) == CODESYS_AUTODETECT - || CODING_SYSTEM_EOL_TYPE (str->codesys) == EOL_AUTODETECT) - { - xzero (str->decst); - str->decst.eol_type = EOL_AUTODETECT; - str->decst.mask = ~0; - } - str->flags = str->ch = 0; -} - -static int -decoding_rewinder (Lstream *stream) -{ - struct decoding_stream *str = DECODING_STREAM_DATA (stream); - reset_decoding_stream (str); - Dynarr_reset (str->runoff); - return Lstream_rewind (str->other_end); -} - -static int -decoding_seekable_p (Lstream *stream) -{ - struct decoding_stream *str = DECODING_STREAM_DATA (stream); - return Lstream_seekable_p (str->other_end); -} - -static int -decoding_flusher (Lstream *stream) -{ - struct decoding_stream *str = DECODING_STREAM_DATA (stream); - return Lstream_flush (str->other_end); + struct gzip_coding_system *data = XCODING_SYSTEM_TYPE_DATA (cs, gzip); + + write_c_string ("(", printcharfun); + if (data->level == -1) + write_c_string ("default", printcharfun); + else + print_internal (make_int (data->level), printcharfun, 0); + write_c_string (")", printcharfun); } static int -decoding_closer (Lstream *stream) -{ - struct decoding_stream *str = DECODING_STREAM_DATA (stream); - if (stream->flags & LSTREAM_FL_WRITE) - { - str->flags |= CODING_STATE_END; - decoding_writer (stream, 0, 0); - } - Dynarr_free (str->runoff); -#ifdef MULE -#ifdef ENABLE_COMPOSITE_CHARS - if (str->iso2022.composite_chars) - Dynarr_free (str->iso2022.composite_chars); -#endif -#endif - return Lstream_close (str->other_end); -} - -Lisp_Object -decoding_stream_coding_system (Lstream *stream) -{ - Lisp_Object coding_system; - struct decoding_stream *str = DECODING_STREAM_DATA (stream); - - XSETCODING_SYSTEM (coding_system, str->codesys); - return subsidiary_coding_system (coding_system, str->eol_type); -} - -void -set_decoding_stream_coding_system (Lstream *lstr, Lisp_Object codesys) -{ - Lisp_Coding_System *cs = XCODING_SYSTEM (codesys); - struct decoding_stream *str = DECODING_STREAM_DATA (lstr); - str->codesys = cs; - if (CODING_SYSTEM_EOL_TYPE (cs) != EOL_AUTODETECT) - str->eol_type = CODING_SYSTEM_EOL_TYPE (cs); - reset_decoding_stream (str); -} - -/* WARNING WARNING WARNING WARNING!!!!! If you open up a decoding - stream for writing, no automatic code detection will be performed. - The reason for this is that automatic code detection requires a - seekable input. Things will also fail if you open a decoding - stream for reading using a non-fully-specified coding system and - a non-seekable input stream. */ - -static Lisp_Object -make_decoding_stream_1 (Lstream *stream, Lisp_Object codesys, - const char *mode) +gzip_putprop (Lisp_Object codesys, Lisp_Object key, Lisp_Object value) { - Lstream *lstr = Lstream_new (lstream_decoding, mode); - struct decoding_stream *str = DECODING_STREAM_DATA (lstr); - Lisp_Object obj; - - xzero (*str); - str->other_end = stream; - str->runoff = (unsigned_char_dynarr *) Dynarr_new (unsigned_char); - str->eol_type = EOL_AUTODETECT; - if (!strcmp (mode, "r") - && Lstream_seekable_p (stream)) - /* We can determine the coding system now. */ - determine_real_coding_system (stream, &codesys, &str->eol_type); - set_decoding_stream_coding_system (lstr, codesys); - str->decst.eol_type = str->eol_type; - str->decst.mask = ~0; - XSETLSTREAM (obj, lstr); - return obj; -} - -Lisp_Object -make_decoding_input_stream (Lstream *stream, Lisp_Object codesys) -{ - return make_decoding_stream_1 (stream, codesys, "r"); -} - -Lisp_Object -make_decoding_output_stream (Lstream *stream, Lisp_Object codesys) -{ - return make_decoding_stream_1 (stream, codesys, "w"); -} - -/* Note: the decode_coding_* functions all take the same - arguments as mule_decode(), which is to say some SRC data of - size N, which is to be stored into dynamic array DST. - DECODING is the stream within which the decoding is - taking place, but no data is actually read from or - written to that stream; that is handled in decoding_reader() - or decoding_writer(). This allows the same functions to - be used for both reading and writing. */ - -static void -mule_decode (Lstream *decoding, const Extbyte *src, - unsigned_char_dynarr *dst, Bytecount n) -{ - struct decoding_stream *str = DECODING_STREAM_DATA (decoding); - - /* If necessary, do encoding-detection now. We do this when - we're a writing stream or a non-seekable reading stream, - meaning that we can't just process the whole input, - rewind, and start over. */ - - if (CODING_SYSTEM_TYPE (str->codesys) == CODESYS_AUTODETECT || - str->eol_type == EOL_AUTODETECT) + struct gzip_coding_system *data = XCODING_SYSTEM_TYPE_DATA (codesys, gzip); + + if (EQ (key, Qlevel)) { - Lisp_Object codesys; - - XSETCODING_SYSTEM (codesys, str->codesys); - detect_coding_type (&str->decst, src, n, - CODING_SYSTEM_TYPE (str->codesys) != - CODESYS_AUTODETECT); - if (CODING_SYSTEM_TYPE (str->codesys) == CODESYS_AUTODETECT && - str->decst.mask != ~0) - /* #### This is cheesy. What we really ought to do is - buffer up a certain amount of data so as to get a - less random result. */ - codesys = coding_system_from_mask (str->decst.mask); - str->eol_type = str->decst.eol_type; - if (XCODING_SYSTEM (codesys) != str->codesys) + if (EQ (value, Qdefault)) + data->level = -1; + else { - /* Preserve the CODING_STATE_END flag in case it was set. - If we erase it, bad things might happen. */ - int was_end = str->flags & CODING_STATE_END; - set_decoding_stream_coding_system (decoding, codesys); - if (was_end) - str->flags |= CODING_STATE_END; + CHECK_INT (value); + check_int_range (XINT (value), 0, 9); + data->level = XINT (value); } } - - switch (CODING_SYSTEM_TYPE (str->codesys)) - { -#ifdef DEBUG_XEMACS - case CODESYS_INTERNAL: - Dynarr_add_many (dst, src, n); - break; -#endif - case CODESYS_AUTODETECT: - /* If we got this far and still haven't decided on the coding - system, then do no conversion. */ - case CODESYS_NO_CONVERSION: - decode_coding_no_conversion (decoding, src, dst, n); - break; -#ifdef MULE - case CODESYS_SHIFT_JIS: - decode_coding_sjis (decoding, src, dst, n); - break; - case CODESYS_BIG5: - decode_coding_big5 (decoding, src, dst, n); - break; - case CODESYS_UCS4: - decode_coding_ucs4 (decoding, src, dst, n); - break; - case CODESYS_UTF8: - decode_coding_utf8 (decoding, src, dst, n); - break; - case CODESYS_CCL: - str->ccl.last_block = str->flags & CODING_STATE_END; - /* When applying ccl program to stream, MUST NOT set NULL - pointer to src. */ - ccl_driver (&str->ccl, (src ? (unsigned char *)src : (unsigned char*)""), - dst, n, 0, CCL_MODE_DECODING); - break; - case CODESYS_ISO2022: - decode_coding_iso2022 (decoding, src, dst, n); - break; -#endif /* MULE */ - default: - abort (); - } -} - -DEFUN ("decode-coding-region", Fdecode_coding_region, 3, 4, 0, /* -Decode the text between START and END which is encoded in CODING-SYSTEM. -This is useful if you've read in encoded text from a file without decoding -it (e.g. you read in a JIS-formatted file but used the `binary' or -`no-conversion' coding system, so that it shows up as "^[$B!<!+^[(B"). -Return length of decoded text. -BUFFER defaults to the current buffer if unspecified. -*/ - (start, end, coding_system, buffer)) -{ - Charbpos b, e; - struct buffer *buf = decode_buffer (buffer, 0); - Lisp_Object instream, lb_outstream, de_outstream, outstream; - Lstream *istr, *ostr; - struct gcpro gcpro1, gcpro2, gcpro3, gcpro4; - - get_buffer_range_char (buf, start, end, &b, &e, 0); - - barf_if_buffer_read_only (buf, b, e); - - coding_system = Fget_coding_system (coding_system); - instream = make_lisp_buffer_input_stream (buf, b, e, 0); - lb_outstream = make_lisp_buffer_output_stream (buf, b, 0); - de_outstream = make_decoding_output_stream (XLSTREAM (lb_outstream), - coding_system); - outstream = make_encoding_output_stream (XLSTREAM (de_outstream), - Fget_coding_system (Qbinary)); - istr = XLSTREAM (instream); - ostr = XLSTREAM (outstream); - GCPRO4 (instream, lb_outstream, de_outstream, outstream); - - /* The chain of streams looks like this: - - [BUFFER] <----- send through - ------> [ENCODE AS BINARY] - ------> [DECODE AS SPECIFIED] - ------> [BUFFER] - */ - - while (1) - { - char tempbuf[1024]; /* some random amount */ - Charbpos newpos, even_newer_pos; - Charbpos oldpos = lisp_buffer_stream_startpos (istr); - Bytecount size_in_bytes = Lstream_read (istr, tempbuf, sizeof (tempbuf)); - - if (!size_in_bytes) - break; - newpos = lisp_buffer_stream_startpos (istr); - Lstream_write (ostr, tempbuf, size_in_bytes); - even_newer_pos = lisp_buffer_stream_startpos (istr); - buffer_delete_range (buf, even_newer_pos - (newpos - oldpos), - even_newer_pos, 0); - } - Lstream_close (istr); - Lstream_close (ostr); - UNGCPRO; - Lstream_delete (istr); - Lstream_delete (ostr); - Lstream_delete (XLSTREAM (de_outstream)); - Lstream_delete (XLSTREAM (lb_outstream)); - return Qnil; -} - - -/************************************************************************/ -/* Converting to an external encoding ("encoding") */ -/************************************************************************/ - -/* An encoding stream is an output stream. When you create the - stream, you specify the coding system that governs the encoding - and another stream that the resulting encoded data is to be - sent to, and then start sending data to it. */ - -#define ENCODING_STREAM_DATA(stream) LSTREAM_TYPE_DATA (stream, encoding) - -struct encoding_stream -{ - /* Coding system that governs the conversion. */ - Lisp_Coding_System *codesys; - - /* Stream that we read the encoded data from or - write the decoded data to. */ - Lstream *other_end; - - /* If we are reading, then we can return only a fixed amount of - data, so if the conversion resulted in too much data, we store it - here for retrieval the next time around. */ - unsigned_char_dynarr *runoff; - - /* FLAGS holds flags indicating the current state of the encoding. - Some of these flags are dependent on the coding system. */ - unsigned int flags; - - /* CH holds a partially built-up character. Since we only deal - with one- and two-byte characters at the moment, we only use - this to store the first byte of a two-byte character. */ - unsigned int ch; -#ifdef MULE - /* Additional information used by the ISO2022 encoder. */ - struct - { - /* CHARSET holds the character sets currently assigned to the G0 - through G3 registers. It is initialized from the array - INITIAL_CHARSET in CODESYS. */ - Lisp_Object charset[4]; - - /* Which registers are currently invoked into the left (GL) and - right (GR) halves of the 8-bit encoding space? */ - int register_left, register_right; - - /* Whether we need to explicitly designate the charset in the - G? register before using it. It is initialized from the - array FORCE_CHARSET_ON_OUTPUT in CODESYS. */ - unsigned char force_charset_on_output[4]; - - /* Other state variables that need to be preserved across - invocations. */ - Lisp_Object current_charset; - int current_half; - int current_char_boundary; - } iso2022; - - /* Additional information (the state of the running CCL program) - used by the CCL encoder. */ - struct ccl_program ccl; -#endif /* MULE */ -}; - -static Bytecount encoding_reader (Lstream *stream, unsigned char *data, Bytecount size); -static Bytecount encoding_writer (Lstream *stream, const unsigned char *data, - Bytecount size); -static int encoding_rewinder (Lstream *stream); -static int encoding_seekable_p (Lstream *stream); -static int encoding_flusher (Lstream *stream); -static int encoding_closer (Lstream *stream); - -static Lisp_Object encoding_marker (Lisp_Object stream); - -DEFINE_LSTREAM_IMPLEMENTATION ("encoding", lstream_encoding, - sizeof (struct encoding_stream)); - -static Lisp_Object -encoding_marker (Lisp_Object stream) -{ - Lstream *str = ENCODING_STREAM_DATA (XLSTREAM (stream))->other_end; - Lisp_Object str_obj; - - /* We do not need to mark the coding systems or charsets stored - within the stream because they are stored in a global list - and automatically marked. */ - - XSETLSTREAM (str_obj, str); - mark_object (str_obj); - if (str->imp->marker) - return (str->imp->marker) (str_obj); else - return Qnil; -} - -/* Read SIZE bytes of data and store it into DATA. We are a encoding stream - so we read data from the other end, encode it, and store it into DATA. */ - -static Bytecount -encoding_reader (Lstream *stream, unsigned char *data, Bytecount size) -{ - struct encoding_stream *str = ENCODING_STREAM_DATA (stream); - unsigned char *orig_data = data; - Bytecount read_size; - int error_occurred = 0; - - /* We need to interface to mule_encode(), which expects to take some - amount of data and store the result into a Dynarr. We have - mule_encode() store into str->runoff, and take data from there - as necessary. */ - - /* We loop until we have enough data, reading chunks from the other - end and encoding it. */ - while (1) - { - /* Take data from the runoff if we can. Make sure to take at - most SIZE bytes, and delete the data from the runoff. */ - if (Dynarr_length (str->runoff) > 0) - { - int chunk = min ((int) size, Dynarr_length (str->runoff)); - memcpy (data, Dynarr_atp (str->runoff, 0), chunk); - Dynarr_delete_many (str->runoff, 0, chunk); - data += chunk; - size -= chunk; - } - - if (size == 0) - break; /* No more room for data */ - - if (str->flags & CODING_STATE_END) - /* This means that on the previous iteration, we hit the EOF on - the other end. We loop once more so that mule_encode() can - output any final stuff it may be holding, or any "go back - to a sane state" escape sequences. (This latter makes sense - during encoding.) */ - break; - - /* Exhausted the runoff, so get some more. DATA at least SIZE bytes - left of storage in it, so it's OK to read directly into it. - (We'll be overwriting above, after we've encoded it into the - runoff.) */ - read_size = Lstream_read (str->other_end, data, size); - if (read_size < 0) - { - error_occurred = 1; - break; - } - if (read_size == 0) - /* There might be some more end data produced in the translation. - See the comment above. */ - str->flags |= CODING_STATE_END; - mule_encode (stream, data, str->runoff, read_size); - } - - if (data == orig_data) - return error_occurred ? -1 : 0; - else - return data - orig_data; -} - -static Bytecount -encoding_writer (Lstream *stream, const unsigned char *data, Bytecount size) -{ - struct encoding_stream *str = ENCODING_STREAM_DATA (stream); - Bytecount retval; - - /* Encode all our data into the runoff, and then attempt to write - it all out to the other end. Remove whatever chunk we succeeded - in writing. */ - mule_encode (stream, data, str->runoff, size); - retval = Lstream_write (str->other_end, Dynarr_atp (str->runoff, 0), - Dynarr_length (str->runoff)); - if (retval > 0) - Dynarr_delete_many (str->runoff, 0, retval); - /* Do NOT return retval. The return value indicates how much - of the incoming data was written, not how many bytes were - written. */ - return size; -} - -static void -reset_encoding_stream (struct encoding_stream *str) -{ -#ifdef MULE - switch (CODING_SYSTEM_TYPE (str->codesys)) - { - case CODESYS_ISO2022: - { - int i; - - for (i = 0; i < 4; i++) - { - str->iso2022.charset[i] = - CODING_SYSTEM_ISO2022_INITIAL_CHARSET (str->codesys, i); - str->iso2022.force_charset_on_output[i] = - CODING_SYSTEM_ISO2022_FORCE_CHARSET_ON_OUTPUT (str->codesys, i); - } - str->iso2022.register_left = 0; - str->iso2022.register_right = 1; - str->iso2022.current_charset = Qnil; - str->iso2022.current_half = 0; - str->iso2022.current_char_boundary = 1; - break; - } - case CODESYS_CCL: - setup_ccl_program (&str->ccl, CODING_SYSTEM_CCL_ENCODE (str->codesys)); - break; - default: - break; - } -#endif /* MULE */ - - str->flags = str->ch = 0; -} - -static int -encoding_rewinder (Lstream *stream) -{ - struct encoding_stream *str = ENCODING_STREAM_DATA (stream); - reset_encoding_stream (str); - Dynarr_reset (str->runoff); - return Lstream_rewind (str->other_end); -} - -static int -encoding_seekable_p (Lstream *stream) -{ - struct encoding_stream *str = ENCODING_STREAM_DATA (stream); - return Lstream_seekable_p (str->other_end); -} - -static int -encoding_flusher (Lstream *stream) -{ - struct encoding_stream *str = ENCODING_STREAM_DATA (stream); - return Lstream_flush (str->other_end); -} - -static int -encoding_closer (Lstream *stream) -{ - struct encoding_stream *str = ENCODING_STREAM_DATA (stream); - if (stream->flags & LSTREAM_FL_WRITE) - { - str->flags |= CODING_STATE_END; - encoding_writer (stream, 0, 0); - } - Dynarr_free (str->runoff); - return Lstream_close (str->other_end); -} - -Lisp_Object -encoding_stream_coding_system (Lstream *stream) -{ - Lisp_Object coding_system; - struct encoding_stream *str = ENCODING_STREAM_DATA (stream); - - XSETCODING_SYSTEM (coding_system, str->codesys); - return coding_system; -} - -void -set_encoding_stream_coding_system (Lstream *lstr, Lisp_Object codesys) -{ - Lisp_Coding_System *cs = XCODING_SYSTEM (codesys); - struct encoding_stream *str = ENCODING_STREAM_DATA (lstr); - str->codesys = cs; - reset_encoding_stream (str); + return 0; + return 1; } static Lisp_Object -make_encoding_stream_1 (Lstream *stream, Lisp_Object codesys, - const char *mode) -{ - Lstream *lstr = Lstream_new (lstream_encoding, mode); - struct encoding_stream *str = ENCODING_STREAM_DATA (lstr); - Lisp_Object obj; - - xzero (*str); - str->runoff = Dynarr_new (unsigned_char); - str->other_end = stream; - set_encoding_stream_coding_system (lstr, codesys); - XSETLSTREAM (obj, lstr); - return obj; -} - -Lisp_Object -make_encoding_input_stream (Lstream *stream, Lisp_Object codesys) -{ - return make_encoding_stream_1 (stream, codesys, "r"); -} - -Lisp_Object -make_encoding_output_stream (Lstream *stream, Lisp_Object codesys) -{ - return make_encoding_stream_1 (stream, codesys, "w"); -} - -/* Convert N bytes of internally-formatted data stored in SRC to an - external format, according to the encoding stream ENCODING. - Store the encoded data into DST. */ - -static void -mule_encode (Lstream *encoding, const Intbyte *src, - unsigned_char_dynarr *dst, Bytecount n) +gzip_getprop (Lisp_Object coding_system, Lisp_Object prop) { - struct encoding_stream *str = ENCODING_STREAM_DATA (encoding); - - switch (CODING_SYSTEM_TYPE (str->codesys)) - { -#ifdef DEBUG_XEMACS - case CODESYS_INTERNAL: - Dynarr_add_many (dst, src, n); - break; -#endif - case CODESYS_AUTODETECT: - /* If we got this far and still haven't decided on the coding - system, then do no conversion. */ - case CODESYS_NO_CONVERSION: - encode_coding_no_conversion (encoding, src, dst, n); - break; -#ifdef MULE - case CODESYS_SHIFT_JIS: - encode_coding_sjis (encoding, src, dst, n); - break; - case CODESYS_BIG5: - encode_coding_big5 (encoding, src, dst, n); - break; - case CODESYS_UCS4: - encode_coding_ucs4 (encoding, src, dst, n); - break; - case CODESYS_UTF8: - encode_coding_utf8 (encoding, src, dst, n); - break; - case CODESYS_CCL: - str->ccl.last_block = str->flags & CODING_STATE_END; - /* When applying ccl program to stream, MUST NOT set NULL - pointer to src. */ - ccl_driver (&str->ccl, ((src) ? src : (unsigned char*)""), - dst, n, 0, CCL_MODE_ENCODING); - break; - case CODESYS_ISO2022: - encode_coding_iso2022 (encoding, src, dst, n); - break; -#endif /* MULE */ - default: - abort (); - } -} - -DEFUN ("encode-coding-region", Fencode_coding_region, 3, 4, 0, /* -Encode the text between START and END using CODING-SYSTEM. -This will, for example, convert Japanese characters into stuff such as -"^[$B!<!+^[(B" if you use the JIS encoding. Return length of encoded -text. BUFFER defaults to the current buffer if unspecified. -*/ - (start, end, coding_system, buffer)) -{ - Charbpos b, e; - struct buffer *buf = decode_buffer (buffer, 0); - Lisp_Object instream, lb_outstream, de_outstream, outstream; - Lstream *istr, *ostr; - struct gcpro gcpro1, gcpro2, gcpro3, gcpro4; - - get_buffer_range_char (buf, start, end, &b, &e, 0); - - barf_if_buffer_read_only (buf, b, e); - - coding_system = Fget_coding_system (coding_system); - instream = make_lisp_buffer_input_stream (buf, b, e, 0); - lb_outstream = make_lisp_buffer_output_stream (buf, b, 0); - de_outstream = make_decoding_output_stream (XLSTREAM (lb_outstream), - Fget_coding_system (Qbinary)); - outstream = make_encoding_output_stream (XLSTREAM (de_outstream), - coding_system); - istr = XLSTREAM (instream); - ostr = XLSTREAM (outstream); - GCPRO4 (instream, outstream, de_outstream, lb_outstream); - /* The chain of streams looks like this: - - [BUFFER] <----- send through - ------> [ENCODE AS SPECIFIED] - ------> [DECODE AS BINARY] - ------> [BUFFER] - */ - while (1) + struct gzip_coding_system *data = + XCODING_SYSTEM_TYPE_DATA (coding_system, gzip); + + if (EQ (prop, Qlevel)) { - char tempbuf[1024]; /* some random amount */ - Charbpos newpos, even_newer_pos; - Charbpos oldpos = lisp_buffer_stream_startpos (istr); - Bytecount size_in_bytes = Lstream_read (istr, tempbuf, sizeof (tempbuf)); - - if (!size_in_bytes) - break; - newpos = lisp_buffer_stream_startpos (istr); - Lstream_write (ostr, tempbuf, size_in_bytes); - even_newer_pos = lisp_buffer_stream_startpos (istr); - buffer_delete_range (buf, even_newer_pos - (newpos - oldpos), - even_newer_pos, 0); - } - - { - Charcount retlen = - lisp_buffer_stream_startpos (XLSTREAM (instream)) - b; - Lstream_close (istr); - Lstream_close (ostr); - UNGCPRO; - Lstream_delete (istr); - Lstream_delete (ostr); - Lstream_delete (XLSTREAM (de_outstream)); - Lstream_delete (XLSTREAM (lb_outstream)); - return make_int (retlen); - } -} - -#ifdef MULE - -/************************************************************************/ -/* Shift-JIS methods */ -/************************************************************************/ - -/* Shift-JIS is a coding system encoding three character sets: ASCII, right - half of JISX0201-Kana, and JISX0208. An ASCII character is encoded - as is. A character of JISX0201-Kana (DIMENSION1_CHARS94 character set) is - encoded by "position-code + 0x80". A character of JISX0208 - (DIMENSION2_CHARS94 character set) is encoded in 2-byte but two - position-codes are divided and shifted so that it fit in the range - below. - - --- CODE RANGE of Shift-JIS --- - (character set) (range) - ASCII 0x00 .. 0x7F - JISX0201-Kana 0xA0 .. 0xDF - JISX0208 (1st byte) 0x80 .. 0x9F and 0xE0 .. 0xEF - (2nd byte) 0x40 .. 0x7E and 0x80 .. 0xFC - ------------------------------- - -*/ - -/* Is this the first byte of a Shift-JIS two-byte char? */ - -#define BYTE_SJIS_TWO_BYTE_1_P(c) \ - (((c) >= 0x81 && (c) <= 0x9F) || ((c) >= 0xE0 && (c) <= 0xEF)) - -/* Is this the second byte of a Shift-JIS two-byte char? */ - -#define BYTE_SJIS_TWO_BYTE_2_P(c) \ - (((c) >= 0x40 && (c) <= 0x7E) || ((c) >= 0x80 && (c) <= 0xFC)) - -#define BYTE_SJIS_KATAKANA_P(c) \ - ((c) >= 0xA1 && (c) <= 0xDF) - -static int -detect_coding_sjis (struct detection_state *st, const Extbyte *src, Bytecount n) -{ - while (n--) - { - unsigned char c = *(unsigned char *)src++; - if (c == ISO_CODE_ESC || c == ISO_CODE_SI || c == ISO_CODE_SO) - return 0; - if (st->shift_jis.in_second_byte) - { - st->shift_jis.in_second_byte = 0; - if (c < 0x40) - return 0; - } - else if ((c >= 0x80 && c < 0xA0) || c >= 0xE0) - st->shift_jis.in_second_byte = 1; - } - return CODING_CATEGORY_SHIFT_JIS_MASK; -} - -/* Convert Shift-JIS data to internal format. */ - -static void -decode_coding_sjis (Lstream *decoding, const Extbyte *src, - unsigned_char_dynarr *dst, Bytecount n) -{ - struct decoding_stream *str = DECODING_STREAM_DATA (decoding); - unsigned int flags = str->flags; - unsigned int ch = str->ch; - eol_type_t eol_type = str->eol_type; - - while (n--) - { - unsigned char c = *(unsigned char *)src++; - - if (ch) - { - /* Previous character was first byte of Shift-JIS Kanji char. */ - if (BYTE_SJIS_TWO_BYTE_2_P (c)) - { - unsigned char e1, e2; - - Dynarr_add (dst, LEADING_BYTE_JAPANESE_JISX0208); - DECODE_SJIS (ch, c, e1, e2); - Dynarr_add (dst, e1); - Dynarr_add (dst, e2); - } - else - { - DECODE_ADD_BINARY_CHAR (ch, dst); - DECODE_ADD_BINARY_CHAR (c, dst); - } - ch = 0; - } - else - { - DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst); - if (BYTE_SJIS_TWO_BYTE_1_P (c)) - ch = c; - else if (BYTE_SJIS_KATAKANA_P (c)) - { - Dynarr_add (dst, LEADING_BYTE_KATAKANA_JISX0201); - Dynarr_add (dst, c); - } - else - DECODE_ADD_BINARY_CHAR (c, dst); - } - label_continue_loop:; - } - - DECODE_HANDLE_END_OF_CONVERSION (flags, ch, dst); - - str->flags = flags; - str->ch = ch; -} - -/* Convert internally-formatted data to Shift-JIS. */ - -static void -encode_coding_sjis (Lstream *encoding, const Intbyte *src, - unsigned_char_dynarr *dst, Bytecount n) -{ - struct encoding_stream *str = ENCODING_STREAM_DATA (encoding); - unsigned int flags = str->flags; - unsigned int ch = str->ch; - eol_type_t eol_type = CODING_SYSTEM_EOL_TYPE (str->codesys); - - while (n--) - { - Intbyte c = *src++; - if (c == '\n') - { - if (eol_type != EOL_LF && eol_type != EOL_AUTODETECT) - Dynarr_add (dst, '\r'); - if (eol_type != EOL_CR) - Dynarr_add (dst, '\n'); - ch = 0; - } - else if (BYTE_ASCII_P (c)) - { - Dynarr_add (dst, c); - ch = 0; - } - else if (INTBYTE_LEADING_BYTE_P (c)) - ch = (c == LEADING_BYTE_KATAKANA_JISX0201 || - c == LEADING_BYTE_JAPANESE_JISX0208_1978 || - c == LEADING_BYTE_JAPANESE_JISX0208) ? c : 0; - else if (ch) - { - if (ch == LEADING_BYTE_KATAKANA_JISX0201) - { - Dynarr_add (dst, c); - ch = 0; - } - else if (ch == LEADING_BYTE_JAPANESE_JISX0208_1978 || - ch == LEADING_BYTE_JAPANESE_JISX0208) - ch = c; - else - { - unsigned char j1, j2; - ENCODE_SJIS (ch, c, j1, j2); - Dynarr_add (dst, j1); - Dynarr_add (dst, j2); - ch = 0; - } - } - } - - str->flags = flags; - str->ch = ch; -} - -DEFUN ("decode-shift-jis-char", Fdecode_shift_jis_char, 1, 1, 0, /* -Decode a JISX0208 character of Shift-JIS coding-system. -CODE is the character code in Shift-JIS as a cons of type bytes. -Return the corresponding character. -*/ - (code)) -{ - unsigned char c1, c2, s1, s2; - - CHECK_CONS (code); - CHECK_INT (XCAR (code)); - CHECK_INT (XCDR (code)); - s1 = XINT (XCAR (code)); - s2 = XINT (XCDR (code)); - if (BYTE_SJIS_TWO_BYTE_1_P (s1) && - BYTE_SJIS_TWO_BYTE_2_P (s2)) - { - DECODE_SJIS (s1, s2, c1, c2); - return make_char (MAKE_CHAR (Vcharset_japanese_jisx0208, - c1 & 0x7F, c2 & 0x7F)); - } - else - return Qnil; -} - -DEFUN ("encode-shift-jis-char", Fencode_shift_jis_char, 1, 1, 0, /* -Encode a JISX0208 character CHARACTER to SHIFT-JIS coding-system. -Return the corresponding character code in SHIFT-JIS as a cons of two bytes. -*/ - (character)) -{ - Lisp_Object charset; - int c1, c2, s1, s2; - - CHECK_CHAR_COERCE_INT (character); - BREAKUP_CHAR (XCHAR (character), charset, c1, c2); - if (EQ (charset, Vcharset_japanese_jisx0208)) - { - ENCODE_SJIS (c1 | 0x80, c2 | 0x80, s1, s2); - return Fcons (make_int (s1), make_int (s2)); + if (data->level == -1) + return Qdefault; + return make_int (data->level); } - else - return Qnil; -} - - -/************************************************************************/ -/* Big5 methods */ -/************************************************************************/ - -/* BIG5 is a coding system encoding two character sets: ASCII and - Big5. An ASCII character is encoded as is. Big5 is a two-byte - character set and is encoded in two-byte. - - --- CODE RANGE of BIG5 --- - (character set) (range) - ASCII 0x00 .. 0x7F - Big5 (1st byte) 0xA1 .. 0xFE - (2nd byte) 0x40 .. 0x7E and 0xA1 .. 0xFE - -------------------------- - - Since the number of characters in Big5 is larger than maximum - characters in Emacs' charset (96x96), it can't be handled as one - charset. So, in Emacs, Big5 is divided into two: `charset-big5-1' - and `charset-big5-2'. Both <type>s are DIMENSION2_CHARS94. The former - contains frequently used characters and the latter contains less - frequently used characters. */ - -#define BYTE_BIG5_TWO_BYTE_1_P(c) \ - ((c) >= 0xA1 && (c) <= 0xFE) - -/* Is this the second byte of a Shift-JIS two-byte char? */ - -#define BYTE_BIG5_TWO_BYTE_2_P(c) \ - (((c) >= 0x40 && (c) <= 0x7E) || ((c) >= 0xA1 && (c) <= 0xFE)) - -/* Number of Big5 characters which have the same code in 1st byte. */ - -#define BIG5_SAME_ROW (0xFF - 0xA1 + 0x7F - 0x40) - -/* Code conversion macros. These are macros because they are used in - inner loops during code conversion. - - Note that temporary variables in macros introduce the classic - dynamic-scoping problems with variable names. We use capital- - lettered variables in the assumption that XEmacs does not use - capital letters in variables except in a very formalized way - (e.g. Qstring). */ - -/* Convert Big5 code (b1, b2) into its internal string representation - (lb, c1, c2). */ - -/* There is a much simpler way to split the Big5 charset into two. - For the moment I'm going to leave the algorithm as-is because it - claims to separate out the most-used characters into a single - charset, which perhaps will lead to optimizations in various - places. - - The way the algorithm works is something like this: - - Big5 can be viewed as a 94x157 charset, where the row is - encoded into the bytes 0xA1 .. 0xFE and the column is encoded - into the bytes 0x40 .. 0x7E and 0xA1 .. 0xFE. As for frequency, - the split between low and high column numbers is apparently - meaningless; ascending rows produce less and less frequent chars. - Therefore, we assign the lower half of rows (0xA1 .. 0xC8) to - the first charset, and the upper half (0xC9 .. 0xFE) to the - second. To do the conversion, we convert the character into - a single number where 0 .. 156 is the first row, 157 .. 313 - is the second, etc. That way, the characters are ordered by - decreasing frequency. Then we just chop the space in two - and coerce the result into a 94x94 space. - */ - -#define DECODE_BIG5(b1, b2, lb, c1, c2) do \ -{ \ - int B1 = b1, B2 = b2; \ - int I \ - = (B1 - 0xA1) * BIG5_SAME_ROW + B2 - (B2 < 0x7F ? 0x40 : 0x62); \ - \ - if (B1 < 0xC9) \ - { \ - lb = LEADING_BYTE_CHINESE_BIG5_1; \ - } \ - else \ - { \ - lb = LEADING_BYTE_CHINESE_BIG5_2; \ - I -= (BIG5_SAME_ROW) * (0xC9 - 0xA1); \ - } \ - c1 = I / (0xFF - 0xA1) + 0xA1; \ - c2 = I % (0xFF - 0xA1) + 0xA1; \ -} while (0) - -/* Convert the internal string representation of a Big5 character - (lb, c1, c2) into Big5 code (b1, b2). */ - -#define ENCODE_BIG5(lb, c1, c2, b1, b2) do \ -{ \ - int I = ((c1) - 0xA1) * (0xFF - 0xA1) + ((c2) - 0xA1); \ - \ - if (lb == LEADING_BYTE_CHINESE_BIG5_2) \ - { \ - I += BIG5_SAME_ROW * (0xC9 - 0xA1); \ - } \ - b1 = I / BIG5_SAME_ROW + 0xA1; \ - b2 = I % BIG5_SAME_ROW; \ - b2 += b2 < 0x3F ? 0x40 : 0x62; \ -} while (0) - -static int -detect_coding_big5 (struct detection_state *st, const Extbyte *src, Bytecount n) -{ - while (n--) - { - unsigned char c = *(unsigned char *)src++; - if (c == ISO_CODE_ESC || c == ISO_CODE_SI || c == ISO_CODE_SO || - (c >= 0x80 && c <= 0xA0)) - return 0; - if (st->big5.in_second_byte) - { - st->big5.in_second_byte = 0; - if (c < 0x40 || (c >= 0x80 && c <= 0xA0)) - return 0; - } - else if (c >= 0xA1) - st->big5.in_second_byte = 1; - } - return CODING_CATEGORY_BIG5_MASK; -} - -/* Convert Big5 data to internal format. */ - -static void -decode_coding_big5 (Lstream *decoding, const Extbyte *src, - unsigned_char_dynarr *dst, Bytecount n) -{ - struct decoding_stream *str = DECODING_STREAM_DATA (decoding); - unsigned int flags = str->flags; - unsigned int ch = str->ch; - eol_type_t eol_type = str->eol_type; - - while (n--) - { - unsigned char c = *(unsigned char *)src++; - if (ch) - { - /* Previous character was first byte of Big5 char. */ - if (BYTE_BIG5_TWO_BYTE_2_P (c)) - { - unsigned char b1, b2, b3; - DECODE_BIG5 (ch, c, b1, b2, b3); - Dynarr_add (dst, b1); - Dynarr_add (dst, b2); - Dynarr_add (dst, b3); - } - else - { - DECODE_ADD_BINARY_CHAR (ch, dst); - DECODE_ADD_BINARY_CHAR (c, dst); - } - ch = 0; - } - else - { - DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst); - if (BYTE_BIG5_TWO_BYTE_1_P (c)) - ch = c; - else - DECODE_ADD_BINARY_CHAR (c, dst); - } - label_continue_loop:; - } - - DECODE_HANDLE_END_OF_CONVERSION (flags, ch, dst); - - str->flags = flags; - str->ch = ch; -} - -/* Convert internally-formatted data to Big5. */ - -static void -encode_coding_big5 (Lstream *encoding, const Intbyte *src, - unsigned_char_dynarr *dst, Bytecount n) -{ - unsigned char c; - struct encoding_stream *str = ENCODING_STREAM_DATA (encoding); - unsigned int flags = str->flags; - unsigned int ch = str->ch; - eol_type_t eol_type = CODING_SYSTEM_EOL_TYPE (str->codesys); - - while (n--) - { - c = *src++; - if (c == '\n') - { - if (eol_type != EOL_LF && eol_type != EOL_AUTODETECT) - Dynarr_add (dst, '\r'); - if (eol_type != EOL_CR) - Dynarr_add (dst, '\n'); - } - else if (BYTE_ASCII_P (c)) - { - /* ASCII. */ - Dynarr_add (dst, c); - } - else if (INTBYTE_LEADING_BYTE_P (c)) - { - if (c == LEADING_BYTE_CHINESE_BIG5_1 || - c == LEADING_BYTE_CHINESE_BIG5_2) - { - /* A recognized leading byte. */ - ch = c; - continue; /* not done with this character. */ - } - /* otherwise just ignore this character. */ - } - else if (ch == LEADING_BYTE_CHINESE_BIG5_1 || - ch == LEADING_BYTE_CHINESE_BIG5_2) - { - /* Previous char was a recognized leading byte. */ - ch = (ch << 8) | c; - continue; /* not done with this character. */ - } - else if (ch) - { - /* Encountering second byte of a Big5 character. */ - unsigned char b1, b2; - - ENCODE_BIG5 (ch >> 8, ch & 0xFF, c, b1, b2); - Dynarr_add (dst, b1); - Dynarr_add (dst, b2); - } - - ch = 0; - } - - str->flags = flags; - str->ch = ch; -} - - -DEFUN ("decode-big5-char", Fdecode_big5_char, 1, 1, 0, /* -Decode a Big5 character CODE of BIG5 coding-system. -CODE is the character code in BIG5, a cons of two integers. -Return the corresponding character. -*/ - (code)) -{ - unsigned char c1, c2, b1, b2; - - CHECK_CONS (code); - CHECK_INT (XCAR (code)); - CHECK_INT (XCDR (code)); - b1 = XINT (XCAR (code)); - b2 = XINT (XCDR (code)); - if (BYTE_BIG5_TWO_BYTE_1_P (b1) && - BYTE_BIG5_TWO_BYTE_2_P (b2)) - { - int leading_byte; - Lisp_Object charset; - DECODE_BIG5 (b1, b2, leading_byte, c1, c2); - charset = CHARSET_BY_LEADING_BYTE (leading_byte); - return make_char (MAKE_CHAR (charset, c1 & 0x7F, c2 & 0x7F)); - } - else - return Qnil; -} - -DEFUN ("encode-big5-char", Fencode_big5_char, 1, 1, 0, /* -Encode the Big5 character CHARACTER in the BIG5 coding-system. -Return the corresponding character code in Big5. -*/ - (character)) -{ - Lisp_Object charset; - int c1, c2, b1, b2; - - CHECK_CHAR_COERCE_INT (character); - BREAKUP_CHAR (XCHAR (character), charset, c1, c2); - if (EQ (charset, Vcharset_chinese_big5_1) || - EQ (charset, Vcharset_chinese_big5_2)) - { - ENCODE_BIG5 (XCHARSET_LEADING_BYTE (charset), c1 | 0x80, c2 | 0x80, - b1, b2); - return Fcons (make_int (b1), make_int (b2)); - } - else - return Qnil; -} - - -/************************************************************************/ -/* UCS-4 methods */ -/* */ -/* UCS-4 character codes are implemented as nonnegative integers. */ -/* */ -/************************************************************************/ - - -DEFUN ("set-ucs-char", Fset_ucs_char, 2, 2, 0, /* -Map UCS-4 code CODE to Mule character CHARACTER. - -Return T on success, NIL on failure. -*/ - (code, character)) -{ - EMACS_INT c; - - CHECK_CHAR (character); - CHECK_NATNUM (code); - c = XINT (code); - - if (c < countof (fcd->ucs_to_mule_table)) - { - fcd->ucs_to_mule_table[c] = character; - return Qt; - } - else - return Qnil; -} - -static Lisp_Object -ucs_to_char (unsigned long code) -{ - if (code < countof (fcd->ucs_to_mule_table)) - { - return fcd->ucs_to_mule_table[code]; - } - else if ((0xe00000 <= code) && (code <= 0xe00000 + 94 * 94 * 14)) - { - unsigned int c; - - code -= 0xe00000; - c = code % (94 * 94); - return make_char - (MAKE_CHAR (CHARSET_BY_ATTRIBUTES - (CHARSET_TYPE_94X94, code / (94 * 94) + '@', - CHARSET_LEFT_TO_RIGHT), - c / 94 + 33, c % 94 + 33)); - } - else - return Qnil; -} - -DEFUN ("ucs-char", Fucs_char, 1, 1, 0, /* -Return Mule character corresponding to UCS code CODE (a positive integer). -*/ - (code)) -{ - CHECK_NATNUM (code); - return ucs_to_char (XINT (code)); -} - -DEFUN ("set-char-ucs", Fset_char_ucs, 2, 2, 0, /* -Map Mule character CHARACTER to UCS code CODE (a positive integer). -*/ - (character, code)) -{ - /* #### Isn't this gilding the lily? Fput_char_table checks its args. - Fset_char_ucs is more restrictive on index arg, but should - check code arg in a char_table method. */ - CHECK_CHAR (character); - CHECK_NATNUM (code); - return Fput_char_table (character, code, mule_to_ucs_table); -} - -DEFUN ("char-ucs", Fchar_ucs, 1, 1, 0, /* -Return the UCS code (a positive integer) corresponding to CHARACTER. -*/ - (character)) -{ - return Fget_char_table (character, mule_to_ucs_table); -} - -/* Decode a UCS-4 character into a buffer. If the lookup fails, use - <GETA MARK> (U+3013) of JIS X 0208, which means correct character - is not found, instead. - #### do something more appropriate (use blob?) - Danger, Will Robinson! Data loss. Should we signal user? */ -static void -decode_ucs4 (unsigned long ch, unsigned_char_dynarr *dst) -{ - Lisp_Object chr = ucs_to_char (ch); - - if (! NILP (chr)) - { - Intbyte work[MAX_EMCHAR_LEN]; - int len; - - ch = XCHAR (chr); - len = (ch < 128) ? - simple_set_charptr_emchar (work, ch) : - non_ascii_set_charptr_emchar (work, ch); - Dynarr_add_many (dst, work, len); - } - else - { - Dynarr_add (dst, LEADING_BYTE_JAPANESE_JISX0208); - Dynarr_add (dst, 34 + 128); - Dynarr_add (dst, 46 + 128); - } -} - -static unsigned long -mule_char_to_ucs4 (Lisp_Object charset, - unsigned char h, unsigned char l) -{ - Lisp_Object code - = Fget_char_table (make_char (MAKE_CHAR (charset, h & 127, l & 127)), - mule_to_ucs_table); - - if (INTP (code)) - { - return XINT (code); - } - else if ( (XCHARSET_DIMENSION (charset) == 2) && - (XCHARSET_CHARS (charset) == 94) ) - { - unsigned char final = XCHARSET_FINAL (charset); - - if ( ('@' <= final) && (final < 0x7f) ) - { - return 0xe00000 + (final - '@') * 94 * 94 - + ((h & 127) - 33) * 94 + (l & 127) - 33; - } - else - { - return '?'; - } - } - else - { - return '?'; - } + + return Qunbound; } static void -encode_ucs4 (Lisp_Object charset, - unsigned char h, unsigned char l, unsigned_char_dynarr *dst) +gzip_init_coding_stream (struct coding_stream *str) { - unsigned long code = mule_char_to_ucs4 (charset, h, l); - Dynarr_add (dst, code >> 24); - Dynarr_add (dst, (code >> 16) & 255); - Dynarr_add (dst, (code >> 8) & 255); - Dynarr_add (dst, code & 255); -} - -static int -detect_coding_ucs4 (struct detection_state *st, const Extbyte *src, Bytecount n) -{ - while (n--) + struct gzip_coding_stream *data = CODING_STREAM_TYPE_DATA (str, gzip); + if (data->stream_initted) { - unsigned char c = *(unsigned char *)src++; - switch (st->ucs4.in_byte) - { - case 0: - if (c >= 128) - return 0; - else - st->ucs4.in_byte++; - break; - case 3: - st->ucs4.in_byte = 0; - break; - default: - st->ucs4.in_byte++; - } + if (str->direction == CODING_DECODE) + inflateEnd (&data->stream); + else + deflateEnd (&data->stream); + data->stream_initted = 0; } - return CODING_CATEGORY_UCS4_MASK; -} - -static void -decode_coding_ucs4 (Lstream *decoding, const Extbyte *src, - unsigned_char_dynarr *dst, Bytecount n) -{ - struct decoding_stream *str = DECODING_STREAM_DATA (decoding); - unsigned int flags = str->flags; - unsigned int ch = str->ch; - unsigned char counter = str->counter; - - while (n--) - { - unsigned char c = *(unsigned char *)src++; - switch (counter) - { - case 0: - ch = c; - counter = 3; - break; - case 1: - decode_ucs4 ( ( ch << 8 ) | c, dst); - ch = 0; - counter = 0; - break; - default: - ch = ( ch << 8 ) | c; - counter--; - } - } - if (counter & CODING_STATE_END) - DECODE_OUTPUT_PARTIAL_CHAR (ch); - - str->flags = flags; - str->ch = ch; - str->counter = counter; + data->reached_eof = 0; } static void -encode_coding_ucs4 (Lstream *encoding, const Intbyte *src, - unsigned_char_dynarr *dst, Bytecount n) +gzip_rewind_coding_stream (struct coding_stream *str) { - struct encoding_stream *str = ENCODING_STREAM_DATA (encoding); - unsigned int flags = str->flags; - unsigned int ch = str->ch; - unsigned char char_boundary = str->iso2022.current_char_boundary; - Lisp_Object charset = str->iso2022.current_charset; - -#ifdef ENABLE_COMPOSITE_CHARS - /* flags for handling composite chars. We do a little switcharoo - on the source while we're outputting the composite char. */ - Bytecount saved_n = 0; - const unsigned char *saved_src = NULL; - int in_composite = 0; - - back_to_square_n: -#endif - - while (n--) - { - unsigned char c = *src++; - - if (BYTE_ASCII_P (c)) - { /* Processing ASCII character */ - ch = 0; - encode_ucs4 (Vcharset_ascii, c, 0, dst); - char_boundary = 1; - } - else if (INTBYTE_LEADING_BYTE_P (c) || INTBYTE_LEADING_BYTE_P (ch)) - { /* Processing Leading Byte */ - ch = 0; - charset = CHARSET_BY_LEADING_BYTE (c); - if (LEADING_BYTE_PREFIX_P(c)) - ch = c; - char_boundary = 0; - } - else - { /* Processing Non-ASCII character */ - char_boundary = 1; - if (EQ (charset, Vcharset_control_1)) - { - encode_ucs4 (Vcharset_control_1, c, 0, dst); - } - else - { - switch (XCHARSET_REP_BYTES (charset)) - { - case 2: - encode_ucs4 (charset, c, 0, dst); - break; - case 3: - if (XCHARSET_PRIVATE_P (charset)) - { - encode_ucs4 (charset, c, 0, dst); - ch = 0; - } - else if (ch) - { -#ifdef ENABLE_COMPOSITE_CHARS - if (EQ (charset, Vcharset_composite)) - { - if (in_composite) - { - /* #### Bother! We don't know how to - handle this yet. */ - Dynarr_add (dst, '\0'); - Dynarr_add (dst, '\0'); - Dynarr_add (dst, '\0'); - Dynarr_add (dst, '~'); - } - else - { - Emchar emch = MAKE_CHAR (Vcharset_composite, - ch & 0x7F, c & 0x7F); - Lisp_Object lstr = composite_char_string (emch); - saved_n = n; - saved_src = src; - in_composite = 1; - src = XSTRING_DATA (lstr); - n = XSTRING_LENGTH (lstr); - } - } - else -#endif /* ENABLE_COMPOSITE_CHARS */ - { - encode_ucs4(charset, ch, c, dst); - } - ch = 0; - } - else - { - ch = c; - char_boundary = 0; - } - break; - case 4: - if (ch) - { - encode_ucs4 (charset, ch, c, dst); - ch = 0; - } - else - { - ch = c; - char_boundary = 0; - } - break; - default: - abort (); - } - } - } - } - -#ifdef ENABLE_COMPOSITE_CHARS - if (in_composite) - { - n = saved_n; - src = saved_src; - in_composite = 0; - goto back_to_square_n; /* Wheeeeeeeee ..... */ - } -#endif /* ENABLE_COMPOSITE_CHARS */ - - str->flags = flags; - str->ch = ch; - str->iso2022.current_char_boundary = char_boundary; - str->iso2022.current_charset = charset; - - /* Verbum caro factum est! */ + gzip_init_coding_stream (str); } - -/************************************************************************/ -/* UTF-8 methods */ -/************************************************************************/ - -static int -detect_coding_utf8 (struct detection_state *st, const Extbyte *src, Bytecount n) +static Bytecount +gzip_convert (struct coding_stream *str, + const UExtbyte *src, + unsigned_char_dynarr *dst, Bytecount n) { - while (n--) + struct gzip_coding_stream *data = CODING_STREAM_TYPE_DATA (str, gzip); + int zerr; + if (str->direction == CODING_DECODE) { - unsigned char c = *(unsigned char *)src++; - switch (st->utf8.in_byte) - { - case 0: - if (c == ISO_CODE_ESC || c == ISO_CODE_SI || c == ISO_CODE_SO) - return 0; - else if (c >= 0xfc) - st->utf8.in_byte = 5; - else if (c >= 0xf8) - st->utf8.in_byte = 4; - else if (c >= 0xf0) - st->utf8.in_byte = 3; - else if (c >= 0xe0) - st->utf8.in_byte = 2; - else if (c >= 0xc0) - st->utf8.in_byte = 1; - else if (c >= 0x80) - return 0; - break; - default: - if ((c & 0xc0) != 0x80) - return 0; - else - st->utf8.in_byte--; - } - } - return CODING_CATEGORY_UTF8_MASK; -} - -static void -decode_coding_utf8 (Lstream *decoding, const Extbyte *src, - unsigned_char_dynarr *dst, Bytecount n) -{ - struct decoding_stream *str = DECODING_STREAM_DATA (decoding); - unsigned int flags = str->flags; - unsigned int ch = str->ch; - eol_type_t eol_type = str->eol_type; - unsigned char counter = str->counter; - - while (n--) - { - unsigned char c = *(unsigned char *)src++; - switch (counter) + if (data->reached_eof) + return n; /* eat the data */ + + if (!data->stream_initted) { - case 0: - if ( c >= 0xfc ) - { - ch = c & 0x01; - counter = 5; - } - else if ( c >= 0xf8 ) - { - ch = c & 0x03; - counter = 4; - } - else if ( c >= 0xf0 ) - { - ch = c & 0x07; - counter = 3; - } - else if ( c >= 0xe0 ) - { - ch = c & 0x0f; - counter = 2; - } - else if ( c >= 0xc0 ) - { - ch = c & 0x1f; - counter = 1; - } - else - { - DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst); - decode_ucs4 (c, dst); - } - break; - case 1: - ch = ( ch << 6 ) | ( c & 0x3f ); - decode_ucs4 (ch, dst); - ch = 0; - counter = 0; - break; - default: - ch = ( ch << 6 ) | ( c & 0x3f ); - counter--; + xzero (data->stream); + if (inflateInit (&data->stream) != Z_OK) + return LSTREAM_ERROR; + data->stream_initted = 1; } - label_continue_loop:; - } - - if (flags & CODING_STATE_END) - DECODE_OUTPUT_PARTIAL_CHAR (ch); - - str->flags = flags; - str->ch = ch; - str->counter = counter; -} - -static void -encode_utf8 (Lisp_Object charset, - unsigned char h, unsigned char l, unsigned_char_dynarr *dst) -{ - unsigned long code = mule_char_to_ucs4 (charset, h, l); - if ( code <= 0x7f ) - { - Dynarr_add (dst, code); - } - else if ( code <= 0x7ff ) - { - Dynarr_add (dst, (code >> 6) | 0xc0); - Dynarr_add (dst, (code & 0x3f) | 0x80); - } - else if ( code <= 0xffff ) - { - Dynarr_add (dst, (code >> 12) | 0xe0); - Dynarr_add (dst, ((code >> 6) & 0x3f) | 0x80); - Dynarr_add (dst, (code & 0x3f) | 0x80); - } - else if ( code <= 0x1fffff ) - { - Dynarr_add (dst, (code >> 18) | 0xf0); - Dynarr_add (dst, ((code >> 12) & 0x3f) | 0x80); - Dynarr_add (dst, ((code >> 6) & 0x3f) | 0x80); - Dynarr_add (dst, (code & 0x3f) | 0x80); - } - else if ( code <= 0x3ffffff ) - { - Dynarr_add (dst, (code >> 24) | 0xf8); - Dynarr_add (dst, ((code >> 18) & 0x3f) | 0x80); - Dynarr_add (dst, ((code >> 12) & 0x3f) | 0x80); - Dynarr_add (dst, ((code >> 6) & 0x3f) | 0x80); - Dynarr_add (dst, (code & 0x3f) | 0x80); + + data->stream.next_in = (Bytef *) src; + data->stream.avail_in = n; + + /* Normally we stop when we've fed all data to the decompressor; but + if we're at the end of the input, and the decompressor hasn't + reported EOF, we need to keep going, as there might be more output + to generate. Z_OK from the decompressor means input was processed + or output was generated; if neither, we break out of the loop. + Other return values are: + + Z_STREAM_END EOF from decompressor + Z_DATA_ERROR Corrupted data + Z_BUF_ERROR No progress possible (this should happen if + we try to feed it an incomplete file) + Z_MEM_ERROR Out of memory + Z_STREAM_ERROR (should never happen) + Z_NEED_DICT (#### when will this happen?) + */ + while (data->stream.avail_in > 0 || str->eof) + { + /* Reserve an output buffer of the same size as the input buffer; + if that's not enough, we keep reserving the same size. */ + Bytecount reserved = n; + Dynarr_add_many (dst, 0, reserved); + /* Careful here! Don't retrieve the pointer until after + reserving the space, or it might be bogus */ + data->stream.next_out = + Dynarr_atp (dst, Dynarr_length (dst) - reserved); + data->stream.avail_out = reserved; + zerr = inflate (&data->stream, Z_NO_FLUSH); + /* Lop off the unused portion */ + Dynarr_set_size (dst, Dynarr_length (dst) - data->stream.avail_out); + if (zerr != Z_OK) + break; + } + + if (zerr == Z_STREAM_END) + data->reached_eof = 1; + + if ((Bytecount) data->stream.avail_in < n) + return n - data->stream.avail_in; + + if (zerr == Z_OK || zerr == Z_STREAM_END) + return 0; + + return LSTREAM_ERROR; } else { - Dynarr_add (dst, (code >> 30) | 0xfc); - Dynarr_add (dst, ((code >> 24) & 0x3f) | 0x80); - Dynarr_add (dst, ((code >> 18) & 0x3f) | 0x80); - Dynarr_add (dst, ((code >> 12) & 0x3f) | 0x80); - Dynarr_add (dst, ((code >> 6) & 0x3f) | 0x80); - Dynarr_add (dst, (code & 0x3f) | 0x80); - } -} - -static void -encode_coding_utf8 (Lstream *encoding, const Intbyte *src, - unsigned_char_dynarr *dst, Bytecount n) -{ - struct encoding_stream *str = ENCODING_STREAM_DATA (encoding); - unsigned int flags = str->flags; - unsigned int ch = str->ch; - eol_type_t eol_type = CODING_SYSTEM_EOL_TYPE (str->codesys); - unsigned char char_boundary = str->iso2022.current_char_boundary; - Lisp_Object charset = str->iso2022.current_charset; - -#ifdef ENABLE_COMPOSITE_CHARS - /* flags for handling composite chars. We do a little switcharoo - on the source while we're outputting the composite char. */ - Bytecount saved_n = 0; - const unsigned char *saved_src = NULL; - int in_composite = 0; - - back_to_square_n: -#endif /* ENABLE_COMPOSITE_CHARS */ - - while (n--) - { - unsigned char c = *src++; - - if (BYTE_ASCII_P (c)) - { /* Processing ASCII character */ - ch = 0; - if (c == '\n') - { - if (eol_type != EOL_LF && eol_type != EOL_AUTODETECT) - Dynarr_add (dst, '\r'); - if (eol_type != EOL_CR) - Dynarr_add (dst, c); - } - else - encode_utf8 (Vcharset_ascii, c, 0, dst); - char_boundary = 1; - } - else if (INTBYTE_LEADING_BYTE_P (c) || INTBYTE_LEADING_BYTE_P (ch)) - { /* Processing Leading Byte */ - ch = 0; - charset = CHARSET_BY_LEADING_BYTE (c); - if (LEADING_BYTE_PREFIX_P(c)) - ch = c; - char_boundary = 0; - } - else - { /* Processing Non-ASCII character */ - char_boundary = 1; - if (EQ (charset, Vcharset_control_1)) - { - encode_utf8 (Vcharset_control_1, c, 0, dst); - } - else - { - switch (XCHARSET_REP_BYTES (charset)) - { - case 2: - encode_utf8 (charset, c, 0, dst); - break; - case 3: - if (XCHARSET_PRIVATE_P (charset)) - { - encode_utf8 (charset, c, 0, dst); - ch = 0; - } - else if (ch) - { -#ifdef ENABLE_COMPOSITE_CHARS - if (EQ (charset, Vcharset_composite)) - { - if (in_composite) - { - /* #### Bother! We don't know how to - handle this yet. */ - encode_utf8 (Vcharset_ascii, '~', 0, dst); - } - else - { - Emchar emch = MAKE_CHAR (Vcharset_composite, - ch & 0x7F, c & 0x7F); - Lisp_Object lstr = composite_char_string (emch); - saved_n = n; - saved_src = src; - in_composite = 1; - src = XSTRING_DATA (lstr); - n = XSTRING_LENGTH (lstr); - } - } - else -#endif /* ENABLE_COMPOSITE_CHARS */ - { - encode_utf8 (charset, ch, c, dst); - } - ch = 0; - } - else - { - ch = c; - char_boundary = 0; - } - break; - case 4: - if (ch) - { - encode_utf8 (charset, ch, c, dst); - ch = 0; - } - else - { - ch = c; - char_boundary = 0; - } - break; - default: - abort (); - } - } + if (!data->stream_initted) + { + int level = XCODING_SYSTEM_GZIP_LEVEL (str->codesys); + xzero (data->stream); + if (deflateInit (&data->stream, + level == -1 ? Z_DEFAULT_COMPRESSION : level) != + Z_OK) + return LSTREAM_ERROR; + data->stream_initted = 1; } - } - -#ifdef ENABLE_COMPOSITE_CHARS - if (in_composite) - { - n = saved_n; - src = saved_src; - in_composite = 0; - goto back_to_square_n; /* Wheeeeeeeee ..... */ - } -#endif - - str->flags = flags; - str->ch = ch; - str->iso2022.current_char_boundary = char_boundary; - str->iso2022.current_charset = charset; - - /* Verbum caro factum est! */ -} - - -/************************************************************************/ -/* ISO2022 methods */ -/************************************************************************/ - -/* The following note describes the coding system ISO2022 briefly. - Since the intention of this note is to help understand the - functions in this file, some parts are NOT ACCURATE or OVERLY - SIMPLIFIED. For thorough understanding, please refer to the - original document of ISO2022. - - ISO2022 provides many mechanisms to encode several character sets - in 7-bit and 8-bit environments. For 7-bit environments, all text - is encoded using bytes less than 128. This may make the encoded - text a little bit longer, but the text passes more easily through - several gateways, some of which strip off MSB (Most Signigant Bit). - - There are two kinds of character sets: control character set and - graphic character set. The former contains control characters such - as `newline' and `escape' to provide control functions (control - functions are also provided by escape sequences). The latter - contains graphic characters such as 'A' and '-'. Emacs recognizes - two control character sets and many graphic character sets. - - Graphic character sets are classified into one of the following - four classes, according to the number of bytes (DIMENSION) and - number of characters in one dimension (CHARS) of the set: - - DIMENSION1_CHARS94 - - DIMENSION1_CHARS96 - - DIMENSION2_CHARS94 - - DIMENSION2_CHARS96 - - In addition, each character set is assigned an identification tag, - unique for each set, called "final character" (denoted as <F> - hereafter). The <F> of each character set is decided by ECMA(*) - when it is registered in ISO. The code range of <F> is 0x30..0x7F - (0x30..0x3F are for private use only). - - Note (*): ECMA = European Computer Manufacturers Association - - Here are examples of graphic character set [NAME(<F>)]: - o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ... - o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ... - o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... - o DIMENSION2_CHARS96 -- none for the moment - - A code area (1 byte = 8 bits) is divided into 4 areas, C0, GL, C1, and GR. - C0 [0x00..0x1F] -- control character plane 0 - GL [0x20..0x7F] -- graphic character plane 0 - C1 [0x80..0x9F] -- control character plane 1 - GR [0xA0..0xFF] -- graphic character plane 1 - - A control character set is directly designated and invoked to C0 or - C1 by an escape sequence. The most common case is that: - - ISO646's control character set is designated/invoked to C0, and - - ISO6429's control character set is designated/invoked to C1, - and usually these designations/invocations are omitted in encoded - text. In a 7-bit environment, only C0 can be used, and a control - character for C1 is encoded by an appropriate escape sequence to - fit into the environment. All control characters for C1 are - defined to have corresponding escape sequences. - - A graphic character set is at first designated to one of four - graphic registers (G0 through G3), then these graphic registers are - invoked to GL or GR. These designations and invocations can be - done independently. The most common case is that G0 is invoked to - GL, G1 is invoked to GR, and ASCII is designated to G0. Usually - these invocations and designations are omitted in encoded text. - In a 7-bit environment, only GL can be used. - - When a graphic character set of CHARS94 is invoked to GL, codes - 0x20 and 0x7F of the GL area work as control characters SPACE and - DEL respectively, and codes 0xA0 and 0xFF of the GR area should not - be used. - - There are two ways of invocation: locking-shift and single-shift. - With locking-shift, the invocation lasts until the next different - invocation, whereas with single-shift, the invocation affects the - following character only and doesn't affect the locking-shift - state. Invocations are done by the following control characters or - escape sequences: - - ---------------------------------------------------------------------- - abbrev function cntrl escape seq description - ---------------------------------------------------------------------- - SI/LS0 (shift-in) 0x0F none invoke G0 into GL - SO/LS1 (shift-out) 0x0E none invoke G1 into GL - LS2 (locking-shift-2) none ESC 'n' invoke G2 into GL - LS3 (locking-shift-3) none ESC 'o' invoke G3 into GL - LS1R (locking-shift-1 right) none ESC '~' invoke G1 into GR (*) - LS2R (locking-shift-2 right) none ESC '}' invoke G2 into GR (*) - LS3R (locking-shift 3 right) none ESC '|' invoke G3 into GR (*) - SS2 (single-shift-2) 0x8E ESC 'N' invoke G2 for one char - SS3 (single-shift-3) 0x8F ESC 'O' invoke G3 for one char - ---------------------------------------------------------------------- - (*) These are not used by any known coding system. - - Control characters for these functions are defined by macros - ISO_CODE_XXX in `coding.h'. - - Designations are done by the following escape sequences: - ---------------------------------------------------------------------- - escape sequence description - ---------------------------------------------------------------------- - ESC '(' <F> designate DIMENSION1_CHARS94<F> to G0 - ESC ')' <F> designate DIMENSION1_CHARS94<F> to G1 - ESC '*' <F> designate DIMENSION1_CHARS94<F> to G2 - ESC '+' <F> designate DIMENSION1_CHARS94<F> to G3 - ESC ',' <F> designate DIMENSION1_CHARS96<F> to G0 (*) - ESC '-' <F> designate DIMENSION1_CHARS96<F> to G1 - ESC '.' <F> designate DIMENSION1_CHARS96<F> to G2 - ESC '/' <F> designate DIMENSION1_CHARS96<F> to G3 - ESC '$' '(' <F> designate DIMENSION2_CHARS94<F> to G0 (**) - ESC '$' ')' <F> designate DIMENSION2_CHARS94<F> to G1 - ESC '$' '*' <F> designate DIMENSION2_CHARS94<F> to G2 - ESC '$' '+' <F> designate DIMENSION2_CHARS94<F> to G3 - ESC '$' ',' <F> designate DIMENSION2_CHARS96<F> to G0 (*) - ESC '$' '-' <F> designate DIMENSION2_CHARS96<F> to G1 - ESC '$' '.' <F> designate DIMENSION2_CHARS96<F> to G2 - ESC '$' '/' <F> designate DIMENSION2_CHARS96<F> to G3 - ---------------------------------------------------------------------- - - In this list, "DIMENSION1_CHARS94<F>" means a graphic character set - of dimension 1, chars 94, and final character <F>, etc... - - Note (*): Although these designations are not allowed in ISO2022, - Emacs accepts them on decoding, and produces them on encoding - CHARS96 character sets in a coding system which is characterized as - 7-bit environment, non-locking-shift, and non-single-shift. - - Note (**): If <F> is '@', 'A', or 'B', the intermediate character - '(' can be omitted. We refer to this as "short-form" hereafter. - - Now you may notice that there are a lot of ways for encoding the - same multilingual text in ISO2022. Actually, there exist many - coding systems such as Compound Text (used in X11's inter client - communication, ISO-2022-JP (used in Japanese internet), ISO-2022-KR - (used in Korean internet), EUC (Extended UNIX Code, used in Asian - localized platforms), and all of these are variants of ISO2022. - - In addition to the above, Emacs handles two more kinds of escape - sequences: ISO6429's direction specification and Emacs' private - sequence for specifying character composition. - - ISO6429's direction specification takes the following form: - o CSI ']' -- end of the current direction - o CSI '0' ']' -- end of the current direction - o CSI '1' ']' -- start of left-to-right text - o CSI '2' ']' -- start of right-to-left text - The control character CSI (0x9B: control sequence introducer) is - abbreviated to the escape sequence ESC '[' in a 7-bit environment. - - Character composition specification takes the following form: - o ESC '0' -- start character composition - o ESC '1' -- end character composition - Since these are not standard escape sequences of any ISO standard, - their use with these meanings is restricted to Emacs only. */ - -static void -reset_iso2022 (Lisp_Object coding_system, struct iso2022_decoder *iso) -{ - int i; - - for (i = 0; i < 4; i++) - { - if (!NILP (coding_system)) - iso->charset[i] = - XCODING_SYSTEM_ISO2022_INITIAL_CHARSET (coding_system, i); - else - iso->charset[i] = Qt; - iso->invalid_designated[i] = 0; - } - iso->esc = ISO_ESC_NOTHING; - iso->esc_bytes_index = 0; - iso->register_left = 0; - iso->register_right = 1; - iso->switched_dir_and_no_valid_charset_yet = 0; - iso->invalid_switch_dir = 0; - iso->output_direction_sequence = 0; - iso->output_literally = 0; -#ifdef ENABLE_COMPOSITE_CHARS - if (iso->composite_chars) - Dynarr_reset (iso->composite_chars); -#endif -} - -static int -fit_to_be_escape_quoted (unsigned char c) -{ - switch (c) - { - case ISO_CODE_ESC: - case ISO_CODE_CSI: - case ISO_CODE_SS2: - case ISO_CODE_SS3: - case ISO_CODE_SO: - case ISO_CODE_SI: - return 1; - - default: - return 0; + + data->stream.next_in = (Bytef *) src; + data->stream.avail_in = n; + + /* Normally we stop when we've fed all data to the compressor; but if + we're at the end of the input, and the compressor hasn't reported + EOF, we need to keep going, as there might be more output to + generate. (To signal EOF on our end, we set the FLUSH parameter + to Z_FINISH; when all data is output, Z_STREAM_END will be + returned.) Z_OK from the compressor means input was processed or + output was generated; if neither, we break out of the loop. Other + return values are: + + Z_STREAM_END EOF from compressor + Z_BUF_ERROR No progress possible (should never happen) + Z_STREAM_ERROR (should never happen) + */ + while (data->stream.avail_in > 0 || str->eof) + { + /* Reserve an output buffer of the same size as the input buffer; + if that's not enough, we keep reserving the same size. */ + Bytecount reserved = n; + Dynarr_add_many (dst, 0, reserved); + /* Careful here! Don't retrieve the pointer until after + reserving the space, or it might be bogus */ + data->stream.next_out = + Dynarr_atp (dst, Dynarr_length (dst) - reserved); + data->stream.avail_out = reserved; + zerr = + deflate (&data->stream, + str->eof ? Z_FINISH : Z_NO_FLUSH); + /* Lop off the unused portion */ + Dynarr_set_size (dst, Dynarr_length (dst) - data->stream.avail_out); + if (zerr != Z_OK) + break; + } + + if ((Bytecount) data->stream.avail_in < n) + return n - data->stream.avail_in; + + if (zerr == Z_OK || zerr == Z_STREAM_END) + return 0; + + return LSTREAM_ERROR; } } -/* Parse one byte of an ISO2022 escape sequence. - If the result is an invalid escape sequence, return 0 and - do not change anything in STR. Otherwise, if the result is - an incomplete escape sequence, update ISO2022.ESC and - ISO2022.ESC_BYTES and return -1. Otherwise, update - all the state variables (but not ISO2022.ESC_BYTES) and - return 1. - - If CHECK_INVALID_CHARSETS is non-zero, check for designation - or invocation of an invalid character set and treat that as - an unrecognized escape sequence. - - ******************************************************************** - - #### Strategies for error annotation and coding orthogonalization - - We really want to separate out a number of things. Conceptually, - there is a nested syntax. - - At the top level is the ISO 2022 extension syntax, including charset - designation and invocation, and certain auxiliary controls such as the - ISO 6429 direction specification. These are octet-oriented, with the - single exception (AFAIK) of the "exit Unicode" sequence which uses the - UTF's natural width (1 byte for UTF-7 and UTF-8, 2 bytes for UCS-2 and - UTF-16, and 4 bytes for UCS-4 and UTF-32). This will be treated as a - (deprecated) special case in Unicode processing. - - The middle layer is ISO 2022 character interpretation. This will depend - on the current state of the ISO 2022 registers, and assembles octets - into the character's internal representation. - - The lowest level is translating system control conventions. At present - this is restricted to newline translation, but one could imagine doing - tab conversion or line wrapping here. "Escape from Unicode" processing - would be done at this level. - - At each level the parser will verify the syntax. In the case of a - syntax error or warning (such as a redundant escape sequence that affects - no characters), the parser will take some action, typically inserting the - erroneous octets directly into the output and creating an annotation - which can be used by higher level I/O to mark the affected region. - - This should make it possible to do something sensible about separating - newline convention processing from character construction, and about - preventing ISO 2022 escape sequences from being recognized - inappropriately. - - The basic strategy will be to have octet classification tables, and - switch processing according to the table entry. - - It's possible that, by doing the processing with tables of functions or - the like, the parser can be used for both detection and translation. */ - -static int -parse_iso2022_esc (Lisp_Object codesys, struct iso2022_decoder *iso, - unsigned char c, unsigned int *flags, - int check_invalid_charsets) -{ - /* (1) If we're at the end of a designation sequence, CS is the - charset being designated and REG is the register to designate - it to. - - (2) If we're at the end of a locking-shift sequence, REG is - the register to invoke and HALF (0 == left, 1 == right) is - the half to invoke it into. - - (3) If we're at the end of a single-shift sequence, REG is - the register to invoke. */ - Lisp_Object cs = Qnil; - int reg, half; - - /* NOTE: This code does goto's all over the fucking place. - The reason for this is that we're basically implementing - a state machine here, and hierarchical languages like C - don't really provide a clean way of doing this. */ - - if (! (*flags & CODING_STATE_ESCAPE)) - /* At beginning of escape sequence; we need to reset our - escape-state variables. */ - iso->esc = ISO_ESC_NOTHING; - - iso->output_literally = 0; - iso->output_direction_sequence = 0; - - switch (iso->esc) - { - case ISO_ESC_NOTHING: - iso->esc_bytes_index = 0; - switch (c) - { - case ISO_CODE_ESC: /* Start escape sequence */ - *flags |= CODING_STATE_ESCAPE; - iso->esc = ISO_ESC; - goto not_done; - - case ISO_CODE_CSI: /* ISO6429 (specifying directionality) */ - *flags |= CODING_STATE_ESCAPE; - iso->esc = ISO_ESC_5_11; - goto not_done; - - case ISO_CODE_SO: /* locking shift 1 */ - reg = 1; half = 0; - goto locking_shift; - case ISO_CODE_SI: /* locking shift 0 */ - reg = 0; half = 0; - goto locking_shift; - - case ISO_CODE_SS2: /* single shift */ - reg = 2; - goto single_shift; - case ISO_CODE_SS3: /* single shift */ - reg = 3; - goto single_shift; - - default: /* Other control characters */ - return 0; - } - - case ISO_ESC: - switch (c) - { - /**** single shift ****/ - - case 'N': /* single shift 2 */ - reg = 2; - goto single_shift; - case 'O': /* single shift 3 */ - reg = 3; - goto single_shift; - - /**** locking shift ****/ - - case '~': /* locking shift 1 right */ - reg = 1; half = 1; - goto locking_shift; - case 'n': /* locking shift 2 */ - reg = 2; half = 0; - goto locking_shift; - case '}': /* locking shift 2 right */ - reg = 2; half = 1; - goto locking_shift; - case 'o': /* locking shift 3 */ - reg = 3; half = 0; - goto locking_shift; - case '|': /* locking shift 3 right */ - reg = 3; half = 1; - goto locking_shift; - -#ifdef ENABLE_COMPOSITE_CHARS - /**** composite ****/ - - case '0': - iso->esc = ISO_ESC_START_COMPOSITE; - *flags = (*flags & CODING_STATE_ISO2022_LOCK) | - CODING_STATE_COMPOSITE; - return 1; - - case '1': - iso->esc = ISO_ESC_END_COMPOSITE; - *flags = (*flags & CODING_STATE_ISO2022_LOCK) & - ~CODING_STATE_COMPOSITE; - return 1; -#endif /* ENABLE_COMPOSITE_CHARS */ - - /**** directionality ****/ - - case '[': - iso->esc = ISO_ESC_5_11; - goto not_done; - - /**** designation ****/ - - case '$': /* multibyte charset prefix */ - iso->esc = ISO_ESC_2_4; - goto not_done; - - default: - if (0x28 <= c && c <= 0x2F) - { - iso->esc = (enum iso_esc_flag) (c - 0x28 + ISO_ESC_2_8); - goto not_done; - } - - /* This function is called with CODESYS equal to nil when - doing coding-system detection. */ - if (!NILP (codesys) - && XCODING_SYSTEM_ISO2022_ESCAPE_QUOTED (codesys) - && fit_to_be_escape_quoted (c)) - { - iso->esc = ISO_ESC_LITERAL; - *flags &= CODING_STATE_ISO2022_LOCK; - return 1; - } - - /* bzzzt! */ - return 0; - } - - - - /**** directionality ****/ - - case ISO_ESC_5_11: /* ISO6429 direction control */ - if (c == ']') - { - *flags &= (CODING_STATE_ISO2022_LOCK & ~CODING_STATE_R2L); - goto directionality; - } - if (c == '0') iso->esc = ISO_ESC_5_11_0; - else if (c == '1') iso->esc = ISO_ESC_5_11_1; - else if (c == '2') iso->esc = ISO_ESC_5_11_2; - else return 0; - goto not_done; - - case ISO_ESC_5_11_0: - if (c == ']') - { - *flags &= (CODING_STATE_ISO2022_LOCK & ~CODING_STATE_R2L); - goto directionality; - } - return 0; - - case ISO_ESC_5_11_1: - if (c == ']') - { - *flags = (CODING_STATE_ISO2022_LOCK & ~CODING_STATE_R2L); - goto directionality; - } - return 0; - - case ISO_ESC_5_11_2: - if (c == ']') - { - *flags = (*flags & CODING_STATE_ISO2022_LOCK) | CODING_STATE_R2L; - goto directionality; - } - return 0; - - directionality: - iso->esc = ISO_ESC_DIRECTIONALITY; - /* Various junk here to attempt to preserve the direction sequences - literally in the text if they would otherwise be swallowed due - to invalid designations that don't show up as actual charset - changes in the text. */ - if (iso->invalid_switch_dir) - { - /* We already inserted a direction switch literally into the - text. We assume (#### this may not be right) that the - next direction switch is the one going the other way, - and we need to output that literally as well. */ - iso->output_literally = 1; - iso->invalid_switch_dir = 0; - } - else - { - int jj; - - /* If we are in the thrall of an invalid designation, - then stick the directionality sequence literally into the - output stream so it ends up in the original text again. */ - for (jj = 0; jj < 4; jj++) - if (iso->invalid_designated[jj]) - break; - if (jj < 4) - { - iso->output_literally = 1; - iso->invalid_switch_dir = 1; - } - else - /* Indicate that we haven't yet seen a valid designation, - so that if a switch-dir is directly followed by an - invalid designation, both get inserted literally. */ - iso->switched_dir_and_no_valid_charset_yet = 1; - } - return 1; - - - /**** designation ****/ - - case ISO_ESC_2_4: - if (0x28 <= c && c <= 0x2F) - { - iso->esc = (enum iso_esc_flag) (c - 0x28 + ISO_ESC_2_4_8); - goto not_done; - } - if (0x40 <= c && c <= 0x42) - { - cs = CHARSET_BY_ATTRIBUTES (CHARSET_TYPE_94X94, c, - *flags & CODING_STATE_R2L ? - CHARSET_RIGHT_TO_LEFT : - CHARSET_LEFT_TO_RIGHT); - reg = 0; - goto designated; - } - return 0; - - default: - { - int type =-1; - - if (c < '0' || c > '~') - return 0; /* bad final byte */ - - if (iso->esc >= ISO_ESC_2_8 && - iso->esc <= ISO_ESC_2_15) - { - type = ((iso->esc >= ISO_ESC_2_12) ? - CHARSET_TYPE_96 : CHARSET_TYPE_94); - reg = (iso->esc - ISO_ESC_2_8) & 3; - } - else if (iso->esc >= ISO_ESC_2_4_8 && - iso->esc <= ISO_ESC_2_4_15) - { - type = ((iso->esc >= ISO_ESC_2_4_12) ? - CHARSET_TYPE_96X96 : CHARSET_TYPE_94X94); - reg = (iso->esc - ISO_ESC_2_4_8) & 3; - } - else - { - /* Can this ever be reached? -slb */ - abort(); - return 0; - } - - cs = CHARSET_BY_ATTRIBUTES (type, c, - *flags & CODING_STATE_R2L ? - CHARSET_RIGHT_TO_LEFT : - CHARSET_LEFT_TO_RIGHT); - goto designated; - } - } - - not_done: - iso->esc_bytes[iso->esc_bytes_index++] = (unsigned char) c; - return -1; - - single_shift: - if (check_invalid_charsets && !CHARSETP (iso->charset[reg])) - /* can't invoke something that ain't there. */ - return 0; - iso->esc = ISO_ESC_SINGLE_SHIFT; - *flags &= CODING_STATE_ISO2022_LOCK; - if (reg == 2) - *flags |= CODING_STATE_SS2; - else - *flags |= CODING_STATE_SS3; - return 1; - - locking_shift: - if (check_invalid_charsets && - !CHARSETP (iso->charset[reg])) - /* can't invoke something that ain't there. */ - return 0; - if (half) - iso->register_right = reg; - else - iso->register_left = reg; - *flags &= CODING_STATE_ISO2022_LOCK; - iso->esc = ISO_ESC_LOCKING_SHIFT; - return 1; - - designated: - if (NILP (cs) && check_invalid_charsets) - { - iso->invalid_designated[reg] = 1; - iso->charset[reg] = Vcharset_ascii; - iso->esc = ISO_ESC_DESIGNATE; - *flags &= CODING_STATE_ISO2022_LOCK; - iso->output_literally = 1; - if (iso->switched_dir_and_no_valid_charset_yet) - { - /* We encountered a switch-direction followed by an - invalid designation. Ensure that the switch-direction - gets outputted; otherwise it will probably get eaten - when the text is written out again. */ - iso->switched_dir_and_no_valid_charset_yet = 0; - iso->output_direction_sequence = 1; - /* And make sure that the switch-dir going the other - way gets outputted, as well. */ - iso->invalid_switch_dir = 1; - } - return 1; - } - /* This function is called with CODESYS equal to nil when - doing coding-system detection. */ - if (!NILP (codesys)) - { - charset_conversion_spec_dynarr *dyn = - XCODING_SYSTEM (codesys)->iso2022.input_conv; - - if (dyn) - { - int i; - - for (i = 0; i < Dynarr_length (dyn); i++) - { - struct charset_conversion_spec *spec = Dynarr_atp (dyn, i); - if (EQ (cs, spec->from_charset)) - cs = spec->to_charset; - } - } - } - - iso->charset[reg] = cs; - iso->esc = ISO_ESC_DESIGNATE; - *flags &= CODING_STATE_ISO2022_LOCK; - if (iso->invalid_designated[reg]) - { - iso->invalid_designated[reg] = 0; - iso->output_literally = 1; - } - if (iso->switched_dir_and_no_valid_charset_yet) - iso->switched_dir_and_no_valid_charset_yet = 0; - return 1; -} - -static int -detect_coding_iso2022 (struct detection_state *st, const Extbyte *src, Bytecount n) -{ - int mask; - - /* #### There are serious deficiencies in the recognition mechanism - here. This needs to be much smarter if it's going to cut it. - The sequence "\xff\x0f" is currently detected as LOCK_SHIFT while - it should be detected as Latin-1. - All the ISO2022 stuff in this file should be synced up with the - code from FSF Emacs-20.4, in which Mule should be more or less stable. - Perhaps we should wait till R2L works in FSF Emacs? */ - - if (!st->iso2022.initted) - { - reset_iso2022 (Qnil, &st->iso2022.iso); - st->iso2022.mask = (CODING_CATEGORY_ISO_7_MASK | - CODING_CATEGORY_ISO_8_DESIGNATE_MASK | - CODING_CATEGORY_ISO_8_1_MASK | - CODING_CATEGORY_ISO_8_2_MASK | - CODING_CATEGORY_ISO_LOCK_SHIFT_MASK); - st->iso2022.flags = 0; - st->iso2022.high_byte_count = 0; - st->iso2022.saw_single_shift = 0; - st->iso2022.initted = 1; - } - - mask = st->iso2022.mask; - - while (n--) - { - unsigned char c = *(unsigned char *)src++; - if (c >= 0xA0) - { - mask &= ~CODING_CATEGORY_ISO_7_MASK; - st->iso2022.high_byte_count++; - } - else - { - if (st->iso2022.high_byte_count && !st->iso2022.saw_single_shift) - { - if (st->iso2022.high_byte_count & 1) - /* odd number of high bytes; assume not iso-8-2 */ - mask &= ~CODING_CATEGORY_ISO_8_2_MASK; - } - st->iso2022.high_byte_count = 0; - st->iso2022.saw_single_shift = 0; - if (c > 0x80) - mask &= ~CODING_CATEGORY_ISO_7_MASK; - } - if (!(st->iso2022.flags & CODING_STATE_ESCAPE) - && (BYTE_C0_P (c) || BYTE_C1_P (c))) - { /* control chars */ - switch (c) - { - /* Allow and ignore control characters that you might - reasonably see in a text file */ - case '\r': - case '\n': - case '\t': - case 7: /* bell */ - case 8: /* backspace */ - case 11: /* vertical tab */ - case 12: /* form feed */ - case 26: /* MS-DOS C-z junk */ - case 31: /* '^_' -- for info */ - goto label_continue_loop; - - default: - break; - } - } - - if ((st->iso2022.flags & CODING_STATE_ESCAPE) || BYTE_C0_P (c) - || BYTE_C1_P (c)) - { - if (parse_iso2022_esc (Qnil, &st->iso2022.iso, c, - &st->iso2022.flags, 0)) - { - switch (st->iso2022.iso.esc) - { - case ISO_ESC_DESIGNATE: - mask &= ~CODING_CATEGORY_ISO_8_1_MASK; - mask &= ~CODING_CATEGORY_ISO_8_2_MASK; - break; - case ISO_ESC_LOCKING_SHIFT: - mask = CODING_CATEGORY_ISO_LOCK_SHIFT_MASK; - goto ran_out_of_chars; - case ISO_ESC_SINGLE_SHIFT: - mask &= ~CODING_CATEGORY_ISO_8_DESIGNATE_MASK; - st->iso2022.saw_single_shift = 1; - break; - default: - break; - } - } - else - { - mask = 0; - goto ran_out_of_chars; - } - } - label_continue_loop:; - } - - ran_out_of_chars: - - return mask; -} - -static int -postprocess_iso2022_mask (int mask) -{ - /* #### kind of cheesy */ - /* If seven-bit ISO is allowed, then assume that the encoding is - entirely seven-bit and turn off the eight-bit ones. */ - if (mask & CODING_CATEGORY_ISO_7_MASK) - mask &= ~ (CODING_CATEGORY_ISO_8_DESIGNATE_MASK | - CODING_CATEGORY_ISO_8_1_MASK | - CODING_CATEGORY_ISO_8_2_MASK); - return mask; -} - -/* If FLAGS is a null pointer or specifies right-to-left motion, - output a switch-dir-to-left-to-right sequence to DST. - Also update FLAGS if it is not a null pointer. - If INTERNAL_P is set, we are outputting in internal format and - need to handle the CSI differently. */ - -static void -restore_left_to_right_direction (Lisp_Coding_System *codesys, - unsigned_char_dynarr *dst, - unsigned int *flags, - int internal_p) -{ - if (!flags || (*flags & CODING_STATE_R2L)) - { - if (CODING_SYSTEM_ISO2022_SEVEN (codesys)) - { - Dynarr_add (dst, ISO_CODE_ESC); - Dynarr_add (dst, '['); - } - else if (internal_p) - DECODE_ADD_BINARY_CHAR (ISO_CODE_CSI, dst); - else - Dynarr_add (dst, ISO_CODE_CSI); - Dynarr_add (dst, '0'); - Dynarr_add (dst, ']'); - if (flags) - *flags &= ~CODING_STATE_R2L; - } -} - -/* If FLAGS is a null pointer or specifies a direction different from - DIRECTION (which should be either CHARSET_RIGHT_TO_LEFT or - CHARSET_LEFT_TO_RIGHT), output the appropriate switch-dir escape - sequence to DST. Also update FLAGS if it is not a null pointer. - If INTERNAL_P is set, we are outputting in internal format and - need to handle the CSI differently. */ - -static void -ensure_correct_direction (int direction, Lisp_Coding_System *codesys, - unsigned_char_dynarr *dst, unsigned int *flags, - int internal_p) -{ - if ((!flags || (*flags & CODING_STATE_R2L)) && - direction == CHARSET_LEFT_TO_RIGHT) - restore_left_to_right_direction (codesys, dst, flags, internal_p); - else if (!CODING_SYSTEM_ISO2022_NO_ISO6429 (codesys) - && (!flags || !(*flags & CODING_STATE_R2L)) && - direction == CHARSET_RIGHT_TO_LEFT) - { - if (CODING_SYSTEM_ISO2022_SEVEN (codesys)) - { - Dynarr_add (dst, ISO_CODE_ESC); - Dynarr_add (dst, '['); - } - else if (internal_p) - DECODE_ADD_BINARY_CHAR (ISO_CODE_CSI, dst); - else - Dynarr_add (dst, ISO_CODE_CSI); - Dynarr_add (dst, '2'); - Dynarr_add (dst, ']'); - if (flags) - *flags |= CODING_STATE_R2L; - } -} - -/* Convert ISO2022-format data to internal format. */ - -static void -decode_coding_iso2022 (Lstream *decoding, const Extbyte *src, - unsigned_char_dynarr *dst, Bytecount n) -{ - struct decoding_stream *str = DECODING_STREAM_DATA (decoding); - unsigned int flags = str->flags; - unsigned int ch = str->ch; - eol_type_t eol_type = str->eol_type; -#ifdef ENABLE_COMPOSITE_CHARS - unsigned_char_dynarr *real_dst = dst; -#endif - Lisp_Object coding_system; - - XSETCODING_SYSTEM (coding_system, str->codesys); - -#ifdef ENABLE_COMPOSITE_CHARS - if (flags & CODING_STATE_COMPOSITE) - dst = str->iso2022.composite_chars; -#endif /* ENABLE_COMPOSITE_CHARS */ - - while (n--) - { - unsigned char c = *(unsigned char *)src++; - if (flags & CODING_STATE_ESCAPE) - { /* Within ESC sequence */ - int retval = parse_iso2022_esc (coding_system, &str->iso2022, - c, &flags, 1); - - if (retval) - { - switch (str->iso2022.esc) - { -#ifdef ENABLE_COMPOSITE_CHARS - case ISO_ESC_START_COMPOSITE: - if (str->iso2022.composite_chars) - Dynarr_reset (str->iso2022.composite_chars); - else - str->iso2022.composite_chars = Dynarr_new (unsigned_char); - dst = str->iso2022.composite_chars; - break; - case ISO_ESC_END_COMPOSITE: - { - Intbyte comstr[MAX_EMCHAR_LEN]; - Bytecount len; - Emchar emch = lookup_composite_char (Dynarr_atp (dst, 0), - Dynarr_length (dst)); - dst = real_dst; - len = set_charptr_emchar (comstr, emch); - Dynarr_add_many (dst, comstr, len); - break; - } -#endif /* ENABLE_COMPOSITE_CHARS */ - - case ISO_ESC_LITERAL: - DECODE_ADD_BINARY_CHAR (c, dst); - break; - - default: - /* Everything else handled already */ - break; - } - } - - /* Attempted error recovery. */ - if (str->iso2022.output_direction_sequence) - ensure_correct_direction (flags & CODING_STATE_R2L ? - CHARSET_RIGHT_TO_LEFT : - CHARSET_LEFT_TO_RIGHT, - str->codesys, dst, 0, 1); - /* More error recovery. */ - if (!retval || str->iso2022.output_literally) - { - /* Output the (possibly invalid) sequence */ - int i; - for (i = 0; i < str->iso2022.esc_bytes_index; i++) - DECODE_ADD_BINARY_CHAR (str->iso2022.esc_bytes[i], dst); - flags &= CODING_STATE_ISO2022_LOCK; - if (!retval) - n++, src--;/* Repeat the loop with the same character. */ - else - { - /* No sense in reprocessing the final byte of the - escape sequence; it could mess things up anyway. - Just add it now. */ - DECODE_ADD_BINARY_CHAR (c, dst); - } - } - ch = 0; - } - else if (BYTE_C0_P (c) || BYTE_C1_P (c)) - { /* Control characters */ - - /***** Error-handling *****/ - - /* If we were in the middle of a character, dump out the - partial character. */ - DECODE_OUTPUT_PARTIAL_CHAR (ch); - - /* If we just saw a single-shift character, dump it out. - This may dump out the wrong sort of single-shift character, - but least it will give an indication that something went - wrong. */ - if (flags & CODING_STATE_SS2) - { - DECODE_ADD_BINARY_CHAR (ISO_CODE_SS2, dst); - flags &= ~CODING_STATE_SS2; - } - if (flags & CODING_STATE_SS3) - { - DECODE_ADD_BINARY_CHAR (ISO_CODE_SS3, dst); - flags &= ~CODING_STATE_SS3; - } - - /***** Now handle the control characters. *****/ - - /* Handle CR/LF */ - DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst); - - flags &= CODING_STATE_ISO2022_LOCK; - - if (!parse_iso2022_esc (coding_system, &str->iso2022, c, &flags, 1)) - DECODE_ADD_BINARY_CHAR (c, dst); - } - else - { /* Graphic characters */ - Lisp_Object charset; - int lb; - int reg; - - DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst); - - /* Now determine the charset. */ - reg = ((flags & CODING_STATE_SS2) ? 2 - : (flags & CODING_STATE_SS3) ? 3 - : !BYTE_ASCII_P (c) ? str->iso2022.register_right - : str->iso2022.register_left); - charset = str->iso2022.charset[reg]; - - /* Error checking: */ - if (! CHARSETP (charset) - || str->iso2022.invalid_designated[reg] - || (((c & 0x7F) == ' ' || (c & 0x7F) == ISO_CODE_DEL) - && XCHARSET_CHARS (charset) == 94)) - /* Mrmph. We are trying to invoke a register that has no - or an invalid charset in it, or trying to add a character - outside the range of the charset. Insert that char literally - to preserve it for the output. */ - { - DECODE_OUTPUT_PARTIAL_CHAR (ch); - DECODE_ADD_BINARY_CHAR (c, dst); - } - - else - { - /* Things are probably hunky-dorey. */ - - /* Fetch reverse charset, maybe. */ - if (((flags & CODING_STATE_R2L) && - XCHARSET_DIRECTION (charset) == CHARSET_LEFT_TO_RIGHT) - || - (!(flags & CODING_STATE_R2L) && - XCHARSET_DIRECTION (charset) == CHARSET_RIGHT_TO_LEFT)) - { - Lisp_Object new_charset = - XCHARSET_REVERSE_DIRECTION_CHARSET (charset); - if (!NILP (new_charset)) - charset = new_charset; - } - - lb = XCHARSET_LEADING_BYTE (charset); - switch (XCHARSET_REP_BYTES (charset)) - { - case 1: /* ASCII */ - DECODE_OUTPUT_PARTIAL_CHAR (ch); - Dynarr_add (dst, c & 0x7F); - break; - - case 2: /* one-byte official */ - DECODE_OUTPUT_PARTIAL_CHAR (ch); - Dynarr_add (dst, lb); - Dynarr_add (dst, c | 0x80); - break; - - case 3: /* one-byte private or two-byte official */ - if (XCHARSET_PRIVATE_P (charset)) - { - DECODE_OUTPUT_PARTIAL_CHAR (ch); - Dynarr_add (dst, PRE_LEADING_BYTE_PRIVATE_1); - Dynarr_add (dst, lb); - Dynarr_add (dst, c | 0x80); - } - else - { - if (ch) - { - Dynarr_add (dst, lb); - Dynarr_add (dst, ch | 0x80); - Dynarr_add (dst, c | 0x80); - ch = 0; - } - else - ch = c; - } - break; - - default: /* two-byte private */ - if (ch) - { - Dynarr_add (dst, PRE_LEADING_BYTE_PRIVATE_2); - Dynarr_add (dst, lb); - Dynarr_add (dst, ch | 0x80); - Dynarr_add (dst, c | 0x80); - ch = 0; - } - else - ch = c; - } - } - - if (!ch) - flags &= CODING_STATE_ISO2022_LOCK; - } - - label_continue_loop:; - } - - if (flags & CODING_STATE_END) - DECODE_OUTPUT_PARTIAL_CHAR (ch); - - str->flags = flags; - str->ch = ch; -} - - -/***** ISO2022 encoder *****/ - -/* Designate CHARSET into register REG. */ - -static void -iso2022_designate (Lisp_Object charset, unsigned char reg, - struct encoding_stream *str, unsigned_char_dynarr *dst) -{ - static const char inter94[] = "()*+"; - static const char inter96[] = ",-./"; - int type; - unsigned char final; - Lisp_Object old_charset = str->iso2022.charset[reg]; - - str->iso2022.charset[reg] = charset; - if (!CHARSETP (charset)) - /* charset might be an initial nil or t. */ - return; - type = XCHARSET_TYPE (charset); - final = XCHARSET_FINAL (charset); - if (!str->iso2022.force_charset_on_output[reg] && - CHARSETP (old_charset) && - XCHARSET_TYPE (old_charset) == type && - XCHARSET_FINAL (old_charset) == final) - return; - - str->iso2022.force_charset_on_output[reg] = 0; - - { - charset_conversion_spec_dynarr *dyn = - str->codesys->iso2022.output_conv; - - if (dyn) - { - int i; - - for (i = 0; i < Dynarr_length (dyn); i++) - { - struct charset_conversion_spec *spec = Dynarr_atp (dyn, i); - if (EQ (charset, spec->from_charset)) - charset = spec->to_charset; - } - } - } - - Dynarr_add (dst, ISO_CODE_ESC); - switch (type) - { - case CHARSET_TYPE_94: - Dynarr_add (dst, inter94[reg]); - break; - case CHARSET_TYPE_96: - Dynarr_add (dst, inter96[reg]); - break; - case CHARSET_TYPE_94X94: - Dynarr_add (dst, '$'); - if (reg != 0 - || !(CODING_SYSTEM_ISO2022_SHORT (str->codesys)) - || final < '@' - || final > 'B') - Dynarr_add (dst, inter94[reg]); - break; - case CHARSET_TYPE_96X96: - Dynarr_add (dst, '$'); - Dynarr_add (dst, inter96[reg]); - break; - } - Dynarr_add (dst, final); -} - -static void -ensure_normal_shift (struct encoding_stream *str, unsigned_char_dynarr *dst) -{ - if (str->iso2022.register_left != 0) - { - Dynarr_add (dst, ISO_CODE_SI); - str->iso2022.register_left = 0; - } -} - -static void -ensure_shift_out (struct encoding_stream *str, unsigned_char_dynarr *dst) -{ - if (str->iso2022.register_left != 1) - { - Dynarr_add (dst, ISO_CODE_SO); - str->iso2022.register_left = 1; - } -} - -/* Convert internally-formatted data to ISO2022 format. */ - -static void -encode_coding_iso2022 (Lstream *encoding, const Intbyte *src, - unsigned_char_dynarr *dst, Bytecount n) -{ - unsigned char charmask, c; - unsigned char char_boundary; - struct encoding_stream *str = ENCODING_STREAM_DATA (encoding); - unsigned int flags = str->flags; - unsigned int ch = str->ch; - Lisp_Coding_System *codesys = str->codesys; - eol_type_t eol_type = CODING_SYSTEM_EOL_TYPE (str->codesys); - int i; - Lisp_Object charset; - int half; - -#ifdef ENABLE_COMPOSITE_CHARS - /* flags for handling composite chars. We do a little switcharoo - on the source while we're outputting the composite char. */ - Bytecount saved_n = 0; - const unsigned char *saved_src = NULL; - int in_composite = 0; -#endif /* ENABLE_COMPOSITE_CHARS */ - - char_boundary = str->iso2022.current_char_boundary; - charset = str->iso2022.current_charset; - half = str->iso2022.current_half; - -#ifdef ENABLE_COMPOSITE_CHARS - back_to_square_n: -#endif - while (n--) - { - c = *src++; - - if (BYTE_ASCII_P (c)) - { /* Processing ASCII character */ - ch = 0; - - restore_left_to_right_direction (codesys, dst, &flags, 0); - - /* Make sure G0 contains ASCII */ - if ((c > ' ' && c < ISO_CODE_DEL) || - !CODING_SYSTEM_ISO2022_NO_ASCII_CNTL (codesys)) - { - ensure_normal_shift (str, dst); - iso2022_designate (Vcharset_ascii, 0, str, dst); - } - - /* If necessary, restore everything to the default state - at end-of-line */ - if (c == '\n' && - !(CODING_SYSTEM_ISO2022_NO_ASCII_EOL (codesys))) - { - restore_left_to_right_direction (codesys, dst, &flags, 0); - - ensure_normal_shift (str, dst); - - for (i = 0; i < 4; i++) - { - Lisp_Object initial_charset = - CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, i); - iso2022_designate (initial_charset, i, str, dst); - } - } - if (c == '\n') - { - if (eol_type != EOL_LF && eol_type != EOL_AUTODETECT) - Dynarr_add (dst, '\r'); - if (eol_type != EOL_CR) - Dynarr_add (dst, c); - } - else - { - if (CODING_SYSTEM_ISO2022_ESCAPE_QUOTED (codesys) - && fit_to_be_escape_quoted (c)) - Dynarr_add (dst, ISO_CODE_ESC); - Dynarr_add (dst, c); - } - char_boundary = 1; - } - - else if (INTBYTE_LEADING_BYTE_P (c) || INTBYTE_LEADING_BYTE_P (ch)) - { /* Processing Leading Byte */ - ch = 0; - charset = CHARSET_BY_LEADING_BYTE (c); - if (LEADING_BYTE_PREFIX_P(c)) - ch = c; - else if (!EQ (charset, Vcharset_control_1) -#ifdef ENABLE_COMPOSITE_CHARS - && !EQ (charset, Vcharset_composite) -#endif - ) - { - int reg; - - ensure_correct_direction (XCHARSET_DIRECTION (charset), - codesys, dst, &flags, 0); - - /* Now determine which register to use. */ - reg = -1; - for (i = 0; i < 4; i++) - { - if (EQ (charset, str->iso2022.charset[i]) || - EQ (charset, - CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, i))) - { - reg = i; - break; - } - } - - if (reg == -1) - { - if (XCHARSET_GRAPHIC (charset) != 0) - { - if (!NILP (str->iso2022.charset[1]) && - (!CODING_SYSTEM_ISO2022_SEVEN (codesys) || - CODING_SYSTEM_ISO2022_LOCK_SHIFT (codesys))) - reg = 1; - else if (!NILP (str->iso2022.charset[2])) - reg = 2; - else if (!NILP (str->iso2022.charset[3])) - reg = 3; - else - reg = 0; - } - else - reg = 0; - } - - iso2022_designate (charset, reg, str, dst); - - /* Now invoke that register. */ - switch (reg) - { - case 0: - ensure_normal_shift (str, dst); - half = 0; - break; - - case 1: - if (CODING_SYSTEM_ISO2022_SEVEN (codesys)) - { - ensure_shift_out (str, dst); - half = 0; - } - else - half = 1; - break; - - case 2: - if (CODING_SYSTEM_ISO2022_SEVEN (str->codesys)) - { - Dynarr_add (dst, ISO_CODE_ESC); - Dynarr_add (dst, 'N'); - half = 0; - } - else - { - Dynarr_add (dst, ISO_CODE_SS2); - half = 1; - } - break; - - case 3: - if (CODING_SYSTEM_ISO2022_SEVEN (str->codesys)) - { - Dynarr_add (dst, ISO_CODE_ESC); - Dynarr_add (dst, 'O'); - half = 0; - } - else - { - Dynarr_add (dst, ISO_CODE_SS3); - half = 1; - } - break; - - default: - abort (); - } - } - char_boundary = 0; - } - else - { /* Processing Non-ASCII character */ - charmask = (half == 0 ? 0x7F : 0xFF); - char_boundary = 1; - if (EQ (charset, Vcharset_control_1)) - { - if (CODING_SYSTEM_ISO2022_ESCAPE_QUOTED (codesys) - && fit_to_be_escape_quoted (c)) - Dynarr_add (dst, ISO_CODE_ESC); - /* you asked for it ... */ - Dynarr_add (dst, c - 0x20); - } - else - { - switch (XCHARSET_REP_BYTES (charset)) - { - case 2: - Dynarr_add (dst, c & charmask); - break; - case 3: - if (XCHARSET_PRIVATE_P (charset)) - { - Dynarr_add (dst, c & charmask); - ch = 0; - } - else if (ch) - { -#ifdef ENABLE_COMPOSITE_CHARS - if (EQ (charset, Vcharset_composite)) - { - if (in_composite) - { - /* #### Bother! We don't know how to - handle this yet. */ - Dynarr_add (dst, '~'); - } - else - { - Emchar emch = MAKE_CHAR (Vcharset_composite, - ch & 0x7F, c & 0x7F); - Lisp_Object lstr = composite_char_string (emch); - saved_n = n; - saved_src = src; - in_composite = 1; - src = XSTRING_DATA (lstr); - n = XSTRING_LENGTH (lstr); - Dynarr_add (dst, ISO_CODE_ESC); - Dynarr_add (dst, '0'); /* start composing */ - } - } - else -#endif /* ENABLE_COMPOSITE_CHARS */ - { - Dynarr_add (dst, ch & charmask); - Dynarr_add (dst, c & charmask); - } - ch = 0; - } - else - { - ch = c; - char_boundary = 0; - } - break; - case 4: - if (ch) - { - Dynarr_add (dst, ch & charmask); - Dynarr_add (dst, c & charmask); - ch = 0; - } - else - { - ch = c; - char_boundary = 0; - } - break; - default: - abort (); - } - } - } - } - -#ifdef ENABLE_COMPOSITE_CHARS - if (in_composite) - { - n = saved_n; - src = saved_src; - in_composite = 0; - Dynarr_add (dst, ISO_CODE_ESC); - Dynarr_add (dst, '1'); /* end composing */ - goto back_to_square_n; /* Wheeeeeeeee ..... */ - } -#endif /* ENABLE_COMPOSITE_CHARS */ - - if (char_boundary && flags & CODING_STATE_END) - { - restore_left_to_right_direction (codesys, dst, &flags, 0); - ensure_normal_shift (str, dst); - for (i = 0; i < 4; i++) - { - Lisp_Object initial_charset = - CODING_SYSTEM_ISO2022_INITIAL_CHARSET (codesys, i); - iso2022_designate (initial_charset, i, str, dst); - } - } - - str->flags = flags; - str->ch = ch; - str->iso2022.current_char_boundary = char_boundary; - str->iso2022.current_charset = charset; - str->iso2022.current_half = half; - - /* Verbum caro factum est! */ -} -#endif /* MULE */ - -/************************************************************************/ -/* No-conversion methods */ -/************************************************************************/ - -/* This is used when reading in "binary" files -- i.e. files that may - contain all 256 possible byte values and that are not to be - interpreted as being in any particular decoding. */ -static void -decode_coding_no_conversion (Lstream *decoding, const Extbyte *src, - unsigned_char_dynarr *dst, Bytecount n) -{ - struct decoding_stream *str = DECODING_STREAM_DATA (decoding); - unsigned int flags = str->flags; - unsigned int ch = str->ch; - eol_type_t eol_type = str->eol_type; - - while (n--) - { - unsigned char c = *(unsigned char *)src++; - - DECODE_HANDLE_EOL_TYPE (eol_type, c, flags, dst); - DECODE_ADD_BINARY_CHAR (c, dst); - label_continue_loop:; - } - - DECODE_HANDLE_END_OF_CONVERSION (flags, ch, dst); - - str->flags = flags; - str->ch = ch; -} - -static void -encode_coding_no_conversion (Lstream *encoding, const Intbyte *src, - unsigned_char_dynarr *dst, Bytecount n) -{ - unsigned char c; - struct encoding_stream *str = ENCODING_STREAM_DATA (encoding); - unsigned int flags = str->flags; - unsigned int ch = str->ch; - eol_type_t eol_type = CODING_SYSTEM_EOL_TYPE (str->codesys); - - while (n--) - { - c = *src++; - if (c == '\n') - { - if (eol_type != EOL_LF && eol_type != EOL_AUTODETECT) - Dynarr_add (dst, '\r'); - if (eol_type != EOL_CR) - Dynarr_add (dst, '\n'); - ch = 0; - } - else if (BYTE_ASCII_P (c)) - { - assert (ch == 0); - Dynarr_add (dst, c); - } - else if (INTBYTE_LEADING_BYTE_P (c)) - { - assert (ch == 0); - if (c == LEADING_BYTE_LATIN_ISO8859_1 || - c == LEADING_BYTE_CONTROL_1) - ch = c; - else - Dynarr_add (dst, '~'); /* untranslatable character */ - } - else - { - if (ch == LEADING_BYTE_LATIN_ISO8859_1) - Dynarr_add (dst, c); - else if (ch == LEADING_BYTE_CONTROL_1) - { - assert (c < 0xC0); - Dynarr_add (dst, c - 0x20); - } - /* else it should be the second or third byte of an - untranslatable character, so ignore it */ - ch = 0; - } - } - - str->flags = flags; - str->ch = ch; -} +#endif /* HAVE_ZLIB */ - /************************************************************************/ /* Initialization */ /************************************************************************/ @@ -5587,6 +4453,8 @@ { INIT_LRECORD_IMPLEMENTATION (coding_system); + DEFSUBR (Fvalid_coding_system_type_p); + DEFSUBR (Fcoding_system_type_list); DEFSUBR (Fcoding_system_p); DEFSUBR (Ffind_coding_system); DEFSUBR (Fget_coding_system); @@ -5599,9 +4467,11 @@ DEFSUBR (Fcoding_system_aliasee); DEFSUBR (Fdefine_coding_system_alias); DEFSUBR (Fsubsidiary_coding_system); + DEFSUBR (Fcoding_system_base); + DEFSUBR (Fcoding_system_used_for_io); DEFSUBR (Fcoding_system_type); - DEFSUBR (Fcoding_system_doc_string); + DEFSUBR (Fcoding_system_description); DEFSUBR (Fcoding_system_property); DEFSUBR (Fcoding_category_list); @@ -5613,120 +4483,209 @@ DEFSUBR (Fdetect_coding_region); DEFSUBR (Fdecode_coding_region); DEFSUBR (Fencode_coding_region); -#ifdef MULE - DEFSUBR (Fdecode_shift_jis_char); - DEFSUBR (Fencode_shift_jis_char); - DEFSUBR (Fdecode_big5_char); - DEFSUBR (Fencode_big5_char); - DEFSUBR (Fset_ucs_char); - DEFSUBR (Fucs_char); - DEFSUBR (Fset_char_ucs); - DEFSUBR (Fchar_ucs); -#endif /* MULE */ DEFSYMBOL_MULTIWORD_PREDICATE (Qcoding_systemp); DEFSYMBOL (Qno_conversion); + DEFSYMBOL (Qconvert_eol); + DEFSYMBOL (Qconvert_eol_autodetect); + DEFSYMBOL (Qconvert_eol_lf); + DEFSYMBOL (Qconvert_eol_cr); + DEFSYMBOL (Qconvert_eol_crlf); DEFSYMBOL (Qraw_text); -#ifdef MULE - DEFSYMBOL (Qbig5); - DEFSYMBOL (Qshift_jis); - defsymbol (&Qucs4, "ucs-4"); - defsymbol (&Qutf8, "utf-8"); - DEFSYMBOL (Qccl); - DEFSYMBOL (Qiso2022); -#endif /* MULE */ + DEFSYMBOL (Qmnemonic); DEFSYMBOL (Qeol_type); DEFSYMBOL (Qpost_read_conversion); DEFSYMBOL (Qpre_write_conversion); + DEFSYMBOL (Qtranslation_table_for_decode); + DEFSYMBOL (Qtranslation_table_for_encode); + DEFSYMBOL (Qsafe_chars); + DEFSYMBOL (Qsafe_charsets); + DEFSYMBOL (Qmime_charset); + DEFSYMBOL (Qvalid_codes); + DEFSYMBOL (Qcr); DEFSYMBOL (Qlf); DEFSYMBOL (Qcrlf); DEFSYMBOL (Qeol_cr); DEFSYMBOL (Qeol_lf); DEFSYMBOL (Qeol_crlf); -#ifdef MULE - DEFSYMBOL (Qcharset_g0); - DEFSYMBOL (Qcharset_g1); - DEFSYMBOL (Qcharset_g2); - DEFSYMBOL (Qcharset_g3); - DEFSYMBOL (Qforce_g0_on_output); - DEFSYMBOL (Qforce_g1_on_output); - DEFSYMBOL (Qforce_g2_on_output); - DEFSYMBOL (Qforce_g3_on_output); - DEFSYMBOL (Qno_iso6429); - DEFSYMBOL (Qinput_charset_conversion); - DEFSYMBOL (Qoutput_charset_conversion); - - DEFSYMBOL (Qshort); - DEFSYMBOL (Qno_ascii_eol); - DEFSYMBOL (Qno_ascii_cntl); - DEFSYMBOL (Qseven); - DEFSYMBOL (Qlock_shift); - DEFSYMBOL (Qescape_quoted); -#endif /* MULE */ DEFSYMBOL (Qencode); DEFSYMBOL (Qdecode); -#ifdef MULE - defsymbol (&coding_category_symbol[CODING_CATEGORY_SHIFT_JIS], - "shift-jis"); - defsymbol (&coding_category_symbol[CODING_CATEGORY_BIG5], - "big5"); - defsymbol (&coding_category_symbol[CODING_CATEGORY_UCS4], - "ucs-4"); - defsymbol (&coding_category_symbol[CODING_CATEGORY_UTF8], - "utf-8"); - defsymbol (&coding_category_symbol[CODING_CATEGORY_ISO_7], - "iso-7"); - defsymbol (&coding_category_symbol[CODING_CATEGORY_ISO_8_DESIGNATE], - "iso-8-designate"); - defsymbol (&coding_category_symbol[CODING_CATEGORY_ISO_8_1], - "iso-8-1"); - defsymbol (&coding_category_symbol[CODING_CATEGORY_ISO_8_2], - "iso-8-2"); - defsymbol (&coding_category_symbol[CODING_CATEGORY_ISO_LOCK_SHIFT], - "iso-lock-shift"); -#endif /* MULE */ - defsymbol (&coding_category_symbol[CODING_CATEGORY_NO_CONVERSION], - "no-conversion"); + DEFSYMBOL (Qnear_certainty); + DEFSYMBOL (Qquite_probable); + DEFSYMBOL (Qsomewhat_likely); + DEFSYMBOL (Qas_likely_as_unlikely); + DEFSYMBOL (Qsomewhat_unlikely); + DEFSYMBOL (Qquite_improbable); + DEFSYMBOL (Qnearly_impossible); + + DEFSYMBOL (Qdo_eol); + DEFSYMBOL (Qdo_coding); + + DEFSYMBOL (Qcanonicalize_after_coding); + + DEFSYMBOL (Qescape_quoted); + +#ifdef HAVE_ZLIB + DEFSYMBOL (Qgzip); +#endif + + /* WARNING: The existing categories are intimately tied to the function + `coding-system-category' in coding.el. If you change a category, or + change the layout of any coding system associated with a category, you + need to check that function and make sure it's written properly. */ + +#ifdef HAVE_DEFAULT_EOL_DETECTION + Fprovide (intern ("unix-default-eol-detection")); +#endif } void lstream_type_create_file_coding (void) { - LSTREAM_HAS_METHOD (decoding, reader); - LSTREAM_HAS_METHOD (decoding, writer); - LSTREAM_HAS_METHOD (decoding, rewinder); - LSTREAM_HAS_METHOD (decoding, seekable_p); - LSTREAM_HAS_METHOD (decoding, flusher); - LSTREAM_HAS_METHOD (decoding, closer); - LSTREAM_HAS_METHOD (decoding, marker); - - LSTREAM_HAS_METHOD (encoding, reader); - LSTREAM_HAS_METHOD (encoding, writer); - LSTREAM_HAS_METHOD (encoding, rewinder); - LSTREAM_HAS_METHOD (encoding, seekable_p); - LSTREAM_HAS_METHOD (encoding, flusher); - LSTREAM_HAS_METHOD (encoding, closer); - LSTREAM_HAS_METHOD (encoding, marker); + LSTREAM_HAS_METHOD (coding, reader); + LSTREAM_HAS_METHOD (coding, writer); + LSTREAM_HAS_METHOD (coding, rewinder); + LSTREAM_HAS_METHOD (coding, seekable_p); + LSTREAM_HAS_METHOD (coding, marker); + LSTREAM_HAS_METHOD (coding, flusher); + LSTREAM_HAS_METHOD (coding, closer); + LSTREAM_HAS_METHOD (coding, finalizer); +} + +void +coding_system_type_create (void) +{ + int i; + + staticpro (&Vcoding_system_hash_table); + Vcoding_system_hash_table = + make_lisp_hash_table (50, HASH_TABLE_NON_WEAK, HASH_TABLE_EQ); + + the_coding_system_type_entry_dynarr = Dynarr_new (coding_system_type_entry); + dump_add_root_struct_ptr (&the_coding_system_type_entry_dynarr, + &csted_description); + + Vcoding_system_type_list = Qnil; + staticpro (&Vcoding_system_type_list); + + /* Initialize to something reasonable ... */ + for (i = 0; i < MAX_DETECTOR_CATEGORIES; i++) + { + coding_category_system[i] = Qnil; + dump_add_root_object (&coding_category_system[i]); + coding_category_by_priority[i] = i; + } + + dump_add_opaque (coding_category_by_priority, + sizeof (coding_category_by_priority)); + + all_coding_detectors = Dynarr_new2 (detector_dynarr, struct detector); + dump_add_root_struct_ptr (&all_coding_detectors, + &detector_dynarr_description); + + dump_add_opaque_int (&coding_system_tick); + dump_add_opaque_int (&coding_detector_count); + dump_add_opaque_int (&coding_detector_category_count); + + INITIALIZE_CODING_SYSTEM_TYPE (no_conversion, + "no-conversion-coding-system-p"); + CODING_SYSTEM_HAS_METHOD (no_conversion, convert); + + INITIALIZE_DETECTOR (no_conversion); + DETECTOR_HAS_METHOD (no_conversion, detect); + INITIALIZE_DETECTOR_CATEGORY (no_conversion, no_conversion); + + INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (convert_eol, + "convert-eol-coding-system-p"); + CODING_SYSTEM_HAS_METHOD (convert_eol, print); + CODING_SYSTEM_HAS_METHOD (convert_eol, convert); + CODING_SYSTEM_HAS_METHOD (convert_eol, getprop); + CODING_SYSTEM_HAS_METHOD (convert_eol, putprop); + CODING_SYSTEM_HAS_METHOD (convert_eol, conversion_end_type); + CODING_SYSTEM_HAS_METHOD (convert_eol, canonicalize_after_coding); + CODING_SYSTEM_HAS_METHOD (convert_eol, init_coding_stream); + + INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (undecided, + "undecided-coding-system-p"); + CODING_SYSTEM_HAS_METHOD (undecided, init); + CODING_SYSTEM_HAS_METHOD (undecided, mark); + CODING_SYSTEM_HAS_METHOD (undecided, print); + CODING_SYSTEM_HAS_METHOD (undecided, convert); + CODING_SYSTEM_HAS_METHOD (undecided, putprop); + CODING_SYSTEM_HAS_METHOD (undecided, getprop); + CODING_SYSTEM_HAS_METHOD (undecided, init_coding_stream); + CODING_SYSTEM_HAS_METHOD (undecided, rewind_coding_stream); + CODING_SYSTEM_HAS_METHOD (undecided, finalize_coding_stream); + CODING_SYSTEM_HAS_METHOD (undecided, mark_coding_stream); + CODING_SYSTEM_HAS_METHOD (undecided, canonicalize); + CODING_SYSTEM_HAS_METHOD (undecided, canonicalize_after_coding); + + INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (chain, "chain-coding-system-p"); + + CODING_SYSTEM_HAS_METHOD (chain, print); + CODING_SYSTEM_HAS_METHOD (chain, canonicalize); + CODING_SYSTEM_HAS_METHOD (chain, init); + CODING_SYSTEM_HAS_METHOD (chain, mark); + CODING_SYSTEM_HAS_METHOD (chain, mark_coding_stream); + CODING_SYSTEM_HAS_METHOD (chain, convert); + CODING_SYSTEM_HAS_METHOD (chain, rewind_coding_stream); + CODING_SYSTEM_HAS_METHOD (chain, finalize_coding_stream); + CODING_SYSTEM_HAS_METHOD (chain, finalize); + CODING_SYSTEM_HAS_METHOD (chain, putprop); + CODING_SYSTEM_HAS_METHOD (chain, getprop); + CODING_SYSTEM_HAS_METHOD (chain, conversion_end_type); + CODING_SYSTEM_HAS_METHOD (chain, canonicalize_after_coding); + +#ifdef DEBUG_XEMACS + INITIALIZE_CODING_SYSTEM_TYPE (internal, "internal-coding-system-p"); + CODING_SYSTEM_HAS_METHOD (internal, convert); +#endif + +#ifdef HAVE_ZLIB + INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (gzip, "gzip-coding-system-p"); + CODING_SYSTEM_HAS_METHOD (gzip, conversion_end_type); + CODING_SYSTEM_HAS_METHOD (gzip, convert); + CODING_SYSTEM_HAS_METHOD (gzip, init); + CODING_SYSTEM_HAS_METHOD (gzip, print); + CODING_SYSTEM_HAS_METHOD (gzip, init_coding_stream); + CODING_SYSTEM_HAS_METHOD (gzip, rewind_coding_stream); + CODING_SYSTEM_HAS_METHOD (gzip, putprop); + CODING_SYSTEM_HAS_METHOD (gzip, getprop); +#endif +} + +void +reinit_coding_system_type_create (void) +{ + REINITIALIZE_CODING_SYSTEM_TYPE (no_conversion); + REINITIALIZE_CODING_SYSTEM_TYPE (convert_eol); + REINITIALIZE_CODING_SYSTEM_TYPE (undecided); + REINITIALIZE_CODING_SYSTEM_TYPE (chain); +#if 0 + REINITIALIZE_CODING_SYSTEM_TYPE (text_file_wrapper); +#endif /* 0 */ +#ifdef DEBUG_XEMACS + REINITIALIZE_CODING_SYSTEM_TYPE (internal); +#endif +#ifdef HAVE_ZLIB + REINITIALIZE_CODING_SYSTEM_TYPE (gzip); +#endif +} + +void +reinit_vars_of_file_coding (void) +{ } void vars_of_file_coding (void) { - int i; - - fcd = xnew (struct file_coding_dump); - dump_add_root_struct_ptr (&fcd, &fcd_description); - - /* Initialize to something reasonable ... */ - for (i = 0; i < CODING_CATEGORY_LAST; i++) - { - fcd->coding_category_system[i] = Qnil; - fcd->coding_category_by_priority[i] = i; - } - + reinit_vars_of_file_coding (); + + /* We always have file-coding support */ Fprovide (intern ("file-coding")); DEFVAR_LISP ("keyboard-coding-system", &Vkeyboard_coding_system /* @@ -5768,93 +4727,167 @@ Vfile_name_coding_system = Qnil; DEFVAR_BOOL ("enable-multibyte-characters", &enable_multibyte_characters /* -Non-nil means the buffer contents are regarded as multi-byte form -of characters, not a binary code. This affects the display, file I/O, -and behaviors of various editing commands. - -Setting this to nil does not do anything. +Setting this has no effect. It is purely for FSF compatibility. */ ); enable_multibyte_characters = 1; + + Vchain_canonicalize_hash_table = + make_lisp_hash_table (50, HASH_TABLE_NON_WEAK, HASH_TABLE_EQUAL); + staticpro (&Vchain_canonicalize_hash_table); + +#ifdef DEBUG_XEMACS + DEFVAR_LISP ("debug-coding-detection", &Vdebug_coding_detection /* +If non-nil, display debug information about detection operations in progress. +Information is displayed on stderr. +*/ ); + Vdebug_coding_detection = Qnil; +#endif } void complex_vars_of_file_coding (void) { - staticpro (&Vcoding_system_hash_table); - Vcoding_system_hash_table = - make_lisp_hash_table (50, HASH_TABLE_NON_WEAK, HASH_TABLE_EQ); - - the_codesys_prop_dynarr = Dynarr_new (codesys_prop); - dump_add_root_struct_ptr (&the_codesys_prop_dynarr, &codesys_prop_dynarr_description); - -#define DEFINE_CODESYS_PROP(Prop_Type, Sym) do \ -{ \ - struct codesys_prop csp; \ - csp.sym = (Sym); \ - csp.prop_type = (Prop_Type); \ - Dynarr_add (the_codesys_prop_dynarr, csp); \ -} while (0) - - DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qmnemonic); - DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qeol_type); - DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qeol_cr); - DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qeol_crlf); - DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qeol_lf); - DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qpost_read_conversion); - DEFINE_CODESYS_PROP (CODESYS_PROP_ALL_OK, Qpre_write_conversion); -#ifdef MULE - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qcharset_g0); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qcharset_g1); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qcharset_g2); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qcharset_g3); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qforce_g0_on_output); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qforce_g1_on_output); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qforce_g2_on_output); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qforce_g3_on_output); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qshort); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qno_ascii_eol); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qno_ascii_cntl); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qseven); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qlock_shift); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qno_iso6429); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qescape_quoted); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qinput_charset_conversion); - DEFINE_CODESYS_PROP (CODESYS_PROP_ISO2022, Qoutput_charset_conversion); - - DEFINE_CODESYS_PROP (CODESYS_PROP_CCL, Qencode); - DEFINE_CODESYS_PROP (CODESYS_PROP_CCL, Qdecode); -#endif /* MULE */ + Fmake_coding_system + (Qconvert_eol_cr, Qconvert_eol, + build_msg_string ("Convert CR to LF"), + nconc2 (list6 (Qdocumentation, + build_msg_string ( +"Converts CR (used to mark the end of a line on Macintosh systems) to LF\n" +"(used internally and under Unix to mark the end of a line)."), + Qmnemonic, build_string ("CR->LF"), + Qsubtype, Qcr), + /* VERY IMPORTANT! Tell make-coding-system not to generate + subsidiaries -- it needs the coding systems we're creating + to do so! */ + list2 (Qeol_type, Qlf))); + + Fmake_coding_system + (Qconvert_eol_lf, Qconvert_eol, + build_msg_string ("Convert LF to LF (do nothing)"), + nconc2 (list6 (Qdocumentation, + build_msg_string ( +"Do nothing."), + Qmnemonic, build_string ("LF->LF"), + Qsubtype, Qlf), + /* VERY IMPORTANT! Tell make-coding-system not to generate + subsidiaries -- it needs the coding systems we're creating + to do so! */ + list2 (Qeol_type, Qlf))); + + Fmake_coding_system + (Qconvert_eol_crlf, Qconvert_eol, + build_msg_string ("Convert CRLF to LF"), + nconc2 (list6 (Qdocumentation, + build_msg_string ( +"Converts CR+LF (used to mark the end of a line on Macintosh systems) to LF\n" +"(used internally and under Unix to mark the end of a line)."), + Qmnemonic, build_string ("CRLF->LF"), + Qsubtype, Qcrlf), + /* VERY IMPORTANT! Tell make-coding-system not to generate + subsidiaries -- it needs the coding systems we're creating + to do so! */ + list2 (Qeol_type, Qlf))); + + Fmake_coding_system + (Qconvert_eol_autodetect, Qconvert_eol, + build_msg_string ("Autodetect EOL type"), + nconc2 (list6 (Qdocumentation, + build_msg_string ( +"Autodetect the end-of-line type."), + Qmnemonic, build_string ("Auto-EOL"), + Qsubtype, Qautodetect), + /* VERY IMPORTANT! Tell make-coding-system not to generate + subsidiaries -- it needs the coding systems we're creating + to do so! */ + list2 (Qeol_type, Qlf))); + + Fmake_coding_system + (Qundecided, Qundecided, + build_msg_string ("Undecided (auto-detect)"), + nconc2 (list4 (Qdocumentation, + build_msg_string + ("Automatically detects the correct encoding."), + Qmnemonic, build_string ("Auto")), + list6 (Qdo_eol, Qt, Qdo_coding, Qt, + /* We do EOL detection ourselves so we don't need to be + wrapped in an EOL detector. (It doesn't actually hurt, + though, I don't think.) */ + Qeol_type, Qlf))); + + Fmake_coding_system + (intern ("undecided-dos"), Qundecided, + build_msg_string ("Undecided (auto-detect) (CRLF)"), + nconc2 (list4 (Qdocumentation, + build_msg_string + ("Automatically detects the correct encoding; EOL type of CRLF forced."), + Qmnemonic, build_string ("Auto")), + list4 (Qdo_coding, Qt, + Qeol_type, Qcrlf))); + + Fmake_coding_system + (intern ("undecided-unix"), Qundecided, + build_msg_string ("Undecided (auto-detect) (LF)"), + nconc2 (list4 (Qdocumentation, + build_msg_string + ("Automatically detects the correct encoding; EOL type of LF forced."), + Qmnemonic, build_string ("Auto")), + list4 (Qdo_coding, Qt, + Qeol_type, Qlf))); + + Fmake_coding_system + (intern ("undecided-mac"), Qundecided, + build_msg_string ("Undecided (auto-detect) (CR)"), + nconc2 (list4 (Qdocumentation, + build_msg_string + ("Automatically detects the correct encoding; EOL type of CR forced."), + Qmnemonic, build_string ("Auto")), + list4 (Qdo_coding, Qt, + Qeol_type, Qcr))); + /* Need to create this here or we're really screwed. */ Fmake_coding_system (Qraw_text, Qno_conversion, - build_string ("Raw text, which means it converts only line-break-codes."), - list2 (Qmnemonic, build_string ("Raw"))); + build_msg_string ("Raw Text"), + list4 (Qdocumentation, + build_msg_string ("Raw text converts only line-break codes, and acts otherwise like `binary'."), + Qmnemonic, build_string ("Raw"))); Fmake_coding_system (Qbinary, Qno_conversion, - build_string ("Binary, which means it does not convert anything."), - list4 (Qeol_type, Qlf, + build_msg_string ("Binary"), + list6 (Qdocumentation, + build_msg_string ( +"This coding system is as close as it comes to doing no conversion.\n" +"On input, each byte is converted directly into the character\n" +"with the corresponding code -- i.e. from the `ascii', `control-1',\n" +"or `latin-1' character sets. On output, these characters are\n" +"converted back to the corresponding bytes, and other characters\n" +"are converted to the default character, i.e. `~'."), + Qeol_type, Qlf, Qmnemonic, build_string ("Binary"))); - Fdefine_coding_system_alias (Qno_conversion, Qraw_text); - + /* Formerly aliased to raw-text! Completely bogus and not even the same + as FSF Emacs. */ + Fdefine_coding_system_alias (Qno_conversion, Qbinary); + Fdefine_coding_system_alias (intern ("no-conversion-unix"), + intern ("raw-text-unix")); + Fdefine_coding_system_alias (intern ("no-conversion-dos"), + intern ("raw-text-dos")); + Fdefine_coding_system_alias (intern ("no-conversion-mac"), + intern ("raw-text-mac")); + + /* These four below will get their defaults set correctly in + code-init.el. We init them now so we can handle stuff at dump + time before we get to code-init.el. */ Fdefine_coding_system_alias (Qfile_name, Qbinary); + Fdefine_coding_system_alias (Qnative, Qfile_name); Fdefine_coding_system_alias (Qterminal, Qbinary); Fdefine_coding_system_alias (Qkeyboard, Qbinary); + Fdefine_coding_system_alias (Qidentity, Qconvert_eol_lf); + /* Need this for bootstrapping */ - fcd->coding_category_system[CODING_CATEGORY_NO_CONVERSION] = + coding_category_system[detector_category_no_conversion] = Fget_coding_system (Qraw_text); - -#ifdef MULE - { - int i; - - for (i = 0; i < countof (fcd->ucs_to_mule_table); i++) - fcd->ucs_to_mule_table[i] = Qnil; - } - staticpro (&mule_to_ucs_table); - mule_to_ucs_table = Fmake_char_table (Qgeneric); -#endif /* MULE */ }