Mercurial > hg > xemacs-beta
comparison src/unicode.c @ 4690:257b468bf2ca
Move the #'query-coding-region implementation to C.
This is necessary because there is no reasonable way to access the
corresponding mswindows-multibyte functionality from Lisp, and we need such
functionality if we're going to have a reliable and portable
#'query-coding-region implementation. However, this change doesn't yet
provide #'query-coding-region for the mswindow-multibyte coding systems,
there should be no functional differences between an XEmacs with this change
and one without it.
src/ChangeLog addition:
2009-09-19 Aidan Kehoe <kehoea@parhasard.net>
Move the #'query-coding-region implementation to C.
This is necessary because there is no reasonable way to access the
corresponding mswindows-multibyte functionality from Lisp, and we
need such functionality if we're going to have a reliable and
portable #'query-coding-region implementation. However, this
change doesn't yet provide #'query-coding-region for the
mswindow-multibyte coding systems, there should be no functional
differences between an XEmacs with this change and one without it.
* mule-coding.c (struct fixed_width_coding_system):
Add a new coding system type, fixed_width, and implement it. It
uses the CCL infrastructure but has a much simpler creation API,
and its own query_method, formerly in lisp/mule/mule-coding.el.
* unicode.c:
Move the Unicode query method implementation here from
unicode.el.
* lisp.h: Declare Fmake_coding_system_internal, Fcopy_range_table
here.
* intl-win32.c (complex_vars_of_intl_win32):
Use Fmake_coding_system_internal, not Fmake_coding_system.
* general-slots.h: Add Qsucceeded, Qunencodable, Qinvalid_sequence
here.
* file-coding.h (enum coding_system_variant):
Add fixed_width_coding_system here.
(struct coding_system_methods):
Add query_method and query_lstream_method to the coding system
methods.
Provide flags for the query methods.
Declare the default query method; initialise it correctly in
INITIALIZE_CODING_SYSTEM_TYPE.
* file-coding.c (default_query_method):
New function, the default query method for coding systems that do
not set it. Moved from coding.el.
(make_coding_system_1):
Accept new elements in PROPS in #'make-coding-system; aliases, a
list of aliases; safe-chars and safe-charsets (these were
previously accepted but not saved); and category.
(Fmake_coding_system_internal):
New function, what used to be #'make-coding-system--on Mule
builds, we've now moved some of the functionality of this to
Lisp.
(Fcoding_system_canonical_name_p):
Move this earlier in the file, since it's now called from within
make_coding_system_1.
(Fquery_coding_region):
Move the implementation of this here, from coding.el.
(complex_vars_of_file_coding):
Call Fmake_coding_system_internal, not Fmake_coding_system;
specify safe-charsets properties when we're a mule build.
* extents.h (mouse_highlight_priority, Fset_extent_priority,
Fset_extent_face, Fmap_extents):
Make these available to other C files.
lisp/ChangeLog addition:
2009-09-19 Aidan Kehoe <kehoea@parhasard.net>
Move the #'query-coding-region implementation to C.
* coding.el:
Consolidate code that depends on the presence or absence of Mule
at the end of this file.
(default-query-coding-region, query-coding-region):
Move these functions to C.
(default-query-coding-region-safe-charset-skip-chars-map):
Remove this variable, the corresponding C variable is
Vdefault_query_coding_region_chartab_cache in file-coding.c.
(query-coding-string): Update docstring to reflect actual multiple
values, be more careful about not modifying a range table that
we're currently mapping over.
(encode-coding-char): Make the implementation of this simpler.
(featurep 'mule): Autoload #'make-coding-system from
mule/make-coding-system.el if we're a mule build; provide an
appropriate compiler macro.
Do various non-mule compatibility things if we're not a mule
build.
* update-elc.el (additional-dump-dependencies):
Add mule/make-coding-system as a dump time dependency if we're a
mule build.
* unicode.el (ccl-encode-to-ucs-2):
(decode-char):
(encode-char):
Move these earlier in the file, for the sake of some byte compile
warnings.
(unicode-query-coding-region):
Move this to unicode.c
* mule/make-coding-system.el:
New file, not dumped. Contains the functionality to rework the
arguments necessary for fixed-width coding systems, and contains
the implementation of #'make-coding-system, which now calls
#'make-coding-system-internal.
* mule/vietnamese.el (viscii):
* mule/latin.el (iso-8859-2):
(windows-1250):
(iso-8859-3):
(iso-8859-4):
(iso-8859-14):
(iso-8859-15):
(iso-8859-16):
(iso-8859-9):
(macintosh):
(windows-1252):
* mule/hebrew.el (iso-8859-8):
* mule/greek.el (iso-8859-7):
(windows-1253):
* mule/cyrillic.el (iso-8859-5):
(koi8-r):
(koi8-u):
(windows-1251):
(alternativnyj):
(koi8-ru):
(koi8-t):
(koi8-c):
(koi8-o):
* mule/arabic.el (iso-8859-6):
(windows-1256):
Move all these coding systems to being of type fixed-width, not of
type CCL. This allows the distinct query-coding-region for them to
be in C, something which will eventually allow us to implement
query-coding-region for the mswindows-multibyte coding systems.
* mule/general-late.el (posix-charset-to-coding-system-hash):
Document why we're pre-emptively persuading the byte compiler that
the ELC for this file needs to be written using escape-quoted.
Call #'set-unicode-query-skip-chars-args, now the Unicode
query-coding-region implementation is in C.
* mule/thai-xtis.el (tis-620):
Don't bother checking whether we're XEmacs or not here.
* mule/mule-coding.el:
Move the eight bit fixed-width functionality from this file to
make-coding-system.el.
tests/ChangeLog addition:
2009-09-19 Aidan Kehoe <kehoea@parhasard.net>
* automated/mule-tests.el:
Check a coding system's type, not an 8-bit-fixed property, for
whether that coding system should be treated as a fixed-width
coding system.
* automated/query-coding-tests.el:
Don't test the query coding functionality for mswindows-multibyte
coding systems, it's not yet implemented.
author | Aidan Kehoe <kehoea@parhasard.net> |
---|---|
date | Sat, 19 Sep 2009 22:53:13 +0100 |
parents | 7e54adf407a1 |
children | b9aaf2a18957 |
comparison
equal
deleted
inserted
replaced
4689:0636c6ccb430 | 4690:257b468bf2ca |
---|---|
38 #include "lisp.h" | 38 #include "lisp.h" |
39 | 39 |
40 #include "charset.h" | 40 #include "charset.h" |
41 #include "file-coding.h" | 41 #include "file-coding.h" |
42 #include "opaque.h" | 42 #include "opaque.h" |
43 | |
44 #include "buffer.h" | |
45 #include "rangetab.h" | |
46 #include "extents.h" | |
43 | 47 |
44 #include "sysfile.h" | 48 #include "sysfile.h" |
45 | 49 |
46 /* For more info about how Unicode works under Windows, see intl-win32.c. */ | 50 /* For more info about how Unicode works under Windows, see intl-win32.c. */ |
47 | 51 |
190 | 194 |
191 Lisp_Object Qutf_16_little_endian, Qutf_16_bom; | 195 Lisp_Object Qutf_16_little_endian, Qutf_16_bom; |
192 Lisp_Object Qutf_16_little_endian_bom; | 196 Lisp_Object Qutf_16_little_endian_bom; |
193 | 197 |
194 Lisp_Object Qutf_8_bom; | 198 Lisp_Object Qutf_8_bom; |
199 | |
200 #ifdef MULE | |
201 /* These range tables are not directly accessible from Lisp: */ | |
202 static Lisp_Object Vunicode_invalid_and_query_skip_chars; | |
203 static Lisp_Object Vutf_8_invalid_and_query_skip_chars; | |
204 static Lisp_Object Vunicode_query_skip_chars; | |
205 | |
206 static Lisp_Object Vunicode_query_string, Vunicode_invalid_string, | |
207 Vutf_8_invalid_string; | |
208 #endif /* MULE */ | |
195 | 209 |
196 /* See the Unicode FAQ, http://www.unicode.org/faq/utf_bom.html#35 for this | 210 /* See the Unicode FAQ, http://www.unicode.org/faq/utf_bom.html#35 for this |
197 algorithm. | 211 algorithm. |
198 | 212 |
199 (They also give another, really verbose one, as part of their explanation | 213 (They also give another, really verbose one, as part of their explanation |
2816 if (XCODING_SYSTEM_UNICODE_NEED_BOM (cs)) | 2830 if (XCODING_SYSTEM_UNICODE_NEED_BOM (cs)) |
2817 write_c_string (printcharfun, ", need-bom"); | 2831 write_c_string (printcharfun, ", need-bom"); |
2818 write_c_string (printcharfun, ")"); | 2832 write_c_string (printcharfun, ")"); |
2819 } | 2833 } |
2820 | 2834 |
2835 #ifdef MULE | |
2836 DEFUN ("set-unicode-query-skip-chars-args", Fset_unicode_query_skip_chars_args, | |
2837 3, 3, 0, /* | |
2838 Specify strings as matching characters known to Unicode coding systems. | |
2839 | |
2840 QUERY-STRING is a string matching characters that can unequivocally be | |
2841 encoded by the Unicode coding systems. | |
2842 | |
2843 INVALID-STRING is a string to match XEmacs characters that represent known | |
2844 octets on disk, but that are invalid sequences according to Unicode. | |
2845 | |
2846 UTF-8-INVALID-STRING is a more restrictive string to match XEmacs characters | |
2847 that are invalid UTF-8 octets. | |
2848 | |
2849 All three strings are in the format accepted by `skip-chars-forward'. | |
2850 */ | |
2851 (query_string, invalid_string, utf_8_invalid_string)) | |
2852 { | |
2853 CHECK_STRING (query_string); | |
2854 CHECK_STRING (invalid_string); | |
2855 CHECK_STRING (utf_8_invalid_string); | |
2856 | |
2857 Vunicode_query_string = query_string; | |
2858 Vunicode_invalid_string = invalid_string; | |
2859 Vutf_8_invalid_string = utf_8_invalid_string; | |
2860 | |
2861 return Qnil; | |
2862 } | |
2863 | |
2864 static void | |
2865 add_lisp_string_to_skip_chars_range (Lisp_Object string, Lisp_Object rtab, | |
2866 Lisp_Object value) | |
2867 { | |
2868 Ibyte *p, *pend; | |
2869 Ichar c; | |
2870 | |
2871 p = XSTRING_DATA (string); | |
2872 pend = p + XSTRING_LENGTH (string); | |
2873 | |
2874 while (p != pend) | |
2875 { | |
2876 c = itext_ichar (p); | |
2877 | |
2878 INC_IBYTEPTR (p); | |
2879 | |
2880 if (c == '\\') | |
2881 { | |
2882 if (p == pend) break; | |
2883 c = itext_ichar (p); | |
2884 INC_IBYTEPTR (p); | |
2885 } | |
2886 | |
2887 if (p != pend && *p == '-') | |
2888 { | |
2889 Ichar cend; | |
2890 | |
2891 /* Skip over the dash. */ | |
2892 p++; | |
2893 if (p == pend) break; | |
2894 cend = itext_ichar (p); | |
2895 | |
2896 Fput_range_table (make_int (c), make_int (cend), value, | |
2897 rtab); | |
2898 | |
2899 INC_IBYTEPTR (p); | |
2900 } | |
2901 else | |
2902 { | |
2903 Fput_range_table (make_int (c), make_int (c), value, rtab); | |
2904 } | |
2905 } | |
2906 } | |
2907 | |
2908 /* This function wouldn't be necessary if initialised range tables were | |
2909 dumped properly; see | |
2910 http://mid.gmane.org/18179.49815.622843.336527@parhasard.net . */ | |
2911 static void | |
2912 initialize_unicode_query_range_tables_from_strings (void) | |
2913 { | |
2914 CHECK_STRING (Vunicode_query_string); | |
2915 CHECK_STRING (Vunicode_invalid_string); | |
2916 CHECK_STRING (Vutf_8_invalid_string); | |
2917 | |
2918 Vunicode_query_skip_chars = Fmake_range_table (Qstart_closed_end_closed); | |
2919 | |
2920 add_lisp_string_to_skip_chars_range (Vunicode_query_string, | |
2921 Vunicode_query_skip_chars, | |
2922 Qsucceeded); | |
2923 | |
2924 Vunicode_invalid_and_query_skip_chars | |
2925 = Fcopy_range_table (Vunicode_query_skip_chars); | |
2926 | |
2927 add_lisp_string_to_skip_chars_range (Vunicode_invalid_string, | |
2928 Vunicode_invalid_and_query_skip_chars, | |
2929 Qinvalid_sequence); | |
2930 | |
2931 Vutf_8_invalid_and_query_skip_chars | |
2932 = Fcopy_range_table (Vunicode_query_skip_chars); | |
2933 | |
2934 add_lisp_string_to_skip_chars_range (Vutf_8_invalid_string, | |
2935 Vutf_8_invalid_and_query_skip_chars, | |
2936 Qinvalid_sequence); | |
2937 } | |
2938 | |
2939 static Lisp_Object | |
2940 unicode_query (Lisp_Object codesys, struct buffer *buf, Charbpos end, | |
2941 int flags) | |
2942 { | |
2943 Charbpos pos = BUF_PT (buf), fail_range_start, fail_range_end; | |
2944 Charbpos pos_byte = BYTE_BUF_PT (buf); | |
2945 Lisp_Object skip_chars_range_table, result = Qnil; | |
2946 enum query_coding_failure_reasons failed_reason, | |
2947 previous_failed_reason = query_coding_succeeded; | |
2948 int checked_unicode, invalid_lower_limit = UNICODE_ERROR_OCTET_RANGE_START, | |
2949 invalid_upper_limit, unicode_type = XCODING_SYSTEM_UNICODE_TYPE (codesys); | |
2950 | |
2951 if (flags & QUERY_METHOD_HIGHLIGHT && | |
2952 /* If we're being called really early, live without highlights getting | |
2953 cleared properly: */ | |
2954 !(UNBOUNDP (XSYMBOL (Qquery_coding_clear_highlights)->function))) | |
2955 { | |
2956 /* It's okay to call Lisp here, the only non-stack object we may have | |
2957 allocated up to this point is skip_chars_range_table, and that's | |
2958 reachable from its entry in Vfixed_width_query_ranges_cache. */ | |
2959 call3 (Qquery_coding_clear_highlights, make_int (pos), make_int (end), | |
2960 wrap_buffer (buf)); | |
2961 } | |
2962 | |
2963 if (NILP (Vunicode_query_skip_chars)) | |
2964 { | |
2965 initialize_unicode_query_range_tables_from_strings(); | |
2966 } | |
2967 | |
2968 if (flags & QUERY_METHOD_IGNORE_INVALID_SEQUENCES) | |
2969 { | |
2970 switch (unicode_type) | |
2971 { | |
2972 case UNICODE_UTF_8: | |
2973 skip_chars_range_table = Vutf_8_invalid_and_query_skip_chars; | |
2974 break; | |
2975 case UNICODE_UTF_7: | |
2976 /* #### See above. */ | |
2977 return Qunbound; | |
2978 break; | |
2979 default: | |
2980 skip_chars_range_table = Vunicode_invalid_and_query_skip_chars; | |
2981 break; | |
2982 } | |
2983 } | |
2984 else | |
2985 { | |
2986 switch (unicode_type) | |
2987 { | |
2988 case UNICODE_UTF_8: | |
2989 invalid_lower_limit = UNICODE_ERROR_OCTET_RANGE_START + 0x80; | |
2990 invalid_upper_limit = UNICODE_ERROR_OCTET_RANGE_START + 0xFF; | |
2991 break; | |
2992 case UNICODE_UTF_7: | |
2993 /* #### Work out what to do here in reality, read the spec and decide | |
2994 which octets are invalid. */ | |
2995 return Qunbound; | |
2996 break; | |
2997 default: | |
2998 invalid_lower_limit = UNICODE_ERROR_OCTET_RANGE_START; | |
2999 invalid_upper_limit = UNICODE_ERROR_OCTET_RANGE_START + 0xFF; | |
3000 break; | |
3001 } | |
3002 | |
3003 skip_chars_range_table = Vunicode_query_skip_chars; | |
3004 } | |
3005 | |
3006 while (pos < end) | |
3007 { | |
3008 Ichar ch = BYTE_BUF_FETCH_CHAR (buf, pos_byte); | |
3009 if ((ch < 0x100 ? 1 : | |
3010 (!EQ (Qnil, Fget_range_table (make_int (ch), skip_chars_range_table, | |
3011 Qnil))))) | |
3012 { | |
3013 pos++; | |
3014 INC_BYTEBPOS (buf, pos_byte); | |
3015 } | |
3016 else | |
3017 { | |
3018 fail_range_start = pos; | |
3019 while ((pos < end) && | |
3020 ((checked_unicode = ichar_to_unicode (ch), | |
3021 -1 == checked_unicode | |
3022 && (failed_reason = query_coding_unencodable)) | |
3023 || (!(flags & QUERY_METHOD_IGNORE_INVALID_SEQUENCES) && | |
3024 (invalid_lower_limit <= checked_unicode) && | |
3025 (checked_unicode <= invalid_upper_limit) | |
3026 && (failed_reason = query_coding_invalid_sequence))) | |
3027 && (previous_failed_reason == query_coding_succeeded | |
3028 || previous_failed_reason == failed_reason)) | |
3029 { | |
3030 pos++; | |
3031 INC_BYTEBPOS (buf, pos_byte); | |
3032 ch = BYTE_BUF_FETCH_CHAR (buf, pos_byte); | |
3033 previous_failed_reason = failed_reason; | |
3034 } | |
3035 | |
3036 if (fail_range_start == pos) | |
3037 { | |
3038 /* The character can actually be encoded; move on. */ | |
3039 pos++; | |
3040 INC_BYTEBPOS (buf, pos_byte); | |
3041 } | |
3042 else | |
3043 { | |
3044 assert (previous_failed_reason == query_coding_invalid_sequence | |
3045 || previous_failed_reason == query_coding_unencodable); | |
3046 | |
3047 if (flags & QUERY_METHOD_ERRORP) | |
3048 { | |
3049 DECLARE_EISTRING (error_details); | |
3050 | |
3051 eicpy_ascii (error_details, "Cannot encode "); | |
3052 eicat_lstr (error_details, | |
3053 make_string_from_buffer (buf, fail_range_start, | |
3054 pos - | |
3055 fail_range_start)); | |
3056 eicat_ascii (error_details, " using coding system"); | |
3057 | |
3058 signal_error (Qtext_conversion_error, | |
3059 (const CIbyte *)(eidata (error_details)), | |
3060 XCODING_SYSTEM_NAME (codesys)); | |
3061 } | |
3062 | |
3063 if (NILP (result)) | |
3064 { | |
3065 result = Fmake_range_table (Qstart_closed_end_open); | |
3066 } | |
3067 | |
3068 fail_range_end = pos; | |
3069 | |
3070 Fput_range_table (make_int (fail_range_start), | |
3071 make_int (fail_range_end), | |
3072 (previous_failed_reason | |
3073 == query_coding_unencodable ? | |
3074 Qunencodable : Qinvalid_sequence), | |
3075 result); | |
3076 previous_failed_reason = query_coding_succeeded; | |
3077 | |
3078 if (flags & QUERY_METHOD_HIGHLIGHT) | |
3079 { | |
3080 Lisp_Object extent | |
3081 = Fmake_extent (make_int (fail_range_start), | |
3082 make_int (fail_range_end), | |
3083 wrap_buffer (buf)); | |
3084 | |
3085 Fset_extent_priority | |
3086 (extent, make_int (2 + mouse_highlight_priority)); | |
3087 Fset_extent_face (extent, Qquery_coding_warning_face); | |
3088 } | |
3089 } | |
3090 } | |
3091 } | |
3092 | |
3093 return result; | |
3094 } | |
3095 #else /* !MULE */ | |
3096 unicode_query (Lisp_Object UNUSED (codesys), | |
3097 struct buffer * UNUSED (buf), | |
3098 Charbpos UNUSED (end), int UNUSED (flags)) | |
3099 { | |
3100 return Qnil; | |
3101 } | |
3102 #endif | |
3103 | |
2821 int | 3104 int |
2822 dfc_coding_system_is_unicode ( | 3105 dfc_coding_system_is_unicode ( |
2823 #ifdef WIN32_ANY | 3106 #ifdef WIN32_ANY |
2824 Lisp_Object codesys | 3107 Lisp_Object codesys |
2825 #else | 3108 #else |
2854 DEFSUBR (Fdefault_unicode_precedence_list); | 3137 DEFSUBR (Fdefault_unicode_precedence_list); |
2855 DEFSUBR (Fset_unicode_conversion); | 3138 DEFSUBR (Fset_unicode_conversion); |
2856 | 3139 |
2857 DEFSUBR (Fload_unicode_mapping_table); | 3140 DEFSUBR (Fload_unicode_mapping_table); |
2858 | 3141 |
3142 DEFSUBR (Fset_unicode_query_skip_chars_args); | |
3143 | |
2859 DEFSYMBOL (Qccl_encode_to_ucs_2); | 3144 DEFSYMBOL (Qccl_encode_to_ucs_2); |
2860 DEFSYMBOL (Qlast_allocated_character); | 3145 DEFSYMBOL (Qlast_allocated_character); |
2861 DEFSYMBOL (Qignore_first_column); | 3146 DEFSYMBOL (Qignore_first_column); |
2862 | 3147 |
2863 DEFSYMBOL (Qunicode_registries); | 3148 DEFSYMBOL (Qunicode_registries); |
2888 coding_system_type_create_unicode (void) | 3173 coding_system_type_create_unicode (void) |
2889 { | 3174 { |
2890 INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (unicode, "unicode-coding-system-p"); | 3175 INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (unicode, "unicode-coding-system-p"); |
2891 CODING_SYSTEM_HAS_METHOD (unicode, print); | 3176 CODING_SYSTEM_HAS_METHOD (unicode, print); |
2892 CODING_SYSTEM_HAS_METHOD (unicode, convert); | 3177 CODING_SYSTEM_HAS_METHOD (unicode, convert); |
3178 CODING_SYSTEM_HAS_METHOD (unicode, query); | |
2893 CODING_SYSTEM_HAS_METHOD (unicode, init_coding_stream); | 3179 CODING_SYSTEM_HAS_METHOD (unicode, init_coding_stream); |
2894 CODING_SYSTEM_HAS_METHOD (unicode, rewind_coding_stream); | 3180 CODING_SYSTEM_HAS_METHOD (unicode, rewind_coding_stream); |
2895 CODING_SYSTEM_HAS_METHOD (unicode, putprop); | 3181 CODING_SYSTEM_HAS_METHOD (unicode, putprop); |
2896 CODING_SYSTEM_HAS_METHOD (unicode, getprop); | 3182 CODING_SYSTEM_HAS_METHOD (unicode, getprop); |
2897 | 3183 |
2978 those used when no font matching the charset's registries property has been | 3264 those used when no font matching the charset's registries property has been |
2979 found (that is, they're probably Mule-specific charsets like Ethiopic or | 3265 found (that is, they're probably Mule-specific charsets like Ethiopic or |
2980 IPA.) | 3266 IPA.) |
2981 */ ); | 3267 */ ); |
2982 Qunicode_registries = vector1(build_string("iso10646-1")); | 3268 Qunicode_registries = vector1(build_string("iso10646-1")); |
3269 | |
3270 /* Initialised in lisp/mule/general-late.el, by a call to | |
3271 #'set-unicode-query-skip-chars-args. Or at least they would be, but we | |
3272 can't do this at dump time right now, initialised range tables aren't | |
3273 dumped properly. */ | |
3274 staticpro (&Vunicode_invalid_and_query_skip_chars); | |
3275 Vunicode_invalid_and_query_skip_chars = Qnil; | |
3276 staticpro (&Vutf_8_invalid_and_query_skip_chars); | |
3277 Vutf_8_invalid_and_query_skip_chars = Qnil; | |
3278 staticpro (&Vunicode_query_skip_chars); | |
3279 Vunicode_query_skip_chars = Qnil; | |
3280 | |
3281 /* If we could dump the range table above these wouldn't be necessary: */ | |
3282 staticpro (&Vunicode_query_string); | |
3283 Vunicode_query_string = Qnil; | |
3284 staticpro (&Vunicode_invalid_string); | |
3285 Vunicode_invalid_string = Qnil; | |
3286 staticpro (&Vutf_8_invalid_string); | |
3287 Vutf_8_invalid_string = Qnil; | |
2983 #endif /* MULE */ | 3288 #endif /* MULE */ |
2984 } | 3289 } |