comparison src/unicode.c @ 4690:257b468bf2ca

Move the #'query-coding-region implementation to C. This is necessary because there is no reasonable way to access the corresponding mswindows-multibyte functionality from Lisp, and we need such functionality if we're going to have a reliable and portable #'query-coding-region implementation. However, this change doesn't yet provide #'query-coding-region for the mswindow-multibyte coding systems, there should be no functional differences between an XEmacs with this change and one without it. src/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> Move the #'query-coding-region implementation to C. This is necessary because there is no reasonable way to access the corresponding mswindows-multibyte functionality from Lisp, and we need such functionality if we're going to have a reliable and portable #'query-coding-region implementation. However, this change doesn't yet provide #'query-coding-region for the mswindow-multibyte coding systems, there should be no functional differences between an XEmacs with this change and one without it. * mule-coding.c (struct fixed_width_coding_system): Add a new coding system type, fixed_width, and implement it. It uses the CCL infrastructure but has a much simpler creation API, and its own query_method, formerly in lisp/mule/mule-coding.el. * unicode.c: Move the Unicode query method implementation here from unicode.el. * lisp.h: Declare Fmake_coding_system_internal, Fcopy_range_table here. * intl-win32.c (complex_vars_of_intl_win32): Use Fmake_coding_system_internal, not Fmake_coding_system. * general-slots.h: Add Qsucceeded, Qunencodable, Qinvalid_sequence here. * file-coding.h (enum coding_system_variant): Add fixed_width_coding_system here. (struct coding_system_methods): Add query_method and query_lstream_method to the coding system methods. Provide flags for the query methods. Declare the default query method; initialise it correctly in INITIALIZE_CODING_SYSTEM_TYPE. * file-coding.c (default_query_method): New function, the default query method for coding systems that do not set it. Moved from coding.el. (make_coding_system_1): Accept new elements in PROPS in #'make-coding-system; aliases, a list of aliases; safe-chars and safe-charsets (these were previously accepted but not saved); and category. (Fmake_coding_system_internal): New function, what used to be #'make-coding-system--on Mule builds, we've now moved some of the functionality of this to Lisp. (Fcoding_system_canonical_name_p): Move this earlier in the file, since it's now called from within make_coding_system_1. (Fquery_coding_region): Move the implementation of this here, from coding.el. (complex_vars_of_file_coding): Call Fmake_coding_system_internal, not Fmake_coding_system; specify safe-charsets properties when we're a mule build. * extents.h (mouse_highlight_priority, Fset_extent_priority, Fset_extent_face, Fmap_extents): Make these available to other C files. lisp/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> Move the #'query-coding-region implementation to C. * coding.el: Consolidate code that depends on the presence or absence of Mule at the end of this file. (default-query-coding-region, query-coding-region): Move these functions to C. (default-query-coding-region-safe-charset-skip-chars-map): Remove this variable, the corresponding C variable is Vdefault_query_coding_region_chartab_cache in file-coding.c. (query-coding-string): Update docstring to reflect actual multiple values, be more careful about not modifying a range table that we're currently mapping over. (encode-coding-char): Make the implementation of this simpler. (featurep 'mule): Autoload #'make-coding-system from mule/make-coding-system.el if we're a mule build; provide an appropriate compiler macro. Do various non-mule compatibility things if we're not a mule build. * update-elc.el (additional-dump-dependencies): Add mule/make-coding-system as a dump time dependency if we're a mule build. * unicode.el (ccl-encode-to-ucs-2): (decode-char): (encode-char): Move these earlier in the file, for the sake of some byte compile warnings. (unicode-query-coding-region): Move this to unicode.c * mule/make-coding-system.el: New file, not dumped. Contains the functionality to rework the arguments necessary for fixed-width coding systems, and contains the implementation of #'make-coding-system, which now calls #'make-coding-system-internal. * mule/vietnamese.el (viscii): * mule/latin.el (iso-8859-2): (windows-1250): (iso-8859-3): (iso-8859-4): (iso-8859-14): (iso-8859-15): (iso-8859-16): (iso-8859-9): (macintosh): (windows-1252): * mule/hebrew.el (iso-8859-8): * mule/greek.el (iso-8859-7): (windows-1253): * mule/cyrillic.el (iso-8859-5): (koi8-r): (koi8-u): (windows-1251): (alternativnyj): (koi8-ru): (koi8-t): (koi8-c): (koi8-o): * mule/arabic.el (iso-8859-6): (windows-1256): Move all these coding systems to being of type fixed-width, not of type CCL. This allows the distinct query-coding-region for them to be in C, something which will eventually allow us to implement query-coding-region for the mswindows-multibyte coding systems. * mule/general-late.el (posix-charset-to-coding-system-hash): Document why we're pre-emptively persuading the byte compiler that the ELC for this file needs to be written using escape-quoted. Call #'set-unicode-query-skip-chars-args, now the Unicode query-coding-region implementation is in C. * mule/thai-xtis.el (tis-620): Don't bother checking whether we're XEmacs or not here. * mule/mule-coding.el: Move the eight bit fixed-width functionality from this file to make-coding-system.el. tests/ChangeLog addition: 2009-09-19 Aidan Kehoe <kehoea@parhasard.net> * automated/mule-tests.el: Check a coding system's type, not an 8-bit-fixed property, for whether that coding system should be treated as a fixed-width coding system. * automated/query-coding-tests.el: Don't test the query coding functionality for mswindows-multibyte coding systems, it's not yet implemented.
author Aidan Kehoe <kehoea@parhasard.net>
date Sat, 19 Sep 2009 22:53:13 +0100
parents 7e54adf407a1
children b9aaf2a18957
comparison
equal deleted inserted replaced
4689:0636c6ccb430 4690:257b468bf2ca
38 #include "lisp.h" 38 #include "lisp.h"
39 39
40 #include "charset.h" 40 #include "charset.h"
41 #include "file-coding.h" 41 #include "file-coding.h"
42 #include "opaque.h" 42 #include "opaque.h"
43
44 #include "buffer.h"
45 #include "rangetab.h"
46 #include "extents.h"
43 47
44 #include "sysfile.h" 48 #include "sysfile.h"
45 49
46 /* For more info about how Unicode works under Windows, see intl-win32.c. */ 50 /* For more info about how Unicode works under Windows, see intl-win32.c. */
47 51
190 194
191 Lisp_Object Qutf_16_little_endian, Qutf_16_bom; 195 Lisp_Object Qutf_16_little_endian, Qutf_16_bom;
192 Lisp_Object Qutf_16_little_endian_bom; 196 Lisp_Object Qutf_16_little_endian_bom;
193 197
194 Lisp_Object Qutf_8_bom; 198 Lisp_Object Qutf_8_bom;
199
200 #ifdef MULE
201 /* These range tables are not directly accessible from Lisp: */
202 static Lisp_Object Vunicode_invalid_and_query_skip_chars;
203 static Lisp_Object Vutf_8_invalid_and_query_skip_chars;
204 static Lisp_Object Vunicode_query_skip_chars;
205
206 static Lisp_Object Vunicode_query_string, Vunicode_invalid_string,
207 Vutf_8_invalid_string;
208 #endif /* MULE */
195 209
196 /* See the Unicode FAQ, http://www.unicode.org/faq/utf_bom.html#35 for this 210 /* See the Unicode FAQ, http://www.unicode.org/faq/utf_bom.html#35 for this
197 algorithm. 211 algorithm.
198 212
199 (They also give another, really verbose one, as part of their explanation 213 (They also give another, really verbose one, as part of their explanation
2816 if (XCODING_SYSTEM_UNICODE_NEED_BOM (cs)) 2830 if (XCODING_SYSTEM_UNICODE_NEED_BOM (cs))
2817 write_c_string (printcharfun, ", need-bom"); 2831 write_c_string (printcharfun, ", need-bom");
2818 write_c_string (printcharfun, ")"); 2832 write_c_string (printcharfun, ")");
2819 } 2833 }
2820 2834
2835 #ifdef MULE
2836 DEFUN ("set-unicode-query-skip-chars-args", Fset_unicode_query_skip_chars_args,
2837 3, 3, 0, /*
2838 Specify strings as matching characters known to Unicode coding systems.
2839
2840 QUERY-STRING is a string matching characters that can unequivocally be
2841 encoded by the Unicode coding systems.
2842
2843 INVALID-STRING is a string to match XEmacs characters that represent known
2844 octets on disk, but that are invalid sequences according to Unicode.
2845
2846 UTF-8-INVALID-STRING is a more restrictive string to match XEmacs characters
2847 that are invalid UTF-8 octets.
2848
2849 All three strings are in the format accepted by `skip-chars-forward'.
2850 */
2851 (query_string, invalid_string, utf_8_invalid_string))
2852 {
2853 CHECK_STRING (query_string);
2854 CHECK_STRING (invalid_string);
2855 CHECK_STRING (utf_8_invalid_string);
2856
2857 Vunicode_query_string = query_string;
2858 Vunicode_invalid_string = invalid_string;
2859 Vutf_8_invalid_string = utf_8_invalid_string;
2860
2861 return Qnil;
2862 }
2863
2864 static void
2865 add_lisp_string_to_skip_chars_range (Lisp_Object string, Lisp_Object rtab,
2866 Lisp_Object value)
2867 {
2868 Ibyte *p, *pend;
2869 Ichar c;
2870
2871 p = XSTRING_DATA (string);
2872 pend = p + XSTRING_LENGTH (string);
2873
2874 while (p != pend)
2875 {
2876 c = itext_ichar (p);
2877
2878 INC_IBYTEPTR (p);
2879
2880 if (c == '\\')
2881 {
2882 if (p == pend) break;
2883 c = itext_ichar (p);
2884 INC_IBYTEPTR (p);
2885 }
2886
2887 if (p != pend && *p == '-')
2888 {
2889 Ichar cend;
2890
2891 /* Skip over the dash. */
2892 p++;
2893 if (p == pend) break;
2894 cend = itext_ichar (p);
2895
2896 Fput_range_table (make_int (c), make_int (cend), value,
2897 rtab);
2898
2899 INC_IBYTEPTR (p);
2900 }
2901 else
2902 {
2903 Fput_range_table (make_int (c), make_int (c), value, rtab);
2904 }
2905 }
2906 }
2907
2908 /* This function wouldn't be necessary if initialised range tables were
2909 dumped properly; see
2910 http://mid.gmane.org/18179.49815.622843.336527@parhasard.net . */
2911 static void
2912 initialize_unicode_query_range_tables_from_strings (void)
2913 {
2914 CHECK_STRING (Vunicode_query_string);
2915 CHECK_STRING (Vunicode_invalid_string);
2916 CHECK_STRING (Vutf_8_invalid_string);
2917
2918 Vunicode_query_skip_chars = Fmake_range_table (Qstart_closed_end_closed);
2919
2920 add_lisp_string_to_skip_chars_range (Vunicode_query_string,
2921 Vunicode_query_skip_chars,
2922 Qsucceeded);
2923
2924 Vunicode_invalid_and_query_skip_chars
2925 = Fcopy_range_table (Vunicode_query_skip_chars);
2926
2927 add_lisp_string_to_skip_chars_range (Vunicode_invalid_string,
2928 Vunicode_invalid_and_query_skip_chars,
2929 Qinvalid_sequence);
2930
2931 Vutf_8_invalid_and_query_skip_chars
2932 = Fcopy_range_table (Vunicode_query_skip_chars);
2933
2934 add_lisp_string_to_skip_chars_range (Vutf_8_invalid_string,
2935 Vutf_8_invalid_and_query_skip_chars,
2936 Qinvalid_sequence);
2937 }
2938
2939 static Lisp_Object
2940 unicode_query (Lisp_Object codesys, struct buffer *buf, Charbpos end,
2941 int flags)
2942 {
2943 Charbpos pos = BUF_PT (buf), fail_range_start, fail_range_end;
2944 Charbpos pos_byte = BYTE_BUF_PT (buf);
2945 Lisp_Object skip_chars_range_table, result = Qnil;
2946 enum query_coding_failure_reasons failed_reason,
2947 previous_failed_reason = query_coding_succeeded;
2948 int checked_unicode, invalid_lower_limit = UNICODE_ERROR_OCTET_RANGE_START,
2949 invalid_upper_limit, unicode_type = XCODING_SYSTEM_UNICODE_TYPE (codesys);
2950
2951 if (flags & QUERY_METHOD_HIGHLIGHT &&
2952 /* If we're being called really early, live without highlights getting
2953 cleared properly: */
2954 !(UNBOUNDP (XSYMBOL (Qquery_coding_clear_highlights)->function)))
2955 {
2956 /* It's okay to call Lisp here, the only non-stack object we may have
2957 allocated up to this point is skip_chars_range_table, and that's
2958 reachable from its entry in Vfixed_width_query_ranges_cache. */
2959 call3 (Qquery_coding_clear_highlights, make_int (pos), make_int (end),
2960 wrap_buffer (buf));
2961 }
2962
2963 if (NILP (Vunicode_query_skip_chars))
2964 {
2965 initialize_unicode_query_range_tables_from_strings();
2966 }
2967
2968 if (flags & QUERY_METHOD_IGNORE_INVALID_SEQUENCES)
2969 {
2970 switch (unicode_type)
2971 {
2972 case UNICODE_UTF_8:
2973 skip_chars_range_table = Vutf_8_invalid_and_query_skip_chars;
2974 break;
2975 case UNICODE_UTF_7:
2976 /* #### See above. */
2977 return Qunbound;
2978 break;
2979 default:
2980 skip_chars_range_table = Vunicode_invalid_and_query_skip_chars;
2981 break;
2982 }
2983 }
2984 else
2985 {
2986 switch (unicode_type)
2987 {
2988 case UNICODE_UTF_8:
2989 invalid_lower_limit = UNICODE_ERROR_OCTET_RANGE_START + 0x80;
2990 invalid_upper_limit = UNICODE_ERROR_OCTET_RANGE_START + 0xFF;
2991 break;
2992 case UNICODE_UTF_7:
2993 /* #### Work out what to do here in reality, read the spec and decide
2994 which octets are invalid. */
2995 return Qunbound;
2996 break;
2997 default:
2998 invalid_lower_limit = UNICODE_ERROR_OCTET_RANGE_START;
2999 invalid_upper_limit = UNICODE_ERROR_OCTET_RANGE_START + 0xFF;
3000 break;
3001 }
3002
3003 skip_chars_range_table = Vunicode_query_skip_chars;
3004 }
3005
3006 while (pos < end)
3007 {
3008 Ichar ch = BYTE_BUF_FETCH_CHAR (buf, pos_byte);
3009 if ((ch < 0x100 ? 1 :
3010 (!EQ (Qnil, Fget_range_table (make_int (ch), skip_chars_range_table,
3011 Qnil)))))
3012 {
3013 pos++;
3014 INC_BYTEBPOS (buf, pos_byte);
3015 }
3016 else
3017 {
3018 fail_range_start = pos;
3019 while ((pos < end) &&
3020 ((checked_unicode = ichar_to_unicode (ch),
3021 -1 == checked_unicode
3022 && (failed_reason = query_coding_unencodable))
3023 || (!(flags & QUERY_METHOD_IGNORE_INVALID_SEQUENCES) &&
3024 (invalid_lower_limit <= checked_unicode) &&
3025 (checked_unicode <= invalid_upper_limit)
3026 && (failed_reason = query_coding_invalid_sequence)))
3027 && (previous_failed_reason == query_coding_succeeded
3028 || previous_failed_reason == failed_reason))
3029 {
3030 pos++;
3031 INC_BYTEBPOS (buf, pos_byte);
3032 ch = BYTE_BUF_FETCH_CHAR (buf, pos_byte);
3033 previous_failed_reason = failed_reason;
3034 }
3035
3036 if (fail_range_start == pos)
3037 {
3038 /* The character can actually be encoded; move on. */
3039 pos++;
3040 INC_BYTEBPOS (buf, pos_byte);
3041 }
3042 else
3043 {
3044 assert (previous_failed_reason == query_coding_invalid_sequence
3045 || previous_failed_reason == query_coding_unencodable);
3046
3047 if (flags & QUERY_METHOD_ERRORP)
3048 {
3049 DECLARE_EISTRING (error_details);
3050
3051 eicpy_ascii (error_details, "Cannot encode ");
3052 eicat_lstr (error_details,
3053 make_string_from_buffer (buf, fail_range_start,
3054 pos -
3055 fail_range_start));
3056 eicat_ascii (error_details, " using coding system");
3057
3058 signal_error (Qtext_conversion_error,
3059 (const CIbyte *)(eidata (error_details)),
3060 XCODING_SYSTEM_NAME (codesys));
3061 }
3062
3063 if (NILP (result))
3064 {
3065 result = Fmake_range_table (Qstart_closed_end_open);
3066 }
3067
3068 fail_range_end = pos;
3069
3070 Fput_range_table (make_int (fail_range_start),
3071 make_int (fail_range_end),
3072 (previous_failed_reason
3073 == query_coding_unencodable ?
3074 Qunencodable : Qinvalid_sequence),
3075 result);
3076 previous_failed_reason = query_coding_succeeded;
3077
3078 if (flags & QUERY_METHOD_HIGHLIGHT)
3079 {
3080 Lisp_Object extent
3081 = Fmake_extent (make_int (fail_range_start),
3082 make_int (fail_range_end),
3083 wrap_buffer (buf));
3084
3085 Fset_extent_priority
3086 (extent, make_int (2 + mouse_highlight_priority));
3087 Fset_extent_face (extent, Qquery_coding_warning_face);
3088 }
3089 }
3090 }
3091 }
3092
3093 return result;
3094 }
3095 #else /* !MULE */
3096 unicode_query (Lisp_Object UNUSED (codesys),
3097 struct buffer * UNUSED (buf),
3098 Charbpos UNUSED (end), int UNUSED (flags))
3099 {
3100 return Qnil;
3101 }
3102 #endif
3103
2821 int 3104 int
2822 dfc_coding_system_is_unicode ( 3105 dfc_coding_system_is_unicode (
2823 #ifdef WIN32_ANY 3106 #ifdef WIN32_ANY
2824 Lisp_Object codesys 3107 Lisp_Object codesys
2825 #else 3108 #else
2854 DEFSUBR (Fdefault_unicode_precedence_list); 3137 DEFSUBR (Fdefault_unicode_precedence_list);
2855 DEFSUBR (Fset_unicode_conversion); 3138 DEFSUBR (Fset_unicode_conversion);
2856 3139
2857 DEFSUBR (Fload_unicode_mapping_table); 3140 DEFSUBR (Fload_unicode_mapping_table);
2858 3141
3142 DEFSUBR (Fset_unicode_query_skip_chars_args);
3143
2859 DEFSYMBOL (Qccl_encode_to_ucs_2); 3144 DEFSYMBOL (Qccl_encode_to_ucs_2);
2860 DEFSYMBOL (Qlast_allocated_character); 3145 DEFSYMBOL (Qlast_allocated_character);
2861 DEFSYMBOL (Qignore_first_column); 3146 DEFSYMBOL (Qignore_first_column);
2862 3147
2863 DEFSYMBOL (Qunicode_registries); 3148 DEFSYMBOL (Qunicode_registries);
2888 coding_system_type_create_unicode (void) 3173 coding_system_type_create_unicode (void)
2889 { 3174 {
2890 INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (unicode, "unicode-coding-system-p"); 3175 INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (unicode, "unicode-coding-system-p");
2891 CODING_SYSTEM_HAS_METHOD (unicode, print); 3176 CODING_SYSTEM_HAS_METHOD (unicode, print);
2892 CODING_SYSTEM_HAS_METHOD (unicode, convert); 3177 CODING_SYSTEM_HAS_METHOD (unicode, convert);
3178 CODING_SYSTEM_HAS_METHOD (unicode, query);
2893 CODING_SYSTEM_HAS_METHOD (unicode, init_coding_stream); 3179 CODING_SYSTEM_HAS_METHOD (unicode, init_coding_stream);
2894 CODING_SYSTEM_HAS_METHOD (unicode, rewind_coding_stream); 3180 CODING_SYSTEM_HAS_METHOD (unicode, rewind_coding_stream);
2895 CODING_SYSTEM_HAS_METHOD (unicode, putprop); 3181 CODING_SYSTEM_HAS_METHOD (unicode, putprop);
2896 CODING_SYSTEM_HAS_METHOD (unicode, getprop); 3182 CODING_SYSTEM_HAS_METHOD (unicode, getprop);
2897 3183
2978 those used when no font matching the charset's registries property has been 3264 those used when no font matching the charset's registries property has been
2979 found (that is, they're probably Mule-specific charsets like Ethiopic or 3265 found (that is, they're probably Mule-specific charsets like Ethiopic or
2980 IPA.) 3266 IPA.)
2981 */ ); 3267 */ );
2982 Qunicode_registries = vector1(build_string("iso10646-1")); 3268 Qunicode_registries = vector1(build_string("iso10646-1"));
3269
3270 /* Initialised in lisp/mule/general-late.el, by a call to
3271 #'set-unicode-query-skip-chars-args. Or at least they would be, but we
3272 can't do this at dump time right now, initialised range tables aren't
3273 dumped properly. */
3274 staticpro (&Vunicode_invalid_and_query_skip_chars);
3275 Vunicode_invalid_and_query_skip_chars = Qnil;
3276 staticpro (&Vutf_8_invalid_and_query_skip_chars);
3277 Vutf_8_invalid_and_query_skip_chars = Qnil;
3278 staticpro (&Vunicode_query_skip_chars);
3279 Vunicode_query_skip_chars = Qnil;
3280
3281 /* If we could dump the range table above these wouldn't be necessary: */
3282 staticpro (&Vunicode_query_string);
3283 Vunicode_query_string = Qnil;
3284 staticpro (&Vunicode_invalid_string);
3285 Vunicode_invalid_string = Qnil;
3286 staticpro (&Vutf_8_invalid_string);
3287 Vutf_8_invalid_string = Qnil;
2983 #endif /* MULE */ 3288 #endif /* MULE */
2984 } 3289 }