Mercurial > hg > xemacs-beta
annotate src/unicode.c @ 4952:19a72041c5ed
Mule-izing, various fixes related to char * arguments
-------------------- ChangeLog entries follow: --------------------
modules/ChangeLog addition:
2010-01-26 Ben Wing <ben@xemacs.org>
* postgresql/postgresql.c:
* postgresql/postgresql.c (CHECK_LIVE_CONNECTION):
* postgresql/postgresql.c (print_pgresult):
* postgresql/postgresql.c (Fpq_conn_defaults):
* postgresql/postgresql.c (Fpq_connectdb):
* postgresql/postgresql.c (Fpq_connect_start):
* postgresql/postgresql.c (Fpq_result_status):
* postgresql/postgresql.c (Fpq_res_status):
Mule-ize large parts of it.
2010-01-26 Ben Wing <ben@xemacs.org>
* ldap/eldap.c (print_ldap):
* ldap/eldap.c (allocate_ldap):
Use write_ascstring().
src/ChangeLog addition:
2010-01-26 Ben Wing <ben@xemacs.org>
* alloc.c:
* alloc.c (build_ascstring):
* alloc.c (build_msg_cistring):
* alloc.c (staticpro_1):
* alloc.c (staticpro_name):
* alloc.c (staticpro_nodump_1):
* alloc.c (staticpro_nodump_name):
* alloc.c (unstaticpro_nodump_1):
* alloc.c (mcpro_1):
* alloc.c (mcpro_name):
* alloc.c (object_memory_usage_stats):
* alloc.c (common_init_alloc_early):
* alloc.c (init_alloc_once_early):
* buffer.c (print_buffer):
* buffer.c (vars_of_buffer):
* buffer.c (common_init_complex_vars_of_buffer):
* buffer.c (init_initial_directory):
* bytecode.c (invalid_byte_code):
* bytecode.c (print_compiled_function):
* bytecode.c (mark_compiled_function):
* chartab.c (print_table_entry):
* chartab.c (print_char_table):
* config.h.in:
* console-gtk.c:
* console-gtk.c (gtk_device_to_console_connection):
* console-gtk.c (gtk_semi_canonicalize_console_connection):
* console-gtk.c (gtk_canonicalize_console_connection):
* console-gtk.c (gtk_semi_canonicalize_device_connection):
* console-gtk.c (gtk_canonicalize_device_connection):
* console-stream.c (stream_init_frame_1):
* console-stream.c (vars_of_console_stream):
* console-stream.c (init_console_stream):
* console-x.c (x_semi_canonicalize_console_connection):
* console-x.c (x_semi_canonicalize_device_connection):
* console-x.c (x_canonicalize_device_connection):
* console-x.h:
* data.c (eq_with_ebola_notice):
* data.c (Fsubr_interactive):
* data.c (Fnumber_to_string):
* data.c (digit_to_number):
* device-gtk.c (gtk_init_device):
* device-msw.c (print_devmode):
* device-x.c (x_event_name):
* dialog-msw.c (handle_directory_dialog_box):
* dialog-msw.c (handle_file_dialog_box):
* dialog-msw.c (vars_of_dialog_mswindows):
* doc.c (weird_doc):
* doc.c (Fsnarf_documentation):
* doc.c (vars_of_doc):
* dumper.c (pdump):
* dynarr.c:
* dynarr.c (Dynarr_realloc):
* editfns.c (Fuser_real_login_name):
* editfns.c (get_home_directory):
* elhash.c (print_hash_table_data):
* elhash.c (print_hash_table):
* emacs.c (main_1):
* emacs.c (vars_of_emacs):
* emodules.c:
* emodules.c (_emodules_list):
* emodules.c (Fload_module):
* emodules.c (Funload_module):
* emodules.c (Flist_modules):
* emodules.c (find_make_module):
* emodules.c (attempt_module_delete):
* emodules.c (emodules_load):
* emodules.c (emodules_doc_subr):
* emodules.c (emodules_doc_sym):
* emodules.c (syms_of_module):
* emodules.c (vars_of_module):
* emodules.h:
* eval.c (print_subr):
* eval.c (signal_call_debugger):
* eval.c (build_error_data):
* eval.c (signal_error):
* eval.c (maybe_signal_error):
* eval.c (signal_continuable_error):
* eval.c (maybe_signal_continuable_error):
* eval.c (signal_error_2):
* eval.c (maybe_signal_error_2):
* eval.c (signal_continuable_error_2):
* eval.c (maybe_signal_continuable_error_2):
* eval.c (signal_ferror):
* eval.c (maybe_signal_ferror):
* eval.c (signal_continuable_ferror):
* eval.c (maybe_signal_continuable_ferror):
* eval.c (signal_ferror_with_frob):
* eval.c (maybe_signal_ferror_with_frob):
* eval.c (signal_continuable_ferror_with_frob):
* eval.c (maybe_signal_continuable_ferror_with_frob):
* eval.c (syntax_error):
* eval.c (syntax_error_2):
* eval.c (maybe_syntax_error):
* eval.c (sferror):
* eval.c (sferror_2):
* eval.c (maybe_sferror):
* eval.c (invalid_argument):
* eval.c (invalid_argument_2):
* eval.c (maybe_invalid_argument):
* eval.c (invalid_constant):
* eval.c (invalid_constant_2):
* eval.c (maybe_invalid_constant):
* eval.c (invalid_operation):
* eval.c (invalid_operation_2):
* eval.c (maybe_invalid_operation):
* eval.c (invalid_change):
* eval.c (invalid_change_2):
* eval.c (maybe_invalid_change):
* eval.c (invalid_state):
* eval.c (invalid_state_2):
* eval.c (maybe_invalid_state):
* eval.c (wtaerror):
* eval.c (stack_overflow):
* eval.c (out_of_memory):
* eval.c (print_multiple_value):
* eval.c (issue_call_trapping_problems_warning):
* eval.c (backtrace_specials):
* eval.c (backtrace_unevalled_args):
* eval.c (Fbacktrace):
* eval.c (warn_when_safe):
* event-Xt.c (modwarn):
* event-Xt.c (modbarf):
* event-Xt.c (check_modifier):
* event-Xt.c (store_modifier):
* event-Xt.c (emacs_Xt_format_magic_event):
* event-Xt.c (describe_event):
* event-gtk.c (dragndrop_data_received):
* event-gtk.c (store_modifier):
* event-gtk.c (gtk_reset_modifier_mapping):
* event-msw.c (dde_eval_string):
* event-msw.c (Fdde_alloc_advise_item):
* event-msw.c (mswindows_dde_callback):
* event-msw.c (FROB):
* event-msw.c (emacs_mswindows_format_magic_event):
* event-stream.c (external_debugging_print_event):
* event-stream.c (execute_help_form):
* event-stream.c (vars_of_event_stream):
* events.c (print_event_1):
* events.c (print_event):
* events.c (event_equal):
* extents.c (print_extent_1):
* extents.c (print_extent):
* extents.c (vars_of_extents):
* faces.c (print_face):
* faces.c (complex_vars_of_faces):
* file-coding.c:
* file-coding.c (print_coding_system):
* file-coding.c (print_coding_system_in_print_method):
* file-coding.c (default_query_method):
* file-coding.c (find_coding_system):
* file-coding.c (make_coding_system_1):
* file-coding.c (chain_print):
* file-coding.c (undecided_print):
* file-coding.c (gzip_print):
* file-coding.c (vars_of_file_coding):
* file-coding.c (complex_vars_of_file_coding):
* fileio.c:
* fileio.c (report_file_type_error):
* fileio.c (report_error_with_errno):
* fileio.c (report_file_error):
* fileio.c (barf_or_query_if_file_exists):
* fileio.c (vars_of_fileio):
* floatfns.c (matherr):
* fns.c (print_bit_vector):
* fns.c (Fmapconcat):
* fns.c (add_suffix_to_symbol):
* fns.c (add_prefix_to_symbol):
* frame-gtk.c:
* frame-gtk.c (Fgtk_window_id):
* frame-x.c (def):
* frame-x.c (x_cde_transfer_callback):
* frame.c:
* frame.c (Fmake_frame):
* gc.c (show_gc_cursor_and_message):
* gc.c (vars_of_gc):
* glyphs-eimage.c (png_instantiate):
* glyphs-eimage.c (tiff_instantiate):
* glyphs-gtk.c (gtk_print_image_instance):
* glyphs-msw.c (mswindows_print_image_instance):
* glyphs-x.c (x_print_image_instance):
* glyphs-x.c (update_widget_face):
* glyphs.c (make_string_from_file):
* glyphs.c (print_image_instance):
* glyphs.c (signal_image_error):
* glyphs.c (signal_image_error_2):
* glyphs.c (signal_double_image_error):
* glyphs.c (signal_double_image_error_2):
* glyphs.c (xbm_mask_file_munging):
* glyphs.c (pixmap_to_lisp_data):
* glyphs.h:
* gui.c (gui_item_display_flush_left):
* hpplay.c (player_error_internal):
* hpplay.c (myHandler):
* intl-win32.c:
* intl-win32.c (langcode_to_lang):
* intl-win32.c (sublangcode_to_lang):
* intl-win32.c (Fmswindows_get_locale_info):
* intl-win32.c (lcid_to_locale_mule_or_no):
* intl-win32.c (mswindows_multibyte_to_unicode_print):
* intl-win32.c (complex_vars_of_intl_win32):
* keymap.c:
* keymap.c (print_keymap):
* keymap.c (ensure_meta_prefix_char_keymapp):
* keymap.c (Fkey_description):
* keymap.c (Ftext_char_description):
* lisp.h:
* lisp.h (struct):
* lisp.h (DECLARE_INLINE_HEADER):
* lread.c (Fload_internal):
* lread.c (locate_file):
* lread.c (read_escape):
* lread.c (read_raw_string):
* lread.c (read1):
* lread.c (read_list):
* lread.c (read_compiled_function):
* lread.c (init_lread):
* lrecord.h:
* marker.c (print_marker):
* marker.c (marker_equal):
* menubar-msw.c (displayable_menu_item):
* menubar-x.c (command_builder_operate_menu_accelerator):
* menubar.c (vars_of_menubar):
* minibuf.c (reinit_complex_vars_of_minibuf):
* minibuf.c (complex_vars_of_minibuf):
* mule-charset.c (Fmake_charset):
* mule-charset.c (complex_vars_of_mule_charset):
* mule-coding.c (iso2022_print):
* mule-coding.c (fixed_width_query):
* number.c (bignum_print):
* number.c (ratio_print):
* number.c (bigfloat_print):
* number.c (bigfloat_finalize):
* objects-msw.c:
* objects-msw.c (mswindows_color_to_string):
* objects-msw.c (mswindows_color_list):
* objects-tty.c:
* objects-tty.c (tty_font_list):
* objects-tty.c (tty_find_charset_font):
* objects-xlike-inc.c (xft_find_charset_font):
* objects-xlike-inc.c (endif):
* print.c:
* print.c (write_istring):
* print.c (write_ascstring):
* print.c (Fterpri):
* print.c (Fprint):
* print.c (print_error_message):
* print.c (print_vector_internal):
* print.c (print_cons):
* print.c (print_string):
* print.c (printing_unreadable_object):
* print.c (print_internal):
* print.c (print_float):
* print.c (print_symbol):
* process-nt.c (mswindows_report_winsock_error):
* process-nt.c (nt_canonicalize_host_name):
* process-unix.c (unix_canonicalize_host_name):
* process.c (print_process):
* process.c (report_process_error):
* process.c (report_network_error):
* process.c (make_process_internal):
* process.c (Fstart_process_internal):
* process.c (status_message):
* process.c (putenv_internal):
* process.c (vars_of_process):
* process.h:
* profile.c (vars_of_profile):
* rangetab.c (print_range_table):
* realpath.c (vars_of_realpath):
* redisplay.c (vars_of_redisplay):
* search.c (wordify):
* search.c (Freplace_match):
* sheap.c (sheap_adjust_h):
* sound.c (report_sound_error):
* sound.c (Fplay_sound_file):
* specifier.c (print_specifier):
* symbols.c (Fsubr_name):
* symbols.c (do_symval_forwarding):
* symbols.c (set_default_buffer_slot_variable):
* symbols.c (set_default_console_slot_variable):
* symbols.c (store_symval_forwarding):
* symbols.c (default_value):
* symbols.c (defsymbol_massage_name_1):
* symbols.c (defsymbol_massage_name_nodump):
* symbols.c (defsymbol_massage_name):
* symbols.c (defsymbol_massage_multiword_predicate_nodump):
* symbols.c (defsymbol_massage_multiword_predicate):
* symbols.c (defsymbol_nodump):
* symbols.c (defsymbol):
* symbols.c (defkeyword):
* symbols.c (defkeyword_massage_name):
* symbols.c (check_module_subr):
* symbols.c (deferror_1):
* symbols.c (deferror):
* symbols.c (deferror_massage_name):
* symbols.c (deferror_massage_name_and_message):
* symbols.c (defvar_magic):
* symeval.h:
* symeval.h (DEFVAR_SYMVAL_FWD):
* sysdep.c:
* sysdep.c (init_system_name):
* sysdll.c:
* sysdll.c (MAYBE_PREPEND_UNDERSCORE):
* sysdll.c (dll_function):
* sysdll.c (dll_variable):
* sysdll.c (dll_error):
* sysdll.c (dll_open):
* sysdll.c (dll_close):
* sysdll.c (image_for_address):
* sysdll.c (my_find_image):
* sysdll.c (search_linked_libs):
* sysdll.h:
* sysfile.h:
* sysfile.h (DEFAULT_DIRECTORY_FALLBACK):
* syswindows.h:
* tests.c (DFC_CHECK_LENGTH):
* tests.c (DFC_CHECK_CONTENT):
* tests.c (Ftest_hash_tables):
* text.c (vars_of_text):
* text.h:
* tooltalk.c (tt_opnum_string):
* tooltalk.c (tt_message_arg_ival_string):
* tooltalk.c (Ftooltalk_default_procid):
* tooltalk.c (Ftooltalk_default_session):
* tooltalk.c (init_tooltalk):
* tooltalk.c (vars_of_tooltalk):
* ui-gtk.c (Fdll_load):
* ui-gtk.c (type_to_marshaller_type):
* ui-gtk.c (Fgtk_import_function_internal):
* ui-gtk.c (emacs_gtk_object_printer):
* ui-gtk.c (emacs_gtk_boxed_printer):
* unicode.c (unicode_to_ichar):
* unicode.c (unicode_print):
* unicode.c (unicode_query):
* unicode.c (vars_of_unicode):
* unicode.c (complex_vars_of_unicode):
* win32.c:
* win32.c (mswindows_report_process_error):
* window.c (print_window):
* xemacs.def.in.in:
BASIC IDEA: Further fixing up uses of char * and CIbyte *
to reflect their actual semantics; Mule-izing some code;
redoing of the not-yet-working code to handle message translation.
Clean up code to handle message-translation (not yet working).
Create separate versions of build_msg_string() for working with
Ibyte *, CIbyte *, and Ascbyte * arguments. Assert that Ascbyte *
arguments are pure-ASCII. Make build_msg_string() be the same
as build_msg_ascstring(). Create same three versions of GETTEXT()
and DEFER_GETTEXT(). Also create build_defer_string() and
variants for the equivalent of DEFER_GETTEXT() when building a
string. Remove old CGETTEXT(). Clean up code where GETTEXT(),
DEFER_GETTEXT(), build_msg_string(), etc. was being called and
introduce some new calls to build_msg_string(), etc. Remove
GETTEXT() from calls to weird_doc() -- we assume that the
message snarfer knows about weird_doc(). Remove uses of
DEFER_GETTEXT() from error messages in sysdep.c and instead use
special comments /* @@@begin-snarf@@@ */ and /* @@@end-snarf@@@ */
that the message snarfer presumably knows about.
Create build_ascstring() and use it in many instances in place
of build_string(). The purpose of having Ascbyte * variants is
to make the code more self-documenting in terms of what sort of
semantics is expected for char * strings. In fact in the process
of looking for uses of build_string(), much improperly Mule-ized
was discovered.
Mule-ize a lot of code as described in previous paragraph,
e.g. in sysdep.c.
Make the error functions take Ascbyte * strings and fix up a
couple of places where non-pure-ASCII strings were being passed in
(file-coding.c, mule-coding.c, unicode.c). (It's debatable whether
we really need to make the error functions work this way. It
helps catch places where code is written in a way that message
translation won't work, but we may well never implement message
translation.)
Make staticpro() and friends take Ascbyte * strings instead of
raw char * strings. Create a const_Ascbyte_ptr dynarr type
to describe what's held by staticpro_names[] and friends,
create pdump descriptions for const_Ascbyte_ptr dynarrs, and
use them in place of specially-crafted staticpro descriptions.
Mule-ize certain other functions (e.g. x_event_name) by correcting
raw use of char * to Ascbyte *, Rawbyte * or another such type,
and raw use of char[] buffers to another type (usually Ascbyte[]).
Change many uses of write_c_string() to write_msg_string(),
write_ascstring(), etc.
Mule-ize emodules.c, emodules.h, sysdll.h.
Fix some un-Mule-ized code in intl-win32.c.
A comment in event-Xt.c and the limitations of the message
snarfer (make-msgfile or whatever) is presumably incorrect --
it should be smart enough to handle function calls spread over
more than one line. Clean up code in event-Xt.c that was
written awkwardly for this reason.
In config.h.in, instead of NEED_ERROR_CHECK_TYPES_INLINES,
create a more general XEMACS_DEFS_NEEDS_INLINE_DECLS to
indicate when inlined functions need to be declared in
xemacs.defs.in.in, and make use of it in xemacs.defs.in.in.
We need to do this because postgresql.c now calls qxestrdup(),
which is an inline function.
Make nconc2() and other such functions MODULE_API and put
them in xemacs.defs.in.in since postgresql.c now uses them.
Clean up indentation in lread.c and a few other places.
In text.h, document ASSERT_ASCTEXT_ASCII() and
ASSERT_ASCTEXT_ASCII_LEN(), group together the stand-in
encodings and add some more for DLL symbols, function and
variable names, etc.
author | Ben Wing <ben@xemacs.org> |
---|---|
date | Tue, 26 Jan 2010 23:22:30 -0600 |
parents | b3ea9c582280 |
children | 304aebb79cd3 |
rev | line source |
---|---|
771 | 1 /* Code to handle Unicode conversion. |
4834
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
2 Copyright (C) 2000, 2001, 2002, 2003, 2004, 2005, 2010 Ben Wing. |
771 | 3 |
4 This file is part of XEmacs. | |
5 | |
6 XEmacs is free software; you can redistribute it and/or modify it | |
7 under the terms of the GNU General Public License as published by the | |
8 Free Software Foundation; either version 2, or (at your option) any | |
9 later version. | |
10 | |
11 XEmacs is distributed in the hope that it will be useful, but WITHOUT | |
12 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or | |
13 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License | |
14 for more details. | |
15 | |
16 You should have received a copy of the GNU General Public License | |
17 along with XEmacs; see the file COPYING. If not, write to | |
18 the Free Software Foundation, Inc., 59 Temple Place - Suite 330, | |
19 Boston, MA 02111-1307, USA. */ | |
20 | |
21 /* Synched up with: FSF 20.3. Not in FSF. */ | |
22 | |
23 /* Authorship: | |
24 | |
25 Current primary author: Ben Wing <ben@xemacs.org> | |
26 | |
27 Written by Ben Wing <ben@xemacs.org>, June, 2001. | |
28 Separated out into this file, August, 2001. | |
29 Includes Unicode coding systems, some parts of which have been written | |
877 | 30 by someone else. #### Morioka and Hayashi, I think. |
771 | 31 |
32 As of September 2001, the detection code is here and abstraction of the | |
877 | 33 detection system is finished. The unicode detectors have been rewritten |
771 | 34 to include multiple levels of likelihood. |
35 */ | |
36 | |
37 #include <config.h> | |
38 #include "lisp.h" | |
39 | |
40 #include "charset.h" | |
41 #include "file-coding.h" | |
42 #include "opaque.h" | |
43 | |
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
44 #include "buffer.h" |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
45 #include "rangetab.h" |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
46 #include "extents.h" |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
47 |
771 | 48 #include "sysfile.h" |
49 | |
2367 | 50 /* For more info about how Unicode works under Windows, see intl-win32.c. */ |
51 | |
52 /* Info about Unicode translation tables [ben]: | |
53 | |
54 FORMAT: | |
55 ------- | |
56 | |
57 We currently use the following format for tables: | |
58 | |
59 If dimension == 1, to_unicode_table is a 96-element array of ints | |
60 (Unicode code points); else, it's a 96-element array of int * pointers, | |
61 each of which points to a 96-element array of ints. If no elements in a | |
62 row have been filled in, the pointer will point to a default empty | |
63 table; that way, memory usage is more reasonable but lookup still fast. | |
64 | |
65 -- If from_unicode_levels == 1, from_unicode_table is a 256-element | |
66 array of shorts (octet 1 in high byte, octet 2 in low byte; we don't | |
67 store Ichars directly to save space). | |
68 | |
69 -- If from_unicode_levels == 2, from_unicode_table is a 256-element | |
70 array of short * pointers, each of which points to a 256-element array | |
71 of shorts. | |
72 | |
73 -- If from_unicode_levels == 3, from_unicode_table is a 256-element | |
74 array of short ** pointers, each of which points to a 256-element array | |
75 of short * pointers, each of which points to a 256-element array of | |
76 shorts. | |
77 | |
78 -- If from_unicode_levels == 4, same thing but one level deeper. | |
79 | |
80 Just as for to_unicode_table, we use default tables to fill in all | |
81 entries with no values in them. | |
82 | |
83 #### An obvious space-saving optimization is to use variable-sized | |
84 tables, where each table instead of just being a 256-element array, is a | |
85 structure with a start value, an end value, and a variable number of | |
86 entries (END - START + 1). Only 8 bits are needed for END and START, | |
87 and could be stored at the end to avoid alignment problems. However, | |
88 before charging off and implementing this, we need to consider whether | |
89 it's worth it: | |
90 | |
91 (1) Most tables will be highly localized in which code points are | |
92 defined, heavily reducing the possible memory waste. Before doing any | |
93 rewriting, write some code to see how much memory is actually being | |
94 wasted (i.e. ratio of empty entries to total # of entries) and only | |
95 start rewriting if it's unacceptably high. You have to check over all | |
96 charsets. | |
97 | |
98 (2) Since entries are usually added one at a time, you have to be very | |
99 careful when creating the tables to avoid realloc()/free() thrashing in | |
100 the common case when you are in an area of high localization and are | |
101 going to end up using most entries in the table. You'd certainly want | |
102 to allow only certain sizes, not arbitrary ones (probably powers of 2, | |
103 where you want the entire block including the START/END values to fit | |
104 into a power of 2, minus any malloc overhead if there is any -- there's | |
105 none under gmalloc.c, and probably most system malloc() functions are | |
106 quite smart nowadays and also have no overhead). You could optimize | |
107 somewhat during the in-C initializations, because you can compute the | |
108 actual usage of various tables by scanning the entries you're going to | |
109 add in a separate pass before adding them. (You could actually do the | |
110 same thing when entries are added on the Lisp level by making the | |
111 assumption that all the entries will come in one after another before | |
112 any use is made of the data. So as they're coming in, you just store | |
113 them in a big long list, and the first time you need to retrieve an | |
114 entry, you compute the whole table at once.) You'd still have to deal | |
115 with the possibility of later entries coming in, though. | |
116 | |
117 (3) You do lose some speed using START/END values, since you need a | |
118 couple of comparisons at each level. This could easily make each single | |
119 lookup become 3-4 times slower. The Unicode book considers this a big | |
120 issue, and recommends against variable-sized tables for this reason; | |
121 however, they almost certainly have in mind applications that primarily | |
122 involve conversion of large amounts of data. Most Unicode strings that | |
123 are translated in XEmacs are fairly small. The only place where this | |
124 might matter is in loading large files -- e.g. a 3-megabyte | |
125 Unicode-encoded file. So think about this, and maybe do a trial | |
126 implementation where you don't worry too much about the intricacies of | |
127 (2) and just implement some basic "multiply by 1.5" trick or something | |
128 to do the resizing. There is a very good FAQ on Unicode called | |
129 something like the Linux-Unicode How-To (it should be part of the Linux | |
130 How-To's, I think), that lists the url of a guy with a whole bunch of | |
131 unicode files you can use to stress-test your implementations, and he's | |
132 highly likely to have a good multi-megabyte Unicode-encoded file (with | |
133 normal text in it -- if you created your own just by creating repeated | |
134 strings of letters and numbers, you probably wouldn't get accurate | |
135 results). | |
136 | |
137 INITIALIZATION: | |
138 --------------- | |
139 | |
140 There are advantages and disadvantages to loading the tables at | |
141 run-time. | |
142 | |
143 Advantages: | |
144 | |
145 They're big, and it's very fast to recreate them (a fraction of a second | |
146 on modern processors). | |
147 | |
148 Disadvantages: | |
149 | |
150 (1) User-defined charsets: It would be inconvenient to require all | |
151 dumped user-defined charsets to be reloaded at init time. | |
152 | |
153 NB With run-time loading, we load in init-mule-at-startup, in | |
154 mule-cmds.el. This is called from startup.el, which is quite late in | |
155 the initialization process -- but data-directory isn't set until then. | |
156 With dump-time loading, you still can't dump in a Japanese directory | |
157 (again, until we move to Unicode internally), but this is not such an | |
158 imposition. | |
159 | |
160 | |
161 */ | |
162 | |
771 | 163 /* #### WARNING! The current sledgehammer routines have a fundamental |
164 problem in that they can't handle two characters mapping to a | |
165 single Unicode codepoint or vice-versa in a single charset table. | |
166 It's not clear there is any way to handle this and still make the | |
877 | 167 sledgehammer routines useful. |
168 | |
169 Inquiring Minds Want To Know Dept: does the above WARNING mean that | |
170 _if_ it happens, then it will signal error, or then it will do | |
171 something evil and unpredictable? Signaling an error is OK: for | |
172 all national standards, the national to Unicode map is an inclusion | |
173 (1-to-1). Any character set that does not behave that way is | |
1318 | 174 broken according to the Unicode standard. |
175 | |
2500 | 176 Answer: You will get an ABORT(), since the purpose of the sledgehammer |
1318 | 177 routines is self-checking. The above problem with non-1-to-1 mapping |
178 occurs in the Big5 tables, as provided by the Unicode Consortium. */ | |
877 | 179 |
771 | 180 /* #define SLEDGEHAMMER_CHECK_UNICODE */ |
181 | |
182 /* When MULE is not defined, we may still need some Unicode support -- | |
183 in particular, some Windows API's always want Unicode, and the way | |
184 we've set up the Unicode encapsulation, we may as well go ahead and | |
185 always use the Unicode versions of split API's. (It would be | |
186 trickier to not use them, and pointless -- under NT, the ANSI API's | |
187 call the Unicode ones anyway, so in the case of structures, we'd be | |
188 converting from Unicode to ANSI structures, only to have the OS | |
189 convert them back.) */ | |
190 | |
191 Lisp_Object Qunicode; | |
4096 | 192 Lisp_Object Qutf_16, Qutf_8, Qucs_4, Qutf_7, Qutf_32; |
771 | 193 Lisp_Object Qneed_bom; |
194 | |
195 Lisp_Object Qutf_16_little_endian, Qutf_16_bom; | |
196 Lisp_Object Qutf_16_little_endian_bom; | |
197 | |
985 | 198 Lisp_Object Qutf_8_bom; |
199 | |
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
200 #ifdef MULE |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
201 /* These range tables are not directly accessible from Lisp: */ |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
202 static Lisp_Object Vunicode_invalid_and_query_skip_chars; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
203 static Lisp_Object Vutf_8_invalid_and_query_skip_chars; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
204 static Lisp_Object Vunicode_query_skip_chars; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
205 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
206 static Lisp_Object Vunicode_query_string, Vunicode_invalid_string, |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
207 Vutf_8_invalid_string; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
208 #endif /* MULE */ |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
209 |
3952 | 210 /* See the Unicode FAQ, http://www.unicode.org/faq/utf_bom.html#35 for this |
211 algorithm. | |
212 | |
213 (They also give another, really verbose one, as part of their explanation | |
214 of the various planes of the encoding, but we won't use that.) */ | |
215 | |
216 #define UTF_16_LEAD_OFFSET (0xD800 - (0x10000 >> 10)) | |
217 #define UTF_16_SURROGATE_OFFSET (0x10000 - (0xD800 << 10) - 0xDC00) | |
218 | |
219 #define utf_16_surrogates_to_code(lead, trail) \ | |
220 (((lead) << 10) + (trail) + UTF_16_SURROGATE_OFFSET) | |
221 | |
222 #define CODE_TO_UTF_16_SURROGATES(codepoint, lead, trail) do { \ | |
223 int __ctu16s_code = (codepoint); \ | |
224 lead = UTF_16_LEAD_OFFSET + (__ctu16s_code >> 10); \ | |
225 trail = 0xDC00 + (__ctu16s_code & 0x3FF); \ | |
226 } while (0) | |
227 | |
771 | 228 #ifdef MULE |
229 | |
3352 | 230 /* Using ints for to_unicode is OK (as long as they are >= 32 bits). |
231 In from_unicode, we're converting from Mule characters, which means | |
232 that the values being converted to are only 96x96, and we can save | |
233 space by using shorts (signedness doesn't matter). */ | |
771 | 234 static int *to_unicode_blank_1; |
235 static int **to_unicode_blank_2; | |
236 | |
237 static short *from_unicode_blank_1; | |
238 static short **from_unicode_blank_2; | |
239 static short ***from_unicode_blank_3; | |
240 static short ****from_unicode_blank_4; | |
241 | |
1204 | 242 static const struct memory_description to_unicode_level_0_desc_1[] = { |
771 | 243 { XD_END } |
244 }; | |
245 | |
1204 | 246 static const struct sized_memory_description to_unicode_level_0_desc = { |
247 sizeof (int), to_unicode_level_0_desc_1 | |
771 | 248 }; |
249 | |
1204 | 250 static const struct memory_description to_unicode_level_1_desc_1[] = { |
2551 | 251 { XD_BLOCK_PTR, 0, 96, { &to_unicode_level_0_desc } }, |
771 | 252 { XD_END } |
253 }; | |
254 | |
1204 | 255 static const struct sized_memory_description to_unicode_level_1_desc = { |
256 sizeof (void *), to_unicode_level_1_desc_1 | |
771 | 257 }; |
258 | |
1204 | 259 static const struct memory_description to_unicode_description_1[] = { |
2551 | 260 { XD_BLOCK_PTR, 1, 96, { &to_unicode_level_0_desc } }, |
261 { XD_BLOCK_PTR, 2, 96, { &to_unicode_level_1_desc } }, | |
771 | 262 { XD_END } |
263 }; | |
264 | |
265 /* Not static because each charset has a set of to and from tables and | |
266 needs to describe them to pdump. */ | |
1204 | 267 const struct sized_memory_description to_unicode_description = { |
268 sizeof (void *), to_unicode_description_1 | |
269 }; | |
270 | |
2367 | 271 /* Used only for to_unicode_blank_2 */ |
272 static const struct memory_description to_unicode_level_2_desc_1[] = { | |
2551 | 273 { XD_BLOCK_PTR, 0, 96, { &to_unicode_level_1_desc } }, |
2367 | 274 { XD_END } |
275 }; | |
276 | |
1204 | 277 static const struct memory_description from_unicode_level_0_desc_1[] = { |
771 | 278 { XD_END } |
279 }; | |
280 | |
1204 | 281 static const struct sized_memory_description from_unicode_level_0_desc = { |
282 sizeof (short), from_unicode_level_0_desc_1 | |
771 | 283 }; |
284 | |
1204 | 285 static const struct memory_description from_unicode_level_1_desc_1[] = { |
2551 | 286 { XD_BLOCK_PTR, 0, 256, { &from_unicode_level_0_desc } }, |
771 | 287 { XD_END } |
288 }; | |
289 | |
1204 | 290 static const struct sized_memory_description from_unicode_level_1_desc = { |
291 sizeof (void *), from_unicode_level_1_desc_1 | |
771 | 292 }; |
293 | |
1204 | 294 static const struct memory_description from_unicode_level_2_desc_1[] = { |
2551 | 295 { XD_BLOCK_PTR, 0, 256, { &from_unicode_level_1_desc } }, |
771 | 296 { XD_END } |
297 }; | |
298 | |
1204 | 299 static const struct sized_memory_description from_unicode_level_2_desc = { |
300 sizeof (void *), from_unicode_level_2_desc_1 | |
771 | 301 }; |
302 | |
1204 | 303 static const struct memory_description from_unicode_level_3_desc_1[] = { |
2551 | 304 { XD_BLOCK_PTR, 0, 256, { &from_unicode_level_2_desc } }, |
771 | 305 { XD_END } |
306 }; | |
307 | |
1204 | 308 static const struct sized_memory_description from_unicode_level_3_desc = { |
309 sizeof (void *), from_unicode_level_3_desc_1 | |
771 | 310 }; |
311 | |
1204 | 312 static const struct memory_description from_unicode_description_1[] = { |
2551 | 313 { XD_BLOCK_PTR, 1, 256, { &from_unicode_level_0_desc } }, |
314 { XD_BLOCK_PTR, 2, 256, { &from_unicode_level_1_desc } }, | |
315 { XD_BLOCK_PTR, 3, 256, { &from_unicode_level_2_desc } }, | |
316 { XD_BLOCK_PTR, 4, 256, { &from_unicode_level_3_desc } }, | |
771 | 317 { XD_END } |
318 }; | |
319 | |
320 /* Not static because each charset has a set of to and from tables and | |
321 needs to describe them to pdump. */ | |
1204 | 322 const struct sized_memory_description from_unicode_description = { |
323 sizeof (void *), from_unicode_description_1 | |
771 | 324 }; |
325 | |
2367 | 326 /* Used only for from_unicode_blank_4 */ |
327 static const struct memory_description from_unicode_level_4_desc_1[] = { | |
2551 | 328 { XD_BLOCK_PTR, 0, 256, { &from_unicode_level_3_desc } }, |
2367 | 329 { XD_END } |
330 }; | |
331 | |
771 | 332 static Lisp_Object_dynarr *unicode_precedence_dynarr; |
333 | |
1204 | 334 static const struct memory_description lod_description_1[] = { |
335 XD_DYNARR_DESC (Lisp_Object_dynarr, &lisp_object_description), | |
771 | 336 { XD_END } |
337 }; | |
338 | |
1204 | 339 static const struct sized_memory_description lisp_object_dynarr_description = { |
771 | 340 sizeof (Lisp_Object_dynarr), |
341 lod_description_1 | |
342 }; | |
343 | |
344 Lisp_Object Vlanguage_unicode_precedence_list; | |
345 Lisp_Object Vdefault_unicode_precedence_list; | |
346 | |
347 Lisp_Object Qignore_first_column; | |
348 | |
3439 | 349 Lisp_Object Vcurrent_jit_charset; |
350 Lisp_Object Qlast_allocated_character; | |
351 Lisp_Object Qccl_encode_to_ucs_2; | |
352 | |
4268 | 353 Lisp_Object Vnumber_of_jit_charsets; |
354 Lisp_Object Vlast_jit_charset_final; | |
355 Lisp_Object Vcharset_descr; | |
356 | |
357 | |
771 | 358 |
359 /************************************************************************/ | |
360 /* Unicode implementation */ | |
361 /************************************************************************/ | |
362 | |
363 #define BREAKUP_UNICODE_CODE(val, u1, u2, u3, u4, levels) \ | |
364 do { \ | |
365 int buc_val = (val); \ | |
366 \ | |
367 (u1) = buc_val >> 24; \ | |
368 (u2) = (buc_val >> 16) & 255; \ | |
369 (u3) = (buc_val >> 8) & 255; \ | |
370 (u4) = buc_val & 255; \ | |
371 (levels) = (buc_val <= 0xFF ? 1 : \ | |
372 buc_val <= 0xFFFF ? 2 : \ | |
373 buc_val <= 0xFFFFFF ? 3 : \ | |
374 4); \ | |
375 } while (0) | |
376 | |
377 static void | |
378 init_blank_unicode_tables (void) | |
379 { | |
380 int i; | |
381 | |
382 from_unicode_blank_1 = xnew_array (short, 256); | |
383 from_unicode_blank_2 = xnew_array (short *, 256); | |
384 from_unicode_blank_3 = xnew_array (short **, 256); | |
385 from_unicode_blank_4 = xnew_array (short ***, 256); | |
386 for (i = 0; i < 256; i++) | |
387 { | |
877 | 388 /* #### IMWTK: Why does using -1 here work? Simply because there are |
1318 | 389 no existing 96x96 charsets? |
390 | |
391 Answer: I don't understand the concern. -1 indicates there is no | |
392 entry for this particular codepoint, which is always the case for | |
393 blank tables. */ | |
771 | 394 from_unicode_blank_1[i] = (short) -1; |
395 from_unicode_blank_2[i] = from_unicode_blank_1; | |
396 from_unicode_blank_3[i] = from_unicode_blank_2; | |
397 from_unicode_blank_4[i] = from_unicode_blank_3; | |
398 } | |
399 | |
400 to_unicode_blank_1 = xnew_array (int, 96); | |
401 to_unicode_blank_2 = xnew_array (int *, 96); | |
402 for (i = 0; i < 96; i++) | |
403 { | |
877 | 404 /* Here -1 is guaranteed OK. */ |
771 | 405 to_unicode_blank_1[i] = -1; |
406 to_unicode_blank_2[i] = to_unicode_blank_1; | |
407 } | |
408 } | |
409 | |
410 static void * | |
411 create_new_from_unicode_table (int level) | |
412 { | |
413 switch (level) | |
414 { | |
415 /* WARNING: If you are thinking of compressing these, keep in | |
416 mind that sizeof (short) does not equal sizeof (short *). */ | |
417 case 1: | |
418 { | |
419 short *newtab = xnew_array (short, 256); | |
420 memcpy (newtab, from_unicode_blank_1, 256 * sizeof (short)); | |
421 return newtab; | |
422 } | |
423 case 2: | |
424 { | |
425 short **newtab = xnew_array (short *, 256); | |
426 memcpy (newtab, from_unicode_blank_2, 256 * sizeof (short *)); | |
427 return newtab; | |
428 } | |
429 case 3: | |
430 { | |
431 short ***newtab = xnew_array (short **, 256); | |
432 memcpy (newtab, from_unicode_blank_3, 256 * sizeof (short **)); | |
433 return newtab; | |
434 } | |
435 case 4: | |
436 { | |
437 short ****newtab = xnew_array (short ***, 256); | |
438 memcpy (newtab, from_unicode_blank_4, 256 * sizeof (short ***)); | |
439 return newtab; | |
440 } | |
441 default: | |
2500 | 442 ABORT (); |
771 | 443 return 0; |
444 } | |
445 } | |
446 | |
877 | 447 /* Allocate and blank the tables. |
1318 | 448 Loading them up is done by load-unicode-mapping-table. */ |
771 | 449 void |
450 init_charset_unicode_tables (Lisp_Object charset) | |
451 { | |
452 if (XCHARSET_DIMENSION (charset) == 1) | |
453 { | |
454 int *to_table = xnew_array (int, 96); | |
455 memcpy (to_table, to_unicode_blank_1, 96 * sizeof (int)); | |
456 XCHARSET_TO_UNICODE_TABLE (charset) = to_table; | |
457 } | |
458 else | |
459 { | |
460 int **to_table = xnew_array (int *, 96); | |
461 memcpy (to_table, to_unicode_blank_2, 96 * sizeof (int *)); | |
462 XCHARSET_TO_UNICODE_TABLE (charset) = to_table; | |
463 } | |
464 | |
465 { | |
2367 | 466 XCHARSET_FROM_UNICODE_TABLE (charset) = |
467 create_new_from_unicode_table (1); | |
771 | 468 XCHARSET_FROM_UNICODE_LEVELS (charset) = 1; |
469 } | |
470 } | |
471 | |
472 static void | |
473 free_from_unicode_table (void *table, int level) | |
474 { | |
475 int i; | |
476 | |
477 switch (level) | |
478 { | |
479 case 2: | |
480 { | |
481 short **tab = (short **) table; | |
482 for (i = 0; i < 256; i++) | |
483 { | |
484 if (tab[i] != from_unicode_blank_1) | |
485 free_from_unicode_table (tab[i], 1); | |
486 } | |
487 break; | |
488 } | |
489 case 3: | |
490 { | |
491 short ***tab = (short ***) table; | |
492 for (i = 0; i < 256; i++) | |
493 { | |
494 if (tab[i] != from_unicode_blank_2) | |
495 free_from_unicode_table (tab[i], 2); | |
496 } | |
497 break; | |
498 } | |
499 case 4: | |
500 { | |
501 short ****tab = (short ****) table; | |
502 for (i = 0; i < 256; i++) | |
503 { | |
504 if (tab[i] != from_unicode_blank_3) | |
505 free_from_unicode_table (tab[i], 3); | |
506 } | |
507 break; | |
508 } | |
509 } | |
510 | |
1726 | 511 xfree (table, void *); |
771 | 512 } |
513 | |
514 static void | |
515 free_to_unicode_table (void *table, int level) | |
516 { | |
517 if (level == 2) | |
518 { | |
519 int i; | |
520 int **tab = (int **) table; | |
521 | |
522 for (i = 0; i < 96; i++) | |
523 { | |
524 if (tab[i] != to_unicode_blank_1) | |
525 free_to_unicode_table (tab[i], 1); | |
526 } | |
527 } | |
528 | |
1726 | 529 xfree (table, void *); |
771 | 530 } |
531 | |
532 void | |
533 free_charset_unicode_tables (Lisp_Object charset) | |
534 { | |
535 free_to_unicode_table (XCHARSET_TO_UNICODE_TABLE (charset), | |
536 XCHARSET_DIMENSION (charset)); | |
537 free_from_unicode_table (XCHARSET_FROM_UNICODE_TABLE (charset), | |
538 XCHARSET_FROM_UNICODE_LEVELS (charset)); | |
539 } | |
540 | |
541 #ifdef MEMORY_USAGE_STATS | |
542 | |
543 static Bytecount | |
544 compute_from_unicode_table_size_1 (void *table, int level, | |
545 struct overhead_stats *stats) | |
546 { | |
547 int i; | |
548 Bytecount size = 0; | |
549 | |
550 switch (level) | |
551 { | |
552 case 2: | |
553 { | |
554 short **tab = (short **) table; | |
555 for (i = 0; i < 256; i++) | |
556 { | |
557 if (tab[i] != from_unicode_blank_1) | |
558 size += compute_from_unicode_table_size_1 (tab[i], 1, stats); | |
559 } | |
560 break; | |
561 } | |
562 case 3: | |
563 { | |
564 short ***tab = (short ***) table; | |
565 for (i = 0; i < 256; i++) | |
566 { | |
567 if (tab[i] != from_unicode_blank_2) | |
568 size += compute_from_unicode_table_size_1 (tab[i], 2, stats); | |
569 } | |
570 break; | |
571 } | |
572 case 4: | |
573 { | |
574 short ****tab = (short ****) table; | |
575 for (i = 0; i < 256; i++) | |
576 { | |
577 if (tab[i] != from_unicode_blank_3) | |
578 size += compute_from_unicode_table_size_1 (tab[i], 3, stats); | |
579 } | |
580 break; | |
581 } | |
582 } | |
583 | |
3024 | 584 size += malloced_storage_size (table, |
771 | 585 256 * (level == 1 ? sizeof (short) : |
586 sizeof (void *)), | |
587 stats); | |
588 return size; | |
589 } | |
590 | |
591 static Bytecount | |
592 compute_to_unicode_table_size_1 (void *table, int level, | |
593 struct overhead_stats *stats) | |
594 { | |
595 Bytecount size = 0; | |
596 | |
597 if (level == 2) | |
598 { | |
599 int i; | |
600 int **tab = (int **) table; | |
601 | |
602 for (i = 0; i < 96; i++) | |
603 { | |
604 if (tab[i] != to_unicode_blank_1) | |
605 size += compute_to_unicode_table_size_1 (tab[i], 1, stats); | |
606 } | |
607 } | |
608 | |
3024 | 609 size += malloced_storage_size (table, |
771 | 610 96 * (level == 1 ? sizeof (int) : |
611 sizeof (void *)), | |
612 stats); | |
613 return size; | |
614 } | |
615 | |
616 Bytecount | |
617 compute_from_unicode_table_size (Lisp_Object charset, | |
618 struct overhead_stats *stats) | |
619 { | |
620 return (compute_from_unicode_table_size_1 | |
621 (XCHARSET_FROM_UNICODE_TABLE (charset), | |
622 XCHARSET_FROM_UNICODE_LEVELS (charset), | |
623 stats)); | |
624 } | |
625 | |
626 Bytecount | |
627 compute_to_unicode_table_size (Lisp_Object charset, | |
628 struct overhead_stats *stats) | |
629 { | |
630 return (compute_to_unicode_table_size_1 | |
631 (XCHARSET_TO_UNICODE_TABLE (charset), | |
632 XCHARSET_DIMENSION (charset), | |
633 stats)); | |
634 } | |
635 | |
636 #endif | |
637 | |
638 #ifdef SLEDGEHAMMER_CHECK_UNICODE | |
639 | |
640 /* "Sledgehammer checks" are checks that verify the self-consistency | |
641 of an entire structure every time a change is about to be made or | |
642 has been made to the structure. Not fast but a pretty much | |
643 sure-fire way of flushing out any incorrectnesses in the algorithms | |
644 that create the structure. | |
645 | |
646 Checking only after a change has been made will speed things up by | |
647 a factor of 2, but it doesn't absolutely prove that the code just | |
648 checked caused the problem; perhaps it happened elsewhere, either | |
649 in some code you forgot to sledgehammer check or as a result of | |
650 data corruption. */ | |
651 | |
652 static void | |
653 assert_not_any_blank_table (void *tab) | |
654 { | |
655 assert (tab != from_unicode_blank_1); | |
656 assert (tab != from_unicode_blank_2); | |
657 assert (tab != from_unicode_blank_3); | |
658 assert (tab != from_unicode_blank_4); | |
659 assert (tab != to_unicode_blank_1); | |
660 assert (tab != to_unicode_blank_2); | |
661 assert (tab); | |
662 } | |
663 | |
664 static void | |
665 sledgehammer_check_from_table (Lisp_Object charset, void *table, int level, | |
666 int codetop) | |
667 { | |
668 int i; | |
669 | |
670 switch (level) | |
671 { | |
672 case 1: | |
673 { | |
674 short *tab = (short *) table; | |
675 for (i = 0; i < 256; i++) | |
676 { | |
677 if (tab[i] != -1) | |
678 { | |
679 Lisp_Object char_charset; | |
680 int c1, c2; | |
681 | |
867 | 682 assert (valid_ichar_p (tab[i])); |
683 BREAKUP_ICHAR (tab[i], char_charset, c1, c2); | |
771 | 684 assert (EQ (charset, char_charset)); |
685 if (XCHARSET_DIMENSION (charset) == 1) | |
686 { | |
687 int *to_table = | |
688 (int *) XCHARSET_TO_UNICODE_TABLE (charset); | |
689 assert_not_any_blank_table (to_table); | |
690 assert (to_table[c1 - 32] == (codetop << 8) + i); | |
691 } | |
692 else | |
693 { | |
694 int **to_table = | |
695 (int **) XCHARSET_TO_UNICODE_TABLE (charset); | |
696 assert_not_any_blank_table (to_table); | |
697 assert_not_any_blank_table (to_table[c1 - 32]); | |
698 assert (to_table[c1 - 32][c2 - 32] == (codetop << 8) + i); | |
699 } | |
700 } | |
701 } | |
702 break; | |
703 } | |
704 case 2: | |
705 { | |
706 short **tab = (short **) table; | |
707 for (i = 0; i < 256; i++) | |
708 { | |
709 if (tab[i] != from_unicode_blank_1) | |
710 sledgehammer_check_from_table (charset, tab[i], 1, | |
711 (codetop << 8) + i); | |
712 } | |
713 break; | |
714 } | |
715 case 3: | |
716 { | |
717 short ***tab = (short ***) table; | |
718 for (i = 0; i < 256; i++) | |
719 { | |
720 if (tab[i] != from_unicode_blank_2) | |
721 sledgehammer_check_from_table (charset, tab[i], 2, | |
722 (codetop << 8) + i); | |
723 } | |
724 break; | |
725 } | |
726 case 4: | |
727 { | |
728 short ****tab = (short ****) table; | |
729 for (i = 0; i < 256; i++) | |
730 { | |
731 if (tab[i] != from_unicode_blank_3) | |
732 sledgehammer_check_from_table (charset, tab[i], 3, | |
733 (codetop << 8) + i); | |
734 } | |
735 break; | |
736 } | |
737 default: | |
2500 | 738 ABORT (); |
771 | 739 } |
740 } | |
741 | |
742 static void | |
743 sledgehammer_check_to_table (Lisp_Object charset, void *table, int level, | |
744 int codetop) | |
745 { | |
746 int i; | |
747 | |
748 switch (level) | |
749 { | |
750 case 1: | |
751 { | |
752 int *tab = (int *) table; | |
753 | |
754 if (XCHARSET_CHARS (charset) == 94) | |
755 { | |
756 assert (tab[0] == -1); | |
757 assert (tab[95] == -1); | |
758 } | |
759 | |
760 for (i = 0; i < 96; i++) | |
761 { | |
762 if (tab[i] != -1) | |
763 { | |
764 int u4, u3, u2, u1, levels; | |
867 | 765 Ichar ch; |
766 Ichar this_ch; | |
771 | 767 short val; |
768 void *frtab = XCHARSET_FROM_UNICODE_TABLE (charset); | |
769 | |
770 if (XCHARSET_DIMENSION (charset) == 1) | |
867 | 771 this_ch = make_ichar (charset, i + 32, 0); |
771 | 772 else |
867 | 773 this_ch = make_ichar (charset, codetop + 32, i + 32); |
771 | 774 |
775 assert (tab[i] >= 0); | |
776 BREAKUP_UNICODE_CODE (tab[i], u4, u3, u2, u1, levels); | |
777 assert (levels <= XCHARSET_FROM_UNICODE_LEVELS (charset)); | |
778 | |
779 switch (XCHARSET_FROM_UNICODE_LEVELS (charset)) | |
780 { | |
781 case 1: val = ((short *) frtab)[u1]; break; | |
782 case 2: val = ((short **) frtab)[u2][u1]; break; | |
783 case 3: val = ((short ***) frtab)[u3][u2][u1]; break; | |
784 case 4: val = ((short ****) frtab)[u4][u3][u2][u1]; break; | |
2500 | 785 default: ABORT (); |
771 | 786 } |
787 | |
867 | 788 ch = make_ichar (charset, val >> 8, val & 0xFF); |
771 | 789 assert (ch == this_ch); |
790 | |
791 switch (XCHARSET_FROM_UNICODE_LEVELS (charset)) | |
792 { | |
793 case 4: | |
794 assert_not_any_blank_table (frtab); | |
795 frtab = ((short ****) frtab)[u4]; | |
796 /* fall through */ | |
797 case 3: | |
798 assert_not_any_blank_table (frtab); | |
799 frtab = ((short ***) frtab)[u3]; | |
800 /* fall through */ | |
801 case 2: | |
802 assert_not_any_blank_table (frtab); | |
803 frtab = ((short **) frtab)[u2]; | |
804 /* fall through */ | |
805 case 1: | |
806 assert_not_any_blank_table (frtab); | |
807 break; | |
2500 | 808 default: ABORT (); |
771 | 809 } |
810 } | |
811 } | |
812 break; | |
813 } | |
814 case 2: | |
815 { | |
816 int **tab = (int **) table; | |
817 | |
818 if (XCHARSET_CHARS (charset) == 94) | |
819 { | |
820 assert (tab[0] == to_unicode_blank_1); | |
821 assert (tab[95] == to_unicode_blank_1); | |
822 } | |
823 | |
824 for (i = 0; i < 96; i++) | |
825 { | |
826 if (tab[i] != to_unicode_blank_1) | |
827 sledgehammer_check_to_table (charset, tab[i], 1, i); | |
828 } | |
829 break; | |
830 } | |
831 default: | |
2500 | 832 ABORT (); |
771 | 833 } |
834 } | |
835 | |
836 static void | |
837 sledgehammer_check_unicode_tables (Lisp_Object charset) | |
838 { | |
839 /* verify that the blank tables have not been modified */ | |
840 int i; | |
841 int from_level = XCHARSET_FROM_UNICODE_LEVELS (charset); | |
842 int to_level = XCHARSET_FROM_UNICODE_LEVELS (charset); | |
843 | |
844 for (i = 0; i < 256; i++) | |
845 { | |
846 assert (from_unicode_blank_1[i] == (short) -1); | |
847 assert (from_unicode_blank_2[i] == from_unicode_blank_1); | |
848 assert (from_unicode_blank_3[i] == from_unicode_blank_2); | |
849 assert (from_unicode_blank_4[i] == from_unicode_blank_3); | |
850 } | |
851 | |
852 for (i = 0; i < 96; i++) | |
853 { | |
854 assert (to_unicode_blank_1[i] == -1); | |
855 assert (to_unicode_blank_2[i] == to_unicode_blank_1); | |
856 } | |
857 | |
858 assert (from_level >= 1 && from_level <= 4); | |
859 | |
860 sledgehammer_check_from_table (charset, | |
861 XCHARSET_FROM_UNICODE_TABLE (charset), | |
862 from_level, 0); | |
863 | |
864 sledgehammer_check_to_table (charset, | |
865 XCHARSET_TO_UNICODE_TABLE (charset), | |
866 XCHARSET_DIMENSION (charset), 0); | |
867 } | |
868 | |
869 #endif /* SLEDGEHAMMER_CHECK_UNICODE */ | |
870 | |
871 static void | |
867 | 872 set_unicode_conversion (Ichar chr, int code) |
771 | 873 { |
874 Lisp_Object charset; | |
875 int c1, c2; | |
876 | |
867 | 877 BREAKUP_ICHAR (chr, charset, c1, c2); |
771 | 878 |
877 | 879 /* I tried an assert on code > 255 || chr == code, but that fails because |
880 Mule gives many Latin characters separate code points for different | |
881 ISO 8859 coded character sets. Obvious in hindsight.... */ | |
882 assert (!EQ (charset, Vcharset_ascii) || chr == code); | |
883 assert (!EQ (charset, Vcharset_latin_iso8859_1) || chr == code); | |
884 assert (!EQ (charset, Vcharset_control_1) || chr == code); | |
885 | |
886 /* This assert is needed because it is simply unimplemented. */ | |
771 | 887 assert (!EQ (charset, Vcharset_composite)); |
888 | |
889 #ifdef SLEDGEHAMMER_CHECK_UNICODE | |
890 sledgehammer_check_unicode_tables (charset); | |
891 #endif | |
892 | |
2704 | 893 if (EQ(charset, Vcharset_ascii) || EQ(charset, Vcharset_control_1)) |
894 return; | |
895 | |
771 | 896 /* First, the char -> unicode translation */ |
897 | |
898 if (XCHARSET_DIMENSION (charset) == 1) | |
899 { | |
900 int *to_table = (int *) XCHARSET_TO_UNICODE_TABLE (charset); | |
901 to_table[c1 - 32] = code; | |
902 } | |
903 else | |
904 { | |
905 int **to_table_2 = (int **) XCHARSET_TO_UNICODE_TABLE (charset); | |
906 int *to_table_1; | |
907 | |
908 assert (XCHARSET_DIMENSION (charset) == 2); | |
909 to_table_1 = to_table_2[c1 - 32]; | |
910 if (to_table_1 == to_unicode_blank_1) | |
911 { | |
912 to_table_1 = xnew_array (int, 96); | |
913 memcpy (to_table_1, to_unicode_blank_1, 96 * sizeof (int)); | |
914 to_table_2[c1 - 32] = to_table_1; | |
915 } | |
916 to_table_1[c2 - 32] = code; | |
917 } | |
918 | |
919 /* Then, unicode -> char: much harder */ | |
920 | |
921 { | |
922 int charset_levels; | |
923 int u4, u3, u2, u1; | |
924 int code_levels; | |
925 BREAKUP_UNICODE_CODE (code, u4, u3, u2, u1, code_levels); | |
926 | |
927 charset_levels = XCHARSET_FROM_UNICODE_LEVELS (charset); | |
928 | |
929 /* Make sure the charset's tables have at least as many levels as | |
930 the code point has: Note that the charset is guaranteed to have | |
931 at least one level, because it was created that way */ | |
932 if (charset_levels < code_levels) | |
933 { | |
934 int i; | |
935 | |
936 assert (charset_levels > 0); | |
937 for (i = 2; i <= code_levels; i++) | |
938 { | |
939 if (charset_levels < i) | |
940 { | |
941 void *old_table = XCHARSET_FROM_UNICODE_TABLE (charset); | |
942 void *table = create_new_from_unicode_table (i); | |
943 XCHARSET_FROM_UNICODE_TABLE (charset) = table; | |
944 | |
945 switch (i) | |
946 { | |
947 case 2: | |
948 ((short **) table)[0] = (short *) old_table; | |
949 break; | |
950 case 3: | |
951 ((short ***) table)[0] = (short **) old_table; | |
952 break; | |
953 case 4: | |
954 ((short ****) table)[0] = (short ***) old_table; | |
955 break; | |
2500 | 956 default: ABORT (); |
771 | 957 } |
958 } | |
959 } | |
960 | |
961 charset_levels = code_levels; | |
962 XCHARSET_FROM_UNICODE_LEVELS (charset) = code_levels; | |
963 } | |
964 | |
965 /* Now, make sure there is a non-default table at each level */ | |
966 { | |
967 int i; | |
968 void *table = XCHARSET_FROM_UNICODE_TABLE (charset); | |
969 | |
970 for (i = charset_levels; i >= 2; i--) | |
971 { | |
972 switch (i) | |
973 { | |
974 case 4: | |
975 if (((short ****) table)[u4] == from_unicode_blank_3) | |
976 ((short ****) table)[u4] = | |
977 ((short ***) create_new_from_unicode_table (3)); | |
978 table = ((short ****) table)[u4]; | |
979 break; | |
980 case 3: | |
981 if (((short ***) table)[u3] == from_unicode_blank_2) | |
982 ((short ***) table)[u3] = | |
983 ((short **) create_new_from_unicode_table (2)); | |
984 table = ((short ***) table)[u3]; | |
985 break; | |
986 case 2: | |
987 if (((short **) table)[u2] == from_unicode_blank_1) | |
988 ((short **) table)[u2] = | |
989 ((short *) create_new_from_unicode_table (1)); | |
990 table = ((short **) table)[u2]; | |
991 break; | |
2500 | 992 default: ABORT (); |
771 | 993 } |
994 } | |
995 } | |
996 | |
997 /* Finally, set the character */ | |
998 | |
999 { | |
1000 void *table = XCHARSET_FROM_UNICODE_TABLE (charset); | |
1001 switch (charset_levels) | |
1002 { | |
1003 case 1: ((short *) table)[u1] = (c1 << 8) + c2; break; | |
1004 case 2: ((short **) table)[u2][u1] = (c1 << 8) + c2; break; | |
1005 case 3: ((short ***) table)[u3][u2][u1] = (c1 << 8) + c2; break; | |
1006 case 4: ((short ****) table)[u4][u3][u2][u1] = (c1 << 8) + c2; break; | |
2500 | 1007 default: ABORT (); |
771 | 1008 } |
1009 } | |
1010 } | |
1011 | |
1012 #ifdef SLEDGEHAMMER_CHECK_UNICODE | |
1013 sledgehammer_check_unicode_tables (charset); | |
1014 #endif | |
1015 } | |
1016 | |
788 | 1017 int |
867 | 1018 ichar_to_unicode (Ichar chr) |
771 | 1019 { |
1020 Lisp_Object charset; | |
1021 int c1, c2; | |
1022 | |
867 | 1023 type_checking_assert (valid_ichar_p (chr)); |
877 | 1024 /* This shortcut depends on the representation of an Ichar, see text.c. */ |
771 | 1025 if (chr < 256) |
1026 return (int) chr; | |
1027 | |
867 | 1028 BREAKUP_ICHAR (chr, charset, c1, c2); |
771 | 1029 if (EQ (charset, Vcharset_composite)) |
1030 return -1; /* #### don't know how to handle */ | |
1031 else if (XCHARSET_DIMENSION (charset) == 1) | |
1032 return ((int *) XCHARSET_TO_UNICODE_TABLE (charset))[c1 - 32]; | |
1033 else | |
1034 return ((int **) XCHARSET_TO_UNICODE_TABLE (charset))[c1 - 32][c2 - 32]; | |
1035 } | |
1036 | |
867 | 1037 static Ichar |
3439 | 1038 get_free_codepoint(Lisp_Object charset) |
1039 { | |
1040 Lisp_Object name = Fcharset_name(charset); | |
1041 Lisp_Object zeichen = Fget(name, Qlast_allocated_character, Qnil); | |
1042 Ichar res; | |
1043 | |
1044 /* Only allow this with the 96x96 character sets we are using for | |
1045 temporary Unicode support. */ | |
1046 assert(2 == XCHARSET_DIMENSION(charset) && 96 == XCHARSET_CHARS(charset)); | |
1047 | |
1048 if (!NILP(zeichen)) | |
1049 { | |
1050 int c1, c2; | |
1051 | |
1052 BREAKUP_ICHAR(XCHAR(zeichen), charset, c1, c2); | |
1053 | |
1054 if (127 == c1 && 127 == c2) | |
1055 { | |
1056 /* We've already used the hightest-numbered character in this | |
1057 set--tell our caller to create another. */ | |
1058 return -1; | |
1059 } | |
1060 | |
1061 if (127 == c2) | |
1062 { | |
1063 ++c1; | |
1064 c2 = 0x20; | |
1065 } | |
1066 else | |
1067 { | |
1068 ++c2; | |
1069 } | |
1070 | |
1071 res = make_ichar(charset, c1, c2); | |
1072 Fput(name, Qlast_allocated_character, make_char(res)); | |
1073 } | |
1074 else | |
1075 { | |
1076 res = make_ichar(charset, 32, 32); | |
1077 Fput(name, Qlast_allocated_character, make_char(res)); | |
1078 } | |
1079 return res; | |
1080 } | |
1081 | |
1082 /* The just-in-time creation of XEmacs characters that correspond to unknown | |
1083 Unicode code points happens when: | |
1084 | |
1085 1. The lookup would otherwise fail. | |
1086 | |
1087 2. The charsets array is the nil or the default. | |
1088 | |
1089 If there are no free code points in the just-in-time Unicode character | |
1090 set, and the charsets array is the default unicode precedence list, | |
1091 create a new just-in-time Unicode character set, add it at the end of the | |
1092 unicode precedence list, create the XEmacs character in that character | |
1093 set, and return it. */ | |
1094 | |
1095 static Ichar | |
877 | 1096 unicode_to_ichar (int code, Lisp_Object_dynarr *charsets) |
771 | 1097 { |
1098 int u1, u2, u3, u4; | |
1099 int code_levels; | |
1100 int i; | |
1101 int n = Dynarr_length (charsets); | |
1102 | |
1103 type_checking_assert (code >= 0); | |
877 | 1104 /* This shortcut depends on the representation of an Ichar, see text.c. |
1105 Note that it may _not_ be extended to U+00A0 to U+00FF (many ISO 8859 | |
893 | 1106 coded character sets have points that map into that region, so this |
1107 function is many-valued). */ | |
877 | 1108 if (code < 0xA0) |
867 | 1109 return (Ichar) code; |
771 | 1110 |
1111 BREAKUP_UNICODE_CODE (code, u4, u3, u2, u1, code_levels); | |
1112 | |
1113 for (i = 0; i < n; i++) | |
1114 { | |
1115 Lisp_Object charset = Dynarr_at (charsets, i); | |
1116 int charset_levels = XCHARSET_FROM_UNICODE_LEVELS (charset); | |
1117 if (charset_levels >= code_levels) | |
1118 { | |
1119 void *table = XCHARSET_FROM_UNICODE_TABLE (charset); | |
1120 short retval; | |
1121 | |
1122 switch (charset_levels) | |
1123 { | |
1124 case 1: retval = ((short *) table)[u1]; break; | |
1125 case 2: retval = ((short **) table)[u2][u1]; break; | |
1126 case 3: retval = ((short ***) table)[u3][u2][u1]; break; | |
1127 case 4: retval = ((short ****) table)[u4][u3][u2][u1]; break; | |
2500 | 1128 default: ABORT (); retval = 0; |
771 | 1129 } |
1130 | |
1131 if (retval != -1) | |
867 | 1132 return make_ichar (charset, retval >> 8, retval & 0xFF); |
771 | 1133 } |
1134 } | |
3439 | 1135 |
1136 /* Only do the magic just-in-time assignment if we're using the default | |
1137 list. */ | |
1138 if (unicode_precedence_dynarr == charsets) | |
1139 { | |
1140 if (NILP (Vcurrent_jit_charset) || | |
1141 (-1 == (i = get_free_codepoint(Vcurrent_jit_charset)))) | |
1142 { | |
3452 | 1143 Ibyte setname[32]; |
4268 | 1144 int number_of_jit_charsets = XINT (Vnumber_of_jit_charsets); |
1145 Ascbyte last_jit_charset_final = XCHAR (Vlast_jit_charset_final); | |
1146 | |
1147 /* This final byte shit is, umm, not that cool. */ | |
1148 assert (last_jit_charset_final >= 0x30); | |
3439 | 1149 |
3452 | 1150 /* Assertion added partly because our Win32 layer doesn't |
1151 support snprintf; with this, we're sure it won't overflow | |
1152 the buffer. */ | |
1153 assert(100 > number_of_jit_charsets); | |
1154 | |
4268 | 1155 qxesprintf(setname, "jit-ucs-charset-%d", number_of_jit_charsets); |
1156 | |
3439 | 1157 Vcurrent_jit_charset = Fmake_charset |
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
1158 (intern_int (setname), Vcharset_descr, |
3439 | 1159 /* Set encode-as-utf-8 to t, to have this character set written |
1160 using UTF-8 escapes in escape-quoted and ctext. This | |
1161 sidesteps the fact that our internal character -> Unicode | |
1162 mapping is not stable from one invocation to the next. */ | |
1163 nconc2 (list2(Qencode_as_utf_8, Qt), | |
1164 nconc2 (list6(Qcolumns, make_int(1), Qchars, make_int(96), | |
1165 Qdimension, make_int(2)), | |
3659 | 1166 list6(Qregistries, Qunicode_registries, |
4268 | 1167 Qfinal, make_char(last_jit_charset_final), |
3439 | 1168 /* This CCL program is initialised in |
1169 unicode.el. */ | |
1170 Qccl_program, Qccl_encode_to_ucs_2)))); | |
4268 | 1171 |
1172 /* Record for the Unicode infrastructure that we've created | |
1173 this character set. */ | |
1174 Vnumber_of_jit_charsets = make_int (number_of_jit_charsets + 1); | |
1175 Vlast_jit_charset_final = make_char (last_jit_charset_final + 1); | |
3439 | 1176 |
1177 i = get_free_codepoint(Vcurrent_jit_charset); | |
1178 } | |
1179 | |
1180 if (-1 != i) | |
1181 { | |
1182 set_unicode_conversion((Ichar)i, code); | |
1183 /* No need to add the charset to the end of the list; it's done | |
1184 automatically. */ | |
1185 } | |
1186 } | |
1187 return (Ichar) i; | |
771 | 1188 } |
1189 | |
877 | 1190 /* Add charsets to precedence list. |
1191 LIST must be a list of charsets. Charsets which are in the list more | |
1192 than once are given the precedence implied by their earliest appearance. | |
1193 Later appearances are ignored. */ | |
771 | 1194 static void |
1195 add_charsets_to_precedence_list (Lisp_Object list, int *lbs, | |
1196 Lisp_Object_dynarr *dynarr) | |
1197 { | |
1198 { | |
1199 EXTERNAL_LIST_LOOP_2 (elt, list) | |
1200 { | |
1201 Lisp_Object charset = Fget_charset (elt); | |
778 | 1202 int lb = XCHARSET_LEADING_BYTE (charset); |
771 | 1203 if (lbs[lb - MIN_LEADING_BYTE] == 0) |
1204 { | |
877 | 1205 Dynarr_add (dynarr, charset); |
771 | 1206 lbs[lb - MIN_LEADING_BYTE] = 1; |
1207 } | |
1208 } | |
1209 } | |
1210 } | |
1211 | |
877 | 1212 /* Rebuild the charset precedence array. |
1213 The "charsets preferred for the current language" get highest precedence, | |
1214 followed by the "charsets preferred by default", ordered as in | |
1215 Vlanguage_unicode_precedence_list and Vdefault_unicode_precedence_list, | |
1216 respectively. All remaining charsets follow in an arbitrary order. */ | |
771 | 1217 void |
1218 recalculate_unicode_precedence (void) | |
1219 { | |
1220 int lbs[NUM_LEADING_BYTES]; | |
1221 int i; | |
1222 | |
1223 for (i = 0; i < NUM_LEADING_BYTES; i++) | |
1224 lbs[i] = 0; | |
1225 | |
1226 Dynarr_reset (unicode_precedence_dynarr); | |
1227 | |
1228 add_charsets_to_precedence_list (Vlanguage_unicode_precedence_list, | |
1229 lbs, unicode_precedence_dynarr); | |
1230 add_charsets_to_precedence_list (Vdefault_unicode_precedence_list, | |
1231 lbs, unicode_precedence_dynarr); | |
1232 | |
1233 for (i = 0; i < NUM_LEADING_BYTES; i++) | |
1234 { | |
1235 if (lbs[i] == 0) | |
1236 { | |
826 | 1237 Lisp_Object charset = charset_by_leading_byte (i + MIN_LEADING_BYTE); |
771 | 1238 if (!NILP (charset)) |
1239 Dynarr_add (unicode_precedence_dynarr, charset); | |
1240 } | |
1241 } | |
1242 } | |
1243 | |
877 | 1244 DEFUN ("unicode-precedence-list", |
1245 Funicode_precedence_list, | |
1246 0, 0, 0, /* | |
1247 Return the precedence order among charsets used for Unicode decoding. | |
1248 | |
1249 Value is a list of charsets, which are searched in order for a translation | |
1250 matching a given Unicode character. | |
1251 | |
1252 The highest precedence is given to the language-specific precedence list of | |
1253 charsets, defined by `set-language-unicode-precedence-list'. These are | |
1254 followed by charsets in the default precedence list, defined by | |
1255 `set-default-unicode-precedence-list'. Charsets occurring multiple times are | |
1256 given precedence according to their first occurrance in either list. These | |
1257 are followed by the remaining charsets, in some arbitrary order. | |
771 | 1258 |
1259 The language-specific precedence list is meant to be set as part of the | |
1260 language environment initialization; the default precedence list is meant | |
1261 to be set by the user. | |
1318 | 1262 |
1263 #### NOTE: This interface may be changed. | |
771 | 1264 */ |
877 | 1265 ()) |
1266 { | |
1267 int i; | |
1268 Lisp_Object list = Qnil; | |
1269 | |
1270 for (i = Dynarr_length (unicode_precedence_dynarr) - 1; i >= 0; i--) | |
1271 list = Fcons (Dynarr_at (unicode_precedence_dynarr, i), list); | |
1272 return list; | |
1273 } | |
1274 | |
1275 | |
1276 /* #### This interface is wrong. Cyrillic users and Chinese users are going | |
1277 to have varying opinions about whether ISO Cyrillic, KOI8-R, or Windows | |
1278 1251 should take precedence, and whether Big Five or CNS should take | |
1279 precedence, respectively. This means that users are sometimes going to | |
1280 want to set Vlanguage_unicode_precedence_list. | |
1281 Furthermore, this should be language-local (buffer-local would be a | |
1318 | 1282 reasonable approximation). |
1283 | |
1284 Answer: You are right, this needs rethinking. */ | |
877 | 1285 DEFUN ("set-language-unicode-precedence-list", |
1286 Fset_language_unicode_precedence_list, | |
1287 1, 1, 0, /* | |
1288 Set the language-specific precedence of charsets in Unicode decoding. | |
1289 LIST is a list of charsets. | |
1290 See `unicode-precedence-list' for more information. | |
1318 | 1291 |
1292 #### NOTE: This interface may be changed. | |
877 | 1293 */ |
771 | 1294 (list)) |
1295 { | |
1296 { | |
1297 EXTERNAL_LIST_LOOP_2 (elt, list) | |
1298 Fget_charset (elt); | |
1299 } | |
1300 | |
1301 Vlanguage_unicode_precedence_list = list; | |
1302 recalculate_unicode_precedence (); | |
1303 return Qnil; | |
1304 } | |
1305 | |
1306 DEFUN ("language-unicode-precedence-list", | |
1307 Flanguage_unicode_precedence_list, | |
1308 0, 0, 0, /* | |
1309 Return the language-specific precedence list used for Unicode decoding. | |
877 | 1310 See `unicode-precedence-list' for more information. |
1318 | 1311 |
1312 #### NOTE: This interface may be changed. | |
771 | 1313 */ |
1314 ()) | |
1315 { | |
1316 return Vlanguage_unicode_precedence_list; | |
1317 } | |
1318 | |
1319 DEFUN ("set-default-unicode-precedence-list", | |
1320 Fset_default_unicode_precedence_list, | |
1321 1, 1, 0, /* | |
1322 Set the default precedence list used for Unicode decoding. | |
877 | 1323 This is intended to be set by the user. See |
1324 `unicode-precedence-list' for more information. | |
1318 | 1325 |
1326 #### NOTE: This interface may be changed. | |
771 | 1327 */ |
1328 (list)) | |
1329 { | |
1330 { | |
1331 EXTERNAL_LIST_LOOP_2 (elt, list) | |
1332 Fget_charset (elt); | |
1333 } | |
1334 | |
1335 Vdefault_unicode_precedence_list = list; | |
1336 recalculate_unicode_precedence (); | |
1337 return Qnil; | |
1338 } | |
1339 | |
1340 DEFUN ("default-unicode-precedence-list", | |
1341 Fdefault_unicode_precedence_list, | |
1342 0, 0, 0, /* | |
1343 Return the default precedence list used for Unicode decoding. | |
877 | 1344 See `unicode-precedence-list' for more information. |
1318 | 1345 |
1346 #### NOTE: This interface may be changed. | |
771 | 1347 */ |
1348 ()) | |
1349 { | |
1350 return Vdefault_unicode_precedence_list; | |
1351 } | |
1352 | |
1353 DEFUN ("set-unicode-conversion", Fset_unicode_conversion, | |
1354 2, 2, 0, /* | |
1355 Add conversion information between Unicode codepoints and characters. | |
877 | 1356 Conversions for U+0000 to U+00FF are hardwired to ASCII, Control-1, and |
1357 Latin-1. Attempts to set these values will raise an error. | |
1358 | |
771 | 1359 CHARACTER is one of the following: |
1360 | |
1361 -- A character (in which case CODE must be a non-negative integer; values | |
1362 above 2^20 - 1 are allowed for the purpose of specifying private | |
877 | 1363 characters, but are illegal in standard Unicode---they will cause errors |
1364 when converted to utf-16) | |
771 | 1365 -- A vector of characters (in which case CODE must be a vector of integers |
1366 of the same length) | |
1367 */ | |
1368 (character, code)) | |
1369 { | |
1370 Lisp_Object charset; | |
877 | 1371 int ichar, unicode; |
771 | 1372 |
1373 CHECK_CHAR (character); | |
1374 CHECK_NATNUM (code); | |
1375 | |
877 | 1376 unicode = XINT (code); |
1377 ichar = XCHAR (character); | |
1378 charset = ichar_charset (ichar); | |
1379 | |
1380 /* The translations of ASCII, Control-1, and Latin-1 code points are | |
1381 hard-coded in ichar_to_unicode and unicode_to_ichar. | |
1382 | |
1383 Checking unicode < 256 && ichar != unicode is wrong because Mule gives | |
1384 many Latin characters code points in a few different character sets. */ | |
1385 if ((EQ (charset, Vcharset_ascii) || | |
1386 EQ (charset, Vcharset_control_1) || | |
1387 EQ (charset, Vcharset_latin_iso8859_1)) | |
1388 && unicode != ichar) | |
893 | 1389 signal_error (Qinvalid_argument, "Can't change Unicode translation for ASCII, Control-1 or Latin-1 character", |
771 | 1390 character); |
1391 | |
877 | 1392 /* #### Composite characters are not properly implemented yet. */ |
1393 if (EQ (charset, Vcharset_composite)) | |
1394 signal_error (Qinvalid_argument, "Can't set Unicode translation for Composite char", | |
1395 character); | |
1396 | |
1397 set_unicode_conversion (ichar, unicode); | |
771 | 1398 return Qnil; |
1399 } | |
1400 | |
1401 #endif /* MULE */ | |
1402 | |
800 | 1403 DEFUN ("char-to-unicode", Fchar_to_unicode, 1, 1, 0, /* |
771 | 1404 Convert character to Unicode codepoint. |
3025 | 1405 When there is no international support (i.e. the `mule' feature is not |
877 | 1406 present), this function simply does `char-to-int'. |
771 | 1407 */ |
1408 (character)) | |
1409 { | |
1410 CHECK_CHAR (character); | |
1411 #ifdef MULE | |
867 | 1412 return make_int (ichar_to_unicode (XCHAR (character))); |
771 | 1413 #else |
1414 return Fchar_to_int (character); | |
1415 #endif /* MULE */ | |
1416 } | |
1417 | |
800 | 1418 DEFUN ("unicode-to-char", Funicode_to_char, 1, 2, 0, /* |
771 | 1419 Convert Unicode codepoint to character. |
1420 CODE should be a non-negative integer. | |
1421 If CHARSETS is given, it should be a list of charsets, and only those | |
1422 charsets will be consulted, in the given order, for a translation. | |
1423 Otherwise, the default ordering of all charsets will be given (see | |
1424 `set-unicode-charset-precedence'). | |
1425 | |
3025 | 1426 When there is no international support (i.e. the `mule' feature is not |
877 | 1427 present), this function simply does `int-to-char' and ignores the CHARSETS |
1428 argument. | |
2622 | 1429 |
3439 | 1430 If the CODE would not otherwise be converted to an XEmacs character, and the |
1431 list of character sets to be consulted is nil or the default, a new XEmacs | |
1432 character will be created for it in one of the `jit-ucs-charset' Mule | |
4268 | 1433 character sets, and that character will be returned. |
1434 | |
1435 This is limited to around 400,000 characters per XEmacs session, though, so | |
1436 while normal usage will not be problematic, things like: | |
1437 | |
1438 \(dotimes (i #x110000) (decode-char 'ucs i)) | |
1439 | |
1440 will eventually error. The long-term solution to this is Unicode as an | |
1441 internal encoding. | |
771 | 1442 */ |
2333 | 1443 (code, USED_IF_MULE (charsets))) |
771 | 1444 { |
1445 #ifdef MULE | |
1446 Lisp_Object_dynarr *dyn; | |
1447 int lbs[NUM_LEADING_BYTES]; | |
1448 int c; | |
1449 | |
1450 CHECK_NATNUM (code); | |
1451 c = XINT (code); | |
1452 { | |
1453 EXTERNAL_LIST_LOOP_2 (elt, charsets) | |
1454 Fget_charset (elt); | |
1455 } | |
1456 | |
1457 if (NILP (charsets)) | |
1458 { | |
877 | 1459 Ichar ret = unicode_to_ichar (c, unicode_precedence_dynarr); |
771 | 1460 if (ret == -1) |
1461 return Qnil; | |
1462 return make_char (ret); | |
1463 } | |
1464 | |
1465 dyn = Dynarr_new (Lisp_Object); | |
1466 memset (lbs, 0, NUM_LEADING_BYTES * sizeof (int)); | |
1467 add_charsets_to_precedence_list (charsets, lbs, dyn); | |
1468 { | |
877 | 1469 Ichar ret = unicode_to_ichar (c, dyn); |
771 | 1470 Dynarr_free (dyn); |
1471 if (ret == -1) | |
1472 return Qnil; | |
1473 return make_char (ret); | |
1474 } | |
1475 #else | |
1476 CHECK_NATNUM (code); | |
1477 return Fint_to_char (code); | |
1478 #endif /* MULE */ | |
1479 } | |
1480 | |
872 | 1481 #ifdef MULE |
1482 | |
771 | 1483 static Lisp_Object |
1484 cerrar_el_fulano (Lisp_Object fulano) | |
1485 { | |
1486 FILE *file = (FILE *) get_opaque_ptr (fulano); | |
1487 retry_fclose (file); | |
1488 return Qnil; | |
1489 } | |
1490 | |
1318 | 1491 DEFUN ("load-unicode-mapping-table", Fload_unicode_mapping_table, |
771 | 1492 2, 6, 0, /* |
877 | 1493 Load Unicode tables with the Unicode mapping data in FILENAME for CHARSET. |
771 | 1494 Data is text, in the form of one translation per line -- charset |
1495 codepoint followed by Unicode codepoint. Numbers are decimal or hex | |
1496 \(preceded by 0x). Comments are marked with a #. Charset codepoints | |
877 | 1497 for two-dimensional charsets have the first octet stored in the |
771 | 1498 high 8 bits of the hex number and the second in the low 8 bits. |
1499 | |
1500 If START and END are given, only charset codepoints within the given | |
877 | 1501 range will be processed. (START and END apply to the codepoints in the |
1502 file, before OFFSET is applied.) | |
771 | 1503 |
877 | 1504 If OFFSET is given, that value will be added to all charset codepoints |
1505 in the file to obtain the internal charset codepoint. \(We assume | |
1506 that octets in the table are in the range 33 to 126 or 32 to 127. If | |
1507 you have a table in ku-ten form, with octets in the range 1 to 94, you | |
1508 will have to use an offset of 5140, i.e. 0x2020.) | |
771 | 1509 |
1510 FLAGS, if specified, control further how the tables are interpreted | |
877 | 1511 and are used to special-case certain known format deviations in the |
1512 Unicode tables or in the charset: | |
771 | 1513 |
1514 `ignore-first-column' | |
877 | 1515 The JIS X 0208 tables have 3 columns of data instead of 2. The first |
1516 column contains the Shift-JIS codepoint, which we ignore. | |
771 | 1517 `big5' |
877 | 1518 The charset codepoints are Big Five codepoints; convert it to the |
1519 hacked-up Mule codepoint in `chinese-big5-1' or `chinese-big5-2'. | |
771 | 1520 */ |
1521 (filename, charset, start, end, offset, flags)) | |
1522 { | |
1523 int st = 0, en = INT_MAX, of = 0; | |
1524 FILE *file; | |
1525 struct gcpro gcpro1; | |
1526 char line[1025]; | |
1527 int fondo = specpdl_depth (); | |
1528 int ignore_first_column = 0; | |
1529 int big5 = 0; | |
1530 | |
1531 CHECK_STRING (filename); | |
1532 charset = Fget_charset (charset); | |
1533 if (!NILP (start)) | |
1534 { | |
1535 CHECK_INT (start); | |
1536 st = XINT (start); | |
1537 } | |
1538 if (!NILP (end)) | |
1539 { | |
1540 CHECK_INT (end); | |
1541 en = XINT (end); | |
1542 } | |
1543 if (!NILP (offset)) | |
1544 { | |
1545 CHECK_INT (offset); | |
1546 of = XINT (offset); | |
1547 } | |
1548 | |
1549 if (!LISTP (flags)) | |
1550 flags = list1 (flags); | |
1551 | |
1552 { | |
1553 EXTERNAL_LIST_LOOP_2 (elt, flags) | |
1554 { | |
1555 if (EQ (elt, Qignore_first_column)) | |
1556 ignore_first_column = 1; | |
1557 else if (EQ (elt, Qbig5)) | |
1558 big5 = 1; | |
1559 else | |
1560 invalid_constant | |
1318 | 1561 ("Unrecognized `load-unicode-mapping-table' flag", elt); |
771 | 1562 } |
1563 } | |
1564 | |
1565 GCPRO1 (filename); | |
1566 filename = Fexpand_file_name (filename, Qnil); | |
1567 file = qxe_fopen (XSTRING_DATA (filename), READ_TEXT); | |
1568 if (!file) | |
1569 report_file_error ("Cannot open", filename); | |
1570 record_unwind_protect (cerrar_el_fulano, make_opaque_ptr (file)); | |
1571 while (fgets (line, sizeof (line), file)) | |
1572 { | |
1573 char *p = line; | |
1574 int cp1, cp2, endcount; | |
1575 int cp1high, cp1low; | |
1576 int dummy; | |
1577 | |
1578 while (*p) /* erase all comments out of the line */ | |
1579 { | |
1580 if (*p == '#') | |
1581 *p = '\0'; | |
1582 else | |
1583 p++; | |
1584 } | |
1585 /* see if line is nothing but whitespace and skip if so */ | |
1586 p = line + strspn (line, " \t\n\r\f"); | |
1587 if (!*p) | |
1588 continue; | |
1589 /* NOTE: It appears that MS Windows and Newlib sscanf() have | |
1590 different interpretations for whitespace (== "skip all whitespace | |
1591 at processing point"): Newlib requires at least one corresponding | |
1592 whitespace character in the input, but MS allows none. The | |
1593 following would be easier to write if we could count on the MS | |
1594 interpretation. | |
1595 | |
1596 Also, the return value does NOT include %n storage. */ | |
1597 if ((!ignore_first_column ? | |
1598 sscanf (p, "%i %i%n", &cp1, &cp2, &endcount) < 2 : | |
1599 sscanf (p, "%i %i %i%n", &dummy, &cp1, &cp2, &endcount) < 3) | |
2367 | 1600 /* #### Temporary code! Cygwin newlib fucked up scanf() handling |
1601 of numbers beginning 0x0... starting in 04/2004, in an attempt | |
1602 to fix another bug. A partial fix for this was put in in | |
1603 06/2004, but as of 10/2004 the value of ENDCOUNT returned in | |
1604 such case is still wrong. If this gets fixed soon, remove | |
1605 this code. --ben */ | |
1606 #ifndef CYGWIN_SCANF_BUG | |
1607 || *(p + endcount + strspn (p + endcount, " \t\n\r\f")) | |
1608 #endif | |
1609 ) | |
771 | 1610 { |
793 | 1611 warn_when_safe (Qunicode, Qwarning, |
771 | 1612 "Unrecognized line in translation file %s:\n%s", |
1613 XSTRING_DATA (filename), line); | |
1614 continue; | |
1615 } | |
1616 if (cp1 >= st && cp1 <= en) | |
1617 { | |
1618 cp1 += of; | |
1619 if (cp1 < 0 || cp1 >= 65536) | |
1620 { | |
1621 out_of_range: | |
793 | 1622 warn_when_safe (Qunicode, Qwarning, |
1623 "Out of range first codepoint 0x%x in " | |
1624 "translation file %s:\n%s", | |
771 | 1625 cp1, XSTRING_DATA (filename), line); |
1626 continue; | |
1627 } | |
1628 | |
1629 cp1high = cp1 >> 8; | |
1630 cp1low = cp1 & 255; | |
1631 | |
1632 if (big5) | |
1633 { | |
867 | 1634 Ichar ch = decode_big5_char (cp1high, cp1low); |
771 | 1635 if (ch == -1) |
793 | 1636 |
1637 warn_when_safe (Qunicode, Qwarning, | |
1638 "Out of range Big5 codepoint 0x%x in " | |
1639 "translation file %s:\n%s", | |
771 | 1640 cp1, XSTRING_DATA (filename), line); |
1641 else | |
1642 set_unicode_conversion (ch, cp2); | |
1643 } | |
1644 else | |
1645 { | |
1646 int l1, h1, l2, h2; | |
867 | 1647 Ichar emch; |
771 | 1648 |
1649 switch (XCHARSET_TYPE (charset)) | |
1650 { | |
1651 case CHARSET_TYPE_94: l1 = 33; h1 = 126; l2 = 0; h2 = 0; break; | |
1652 case CHARSET_TYPE_96: l1 = 32; h1 = 127; l2 = 0; h2 = 0; break; | |
1653 case CHARSET_TYPE_94X94: l1 = 33; h1 = 126; l2 = 33; h2 = 126; | |
1654 break; | |
1655 case CHARSET_TYPE_96X96: l1 = 32; h1 = 127; l2 = 32; h2 = 127; | |
1656 break; | |
2500 | 1657 default: ABORT (); l1 = 0; h1 = 0; l2 = 0; h2 = 0; |
771 | 1658 } |
1659 | |
1660 if (cp1high < l2 || cp1high > h2 || cp1low < l1 || cp1low > h1) | |
1661 goto out_of_range; | |
1662 | |
867 | 1663 emch = (cp1high == 0 ? make_ichar (charset, cp1low, 0) : |
1664 make_ichar (charset, cp1high, cp1low)); | |
771 | 1665 set_unicode_conversion (emch, cp2); |
1666 } | |
1667 } | |
1668 } | |
1669 | |
1670 if (ferror (file)) | |
1671 report_file_error ("IO error when reading", filename); | |
1672 | |
1673 unbind_to (fondo); /* close file */ | |
1674 UNGCPRO; | |
1675 return Qnil; | |
1676 } | |
1677 | |
1678 #endif /* MULE */ | |
1679 | |
1680 | |
1681 /************************************************************************/ | |
1682 /* Unicode coding system */ | |
1683 /************************************************************************/ | |
1684 | |
1685 struct unicode_coding_system | |
1686 { | |
1687 enum unicode_type type; | |
1887 | 1688 unsigned int little_endian :1; |
1689 unsigned int need_bom :1; | |
771 | 1690 }; |
1691 | |
1692 #define CODING_SYSTEM_UNICODE_TYPE(codesys) \ | |
1693 (CODING_SYSTEM_TYPE_DATA (codesys, unicode)->type) | |
1694 #define XCODING_SYSTEM_UNICODE_TYPE(codesys) \ | |
1695 CODING_SYSTEM_UNICODE_TYPE (XCODING_SYSTEM (codesys)) | |
1696 #define CODING_SYSTEM_UNICODE_LITTLE_ENDIAN(codesys) \ | |
1697 (CODING_SYSTEM_TYPE_DATA (codesys, unicode)->little_endian) | |
1698 #define XCODING_SYSTEM_UNICODE_LITTLE_ENDIAN(codesys) \ | |
1699 CODING_SYSTEM_UNICODE_LITTLE_ENDIAN (XCODING_SYSTEM (codesys)) | |
1700 #define CODING_SYSTEM_UNICODE_NEED_BOM(codesys) \ | |
1701 (CODING_SYSTEM_TYPE_DATA (codesys, unicode)->need_bom) | |
1702 #define XCODING_SYSTEM_UNICODE_NEED_BOM(codesys) \ | |
1703 CODING_SYSTEM_UNICODE_NEED_BOM (XCODING_SYSTEM (codesys)) | |
1704 | |
1705 struct unicode_coding_stream | |
1706 { | |
1707 /* decode */ | |
1708 unsigned char counter; | |
4096 | 1709 unsigned char indicated_length; |
771 | 1710 int seen_char; |
1711 /* encode */ | |
1712 Lisp_Object current_charset; | |
1713 int current_char_boundary; | |
1714 int wrote_bom; | |
1715 }; | |
1716 | |
1204 | 1717 static const struct memory_description unicode_coding_system_description[] = { |
771 | 1718 { XD_END } |
1719 }; | |
1720 | |
1204 | 1721 DEFINE_CODING_SYSTEM_TYPE_WITH_DATA (unicode); |
1722 | |
771 | 1723 static void |
1724 decode_unicode_char (int ch, unsigned_char_dynarr *dst, | |
1887 | 1725 struct unicode_coding_stream *data, |
1726 unsigned int ignore_bom) | |
771 | 1727 { |
1728 if (ch == 0xFEFF && !data->seen_char && ignore_bom) | |
1729 ; | |
1730 else | |
1731 { | |
1732 #ifdef MULE | |
877 | 1733 Ichar chr = unicode_to_ichar (ch, unicode_precedence_dynarr); |
771 | 1734 |
1735 if (chr != -1) | |
1736 { | |
867 | 1737 Ibyte work[MAX_ICHAR_LEN]; |
771 | 1738 int len; |
1739 | |
867 | 1740 len = set_itext_ichar (work, chr); |
771 | 1741 Dynarr_add_many (dst, work, len); |
1742 } | |
1743 else | |
1744 { | |
1745 Dynarr_add (dst, LEADING_BYTE_JAPANESE_JISX0208); | |
1746 Dynarr_add (dst, 34 + 128); | |
1747 Dynarr_add (dst, 46 + 128); | |
1748 } | |
1749 #else | |
867 | 1750 Dynarr_add (dst, (Ibyte) ch); |
771 | 1751 #endif /* MULE */ |
1752 } | |
1753 | |
1754 data->seen_char = 1; | |
1755 } | |
1756 | |
4096 | 1757 #define DECODE_ERROR_OCTET(octet, dst, data, ignore_bom) \ |
1758 decode_unicode_char ((octet) + UNICODE_ERROR_OCTET_RANGE_START, \ | |
1759 dst, data, ignore_bom) | |
1760 | |
1761 static inline void | |
1762 indicate_invalid_utf_8 (unsigned char indicated_length, | |
1763 unsigned char counter, | |
1764 int ch, unsigned_char_dynarr *dst, | |
1765 struct unicode_coding_stream *data, | |
1766 unsigned int ignore_bom) | |
1767 { | |
1768 Binbyte stored = indicated_length - counter; | |
1769 Binbyte mask = "\x00\x00\xC0\xE0\xF0\xF8\xFC"[indicated_length]; | |
1770 | |
1771 while (stored > 0) | |
1772 { | |
1773 DECODE_ERROR_OCTET (((ch >> (6 * (stored - 1))) & 0x3f) | mask, | |
1774 dst, data, ignore_bom); | |
1775 mask = 0x80, stored--; | |
1776 } | |
1777 } | |
1778 | |
771 | 1779 static void |
1780 encode_unicode_char_1 (int code, unsigned_char_dynarr *dst, | |
4096 | 1781 enum unicode_type type, unsigned int little_endian, |
1782 int write_error_characters_as_such) | |
771 | 1783 { |
1784 switch (type) | |
1785 { | |
1786 case UNICODE_UTF_16: | |
1787 if (little_endian) | |
1788 { | |
3952 | 1789 if (code < 0x10000) { |
1790 Dynarr_add (dst, (unsigned char) (code & 255)); | |
1791 Dynarr_add (dst, (unsigned char) ((code >> 8) & 255)); | |
4096 | 1792 } else if (write_error_characters_as_such && |
1793 code >= UNICODE_ERROR_OCTET_RANGE_START && | |
1794 code < (UNICODE_ERROR_OCTET_RANGE_START + 0x100)) | |
1795 { | |
1796 Dynarr_add (dst, (unsigned char) ((code & 0xFF))); | |
1797 } | |
1798 else if (code < 0x110000) | |
1799 { | |
1800 /* Little endian; least significant byte first. */ | |
1801 int first, second; | |
1802 | |
1803 CODE_TO_UTF_16_SURROGATES(code, first, second); | |
1804 | |
1805 Dynarr_add (dst, (unsigned char) (first & 255)); | |
1806 Dynarr_add (dst, (unsigned char) ((first >> 8) & 255)); | |
1807 | |
1808 Dynarr_add (dst, (unsigned char) (second & 255)); | |
1809 Dynarr_add (dst, (unsigned char) ((second >> 8) & 255)); | |
1810 } | |
1811 else | |
1812 { | |
1813 /* Not valid Unicode. Pass U+FFFD, least significant byte | |
1814 first. */ | |
1815 Dynarr_add (dst, (unsigned char) 0xFD); | |
1816 Dynarr_add (dst, (unsigned char) 0xFF); | |
1817 } | |
771 | 1818 } |
1819 else | |
1820 { | |
3952 | 1821 if (code < 0x10000) { |
1822 Dynarr_add (dst, (unsigned char) ((code >> 8) & 255)); | |
1823 Dynarr_add (dst, (unsigned char) (code & 255)); | |
4096 | 1824 } else if (write_error_characters_as_such && |
1825 code >= UNICODE_ERROR_OCTET_RANGE_START && | |
1826 code < (UNICODE_ERROR_OCTET_RANGE_START + 0x100)) | |
1827 { | |
1828 Dynarr_add (dst, (unsigned char) ((code & 0xFF))); | |
1829 } | |
1830 else if (code < 0x110000) | |
1831 { | |
1832 /* Big endian; most significant byte first. */ | |
1833 int first, second; | |
1834 | |
1835 CODE_TO_UTF_16_SURROGATES(code, first, second); | |
1836 | |
1837 Dynarr_add (dst, (unsigned char) ((first >> 8) & 255)); | |
1838 Dynarr_add (dst, (unsigned char) (first & 255)); | |
1839 | |
1840 Dynarr_add (dst, (unsigned char) ((second >> 8) & 255)); | |
1841 Dynarr_add (dst, (unsigned char) (second & 255)); | |
1842 } | |
1843 else | |
1844 { | |
1845 /* Not valid Unicode. Pass U+FFFD, most significant byte | |
1846 first. */ | |
1847 Dynarr_add (dst, (unsigned char) 0xFF); | |
1848 Dynarr_add (dst, (unsigned char) 0xFD); | |
1849 } | |
771 | 1850 } |
1851 break; | |
1852 | |
1853 case UNICODE_UCS_4: | |
4096 | 1854 case UNICODE_UTF_32: |
771 | 1855 if (little_endian) |
1856 { | |
4096 | 1857 if (write_error_characters_as_such && |
1858 code >= UNICODE_ERROR_OCTET_RANGE_START && | |
1859 code < (UNICODE_ERROR_OCTET_RANGE_START + 0x100)) | |
1860 { | |
1861 Dynarr_add (dst, (unsigned char) ((code & 0xFF))); | |
1862 } | |
1863 else | |
1864 { | |
1865 /* We generate and accept incorrect sequences here, which is | |
1866 okay, in the interest of preservation of the user's | |
1867 data. */ | |
1868 Dynarr_add (dst, (unsigned char) (code & 255)); | |
1869 Dynarr_add (dst, (unsigned char) ((code >> 8) & 255)); | |
1870 Dynarr_add (dst, (unsigned char) ((code >> 16) & 255)); | |
1871 Dynarr_add (dst, (unsigned char) (code >> 24)); | |
1872 } | |
771 | 1873 } |
1874 else | |
1875 { | |
4096 | 1876 if (write_error_characters_as_such && |
1877 code >= UNICODE_ERROR_OCTET_RANGE_START && | |
1878 code < (UNICODE_ERROR_OCTET_RANGE_START + 0x100)) | |
1879 { | |
1880 Dynarr_add (dst, (unsigned char) ((code & 0xFF))); | |
1881 } | |
1882 else | |
1883 { | |
1884 /* We generate and accept incorrect sequences here, which is okay, | |
1885 in the interest of preservation of the user's data. */ | |
1886 Dynarr_add (dst, (unsigned char) (code >> 24)); | |
1887 Dynarr_add (dst, (unsigned char) ((code >> 16) & 255)); | |
1888 Dynarr_add (dst, (unsigned char) ((code >> 8) & 255)); | |
1889 Dynarr_add (dst, (unsigned char) (code & 255)); | |
1890 } | |
771 | 1891 } |
1892 break; | |
1893 | |
1894 case UNICODE_UTF_8: | |
1895 if (code <= 0x7f) | |
1896 { | |
1897 Dynarr_add (dst, (unsigned char) code); | |
1898 } | |
1899 else if (code <= 0x7ff) | |
1900 { | |
1901 Dynarr_add (dst, (unsigned char) ((code >> 6) | 0xc0)); | |
1902 Dynarr_add (dst, (unsigned char) ((code & 0x3f) | 0x80)); | |
1903 } | |
1904 else if (code <= 0xffff) | |
1905 { | |
1906 Dynarr_add (dst, (unsigned char) ((code >> 12) | 0xe0)); | |
1907 Dynarr_add (dst, (unsigned char) (((code >> 6) & 0x3f) | 0x80)); | |
1908 Dynarr_add (dst, (unsigned char) ((code & 0x3f) | 0x80)); | |
1909 } | |
1910 else if (code <= 0x1fffff) | |
1911 { | |
1912 Dynarr_add (dst, (unsigned char) ((code >> 18) | 0xf0)); | |
1913 Dynarr_add (dst, (unsigned char) (((code >> 12) & 0x3f) | 0x80)); | |
1914 Dynarr_add (dst, (unsigned char) (((code >> 6) & 0x3f) | 0x80)); | |
1915 Dynarr_add (dst, (unsigned char) ((code & 0x3f) | 0x80)); | |
1916 } | |
1917 else if (code <= 0x3ffffff) | |
1918 { | |
4096 | 1919 |
1920 #if !(UNICODE_ERROR_OCTET_RANGE_START > 0x1fffff \ | |
1921 && UNICODE_ERROR_OCTET_RANGE_START < 0x3ffffff) | |
1922 #error "This code needs to be rewritten. " | |
1923 #endif | |
1924 if (write_error_characters_as_such && | |
1925 code >= UNICODE_ERROR_OCTET_RANGE_START && | |
1926 code < (UNICODE_ERROR_OCTET_RANGE_START + 0x100)) | |
1927 { | |
1928 Dynarr_add (dst, (unsigned char) ((code & 0xFF))); | |
1929 } | |
1930 else | |
1931 { | |
1932 Dynarr_add (dst, (unsigned char) ((code >> 24) | 0xf8)); | |
1933 Dynarr_add (dst, (unsigned char) (((code >> 18) & 0x3f) | 0x80)); | |
1934 Dynarr_add (dst, (unsigned char) (((code >> 12) & 0x3f) | 0x80)); | |
1935 Dynarr_add (dst, (unsigned char) (((code >> 6) & 0x3f) | 0x80)); | |
1936 Dynarr_add (dst, (unsigned char) ((code & 0x3f) | 0x80)); | |
1937 } | |
771 | 1938 } |
1939 else | |
1940 { | |
1941 Dynarr_add (dst, (unsigned char) ((code >> 30) | 0xfc)); | |
1942 Dynarr_add (dst, (unsigned char) (((code >> 24) & 0x3f) | 0x80)); | |
1943 Dynarr_add (dst, (unsigned char) (((code >> 18) & 0x3f) | 0x80)); | |
1944 Dynarr_add (dst, (unsigned char) (((code >> 12) & 0x3f) | 0x80)); | |
1945 Dynarr_add (dst, (unsigned char) (((code >> 6) & 0x3f) | 0x80)); | |
1946 Dynarr_add (dst, (unsigned char) ((code & 0x3f) | 0x80)); | |
1947 } | |
1948 break; | |
1949 | |
2500 | 1950 case UNICODE_UTF_7: ABORT (); |
771 | 1951 |
2500 | 1952 default: ABORT (); |
771 | 1953 } |
1954 } | |
1955 | |
3439 | 1956 /* Also used in mule-coding.c for UTF-8 handling in ISO 2022-oriented |
1957 encodings. */ | |
1958 void | |
2333 | 1959 encode_unicode_char (Lisp_Object USED_IF_MULE (charset), int h, |
1960 int USED_IF_MULE (l), unsigned_char_dynarr *dst, | |
4096 | 1961 enum unicode_type type, unsigned int little_endian, |
1962 int write_error_characters_as_such) | |
771 | 1963 { |
1964 #ifdef MULE | |
867 | 1965 int code = ichar_to_unicode (make_ichar (charset, h & 127, l & 127)); |
771 | 1966 |
1967 if (code == -1) | |
1968 { | |
1969 if (type != UNICODE_UTF_16 && | |
1970 XCHARSET_DIMENSION (charset) == 2 && | |
1971 XCHARSET_CHARS (charset) == 94) | |
1972 { | |
1973 unsigned char final = XCHARSET_FINAL (charset); | |
1974 | |
1975 if (('@' <= final) && (final < 0x7f)) | |
1976 code = (0xe00000 + (final - '@') * 94 * 94 | |
1977 + ((h & 127) - 33) * 94 + (l & 127) - 33); | |
1978 else | |
1979 code = '?'; | |
1980 } | |
1981 else | |
1982 code = '?'; | |
1983 } | |
1984 #else | |
1985 int code = h; | |
1986 #endif /* MULE */ | |
1987 | |
4096 | 1988 encode_unicode_char_1 (code, dst, type, little_endian, |
1989 write_error_characters_as_such); | |
771 | 1990 } |
1991 | |
1992 static Bytecount | |
1993 unicode_convert (struct coding_stream *str, const UExtbyte *src, | |
1994 unsigned_char_dynarr *dst, Bytecount n) | |
1995 { | |
1996 unsigned int ch = str->ch; | |
1997 struct unicode_coding_stream *data = CODING_STREAM_TYPE_DATA (str, unicode); | |
1998 enum unicode_type type = | |
1999 XCODING_SYSTEM_UNICODE_TYPE (str->codesys); | |
1887 | 2000 unsigned int little_endian = |
2001 XCODING_SYSTEM_UNICODE_LITTLE_ENDIAN (str->codesys); | |
2002 unsigned int ignore_bom = XCODING_SYSTEM_UNICODE_NEED_BOM (str->codesys); | |
771 | 2003 Bytecount orign = n; |
2004 | |
2005 if (str->direction == CODING_DECODE) | |
2006 { | |
2007 unsigned char counter = data->counter; | |
4096 | 2008 unsigned char indicated_length |
2009 = data->indicated_length; | |
771 | 2010 |
2011 while (n--) | |
2012 { | |
2013 UExtbyte c = *src++; | |
2014 | |
2015 switch (type) | |
2016 { | |
2017 case UNICODE_UTF_8: | |
4096 | 2018 if (0 == counter) |
2019 { | |
2020 if (0 == (c & 0x80)) | |
2021 { | |
2022 /* ASCII. */ | |
2023 decode_unicode_char (c, dst, data, ignore_bom); | |
2024 } | |
2025 else if (0 == (c & 0x40)) | |
2026 { | |
2027 /* Highest bit set, second highest not--there's | |
2028 something wrong. */ | |
2029 DECODE_ERROR_OCTET (c, dst, data, ignore_bom); | |
2030 } | |
2031 else if (0 == (c & 0x20)) | |
2032 { | |
2033 ch = c & 0x1f; | |
2034 counter = 1; | |
2035 indicated_length = 2; | |
2036 } | |
2037 else if (0 == (c & 0x10)) | |
2038 { | |
2039 ch = c & 0x0f; | |
2040 counter = 2; | |
2041 indicated_length = 3; | |
2042 } | |
2043 else if (0 == (c & 0x08)) | |
2044 { | |
2045 ch = c & 0x0f; | |
2046 counter = 3; | |
2047 indicated_length = 4; | |
2048 } | |
2049 else | |
2050 { | |
2051 /* We don't supports lengths longer than 4 in | |
2052 external-format data. */ | |
2053 DECODE_ERROR_OCTET (c, dst, data, ignore_bom); | |
2054 | |
2055 } | |
2056 } | |
2057 else | |
2058 { | |
2059 /* counter != 0 */ | |
2060 if ((0 == (c & 0x80)) || (0 != (c & 0x40))) | |
2061 { | |
2062 indicate_invalid_utf_8(indicated_length, | |
2063 counter, | |
2064 ch, dst, data, ignore_bom); | |
2065 if (c & 0x80) | |
2066 { | |
2067 DECODE_ERROR_OCTET (c, dst, data, ignore_bom); | |
2068 } | |
2069 else | |
2070 { | |
2071 /* The character just read is ASCII. Treat it as | |
2072 such. */ | |
2073 decode_unicode_char (c, dst, data, ignore_bom); | |
2074 } | |
2075 ch = 0; | |
2076 counter = 0; | |
2077 } | |
2078 else | |
2079 { | |
2080 ch = (ch << 6) | (c & 0x3f); | |
2081 counter--; | |
2082 /* Just processed the final byte. Emit the character. */ | |
2083 if (!counter) | |
2084 { | |
2085 /* Don't accept over-long sequences, surrogates, | |
2086 or codes above #x10FFFF. */ | |
2087 if ((ch < 0x80) || | |
2088 ((ch < 0x800) && indicated_length > 2) || | |
2089 ((ch < 0x10000) && indicated_length > 3) || | |
2090 valid_utf_16_surrogate(ch) || (ch > 0x110000)) | |
2091 { | |
2092 indicate_invalid_utf_8(indicated_length, | |
2093 counter, | |
2094 ch, dst, data, | |
2095 ignore_bom); | |
2096 } | |
2097 else | |
2098 { | |
2099 decode_unicode_char (ch, dst, data, ignore_bom); | |
2100 } | |
2101 ch = 0; | |
2102 } | |
2103 } | |
771 | 2104 } |
2105 break; | |
2106 | |
2107 case UNICODE_UTF_16: | |
3952 | 2108 |
771 | 2109 if (little_endian) |
2110 ch = (c << counter) | ch; | |
2111 else | |
2112 ch = (ch << 8) | c; | |
4096 | 2113 |
771 | 2114 counter += 8; |
3952 | 2115 |
4096 | 2116 if (16 == counter) |
2117 { | |
771 | 2118 int tempch = ch; |
4096 | 2119 |
2120 if (valid_utf_16_first_surrogate(ch)) | |
2121 { | |
2122 break; | |
2123 } | |
771 | 2124 ch = 0; |
2125 counter = 0; | |
2126 decode_unicode_char (tempch, dst, data, ignore_bom); | |
2127 } | |
4096 | 2128 else if (32 == counter) |
3952 | 2129 { |
2130 int tempch; | |
4096 | 2131 |
4583
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2132 if (little_endian) |
4096 | 2133 { |
4583
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2134 if (!valid_utf_16_last_surrogate(ch >> 16)) |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2135 { |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2136 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2137 ignore_bom); |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2138 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2139 ignore_bom); |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2140 DECODE_ERROR_OCTET ((ch >> 16) & 0xFF, dst, data, |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2141 ignore_bom); |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2142 DECODE_ERROR_OCTET ((ch >> 24) & 0xFF, dst, data, |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2143 ignore_bom); |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2144 } |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2145 else |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2146 { |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2147 tempch = utf_16_surrogates_to_code((ch & 0xffff), |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2148 (ch >> 16)); |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2149 decode_unicode_char(tempch, dst, data, ignore_bom); |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2150 } |
4096 | 2151 } |
4583
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2152 else |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2153 { |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2154 if (!valid_utf_16_last_surrogate(ch & 0xFFFF)) |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2155 { |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2156 DECODE_ERROR_OCTET ((ch >> 24) & 0xFF, dst, data, |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2157 ignore_bom); |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2158 DECODE_ERROR_OCTET ((ch >> 16) & 0xFF, dst, data, |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2159 ignore_bom); |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2160 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2161 ignore_bom); |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2162 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2163 ignore_bom); |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2164 } |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2165 else |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2166 { |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2167 tempch = utf_16_surrogates_to_code((ch >> 16), |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2168 (ch & 0xffff)); |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2169 decode_unicode_char(tempch, dst, data, ignore_bom); |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2170 } |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2171 } |
2669b1b7e33b
Correct little-endian UTF-16 surrogate handling.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4270
diff
changeset
|
2172 |
3952 | 2173 ch = 0; |
2174 counter = 0; | |
4096 | 2175 } |
2176 else | |
2177 assert(8 == counter || 24 == counter); | |
771 | 2178 break; |
2179 | |
2180 case UNICODE_UCS_4: | |
4096 | 2181 case UNICODE_UTF_32: |
771 | 2182 if (little_endian) |
2183 ch = (c << counter) | ch; | |
2184 else | |
2185 ch = (ch << 8) | c; | |
2186 counter += 8; | |
2187 if (counter == 32) | |
2188 { | |
4096 | 2189 if (ch > 0x10ffff) |
2190 { | |
2191 /* ch is not a legal Unicode character. We're fine | |
2192 with that in UCS-4, though not in UTF-32. */ | |
2193 if (UNICODE_UCS_4 == type && ch < 0x80000000) | |
2194 { | |
2195 decode_unicode_char (ch, dst, data, ignore_bom); | |
2196 } | |
2197 else if (little_endian) | |
2198 { | |
2199 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, | |
2200 ignore_bom); | |
2201 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, | |
2202 ignore_bom); | |
2203 DECODE_ERROR_OCTET ((ch >> 16) & 0xFF, dst, data, | |
2204 ignore_bom); | |
2205 DECODE_ERROR_OCTET ((ch >> 24) & 0xFF, dst, data, | |
2206 ignore_bom); | |
2207 } | |
2208 else | |
2209 { | |
2210 DECODE_ERROR_OCTET ((ch >> 24) & 0xFF, dst, data, | |
2211 ignore_bom); | |
2212 DECODE_ERROR_OCTET ((ch >> 16) & 0xFF, dst, data, | |
2213 ignore_bom); | |
2214 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, | |
2215 ignore_bom); | |
2216 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, | |
2217 ignore_bom); | |
2218 } | |
2219 } | |
2220 else | |
2221 { | |
2222 decode_unicode_char (ch, dst, data, ignore_bom); | |
2223 } | |
771 | 2224 ch = 0; |
2225 counter = 0; | |
2226 } | |
2227 break; | |
2228 | |
2229 case UNICODE_UTF_7: | |
2500 | 2230 ABORT (); |
771 | 2231 break; |
2232 | |
2500 | 2233 default: ABORT (); |
771 | 2234 } |
2235 | |
2236 } | |
4096 | 2237 |
4688
7e54adf407a1
Fix a bug with Unicode error sequences and very short input strings.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4583
diff
changeset
|
2238 if (str->eof && counter) |
4096 | 2239 { |
2240 switch (type) | |
2241 { | |
2242 case UNICODE_UTF_8: | |
2243 indicate_invalid_utf_8(indicated_length, | |
2244 counter, ch, dst, data, | |
2245 ignore_bom); | |
2246 break; | |
2247 | |
2248 case UNICODE_UTF_16: | |
2249 case UNICODE_UCS_4: | |
2250 case UNICODE_UTF_32: | |
2251 if (8 == counter) | |
2252 { | |
2253 DECODE_ERROR_OCTET (ch, dst, data, ignore_bom); | |
2254 } | |
2255 else if (16 == counter) | |
2256 { | |
2257 if (little_endian) | |
2258 { | |
2259 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, ignore_bom); | |
2260 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, | |
2261 ignore_bom); | |
2262 } | |
2263 else | |
2264 { | |
2265 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, | |
2266 ignore_bom); | |
2267 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, ignore_bom); | |
2268 } | |
2269 } | |
2270 else if (24 == counter) | |
2271 { | |
2272 if (little_endian) | |
2273 { | |
2274 DECODE_ERROR_OCTET ((ch >> 16) & 0xFF, dst, data, | |
2275 ignore_bom); | |
2276 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, ignore_bom); | |
2277 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, | |
2278 ignore_bom); | |
2279 } | |
2280 else | |
2281 { | |
2282 DECODE_ERROR_OCTET ((ch >> 16) & 0xFF, dst, data, | |
2283 ignore_bom); | |
2284 DECODE_ERROR_OCTET ((ch >> 8) & 0xFF, dst, data, | |
2285 ignore_bom); | |
2286 DECODE_ERROR_OCTET (ch & 0xFF, dst, data, | |
2287 ignore_bom); | |
2288 } | |
2289 } | |
2290 else assert(0); | |
2291 break; | |
2292 } | |
2293 ch = 0; | |
4688
7e54adf407a1
Fix a bug with Unicode error sequences and very short input strings.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4583
diff
changeset
|
2294 counter = 0; |
4096 | 2295 } |
771 | 2296 |
2297 data->counter = counter; | |
4096 | 2298 data->indicated_length = indicated_length; |
771 | 2299 } |
2300 else | |
2301 { | |
2302 unsigned char char_boundary = data->current_char_boundary; | |
2303 Lisp_Object charset = data->current_charset; | |
2304 | |
2305 #ifdef ENABLE_COMPOSITE_CHARS | |
2306 /* flags for handling composite chars. We do a little switcheroo | |
2307 on the source while we're outputting the composite char. */ | |
2308 Bytecount saved_n = 0; | |
867 | 2309 const Ibyte *saved_src = NULL; |
771 | 2310 int in_composite = 0; |
2311 | |
2312 back_to_square_n: | |
2313 #endif /* ENABLE_COMPOSITE_CHARS */ | |
2314 | |
2315 if (XCODING_SYSTEM_UNICODE_NEED_BOM (str->codesys) && !data->wrote_bom) | |
2316 { | |
4096 | 2317 encode_unicode_char_1 (0xFEFF, dst, type, little_endian, 1); |
771 | 2318 data->wrote_bom = 1; |
2319 } | |
2320 | |
2321 while (n--) | |
2322 { | |
867 | 2323 Ibyte c = *src++; |
771 | 2324 |
2325 #ifdef MULE | |
826 | 2326 if (byte_ascii_p (c)) |
771 | 2327 #endif /* MULE */ |
2328 { /* Processing ASCII character */ | |
2329 ch = 0; | |
2330 encode_unicode_char (Vcharset_ascii, c, 0, dst, type, | |
4096 | 2331 little_endian, 1); |
771 | 2332 |
2333 char_boundary = 1; | |
2334 } | |
2335 #ifdef MULE | |
867 | 2336 else if (ibyte_leading_byte_p (c) || ibyte_leading_byte_p (ch)) |
771 | 2337 { /* Processing Leading Byte */ |
2338 ch = 0; | |
826 | 2339 charset = charset_by_leading_byte (c); |
2340 if (leading_byte_prefix_p(c)) | |
771 | 2341 ch = c; |
2342 char_boundary = 0; | |
2343 } | |
2344 else | |
2345 { /* Processing Non-ASCII character */ | |
2346 char_boundary = 1; | |
2347 if (EQ (charset, Vcharset_control_1)) | |
2704 | 2348 /* See: |
2349 | |
2350 (Info-goto-node "(internals)Internal String Encoding") | |
2351 | |
2352 for the rationale behind subtracting #xa0 from the | |
2353 character's code. */ | |
2354 encode_unicode_char (Vcharset_control_1, c - 0xa0, 0, dst, | |
4096 | 2355 type, little_endian, 1); |
771 | 2356 else |
2357 { | |
2358 switch (XCHARSET_REP_BYTES (charset)) | |
2359 { | |
2360 case 2: | |
2361 encode_unicode_char (charset, c, 0, dst, type, | |
4096 | 2362 little_endian, 1); |
771 | 2363 break; |
2364 case 3: | |
2365 if (XCHARSET_PRIVATE_P (charset)) | |
2366 { | |
2367 encode_unicode_char (charset, c, 0, dst, type, | |
4096 | 2368 little_endian, 1); |
771 | 2369 ch = 0; |
2370 } | |
2371 else if (ch) | |
2372 { | |
2373 #ifdef ENABLE_COMPOSITE_CHARS | |
2374 if (EQ (charset, Vcharset_composite)) | |
2375 { | |
2376 if (in_composite) | |
2377 { | |
2378 /* #### Bother! We don't know how to | |
2379 handle this yet. */ | |
2380 encode_unicode_char (Vcharset_ascii, '~', 0, | |
2381 dst, type, | |
4096 | 2382 little_endian, 1); |
771 | 2383 } |
2384 else | |
2385 { | |
867 | 2386 Ichar emch = make_ichar (Vcharset_composite, |
771 | 2387 ch & 0x7F, |
2388 c & 0x7F); | |
2389 Lisp_Object lstr = | |
2390 composite_char_string (emch); | |
2391 saved_n = n; | |
2392 saved_src = src; | |
2393 in_composite = 1; | |
2394 src = XSTRING_DATA (lstr); | |
2395 n = XSTRING_LENGTH (lstr); | |
2396 } | |
2397 } | |
2398 else | |
2399 #endif /* ENABLE_COMPOSITE_CHARS */ | |
2400 encode_unicode_char (charset, ch, c, dst, type, | |
4096 | 2401 little_endian, 1); |
771 | 2402 ch = 0; |
2403 } | |
2404 else | |
2405 { | |
2406 ch = c; | |
2407 char_boundary = 0; | |
2408 } | |
2409 break; | |
2410 case 4: | |
2411 if (ch) | |
2412 { | |
2413 encode_unicode_char (charset, ch, c, dst, type, | |
4096 | 2414 little_endian, 1); |
771 | 2415 ch = 0; |
2416 } | |
2417 else | |
2418 { | |
2419 ch = c; | |
2420 char_boundary = 0; | |
2421 } | |
2422 break; | |
2423 default: | |
2500 | 2424 ABORT (); |
771 | 2425 } |
2426 } | |
2427 } | |
2428 #endif /* MULE */ | |
2429 } | |
2430 | |
2431 #ifdef ENABLE_COMPOSITE_CHARS | |
2432 if (in_composite) | |
2433 { | |
2434 n = saved_n; | |
2435 src = saved_src; | |
2436 in_composite = 0; | |
2437 goto back_to_square_n; /* Wheeeeeeeee ..... */ | |
2438 } | |
2439 #endif /* ENABLE_COMPOSITE_CHARS */ | |
2440 | |
2441 data->current_char_boundary = char_boundary; | |
2442 data->current_charset = charset; | |
2443 | |
2444 /* La palabra se hizo carne! */ | |
2445 /* A palavra fez-se carne! */ | |
2446 /* Whatever. */ | |
2447 } | |
2448 | |
2449 str->ch = ch; | |
2450 return orign; | |
2451 } | |
2452 | |
2453 /* DEFINE_DETECTOR (utf_7); */ | |
2454 DEFINE_DETECTOR (utf_8); | |
2455 DEFINE_DETECTOR_CATEGORY (utf_8, utf_8); | |
985 | 2456 DEFINE_DETECTOR_CATEGORY (utf_8, utf_8_bom); |
771 | 2457 DEFINE_DETECTOR (ucs_4); |
2458 DEFINE_DETECTOR_CATEGORY (ucs_4, ucs_4); | |
2459 DEFINE_DETECTOR (utf_16); | |
2460 DEFINE_DETECTOR_CATEGORY (utf_16, utf_16); | |
2461 DEFINE_DETECTOR_CATEGORY (utf_16, utf_16_little_endian); | |
2462 DEFINE_DETECTOR_CATEGORY (utf_16, utf_16_bom); | |
2463 DEFINE_DETECTOR_CATEGORY (utf_16, utf_16_little_endian_bom); | |
2464 | |
2465 struct ucs_4_detector | |
2466 { | |
2467 int in_ucs_4_byte; | |
2468 }; | |
2469 | |
2470 static void | |
2471 ucs_4_detect (struct detection_state *st, const UExtbyte *src, | |
2472 Bytecount n) | |
2473 { | |
2474 struct ucs_4_detector *data = DETECTION_STATE_DATA (st, ucs_4); | |
2475 | |
2476 while (n--) | |
2477 { | |
2478 UExtbyte c = *src++; | |
2479 switch (data->in_ucs_4_byte) | |
2480 { | |
2481 case 0: | |
2482 if (c >= 128) | |
2483 { | |
2484 DET_RESULT (st, ucs_4) = DET_NEARLY_IMPOSSIBLE; | |
2485 return; | |
2486 } | |
2487 else | |
2488 data->in_ucs_4_byte++; | |
2489 break; | |
2490 case 3: | |
2491 data->in_ucs_4_byte = 0; | |
2492 break; | |
2493 default: | |
2494 data->in_ucs_4_byte++; | |
2495 } | |
2496 } | |
2497 | |
2498 /* !!#### write this for real */ | |
2499 DET_RESULT (st, ucs_4) = DET_AS_LIKELY_AS_UNLIKELY; | |
2500 } | |
2501 | |
2502 struct utf_16_detector | |
2503 { | |
2504 unsigned int seen_ffff:1; | |
2505 unsigned int seen_forward_bom:1; | |
2506 unsigned int seen_rev_bom:1; | |
2507 int byteno; | |
2508 int prev_char; | |
2509 int text, rev_text; | |
1267 | 2510 int sep, rev_sep; |
2511 int num_ascii; | |
771 | 2512 }; |
2513 | |
2514 static void | |
2515 utf_16_detect (struct detection_state *st, const UExtbyte *src, | |
2516 Bytecount n) | |
2517 { | |
2518 struct utf_16_detector *data = DETECTION_STATE_DATA (st, utf_16); | |
2519 | |
2520 while (n--) | |
2521 { | |
2522 UExtbyte c = *src++; | |
2523 int prevc = data->prev_char; | |
2524 if (data->byteno == 1 && c == 0xFF && prevc == 0xFE) | |
2525 data->seen_forward_bom = 1; | |
2526 else if (data->byteno == 1 && c == 0xFE && prevc == 0xFF) | |
2527 data->seen_rev_bom = 1; | |
2528 | |
2529 if (data->byteno & 1) | |
2530 { | |
2531 if (c == 0xFF && prevc == 0xFF) | |
2532 data->seen_ffff = 1; | |
2533 if (prevc == 0 | |
2534 && (c == '\r' || c == '\n' | |
2535 || (c >= 0x20 && c <= 0x7E))) | |
2536 data->text++; | |
2537 if (c == 0 | |
2538 && (prevc == '\r' || prevc == '\n' | |
2539 || (prevc >= 0x20 && prevc <= 0x7E))) | |
2540 data->rev_text++; | |
1267 | 2541 /* #### 0x2028 is LINE SEPARATOR and 0x2029 is PARAGRAPH SEPARATOR. |
2542 I used to count these in text and rev_text but that is very bad, | |
2543 as 0x2028 is also space + left-paren in ASCII, which is extremely | |
2544 common. So, what do we do with these? */ | |
771 | 2545 if (prevc == 0x20 && (c == 0x28 || c == 0x29)) |
1267 | 2546 data->sep++; |
771 | 2547 if (c == 0x20 && (prevc == 0x28 || prevc == 0x29)) |
1267 | 2548 data->rev_sep++; |
771 | 2549 } |
2550 | |
1267 | 2551 if ((c >= ' ' && c <= '~') || c == '\n' || c == '\r' || c == '\t' || |
2552 c == '\f' || c == '\v') | |
2553 data->num_ascii++; | |
771 | 2554 data->byteno++; |
2555 data->prev_char = c; | |
2556 } | |
2557 | |
2558 { | |
2559 int variance_indicates_big_endian = | |
2560 (data->text >= 10 | |
2561 && (data->rev_text == 0 | |
2562 || data->text / data->rev_text >= 10)); | |
2563 int variance_indicates_little_endian = | |
2564 (data->rev_text >= 10 | |
2565 && (data->text == 0 | |
2566 || data->rev_text / data->text >= 10)); | |
2567 | |
2568 if (data->seen_ffff) | |
2569 SET_DET_RESULTS (st, utf_16, DET_NEARLY_IMPOSSIBLE); | |
2570 else if (data->seen_forward_bom) | |
2571 { | |
2572 SET_DET_RESULTS (st, utf_16, DET_NEARLY_IMPOSSIBLE); | |
2573 if (variance_indicates_big_endian) | |
2574 DET_RESULT (st, utf_16_bom) = DET_NEAR_CERTAINTY; | |
2575 else if (variance_indicates_little_endian) | |
2576 DET_RESULT (st, utf_16_bom) = DET_SOMEWHAT_LIKELY; | |
2577 else | |
2578 DET_RESULT (st, utf_16_bom) = DET_QUITE_PROBABLE; | |
2579 } | |
2580 else if (data->seen_forward_bom) | |
2581 { | |
2582 SET_DET_RESULTS (st, utf_16, DET_NEARLY_IMPOSSIBLE); | |
2583 if (variance_indicates_big_endian) | |
2584 DET_RESULT (st, utf_16_bom) = DET_NEAR_CERTAINTY; | |
2585 else if (variance_indicates_little_endian) | |
2586 /* #### may need to rethink */ | |
2587 DET_RESULT (st, utf_16_bom) = DET_SOMEWHAT_LIKELY; | |
2588 else | |
2589 /* #### may need to rethink */ | |
2590 DET_RESULT (st, utf_16_bom) = DET_QUITE_PROBABLE; | |
2591 } | |
2592 else if (data->seen_rev_bom) | |
2593 { | |
2594 SET_DET_RESULTS (st, utf_16, DET_NEARLY_IMPOSSIBLE); | |
2595 if (variance_indicates_little_endian) | |
2596 DET_RESULT (st, utf_16_little_endian_bom) = DET_NEAR_CERTAINTY; | |
2597 else if (variance_indicates_big_endian) | |
2598 /* #### may need to rethink */ | |
2599 DET_RESULT (st, utf_16_little_endian_bom) = DET_SOMEWHAT_LIKELY; | |
2600 else | |
2601 /* #### may need to rethink */ | |
2602 DET_RESULT (st, utf_16_little_endian_bom) = DET_QUITE_PROBABLE; | |
2603 } | |
2604 else if (variance_indicates_big_endian) | |
2605 { | |
2606 SET_DET_RESULTS (st, utf_16, DET_NEARLY_IMPOSSIBLE); | |
2607 DET_RESULT (st, utf_16) = DET_SOMEWHAT_LIKELY; | |
2608 DET_RESULT (st, utf_16_little_endian) = DET_SOMEWHAT_UNLIKELY; | |
2609 } | |
2610 else if (variance_indicates_little_endian) | |
2611 { | |
2612 SET_DET_RESULTS (st, utf_16, DET_NEARLY_IMPOSSIBLE); | |
2613 DET_RESULT (st, utf_16) = DET_SOMEWHAT_UNLIKELY; | |
2614 DET_RESULT (st, utf_16_little_endian) = DET_SOMEWHAT_LIKELY; | |
2615 } | |
2616 else | |
1267 | 2617 { |
2618 /* #### FUCKME! There should really be an ASCII detector. This | |
2619 would rule out the need to have this built-in here as | |
2620 well. --ben */ | |
1292 | 2621 int pct_ascii = data->byteno ? (100 * data->num_ascii) / data->byteno |
2622 : 100; | |
1267 | 2623 |
2624 if (pct_ascii > 90) | |
2625 SET_DET_RESULTS (st, utf_16, DET_QUITE_IMPROBABLE); | |
2626 else if (pct_ascii > 75) | |
2627 SET_DET_RESULTS (st, utf_16, DET_SOMEWHAT_UNLIKELY); | |
2628 else | |
2629 SET_DET_RESULTS (st, utf_16, DET_AS_LIKELY_AS_UNLIKELY); | |
2630 } | |
771 | 2631 } |
2632 } | |
2633 | |
2634 struct utf_8_detector | |
2635 { | |
985 | 2636 int byteno; |
2637 int first_byte; | |
2638 int second_byte; | |
1267 | 2639 int prev_byte; |
771 | 2640 int in_utf_8_byte; |
1267 | 2641 int recent_utf_8_sequence; |
2642 int seen_bogus_utf8; | |
2643 int seen_really_bogus_utf8; | |
2644 int seen_2byte_sequence; | |
2645 int seen_longer_sequence; | |
2646 int seen_iso2022_esc; | |
2647 int seen_iso_shift; | |
1887 | 2648 unsigned int seen_utf_bom:1; |
771 | 2649 }; |
2650 | |
2651 static void | |
2652 utf_8_detect (struct detection_state *st, const UExtbyte *src, | |
2653 Bytecount n) | |
2654 { | |
2655 struct utf_8_detector *data = DETECTION_STATE_DATA (st, utf_8); | |
2656 | |
2657 while (n--) | |
2658 { | |
2659 UExtbyte c = *src++; | |
985 | 2660 switch (data->byteno) |
2661 { | |
2662 case 0: | |
2663 data->first_byte = c; | |
2664 break; | |
2665 case 1: | |
2666 data->second_byte = c; | |
2667 break; | |
2668 case 2: | |
2669 if (data->first_byte == 0xef && | |
2670 data->second_byte == 0xbb && | |
2671 c == 0xbf) | |
1267 | 2672 data->seen_utf_bom = 1; |
985 | 2673 break; |
2674 } | |
2675 | |
771 | 2676 switch (data->in_utf_8_byte) |
2677 { | |
2678 case 0: | |
1267 | 2679 if (data->prev_byte == ISO_CODE_ESC && c >= 0x28 && c <= 0x2F) |
2680 data->seen_iso2022_esc++; | |
2681 else if (c == ISO_CODE_SI || c == ISO_CODE_SO) | |
2682 data->seen_iso_shift++; | |
771 | 2683 else if (c >= 0xfc) |
2684 data->in_utf_8_byte = 5; | |
2685 else if (c >= 0xf8) | |
2686 data->in_utf_8_byte = 4; | |
2687 else if (c >= 0xf0) | |
2688 data->in_utf_8_byte = 3; | |
2689 else if (c >= 0xe0) | |
2690 data->in_utf_8_byte = 2; | |
2691 else if (c >= 0xc0) | |
2692 data->in_utf_8_byte = 1; | |
2693 else if (c >= 0x80) | |
1267 | 2694 data->seen_bogus_utf8++; |
2695 if (data->in_utf_8_byte > 0) | |
2696 data->recent_utf_8_sequence = data->in_utf_8_byte; | |
771 | 2697 break; |
2698 default: | |
2699 if ((c & 0xc0) != 0x80) | |
1267 | 2700 data->seen_really_bogus_utf8++; |
2701 else | |
771 | 2702 { |
1267 | 2703 data->in_utf_8_byte--; |
2704 if (data->in_utf_8_byte == 0) | |
2705 { | |
2706 if (data->recent_utf_8_sequence == 1) | |
2707 data->seen_2byte_sequence++; | |
2708 else | |
2709 { | |
2710 assert (data->recent_utf_8_sequence >= 2); | |
2711 data->seen_longer_sequence++; | |
2712 } | |
2713 } | |
771 | 2714 } |
2715 } | |
985 | 2716 |
2717 data->byteno++; | |
1267 | 2718 data->prev_byte = c; |
771 | 2719 } |
1267 | 2720 |
2721 /* either BOM or no BOM, but not both */ | |
2722 SET_DET_RESULTS (st, utf_8, DET_NEARLY_IMPOSSIBLE); | |
2723 | |
2724 | |
2725 if (data->seen_utf_bom) | |
2726 DET_RESULT (st, utf_8_bom) = DET_NEAR_CERTAINTY; | |
2727 else | |
2728 { | |
2729 if (data->seen_really_bogus_utf8 || | |
2730 data->seen_bogus_utf8 >= 2) | |
2731 ; /* bogus */ | |
2732 else if (data->seen_bogus_utf8) | |
2733 DET_RESULT (st, utf_8) = DET_SOMEWHAT_UNLIKELY; | |
2734 else if ((data->seen_longer_sequence >= 5 || | |
2735 data->seen_2byte_sequence >= 10) && | |
2736 (!(data->seen_iso2022_esc + data->seen_iso_shift) || | |
2737 (data->seen_longer_sequence * 2 + data->seen_2byte_sequence) / | |
2738 (data->seen_iso2022_esc + data->seen_iso_shift) >= 10)) | |
2739 /* heuristics, heuristics, we love heuristics */ | |
2740 DET_RESULT (st, utf_8) = DET_QUITE_PROBABLE; | |
2741 else if (data->seen_iso2022_esc || | |
2742 data->seen_iso_shift >= 3) | |
2743 DET_RESULT (st, utf_8) = DET_SOMEWHAT_UNLIKELY; | |
2744 else if (data->seen_longer_sequence || | |
2745 data->seen_2byte_sequence) | |
2746 DET_RESULT (st, utf_8) = DET_SOMEWHAT_LIKELY; | |
2747 else if (data->seen_iso_shift) | |
2748 DET_RESULT (st, utf_8) = DET_SOMEWHAT_UNLIKELY; | |
2749 else | |
2750 DET_RESULT (st, utf_8) = DET_AS_LIKELY_AS_UNLIKELY; | |
2751 } | |
771 | 2752 } |
2753 | |
2754 static void | |
2755 unicode_init_coding_stream (struct coding_stream *str) | |
2756 { | |
2757 struct unicode_coding_stream *data = | |
2758 CODING_STREAM_TYPE_DATA (str, unicode); | |
2759 xzero (*data); | |
2760 data->current_charset = Qnil; | |
2761 } | |
2762 | |
2763 static void | |
2764 unicode_rewind_coding_stream (struct coding_stream *str) | |
2765 { | |
2766 unicode_init_coding_stream (str); | |
2767 } | |
2768 | |
2769 static int | |
2770 unicode_putprop (Lisp_Object codesys, Lisp_Object key, Lisp_Object value) | |
2771 { | |
3767 | 2772 if (EQ (key, Qunicode_type)) |
771 | 2773 { |
2774 enum unicode_type type; | |
2775 | |
2776 if (EQ (value, Qutf_8)) | |
2777 type = UNICODE_UTF_8; | |
2778 else if (EQ (value, Qutf_16)) | |
2779 type = UNICODE_UTF_16; | |
2780 else if (EQ (value, Qutf_7)) | |
2781 type = UNICODE_UTF_7; | |
2782 else if (EQ (value, Qucs_4)) | |
2783 type = UNICODE_UCS_4; | |
4096 | 2784 else if (EQ (value, Qutf_32)) |
2785 type = UNICODE_UTF_32; | |
771 | 2786 else |
2787 invalid_constant ("Invalid Unicode type", key); | |
2788 | |
2789 XCODING_SYSTEM_UNICODE_TYPE (codesys) = type; | |
2790 } | |
2791 else if (EQ (key, Qlittle_endian)) | |
2792 XCODING_SYSTEM_UNICODE_LITTLE_ENDIAN (codesys) = !NILP (value); | |
2793 else if (EQ (key, Qneed_bom)) | |
2794 XCODING_SYSTEM_UNICODE_NEED_BOM (codesys) = !NILP (value); | |
2795 else | |
2796 return 0; | |
2797 return 1; | |
2798 } | |
2799 | |
2800 static Lisp_Object | |
2801 unicode_getprop (Lisp_Object coding_system, Lisp_Object prop) | |
2802 { | |
3767 | 2803 if (EQ (prop, Qunicode_type)) |
771 | 2804 { |
2805 switch (XCODING_SYSTEM_UNICODE_TYPE (coding_system)) | |
2806 { | |
2807 case UNICODE_UTF_16: return Qutf_16; | |
2808 case UNICODE_UTF_8: return Qutf_8; | |
2809 case UNICODE_UTF_7: return Qutf_7; | |
2810 case UNICODE_UCS_4: return Qucs_4; | |
4096 | 2811 case UNICODE_UTF_32: return Qutf_32; |
2500 | 2812 default: ABORT (); |
771 | 2813 } |
2814 } | |
2815 else if (EQ (prop, Qlittle_endian)) | |
2816 return XCODING_SYSTEM_UNICODE_LITTLE_ENDIAN (coding_system) ? Qt : Qnil; | |
2817 else if (EQ (prop, Qneed_bom)) | |
2818 return XCODING_SYSTEM_UNICODE_NEED_BOM (coding_system) ? Qt : Qnil; | |
2819 return Qunbound; | |
2820 } | |
2821 | |
2822 static void | |
2286 | 2823 unicode_print (Lisp_Object cs, Lisp_Object printcharfun, |
2824 int UNUSED (escapeflag)) | |
771 | 2825 { |
3767 | 2826 write_fmt_string_lisp (printcharfun, "(%s", 1, |
2827 unicode_getprop (cs, Qunicode_type)); | |
771 | 2828 if (XCODING_SYSTEM_UNICODE_LITTLE_ENDIAN (cs)) |
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
2829 write_ascstring (printcharfun, ", little-endian"); |
771 | 2830 if (XCODING_SYSTEM_UNICODE_NEED_BOM (cs)) |
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
2831 write_ascstring (printcharfun, ", need-bom"); |
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
2832 write_ascstring (printcharfun, ")"); |
771 | 2833 } |
2834 | |
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2835 #ifdef MULE |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2836 DEFUN ("set-unicode-query-skip-chars-args", Fset_unicode_query_skip_chars_args, |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2837 3, 3, 0, /* |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2838 Specify strings as matching characters known to Unicode coding systems. |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2839 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2840 QUERY-STRING is a string matching characters that can unequivocally be |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2841 encoded by the Unicode coding systems. |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2842 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2843 INVALID-STRING is a string to match XEmacs characters that represent known |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2844 octets on disk, but that are invalid sequences according to Unicode. |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2845 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2846 UTF-8-INVALID-STRING is a more restrictive string to match XEmacs characters |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2847 that are invalid UTF-8 octets. |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2848 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2849 All three strings are in the format accepted by `skip-chars-forward'. |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2850 */ |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2851 (query_string, invalid_string, utf_8_invalid_string)) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2852 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2853 CHECK_STRING (query_string); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2854 CHECK_STRING (invalid_string); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2855 CHECK_STRING (utf_8_invalid_string); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2856 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2857 Vunicode_query_string = query_string; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2858 Vunicode_invalid_string = invalid_string; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2859 Vutf_8_invalid_string = utf_8_invalid_string; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2860 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2861 return Qnil; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2862 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2863 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2864 static void |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2865 add_lisp_string_to_skip_chars_range (Lisp_Object string, Lisp_Object rtab, |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2866 Lisp_Object value) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2867 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2868 Ibyte *p, *pend; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2869 Ichar c; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2870 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2871 p = XSTRING_DATA (string); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2872 pend = p + XSTRING_LENGTH (string); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2873 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2874 while (p != pend) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2875 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2876 c = itext_ichar (p); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2877 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2878 INC_IBYTEPTR (p); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2879 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2880 if (c == '\\') |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2881 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2882 if (p == pend) break; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2883 c = itext_ichar (p); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2884 INC_IBYTEPTR (p); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2885 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2886 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2887 if (p != pend && *p == '-') |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2888 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2889 Ichar cend; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2890 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2891 /* Skip over the dash. */ |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2892 p++; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2893 if (p == pend) break; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2894 cend = itext_ichar (p); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2895 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2896 Fput_range_table (make_int (c), make_int (cend), value, |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2897 rtab); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2898 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2899 INC_IBYTEPTR (p); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2900 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2901 else |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2902 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2903 Fput_range_table (make_int (c), make_int (c), value, rtab); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2904 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2905 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2906 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2907 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2908 /* This function wouldn't be necessary if initialised range tables were |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2909 dumped properly; see |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2910 http://mid.gmane.org/18179.49815.622843.336527@parhasard.net . */ |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2911 static void |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2912 initialize_unicode_query_range_tables_from_strings (void) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2913 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2914 CHECK_STRING (Vunicode_query_string); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2915 CHECK_STRING (Vunicode_invalid_string); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2916 CHECK_STRING (Vutf_8_invalid_string); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2917 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2918 Vunicode_query_skip_chars = Fmake_range_table (Qstart_closed_end_closed); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2919 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2920 add_lisp_string_to_skip_chars_range (Vunicode_query_string, |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2921 Vunicode_query_skip_chars, |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2922 Qsucceeded); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2923 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2924 Vunicode_invalid_and_query_skip_chars |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2925 = Fcopy_range_table (Vunicode_query_skip_chars); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2926 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2927 add_lisp_string_to_skip_chars_range (Vunicode_invalid_string, |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2928 Vunicode_invalid_and_query_skip_chars, |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2929 Qinvalid_sequence); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2930 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2931 Vutf_8_invalid_and_query_skip_chars |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2932 = Fcopy_range_table (Vunicode_query_skip_chars); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2933 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2934 add_lisp_string_to_skip_chars_range (Vutf_8_invalid_string, |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2935 Vutf_8_invalid_and_query_skip_chars, |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2936 Qinvalid_sequence); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2937 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2938 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2939 static Lisp_Object |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2940 unicode_query (Lisp_Object codesys, struct buffer *buf, Charbpos end, |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2941 int flags) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2942 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2943 Charbpos pos = BUF_PT (buf), fail_range_start, fail_range_end; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2944 Charbpos pos_byte = BYTE_BUF_PT (buf); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2945 Lisp_Object skip_chars_range_table, result = Qnil; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2946 enum query_coding_failure_reasons failed_reason, |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2947 previous_failed_reason = query_coding_succeeded; |
4824
c12b646d84ee
changes to get things to compile under latest cygwin
Ben Wing <ben@xemacs.org>
parents:
4770
diff
changeset
|
2948 int checked_unicode, |
c12b646d84ee
changes to get things to compile under latest cygwin
Ben Wing <ben@xemacs.org>
parents:
4770
diff
changeset
|
2949 invalid_lower_limit = UNICODE_ERROR_OCTET_RANGE_START, |
c12b646d84ee
changes to get things to compile under latest cygwin
Ben Wing <ben@xemacs.org>
parents:
4770
diff
changeset
|
2950 invalid_upper_limit = -1, |
c12b646d84ee
changes to get things to compile under latest cygwin
Ben Wing <ben@xemacs.org>
parents:
4770
diff
changeset
|
2951 unicode_type = XCODING_SYSTEM_UNICODE_TYPE (codesys); |
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2952 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2953 if (flags & QUERY_METHOD_HIGHLIGHT && |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2954 /* If we're being called really early, live without highlights getting |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2955 cleared properly: */ |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2956 !(UNBOUNDP (XSYMBOL (Qquery_coding_clear_highlights)->function))) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2957 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2958 /* It's okay to call Lisp here, the only non-stack object we may have |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2959 allocated up to this point is skip_chars_range_table, and that's |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2960 reachable from its entry in Vfixed_width_query_ranges_cache. */ |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2961 call3 (Qquery_coding_clear_highlights, make_int (pos), make_int (end), |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2962 wrap_buffer (buf)); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2963 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2964 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2965 if (NILP (Vunicode_query_skip_chars)) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2966 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2967 initialize_unicode_query_range_tables_from_strings(); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2968 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2969 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2970 if (flags & QUERY_METHOD_IGNORE_INVALID_SEQUENCES) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2971 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2972 switch (unicode_type) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2973 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2974 case UNICODE_UTF_8: |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2975 skip_chars_range_table = Vutf_8_invalid_and_query_skip_chars; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2976 break; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2977 case UNICODE_UTF_7: |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2978 /* #### See above. */ |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2979 return Qunbound; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2980 break; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2981 default: |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2982 skip_chars_range_table = Vunicode_invalid_and_query_skip_chars; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2983 break; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2984 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2985 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2986 else |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2987 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2988 switch (unicode_type) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2989 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2990 case UNICODE_UTF_8: |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2991 invalid_lower_limit = UNICODE_ERROR_OCTET_RANGE_START + 0x80; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2992 invalid_upper_limit = UNICODE_ERROR_OCTET_RANGE_START + 0xFF; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2993 break; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2994 case UNICODE_UTF_7: |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2995 /* #### Work out what to do here in reality, read the spec and decide |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2996 which octets are invalid. */ |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2997 return Qunbound; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2998 break; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
2999 default: |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3000 invalid_lower_limit = UNICODE_ERROR_OCTET_RANGE_START; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3001 invalid_upper_limit = UNICODE_ERROR_OCTET_RANGE_START + 0xFF; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3002 break; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3003 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3004 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3005 skip_chars_range_table = Vunicode_query_skip_chars; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3006 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3007 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3008 while (pos < end) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3009 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3010 Ichar ch = BYTE_BUF_FETCH_CHAR (buf, pos_byte); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3011 if ((ch < 0x100 ? 1 : |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3012 (!EQ (Qnil, Fget_range_table (make_int (ch), skip_chars_range_table, |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3013 Qnil))))) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3014 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3015 pos++; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3016 INC_BYTEBPOS (buf, pos_byte); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3017 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3018 else |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3019 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3020 fail_range_start = pos; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3021 while ((pos < end) && |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3022 ((checked_unicode = ichar_to_unicode (ch), |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3023 -1 == checked_unicode |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3024 && (failed_reason = query_coding_unencodable)) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3025 || (!(flags & QUERY_METHOD_IGNORE_INVALID_SEQUENCES) && |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3026 (invalid_lower_limit <= checked_unicode) && |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3027 (checked_unicode <= invalid_upper_limit) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3028 && (failed_reason = query_coding_invalid_sequence))) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3029 && (previous_failed_reason == query_coding_succeeded |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3030 || previous_failed_reason == failed_reason)) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3031 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3032 pos++; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3033 INC_BYTEBPOS (buf, pos_byte); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3034 ch = BYTE_BUF_FETCH_CHAR (buf, pos_byte); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3035 previous_failed_reason = failed_reason; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3036 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3037 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3038 if (fail_range_start == pos) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3039 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3040 /* The character can actually be encoded; move on. */ |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3041 pos++; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3042 INC_BYTEBPOS (buf, pos_byte); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3043 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3044 else |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3045 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3046 assert (previous_failed_reason == query_coding_invalid_sequence |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3047 || previous_failed_reason == query_coding_unencodable); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3048 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3049 if (flags & QUERY_METHOD_ERRORP) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3050 { |
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3051 signal_error_2 |
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3052 (Qtext_conversion_error, |
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3053 "Cannot encode using coding system", |
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3054 make_string_from_buffer (buf, fail_range_start, |
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3055 pos - fail_range_start), |
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3056 XCODING_SYSTEM_NAME (codesys)); |
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3057 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3058 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3059 if (NILP (result)) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3060 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3061 result = Fmake_range_table (Qstart_closed_end_open); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3062 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3063 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3064 fail_range_end = pos; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3065 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3066 Fput_range_table (make_int (fail_range_start), |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3067 make_int (fail_range_end), |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3068 (previous_failed_reason |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3069 == query_coding_unencodable ? |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3070 Qunencodable : Qinvalid_sequence), |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3071 result); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3072 previous_failed_reason = query_coding_succeeded; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3073 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3074 if (flags & QUERY_METHOD_HIGHLIGHT) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3075 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3076 Lisp_Object extent |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3077 = Fmake_extent (make_int (fail_range_start), |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3078 make_int (fail_range_end), |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3079 wrap_buffer (buf)); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3080 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3081 Fset_extent_priority |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3082 (extent, make_int (2 + mouse_highlight_priority)); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3083 Fset_extent_face (extent, Qquery_coding_warning_face); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3084 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3085 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3086 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3087 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3088 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3089 return result; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3090 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3091 #else /* !MULE */ |
4770
b9aaf2a18957
Add missing return value type to unicode_query.
Stephen J. Turnbull <stephen@xemacs.org>
parents:
4690
diff
changeset
|
3092 static Lisp_Object |
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3093 unicode_query (Lisp_Object UNUSED (codesys), |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3094 struct buffer * UNUSED (buf), |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3095 Charbpos UNUSED (end), int UNUSED (flags)) |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3096 { |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3097 return Qnil; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3098 } |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3099 #endif |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3100 |
771 | 3101 int |
2286 | 3102 dfc_coding_system_is_unicode ( |
3103 #ifdef WIN32_ANY | |
3104 Lisp_Object codesys | |
3105 #else | |
3106 Lisp_Object UNUSED (codesys) | |
3107 #endif | |
3108 ) | |
771 | 3109 { |
1315 | 3110 #ifdef WIN32_ANY |
771 | 3111 codesys = Fget_coding_system (codesys); |
3112 return (EQ (XCODING_SYSTEM_TYPE (codesys), Qunicode) && | |
3113 XCODING_SYSTEM_UNICODE_TYPE (codesys) == UNICODE_UTF_16 && | |
3114 XCODING_SYSTEM_UNICODE_LITTLE_ENDIAN (codesys)); | |
3115 | |
3116 #else | |
3117 return 0; | |
3118 #endif | |
3119 } | |
3120 | |
3121 | |
3122 /************************************************************************/ | |
3123 /* Initialization */ | |
3124 /************************************************************************/ | |
3125 | |
3126 void | |
3127 syms_of_unicode (void) | |
3128 { | |
3129 #ifdef MULE | |
877 | 3130 DEFSUBR (Funicode_precedence_list); |
771 | 3131 DEFSUBR (Fset_language_unicode_precedence_list); |
3132 DEFSUBR (Flanguage_unicode_precedence_list); | |
3133 DEFSUBR (Fset_default_unicode_precedence_list); | |
3134 DEFSUBR (Fdefault_unicode_precedence_list); | |
3135 DEFSUBR (Fset_unicode_conversion); | |
3136 | |
1318 | 3137 DEFSUBR (Fload_unicode_mapping_table); |
771 | 3138 |
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3139 DEFSUBR (Fset_unicode_query_skip_chars_args); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3140 |
3439 | 3141 DEFSYMBOL (Qccl_encode_to_ucs_2); |
3142 DEFSYMBOL (Qlast_allocated_character); | |
771 | 3143 DEFSYMBOL (Qignore_first_column); |
3659 | 3144 |
3145 DEFSYMBOL (Qunicode_registries); | |
771 | 3146 #endif /* MULE */ |
3147 | |
800 | 3148 DEFSUBR (Fchar_to_unicode); |
3149 DEFSUBR (Funicode_to_char); | |
771 | 3150 |
3151 DEFSYMBOL (Qunicode); | |
3152 DEFSYMBOL (Qucs_4); | |
3153 DEFSYMBOL (Qutf_16); | |
4096 | 3154 DEFSYMBOL (Qutf_32); |
771 | 3155 DEFSYMBOL (Qutf_8); |
3156 DEFSYMBOL (Qutf_7); | |
3157 | |
3158 DEFSYMBOL (Qneed_bom); | |
3159 | |
3160 DEFSYMBOL (Qutf_16); | |
3161 DEFSYMBOL (Qutf_16_little_endian); | |
3162 DEFSYMBOL (Qutf_16_bom); | |
3163 DEFSYMBOL (Qutf_16_little_endian_bom); | |
985 | 3164 |
3165 DEFSYMBOL (Qutf_8); | |
3166 DEFSYMBOL (Qutf_8_bom); | |
771 | 3167 } |
3168 | |
3169 void | |
3170 coding_system_type_create_unicode (void) | |
3171 { | |
3172 INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (unicode, "unicode-coding-system-p"); | |
3173 CODING_SYSTEM_HAS_METHOD (unicode, print); | |
3174 CODING_SYSTEM_HAS_METHOD (unicode, convert); | |
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3175 CODING_SYSTEM_HAS_METHOD (unicode, query); |
771 | 3176 CODING_SYSTEM_HAS_METHOD (unicode, init_coding_stream); |
3177 CODING_SYSTEM_HAS_METHOD (unicode, rewind_coding_stream); | |
3178 CODING_SYSTEM_HAS_METHOD (unicode, putprop); | |
3179 CODING_SYSTEM_HAS_METHOD (unicode, getprop); | |
3180 | |
3181 INITIALIZE_DETECTOR (utf_8); | |
3182 DETECTOR_HAS_METHOD (utf_8, detect); | |
3183 INITIALIZE_DETECTOR_CATEGORY (utf_8, utf_8); | |
985 | 3184 INITIALIZE_DETECTOR_CATEGORY (utf_8, utf_8_bom); |
771 | 3185 |
3186 INITIALIZE_DETECTOR (ucs_4); | |
3187 DETECTOR_HAS_METHOD (ucs_4, detect); | |
3188 INITIALIZE_DETECTOR_CATEGORY (ucs_4, ucs_4); | |
3189 | |
3190 INITIALIZE_DETECTOR (utf_16); | |
3191 DETECTOR_HAS_METHOD (utf_16, detect); | |
3192 INITIALIZE_DETECTOR_CATEGORY (utf_16, utf_16); | |
3193 INITIALIZE_DETECTOR_CATEGORY (utf_16, utf_16_little_endian); | |
3194 INITIALIZE_DETECTOR_CATEGORY (utf_16, utf_16_bom); | |
3195 INITIALIZE_DETECTOR_CATEGORY (utf_16, utf_16_little_endian_bom); | |
3196 } | |
3197 | |
3198 void | |
3199 reinit_coding_system_type_create_unicode (void) | |
3200 { | |
3201 REINITIALIZE_CODING_SYSTEM_TYPE (unicode); | |
3202 } | |
3203 | |
3204 void | |
3205 vars_of_unicode (void) | |
3206 { | |
3207 Fprovide (intern ("unicode")); | |
3208 | |
3209 #ifdef MULE | |
4270 | 3210 staticpro (&Vnumber_of_jit_charsets); |
3211 Vnumber_of_jit_charsets = make_int (0); | |
3212 staticpro (&Vlast_jit_charset_final); | |
3213 Vlast_jit_charset_final = make_char (0x30); | |
3214 staticpro (&Vcharset_descr); | |
3215 Vcharset_descr | |
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3216 = build_defer_string ("Mule charset for otherwise unknown Unicode code points."); |
4270 | 3217 |
771 | 3218 staticpro (&Vlanguage_unicode_precedence_list); |
3219 Vlanguage_unicode_precedence_list = Qnil; | |
3220 | |
3221 staticpro (&Vdefault_unicode_precedence_list); | |
3222 Vdefault_unicode_precedence_list = Qnil; | |
3223 | |
3224 unicode_precedence_dynarr = Dynarr_new (Lisp_Object); | |
2367 | 3225 dump_add_root_block_ptr (&unicode_precedence_dynarr, |
771 | 3226 &lisp_object_dynarr_description); |
2367 | 3227 |
3659 | 3228 |
3229 | |
2367 | 3230 init_blank_unicode_tables (); |
3231 | |
3439 | 3232 staticpro (&Vcurrent_jit_charset); |
3233 Vcurrent_jit_charset = Qnil; | |
3234 | |
2367 | 3235 /* Note that the "block" we are describing is a single pointer, and hence |
3236 we could potentially use dump_add_root_block_ptr(). However, given | |
3237 the way the descriptions are written, we couldn't use them, and would | |
3238 have to write new descriptions for each of the pointers below, since | |
3239 we would have to make use of a description with an XD_BLOCK_ARRAY | |
3240 in it. */ | |
3241 | |
3242 dump_add_root_block (&to_unicode_blank_1, sizeof (void *), | |
3243 to_unicode_level_1_desc_1); | |
3244 dump_add_root_block (&to_unicode_blank_2, sizeof (void *), | |
3245 to_unicode_level_2_desc_1); | |
3246 | |
3247 dump_add_root_block (&from_unicode_blank_1, sizeof (void *), | |
3248 from_unicode_level_1_desc_1); | |
3249 dump_add_root_block (&from_unicode_blank_2, sizeof (void *), | |
3250 from_unicode_level_2_desc_1); | |
3251 dump_add_root_block (&from_unicode_blank_3, sizeof (void *), | |
3252 from_unicode_level_3_desc_1); | |
3253 dump_add_root_block (&from_unicode_blank_4, sizeof (void *), | |
3254 from_unicode_level_4_desc_1); | |
3659 | 3255 |
3256 DEFVAR_LISP ("unicode-registries", &Qunicode_registries /* | |
3257 Vector describing the X11 registries searched when using fallback fonts. | |
3258 | |
3259 "Fallback fonts" here includes by default those fonts used by redisplay when | |
3260 displaying charsets for which the `encode-as-utf-8' property is true, and | |
3261 those used when no font matching the charset's registries property has been | |
3262 found (that is, they're probably Mule-specific charsets like Ethiopic or | |
3263 IPA.) | |
3264 */ ); | |
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3265 Qunicode_registries = vector1(build_ascstring("iso10646-1")); |
4690
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3266 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3267 /* Initialised in lisp/mule/general-late.el, by a call to |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3268 #'set-unicode-query-skip-chars-args. Or at least they would be, but we |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3269 can't do this at dump time right now, initialised range tables aren't |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3270 dumped properly. */ |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3271 staticpro (&Vunicode_invalid_and_query_skip_chars); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3272 Vunicode_invalid_and_query_skip_chars = Qnil; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3273 staticpro (&Vutf_8_invalid_and_query_skip_chars); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3274 Vutf_8_invalid_and_query_skip_chars = Qnil; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3275 staticpro (&Vunicode_query_skip_chars); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3276 Vunicode_query_skip_chars = Qnil; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3277 |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3278 /* If we could dump the range table above these wouldn't be necessary: */ |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3279 staticpro (&Vunicode_query_string); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3280 Vunicode_query_string = Qnil; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3281 staticpro (&Vunicode_invalid_string); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3282 Vunicode_invalid_string = Qnil; |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3283 staticpro (&Vutf_8_invalid_string); |
257b468bf2ca
Move the #'query-coding-region implementation to C.
Aidan Kehoe <kehoea@parhasard.net>
parents:
4688
diff
changeset
|
3284 Vutf_8_invalid_string = Qnil; |
771 | 3285 #endif /* MULE */ |
3286 } | |
4834
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3287 |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3288 void |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3289 complex_vars_of_unicode (void) |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3290 { |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3291 /* We used to define this in unicode.el. But we need it early for |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3292 Cygwin 1.7 -- used in LOCAL_FILE_FORMAT_TO_TSTR() et al. */ |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3293 Fmake_coding_system_internal |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3294 (Qutf_8, Qunicode, |
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3295 build_defer_string ("UTF-8"), |
4834
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3296 nconc2 (list4 (Qdocumentation, |
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3297 build_defer_string ( |
4834
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3298 "UTF-8 Unicode encoding -- ASCII-compatible 8-bit variable-width encoding\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3299 "sharing the following principles with the Mule-internal encoding:\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3300 "\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3301 " -- All ASCII characters (codepoints 0 through 127) are represented\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3302 " by themselves (i.e. using one byte, with the same value as the\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3303 " ASCII codepoint), and these bytes are disjoint from bytes\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3304 " representing non-ASCII characters.\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3305 "\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3306 " This means that any 8-bit clean application can safely process\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3307 " UTF-8-encoded text as it were ASCII, with no corruption (e.g. a\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3308 " '/' byte is always a slash character, never the second byte of\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3309 " some other character, as with Big5, so a pathname encoded in\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3310 " UTF-8 can safely be split up into components and reassembled\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3311 " again using standard ASCII processes).\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3312 "\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3313 " -- Leading bytes and non-leading bytes in the encoding of a\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3314 " character are disjoint, so moving backwards is easy.\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3315 "\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3316 " -- Given only the leading byte, you know how many following bytes\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3317 " are present.\n" |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3318 ), |
4952
19a72041c5ed
Mule-izing, various fixes related to char * arguments
Ben Wing <ben@xemacs.org>
parents:
4834
diff
changeset
|
3319 Qmnemonic, build_ascstring ("UTF8")), |
4834
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3320 list2 (Qunicode_type, Qutf_8))); |
b3ea9c582280
Use new cygwin_conv_path API with Cygwin 1.7 for converting names between Win32 and POSIX, UTF-8-aware, with attendant changes elsewhere
Ben Wing <ben@xemacs.org>
parents:
4824
diff
changeset
|
3321 } |