771
|
1 oct 27, 2001:
|
|
2
|
|
3 -------- proposal for better buffer-switching commands:
|
|
4
|
|
5 implement what VC++ currently has. you have a single "switch" command like
|
|
6 CTRL-TAB, which as long as you hold the CTRL button down, brings successive
|
|
7 buffers that are "next in line" into the current position, bumping the rest
|
|
8 forward. once you release the CTRL key, the chain is broken, and further
|
|
9 CTRL-TABs will start from the beginning again. this way, frequently used
|
|
10 buffers naturally move toward the front of the chain, and you can switch
|
|
11 back and forth between two buffers using CTRL-TAB. the only thing about
|
|
12 CTRL-TAB is it's a bit awkward. the way to implement is to have
|
|
13 modifier-up strokes fire off a hook, like modifier-up-hook. this is driven
|
|
14 by event dispatch, so there are no synchronization issues. when C-tab is
|
|
15 pressed, the binding function does something like set a one-shot handler on
|
|
16 the modifier-up-hook (perhaps separate hooks for separate modifiers?).
|
|
17
|
|
18 to do this, we'd also want to change the buffer tabs so that they maintain
|
|
19 their own order. in particular, they start out synched to the regular
|
|
20 order, but as you make changes, you don't want the tabs to change
|
|
21 order. (in fact, they may already do this.) selecting a particular buffer
|
|
22 from the buffer tabs DOES make the buffer go to the head of the line. the
|
|
23 invariant is that if the tabs are displaying X items, those X items are the
|
|
24 first X items in the standard buffer list, but may be in a different
|
|
25 order. (it looks like the tabs may already implement all of this.)
|
|
26
|
|
27 oct 26, 2001:
|
|
28
|
|
29 necessary testing/changes:
|
|
30
|
|
31 - test all eol detection stuff under windows w/ and w/o mule, unix w/ and
|
|
32 w/o mule. (test configure flag, command-line flag, menu option) may need
|
|
33 a way of pretending to be unix under cygwin.
|
|
34 - test under windows w/ and w/o mule, cygwin w/ and w/o mule, cygwin x
|
|
35 windows w/ and w/o mule.
|
|
36 - test undecided-dos/unix/mac.
|
|
37 - check ESC ESC works as isearch-quit under TTY's.
|
|
38 - test coding-system-base and all its uses (grep for them).
|
|
39 - menu item to revert to most recent auto save.
|
|
40 - consider renaming build_string -> build_intstring and build_c_string to
|
|
41 build_string. (consistent with build_msg_string et al; many more
|
|
42 build_c_string than build_string)
|
|
43
|
|
44 oct 20, 2001:
|
|
45
|
|
46 fixed problem causing crash due to invalid internal-format data, fixed an
|
|
47 existing bug in valid_char_p, and added checks to more quickly catch when
|
|
48 invalid chars are generated. still need to investigate why
|
|
49 mswindows-multibyte is being detected.
|
|
50
|
|
51 i now see why -- we only process 65536 bytes due to a constant
|
|
52 MAX_BYTES_PROCESSED_FOR_DETECTION. instead, we should have no limit as
|
|
53 long as we have a seekable stream. we also need to write
|
|
54 stderr_out_lisp(), used in the debug info routines i wrote.
|
|
55
|
|
56 check once more about DEBUG_XEMACS. i think debugging info should be
|
|
57 ON by default. make sure it is. check that nothing untoward will result
|
|
58 in a production system, e.g. presumably assert()s should not really abort().
|
|
59 (!! Actually, this should be runtime settable! Use a variable for this, and
|
|
60 it can be set using the same XEMACSDEBUG method. In fact, now that I think
|
|
61 of it, I'm sure that debugging info should be on always, with runtime ways
|
|
62 of turning on or off any funny behavior.)
|
|
63
|
|
64 oct 19, 2001:
|
|
65
|
|
66 fixed various bugs preventing packages from being able to be built. still
|
|
67 another bug, with psgml/etc/cdtd/docbook, which contains some strange
|
|
68 characters starting around char pos 110,000. It gets detected as
|
|
69 mswindows-multibyte (wrong! why?) and then invalid internal-format data is
|
|
70 generated. need to fix mswindows-multibyte (and possibly add something
|
|
71 that signals an error as well; need to work on this error-signalling
|
|
72 mechanism) and figure out why it's getting detected as such. what i should
|
|
73 do is add a debug var that outputs blow-by-blow info of the detection
|
|
74 process.
|
|
75
|
|
76 oct 9, 2001:
|
|
77
|
|
78 the stuff with global-window-system-map doesn't appear to work. in any
|
|
79 case it needs better documentation. [DONE]
|
|
80
|
|
81 M-home, M-end do work, but cause cl-macs to get loaded. why?
|
|
82
|
|
83 oct 8, 2001:
|
|
84
|
|
85 finished the coding system changes and they finally work!
|
|
86
|
|
87 need to implement undecided-unix/dos/mac. they should be easy to do; it
|
|
88 should be enough to specify an eol-type but not do-eol, but check this.
|
|
89
|
|
90 consider making the standard naming be foo-lf/crlf/cr, with unix/dos/mac as
|
|
91 aliases.
|
|
92
|
|
93 print methods for coding systems should include some of the generic
|
|
94 properties. (also then fix print_..._within_print_method). [DONE]
|
|
95
|
|
96 in a little while, go back and delete the text-file-wrapper-coding-system
|
|
97 code. (it'll be in CVS if necessary to get at it.) [DONE]
|
|
98
|
|
99 need to verify at some point that non-text-file coding systems work
|
|
100 properly when specified. when gzip is working, this would be a good test
|
|
101 case. (and consider creating base64 as well!)
|
|
102
|
|
103 remove extra crap from coding-system-category that checks for chain coding
|
|
104 systems. [DONE]
|
|
105
|
|
106 perhaps make a primitive that gets at coding-system-canonical. [DONE]
|
|
107
|
|
108 need to test cygwin, compiling the mule packages, get unix-eol stuff
|
|
109 working. frank from germany says he doesn't see a lisp backtrace when he
|
|
110 gets an error during temacs? verify that this actually gets outputted.
|
|
111
|
|
112 consider putting the current language on the modeline, mousable so it can
|
|
113 be switched. also consider making the coding system be mousable and the
|
|
114 line number (pick a line) and the percentage (pick a percentage).
|
|
115
|
|
116 oct 6, 2001:
|
|
117
|
|
118 added code so that debug_print() will output a newline to the mswindows
|
|
119 debugging output, not just the console. need to test. [DONE]
|
|
120
|
|
121 working on problem where all files are being detected as binary. the
|
|
122 problem may be that the undecided coding system is getting wrapped with an
|
|
123 auto-eol coding system, which it shouldn't be -- but even in this
|
|
124 situation, we should get the right results! check the
|
|
125 canonicalize-after-coding methods. also, determine_real_coding_system
|
|
126 appears to be getting called even when we're not detecting encoding. also,
|
|
127 undecided needs a print method to show its params, and chain needs to be
|
|
128 updated to show canonicalize_after_coding. check others as well. [DONE]
|
|
129
|
|
130 oct 5, 2001:
|
|
131
|
|
132 finished up coding system changes, testing.
|
|
133
|
|
134 errors byte-compiling files in iso-2022-7-bit. perhaps it's not correctly
|
|
135 detecting the encoding?
|
|
136
|
|
137 noticed a problem in the dfc macros: we call
|
|
138 get_coding_system_for_text_file with eol_wrap == 1, to allow for
|
|
139 auto-detection of the eol type; but this defeats the check and
|
|
140 short-circuit for unicode.
|
|
141
|
|
142 still need to implement calling determine_real_coding_system() for
|
|
143 non-seekable streams. to implement correctly, we need to do our own
|
|
144 buffering. [DONE, BUT WITHOUT BUFFERING]
|
|
145
|
|
146 oct 4, 2001:
|
|
147
|
|
148 implemented most stuff below.
|
|
149
|
|
150 need to finish up changes to make_coding_system_1. (i changed the way
|
|
151 internal coding systems were handled; i need to create subsidiaries for all
|
|
152 types of coding systems, not just text ones.) there's a nasty xfree() crash
|
|
153 i was hitting; perhaps it'll go away once all stuff has been rewritten.
|
|
154
|
|
155 check under cygwin to make sure that when an error occurs during loadup, a
|
|
156 backtrace is output.
|
|
157
|
|
158 as soon as andy releases his new setup, we should put it onto various
|
|
159 standard windows software repositories.
|
|
160
|
|
161 oct 3, 2001:
|
|
162
|
|
163 added global-tty-map and global-window-system-map. add some stuff to the
|
|
164 maps, e.g. C-x ESC for repeat vs. C-x ESC ESC on TTY's, and of course ESC
|
|
165 ESC on window systems vs. ESC ESC ESC on TTY's. [TEST]
|
|
166
|
|
167 was working on integrating the two help-for-tutorial versions (mule,
|
|
168 non-mule). [DONE, but test under non-Mule]
|
|
169
|
|
170 was working on the file-coding changes. need to think more about
|
|
171 text-file-wrapper. conclusion i think is that
|
|
172 get_coding_system_for_text_file should wrap using a special coding system
|
|
173 type called a text-file-wrapper, which inherits from chain, and implements
|
|
174 canonicalize-after-decoding to just return the unwrapped coding system. We
|
|
175 need to implement inheritance of coding systems, which will certainly come
|
|
176 in extremely useful when coding systems get implemented in Lisp, which
|
|
177 should happen at some point. (see existing docs about this.) essentially,
|
|
178 we have a way of declaring that we inherit from some system, and the
|
|
179 appropriate data structures get created, perhaps just an extra inheritance
|
|
180 pointer. but when we create the coding system, the extra data needs to be
|
|
181 a stretchy array of offsets, pointing to the type-specific data for the
|
|
182 coding system type and all its parents. that means that in the methods
|
|
183 structure for a coding system (which perhaps should be expanded beyond
|
|
184 method, it's just a "class structure") is the index in these arrays of
|
|
185 offsets. CODING_SYSTEM_DATA() can take any of the coding system classes
|
|
186 (rename type to class!) that make up this class. similarly, a coding
|
|
187 system class inherits its methods from the class above unless specifying
|
|
188 its own method, and can call the superclass method at any point by either
|
|
189 just invoking its name, or conceivably by some macro like
|
|
190
|
|
191 CALL_SUPER (method, (args))
|
|
192
|
|
193 similar mods would have to be made to coding stream structures.
|
|
194
|
|
195 perhaps for the immediate we can just sort of fake things like we currently
|
|
196 do with undecided calling some stuff from chain.
|
|
197
|
|
198 oct 2, 2001:
|
|
199
|
|
200 need to implement support for iso-8859-15, i.e. iso-8859-1 + euro symbol.
|
|
201 figure out how to fall back to iso-8859-1 as necessary.
|
|
202
|
|
203 leave the current bindings the way they are for the moment, but bump off
|
|
204 M-home and M-end (hardly used), and substitute my buffer movement stuff
|
|
205 there. [DONE, but test]
|
|
206
|
|
207 there's something to be said for combining block of 6 and paragraph,
|
|
208 esp. if we make the definition of "paragraph" be so that it skips by 6 when
|
|
209 within code. hmm.
|
|
210
|
|
211 eliminate advertised-undo crap, and similar hacks. [DONE]
|
|
212
|
|
213 think about obsolete stuff to be eliminated. think about eliminating or
|
|
214 dimming obsolete items from hyper-apropos and something similar in
|
|
215 completion buffers.
|
|
216
|
|
217 sep 30, 2001:
|
|
218
|
|
219 synched up the tutorials with FSF 21.0.105. was rewriting them to favor
|
|
220 the cursor keys over the older C-p, etc. keys.
|
|
221
|
|
222 Got thinking about key bindings again.
|
|
223
|
|
224 (1) I think that M-up/down and M-C-up/down should be reversed. I use
|
|
225 scroll-up/down much more often than motion by paragraph.
|
|
226
|
|
227 (2) Should we eliminate move by block (of 6) and subsitute it for
|
|
228 paragraph? This would have the advantage that I could make bindings
|
|
229 for buffer change (forward/back buffer, perhaps M-C-up/down. with
|
|
230 shift, M-C-S-up/down only goes within the same type (C files, etc.).
|
|
231 alternatively, just bump off beginning-of-defun from C-M-home, since
|
|
232 it's on C-M-a already.
|
|
233
|
|
234 need someone to go over the other tutorials (five new ones, from FSF
|
|
235 21.0.105) and fix them up to correspond to the english one.
|
|
236
|
|
237 shouldn't shift-motion work with C-a and such as well as arrows?
|
|
238
|
|
239 sep 29, 2001:
|
|
240
|
|
241 charcount_to_bytecount can also be made to scream -- as can scan_buffer,
|
|
242 buffer_mule_signal_inserted_region, others? we should start profiling
|
|
243 though before going too far down this line.
|
|
244
|
|
245 Debug code that causes no slowdown should in general remain in the
|
|
246 executable even in the release version because it may be useful (e.g. for
|
|
247 people to see the event output). so DEBUG_XEMACS should be rethought.
|
|
248 things like use of msvcrtd.dll should be controlled by error_checking on.
|
|
249 maybe DEBUG_XEMACS controls general debug code (e.g. use of msvcrtd.dll,
|
|
250 asserts abort, error checking), and the actual debugging code should remain
|
|
251 always, or be conditonalized on something else
|
|
252 (e.g. DEBUGGING_FUNS_PRESENT).
|
|
253
|
|
254 doc strings in dumped files are displayed with an extra blank line between
|
|
255 each line. presumably this is recent? i assume either the change to
|
|
256 detect-coding-region or the double-wrapping mentioned below.
|
|
257
|
|
258 error with coding-system-property on iso-2022-jp-dos. problem is that that
|
|
259 coding system is wrapped, so its type shows up as chain, not iso-2022.
|
|
260 this is a general problem, and i think the way to fix it is to in essence
|
|
261 do late canonicalization -- similar in spirit to what was done long ago,
|
|
262 canonicalize_when_code, except that the new coding system (the wrapper) is
|
|
263 created only once, either when the original cs is created or when first
|
|
264 needed. this way, operations on the coding system work like expected, and
|
|
265 you get the same results as currently when decoding/encoding. the only
|
|
266 thing tricky is handling canonicalize-after-coding and the ever-tricky
|
|
267 double-wrapping problem mentioned below. i think the proper solution is to
|
|
268 move the autodetection of eol into the main autodetect type. it can be
|
|
269 asked to autodetect eol, coding, or both. for just coding, it does like it
|
|
270 currently does. for just eol, it does similar to what it currently does
|
|
271 but runs the detection code that convert-eol currently does, and selects
|
|
272 the appropriate convert-eol system. when it does both eol and coding, it
|
|
273 does something on the order of creating two more autodetect coding systems,
|
|
274 one for eol only and one for coding only, and chains them together. when
|
|
275 each has detected the appropriate value, the results are combined. this
|
|
276 automatically eliminates the double-wrapping problem, removes the need for
|
|
277 complicated canonicalize-after-coding stuff in chain, and fixes the problem
|
|
278 of autodetect not having a seekable stream because hidden inside of a
|
|
279 chain. (we presume that in the both-eol-and-coding case, the various
|
|
280 autodetect coding streams can communicate with each other appropriately.)
|
|
281
|
|
282 also, we should solve the problem of internal coding systems floating
|
|
283 around and clogging up the list simply by having an "internal" property on
|
|
284 cs's and an internal param to coding-system-list (optional; if not given,
|
|
285 you don't get the internal ones). [DONE]
|
|
286
|
|
287 we should try to reduce the size of the from-unicode tables (the dominant
|
|
288 memory hog in the tables). one obvious thing is to not store a whole
|
|
289 emchar as the mapped-to value, but a short that encodes the octets. [DONE]
|
|
290
|
|
291 sep 28, 2001:
|
|
292
|
|
293 need to merge up to latest in trunk.
|
|
294
|
|
295 add unicode charsets for all non-translatable unicode chars; probably want
|
|
296 to extend the concept of charsets to allow for dimension 3 and dimension 4
|
|
297 charsets. for the moment we should stick with just dimension 3 charsets;
|
|
298 otherwise we run past the current maximum of 4 bytes per emchar. (most code
|
|
299 would work automatically since it uses MAX_EMCHAR_LEN; the trickiness is in
|
|
300 certain code that has intimate knowledge of the representation.
|
|
301 e.g. bufpos_to_bytind() has to multiply or divide by 1, 2, 3, or 4,
|
|
302 and has special ways of handling each number. with 5 or 6 bytes per char,
|
|
303 we'd have to change that code in various ways.) 96x96x96 = 884,000 or so,
|
|
304 so with two 96x96x96 charsets, we could tackle all Unicode values
|
|
305 representable by UTF-16 and then some -- and only these codepoints will
|
|
306 ever have assigned chars, as far as we know.
|
|
307
|
|
308 need an easy way of showing the current language environment. some menus
|
|
309 need to have the current one checked or whatever. [DONE]
|
|
310
|
|
311 implement unicode surrogates.
|
|
312
|
|
313 implement buffer-file-coding-system-when-loaded -- make sure find-file,
|
|
314 revert-file, etc. set the coding system [DONE]
|
|
315
|
|
316 verify all the menu stuff [DONE]
|
|
317
|
|
318 implemented the entirely-ascii check in buffers. not sure how much gain
|
|
319 it'll get us as we already have a known range inside of which is constant
|
|
320 time, and with pure-ascii files the known range spans the whole buffer.
|
|
321 improved the comment about how bufpos-to-bytind and vice-versa work. [DONE]
|
|
322
|
|
323 fix double-wrapping of convert-eol: when undecided converts itself to
|
|
324 something with a non-autodetect eol, it needs to tell the adjacent
|
|
325 convert-eol to reduce itself to nothing.
|
|
326
|
|
327 need menu item for find file with specified encoding. [DONE]
|
|
328
|
|
329 renamed coding systems mswindows-### to windows-### to follow the standard
|
|
330 in rfc1345. [DONE]
|
|
331
|
|
332 implemented coding-system-subsidiary-parent [DONE]
|
|
333 HAVE_MULE -> MULE in files in nt/ so that depend checking works [DONE]
|
|
334
|
|
335 need to take the smarter search-all-files-in-dir stuff from my sample init
|
|
336 file and put it on the grep menu [DONE]
|
|
337
|
|
338 added item for revert w/specified encoding; mostly works, but needs fixes.
|
|
339 in particular, you get the correct results, but buffer-file-coding-system
|
|
340 does not reflect things right. also, there are too many entries. need to
|
|
341 split into submenus. there is already split code out there; see if it's
|
|
342 generalized and if not make it so. it should only split when there's more
|
|
343 than a specified number, and when splitting, split into groups of a
|
|
344 specified size, not into a specified number of groups. [DONE]
|
|
345
|
|
346 too many entries in the langenv menus; need to split. [DONE]
|
|
347
|
|
348 sep 27, 2001:
|
|
349
|
|
350 NOTE: M-x grep for make-string causes crash now. something definitely to
|
|
351 do with string changes. check very carefully the diffs and put in those
|
|
352 sledgehammer checks. [DONE]
|
|
353
|
|
354 fix font-lock bug i introduced. [DONE]
|
|
355
|
|
356 added optimization to strings (keeps track of # of bytes of ascii at the
|
|
357 beginning of a string). perhaps should also keep an all-ascii flag to deal
|
|
358 with really large (> 2 MB) strings. rewrite code to count ascii-begin to
|
|
359 use the 4-or-8-at-a-time stuff in bytecount_to_charcount.
|
|
360
|
|
361 Error: M-q is causing Invalid Regexp error on the above paragraph. It's
|
|
362 not in working. I assume it's a side effect of the string stuff. VERIFY!
|
|
363 Write sledgehammer checks for strings. [DONE]
|
|
364
|
|
365 revamped the locale/init stuff so that it tries much harder to get things
|
|
366 right. should test a bit more. in particular, test out Describe Language
|
|
367 on the various created environments and make sure everything looks right.
|
|
368
|
|
369 should change the menus: move the submenus on Edit->Mule directly under
|
|
370 Edit. add a menu entry on File to say "Reload with specified encoding ->".
|
|
371 [DONE]
|
|
372
|
|
373 Also Find File with specified encoding -> Also entry to change the EOL
|
|
374 settings for Unix, and implement it.
|
|
375
|
|
376 decode-coding-region isn't working because it needs to insert a binary
|
|
377 (char->byte) converter. [DONE]
|
|
378
|
|
379 chain should be rearranged to be in decoding order; similar for
|
|
380 source/sink-type, other things?
|
|
381
|
|
382 the detector should check for a magic cookie even without a seekable input.
|
|
383 (currently its input is not seekable, because it's hidden within a chain.
|
|
384 #### See what we can do about this.)
|
|
385
|
|
386 provide a way to display various settings, e.g. the current category
|
|
387 mappings and priority (see mule-diag; get this working so it's in the
|
|
388 path); also a way to print out the likeliness results from a detection,
|
|
389 perhaps a debug flag.
|
|
390
|
|
391 problem with `env', which causes path issues due to `env' in packages.
|
|
392 move env code to process, sync with fsf 21.0.105, check that the autoloads
|
|
393 in `env' don't cause problems. [DONE]
|
|
394
|
|
395 8-bit iso2022 detection appears broken; or at least, mule-canna.c is not so
|
|
396 detected.
|
|
397
|
|
398 sep 25, 2001:
|
|
399
|
|
400 something else to do is review the font selection and fix it so that (e.g.)
|
|
401 JISX-0212 can be displayed.
|
|
402
|
|
403 also, text in widgets needs to be drawn by us so that the correct fonts
|
|
404 will be displayed even in multi-lingual text.
|
|
405
|
|
406 sep 24, 2001:
|
|
407
|
|
408 the detection system is now properly abstracted. the detectors have been
|
|
409 rewritten to include multiple levels of abstraction. now we just need
|
|
410 detectors for ascii, binary, and latin-x, as well as more sophisticated
|
|
411 detectors in general and further review of the general algorithm for doing
|
|
412 detection. (#### Is this written up anywhere?) after that, consider adding
|
|
413 error-checking to decoding (VERY IMPORTANT) and verifying the binary
|
|
414 correctness of things under unix no-mule.
|
|
415
|
|
416 sep 23, 2001:
|
|
417
|
|
418 began to fix the detection system -- adding multiple levels of likelihood
|
|
419 and properly abstracting the detectors. the system is in place except for
|
|
420 the abstraction of the detector-specific data out of the struct
|
|
421 detection_state. we should get things working first before tackling that
|
|
422 (which should not be too hard). i'm rewriting algorithms here rather than
|
|
423 just converting code, so it's harder. mostly done with everything, but i
|
|
424 need to review all detectors except iso2022 and make them properly follow
|
|
425 the new way. also write a no-conversion detector. also need to look into
|
|
426 the `recode' package and see how (if?) they handle detection, and maybe
|
|
427 copy some of the algorithms. also look at recent FSF 21.0 and see if their
|
|
428 algorithms have improved.
|
|
429
|
|
430 sep 22, 2001:
|
|
431
|
|
432 fixed gc bugs from yesterday.
|
|
433 fixed truename bug.
|
|
434 close/finalize stuff works.
|
|
435 eliminated notyet stuff in syswindows.h.
|
|
436 eliminated special code in tstr_to_c_string.
|
|
437 fixed pdump problems. (many of them, mostly latent bugs, ugh)
|
|
438 fixed cygwin sscanf problems in parse-unicode-translation-table. (NOT a
|
|
439 sscanf bug, but subtly different behavior w.r.t. whitespace in the format
|
|
440 string, combined with a debugger that sucks ROCKS!! and consistently
|
|
441 outputs garbage for variable values.)
|
|
442 main stuff to test is the handling of EOF recognition vs. binary
|
|
443 (i.e. check what the default settings are under Unix). then we may have
|
|
444 something that WORKS on all platforms!!! (Also need to test Windows
|
|
445 non-Mule)
|
|
446
|
|
447 sep 21, 2001:
|
|
448
|
|
449 finished redoing the close/finalize stuff in the lstream code. but i
|
|
450 encountered again the nasty bug mentioned on sep 15 that disappeared on its
|
|
451 own then. the problem seems to be that the finalize method of some of the
|
|
452 lstreams is calling Lstream_delete(), which calls free_managed_lcrecord(),
|
|
453 which is a no-no when we're inside of garbage-collection and the object
|
|
454 passed to free_managed_lcrecord() is unmarked, and about to be released by
|
|
455 the gc mechanism -- the free lists will end up with xfree()d objects on
|
|
456 them, which is very bad. we need to modify free_managed_lcrecord() to
|
|
457 check if we're in gc and the object is unmarked, and ignore it rather than
|
|
458 move it to the free list. [DONE]
|
|
459
|
|
460 (#### What we really need to do is do what Java and C# do w.r.t. their
|
|
461 finalize methods: For objects with finalizers, when they're about to be
|
|
462 freed, leave them marked, run the finalizer, and set another bit on them
|
|
463 indicating that the finalizer has run. Next GC cycle, the objects will
|
|
464 again come up for freeing, and this time the sweeper notices that the
|
|
465 finalize method has already been called, and frees them for good (provided
|
|
466 that a finalize method didn't do something to make the object alive
|
|
467 again).)
|
|
468
|
|
469 sep 20, 2001:
|
|
470
|
|
471 redid the lstream code so there is only one coding stream. combined the
|
|
472 various doubled coding stream methods into one; i'm a little bit unsure of
|
|
473 this last part, though, as the results of combining the two together seem
|
|
474 unclean. got it to compile, but it crashes in loadup. need to go through
|
|
475 and rehash the close vs. finalize stuff, as the problem was stuff getting
|
|
476 freed too quickly, before the canonicalize-after-decoding was run. should
|
|
477 eliminate entirely CODING_STATE_END and use a different method (close
|
|
478 coding stream). rewrite to use these two. make sure they're called in the
|
|
479 right places. Lstream_close on a stream should *NOT* do finalizing.
|
|
480 finalize only on delete. [DONE]
|
|
481
|
|
482 in general i'd like to see the flags eliminated and converted to
|
|
483 bit-fields. also, rewriting the methods to take advantage of rejecting
|
|
484 should make it possible to eliminate much of the state in the various
|
|
485 methods, esp. including the flags. need to test this is working, though --
|
|
486 reduce the buffer size down very low and try files with only CRLF's in
|
|
487 them, with one offset by a byte from the other, and see if we correctly
|
|
488 handle rejection.
|
|
489
|
|
490 still have the problem with incorrectly truenaming files.
|
|
491
|
|
492
|
|
493 sep 19, 2001:
|
|
494
|
|
495 bug reported: crash while closing lstreams.
|
|
496
|
|
497 the lstream/coding system close code needs revamping. we need to document
|
|
498 that order of closing lstreams is very important, and make sure we're
|
|
499 consistent. furthermore, chain and undecided lstreams need to close their
|
|
500 underneath lstreams when they receive the EOF signal (there may be data in
|
|
501 the underneath streams waiting to come out), not when they themselves are
|
|
502 closed. [DONE]
|
|
503
|
|
504 (if only we had proper inheritance. i think in any case we should
|
|
505 simulate it for the chain coding stream -- write things in such a way that
|
|
506 undecided can use the chain coding stream and not have to duplicate
|
|
507 anything itself.)
|
|
508
|
|
509 in general we need to carefully think through the closing process to make
|
|
510 sure everything always works correctly and in the right order. also check
|
|
511 very carefully to make sure there are no dangling pointers to deleted
|
|
512 objects floating around.
|
|
513
|
|
514 move the docs for the lstream functions to the functions themselves, not
|
|
515 the header files. document more carefully what exactly Lstream_delete()
|
|
516 means and how it's used, what the connections are between Lstream_close(),
|
|
517 Lstream_delete(), Lstream_flush(), lstream_finalize, etc. [DONE]
|
|
518
|
|
519 additional error-checking: consider deadbeefing the memory in objects
|
|
520 stored in lcrecord free lists; furthermore, consider whether lifo or fifo
|
|
521 is correct; under error-checking, we should perhaps be doing fifo, and
|
|
522 setting a minimum number of objects on the lists that's quite large so that
|
|
523 it's highly likely that any erroneous accesses to freed objects will go
|
|
524 into such deadbeefed memory and cause crashes. also, at the earliest
|
|
525 available opportunity, go through all freed memory and check for any
|
|
526 consistency failures (overwrites of the deadbeef), crashing if so. perhaps
|
|
527 we could have some sort of id for each block, to easier trace where the
|
|
528 offending block came from. (all of these ideas are present in the debug
|
|
529 system malloc from VC++, plus more stuff.) there's similar code i wrote
|
|
530 sitting somewhere (in free-hook.c? doesn't appear so. we need to delete the
|
|
531 blocking stuff out of there!). also look into using the debug system
|
|
532 malloc from VC++, which has lots of cool stuff in it. we even have the
|
|
533 sources. that means compiling under pdump, which would be a good idea
|
|
534 anyway. set it as the default. (but then, we need to remove the
|
|
535 requirement that Xpm be a DLL, which is extremely annoying. look into
|
|
536 this.)
|
|
537
|
|
538 test the windows code page coding systems recently created.
|
|
539
|
|
540 problems reading my mail files -- 1personal appears to hang, others come up
|
|
541 with lots of ^M's. investigate.
|
|
542
|
|
543 test the enum functions i just wrote, and finish them.
|
|
544
|
|
545 still pdump problems.
|
|
546
|
|
547 sep 18, 2001:
|
|
548
|
|
549 critical-quit broken sometime after aug 25.
|
|
550
|
|
551 -- fixed critical quit.
|
|
552 -- fixed process problems.
|
|
553 -- print routines work. (no routine for ccl, though)
|
|
554 -- can read and write unicode files, and they can still be read by some
|
|
555 other program
|
|
556 -- defaults should come up correctly -- mswindows-multibyte is general.
|
|
557
|
|
558 still need to test matej's stuff.
|
|
559 seems ok with multibyte stuff but needs more testing.
|
|
560
|
|
561 sep 17, 2001:
|
|
562
|
|
563 !!!!! something broken with processes !!!!! cannot send mail anymore. must
|
|
564 investigate.
|
|
565
|
|
566 sep 17, 2001:
|
|
567
|
|
568 on mon/wed nights, stop *BEFORE* 11pm. Otherwise i just start getting
|
|
569 woozy and can't concentrate.
|
|
570
|
|
571 just finished getting assorted fixups to the main branch committed, so it
|
|
572 will compile under C++ (Andy committed some code that broke C++ builds).
|
|
573 cup'd the code into the fixtypes workspace, updated the tags appropriately.
|
|
574 i've created the appropriate log message, sitting in fixtypes.txt in
|
|
575 /src/xemacs; perhaps it should go into a README. now i just have to build
|
|
576 on everything (it's currently building), verify it's ok, run patcher-mail,
|
|
577 commit, send.
|
|
578
|
|
579 my mule ws is also very close. need to:
|
|
580
|
|
581 -- test the new print routines.
|
|
582 -- test it can read and write unicode files, and they can still be read by
|
|
583 some other program.
|
|
584 -- try to see if unicode can be auto-detected properly.
|
|
585 -- test it can read and write multibyte files in a few different formats.
|
|
586 currently can't recognize them, but if you set the cs right, it should
|
|
587 work.
|
|
588 -- examine the test files sent by matej and see if we can handle them.
|
|
589
|
|
590 sep 15, 2001:
|
|
591
|
|
592 more eol fixing. this stuff is utter crap.
|
|
593
|
|
594 currently we wrap coding systems with convert-eol-autodetect when we create
|
|
595 them in make_coding_system_1. i had a feeling that this would be a
|
|
596 problem, and indeed it is -- when autodetecting with `undecided', for
|
|
597 example, we end up with multiple layers of eol conversion. to avoid this,
|
|
598 we need to do the eol wrapping *ONLY* when we actually retrieve a coding
|
|
599 system in places such as insert-file-contents. these places are
|
|
600 insert-file-contents, load, process input, call-process-internal,
|
|
601 encode/decode/detect-coding-region, database input, ...
|
|
602
|
|
603 (later) it's fixed, and things basically work. NOTE: for some reason,
|
|
604 adding code to wrap coding systems with convert-eol-lf when eol-type == lf
|
|
605 results in crashing during garbage collection in some pretty obscure place
|
|
606 -- an lstream is free when it shouldn't be. this is a bad sign. i guess
|
|
607 something might be getting initialized too early?
|
|
608
|
|
609 we still need to fix the canonicalization-after-decoding code to avoid
|
|
610 problems with coding systems like `internal-7' showing up. basically, when
|
|
611 eol==lf is detected, nil should be returned, and the callers should handle
|
|
612 it appropriately, eliding when necessary. chain needs to recognize when
|
|
613 it's got only one (or even 0) items in the chain, and elide out the chain.
|
|
614
|
|
615 sep 11, 2001: the day that will live in infamy.
|
|
616
|
|
617 rewrite of sep 9 entry about formats:
|
|
618
|
|
619 when calling make-coding-system, the name can be a cons of (format1 .
|
|
620 format2), specifying that it decodes format1->format2 and encodes the other
|
|
621 way. if only one name is given, that is assumed to be format1, and the
|
|
622 other is either `external' or `internal' depending on the end type.
|
|
623 normally the user when decoding gives the decoding order in formats, but
|
|
624 can leave off the last one, `internal', which is assumed. a multichain
|
|
625 might look like gzip|multibyte|unicode, using the coding systems named
|
|
626 `gzip', `(unicode . multibyte)' and `unicode'. the way this actually works
|
|
627 is by searching for gzip->multibyte; if not found, look for gzip->external
|
|
628 or gzip->internal. (In general we automatically do conversion between
|
|
629 internal and external as necessary: thus gzip|crlf does the expected, and
|
|
630 maps to gzip->external, external->internal, crlf->internal, which when
|
|
631 fully specified would be gzip|external:external|internal:crlf|internal --
|
|
632 see below.) To forcibly fit together two converters that have explicitly
|
|
633 specified and incompatible names (say you have unicode->multibyte and
|
|
634 iso8859-1->ebcdic and you know that the multibyte and iso8859-1 in this
|
|
635 case are compatible), you can force-cast using :, like this:
|
|
636 ebcdic|iso8859-1:multibyte|unicode. (again, if you force-cast between
|
|
637 internal and external formats, the conversion happens automatically.)
|
|
638
|
|
639
|
|
640 sep 10, 2001:
|
|
641
|
|
642 moved the autodetection stuff (both codesys and eol) into particular coding
|
|
643 systems -- `undecided' and `convert-eol' (type == `autodetect'). needs
|
|
644 lots of work. still need to search through the rest of the code and find
|
|
645 any remaining auto-detect code and move it into the undecided coding
|
|
646 system. need to modify make-coding-system so that it spits out
|
|
647 auto-detecting versions of all text-file coding systems unless we say not
|
|
648 to. need eliminate entirely the EOF flag from both the stream info and the
|
|
649 coding system; have only the original-eof flag. in
|
|
650 coding_system_from_mask, need to check that the returned value is not of
|
|
651 type `undecided', falling back to no-conversion if so. also need to make
|
|
652 sure we wrap everything appropriate for text-files -- i removed the
|
|
653 wrapping on set-coding-category-list or whatever (need to check all those
|
|
654 files to make sure all wrapping is removed). need to review carefully the
|
|
655 new code in `undecided' to make sure it works are preserves the same logic
|
|
656 as previously. need to review the closing and rewinding behavior of chain
|
|
657 and undecided (same -- should really consolidate into helper routines, so
|
|
658 that any coding system can embed a chain in it) -- make sure the dynarr's
|
|
659 are getting their data flushed out as necessary, rewound/closed in the
|
|
660 right order, no missing steps, etc.
|
|
661
|
|
662 also split out mule stuff into mule-coding.c. work done on
|
|
663 configure/xemacs.mak/Makefiles not done yet. work on emacs.c/symsinit.h to
|
|
664 interface with the new init functions not done yet.
|
|
665
|
|
666 also put in a few declarations of the way i think the abstracted detection
|
|
667 stuff ought to go. DON'T WORK ON THIS MORE UNTIL THE REST IS DEALT WITH
|
|
668 AND WE HAVE A WORKING XEMACS AGAIN WITH ALL EOL ISSUES NAILED.
|
|
669
|
|
670 really need a version of cvs-mods that reports only the current directory.
|
|
671 WRITE THIS! use it to implement a better cvs-checkin.
|
|
672
|
|
673 sep 9, 2001:
|
|
674
|
|
675 implemented a gzip coding system. unfortunately, doesn't quite work right
|
|
676 because it doesn't handle the gzip headers -- it just reads and writes raw
|
|
677 zlib data. there's no function in the library to skip past the header, but
|
|
678 we do have some code out of the library that we can snarf that implements
|
|
679 header parsing. we need to snarf that, store it, and output it again at
|
|
680 the beginning when encoding. in the process, we should create a "get next
|
|
681 byte" macro that bails out when there are no more. using this, we set up a
|
|
682 nice way of doing most stuff statelessly -- if we have to bail, we reject
|
|
683 everything back to the sync point. also need to fix up the autodetection
|
|
684 of zlib in configure.in.
|
|
685
|
|
686 BIG problems with eol. finished up everything i thought i would need to
|
|
687 get eol stuff working, but no -- when you have mswindows-unicode, with its
|
|
688 eol set to autodetect, the detection routines themselves do the autodetect
|
|
689 (first), and fail (they report CR on CRLF because of the NULL byte between
|
|
690 the CR and the LF) since they're not looking at ascii data. with a chain
|
|
691 it's similarly bad. for mswindows-multibyte, for example, which is a chain
|
|
692 unicode->unicode-to-multibyte, autodetection happens inside of the chain,
|
|
693 both when unicode and unicode-to-multibyte are active. we could twiddle
|
|
694 around with the eol flags to try to deal with this, but it's gonna be a big
|
|
695 mess, which is exactly what we're trying to avoid. what we basically want
|
|
696 is to entirely rip out all EOL settings from either the coding system or
|
|
697 the stream (yes, there are two! one might saw autodetect, and then the
|
|
698 stream contains the actual detected value). instead, we simply create an
|
|
699 eol-autodetect coding system -- or rather, it's part of the convert-eol
|
|
700 coding system. convert-eol, type = autodetect, does autodetection the
|
|
701 first time it gets data sent to it to decode, and thereafter sets a stream
|
|
702 parameter indicating the actual eol type for this stream. this means that
|
|
703 all autodetect coding systems, as created by `make-coding-system', really
|
|
704 are chains with a convert-eol at the beginning. only subsidiary xxx-unix
|
|
705 has no wrapping at all. this should allow eof detection of gzip, unicode,
|
|
706 etc. for that matter, general autodetection should be entirely
|
|
707 encapsulated inside of the `autodetect' coding system, with no
|
|
708 eol-autodetection -- the chain becomes convert-eol (autodetect) ->
|
|
709 autodetect or perhaps backwards. the generic autodetect similarly has a
|
|
710 coding-system in its stream methods, and needs somehow or other to insert
|
|
711 the detected coding-system into the chain. either it contains a chain
|
|
712 inside of it (perhaps it *IS* a chain), or there's some magic involving
|
|
713 canonicalization-type switcherooing in the middle of a decode. either way,
|
|
714 once everything is good and done and we want to save the coding system so
|
|
715 it can be used later, we need to do another sort of canonicalization --
|
|
716 converting auto-detect-type coding systems into the detected systems.
|
|
717 again, a coding-system method, with some magic currently so that
|
|
718 subsidiaries get properly used rather than something that's new but
|
|
719 equivalent to subsidiaries. (#### perhaps we could use a hash table to
|
|
720 avoid recreating coding systems when not necessary. but that would require
|
|
721 that coding systems be immutable from external, and i'm not sure that's the
|
|
722 case.)
|
|
723
|
|
724 i really think, after all, that i should reverse the naming of everything
|
|
725 in chain and source-sink-type -- they should be decoding-centric. later
|
|
726 on, if/when we come up with the proper way to make it totally symmetrical,
|
|
727 we'll be fine whether before then we were encoding or decoding centric.
|
|
728
|
|
729
|
|
730 sep 9, 2001:
|
|
731
|
|
732 investigated eol parameter.
|
|
733 implemented handling in make-coding-system of eol-cr and eol-crlf.
|
|
734 fixed calls everywhere to Fget_coding_system / Ffind_coding_system to
|
|
735 reject non-char->byte coding systems.
|
|
736
|
|
737 still need to handle "query eol type using coding-system-property" so it
|
|
738 magically returns the right type by parsing the chain.
|
|
739
|
|
740 no work done on formats, as mentioned below. we should consider using :
|
|
741 instead of || to indicate casting.
|
|
742
|
|
743 early sep 9, 2001:
|
|
744
|
|
745 renamed some codesys properties: `list' in chain -> chain; `subtype' in
|
|
746 unicode -> type. everything compiles again and sort of works; some CRLF
|
|
747 problems that may resolve themselves when i finish the convert-eol stuff.
|
|
748 the stuff to create subsidiaries has been rewritten to use chains; but i
|
|
749 still need to investigate how the EOL type parameter is used. also, still
|
|
750 need to implement this: when a coding system is created, and its eol type
|
|
751 is not autodetect or lf, a chain needs to be created and returned. i think
|
|
752 that what needs to happen is that the eol type can only be set to
|
|
753 autodetect or lf; later on this should be changed to simply be either
|
|
754 autodetect or not (but that would require ripping out the eol converting
|
|
755 stuff in the various coding systems), and eventually we will do the work on
|
|
756 the detection mechanism so it can do chain detection; then we won't need an
|
|
757 eol autodetect setting at all. i think there's a way to query the eol type
|
|
758 of a coding system; this should check to see if the coding system is a
|
|
759 chain and there's a convert-eol at the front; if so, the eol type comes
|
|
760 from the type of the convert-eol.
|
|
761
|
|
762 also check out everywhere that Fget_coding_system or Ffind_coding_system is
|
|
763 called, and see whether anything but a char->byte system can be tolerated.
|
|
764 create a new function for all the places that only want char->byte,
|
|
765 something like get_coding_system_char_to_byte_only.
|
|
766
|
|
767 think about specifying formats in make-coding-system. perhaps the name can
|
|
768 be a cons of (format1, format2), specifying that it encodes
|
|
769 format1->format2 and decodes the other way. if only one name is given,
|
|
770 that is assumed to be format2, and the other is either `byte' or `char'
|
|
771 depending on the end type. normally the user when decoding gives the
|
|
772 decoding order in formats, but can leave off the last one, `char', which is
|
|
773 assumed. perhaps we should say `internal' instead of `char' and `external'
|
|
774 instead of byte. a multichain might look like gzip|multibyte|unicode,
|
|
775 using the coding systems named `gzip', `(unicode . multibyte)' and
|
|
776 `unicode'. we would have to allow something where one format is given only
|
|
777 as generic byte/char or internal/external to fit with any of the same
|
|
778 byte/char type. when forcibly fitting together two converters that have
|
|
779 explicitly specified and incompatible names (say you have
|
|
780 unicode->multibyte and iso8859-1->ebcdic and you know that the multibyte
|
|
781 and iso8859-1 in this case are compatible), you can force-cast using ||,
|
|
782 like this: ebcdic|iso8859-1||multibyte|unicode. this will also force
|
|
783 external->internal translation as necessary:
|
|
784 unicode|multibyte||crlf|internal does unicode->multibyte,
|
|
785 external->internal, crlf->internal. perhaps you'd need to put in the
|
|
786 internal translation, like this: unicode|multibyte|internal||crlf|internal,
|
|
787 which means unicode->multibyte, external->internal (multibyte is compatible
|
|
788 with external); force-cast to crlf format and convert crlf->internal.
|
|
789
|
|
790 even later: Sep 8, 2001:
|
|
791
|
|
792 chain doesn't need to set character mode, that happens automatically when
|
|
793 the coding systems are created. fixed chain to return correct source/sink
|
|
794 type for itself and to check the compatibility of source/sink types in its
|
|
795 chain. fixed decode/encode-coding-region to check the source and sink
|
|
796 types of the coding system performing the conversion and insert appropriate
|
|
797 byte->char/char->byte converters (aka "binary" coding system). fixed
|
|
798 set-coding-category-system to only accept the traditional
|
|
799 encode-char-to-byte types of coding systems.
|
|
800
|
|
801 still need to extend chain to specify the parameters mentioned below,
|
|
802 esp. "reverse". also need to extend the print mechanism for chain so it
|
|
803 prints out the chain. probably this should be general: have a new method
|
|
804 to return all properties, and output those properties. you could also
|
|
805 implement a read syntax for coding systems this way.
|
|
806
|
|
807 still need to implement convert-eol and finish up the rest of the eol stuff
|
|
808 mentioned below.
|
|
809
|
|
810 later September 7, 2001: (more like Sep 8)
|
|
811
|
|
812 moved many Lisp_Coding_System * params to Lisp_Object. In general this is
|
|
813 the way to go, and if we ever implement a copying GC, we will never want to
|
|
814 be passing direct pointers around. With no error-checking, we lose no
|
|
815 cycles using Lisp_Objects in place of pointers -- the Lisp_Object itself is
|
|
816 nothing but a pointer, and so all the casts and "dereferences" boil down to
|
|
817 nothing.
|
|
818
|
|
819 Clarified and cleaned up the "character mode" on streams, and documented
|
|
820 who (caller or object itself) has the right to be setting character mode on
|
|
821 a stream, depending on whether it's a read or write stream. changed
|
|
822 conversion_end_type method and enum source_sink_type to return
|
|
823 encoding-centric values, rather than decoding-centric. for the moment,
|
|
824 we're going to be entirely encoding-centric in everything; we can rethink
|
|
825 later. fixed coding systems so that the decode and encode methods are
|
|
826 guaranteed to receive only full characters, if that's the source type of
|
|
827 the data, as per conversion_end_type.
|
|
828
|
|
829 still need to fix the chain method so that it correctly sets the character
|
|
830 mode on all the lstreams in it and checks the source/sink types to be
|
|
831 compatible. also fix decode-coding-string and friends to put the
|
|
832 appropriate byte->character (i.e. no-conversion) coding systems on the ends
|
|
833 as necessary so that the final ends are both character. also add to chain
|
|
834 a parameter giving the ability to switch the direction of conversion of any
|
|
835 particular item in the chain (i.e. swap encoding and decoding). i think
|
|
836 what we really want to do is allow for arbitrary parameters to be put onto
|
|
837 a particular coding system in the chain, of which the only one so far is
|
|
838 swap-encode-decode. don't need too much codage here for that, but make the
|
|
839 design extendable.
|
|
840
|
|
841
|
|
842
|
|
843 September 7, 2001:
|
|
844
|
|
845 just added a return value from the decode and encode methods of a coding
|
|
846 system, so that some of the data can get rejected. fixed the calling
|
|
847 routines to handle this. need to investigate when and whether the coding
|
|
848 lstream is set to character mode, so that the decode/encode methods only
|
|
849 get whole characters. if not, we should do so, according to the source
|
|
850 type of these methods. also need to implement the convert_eol coding
|
|
851 system, and fix the subsidiary coding systems (and in general, any coding
|
|
852 system where the eol type is specified and is not LF) to be chains
|
|
853 involving convert_eol.
|
|
854
|
|
855 after everything is working, need to remove eol handling from encode/decode
|
|
856 methods and eventually consider rewriting (simplifying) them given the
|
|
857 reject ability.
|
|
858
|
|
859 September 5, 2001:
|
|
860
|
|
861 -- need to organize this. get everything below into the TODO list.
|
|
862 CVS the TODO list frequently so i can delete old stuff. prioritize
|
|
863 it!!!!!!!!!
|
|
864
|
|
865 -- move README.ben-mule... to STATUS.ben-mule...; use README for
|
|
866 intro, overview of what's new, what's broken, how to use the
|
|
867 features, etc.
|
|
868
|
|
869 -- need a global and local coding-category-precedence list, which get
|
|
870 merged.
|
|
871
|
|
872 -- finished the BOM support. also finished something not listed
|
|
873 below, expansion to the auto-generator of Unicode-encapsulation to
|
|
874 support bracketing code with #if ... #endif, for Cygwin and MINGW
|
|
875 problems, e.g. This is tested; appears to work.
|
|
876
|
|
877 -- need to add more multibyte coding systems now that we have various
|
|
878 properties to specify them. need to add DEFUN's for mac-code-page
|
|
879 and ebcdic-code-page for completeness. need to rethink the whole
|
|
880 way that the priority list works. it will continue to be total
|
|
881 junk until multiple levels of likeliness get implemented.
|
|
882
|
|
883 -- need to finish up the stuff about the various defaults. [need to
|
|
884 investigate more generally where all the different default values
|
|
885 are that control encoding. (there are six places or so.) need to
|
|
886 list them in make-coding-system docs and put pointers
|
|
887 elsewhere. [[[[#### what interface to specify that this default
|
|
888 should be unicode? a "Unicode" language environment seems too
|
|
889 drastic, as the language environment controls much more.]]]] even
|
|
890 skipping the Unicode stuff here, we need to survey and list the
|
|
891 variables that control coding page behavior and determine how they
|
|
892 need to be set for various possible scenarios:
|
|
893
|
|
894 -- total binary: no detection at all.
|
|
895 -- raw-text only: wants only autodetection of line endings, nothing else.
|
|
896 -- "standard Windows environment": tries for Unicode, falls back on
|
|
897 code page encoding.
|
|
898 -- some sort of East European environment, and Russian.
|
|
899 -- some sort of standard Japanese Windows environment.
|
|
900 -- standard Chinese Windows environments (traditional and simplified)
|
|
901 -- various Unix environments (European, Japanese, Russian, etc.)
|
|
902 -- Unicode support in all of these when it's reasonable
|
|
903
|
|
904 These really require multiple likelihood levels to be fully
|
|
905 implementable. We should see what can be done ("gracefully fall
|
|
906 back") with single likelihood level. need lots of testing.
|
|
907
|
|
908 -- need to fix the truename problem.
|
|
909
|
|
910 -- lots of testing: need to test all of the stuff above and below that's recently been implemented.
|
|
911
|
|
912
|
|
913
|
|
914 September 4, 2001:
|
|
915
|
|
916 mostly everything compiles. currently there is a crash in
|
|
917 parse-unicode-translation-table, and Cygwin/Mule won't run. it may
|
|
918 well be a bug in the sscanf() in Cygwin.
|
|
919
|
|
920 working on today:
|
|
921
|
|
922 -- adding BOM support for Unicode coding systems. mostly there, but
|
|
923 need to finish adding BOM support to the detection routines. then test.
|
|
924 -- adding properties to unicode-to-multibyte to specify the coding
|
|
925 system in various flexible ways, e.g. directly specified code page
|
|
926 or ansi or oem code page of specified locale, current locale,
|
|
927 user-default or system-default locale. need to test.
|
|
928 -- creating a `multibyte' coding system, with the same parameters as
|
|
929 unicode-to-multibyte and which resolves at coding-system-creation
|
|
930 time to the appropriate chain. creating the underlying mechanism
|
|
931 to allow such under-the-scenes switcheroo. need to test.
|
|
932 -- set default-value of buffer-file-coding-system to
|
|
933 mswindows-multibyte, as Matej said it should be. need to test.
|
|
934 need to investigate more generally where all the different default
|
|
935 values are that control encoding. (there are six places or so.)
|
|
936 need to list them in make-coding-system docs and put pointers
|
|
937 elsewhere. #### what interface to specify that this default should
|
|
938 be unicode? a "Unicode" language environment seems too drastic, as
|
|
939 the language environment controls much more.
|
|
940 -- thinking about adding multiple levels of certainty to the detection
|
|
941 schemes, instead of just a mask. eventually, we need to totally
|
|
942 abstract things, but that can easier be done in many steps. (we
|
|
943 need multiple levels of likelihood to more reasonably support a
|
|
944 Windows environment with code-page type files. currently, in order
|
|
945 to get them detected, we have to put them first, because they can
|
|
946 look like lots of other things; but then, other encodings don't get
|
|
947 detected. with multiple levels of likelihood, we still put the
|
|
948 code-page categories first, but they will return low levels of
|
|
949 likelihood. Lower-down encodings may be able to return higher
|
|
950 levels of likelihood, and will get taken preferentially.)
|
|
951 -- making it so you cannot disable file-coding, but you get an
|
|
952 equivalent default on Unix non-Mule systems where all defaults are
|
|
953 `binary'. need to test!!!!!!!!!
|
|
954
|
|
955 Matej (mostly, + some others) notes the following problems, and here
|
|
956 are possible solutions:
|
|
957
|
|
958 -- he wants the defaults to work right. [figure out what those
|
|
959 defaults are. i presume they are auto-detection of data in current
|
|
960 code page and in unicode, and new files have current code page set
|
|
961 as their output encoding.]
|
|
962
|
|
963 -- too easy to lose data with incorrect encodings. [need to set up an
|
|
964 error system for encoding/decoding. extremely important but a
|
|
965 little tricky to implement so let's deal with other issues now.]
|
|
966
|
|
967 -- EOL isn't always detected correctly. [#### ?? need examples]
|
|
968
|
|
969 -- truename isn't working: c:\t.txt and c:\tmp.txt have the same truename.
|
|
970 [should be easy to fix]
|
|
971
|
|
972 -- unicode files lose the BOM mark. [working on this]
|
|
973
|
|
974 -- command-line utilities use OEM. [actually it seems more
|
|
975 complicated. it seems they use the codepage of the console. we
|
|
976 may be able to set that, e.g. to UTF8, before we invoke a command.
|
|
977 need to investigate.]
|
|
978
|
|
979 -- no way to handle unicode characters not recognized as charsets. [we
|
|
980 need to create something like 8 private 2-dimensional charsets to
|
|
981 handle all BMP Unicode chars. Obviously this is a stopgap
|
|
982 solution. Switching to Unicode internal will ultimately make life
|
|
983 far easier and remove the BMP limitation. but for now it will
|
|
984 work. we translate all characters where we have charsets into
|
|
985 chars in those charsets, and the remainder in a unicode charset.
|
|
986 that way we can save them out again and guarantee no data loss with
|
|
987 unicode. this creates font problems, though ...]
|
|
988
|
|
989 -- problems with xemacs font handling. [xemacs font handling is not
|
|
990 sophisticated enough. it goes on a charset granularity basis and
|
|
991 only looks for a font whose name contains the corresponding windows
|
|
992 charset in it. with unicode this fails in various ways. for one
|
|
993 the granularity needs to be single character, so that those unicode
|
|
994 charsets mentioned above work; and it needs to query the font to
|
|
995 see what unicode ranges it supports, rather than just looking at
|
|
996 the charset ending.]
|
|
997
|
|
998
|
|
999
|
|
1000 August 28, 2001:
|
|
1001
|
|
1002 working on getting everything to compile again: Cygwin, non-MULE,
|
|
1003 pdump. not there yet.
|
|
1004
|
|
1005 mswindows-multibyte is now defined using chain, and works. removed
|
|
1006 most vestiges of the mswindows-multibyte coding system type.
|
|
1007
|
|
1008 file-coding is on by default; should default to binary only on Unix.
|
|
1009 Need to test. (Needs to compile first :-)
|
|
1010
|
|
1011 August 26, 2001:
|
|
1012
|
|
1013 I've fixed the issue of inputting non-ASCII text under -nuni, and done
|
|
1014 some of the work on the Russian C-x problem -- we now compute the
|
|
1015 other possibilities. We still need to fix the key-lookup code,
|
|
1016 though, and that code is unfortunately a bit ugly. the best way, it
|
|
1017 seems, is to expand the command-builder structure so you can specify
|
|
1018 different interpretations for keys. (if we do find an alternative
|
|
1019 binding, though, we need to mess with both the command builder and
|
|
1020 this-command-keys, as does the function-key stuff. probably need to
|
|
1021 abstract that munging code.)
|
|
1022
|
|
1023 high-priority:
|
|
1024
|
|
1025 [currently doing]
|
|
1026
|
|
1027 -- support for WM_IME_CHAR. IME input can work under -nuni if we use
|
|
1028 WM_IME_CHAR. probably we should always be using this, instead of
|
|
1029 snarfing input using WM_COMPOSITION. i'll check this out.
|
|
1030 -- Russian C-x problem. see above.
|
|
1031
|
|
1032 [clean-up]
|
|
1033
|
|
1034 -- make sure it compiles and runs under non-mule. remember that some
|
|
1035 code needs the unicode support, or at least a simple version of it.
|
|
1036 -- make sure it compiles and runs under pdump. see below.
|
|
1037 -- clean up mswindows-multibyte, TSTR_TO_C_STRING. see below. [DONE]
|
|
1038 -- eliminate last vestiges of codepage<->charset conversion and similar stuff.
|
|
1039
|
|
1040 [other]
|
|
1041 -- cut and paste. see below.
|
|
1042 -- misc issues with handling lang environments. see also August 25,
|
|
1043 "finally: working on the C-x in ...".
|
|
1044 -- when switching lang env, needs to set keyboard layout.
|
|
1045 -- user var to control whether, when moving into text of a
|
|
1046 particular language, we set the appropriate keyboard layout. we
|
|
1047 would need to have a lisp api for retrieving and setting the
|
|
1048 keyboard layout, set text properties to indicate the layout of
|
|
1049 text, and have a way of dealing with text with no property on
|
|
1050 it. (e.g. saved text has no text properties on it.) basically,
|
|
1051 we need to get a keyboard layout from a charset; getting a
|
|
1052 language would do. Perhaps we need a table that maps charsets
|
|
1053 to language environments.
|
|
1054 -- test that the lang env is properly set at startup. test that
|
|
1055 switching the lang env properly sets the C locale (call
|
|
1056 setlocale(), set LANG, etc.) -- a spawned subprogram should have
|
|
1057 the new locale in its environment.
|
|
1058 -- look through everything below and see if anything is missed in this
|
|
1059 priority list, and if so add it. create a separate file for the
|
|
1060 priority list, so it can be updated as appropriate.
|
|
1061
|
|
1062
|
|
1063 mid-priority:
|
|
1064
|
|
1065 -- clean up the chain coding system. its list should specify decode
|
|
1066 order, not encode; i now think this way is more logical. it should
|
|
1067 check the endpoints to make sure they make sense. it should also
|
|
1068 allow for the specification of "reverse-direction coding systems":
|
|
1069 use the specified coding system, but invert the sense of decode and
|
|
1070 encode.
|
|
1071
|
|
1072 -- along with that, places that take an arbitrary coding system and
|
|
1073 expect the ends to be anything specific need to check this, and add
|
|
1074 the appropriate conversions from byte->char or char->byte.
|
|
1075
|
|
1076 -- get some support for arabic, thai, vietnamese, japanese jisx 0212:
|
|
1077 at least get the unicode information in place and make sure we have
|
|
1078 things tied together so that we can display them. worry about r2l
|
|
1079 some other time.
|
|
1080
|
|
1081 August 25, 2001:
|
|
1082
|
|
1083 There is actually more non-Unicode-ized stuff, but it's basically
|
|
1084 inconsequential. (See previous note.) You can check using the file
|
|
1085 nmkun.txt (#### RENAME), which is just a list of all the routines that
|
|
1086 have been split. (It was generated from the output of `nmake
|
|
1087 unicode-encapsulate', after removing everything from the output but
|
|
1088 the function names.) Use something like
|
|
1089
|
|
1090 fgrep -f ../nmkun.txt -w [a-hj-z]*.[ch] |m
|
|
1091
|
|
1092 in the source directory, which does a word match and skips
|
|
1093 intl-unicode-win32.[ch] and intl-win32.[ch], which have a whole lot of
|
|
1094 references to these, unavoidably. It effectively detects what needs
|
|
1095 to be changed because changed versions either begin qxe... or end with
|
|
1096 A or W, and in each case there's no whole-word match.
|
|
1097
|
|
1098 The nasty bug has been fixed below. The -nuni option now works -- all
|
|
1099 specially-written code to handle the encapsulation has been tested by
|
|
1100 some operation (fonts by loadup and checking the output of (list-fonts
|
|
1101 ""); devmode by printing; dragdrop tests other stuff).
|
|
1102
|
|
1103 NOTE: for -nuni (Win 95), areas need work:
|
|
1104
|
|
1105 -- cut and paste. we should be able to receive Unicode text if it's
|
|
1106 there, and we should be able to receive it even in Win 95 or -nuni.
|
|
1107 we should just check in all circumstances. also, under 95, when we
|
|
1108 put some text in the clipboard, it may or may not also be
|
|
1109 automatically enumerated as unicode. we need to test this out
|
|
1110 and/or just go ahead and manually do the unicode enumeration.
|
|
1111
|
|
1112 -- receiving keyboard input. we get only a single byte, but we should
|
|
1113 be able to correlate the language of the keyboard layout to a
|
|
1114 particular code page, so we can then decode it correctly.
|
|
1115
|
|
1116 -- mswindows-multibyte. still implemented as its own thing. should
|
|
1117 be done as a chain of (encoding) unicode | unicode-to-multibyte.
|
|
1118 need to turn this on, get it working, and look into optimizations
|
|
1119 in the dfc stuff. (#### perhaps there's a general way to do these
|
|
1120 optimizations??? something like having a method on a coding system
|
|
1121 that can specify whether a pure-ASCII string gets rendered as
|
|
1122 pure-ASCII bytes and vice-versa.)
|
|
1123
|
|
1124
|
|
1125 ALSO:
|
|
1126
|
|
1127 -- we have special macros TSTR_TO_C_STRING and such because formerly
|
|
1128 the DFC macros didn't know about external stuff that was Unicode
|
|
1129 encoded and would call strlen() on them. this is fixed, so now we
|
|
1130 should undo the special macros, make em normal, removal the
|
|
1131 comments about this, and make sure it works. [DONE]
|
|
1132
|
|
1133
|
|
1134 -- finally: working on the C-x in Russian key layout problem. in the
|
|
1135 process will probably end up doing work on cleaning up the handling
|
|
1136 of keyboard layouts, integrating or deleting the FSF stuff, adding
|
|
1137 code to change the keyboard layout as we move in and out of text in
|
|
1138 different languages (implemented as a post-command-hook; we need
|
|
1139 something like internal-post-command-hook if not already there, for
|
|
1140 internal stuff that doesn't want to get mixed up with the regular
|
|
1141 post-command-hook; similar for pre-command-hook). also, when
|
|
1142 langenv changes, ways to set the keyboard layout appropriately.
|
|
1143
|
|
1144 -- i think the stuff above is higher priority than the other stuff
|
|
1145 mentioned below. what i'm aiming for is to be able to input and
|
|
1146 work with multiple languages without weird glitches, both under 95
|
|
1147 and NT. the problems above are all basic impediments to such work.
|
|
1148 we assume for the moment that the user can make use of the existing
|
|
1149 file i/o conversion stuff, and put that lower in priority, after
|
|
1150 the basic input is working.
|
|
1151
|
|
1152 -- i should get my modem connected and write up what's going on and
|
|
1153 send it to the lists; also cvs commit my workspaces and get more
|
|
1154 testers.
|
|
1155
|
|
1156 August 24, 2001:
|
|
1157
|
|
1158 All code has been Unicode-ized except for some stuff in console-msw.c
|
|
1159 that deals with console output. Much of the Unicode-encapsulation
|
|
1160 stuff, particularly the hand-written stuff, really needs testing. I
|
|
1161 added a new command-line option, -nuni, to force use of all ANSI calls
|
|
1162 -- XE_UNICODEP evaluates to false in this case.
|
|
1163
|
|
1164 There is a nasty bug that appeared recently, probably when the event
|
|
1165 code got Unicode-ized -- bad interactions with OS sticky modifiers.
|
|
1166 Hold the shift key down and release it, then instead of affecting the
|
|
1167 next char only, it gets permanently stuck on (until you do a regular
|
|
1168 shift+char stroke). This needs to be debugged.
|
|
1169
|
|
1170 Other things on agenda:
|
|
1171
|
|
1172 -- go through and prioritize what's listed below.
|
|
1173
|
|
1174 -- make sure the pdump code can compile and work. for the moment we
|
|
1175 just don't try to dump any Unicode tables and load them up each
|
|
1176 time. this is certainly fast but ...
|
|
1177
|
|
1178 -- there's the problem that XEmacs can't be run in a directory with
|
|
1179 non-ASCII/Latin-1 chars in it, since it will be doing Unicode
|
|
1180 processing before we've had a chance to load the tables. In fact,
|
|
1181 even finding the tables in such a situation is problematic using
|
|
1182 the normal commands. my idea is to eventually load the stuff
|
|
1183 extremely extremely early, at the same time as the pdump data gets
|
|
1184 loaded. in fact, the unicode table data (stored in an efficient
|
|
1185 binary format) can even be stuck into the pdump file (which would
|
|
1186 mean as a resource to the executable, for windows). we'd need to
|
|
1187 extend pdump a bit: to allow for attaching extra data to the pdump
|
|
1188 file. (something like pdump_attach_extra_data (addr, length)
|
|
1189 returns a number of some sort, an index into the file, which you
|
|
1190 can then retrieve with pdump_load_extra_data(), which returns an
|
|
1191 addr (mmap()ed or loaded), and later you pdump_unload_extra_data()
|
|
1192 when finished. we'd probably also need
|
|
1193 pdump_attach_extra_data_append(), which appends data to the data
|
|
1194 just written out with pdump_attach_extra_data(). this way,
|
|
1195 multiple tables in memory can be written out into one contiguous
|
|
1196 table. (we'd use the tar-like trick of allowing new blocks to be
|
|
1197 written without going back to change the old blocks -- we just rely
|
|
1198 on the end of file/end of memory.) this same mechanism could be
|
|
1199 extracted out of pdump and used to handle the non-pdump situation
|
|
1200 (or alternatively, we could just dump either the memory image of
|
|
1201 the tables themselves or the compressed binary version). in the
|
|
1202 case of extra unicode tables not known about at compile time that
|
|
1203 get loaded before dumping, we either just dump them into the image
|
|
1204 (pdump and all) or extract them into the compressed binary format,
|
|
1205 free the original tables, and treat them like all other tables.
|
|
1206
|
|
1207 -- `C-x b' when using a Russian keyboard layout. XEmacs currently
|
|
1208 tries to interpret C+cyrillic char, which causes an error. We want
|
|
1209 C-x b to still work even when the keyboard normally generates
|
|
1210 Cyrillic. What we should do is expand the keyboard event structure
|
|
1211 so that it contains not only the actual char, but what the char
|
|
1212 would have been in various other keyboard layouts, and in contexts
|
|
1213 where only certain keystrokes make sense (creating control chars,
|
|
1214 and looking up in keymaps), we proceed in order, processing each of
|
|
1215 them until we get something. order should be something like:
|
|
1216 current keyboard layout; layout of the current language
|
|
1217 environment; layout of the user's default language; layout of the
|
|
1218 system default language; layout of US English.
|
|
1219
|
|
1220 -- reading and writing Unicode files. multiple problems:
|
|
1221
|
|
1222 -- EOL's aren't handled right. for the moment, just fix the
|
|
1223 Unicode coding systems; later on, create EOL-only coding
|
|
1224 systems:
|
|
1225
|
|
1226 1. they would be character->character and operate next to the
|
|
1227 internal data; this means that coding systems need to be able
|
|
1228 to handle ends of lines that are either CR, LF, or CRLF.
|
|
1229 usually this isn't a problem, as they are just characters
|
|
1230 like any other and get encoded appropriately. however,
|
|
1231 coding systems that are line-oriented need to recognize any
|
|
1232 of the three as line endings.
|
|
1233
|
|
1234 2. we'd also have to complete the stuff that handles coding
|
|
1235 systems where either end can be byte or char (four
|
|
1236 possibilities total; use a single enum such as
|
|
1237 ENCODES_CHAR_TO_BYTE, ENCODES_BYTE_TO_BYTE, etc.).
|
|
1238
|
|
1239 3. we'd need ways of specifying the chaining of coding systems.
|
|
1240 e.g. when reading a coding system, a user can specify more
|
|
1241 than one with a | symbol between them. when a context calls
|
|
1242 for a coding system and a chain is needed, the `chain' coding
|
|
1243 system is useful; but we should really expand the contexts
|
|
1244 where a list of coding systems can be given, and whenever
|
|
1245 possible try to inline the chain instead of using a
|
|
1246 surrounding `chain' coding system.
|
|
1247
|
|
1248 4. the `chain' needs some work so that it passes all sorts of
|
|
1249 lstream commands down to the chain inside it -- it should be
|
|
1250 entirely transparent and the fact that there's actually a
|
|
1251 surrounding coding system should be invisible. more general
|
|
1252 coding system methods might need to be created.
|
|
1253
|
|
1254 5. important: we need a way of specifying how detecting works
|
|
1255 when we have more than one coding system. we might need more
|
|
1256 than a single priority list. need to think about this.
|
|
1257
|
|
1258 -- Unicode files beginning with the BOM are not recognized as such.
|
|
1259 we need to fix this; but to make things sensible, we really need
|
|
1260 to add the idea of different levels of confidence regarding
|
|
1261 what's detected. otherwise, Unicode says "yes this is me" but
|
|
1262 others higher up do too. in the process we should probably
|
|
1263 finish abstracting the detection system and fix up some
|
|
1264 stupidities in it.
|
|
1265
|
|
1266 -- When writing a file, we need error detection; otherwise somebody
|
|
1267 will create a Unicode file without realizing the coding system
|
|
1268 of the buffer is Raw, and then lose all the non-ASCII/Latin-1
|
|
1269 text when it's written out. We need two levels
|
|
1270
|
|
1271 1. first, a "safe-charset" level that checks before any actual
|
|
1272 encoding to see if all characters in the document can safely
|
|
1273 be represented using the given coding system. FSF has a
|
|
1274 "safe-charset" property of coding systems, but it's stupid
|
|
1275 because this information can be automatically derived from
|
|
1276 the coding system, at least the vast majority of the time.
|
|
1277 What we need is some sort of
|
|
1278 alternative-coding-system-precedence-list, langenv-specific,
|
|
1279 where everything on it can be checked for safe charsets and
|
|
1280 then the user given a list of possibilities. When the user
|
|
1281 does "save with specified encoding", they should see the same
|
|
1282 precedence list. Again like with other precedence lists,
|
|
1283 there's also a global one, and presumably all coding systems
|
|
1284 not on other list get appended to the end (and perhaps not
|
|
1285 checked at all when doing safe-checking?). safe-checking
|
|
1286 should work something like this: compile a list of all
|
|
1287 charsets used in the buffer, along with a count of chars
|
|
1288 used. that way, "slightly unsafe" charsets can perhaps be
|
|
1289 presented at the end, which will lose only a few characters
|
|
1290 and are perhaps what the users were looking for.
|
|
1291
|
|
1292 2. when actually writing out, we need error checking in case an
|
|
1293 individual char in a charset can't be written even though the
|
|
1294 charsets are safe. again, the user gets the choice of other
|
|
1295 reasonable coding systems.
|
|
1296
|
|
1297 3. same thing (error checking, list of alternatives, etc.) needs
|
|
1298 to happen when reading! all of this will be a lot of work!
|
|
1299
|
|
1300
|
|
1301
|
|
1302 Announcement, August 20, 2001:
|
|
1303
|
|
1304 I'm looking for testers. There is a complete and fast implementation
|
|
1305 in C of Unicode conversion, translations for almost all of the
|
|
1306 standardly-defined charsets that load up automatically and
|
|
1307 instantaneously at runtime, coding systems supporting the common
|
|
1308 external representations of Unicode [utf-16, ucs-4, utf-8,
|
|
1309 little-endian versions of utf-16 and ucs-4; utf-7 is sitting there
|
|
1310 with abort[]s where the coding routines should go, just waiting for
|
|
1311 somebody to implement], and a nice set of primitives for translating
|
|
1312 characters<->codepoints and setting the priority lists used to control
|
|
1313 codepoint->char lookup.
|
|
1314
|
|
1315 It's so far hooked into one place: the Windows IME. Currently I can
|
|
1316 select the Japanese IME from the thing on my tray pad in the lower
|
|
1317 right corner of the screen, and type Japanese into XEmacs, and you get
|
|
1318 Japanese in XEmacs -- regardless of whether you set either your
|
|
1319 current or global system locale to Japanese,and regardless of whether
|
|
1320 you set your XEmacs lang env as Japanese. This should work for many
|
|
1321 other languages, too -- Cyrillic, Chinese either Traditional or
|
|
1322 Simplified, and many others, but YMMV. There may be some lurking
|
|
1323 bugs (hardly surprising for something so raw).
|
|
1324
|
|
1325 To get at this, checkout using `ben-mule-21-5', NOT the simpler
|
|
1326 *`mule-21-5'. For example
|
|
1327
|
|
1328 cvs -d :pserver:xemacs@cvs.xemacs.org:/usr/CVSroot checkout -r ben-mule-21-5 xemacs
|
|
1329
|
|
1330 or you get the idea. the `-r ben-mule-21-5' is important.
|
|
1331
|
|
1332 I keep track of my progress in a file called README.ben-mule-21-5 in
|
|
1333 the root directory of the source tree.
|
|
1334
|
|
1335 WARNING: Pdump might not work. Will be fixed rsn.
|
|
1336
|
|
1337 August 20, 2001:
|
|
1338
|
|
1339 -- still need to sort out demand loading, binary format, etc. figure
|
|
1340 out what the goals are and how we're going to achieve them. for
|
|
1341 the moment let's just say that running XEmacs in a directory with
|
|
1342 Japanese or other weird characters in the name is likely to cause
|
|
1343 problems under MS Windows, but once XEmacs is initialized (and
|
|
1344 before processing init files), all Unicode support is there.
|
|
1345
|
|
1346 -- wrote the size computation routines, although not yet tested.
|
|
1347
|
|
1348 -- lots more abstraction of coding systems; almost done.
|
|
1349
|
|
1350 -- UNICODE WORKS!!!!!
|
|
1351
|
|
1352
|
|
1353 August 19, 2001:
|
|
1354
|
|
1355 Still needed on the Unicode support:
|
|
1356
|
|
1357 -- demand loading: load the Unicode table data the first time a
|
|
1358 conversion needs to be done.
|
|
1359
|
|
1360 -- maybe: table size computation: figure out how big the in-memory
|
|
1361 tables actually are.
|
|
1362
|
|
1363 -- maybe: create a space-efficient binary format for the data, and a
|
|
1364 way to dump out an existing charset's data into this binary format.
|
|
1365 it should allow for many such groups of data to be appended
|
|
1366 together in one file, such that you can just append the new data
|
|
1367 onto the end and not have to go back and modify anything
|
|
1368 previously. (like how tar archives work, and how the UFS? for
|
|
1369 CD-R's and CD-RW's works.)
|
|
1370
|
|
1371 -- maybe: figure out how to be able to access the Unicode tables at
|
|
1372 init_intl() time, before we know how to get at data-directory; that
|
|
1373 way we can handle the need for unicode conversions that come up
|
|
1374 very early, for example if XEmacs is run from a directory
|
|
1375 containing Japanese in it. Presumably we'd want to generalize the
|
|
1376 stuff in pdump.c that deals with the dumper file, so that it can
|
|
1377 handle other files -- putting the file either in the directory of
|
|
1378 the executable or in a resource, maybe actually attached to the
|
|
1379 pdump file itself -- or maybe we just dump the data into the actual
|
|
1380 executable. With pdump we could extend pdump to allow for data
|
|
1381 that's in the pdump file but not actually mapped at startup,
|
|
1382 separate from the data that does get mapped -- and then at runtime
|
|
1383 the pointer gets restored not with a real pointer but an offset
|
|
1384 into the file; another pdump call and we get some way to access the
|
|
1385 data. (tricky because it might be in a resource, not a file. we
|
|
1386 might have to just tell pdump to mmap or whatever the data in, and
|
|
1387 then tell pdump to release it.)
|
|
1388
|
|
1389 -- fix multibyte to use unicode. at first, just reverse
|
|
1390 mswindows-multibyte-to-unicode to be unicode-to-multibyte; later
|
|
1391 implement something in chain to allow for reversal, for declaring
|
|
1392 the ends of the coding systems, etc.
|
|
1393
|
|
1394 -- actually make sure that the IME stuff is working!!!
|
|
1395
|
|
1396 Other things before announcing:
|
|
1397
|
|
1398 -- change so that the Unicode tables are not pdumped. This means we
|
|
1399 need to free any table data out there. Make sure that pdump
|
|
1400 compiles and try to finish the pretty-much-already-done stuff
|
|
1401 already with XD_STRUCT_ARRAY and dynamic size computation; just
|
|
1402 need to see what's going on with LO_LINK.
|
|
1403
|
|
1404 August 14, 2001:
|
|
1405
|
|
1406 To do a diff between this workspace and the mainline, use the most recent sync tags, currently:
|
|
1407
|
|
1408 cvs diff -r main-branch-ben-mule-21-5-aug-11-2001-sync -r ben-mule-21-5-post-aug-11-2001-sync
|
|
1409
|
|
1410 Unicode support:
|
|
1411
|
|
1412 Unicode support is important for supporting many languages under
|
|
1413 Windows, such as Cyrillic, without resorting to translation tables for
|
|
1414 particular Windows-specific code pages. Internally, all characters in
|
|
1415 Windows can be represented in two encodings: code pages and Unicode.
|
|
1416 With Unicode support, we can seamlessly support all Windows
|
|
1417 characters. Currently, the test in the drive to support Unicode is if
|
|
1418 IME input works properly, since it is being converted from Unicode.
|
|
1419
|
|
1420 Unicode support also requires that the various Windows API's be
|
|
1421 "Unicode-encapsulated", so that they automatically call the ANSI or
|
|
1422 Unicode version of the API call appropriately and handle the size
|
|
1423 differences in structures. What this means is:
|
|
1424
|
|
1425 -- first, note that Windows already provides a sort of encapsulation
|
|
1426 of all API's that deal with text. All such API's are underlyingly
|
|
1427 provided in two versions, with an A or W suffix (ANSI or "wide"
|
|
1428 i.e. Unicode), and the compile-time constant UNICODE controls which
|
|
1429 is selected by the unsuffixed API. Same thing happens with
|
|
1430 structures. Unfortunately, this is compile-time only, not
|
|
1431 run-time, so not sufficient. (Creating the necessary run-time
|
|
1432 encoding is not conceptually difficult, but very time-consuming to
|
|
1433 write. It adds no significant overhead, and the only reason it's
|
|
1434 not standard in Windows is conscious marketing attempts by
|
|
1435 Microsoft to cripple Windows 95. FUCK MICROSOFT! They even
|
|
1436 describe in a KnowledgeBase article exactly how to create such an
|
|
1437 API [although we don't exactly follow their procedure], and point
|
|
1438 out its usefulness; the procedure is also described more generally
|
|
1439 in Nadine Kano's book on Win32 internationalization -- written SIX
|
|
1440 YEARS AGO! Obviously Microsoft has such an API available
|
|
1441 internally.)
|
|
1442
|
|
1443 -- what we do is provide an encapsulation of each standard Windows API
|
|
1444 call that is split into A and W versions. current theory is to
|
|
1445 avoid all preprocessor games; so we name the function with a prefix
|
|
1446 -- "qxe" currently -- and require callers to use the prefixed name.
|
|
1447 Callers need to explicitly use the W version of all structures, and
|
|
1448 convert text themselves using Qmswindows_tstr. the qxe
|
|
1449 encapsulated version will automatically call the appropriate A or W
|
|
1450 version depending on whether we're running on 9x or NT, and copy
|
|
1451 data between W and A versions of the structures as necessary.
|
|
1452
|
|
1453 -- We require the caller to handle the actual translation of text to
|
|
1454 avoid possible overflow when dealing with fixed-size Windows
|
|
1455 structures. There are no such problems when copying data between
|
|
1456 the A and W versions because ANSI text is never larger than its
|
|
1457 equivalent Unicode representation.
|
|
1458
|
|
1459 -- We allow for incremental creation of the encapsulated routines by
|
|
1460 using the coding system Qmswindows_tstr_notyet. This is an alias
|
|
1461 for Qmswindows_multibyte, i.e. it always converts to ANSI; but it
|
|
1462 indicates that it will be changed to Qmswindows_tstr when we have a
|
|
1463 qxe version of the API call that the data is being passed to and
|
|
1464 change the code to use the new function.
|
|
1465
|
|
1466 Besides creating the encapsulation, the following needs to be done for
|
|
1467 Unicode support:
|
|
1468
|
|
1469 -- No actual translation tables are fed into XEmacs. We need to
|
|
1470 provide glue code to read the tables in etc/unicode. See
|
|
1471 etc/unicode/README for the interface to implement.
|
|
1472
|
|
1473 -- Fix pdump. The translation tables for Unicode characters function
|
|
1474 as unions of structures with different numbers of indirection
|
|
1475 levels, in order to be efficient. pdump doesn't yet support such
|
|
1476 unions. charset.h has a general description of how the translation
|
|
1477 tables work, and the pdump code has constants added for the new
|
|
1478 required data types, and descriptions of how these should work.
|
|
1479
|
|
1480 -- ultimately, there's no end to additional work (composition, bidi
|
|
1481 reordering, glyph shaping/ordering, etc.), but the above is enough
|
|
1482 to get basic translation working.
|
|
1483
|
|
1484 Merging this workspace into the trunk requires some work. ChangeLogs
|
|
1485 have not yet been created. Also, there is a lot of additional code in
|
|
1486 this workspace other than just Windows and Unicode stuff. Some of the
|
|
1487 changes have been somewhat disruptive to the code base, in particular:
|
|
1488
|
|
1489 -- the code that handles the details of processing multilingual text
|
|
1490 has been consolidated to make it easier to extend it. it has been
|
|
1491 yanked out of various files (buffer.h, mule-charset.h, lisp.h,
|
|
1492 insdel.c, fns.c, file-coding.c, etc.) and put into text.c and
|
|
1493 text.h. mule-charset.h has also been renamed charset.h. all long
|
|
1494 comments concerning the representations and their processing have
|
|
1495 been consolidated into text.c.
|
|
1496
|
|
1497 -- nt/config.h has been eliminated and everything in it merged into
|
|
1498 config.h.in and s/windowsnt.h. see config.h.in for more info.
|
|
1499
|
|
1500 -- s/windowsnt.h has been completely rewritten, and s/cygwin32.h and
|
|
1501 s/mingw32.h have been largely rewritten. tons of dead weight has
|
|
1502 been removed, and stuff common to more than one file has been
|
|
1503 isolated into s/win32-common.h and s/win32-native.h, similar to
|
|
1504 what's already done for usg variants.
|
|
1505
|
|
1506 -- large amounts of code throughout the code base have been Mule-ized,
|
|
1507 not just Windows code.
|
|
1508
|
|
1509 -- file-coding.c/.h have been largely rewritten (although still mostly
|
|
1510 syncable); see below.
|
|
1511
|
|
1512
|
|
1513
|
|
1514 June 26, 2001:
|
|
1515
|
|
1516 -- ben-mule-21-5
|
|
1517
|
|
1518 this contains all the mule work i've been doing. this includes mostly
|
|
1519 work done to get mule working under ms windows, but in the process
|
|
1520 i've [of course] fixed a whole lot of other things as well, mostly
|
|
1521 mule issues. the specifics:
|
|
1522
|
|
1523 - it compiles and runs under windows and should basically work. the
|
|
1524 stuff remaining to do is (a) improved unicode support (see below)
|
|
1525 and (b) smarter handling of keyboard layouts. in particular, it
|
|
1526 should (1) set the right keyboard layout when you change your
|
|
1527 language environment; (2) optionally (a user var) set the
|
|
1528 appropriate keyboard layout as you move the cursor into text in a
|
|
1529 particular language.
|
|
1530
|
|
1531 - i added a bunch of code to better support OS locales. it tries to
|
|
1532 notice your locale at startup and set the language environment
|
|
1533 accordingly (this more or less works), and call setlocale() and set
|
|
1534 LANG when you change the language environment (may or may not work).
|
|
1535
|
|
1536 - major rewriting of file-coding. it's mostly abstracted into coding
|
|
1537 systems that are defined by methods (similar to devices and
|
|
1538 specifiers), with the ultimate aim being to allow non-i18n coding
|
|
1539 systems such as gzip. there is a "chain" coding system that allows
|
|
1540 multiple coding systems to be chained together. (it doesn't yet
|
|
1541 have the concept that either end of a coding system can be bytes or
|
|
1542 chars; this needs to be added.)
|
|
1543
|
|
1544 - unicode support. very raw. a few days ago i wrote a complete and
|
|
1545 efficient implementation of unicode translation. it should be very
|
|
1546 fast, and fairly memory-efficient in its tables. it allows for
|
|
1547 charset priority lists, which should be language-environment
|
|
1548 specific (but i haven't yet written the glue code). it works in
|
|
1549 preliminary testing, but obviously needs more testing and work.
|
|
1550 as of yet there is no translation data added for the standard charsets.
|
|
1551 the tables are in etc/unicode, and all we need is a bit of glue code
|
|
1552 to process them. see etc/unicode/README for the interface to
|
|
1553 implement.
|
|
1554
|
|
1555 - support for unicode in windows is partly there. this will work even
|
|
1556 on windows 95. the basic model is implemented but it needs finishing
|
|
1557 up.
|
|
1558
|
|
1559 - there is a preliminary implementation of windows ime support courtesy
|
|
1560 of ikeyama.
|
|
1561
|
|
1562 - if you want to get cyrillic working under windows (it appears to "work"
|
|
1563 but the wrong chars currently appear), the best way is to add unicode
|
|
1564 support for iso-8859-5 and use it in redisplay-msw.c. we are already
|
|
1565 passing unicode codepoints to the text-draw routine (ExtTextOutW).
|
|
1566 (ExtTextOutW and GetTextExtentPoint32W are implemented on both 95 and NT.)
|
|
1567
|
|
1568 - i fixed the iso2022 handling so it will correctly read in files
|
|
1569 containing unknown charsets, creating a "temporary" charset which
|
|
1570 can later be overwritten by the real charset when it's defined.
|
|
1571 this allows iso2022 elisp files with literals in strange languages
|
|
1572 to compile correctly under mule. i also added a hack that will
|
|
1573 correctly read in and write out the emacs-specific "composition"
|
|
1574 escape sequences, i.e. ESC 0 through ESC 4. this means that my
|
|
1575 workspace correctly compiles the new file devanagari.el that i added
|
|
1576 (see below).
|
|
1577
|
|
1578 - i copied the remaining language-specific files from fsf. i made
|
|
1579 some minor changes in certain cases but for the most part the stuff
|
|
1580 was just copied and may not work.
|
|
1581
|
|
1582 - i fixed post-read-conversion in coding systems to follow fsf
|
|
1583 conventions. (i also support our convention, for the moment. a
|
|
1584 kludge, of course.)
|
|
1585
|
|
1586 - make-coding-system accepts (but ignores) the additional properties
|
|
1587 present in the fsf version, for compatibility.
|