Mercurial > hg > xemacs-beta
comparison README.ben-mule-21-5 @ 771:943eaba38521
[xemacs-hg @ 2002-03-13 08:51:24 by ben]
The big ben-mule-21-5 check-in!
Various files were added and deleted. See CHANGES-ben-mule.
There are still some test suite failures. No crashes, though.
Many of the failures have to do with problems in the test suite itself
rather than in the actual code. I'll be addressing these in the next
day or so -- none of the test suite failures are at all critical.
Meanwhile I'll be trying to address the biggest issues -- i.e. build
or run failures, which will almost certainly happen on various platforms.
All comments should be sent to ben@xemacs.org -- use a Cc: if necessary
when sending to mailing lists. There will be pre- and post- tags,
something like
pre-ben-mule-21-5-merge-in, and
post-ben-mule-21-5-merge-in.
author | ben |
---|---|
date | Wed, 13 Mar 2002 08:54:06 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
770:336a418893b5 | 771:943eaba38521 |
---|---|
1 oct 27, 2001: | |
2 | |
3 -------- proposal for better buffer-switching commands: | |
4 | |
5 implement what VC++ currently has. you have a single "switch" command like | |
6 CTRL-TAB, which as long as you hold the CTRL button down, brings successive | |
7 buffers that are "next in line" into the current position, bumping the rest | |
8 forward. once you release the CTRL key, the chain is broken, and further | |
9 CTRL-TABs will start from the beginning again. this way, frequently used | |
10 buffers naturally move toward the front of the chain, and you can switch | |
11 back and forth between two buffers using CTRL-TAB. the only thing about | |
12 CTRL-TAB is it's a bit awkward. the way to implement is to have | |
13 modifier-up strokes fire off a hook, like modifier-up-hook. this is driven | |
14 by event dispatch, so there are no synchronization issues. when C-tab is | |
15 pressed, the binding function does something like set a one-shot handler on | |
16 the modifier-up-hook (perhaps separate hooks for separate modifiers?). | |
17 | |
18 to do this, we'd also want to change the buffer tabs so that they maintain | |
19 their own order. in particular, they start out synched to the regular | |
20 order, but as you make changes, you don't want the tabs to change | |
21 order. (in fact, they may already do this.) selecting a particular buffer | |
22 from the buffer tabs DOES make the buffer go to the head of the line. the | |
23 invariant is that if the tabs are displaying X items, those X items are the | |
24 first X items in the standard buffer list, but may be in a different | |
25 order. (it looks like the tabs may already implement all of this.) | |
26 | |
27 oct 26, 2001: | |
28 | |
29 necessary testing/changes: | |
30 | |
31 - test all eol detection stuff under windows w/ and w/o mule, unix w/ and | |
32 w/o mule. (test configure flag, command-line flag, menu option) may need | |
33 a way of pretending to be unix under cygwin. | |
34 - test under windows w/ and w/o mule, cygwin w/ and w/o mule, cygwin x | |
35 windows w/ and w/o mule. | |
36 - test undecided-dos/unix/mac. | |
37 - check ESC ESC works as isearch-quit under TTY's. | |
38 - test coding-system-base and all its uses (grep for them). | |
39 - menu item to revert to most recent auto save. | |
40 - consider renaming build_string -> build_intstring and build_c_string to | |
41 build_string. (consistent with build_msg_string et al; many more | |
42 build_c_string than build_string) | |
43 | |
44 oct 20, 2001: | |
45 | |
46 fixed problem causing crash due to invalid internal-format data, fixed an | |
47 existing bug in valid_char_p, and added checks to more quickly catch when | |
48 invalid chars are generated. still need to investigate why | |
49 mswindows-multibyte is being detected. | |
50 | |
51 i now see why -- we only process 65536 bytes due to a constant | |
52 MAX_BYTES_PROCESSED_FOR_DETECTION. instead, we should have no limit as | |
53 long as we have a seekable stream. we also need to write | |
54 stderr_out_lisp(), used in the debug info routines i wrote. | |
55 | |
56 check once more about DEBUG_XEMACS. i think debugging info should be | |
57 ON by default. make sure it is. check that nothing untoward will result | |
58 in a production system, e.g. presumably assert()s should not really abort(). | |
59 (!! Actually, this should be runtime settable! Use a variable for this, and | |
60 it can be set using the same XEMACSDEBUG method. In fact, now that I think | |
61 of it, I'm sure that debugging info should be on always, with runtime ways | |
62 of turning on or off any funny behavior.) | |
63 | |
64 oct 19, 2001: | |
65 | |
66 fixed various bugs preventing packages from being able to be built. still | |
67 another bug, with psgml/etc/cdtd/docbook, which contains some strange | |
68 characters starting around char pos 110,000. It gets detected as | |
69 mswindows-multibyte (wrong! why?) and then invalid internal-format data is | |
70 generated. need to fix mswindows-multibyte (and possibly add something | |
71 that signals an error as well; need to work on this error-signalling | |
72 mechanism) and figure out why it's getting detected as such. what i should | |
73 do is add a debug var that outputs blow-by-blow info of the detection | |
74 process. | |
75 | |
76 oct 9, 2001: | |
77 | |
78 the stuff with global-window-system-map doesn't appear to work. in any | |
79 case it needs better documentation. [DONE] | |
80 | |
81 M-home, M-end do work, but cause cl-macs to get loaded. why? | |
82 | |
83 oct 8, 2001: | |
84 | |
85 finished the coding system changes and they finally work! | |
86 | |
87 need to implement undecided-unix/dos/mac. they should be easy to do; it | |
88 should be enough to specify an eol-type but not do-eol, but check this. | |
89 | |
90 consider making the standard naming be foo-lf/crlf/cr, with unix/dos/mac as | |
91 aliases. | |
92 | |
93 print methods for coding systems should include some of the generic | |
94 properties. (also then fix print_..._within_print_method). [DONE] | |
95 | |
96 in a little while, go back and delete the text-file-wrapper-coding-system | |
97 code. (it'll be in CVS if necessary to get at it.) [DONE] | |
98 | |
99 need to verify at some point that non-text-file coding systems work | |
100 properly when specified. when gzip is working, this would be a good test | |
101 case. (and consider creating base64 as well!) | |
102 | |
103 remove extra crap from coding-system-category that checks for chain coding | |
104 systems. [DONE] | |
105 | |
106 perhaps make a primitive that gets at coding-system-canonical. [DONE] | |
107 | |
108 need to test cygwin, compiling the mule packages, get unix-eol stuff | |
109 working. frank from germany says he doesn't see a lisp backtrace when he | |
110 gets an error during temacs? verify that this actually gets outputted. | |
111 | |
112 consider putting the current language on the modeline, mousable so it can | |
113 be switched. also consider making the coding system be mousable and the | |
114 line number (pick a line) and the percentage (pick a percentage). | |
115 | |
116 oct 6, 2001: | |
117 | |
118 added code so that debug_print() will output a newline to the mswindows | |
119 debugging output, not just the console. need to test. [DONE] | |
120 | |
121 working on problem where all files are being detected as binary. the | |
122 problem may be that the undecided coding system is getting wrapped with an | |
123 auto-eol coding system, which it shouldn't be -- but even in this | |
124 situation, we should get the right results! check the | |
125 canonicalize-after-coding methods. also, determine_real_coding_system | |
126 appears to be getting called even when we're not detecting encoding. also, | |
127 undecided needs a print method to show its params, and chain needs to be | |
128 updated to show canonicalize_after_coding. check others as well. [DONE] | |
129 | |
130 oct 5, 2001: | |
131 | |
132 finished up coding system changes, testing. | |
133 | |
134 errors byte-compiling files in iso-2022-7-bit. perhaps it's not correctly | |
135 detecting the encoding? | |
136 | |
137 noticed a problem in the dfc macros: we call | |
138 get_coding_system_for_text_file with eol_wrap == 1, to allow for | |
139 auto-detection of the eol type; but this defeats the check and | |
140 short-circuit for unicode. | |
141 | |
142 still need to implement calling determine_real_coding_system() for | |
143 non-seekable streams. to implement correctly, we need to do our own | |
144 buffering. [DONE, BUT WITHOUT BUFFERING] | |
145 | |
146 oct 4, 2001: | |
147 | |
148 implemented most stuff below. | |
149 | |
150 need to finish up changes to make_coding_system_1. (i changed the way | |
151 internal coding systems were handled; i need to create subsidiaries for all | |
152 types of coding systems, not just text ones.) there's a nasty xfree() crash | |
153 i was hitting; perhaps it'll go away once all stuff has been rewritten. | |
154 | |
155 check under cygwin to make sure that when an error occurs during loadup, a | |
156 backtrace is output. | |
157 | |
158 as soon as andy releases his new setup, we should put it onto various | |
159 standard windows software repositories. | |
160 | |
161 oct 3, 2001: | |
162 | |
163 added global-tty-map and global-window-system-map. add some stuff to the | |
164 maps, e.g. C-x ESC for repeat vs. C-x ESC ESC on TTY's, and of course ESC | |
165 ESC on window systems vs. ESC ESC ESC on TTY's. [TEST] | |
166 | |
167 was working on integrating the two help-for-tutorial versions (mule, | |
168 non-mule). [DONE, but test under non-Mule] | |
169 | |
170 was working on the file-coding changes. need to think more about | |
171 text-file-wrapper. conclusion i think is that | |
172 get_coding_system_for_text_file should wrap using a special coding system | |
173 type called a text-file-wrapper, which inherits from chain, and implements | |
174 canonicalize-after-decoding to just return the unwrapped coding system. We | |
175 need to implement inheritance of coding systems, which will certainly come | |
176 in extremely useful when coding systems get implemented in Lisp, which | |
177 should happen at some point. (see existing docs about this.) essentially, | |
178 we have a way of declaring that we inherit from some system, and the | |
179 appropriate data structures get created, perhaps just an extra inheritance | |
180 pointer. but when we create the coding system, the extra data needs to be | |
181 a stretchy array of offsets, pointing to the type-specific data for the | |
182 coding system type and all its parents. that means that in the methods | |
183 structure for a coding system (which perhaps should be expanded beyond | |
184 method, it's just a "class structure") is the index in these arrays of | |
185 offsets. CODING_SYSTEM_DATA() can take any of the coding system classes | |
186 (rename type to class!) that make up this class. similarly, a coding | |
187 system class inherits its methods from the class above unless specifying | |
188 its own method, and can call the superclass method at any point by either | |
189 just invoking its name, or conceivably by some macro like | |
190 | |
191 CALL_SUPER (method, (args)) | |
192 | |
193 similar mods would have to be made to coding stream structures. | |
194 | |
195 perhaps for the immediate we can just sort of fake things like we currently | |
196 do with undecided calling some stuff from chain. | |
197 | |
198 oct 2, 2001: | |
199 | |
200 need to implement support for iso-8859-15, i.e. iso-8859-1 + euro symbol. | |
201 figure out how to fall back to iso-8859-1 as necessary. | |
202 | |
203 leave the current bindings the way they are for the moment, but bump off | |
204 M-home and M-end (hardly used), and substitute my buffer movement stuff | |
205 there. [DONE, but test] | |
206 | |
207 there's something to be said for combining block of 6 and paragraph, | |
208 esp. if we make the definition of "paragraph" be so that it skips by 6 when | |
209 within code. hmm. | |
210 | |
211 eliminate advertised-undo crap, and similar hacks. [DONE] | |
212 | |
213 think about obsolete stuff to be eliminated. think about eliminating or | |
214 dimming obsolete items from hyper-apropos and something similar in | |
215 completion buffers. | |
216 | |
217 sep 30, 2001: | |
218 | |
219 synched up the tutorials with FSF 21.0.105. was rewriting them to favor | |
220 the cursor keys over the older C-p, etc. keys. | |
221 | |
222 Got thinking about key bindings again. | |
223 | |
224 (1) I think that M-up/down and M-C-up/down should be reversed. I use | |
225 scroll-up/down much more often than motion by paragraph. | |
226 | |
227 (2) Should we eliminate move by block (of 6) and subsitute it for | |
228 paragraph? This would have the advantage that I could make bindings | |
229 for buffer change (forward/back buffer, perhaps M-C-up/down. with | |
230 shift, M-C-S-up/down only goes within the same type (C files, etc.). | |
231 alternatively, just bump off beginning-of-defun from C-M-home, since | |
232 it's on C-M-a already. | |
233 | |
234 need someone to go over the other tutorials (five new ones, from FSF | |
235 21.0.105) and fix them up to correspond to the english one. | |
236 | |
237 shouldn't shift-motion work with C-a and such as well as arrows? | |
238 | |
239 sep 29, 2001: | |
240 | |
241 charcount_to_bytecount can also be made to scream -- as can scan_buffer, | |
242 buffer_mule_signal_inserted_region, others? we should start profiling | |
243 though before going too far down this line. | |
244 | |
245 Debug code that causes no slowdown should in general remain in the | |
246 executable even in the release version because it may be useful (e.g. for | |
247 people to see the event output). so DEBUG_XEMACS should be rethought. | |
248 things like use of msvcrtd.dll should be controlled by error_checking on. | |
249 maybe DEBUG_XEMACS controls general debug code (e.g. use of msvcrtd.dll, | |
250 asserts abort, error checking), and the actual debugging code should remain | |
251 always, or be conditonalized on something else | |
252 (e.g. DEBUGGING_FUNS_PRESENT). | |
253 | |
254 doc strings in dumped files are displayed with an extra blank line between | |
255 each line. presumably this is recent? i assume either the change to | |
256 detect-coding-region or the double-wrapping mentioned below. | |
257 | |
258 error with coding-system-property on iso-2022-jp-dos. problem is that that | |
259 coding system is wrapped, so its type shows up as chain, not iso-2022. | |
260 this is a general problem, and i think the way to fix it is to in essence | |
261 do late canonicalization -- similar in spirit to what was done long ago, | |
262 canonicalize_when_code, except that the new coding system (the wrapper) is | |
263 created only once, either when the original cs is created or when first | |
264 needed. this way, operations on the coding system work like expected, and | |
265 you get the same results as currently when decoding/encoding. the only | |
266 thing tricky is handling canonicalize-after-coding and the ever-tricky | |
267 double-wrapping problem mentioned below. i think the proper solution is to | |
268 move the autodetection of eol into the main autodetect type. it can be | |
269 asked to autodetect eol, coding, or both. for just coding, it does like it | |
270 currently does. for just eol, it does similar to what it currently does | |
271 but runs the detection code that convert-eol currently does, and selects | |
272 the appropriate convert-eol system. when it does both eol and coding, it | |
273 does something on the order of creating two more autodetect coding systems, | |
274 one for eol only and one for coding only, and chains them together. when | |
275 each has detected the appropriate value, the results are combined. this | |
276 automatically eliminates the double-wrapping problem, removes the need for | |
277 complicated canonicalize-after-coding stuff in chain, and fixes the problem | |
278 of autodetect not having a seekable stream because hidden inside of a | |
279 chain. (we presume that in the both-eol-and-coding case, the various | |
280 autodetect coding streams can communicate with each other appropriately.) | |
281 | |
282 also, we should solve the problem of internal coding systems floating | |
283 around and clogging up the list simply by having an "internal" property on | |
284 cs's and an internal param to coding-system-list (optional; if not given, | |
285 you don't get the internal ones). [DONE] | |
286 | |
287 we should try to reduce the size of the from-unicode tables (the dominant | |
288 memory hog in the tables). one obvious thing is to not store a whole | |
289 emchar as the mapped-to value, but a short that encodes the octets. [DONE] | |
290 | |
291 sep 28, 2001: | |
292 | |
293 need to merge up to latest in trunk. | |
294 | |
295 add unicode charsets for all non-translatable unicode chars; probably want | |
296 to extend the concept of charsets to allow for dimension 3 and dimension 4 | |
297 charsets. for the moment we should stick with just dimension 3 charsets; | |
298 otherwise we run past the current maximum of 4 bytes per emchar. (most code | |
299 would work automatically since it uses MAX_EMCHAR_LEN; the trickiness is in | |
300 certain code that has intimate knowledge of the representation. | |
301 e.g. bufpos_to_bytind() has to multiply or divide by 1, 2, 3, or 4, | |
302 and has special ways of handling each number. with 5 or 6 bytes per char, | |
303 we'd have to change that code in various ways.) 96x96x96 = 884,000 or so, | |
304 so with two 96x96x96 charsets, we could tackle all Unicode values | |
305 representable by UTF-16 and then some -- and only these codepoints will | |
306 ever have assigned chars, as far as we know. | |
307 | |
308 need an easy way of showing the current language environment. some menus | |
309 need to have the current one checked or whatever. [DONE] | |
310 | |
311 implement unicode surrogates. | |
312 | |
313 implement buffer-file-coding-system-when-loaded -- make sure find-file, | |
314 revert-file, etc. set the coding system [DONE] | |
315 | |
316 verify all the menu stuff [DONE] | |
317 | |
318 implemented the entirely-ascii check in buffers. not sure how much gain | |
319 it'll get us as we already have a known range inside of which is constant | |
320 time, and with pure-ascii files the known range spans the whole buffer. | |
321 improved the comment about how bufpos-to-bytind and vice-versa work. [DONE] | |
322 | |
323 fix double-wrapping of convert-eol: when undecided converts itself to | |
324 something with a non-autodetect eol, it needs to tell the adjacent | |
325 convert-eol to reduce itself to nothing. | |
326 | |
327 need menu item for find file with specified encoding. [DONE] | |
328 | |
329 renamed coding systems mswindows-### to windows-### to follow the standard | |
330 in rfc1345. [DONE] | |
331 | |
332 implemented coding-system-subsidiary-parent [DONE] | |
333 HAVE_MULE -> MULE in files in nt/ so that depend checking works [DONE] | |
334 | |
335 need to take the smarter search-all-files-in-dir stuff from my sample init | |
336 file and put it on the grep menu [DONE] | |
337 | |
338 added item for revert w/specified encoding; mostly works, but needs fixes. | |
339 in particular, you get the correct results, but buffer-file-coding-system | |
340 does not reflect things right. also, there are too many entries. need to | |
341 split into submenus. there is already split code out there; see if it's | |
342 generalized and if not make it so. it should only split when there's more | |
343 than a specified number, and when splitting, split into groups of a | |
344 specified size, not into a specified number of groups. [DONE] | |
345 | |
346 too many entries in the langenv menus; need to split. [DONE] | |
347 | |
348 sep 27, 2001: | |
349 | |
350 NOTE: M-x grep for make-string causes crash now. something definitely to | |
351 do with string changes. check very carefully the diffs and put in those | |
352 sledgehammer checks. [DONE] | |
353 | |
354 fix font-lock bug i introduced. [DONE] | |
355 | |
356 added optimization to strings (keeps track of # of bytes of ascii at the | |
357 beginning of a string). perhaps should also keep an all-ascii flag to deal | |
358 with really large (> 2 MB) strings. rewrite code to count ascii-begin to | |
359 use the 4-or-8-at-a-time stuff in bytecount_to_charcount. | |
360 | |
361 Error: M-q is causing Invalid Regexp error on the above paragraph. It's | |
362 not in working. I assume it's a side effect of the string stuff. VERIFY! | |
363 Write sledgehammer checks for strings. [DONE] | |
364 | |
365 revamped the locale/init stuff so that it tries much harder to get things | |
366 right. should test a bit more. in particular, test out Describe Language | |
367 on the various created environments and make sure everything looks right. | |
368 | |
369 should change the menus: move the submenus on Edit->Mule directly under | |
370 Edit. add a menu entry on File to say "Reload with specified encoding ->". | |
371 [DONE] | |
372 | |
373 Also Find File with specified encoding -> Also entry to change the EOL | |
374 settings for Unix, and implement it. | |
375 | |
376 decode-coding-region isn't working because it needs to insert a binary | |
377 (char->byte) converter. [DONE] | |
378 | |
379 chain should be rearranged to be in decoding order; similar for | |
380 source/sink-type, other things? | |
381 | |
382 the detector should check for a magic cookie even without a seekable input. | |
383 (currently its input is not seekable, because it's hidden within a chain. | |
384 #### See what we can do about this.) | |
385 | |
386 provide a way to display various settings, e.g. the current category | |
387 mappings and priority (see mule-diag; get this working so it's in the | |
388 path); also a way to print out the likeliness results from a detection, | |
389 perhaps a debug flag. | |
390 | |
391 problem with `env', which causes path issues due to `env' in packages. | |
392 move env code to process, sync with fsf 21.0.105, check that the autoloads | |
393 in `env' don't cause problems. [DONE] | |
394 | |
395 8-bit iso2022 detection appears broken; or at least, mule-canna.c is not so | |
396 detected. | |
397 | |
398 sep 25, 2001: | |
399 | |
400 something else to do is review the font selection and fix it so that (e.g.) | |
401 JISX-0212 can be displayed. | |
402 | |
403 also, text in widgets needs to be drawn by us so that the correct fonts | |
404 will be displayed even in multi-lingual text. | |
405 | |
406 sep 24, 2001: | |
407 | |
408 the detection system is now properly abstracted. the detectors have been | |
409 rewritten to include multiple levels of abstraction. now we just need | |
410 detectors for ascii, binary, and latin-x, as well as more sophisticated | |
411 detectors in general and further review of the general algorithm for doing | |
412 detection. (#### Is this written up anywhere?) after that, consider adding | |
413 error-checking to decoding (VERY IMPORTANT) and verifying the binary | |
414 correctness of things under unix no-mule. | |
415 | |
416 sep 23, 2001: | |
417 | |
418 began to fix the detection system -- adding multiple levels of likelihood | |
419 and properly abstracting the detectors. the system is in place except for | |
420 the abstraction of the detector-specific data out of the struct | |
421 detection_state. we should get things working first before tackling that | |
422 (which should not be too hard). i'm rewriting algorithms here rather than | |
423 just converting code, so it's harder. mostly done with everything, but i | |
424 need to review all detectors except iso2022 and make them properly follow | |
425 the new way. also write a no-conversion detector. also need to look into | |
426 the `recode' package and see how (if?) they handle detection, and maybe | |
427 copy some of the algorithms. also look at recent FSF 21.0 and see if their | |
428 algorithms have improved. | |
429 | |
430 sep 22, 2001: | |
431 | |
432 fixed gc bugs from yesterday. | |
433 fixed truename bug. | |
434 close/finalize stuff works. | |
435 eliminated notyet stuff in syswindows.h. | |
436 eliminated special code in tstr_to_c_string. | |
437 fixed pdump problems. (many of them, mostly latent bugs, ugh) | |
438 fixed cygwin sscanf problems in parse-unicode-translation-table. (NOT a | |
439 sscanf bug, but subtly different behavior w.r.t. whitespace in the format | |
440 string, combined with a debugger that sucks ROCKS!! and consistently | |
441 outputs garbage for variable values.) | |
442 main stuff to test is the handling of EOF recognition vs. binary | |
443 (i.e. check what the default settings are under Unix). then we may have | |
444 something that WORKS on all platforms!!! (Also need to test Windows | |
445 non-Mule) | |
446 | |
447 sep 21, 2001: | |
448 | |
449 finished redoing the close/finalize stuff in the lstream code. but i | |
450 encountered again the nasty bug mentioned on sep 15 that disappeared on its | |
451 own then. the problem seems to be that the finalize method of some of the | |
452 lstreams is calling Lstream_delete(), which calls free_managed_lcrecord(), | |
453 which is a no-no when we're inside of garbage-collection and the object | |
454 passed to free_managed_lcrecord() is unmarked, and about to be released by | |
455 the gc mechanism -- the free lists will end up with xfree()d objects on | |
456 them, which is very bad. we need to modify free_managed_lcrecord() to | |
457 check if we're in gc and the object is unmarked, and ignore it rather than | |
458 move it to the free list. [DONE] | |
459 | |
460 (#### What we really need to do is do what Java and C# do w.r.t. their | |
461 finalize methods: For objects with finalizers, when they're about to be | |
462 freed, leave them marked, run the finalizer, and set another bit on them | |
463 indicating that the finalizer has run. Next GC cycle, the objects will | |
464 again come up for freeing, and this time the sweeper notices that the | |
465 finalize method has already been called, and frees them for good (provided | |
466 that a finalize method didn't do something to make the object alive | |
467 again).) | |
468 | |
469 sep 20, 2001: | |
470 | |
471 redid the lstream code so there is only one coding stream. combined the | |
472 various doubled coding stream methods into one; i'm a little bit unsure of | |
473 this last part, though, as the results of combining the two together seem | |
474 unclean. got it to compile, but it crashes in loadup. need to go through | |
475 and rehash the close vs. finalize stuff, as the problem was stuff getting | |
476 freed too quickly, before the canonicalize-after-decoding was run. should | |
477 eliminate entirely CODING_STATE_END and use a different method (close | |
478 coding stream). rewrite to use these two. make sure they're called in the | |
479 right places. Lstream_close on a stream should *NOT* do finalizing. | |
480 finalize only on delete. [DONE] | |
481 | |
482 in general i'd like to see the flags eliminated and converted to | |
483 bit-fields. also, rewriting the methods to take advantage of rejecting | |
484 should make it possible to eliminate much of the state in the various | |
485 methods, esp. including the flags. need to test this is working, though -- | |
486 reduce the buffer size down very low and try files with only CRLF's in | |
487 them, with one offset by a byte from the other, and see if we correctly | |
488 handle rejection. | |
489 | |
490 still have the problem with incorrectly truenaming files. | |
491 | |
492 | |
493 sep 19, 2001: | |
494 | |
495 bug reported: crash while closing lstreams. | |
496 | |
497 the lstream/coding system close code needs revamping. we need to document | |
498 that order of closing lstreams is very important, and make sure we're | |
499 consistent. furthermore, chain and undecided lstreams need to close their | |
500 underneath lstreams when they receive the EOF signal (there may be data in | |
501 the underneath streams waiting to come out), not when they themselves are | |
502 closed. [DONE] | |
503 | |
504 (if only we had proper inheritance. i think in any case we should | |
505 simulate it for the chain coding stream -- write things in such a way that | |
506 undecided can use the chain coding stream and not have to duplicate | |
507 anything itself.) | |
508 | |
509 in general we need to carefully think through the closing process to make | |
510 sure everything always works correctly and in the right order. also check | |
511 very carefully to make sure there are no dangling pointers to deleted | |
512 objects floating around. | |
513 | |
514 move the docs for the lstream functions to the functions themselves, not | |
515 the header files. document more carefully what exactly Lstream_delete() | |
516 means and how it's used, what the connections are between Lstream_close(), | |
517 Lstream_delete(), Lstream_flush(), lstream_finalize, etc. [DONE] | |
518 | |
519 additional error-checking: consider deadbeefing the memory in objects | |
520 stored in lcrecord free lists; furthermore, consider whether lifo or fifo | |
521 is correct; under error-checking, we should perhaps be doing fifo, and | |
522 setting a minimum number of objects on the lists that's quite large so that | |
523 it's highly likely that any erroneous accesses to freed objects will go | |
524 into such deadbeefed memory and cause crashes. also, at the earliest | |
525 available opportunity, go through all freed memory and check for any | |
526 consistency failures (overwrites of the deadbeef), crashing if so. perhaps | |
527 we could have some sort of id for each block, to easier trace where the | |
528 offending block came from. (all of these ideas are present in the debug | |
529 system malloc from VC++, plus more stuff.) there's similar code i wrote | |
530 sitting somewhere (in free-hook.c? doesn't appear so. we need to delete the | |
531 blocking stuff out of there!). also look into using the debug system | |
532 malloc from VC++, which has lots of cool stuff in it. we even have the | |
533 sources. that means compiling under pdump, which would be a good idea | |
534 anyway. set it as the default. (but then, we need to remove the | |
535 requirement that Xpm be a DLL, which is extremely annoying. look into | |
536 this.) | |
537 | |
538 test the windows code page coding systems recently created. | |
539 | |
540 problems reading my mail files -- 1personal appears to hang, others come up | |
541 with lots of ^M's. investigate. | |
542 | |
543 test the enum functions i just wrote, and finish them. | |
544 | |
545 still pdump problems. | |
546 | |
547 sep 18, 2001: | |
548 | |
549 critical-quit broken sometime after aug 25. | |
550 | |
551 -- fixed critical quit. | |
552 -- fixed process problems. | |
553 -- print routines work. (no routine for ccl, though) | |
554 -- can read and write unicode files, and they can still be read by some | |
555 other program | |
556 -- defaults should come up correctly -- mswindows-multibyte is general. | |
557 | |
558 still need to test matej's stuff. | |
559 seems ok with multibyte stuff but needs more testing. | |
560 | |
561 sep 17, 2001: | |
562 | |
563 !!!!! something broken with processes !!!!! cannot send mail anymore. must | |
564 investigate. | |
565 | |
566 sep 17, 2001: | |
567 | |
568 on mon/wed nights, stop *BEFORE* 11pm. Otherwise i just start getting | |
569 woozy and can't concentrate. | |
570 | |
571 just finished getting assorted fixups to the main branch committed, so it | |
572 will compile under C++ (Andy committed some code that broke C++ builds). | |
573 cup'd the code into the fixtypes workspace, updated the tags appropriately. | |
574 i've created the appropriate log message, sitting in fixtypes.txt in | |
575 /src/xemacs; perhaps it should go into a README. now i just have to build | |
576 on everything (it's currently building), verify it's ok, run patcher-mail, | |
577 commit, send. | |
578 | |
579 my mule ws is also very close. need to: | |
580 | |
581 -- test the new print routines. | |
582 -- test it can read and write unicode files, and they can still be read by | |
583 some other program. | |
584 -- try to see if unicode can be auto-detected properly. | |
585 -- test it can read and write multibyte files in a few different formats. | |
586 currently can't recognize them, but if you set the cs right, it should | |
587 work. | |
588 -- examine the test files sent by matej and see if we can handle them. | |
589 | |
590 sep 15, 2001: | |
591 | |
592 more eol fixing. this stuff is utter crap. | |
593 | |
594 currently we wrap coding systems with convert-eol-autodetect when we create | |
595 them in make_coding_system_1. i had a feeling that this would be a | |
596 problem, and indeed it is -- when autodetecting with `undecided', for | |
597 example, we end up with multiple layers of eol conversion. to avoid this, | |
598 we need to do the eol wrapping *ONLY* when we actually retrieve a coding | |
599 system in places such as insert-file-contents. these places are | |
600 insert-file-contents, load, process input, call-process-internal, | |
601 encode/decode/detect-coding-region, database input, ... | |
602 | |
603 (later) it's fixed, and things basically work. NOTE: for some reason, | |
604 adding code to wrap coding systems with convert-eol-lf when eol-type == lf | |
605 results in crashing during garbage collection in some pretty obscure place | |
606 -- an lstream is free when it shouldn't be. this is a bad sign. i guess | |
607 something might be getting initialized too early? | |
608 | |
609 we still need to fix the canonicalization-after-decoding code to avoid | |
610 problems with coding systems like `internal-7' showing up. basically, when | |
611 eol==lf is detected, nil should be returned, and the callers should handle | |
612 it appropriately, eliding when necessary. chain needs to recognize when | |
613 it's got only one (or even 0) items in the chain, and elide out the chain. | |
614 | |
615 sep 11, 2001: the day that will live in infamy. | |
616 | |
617 rewrite of sep 9 entry about formats: | |
618 | |
619 when calling make-coding-system, the name can be a cons of (format1 . | |
620 format2), specifying that it decodes format1->format2 and encodes the other | |
621 way. if only one name is given, that is assumed to be format1, and the | |
622 other is either `external' or `internal' depending on the end type. | |
623 normally the user when decoding gives the decoding order in formats, but | |
624 can leave off the last one, `internal', which is assumed. a multichain | |
625 might look like gzip|multibyte|unicode, using the coding systems named | |
626 `gzip', `(unicode . multibyte)' and `unicode'. the way this actually works | |
627 is by searching for gzip->multibyte; if not found, look for gzip->external | |
628 or gzip->internal. (In general we automatically do conversion between | |
629 internal and external as necessary: thus gzip|crlf does the expected, and | |
630 maps to gzip->external, external->internal, crlf->internal, which when | |
631 fully specified would be gzip|external:external|internal:crlf|internal -- | |
632 see below.) To forcibly fit together two converters that have explicitly | |
633 specified and incompatible names (say you have unicode->multibyte and | |
634 iso8859-1->ebcdic and you know that the multibyte and iso8859-1 in this | |
635 case are compatible), you can force-cast using :, like this: | |
636 ebcdic|iso8859-1:multibyte|unicode. (again, if you force-cast between | |
637 internal and external formats, the conversion happens automatically.) | |
638 | |
639 | |
640 sep 10, 2001: | |
641 | |
642 moved the autodetection stuff (both codesys and eol) into particular coding | |
643 systems -- `undecided' and `convert-eol' (type == `autodetect'). needs | |
644 lots of work. still need to search through the rest of the code and find | |
645 any remaining auto-detect code and move it into the undecided coding | |
646 system. need to modify make-coding-system so that it spits out | |
647 auto-detecting versions of all text-file coding systems unless we say not | |
648 to. need eliminate entirely the EOF flag from both the stream info and the | |
649 coding system; have only the original-eof flag. in | |
650 coding_system_from_mask, need to check that the returned value is not of | |
651 type `undecided', falling back to no-conversion if so. also need to make | |
652 sure we wrap everything appropriate for text-files -- i removed the | |
653 wrapping on set-coding-category-list or whatever (need to check all those | |
654 files to make sure all wrapping is removed). need to review carefully the | |
655 new code in `undecided' to make sure it works are preserves the same logic | |
656 as previously. need to review the closing and rewinding behavior of chain | |
657 and undecided (same -- should really consolidate into helper routines, so | |
658 that any coding system can embed a chain in it) -- make sure the dynarr's | |
659 are getting their data flushed out as necessary, rewound/closed in the | |
660 right order, no missing steps, etc. | |
661 | |
662 also split out mule stuff into mule-coding.c. work done on | |
663 configure/xemacs.mak/Makefiles not done yet. work on emacs.c/symsinit.h to | |
664 interface with the new init functions not done yet. | |
665 | |
666 also put in a few declarations of the way i think the abstracted detection | |
667 stuff ought to go. DON'T WORK ON THIS MORE UNTIL THE REST IS DEALT WITH | |
668 AND WE HAVE A WORKING XEMACS AGAIN WITH ALL EOL ISSUES NAILED. | |
669 | |
670 really need a version of cvs-mods that reports only the current directory. | |
671 WRITE THIS! use it to implement a better cvs-checkin. | |
672 | |
673 sep 9, 2001: | |
674 | |
675 implemented a gzip coding system. unfortunately, doesn't quite work right | |
676 because it doesn't handle the gzip headers -- it just reads and writes raw | |
677 zlib data. there's no function in the library to skip past the header, but | |
678 we do have some code out of the library that we can snarf that implements | |
679 header parsing. we need to snarf that, store it, and output it again at | |
680 the beginning when encoding. in the process, we should create a "get next | |
681 byte" macro that bails out when there are no more. using this, we set up a | |
682 nice way of doing most stuff statelessly -- if we have to bail, we reject | |
683 everything back to the sync point. also need to fix up the autodetection | |
684 of zlib in configure.in. | |
685 | |
686 BIG problems with eol. finished up everything i thought i would need to | |
687 get eol stuff working, but no -- when you have mswindows-unicode, with its | |
688 eol set to autodetect, the detection routines themselves do the autodetect | |
689 (first), and fail (they report CR on CRLF because of the NULL byte between | |
690 the CR and the LF) since they're not looking at ascii data. with a chain | |
691 it's similarly bad. for mswindows-multibyte, for example, which is a chain | |
692 unicode->unicode-to-multibyte, autodetection happens inside of the chain, | |
693 both when unicode and unicode-to-multibyte are active. we could twiddle | |
694 around with the eol flags to try to deal with this, but it's gonna be a big | |
695 mess, which is exactly what we're trying to avoid. what we basically want | |
696 is to entirely rip out all EOL settings from either the coding system or | |
697 the stream (yes, there are two! one might saw autodetect, and then the | |
698 stream contains the actual detected value). instead, we simply create an | |
699 eol-autodetect coding system -- or rather, it's part of the convert-eol | |
700 coding system. convert-eol, type = autodetect, does autodetection the | |
701 first time it gets data sent to it to decode, and thereafter sets a stream | |
702 parameter indicating the actual eol type for this stream. this means that | |
703 all autodetect coding systems, as created by `make-coding-system', really | |
704 are chains with a convert-eol at the beginning. only subsidiary xxx-unix | |
705 has no wrapping at all. this should allow eof detection of gzip, unicode, | |
706 etc. for that matter, general autodetection should be entirely | |
707 encapsulated inside of the `autodetect' coding system, with no | |
708 eol-autodetection -- the chain becomes convert-eol (autodetect) -> | |
709 autodetect or perhaps backwards. the generic autodetect similarly has a | |
710 coding-system in its stream methods, and needs somehow or other to insert | |
711 the detected coding-system into the chain. either it contains a chain | |
712 inside of it (perhaps it *IS* a chain), or there's some magic involving | |
713 canonicalization-type switcherooing in the middle of a decode. either way, | |
714 once everything is good and done and we want to save the coding system so | |
715 it can be used later, we need to do another sort of canonicalization -- | |
716 converting auto-detect-type coding systems into the detected systems. | |
717 again, a coding-system method, with some magic currently so that | |
718 subsidiaries get properly used rather than something that's new but | |
719 equivalent to subsidiaries. (#### perhaps we could use a hash table to | |
720 avoid recreating coding systems when not necessary. but that would require | |
721 that coding systems be immutable from external, and i'm not sure that's the | |
722 case.) | |
723 | |
724 i really think, after all, that i should reverse the naming of everything | |
725 in chain and source-sink-type -- they should be decoding-centric. later | |
726 on, if/when we come up with the proper way to make it totally symmetrical, | |
727 we'll be fine whether before then we were encoding or decoding centric. | |
728 | |
729 | |
730 sep 9, 2001: | |
731 | |
732 investigated eol parameter. | |
733 implemented handling in make-coding-system of eol-cr and eol-crlf. | |
734 fixed calls everywhere to Fget_coding_system / Ffind_coding_system to | |
735 reject non-char->byte coding systems. | |
736 | |
737 still need to handle "query eol type using coding-system-property" so it | |
738 magically returns the right type by parsing the chain. | |
739 | |
740 no work done on formats, as mentioned below. we should consider using : | |
741 instead of || to indicate casting. | |
742 | |
743 early sep 9, 2001: | |
744 | |
745 renamed some codesys properties: `list' in chain -> chain; `subtype' in | |
746 unicode -> type. everything compiles again and sort of works; some CRLF | |
747 problems that may resolve themselves when i finish the convert-eol stuff. | |
748 the stuff to create subsidiaries has been rewritten to use chains; but i | |
749 still need to investigate how the EOL type parameter is used. also, still | |
750 need to implement this: when a coding system is created, and its eol type | |
751 is not autodetect or lf, a chain needs to be created and returned. i think | |
752 that what needs to happen is that the eol type can only be set to | |
753 autodetect or lf; later on this should be changed to simply be either | |
754 autodetect or not (but that would require ripping out the eol converting | |
755 stuff in the various coding systems), and eventually we will do the work on | |
756 the detection mechanism so it can do chain detection; then we won't need an | |
757 eol autodetect setting at all. i think there's a way to query the eol type | |
758 of a coding system; this should check to see if the coding system is a | |
759 chain and there's a convert-eol at the front; if so, the eol type comes | |
760 from the type of the convert-eol. | |
761 | |
762 also check out everywhere that Fget_coding_system or Ffind_coding_system is | |
763 called, and see whether anything but a char->byte system can be tolerated. | |
764 create a new function for all the places that only want char->byte, | |
765 something like get_coding_system_char_to_byte_only. | |
766 | |
767 think about specifying formats in make-coding-system. perhaps the name can | |
768 be a cons of (format1, format2), specifying that it encodes | |
769 format1->format2 and decodes the other way. if only one name is given, | |
770 that is assumed to be format2, and the other is either `byte' or `char' | |
771 depending on the end type. normally the user when decoding gives the | |
772 decoding order in formats, but can leave off the last one, `char', which is | |
773 assumed. perhaps we should say `internal' instead of `char' and `external' | |
774 instead of byte. a multichain might look like gzip|multibyte|unicode, | |
775 using the coding systems named `gzip', `(unicode . multibyte)' and | |
776 `unicode'. we would have to allow something where one format is given only | |
777 as generic byte/char or internal/external to fit with any of the same | |
778 byte/char type. when forcibly fitting together two converters that have | |
779 explicitly specified and incompatible names (say you have | |
780 unicode->multibyte and iso8859-1->ebcdic and you know that the multibyte | |
781 and iso8859-1 in this case are compatible), you can force-cast using ||, | |
782 like this: ebcdic|iso8859-1||multibyte|unicode. this will also force | |
783 external->internal translation as necessary: | |
784 unicode|multibyte||crlf|internal does unicode->multibyte, | |
785 external->internal, crlf->internal. perhaps you'd need to put in the | |
786 internal translation, like this: unicode|multibyte|internal||crlf|internal, | |
787 which means unicode->multibyte, external->internal (multibyte is compatible | |
788 with external); force-cast to crlf format and convert crlf->internal. | |
789 | |
790 even later: Sep 8, 2001: | |
791 | |
792 chain doesn't need to set character mode, that happens automatically when | |
793 the coding systems are created. fixed chain to return correct source/sink | |
794 type for itself and to check the compatibility of source/sink types in its | |
795 chain. fixed decode/encode-coding-region to check the source and sink | |
796 types of the coding system performing the conversion and insert appropriate | |
797 byte->char/char->byte converters (aka "binary" coding system). fixed | |
798 set-coding-category-system to only accept the traditional | |
799 encode-char-to-byte types of coding systems. | |
800 | |
801 still need to extend chain to specify the parameters mentioned below, | |
802 esp. "reverse". also need to extend the print mechanism for chain so it | |
803 prints out the chain. probably this should be general: have a new method | |
804 to return all properties, and output those properties. you could also | |
805 implement a read syntax for coding systems this way. | |
806 | |
807 still need to implement convert-eol and finish up the rest of the eol stuff | |
808 mentioned below. | |
809 | |
810 later September 7, 2001: (more like Sep 8) | |
811 | |
812 moved many Lisp_Coding_System * params to Lisp_Object. In general this is | |
813 the way to go, and if we ever implement a copying GC, we will never want to | |
814 be passing direct pointers around. With no error-checking, we lose no | |
815 cycles using Lisp_Objects in place of pointers -- the Lisp_Object itself is | |
816 nothing but a pointer, and so all the casts and "dereferences" boil down to | |
817 nothing. | |
818 | |
819 Clarified and cleaned up the "character mode" on streams, and documented | |
820 who (caller or object itself) has the right to be setting character mode on | |
821 a stream, depending on whether it's a read or write stream. changed | |
822 conversion_end_type method and enum source_sink_type to return | |
823 encoding-centric values, rather than decoding-centric. for the moment, | |
824 we're going to be entirely encoding-centric in everything; we can rethink | |
825 later. fixed coding systems so that the decode and encode methods are | |
826 guaranteed to receive only full characters, if that's the source type of | |
827 the data, as per conversion_end_type. | |
828 | |
829 still need to fix the chain method so that it correctly sets the character | |
830 mode on all the lstreams in it and checks the source/sink types to be | |
831 compatible. also fix decode-coding-string and friends to put the | |
832 appropriate byte->character (i.e. no-conversion) coding systems on the ends | |
833 as necessary so that the final ends are both character. also add to chain | |
834 a parameter giving the ability to switch the direction of conversion of any | |
835 particular item in the chain (i.e. swap encoding and decoding). i think | |
836 what we really want to do is allow for arbitrary parameters to be put onto | |
837 a particular coding system in the chain, of which the only one so far is | |
838 swap-encode-decode. don't need too much codage here for that, but make the | |
839 design extendable. | |
840 | |
841 | |
842 | |
843 September 7, 2001: | |
844 | |
845 just added a return value from the decode and encode methods of a coding | |
846 system, so that some of the data can get rejected. fixed the calling | |
847 routines to handle this. need to investigate when and whether the coding | |
848 lstream is set to character mode, so that the decode/encode methods only | |
849 get whole characters. if not, we should do so, according to the source | |
850 type of these methods. also need to implement the convert_eol coding | |
851 system, and fix the subsidiary coding systems (and in general, any coding | |
852 system where the eol type is specified and is not LF) to be chains | |
853 involving convert_eol. | |
854 | |
855 after everything is working, need to remove eol handling from encode/decode | |
856 methods and eventually consider rewriting (simplifying) them given the | |
857 reject ability. | |
858 | |
859 September 5, 2001: | |
860 | |
861 -- need to organize this. get everything below into the TODO list. | |
862 CVS the TODO list frequently so i can delete old stuff. prioritize | |
863 it!!!!!!!!! | |
864 | |
865 -- move README.ben-mule... to STATUS.ben-mule...; use README for | |
866 intro, overview of what's new, what's broken, how to use the | |
867 features, etc. | |
868 | |
869 -- need a global and local coding-category-precedence list, which get | |
870 merged. | |
871 | |
872 -- finished the BOM support. also finished something not listed | |
873 below, expansion to the auto-generator of Unicode-encapsulation to | |
874 support bracketing code with #if ... #endif, for Cygwin and MINGW | |
875 problems, e.g. This is tested; appears to work. | |
876 | |
877 -- need to add more multibyte coding systems now that we have various | |
878 properties to specify them. need to add DEFUN's for mac-code-page | |
879 and ebcdic-code-page for completeness. need to rethink the whole | |
880 way that the priority list works. it will continue to be total | |
881 junk until multiple levels of likeliness get implemented. | |
882 | |
883 -- need to finish up the stuff about the various defaults. [need to | |
884 investigate more generally where all the different default values | |
885 are that control encoding. (there are six places or so.) need to | |
886 list them in make-coding-system docs and put pointers | |
887 elsewhere. [[[[#### what interface to specify that this default | |
888 should be unicode? a "Unicode" language environment seems too | |
889 drastic, as the language environment controls much more.]]]] even | |
890 skipping the Unicode stuff here, we need to survey and list the | |
891 variables that control coding page behavior and determine how they | |
892 need to be set for various possible scenarios: | |
893 | |
894 -- total binary: no detection at all. | |
895 -- raw-text only: wants only autodetection of line endings, nothing else. | |
896 -- "standard Windows environment": tries for Unicode, falls back on | |
897 code page encoding. | |
898 -- some sort of East European environment, and Russian. | |
899 -- some sort of standard Japanese Windows environment. | |
900 -- standard Chinese Windows environments (traditional and simplified) | |
901 -- various Unix environments (European, Japanese, Russian, etc.) | |
902 -- Unicode support in all of these when it's reasonable | |
903 | |
904 These really require multiple likelihood levels to be fully | |
905 implementable. We should see what can be done ("gracefully fall | |
906 back") with single likelihood level. need lots of testing. | |
907 | |
908 -- need to fix the truename problem. | |
909 | |
910 -- lots of testing: need to test all of the stuff above and below that's recently been implemented. | |
911 | |
912 | |
913 | |
914 September 4, 2001: | |
915 | |
916 mostly everything compiles. currently there is a crash in | |
917 parse-unicode-translation-table, and Cygwin/Mule won't run. it may | |
918 well be a bug in the sscanf() in Cygwin. | |
919 | |
920 working on today: | |
921 | |
922 -- adding BOM support for Unicode coding systems. mostly there, but | |
923 need to finish adding BOM support to the detection routines. then test. | |
924 -- adding properties to unicode-to-multibyte to specify the coding | |
925 system in various flexible ways, e.g. directly specified code page | |
926 or ansi or oem code page of specified locale, current locale, | |
927 user-default or system-default locale. need to test. | |
928 -- creating a `multibyte' coding system, with the same parameters as | |
929 unicode-to-multibyte and which resolves at coding-system-creation | |
930 time to the appropriate chain. creating the underlying mechanism | |
931 to allow such under-the-scenes switcheroo. need to test. | |
932 -- set default-value of buffer-file-coding-system to | |
933 mswindows-multibyte, as Matej said it should be. need to test. | |
934 need to investigate more generally where all the different default | |
935 values are that control encoding. (there are six places or so.) | |
936 need to list them in make-coding-system docs and put pointers | |
937 elsewhere. #### what interface to specify that this default should | |
938 be unicode? a "Unicode" language environment seems too drastic, as | |
939 the language environment controls much more. | |
940 -- thinking about adding multiple levels of certainty to the detection | |
941 schemes, instead of just a mask. eventually, we need to totally | |
942 abstract things, but that can easier be done in many steps. (we | |
943 need multiple levels of likelihood to more reasonably support a | |
944 Windows environment with code-page type files. currently, in order | |
945 to get them detected, we have to put them first, because they can | |
946 look like lots of other things; but then, other encodings don't get | |
947 detected. with multiple levels of likelihood, we still put the | |
948 code-page categories first, but they will return low levels of | |
949 likelihood. Lower-down encodings may be able to return higher | |
950 levels of likelihood, and will get taken preferentially.) | |
951 -- making it so you cannot disable file-coding, but you get an | |
952 equivalent default on Unix non-Mule systems where all defaults are | |
953 `binary'. need to test!!!!!!!!! | |
954 | |
955 Matej (mostly, + some others) notes the following problems, and here | |
956 are possible solutions: | |
957 | |
958 -- he wants the defaults to work right. [figure out what those | |
959 defaults are. i presume they are auto-detection of data in current | |
960 code page and in unicode, and new files have current code page set | |
961 as their output encoding.] | |
962 | |
963 -- too easy to lose data with incorrect encodings. [need to set up an | |
964 error system for encoding/decoding. extremely important but a | |
965 little tricky to implement so let's deal with other issues now.] | |
966 | |
967 -- EOL isn't always detected correctly. [#### ?? need examples] | |
968 | |
969 -- truename isn't working: c:\t.txt and c:\tmp.txt have the same truename. | |
970 [should be easy to fix] | |
971 | |
972 -- unicode files lose the BOM mark. [working on this] | |
973 | |
974 -- command-line utilities use OEM. [actually it seems more | |
975 complicated. it seems they use the codepage of the console. we | |
976 may be able to set that, e.g. to UTF8, before we invoke a command. | |
977 need to investigate.] | |
978 | |
979 -- no way to handle unicode characters not recognized as charsets. [we | |
980 need to create something like 8 private 2-dimensional charsets to | |
981 handle all BMP Unicode chars. Obviously this is a stopgap | |
982 solution. Switching to Unicode internal will ultimately make life | |
983 far easier and remove the BMP limitation. but for now it will | |
984 work. we translate all characters where we have charsets into | |
985 chars in those charsets, and the remainder in a unicode charset. | |
986 that way we can save them out again and guarantee no data loss with | |
987 unicode. this creates font problems, though ...] | |
988 | |
989 -- problems with xemacs font handling. [xemacs font handling is not | |
990 sophisticated enough. it goes on a charset granularity basis and | |
991 only looks for a font whose name contains the corresponding windows | |
992 charset in it. with unicode this fails in various ways. for one | |
993 the granularity needs to be single character, so that those unicode | |
994 charsets mentioned above work; and it needs to query the font to | |
995 see what unicode ranges it supports, rather than just looking at | |
996 the charset ending.] | |
997 | |
998 | |
999 | |
1000 August 28, 2001: | |
1001 | |
1002 working on getting everything to compile again: Cygwin, non-MULE, | |
1003 pdump. not there yet. | |
1004 | |
1005 mswindows-multibyte is now defined using chain, and works. removed | |
1006 most vestiges of the mswindows-multibyte coding system type. | |
1007 | |
1008 file-coding is on by default; should default to binary only on Unix. | |
1009 Need to test. (Needs to compile first :-) | |
1010 | |
1011 August 26, 2001: | |
1012 | |
1013 I've fixed the issue of inputting non-ASCII text under -nuni, and done | |
1014 some of the work on the Russian C-x problem -- we now compute the | |
1015 other possibilities. We still need to fix the key-lookup code, | |
1016 though, and that code is unfortunately a bit ugly. the best way, it | |
1017 seems, is to expand the command-builder structure so you can specify | |
1018 different interpretations for keys. (if we do find an alternative | |
1019 binding, though, we need to mess with both the command builder and | |
1020 this-command-keys, as does the function-key stuff. probably need to | |
1021 abstract that munging code.) | |
1022 | |
1023 high-priority: | |
1024 | |
1025 [currently doing] | |
1026 | |
1027 -- support for WM_IME_CHAR. IME input can work under -nuni if we use | |
1028 WM_IME_CHAR. probably we should always be using this, instead of | |
1029 snarfing input using WM_COMPOSITION. i'll check this out. | |
1030 -- Russian C-x problem. see above. | |
1031 | |
1032 [clean-up] | |
1033 | |
1034 -- make sure it compiles and runs under non-mule. remember that some | |
1035 code needs the unicode support, or at least a simple version of it. | |
1036 -- make sure it compiles and runs under pdump. see below. | |
1037 -- clean up mswindows-multibyte, TSTR_TO_C_STRING. see below. [DONE] | |
1038 -- eliminate last vestiges of codepage<->charset conversion and similar stuff. | |
1039 | |
1040 [other] | |
1041 -- cut and paste. see below. | |
1042 -- misc issues with handling lang environments. see also August 25, | |
1043 "finally: working on the C-x in ...". | |
1044 -- when switching lang env, needs to set keyboard layout. | |
1045 -- user var to control whether, when moving into text of a | |
1046 particular language, we set the appropriate keyboard layout. we | |
1047 would need to have a lisp api for retrieving and setting the | |
1048 keyboard layout, set text properties to indicate the layout of | |
1049 text, and have a way of dealing with text with no property on | |
1050 it. (e.g. saved text has no text properties on it.) basically, | |
1051 we need to get a keyboard layout from a charset; getting a | |
1052 language would do. Perhaps we need a table that maps charsets | |
1053 to language environments. | |
1054 -- test that the lang env is properly set at startup. test that | |
1055 switching the lang env properly sets the C locale (call | |
1056 setlocale(), set LANG, etc.) -- a spawned subprogram should have | |
1057 the new locale in its environment. | |
1058 -- look through everything below and see if anything is missed in this | |
1059 priority list, and if so add it. create a separate file for the | |
1060 priority list, so it can be updated as appropriate. | |
1061 | |
1062 | |
1063 mid-priority: | |
1064 | |
1065 -- clean up the chain coding system. its list should specify decode | |
1066 order, not encode; i now think this way is more logical. it should | |
1067 check the endpoints to make sure they make sense. it should also | |
1068 allow for the specification of "reverse-direction coding systems": | |
1069 use the specified coding system, but invert the sense of decode and | |
1070 encode. | |
1071 | |
1072 -- along with that, places that take an arbitrary coding system and | |
1073 expect the ends to be anything specific need to check this, and add | |
1074 the appropriate conversions from byte->char or char->byte. | |
1075 | |
1076 -- get some support for arabic, thai, vietnamese, japanese jisx 0212: | |
1077 at least get the unicode information in place and make sure we have | |
1078 things tied together so that we can display them. worry about r2l | |
1079 some other time. | |
1080 | |
1081 August 25, 2001: | |
1082 | |
1083 There is actually more non-Unicode-ized stuff, but it's basically | |
1084 inconsequential. (See previous note.) You can check using the file | |
1085 nmkun.txt (#### RENAME), which is just a list of all the routines that | |
1086 have been split. (It was generated from the output of `nmake | |
1087 unicode-encapsulate', after removing everything from the output but | |
1088 the function names.) Use something like | |
1089 | |
1090 fgrep -f ../nmkun.txt -w [a-hj-z]*.[ch] |m | |
1091 | |
1092 in the source directory, which does a word match and skips | |
1093 intl-unicode-win32.[ch] and intl-win32.[ch], which have a whole lot of | |
1094 references to these, unavoidably. It effectively detects what needs | |
1095 to be changed because changed versions either begin qxe... or end with | |
1096 A or W, and in each case there's no whole-word match. | |
1097 | |
1098 The nasty bug has been fixed below. The -nuni option now works -- all | |
1099 specially-written code to handle the encapsulation has been tested by | |
1100 some operation (fonts by loadup and checking the output of (list-fonts | |
1101 ""); devmode by printing; dragdrop tests other stuff). | |
1102 | |
1103 NOTE: for -nuni (Win 95), areas need work: | |
1104 | |
1105 -- cut and paste. we should be able to receive Unicode text if it's | |
1106 there, and we should be able to receive it even in Win 95 or -nuni. | |
1107 we should just check in all circumstances. also, under 95, when we | |
1108 put some text in the clipboard, it may or may not also be | |
1109 automatically enumerated as unicode. we need to test this out | |
1110 and/or just go ahead and manually do the unicode enumeration. | |
1111 | |
1112 -- receiving keyboard input. we get only a single byte, but we should | |
1113 be able to correlate the language of the keyboard layout to a | |
1114 particular code page, so we can then decode it correctly. | |
1115 | |
1116 -- mswindows-multibyte. still implemented as its own thing. should | |
1117 be done as a chain of (encoding) unicode | unicode-to-multibyte. | |
1118 need to turn this on, get it working, and look into optimizations | |
1119 in the dfc stuff. (#### perhaps there's a general way to do these | |
1120 optimizations??? something like having a method on a coding system | |
1121 that can specify whether a pure-ASCII string gets rendered as | |
1122 pure-ASCII bytes and vice-versa.) | |
1123 | |
1124 | |
1125 ALSO: | |
1126 | |
1127 -- we have special macros TSTR_TO_C_STRING and such because formerly | |
1128 the DFC macros didn't know about external stuff that was Unicode | |
1129 encoded and would call strlen() on them. this is fixed, so now we | |
1130 should undo the special macros, make em normal, removal the | |
1131 comments about this, and make sure it works. [DONE] | |
1132 | |
1133 | |
1134 -- finally: working on the C-x in Russian key layout problem. in the | |
1135 process will probably end up doing work on cleaning up the handling | |
1136 of keyboard layouts, integrating or deleting the FSF stuff, adding | |
1137 code to change the keyboard layout as we move in and out of text in | |
1138 different languages (implemented as a post-command-hook; we need | |
1139 something like internal-post-command-hook if not already there, for | |
1140 internal stuff that doesn't want to get mixed up with the regular | |
1141 post-command-hook; similar for pre-command-hook). also, when | |
1142 langenv changes, ways to set the keyboard layout appropriately. | |
1143 | |
1144 -- i think the stuff above is higher priority than the other stuff | |
1145 mentioned below. what i'm aiming for is to be able to input and | |
1146 work with multiple languages without weird glitches, both under 95 | |
1147 and NT. the problems above are all basic impediments to such work. | |
1148 we assume for the moment that the user can make use of the existing | |
1149 file i/o conversion stuff, and put that lower in priority, after | |
1150 the basic input is working. | |
1151 | |
1152 -- i should get my modem connected and write up what's going on and | |
1153 send it to the lists; also cvs commit my workspaces and get more | |
1154 testers. | |
1155 | |
1156 August 24, 2001: | |
1157 | |
1158 All code has been Unicode-ized except for some stuff in console-msw.c | |
1159 that deals with console output. Much of the Unicode-encapsulation | |
1160 stuff, particularly the hand-written stuff, really needs testing. I | |
1161 added a new command-line option, -nuni, to force use of all ANSI calls | |
1162 -- XE_UNICODEP evaluates to false in this case. | |
1163 | |
1164 There is a nasty bug that appeared recently, probably when the event | |
1165 code got Unicode-ized -- bad interactions with OS sticky modifiers. | |
1166 Hold the shift key down and release it, then instead of affecting the | |
1167 next char only, it gets permanently stuck on (until you do a regular | |
1168 shift+char stroke). This needs to be debugged. | |
1169 | |
1170 Other things on agenda: | |
1171 | |
1172 -- go through and prioritize what's listed below. | |
1173 | |
1174 -- make sure the pdump code can compile and work. for the moment we | |
1175 just don't try to dump any Unicode tables and load them up each | |
1176 time. this is certainly fast but ... | |
1177 | |
1178 -- there's the problem that XEmacs can't be run in a directory with | |
1179 non-ASCII/Latin-1 chars in it, since it will be doing Unicode | |
1180 processing before we've had a chance to load the tables. In fact, | |
1181 even finding the tables in such a situation is problematic using | |
1182 the normal commands. my idea is to eventually load the stuff | |
1183 extremely extremely early, at the same time as the pdump data gets | |
1184 loaded. in fact, the unicode table data (stored in an efficient | |
1185 binary format) can even be stuck into the pdump file (which would | |
1186 mean as a resource to the executable, for windows). we'd need to | |
1187 extend pdump a bit: to allow for attaching extra data to the pdump | |
1188 file. (something like pdump_attach_extra_data (addr, length) | |
1189 returns a number of some sort, an index into the file, which you | |
1190 can then retrieve with pdump_load_extra_data(), which returns an | |
1191 addr (mmap()ed or loaded), and later you pdump_unload_extra_data() | |
1192 when finished. we'd probably also need | |
1193 pdump_attach_extra_data_append(), which appends data to the data | |
1194 just written out with pdump_attach_extra_data(). this way, | |
1195 multiple tables in memory can be written out into one contiguous | |
1196 table. (we'd use the tar-like trick of allowing new blocks to be | |
1197 written without going back to change the old blocks -- we just rely | |
1198 on the end of file/end of memory.) this same mechanism could be | |
1199 extracted out of pdump and used to handle the non-pdump situation | |
1200 (or alternatively, we could just dump either the memory image of | |
1201 the tables themselves or the compressed binary version). in the | |
1202 case of extra unicode tables not known about at compile time that | |
1203 get loaded before dumping, we either just dump them into the image | |
1204 (pdump and all) or extract them into the compressed binary format, | |
1205 free the original tables, and treat them like all other tables. | |
1206 | |
1207 -- `C-x b' when using a Russian keyboard layout. XEmacs currently | |
1208 tries to interpret C+cyrillic char, which causes an error. We want | |
1209 C-x b to still work even when the keyboard normally generates | |
1210 Cyrillic. What we should do is expand the keyboard event structure | |
1211 so that it contains not only the actual char, but what the char | |
1212 would have been in various other keyboard layouts, and in contexts | |
1213 where only certain keystrokes make sense (creating control chars, | |
1214 and looking up in keymaps), we proceed in order, processing each of | |
1215 them until we get something. order should be something like: | |
1216 current keyboard layout; layout of the current language | |
1217 environment; layout of the user's default language; layout of the | |
1218 system default language; layout of US English. | |
1219 | |
1220 -- reading and writing Unicode files. multiple problems: | |
1221 | |
1222 -- EOL's aren't handled right. for the moment, just fix the | |
1223 Unicode coding systems; later on, create EOL-only coding | |
1224 systems: | |
1225 | |
1226 1. they would be character->character and operate next to the | |
1227 internal data; this means that coding systems need to be able | |
1228 to handle ends of lines that are either CR, LF, or CRLF. | |
1229 usually this isn't a problem, as they are just characters | |
1230 like any other and get encoded appropriately. however, | |
1231 coding systems that are line-oriented need to recognize any | |
1232 of the three as line endings. | |
1233 | |
1234 2. we'd also have to complete the stuff that handles coding | |
1235 systems where either end can be byte or char (four | |
1236 possibilities total; use a single enum such as | |
1237 ENCODES_CHAR_TO_BYTE, ENCODES_BYTE_TO_BYTE, etc.). | |
1238 | |
1239 3. we'd need ways of specifying the chaining of coding systems. | |
1240 e.g. when reading a coding system, a user can specify more | |
1241 than one with a | symbol between them. when a context calls | |
1242 for a coding system and a chain is needed, the `chain' coding | |
1243 system is useful; but we should really expand the contexts | |
1244 where a list of coding systems can be given, and whenever | |
1245 possible try to inline the chain instead of using a | |
1246 surrounding `chain' coding system. | |
1247 | |
1248 4. the `chain' needs some work so that it passes all sorts of | |
1249 lstream commands down to the chain inside it -- it should be | |
1250 entirely transparent and the fact that there's actually a | |
1251 surrounding coding system should be invisible. more general | |
1252 coding system methods might need to be created. | |
1253 | |
1254 5. important: we need a way of specifying how detecting works | |
1255 when we have more than one coding system. we might need more | |
1256 than a single priority list. need to think about this. | |
1257 | |
1258 -- Unicode files beginning with the BOM are not recognized as such. | |
1259 we need to fix this; but to make things sensible, we really need | |
1260 to add the idea of different levels of confidence regarding | |
1261 what's detected. otherwise, Unicode says "yes this is me" but | |
1262 others higher up do too. in the process we should probably | |
1263 finish abstracting the detection system and fix up some | |
1264 stupidities in it. | |
1265 | |
1266 -- When writing a file, we need error detection; otherwise somebody | |
1267 will create a Unicode file without realizing the coding system | |
1268 of the buffer is Raw, and then lose all the non-ASCII/Latin-1 | |
1269 text when it's written out. We need two levels | |
1270 | |
1271 1. first, a "safe-charset" level that checks before any actual | |
1272 encoding to see if all characters in the document can safely | |
1273 be represented using the given coding system. FSF has a | |
1274 "safe-charset" property of coding systems, but it's stupid | |
1275 because this information can be automatically derived from | |
1276 the coding system, at least the vast majority of the time. | |
1277 What we need is some sort of | |
1278 alternative-coding-system-precedence-list, langenv-specific, | |
1279 where everything on it can be checked for safe charsets and | |
1280 then the user given a list of possibilities. When the user | |
1281 does "save with specified encoding", they should see the same | |
1282 precedence list. Again like with other precedence lists, | |
1283 there's also a global one, and presumably all coding systems | |
1284 not on other list get appended to the end (and perhaps not | |
1285 checked at all when doing safe-checking?). safe-checking | |
1286 should work something like this: compile a list of all | |
1287 charsets used in the buffer, along with a count of chars | |
1288 used. that way, "slightly unsafe" charsets can perhaps be | |
1289 presented at the end, which will lose only a few characters | |
1290 and are perhaps what the users were looking for. | |
1291 | |
1292 2. when actually writing out, we need error checking in case an | |
1293 individual char in a charset can't be written even though the | |
1294 charsets are safe. again, the user gets the choice of other | |
1295 reasonable coding systems. | |
1296 | |
1297 3. same thing (error checking, list of alternatives, etc.) needs | |
1298 to happen when reading! all of this will be a lot of work! | |
1299 | |
1300 | |
1301 | |
1302 Announcement, August 20, 2001: | |
1303 | |
1304 I'm looking for testers. There is a complete and fast implementation | |
1305 in C of Unicode conversion, translations for almost all of the | |
1306 standardly-defined charsets that load up automatically and | |
1307 instantaneously at runtime, coding systems supporting the common | |
1308 external representations of Unicode [utf-16, ucs-4, utf-8, | |
1309 little-endian versions of utf-16 and ucs-4; utf-7 is sitting there | |
1310 with abort[]s where the coding routines should go, just waiting for | |
1311 somebody to implement], and a nice set of primitives for translating | |
1312 characters<->codepoints and setting the priority lists used to control | |
1313 codepoint->char lookup. | |
1314 | |
1315 It's so far hooked into one place: the Windows IME. Currently I can | |
1316 select the Japanese IME from the thing on my tray pad in the lower | |
1317 right corner of the screen, and type Japanese into XEmacs, and you get | |
1318 Japanese in XEmacs -- regardless of whether you set either your | |
1319 current or global system locale to Japanese,and regardless of whether | |
1320 you set your XEmacs lang env as Japanese. This should work for many | |
1321 other languages, too -- Cyrillic, Chinese either Traditional or | |
1322 Simplified, and many others, but YMMV. There may be some lurking | |
1323 bugs (hardly surprising for something so raw). | |
1324 | |
1325 To get at this, checkout using `ben-mule-21-5', NOT the simpler | |
1326 *`mule-21-5'. For example | |
1327 | |
1328 cvs -d :pserver:xemacs@cvs.xemacs.org:/usr/CVSroot checkout -r ben-mule-21-5 xemacs | |
1329 | |
1330 or you get the idea. the `-r ben-mule-21-5' is important. | |
1331 | |
1332 I keep track of my progress in a file called README.ben-mule-21-5 in | |
1333 the root directory of the source tree. | |
1334 | |
1335 WARNING: Pdump might not work. Will be fixed rsn. | |
1336 | |
1337 August 20, 2001: | |
1338 | |
1339 -- still need to sort out demand loading, binary format, etc. figure | |
1340 out what the goals are and how we're going to achieve them. for | |
1341 the moment let's just say that running XEmacs in a directory with | |
1342 Japanese or other weird characters in the name is likely to cause | |
1343 problems under MS Windows, but once XEmacs is initialized (and | |
1344 before processing init files), all Unicode support is there. | |
1345 | |
1346 -- wrote the size computation routines, although not yet tested. | |
1347 | |
1348 -- lots more abstraction of coding systems; almost done. | |
1349 | |
1350 -- UNICODE WORKS!!!!! | |
1351 | |
1352 | |
1353 August 19, 2001: | |
1354 | |
1355 Still needed on the Unicode support: | |
1356 | |
1357 -- demand loading: load the Unicode table data the first time a | |
1358 conversion needs to be done. | |
1359 | |
1360 -- maybe: table size computation: figure out how big the in-memory | |
1361 tables actually are. | |
1362 | |
1363 -- maybe: create a space-efficient binary format for the data, and a | |
1364 way to dump out an existing charset's data into this binary format. | |
1365 it should allow for many such groups of data to be appended | |
1366 together in one file, such that you can just append the new data | |
1367 onto the end and not have to go back and modify anything | |
1368 previously. (like how tar archives work, and how the UFS? for | |
1369 CD-R's and CD-RW's works.) | |
1370 | |
1371 -- maybe: figure out how to be able to access the Unicode tables at | |
1372 init_intl() time, before we know how to get at data-directory; that | |
1373 way we can handle the need for unicode conversions that come up | |
1374 very early, for example if XEmacs is run from a directory | |
1375 containing Japanese in it. Presumably we'd want to generalize the | |
1376 stuff in pdump.c that deals with the dumper file, so that it can | |
1377 handle other files -- putting the file either in the directory of | |
1378 the executable or in a resource, maybe actually attached to the | |
1379 pdump file itself -- or maybe we just dump the data into the actual | |
1380 executable. With pdump we could extend pdump to allow for data | |
1381 that's in the pdump file but not actually mapped at startup, | |
1382 separate from the data that does get mapped -- and then at runtime | |
1383 the pointer gets restored not with a real pointer but an offset | |
1384 into the file; another pdump call and we get some way to access the | |
1385 data. (tricky because it might be in a resource, not a file. we | |
1386 might have to just tell pdump to mmap or whatever the data in, and | |
1387 then tell pdump to release it.) | |
1388 | |
1389 -- fix multibyte to use unicode. at first, just reverse | |
1390 mswindows-multibyte-to-unicode to be unicode-to-multibyte; later | |
1391 implement something in chain to allow for reversal, for declaring | |
1392 the ends of the coding systems, etc. | |
1393 | |
1394 -- actually make sure that the IME stuff is working!!! | |
1395 | |
1396 Other things before announcing: | |
1397 | |
1398 -- change so that the Unicode tables are not pdumped. This means we | |
1399 need to free any table data out there. Make sure that pdump | |
1400 compiles and try to finish the pretty-much-already-done stuff | |
1401 already with XD_STRUCT_ARRAY and dynamic size computation; just | |
1402 need to see what's going on with LO_LINK. | |
1403 | |
1404 August 14, 2001: | |
1405 | |
1406 To do a diff between this workspace and the mainline, use the most recent sync tags, currently: | |
1407 | |
1408 cvs diff -r main-branch-ben-mule-21-5-aug-11-2001-sync -r ben-mule-21-5-post-aug-11-2001-sync | |
1409 | |
1410 Unicode support: | |
1411 | |
1412 Unicode support is important for supporting many languages under | |
1413 Windows, such as Cyrillic, without resorting to translation tables for | |
1414 particular Windows-specific code pages. Internally, all characters in | |
1415 Windows can be represented in two encodings: code pages and Unicode. | |
1416 With Unicode support, we can seamlessly support all Windows | |
1417 characters. Currently, the test in the drive to support Unicode is if | |
1418 IME input works properly, since it is being converted from Unicode. | |
1419 | |
1420 Unicode support also requires that the various Windows API's be | |
1421 "Unicode-encapsulated", so that they automatically call the ANSI or | |
1422 Unicode version of the API call appropriately and handle the size | |
1423 differences in structures. What this means is: | |
1424 | |
1425 -- first, note that Windows already provides a sort of encapsulation | |
1426 of all API's that deal with text. All such API's are underlyingly | |
1427 provided in two versions, with an A or W suffix (ANSI or "wide" | |
1428 i.e. Unicode), and the compile-time constant UNICODE controls which | |
1429 is selected by the unsuffixed API. Same thing happens with | |
1430 structures. Unfortunately, this is compile-time only, not | |
1431 run-time, so not sufficient. (Creating the necessary run-time | |
1432 encoding is not conceptually difficult, but very time-consuming to | |
1433 write. It adds no significant overhead, and the only reason it's | |
1434 not standard in Windows is conscious marketing attempts by | |
1435 Microsoft to cripple Windows 95. FUCK MICROSOFT! They even | |
1436 describe in a KnowledgeBase article exactly how to create such an | |
1437 API [although we don't exactly follow their procedure], and point | |
1438 out its usefulness; the procedure is also described more generally | |
1439 in Nadine Kano's book on Win32 internationalization -- written SIX | |
1440 YEARS AGO! Obviously Microsoft has such an API available | |
1441 internally.) | |
1442 | |
1443 -- what we do is provide an encapsulation of each standard Windows API | |
1444 call that is split into A and W versions. current theory is to | |
1445 avoid all preprocessor games; so we name the function with a prefix | |
1446 -- "qxe" currently -- and require callers to use the prefixed name. | |
1447 Callers need to explicitly use the W version of all structures, and | |
1448 convert text themselves using Qmswindows_tstr. the qxe | |
1449 encapsulated version will automatically call the appropriate A or W | |
1450 version depending on whether we're running on 9x or NT, and copy | |
1451 data between W and A versions of the structures as necessary. | |
1452 | |
1453 -- We require the caller to handle the actual translation of text to | |
1454 avoid possible overflow when dealing with fixed-size Windows | |
1455 structures. There are no such problems when copying data between | |
1456 the A and W versions because ANSI text is never larger than its | |
1457 equivalent Unicode representation. | |
1458 | |
1459 -- We allow for incremental creation of the encapsulated routines by | |
1460 using the coding system Qmswindows_tstr_notyet. This is an alias | |
1461 for Qmswindows_multibyte, i.e. it always converts to ANSI; but it | |
1462 indicates that it will be changed to Qmswindows_tstr when we have a | |
1463 qxe version of the API call that the data is being passed to and | |
1464 change the code to use the new function. | |
1465 | |
1466 Besides creating the encapsulation, the following needs to be done for | |
1467 Unicode support: | |
1468 | |
1469 -- No actual translation tables are fed into XEmacs. We need to | |
1470 provide glue code to read the tables in etc/unicode. See | |
1471 etc/unicode/README for the interface to implement. | |
1472 | |
1473 -- Fix pdump. The translation tables for Unicode characters function | |
1474 as unions of structures with different numbers of indirection | |
1475 levels, in order to be efficient. pdump doesn't yet support such | |
1476 unions. charset.h has a general description of how the translation | |
1477 tables work, and the pdump code has constants added for the new | |
1478 required data types, and descriptions of how these should work. | |
1479 | |
1480 -- ultimately, there's no end to additional work (composition, bidi | |
1481 reordering, glyph shaping/ordering, etc.), but the above is enough | |
1482 to get basic translation working. | |
1483 | |
1484 Merging this workspace into the trunk requires some work. ChangeLogs | |
1485 have not yet been created. Also, there is a lot of additional code in | |
1486 this workspace other than just Windows and Unicode stuff. Some of the | |
1487 changes have been somewhat disruptive to the code base, in particular: | |
1488 | |
1489 -- the code that handles the details of processing multilingual text | |
1490 has been consolidated to make it easier to extend it. it has been | |
1491 yanked out of various files (buffer.h, mule-charset.h, lisp.h, | |
1492 insdel.c, fns.c, file-coding.c, etc.) and put into text.c and | |
1493 text.h. mule-charset.h has also been renamed charset.h. all long | |
1494 comments concerning the representations and their processing have | |
1495 been consolidated into text.c. | |
1496 | |
1497 -- nt/config.h has been eliminated and everything in it merged into | |
1498 config.h.in and s/windowsnt.h. see config.h.in for more info. | |
1499 | |
1500 -- s/windowsnt.h has been completely rewritten, and s/cygwin32.h and | |
1501 s/mingw32.h have been largely rewritten. tons of dead weight has | |
1502 been removed, and stuff common to more than one file has been | |
1503 isolated into s/win32-common.h and s/win32-native.h, similar to | |
1504 what's already done for usg variants. | |
1505 | |
1506 -- large amounts of code throughout the code base have been Mule-ized, | |
1507 not just Windows code. | |
1508 | |
1509 -- file-coding.c/.h have been largely rewritten (although still mostly | |
1510 syncable); see below. | |
1511 | |
1512 | |
1513 | |
1514 June 26, 2001: | |
1515 | |
1516 -- ben-mule-21-5 | |
1517 | |
1518 this contains all the mule work i've been doing. this includes mostly | |
1519 work done to get mule working under ms windows, but in the process | |
1520 i've [of course] fixed a whole lot of other things as well, mostly | |
1521 mule issues. the specifics: | |
1522 | |
1523 - it compiles and runs under windows and should basically work. the | |
1524 stuff remaining to do is (a) improved unicode support (see below) | |
1525 and (b) smarter handling of keyboard layouts. in particular, it | |
1526 should (1) set the right keyboard layout when you change your | |
1527 language environment; (2) optionally (a user var) set the | |
1528 appropriate keyboard layout as you move the cursor into text in a | |
1529 particular language. | |
1530 | |
1531 - i added a bunch of code to better support OS locales. it tries to | |
1532 notice your locale at startup and set the language environment | |
1533 accordingly (this more or less works), and call setlocale() and set | |
1534 LANG when you change the language environment (may or may not work). | |
1535 | |
1536 - major rewriting of file-coding. it's mostly abstracted into coding | |
1537 systems that are defined by methods (similar to devices and | |
1538 specifiers), with the ultimate aim being to allow non-i18n coding | |
1539 systems such as gzip. there is a "chain" coding system that allows | |
1540 multiple coding systems to be chained together. (it doesn't yet | |
1541 have the concept that either end of a coding system can be bytes or | |
1542 chars; this needs to be added.) | |
1543 | |
1544 - unicode support. very raw. a few days ago i wrote a complete and | |
1545 efficient implementation of unicode translation. it should be very | |
1546 fast, and fairly memory-efficient in its tables. it allows for | |
1547 charset priority lists, which should be language-environment | |
1548 specific (but i haven't yet written the glue code). it works in | |
1549 preliminary testing, but obviously needs more testing and work. | |
1550 as of yet there is no translation data added for the standard charsets. | |
1551 the tables are in etc/unicode, and all we need is a bit of glue code | |
1552 to process them. see etc/unicode/README for the interface to | |
1553 implement. | |
1554 | |
1555 - support for unicode in windows is partly there. this will work even | |
1556 on windows 95. the basic model is implemented but it needs finishing | |
1557 up. | |
1558 | |
1559 - there is a preliminary implementation of windows ime support courtesy | |
1560 of ikeyama. | |
1561 | |
1562 - if you want to get cyrillic working under windows (it appears to "work" | |
1563 but the wrong chars currently appear), the best way is to add unicode | |
1564 support for iso-8859-5 and use it in redisplay-msw.c. we are already | |
1565 passing unicode codepoints to the text-draw routine (ExtTextOutW). | |
1566 (ExtTextOutW and GetTextExtentPoint32W are implemented on both 95 and NT.) | |
1567 | |
1568 - i fixed the iso2022 handling so it will correctly read in files | |
1569 containing unknown charsets, creating a "temporary" charset which | |
1570 can later be overwritten by the real charset when it's defined. | |
1571 this allows iso2022 elisp files with literals in strange languages | |
1572 to compile correctly under mule. i also added a hack that will | |
1573 correctly read in and write out the emacs-specific "composition" | |
1574 escape sequences, i.e. ESC 0 through ESC 4. this means that my | |
1575 workspace correctly compiles the new file devanagari.el that i added | |
1576 (see below). | |
1577 | |
1578 - i copied the remaining language-specific files from fsf. i made | |
1579 some minor changes in certain cases but for the most part the stuff | |
1580 was just copied and may not work. | |
1581 | |
1582 - i fixed post-read-conversion in coding systems to follow fsf | |
1583 conventions. (i also support our convention, for the moment. a | |
1584 kludge, of course.) | |
1585 | |
1586 - make-coding-system accepts (but ignores) the additional properties | |
1587 present in the fsf version, for compatibility. |