comparison README.ben-mule-21-5 @ 771:943eaba38521

[xemacs-hg @ 2002-03-13 08:51:24 by ben] The big ben-mule-21-5 check-in! Various files were added and deleted. See CHANGES-ben-mule. There are still some test suite failures. No crashes, though. Many of the failures have to do with problems in the test suite itself rather than in the actual code. I'll be addressing these in the next day or so -- none of the test suite failures are at all critical. Meanwhile I'll be trying to address the biggest issues -- i.e. build or run failures, which will almost certainly happen on various platforms. All comments should be sent to ben@xemacs.org -- use a Cc: if necessary when sending to mailing lists. There will be pre- and post- tags, something like pre-ben-mule-21-5-merge-in, and post-ben-mule-21-5-merge-in.
author ben
date Wed, 13 Mar 2002 08:54:06 +0000
parents
children
comparison
equal deleted inserted replaced
770:336a418893b5 771:943eaba38521
1 oct 27, 2001:
2
3 -------- proposal for better buffer-switching commands:
4
5 implement what VC++ currently has. you have a single "switch" command like
6 CTRL-TAB, which as long as you hold the CTRL button down, brings successive
7 buffers that are "next in line" into the current position, bumping the rest
8 forward. once you release the CTRL key, the chain is broken, and further
9 CTRL-TABs will start from the beginning again. this way, frequently used
10 buffers naturally move toward the front of the chain, and you can switch
11 back and forth between two buffers using CTRL-TAB. the only thing about
12 CTRL-TAB is it's a bit awkward. the way to implement is to have
13 modifier-up strokes fire off a hook, like modifier-up-hook. this is driven
14 by event dispatch, so there are no synchronization issues. when C-tab is
15 pressed, the binding function does something like set a one-shot handler on
16 the modifier-up-hook (perhaps separate hooks for separate modifiers?).
17
18 to do this, we'd also want to change the buffer tabs so that they maintain
19 their own order. in particular, they start out synched to the regular
20 order, but as you make changes, you don't want the tabs to change
21 order. (in fact, they may already do this.) selecting a particular buffer
22 from the buffer tabs DOES make the buffer go to the head of the line. the
23 invariant is that if the tabs are displaying X items, those X items are the
24 first X items in the standard buffer list, but may be in a different
25 order. (it looks like the tabs may already implement all of this.)
26
27 oct 26, 2001:
28
29 necessary testing/changes:
30
31 - test all eol detection stuff under windows w/ and w/o mule, unix w/ and
32 w/o mule. (test configure flag, command-line flag, menu option) may need
33 a way of pretending to be unix under cygwin.
34 - test under windows w/ and w/o mule, cygwin w/ and w/o mule, cygwin x
35 windows w/ and w/o mule.
36 - test undecided-dos/unix/mac.
37 - check ESC ESC works as isearch-quit under TTY's.
38 - test coding-system-base and all its uses (grep for them).
39 - menu item to revert to most recent auto save.
40 - consider renaming build_string -> build_intstring and build_c_string to
41 build_string. (consistent with build_msg_string et al; many more
42 build_c_string than build_string)
43
44 oct 20, 2001:
45
46 fixed problem causing crash due to invalid internal-format data, fixed an
47 existing bug in valid_char_p, and added checks to more quickly catch when
48 invalid chars are generated. still need to investigate why
49 mswindows-multibyte is being detected.
50
51 i now see why -- we only process 65536 bytes due to a constant
52 MAX_BYTES_PROCESSED_FOR_DETECTION. instead, we should have no limit as
53 long as we have a seekable stream. we also need to write
54 stderr_out_lisp(), used in the debug info routines i wrote.
55
56 check once more about DEBUG_XEMACS. i think debugging info should be
57 ON by default. make sure it is. check that nothing untoward will result
58 in a production system, e.g. presumably assert()s should not really abort().
59 (!! Actually, this should be runtime settable! Use a variable for this, and
60 it can be set using the same XEMACSDEBUG method. In fact, now that I think
61 of it, I'm sure that debugging info should be on always, with runtime ways
62 of turning on or off any funny behavior.)
63
64 oct 19, 2001:
65
66 fixed various bugs preventing packages from being able to be built. still
67 another bug, with psgml/etc/cdtd/docbook, which contains some strange
68 characters starting around char pos 110,000. It gets detected as
69 mswindows-multibyte (wrong! why?) and then invalid internal-format data is
70 generated. need to fix mswindows-multibyte (and possibly add something
71 that signals an error as well; need to work on this error-signalling
72 mechanism) and figure out why it's getting detected as such. what i should
73 do is add a debug var that outputs blow-by-blow info of the detection
74 process.
75
76 oct 9, 2001:
77
78 the stuff with global-window-system-map doesn't appear to work. in any
79 case it needs better documentation. [DONE]
80
81 M-home, M-end do work, but cause cl-macs to get loaded. why?
82
83 oct 8, 2001:
84
85 finished the coding system changes and they finally work!
86
87 need to implement undecided-unix/dos/mac. they should be easy to do; it
88 should be enough to specify an eol-type but not do-eol, but check this.
89
90 consider making the standard naming be foo-lf/crlf/cr, with unix/dos/mac as
91 aliases.
92
93 print methods for coding systems should include some of the generic
94 properties. (also then fix print_..._within_print_method). [DONE]
95
96 in a little while, go back and delete the text-file-wrapper-coding-system
97 code. (it'll be in CVS if necessary to get at it.) [DONE]
98
99 need to verify at some point that non-text-file coding systems work
100 properly when specified. when gzip is working, this would be a good test
101 case. (and consider creating base64 as well!)
102
103 remove extra crap from coding-system-category that checks for chain coding
104 systems. [DONE]
105
106 perhaps make a primitive that gets at coding-system-canonical. [DONE]
107
108 need to test cygwin, compiling the mule packages, get unix-eol stuff
109 working. frank from germany says he doesn't see a lisp backtrace when he
110 gets an error during temacs? verify that this actually gets outputted.
111
112 consider putting the current language on the modeline, mousable so it can
113 be switched. also consider making the coding system be mousable and the
114 line number (pick a line) and the percentage (pick a percentage).
115
116 oct 6, 2001:
117
118 added code so that debug_print() will output a newline to the mswindows
119 debugging output, not just the console. need to test. [DONE]
120
121 working on problem where all files are being detected as binary. the
122 problem may be that the undecided coding system is getting wrapped with an
123 auto-eol coding system, which it shouldn't be -- but even in this
124 situation, we should get the right results! check the
125 canonicalize-after-coding methods. also, determine_real_coding_system
126 appears to be getting called even when we're not detecting encoding. also,
127 undecided needs a print method to show its params, and chain needs to be
128 updated to show canonicalize_after_coding. check others as well. [DONE]
129
130 oct 5, 2001:
131
132 finished up coding system changes, testing.
133
134 errors byte-compiling files in iso-2022-7-bit. perhaps it's not correctly
135 detecting the encoding?
136
137 noticed a problem in the dfc macros: we call
138 get_coding_system_for_text_file with eol_wrap == 1, to allow for
139 auto-detection of the eol type; but this defeats the check and
140 short-circuit for unicode.
141
142 still need to implement calling determine_real_coding_system() for
143 non-seekable streams. to implement correctly, we need to do our own
144 buffering. [DONE, BUT WITHOUT BUFFERING]
145
146 oct 4, 2001:
147
148 implemented most stuff below.
149
150 need to finish up changes to make_coding_system_1. (i changed the way
151 internal coding systems were handled; i need to create subsidiaries for all
152 types of coding systems, not just text ones.) there's a nasty xfree() crash
153 i was hitting; perhaps it'll go away once all stuff has been rewritten.
154
155 check under cygwin to make sure that when an error occurs during loadup, a
156 backtrace is output.
157
158 as soon as andy releases his new setup, we should put it onto various
159 standard windows software repositories.
160
161 oct 3, 2001:
162
163 added global-tty-map and global-window-system-map. add some stuff to the
164 maps, e.g. C-x ESC for repeat vs. C-x ESC ESC on TTY's, and of course ESC
165 ESC on window systems vs. ESC ESC ESC on TTY's. [TEST]
166
167 was working on integrating the two help-for-tutorial versions (mule,
168 non-mule). [DONE, but test under non-Mule]
169
170 was working on the file-coding changes. need to think more about
171 text-file-wrapper. conclusion i think is that
172 get_coding_system_for_text_file should wrap using a special coding system
173 type called a text-file-wrapper, which inherits from chain, and implements
174 canonicalize-after-decoding to just return the unwrapped coding system. We
175 need to implement inheritance of coding systems, which will certainly come
176 in extremely useful when coding systems get implemented in Lisp, which
177 should happen at some point. (see existing docs about this.) essentially,
178 we have a way of declaring that we inherit from some system, and the
179 appropriate data structures get created, perhaps just an extra inheritance
180 pointer. but when we create the coding system, the extra data needs to be
181 a stretchy array of offsets, pointing to the type-specific data for the
182 coding system type and all its parents. that means that in the methods
183 structure for a coding system (which perhaps should be expanded beyond
184 method, it's just a "class structure") is the index in these arrays of
185 offsets. CODING_SYSTEM_DATA() can take any of the coding system classes
186 (rename type to class!) that make up this class. similarly, a coding
187 system class inherits its methods from the class above unless specifying
188 its own method, and can call the superclass method at any point by either
189 just invoking its name, or conceivably by some macro like
190
191 CALL_SUPER (method, (args))
192
193 similar mods would have to be made to coding stream structures.
194
195 perhaps for the immediate we can just sort of fake things like we currently
196 do with undecided calling some stuff from chain.
197
198 oct 2, 2001:
199
200 need to implement support for iso-8859-15, i.e. iso-8859-1 + euro symbol.
201 figure out how to fall back to iso-8859-1 as necessary.
202
203 leave the current bindings the way they are for the moment, but bump off
204 M-home and M-end (hardly used), and substitute my buffer movement stuff
205 there. [DONE, but test]
206
207 there's something to be said for combining block of 6 and paragraph,
208 esp. if we make the definition of "paragraph" be so that it skips by 6 when
209 within code. hmm.
210
211 eliminate advertised-undo crap, and similar hacks. [DONE]
212
213 think about obsolete stuff to be eliminated. think about eliminating or
214 dimming obsolete items from hyper-apropos and something similar in
215 completion buffers.
216
217 sep 30, 2001:
218
219 synched up the tutorials with FSF 21.0.105. was rewriting them to favor
220 the cursor keys over the older C-p, etc. keys.
221
222 Got thinking about key bindings again.
223
224 (1) I think that M-up/down and M-C-up/down should be reversed. I use
225 scroll-up/down much more often than motion by paragraph.
226
227 (2) Should we eliminate move by block (of 6) and subsitute it for
228 paragraph? This would have the advantage that I could make bindings
229 for buffer change (forward/back buffer, perhaps M-C-up/down. with
230 shift, M-C-S-up/down only goes within the same type (C files, etc.).
231 alternatively, just bump off beginning-of-defun from C-M-home, since
232 it's on C-M-a already.
233
234 need someone to go over the other tutorials (five new ones, from FSF
235 21.0.105) and fix them up to correspond to the english one.
236
237 shouldn't shift-motion work with C-a and such as well as arrows?
238
239 sep 29, 2001:
240
241 charcount_to_bytecount can also be made to scream -- as can scan_buffer,
242 buffer_mule_signal_inserted_region, others? we should start profiling
243 though before going too far down this line.
244
245 Debug code that causes no slowdown should in general remain in the
246 executable even in the release version because it may be useful (e.g. for
247 people to see the event output). so DEBUG_XEMACS should be rethought.
248 things like use of msvcrtd.dll should be controlled by error_checking on.
249 maybe DEBUG_XEMACS controls general debug code (e.g. use of msvcrtd.dll,
250 asserts abort, error checking), and the actual debugging code should remain
251 always, or be conditonalized on something else
252 (e.g. DEBUGGING_FUNS_PRESENT).
253
254 doc strings in dumped files are displayed with an extra blank line between
255 each line. presumably this is recent? i assume either the change to
256 detect-coding-region or the double-wrapping mentioned below.
257
258 error with coding-system-property on iso-2022-jp-dos. problem is that that
259 coding system is wrapped, so its type shows up as chain, not iso-2022.
260 this is a general problem, and i think the way to fix it is to in essence
261 do late canonicalization -- similar in spirit to what was done long ago,
262 canonicalize_when_code, except that the new coding system (the wrapper) is
263 created only once, either when the original cs is created or when first
264 needed. this way, operations on the coding system work like expected, and
265 you get the same results as currently when decoding/encoding. the only
266 thing tricky is handling canonicalize-after-coding and the ever-tricky
267 double-wrapping problem mentioned below. i think the proper solution is to
268 move the autodetection of eol into the main autodetect type. it can be
269 asked to autodetect eol, coding, or both. for just coding, it does like it
270 currently does. for just eol, it does similar to what it currently does
271 but runs the detection code that convert-eol currently does, and selects
272 the appropriate convert-eol system. when it does both eol and coding, it
273 does something on the order of creating two more autodetect coding systems,
274 one for eol only and one for coding only, and chains them together. when
275 each has detected the appropriate value, the results are combined. this
276 automatically eliminates the double-wrapping problem, removes the need for
277 complicated canonicalize-after-coding stuff in chain, and fixes the problem
278 of autodetect not having a seekable stream because hidden inside of a
279 chain. (we presume that in the both-eol-and-coding case, the various
280 autodetect coding streams can communicate with each other appropriately.)
281
282 also, we should solve the problem of internal coding systems floating
283 around and clogging up the list simply by having an "internal" property on
284 cs's and an internal param to coding-system-list (optional; if not given,
285 you don't get the internal ones). [DONE]
286
287 we should try to reduce the size of the from-unicode tables (the dominant
288 memory hog in the tables). one obvious thing is to not store a whole
289 emchar as the mapped-to value, but a short that encodes the octets. [DONE]
290
291 sep 28, 2001:
292
293 need to merge up to latest in trunk.
294
295 add unicode charsets for all non-translatable unicode chars; probably want
296 to extend the concept of charsets to allow for dimension 3 and dimension 4
297 charsets. for the moment we should stick with just dimension 3 charsets;
298 otherwise we run past the current maximum of 4 bytes per emchar. (most code
299 would work automatically since it uses MAX_EMCHAR_LEN; the trickiness is in
300 certain code that has intimate knowledge of the representation.
301 e.g. bufpos_to_bytind() has to multiply or divide by 1, 2, 3, or 4,
302 and has special ways of handling each number. with 5 or 6 bytes per char,
303 we'd have to change that code in various ways.) 96x96x96 = 884,000 or so,
304 so with two 96x96x96 charsets, we could tackle all Unicode values
305 representable by UTF-16 and then some -- and only these codepoints will
306 ever have assigned chars, as far as we know.
307
308 need an easy way of showing the current language environment. some menus
309 need to have the current one checked or whatever. [DONE]
310
311 implement unicode surrogates.
312
313 implement buffer-file-coding-system-when-loaded -- make sure find-file,
314 revert-file, etc. set the coding system [DONE]
315
316 verify all the menu stuff [DONE]
317
318 implemented the entirely-ascii check in buffers. not sure how much gain
319 it'll get us as we already have a known range inside of which is constant
320 time, and with pure-ascii files the known range spans the whole buffer.
321 improved the comment about how bufpos-to-bytind and vice-versa work. [DONE]
322
323 fix double-wrapping of convert-eol: when undecided converts itself to
324 something with a non-autodetect eol, it needs to tell the adjacent
325 convert-eol to reduce itself to nothing.
326
327 need menu item for find file with specified encoding. [DONE]
328
329 renamed coding systems mswindows-### to windows-### to follow the standard
330 in rfc1345. [DONE]
331
332 implemented coding-system-subsidiary-parent [DONE]
333 HAVE_MULE -> MULE in files in nt/ so that depend checking works [DONE]
334
335 need to take the smarter search-all-files-in-dir stuff from my sample init
336 file and put it on the grep menu [DONE]
337
338 added item for revert w/specified encoding; mostly works, but needs fixes.
339 in particular, you get the correct results, but buffer-file-coding-system
340 does not reflect things right. also, there are too many entries. need to
341 split into submenus. there is already split code out there; see if it's
342 generalized and if not make it so. it should only split when there's more
343 than a specified number, and when splitting, split into groups of a
344 specified size, not into a specified number of groups. [DONE]
345
346 too many entries in the langenv menus; need to split. [DONE]
347
348 sep 27, 2001:
349
350 NOTE: M-x grep for make-string causes crash now. something definitely to
351 do with string changes. check very carefully the diffs and put in those
352 sledgehammer checks. [DONE]
353
354 fix font-lock bug i introduced. [DONE]
355
356 added optimization to strings (keeps track of # of bytes of ascii at the
357 beginning of a string). perhaps should also keep an all-ascii flag to deal
358 with really large (> 2 MB) strings. rewrite code to count ascii-begin to
359 use the 4-or-8-at-a-time stuff in bytecount_to_charcount.
360
361 Error: M-q is causing Invalid Regexp error on the above paragraph. It's
362 not in working. I assume it's a side effect of the string stuff. VERIFY!
363 Write sledgehammer checks for strings. [DONE]
364
365 revamped the locale/init stuff so that it tries much harder to get things
366 right. should test a bit more. in particular, test out Describe Language
367 on the various created environments and make sure everything looks right.
368
369 should change the menus: move the submenus on Edit->Mule directly under
370 Edit. add a menu entry on File to say "Reload with specified encoding ->".
371 [DONE]
372
373 Also Find File with specified encoding -> Also entry to change the EOL
374 settings for Unix, and implement it.
375
376 decode-coding-region isn't working because it needs to insert a binary
377 (char->byte) converter. [DONE]
378
379 chain should be rearranged to be in decoding order; similar for
380 source/sink-type, other things?
381
382 the detector should check for a magic cookie even without a seekable input.
383 (currently its input is not seekable, because it's hidden within a chain.
384 #### See what we can do about this.)
385
386 provide a way to display various settings, e.g. the current category
387 mappings and priority (see mule-diag; get this working so it's in the
388 path); also a way to print out the likeliness results from a detection,
389 perhaps a debug flag.
390
391 problem with `env', which causes path issues due to `env' in packages.
392 move env code to process, sync with fsf 21.0.105, check that the autoloads
393 in `env' don't cause problems. [DONE]
394
395 8-bit iso2022 detection appears broken; or at least, mule-canna.c is not so
396 detected.
397
398 sep 25, 2001:
399
400 something else to do is review the font selection and fix it so that (e.g.)
401 JISX-0212 can be displayed.
402
403 also, text in widgets needs to be drawn by us so that the correct fonts
404 will be displayed even in multi-lingual text.
405
406 sep 24, 2001:
407
408 the detection system is now properly abstracted. the detectors have been
409 rewritten to include multiple levels of abstraction. now we just need
410 detectors for ascii, binary, and latin-x, as well as more sophisticated
411 detectors in general and further review of the general algorithm for doing
412 detection. (#### Is this written up anywhere?) after that, consider adding
413 error-checking to decoding (VERY IMPORTANT) and verifying the binary
414 correctness of things under unix no-mule.
415
416 sep 23, 2001:
417
418 began to fix the detection system -- adding multiple levels of likelihood
419 and properly abstracting the detectors. the system is in place except for
420 the abstraction of the detector-specific data out of the struct
421 detection_state. we should get things working first before tackling that
422 (which should not be too hard). i'm rewriting algorithms here rather than
423 just converting code, so it's harder. mostly done with everything, but i
424 need to review all detectors except iso2022 and make them properly follow
425 the new way. also write a no-conversion detector. also need to look into
426 the `recode' package and see how (if?) they handle detection, and maybe
427 copy some of the algorithms. also look at recent FSF 21.0 and see if their
428 algorithms have improved.
429
430 sep 22, 2001:
431
432 fixed gc bugs from yesterday.
433 fixed truename bug.
434 close/finalize stuff works.
435 eliminated notyet stuff in syswindows.h.
436 eliminated special code in tstr_to_c_string.
437 fixed pdump problems. (many of them, mostly latent bugs, ugh)
438 fixed cygwin sscanf problems in parse-unicode-translation-table. (NOT a
439 sscanf bug, but subtly different behavior w.r.t. whitespace in the format
440 string, combined with a debugger that sucks ROCKS!! and consistently
441 outputs garbage for variable values.)
442 main stuff to test is the handling of EOF recognition vs. binary
443 (i.e. check what the default settings are under Unix). then we may have
444 something that WORKS on all platforms!!! (Also need to test Windows
445 non-Mule)
446
447 sep 21, 2001:
448
449 finished redoing the close/finalize stuff in the lstream code. but i
450 encountered again the nasty bug mentioned on sep 15 that disappeared on its
451 own then. the problem seems to be that the finalize method of some of the
452 lstreams is calling Lstream_delete(), which calls free_managed_lcrecord(),
453 which is a no-no when we're inside of garbage-collection and the object
454 passed to free_managed_lcrecord() is unmarked, and about to be released by
455 the gc mechanism -- the free lists will end up with xfree()d objects on
456 them, which is very bad. we need to modify free_managed_lcrecord() to
457 check if we're in gc and the object is unmarked, and ignore it rather than
458 move it to the free list. [DONE]
459
460 (#### What we really need to do is do what Java and C# do w.r.t. their
461 finalize methods: For objects with finalizers, when they're about to be
462 freed, leave them marked, run the finalizer, and set another bit on them
463 indicating that the finalizer has run. Next GC cycle, the objects will
464 again come up for freeing, and this time the sweeper notices that the
465 finalize method has already been called, and frees them for good (provided
466 that a finalize method didn't do something to make the object alive
467 again).)
468
469 sep 20, 2001:
470
471 redid the lstream code so there is only one coding stream. combined the
472 various doubled coding stream methods into one; i'm a little bit unsure of
473 this last part, though, as the results of combining the two together seem
474 unclean. got it to compile, but it crashes in loadup. need to go through
475 and rehash the close vs. finalize stuff, as the problem was stuff getting
476 freed too quickly, before the canonicalize-after-decoding was run. should
477 eliminate entirely CODING_STATE_END and use a different method (close
478 coding stream). rewrite to use these two. make sure they're called in the
479 right places. Lstream_close on a stream should *NOT* do finalizing.
480 finalize only on delete. [DONE]
481
482 in general i'd like to see the flags eliminated and converted to
483 bit-fields. also, rewriting the methods to take advantage of rejecting
484 should make it possible to eliminate much of the state in the various
485 methods, esp. including the flags. need to test this is working, though --
486 reduce the buffer size down very low and try files with only CRLF's in
487 them, with one offset by a byte from the other, and see if we correctly
488 handle rejection.
489
490 still have the problem with incorrectly truenaming files.
491
492
493 sep 19, 2001:
494
495 bug reported: crash while closing lstreams.
496
497 the lstream/coding system close code needs revamping. we need to document
498 that order of closing lstreams is very important, and make sure we're
499 consistent. furthermore, chain and undecided lstreams need to close their
500 underneath lstreams when they receive the EOF signal (there may be data in
501 the underneath streams waiting to come out), not when they themselves are
502 closed. [DONE]
503
504 (if only we had proper inheritance. i think in any case we should
505 simulate it for the chain coding stream -- write things in such a way that
506 undecided can use the chain coding stream and not have to duplicate
507 anything itself.)
508
509 in general we need to carefully think through the closing process to make
510 sure everything always works correctly and in the right order. also check
511 very carefully to make sure there are no dangling pointers to deleted
512 objects floating around.
513
514 move the docs for the lstream functions to the functions themselves, not
515 the header files. document more carefully what exactly Lstream_delete()
516 means and how it's used, what the connections are between Lstream_close(),
517 Lstream_delete(), Lstream_flush(), lstream_finalize, etc. [DONE]
518
519 additional error-checking: consider deadbeefing the memory in objects
520 stored in lcrecord free lists; furthermore, consider whether lifo or fifo
521 is correct; under error-checking, we should perhaps be doing fifo, and
522 setting a minimum number of objects on the lists that's quite large so that
523 it's highly likely that any erroneous accesses to freed objects will go
524 into such deadbeefed memory and cause crashes. also, at the earliest
525 available opportunity, go through all freed memory and check for any
526 consistency failures (overwrites of the deadbeef), crashing if so. perhaps
527 we could have some sort of id for each block, to easier trace where the
528 offending block came from. (all of these ideas are present in the debug
529 system malloc from VC++, plus more stuff.) there's similar code i wrote
530 sitting somewhere (in free-hook.c? doesn't appear so. we need to delete the
531 blocking stuff out of there!). also look into using the debug system
532 malloc from VC++, which has lots of cool stuff in it. we even have the
533 sources. that means compiling under pdump, which would be a good idea
534 anyway. set it as the default. (but then, we need to remove the
535 requirement that Xpm be a DLL, which is extremely annoying. look into
536 this.)
537
538 test the windows code page coding systems recently created.
539
540 problems reading my mail files -- 1personal appears to hang, others come up
541 with lots of ^M's. investigate.
542
543 test the enum functions i just wrote, and finish them.
544
545 still pdump problems.
546
547 sep 18, 2001:
548
549 critical-quit broken sometime after aug 25.
550
551 -- fixed critical quit.
552 -- fixed process problems.
553 -- print routines work. (no routine for ccl, though)
554 -- can read and write unicode files, and they can still be read by some
555 other program
556 -- defaults should come up correctly -- mswindows-multibyte is general.
557
558 still need to test matej's stuff.
559 seems ok with multibyte stuff but needs more testing.
560
561 sep 17, 2001:
562
563 !!!!! something broken with processes !!!!! cannot send mail anymore. must
564 investigate.
565
566 sep 17, 2001:
567
568 on mon/wed nights, stop *BEFORE* 11pm. Otherwise i just start getting
569 woozy and can't concentrate.
570
571 just finished getting assorted fixups to the main branch committed, so it
572 will compile under C++ (Andy committed some code that broke C++ builds).
573 cup'd the code into the fixtypes workspace, updated the tags appropriately.
574 i've created the appropriate log message, sitting in fixtypes.txt in
575 /src/xemacs; perhaps it should go into a README. now i just have to build
576 on everything (it's currently building), verify it's ok, run patcher-mail,
577 commit, send.
578
579 my mule ws is also very close. need to:
580
581 -- test the new print routines.
582 -- test it can read and write unicode files, and they can still be read by
583 some other program.
584 -- try to see if unicode can be auto-detected properly.
585 -- test it can read and write multibyte files in a few different formats.
586 currently can't recognize them, but if you set the cs right, it should
587 work.
588 -- examine the test files sent by matej and see if we can handle them.
589
590 sep 15, 2001:
591
592 more eol fixing. this stuff is utter crap.
593
594 currently we wrap coding systems with convert-eol-autodetect when we create
595 them in make_coding_system_1. i had a feeling that this would be a
596 problem, and indeed it is -- when autodetecting with `undecided', for
597 example, we end up with multiple layers of eol conversion. to avoid this,
598 we need to do the eol wrapping *ONLY* when we actually retrieve a coding
599 system in places such as insert-file-contents. these places are
600 insert-file-contents, load, process input, call-process-internal,
601 encode/decode/detect-coding-region, database input, ...
602
603 (later) it's fixed, and things basically work. NOTE: for some reason,
604 adding code to wrap coding systems with convert-eol-lf when eol-type == lf
605 results in crashing during garbage collection in some pretty obscure place
606 -- an lstream is free when it shouldn't be. this is a bad sign. i guess
607 something might be getting initialized too early?
608
609 we still need to fix the canonicalization-after-decoding code to avoid
610 problems with coding systems like `internal-7' showing up. basically, when
611 eol==lf is detected, nil should be returned, and the callers should handle
612 it appropriately, eliding when necessary. chain needs to recognize when
613 it's got only one (or even 0) items in the chain, and elide out the chain.
614
615 sep 11, 2001: the day that will live in infamy.
616
617 rewrite of sep 9 entry about formats:
618
619 when calling make-coding-system, the name can be a cons of (format1 .
620 format2), specifying that it decodes format1->format2 and encodes the other
621 way. if only one name is given, that is assumed to be format1, and the
622 other is either `external' or `internal' depending on the end type.
623 normally the user when decoding gives the decoding order in formats, but
624 can leave off the last one, `internal', which is assumed. a multichain
625 might look like gzip|multibyte|unicode, using the coding systems named
626 `gzip', `(unicode . multibyte)' and `unicode'. the way this actually works
627 is by searching for gzip->multibyte; if not found, look for gzip->external
628 or gzip->internal. (In general we automatically do conversion between
629 internal and external as necessary: thus gzip|crlf does the expected, and
630 maps to gzip->external, external->internal, crlf->internal, which when
631 fully specified would be gzip|external:external|internal:crlf|internal --
632 see below.) To forcibly fit together two converters that have explicitly
633 specified and incompatible names (say you have unicode->multibyte and
634 iso8859-1->ebcdic and you know that the multibyte and iso8859-1 in this
635 case are compatible), you can force-cast using :, like this:
636 ebcdic|iso8859-1:multibyte|unicode. (again, if you force-cast between
637 internal and external formats, the conversion happens automatically.)
638
639
640 sep 10, 2001:
641
642 moved the autodetection stuff (both codesys and eol) into particular coding
643 systems -- `undecided' and `convert-eol' (type == `autodetect'). needs
644 lots of work. still need to search through the rest of the code and find
645 any remaining auto-detect code and move it into the undecided coding
646 system. need to modify make-coding-system so that it spits out
647 auto-detecting versions of all text-file coding systems unless we say not
648 to. need eliminate entirely the EOF flag from both the stream info and the
649 coding system; have only the original-eof flag. in
650 coding_system_from_mask, need to check that the returned value is not of
651 type `undecided', falling back to no-conversion if so. also need to make
652 sure we wrap everything appropriate for text-files -- i removed the
653 wrapping on set-coding-category-list or whatever (need to check all those
654 files to make sure all wrapping is removed). need to review carefully the
655 new code in `undecided' to make sure it works are preserves the same logic
656 as previously. need to review the closing and rewinding behavior of chain
657 and undecided (same -- should really consolidate into helper routines, so
658 that any coding system can embed a chain in it) -- make sure the dynarr's
659 are getting their data flushed out as necessary, rewound/closed in the
660 right order, no missing steps, etc.
661
662 also split out mule stuff into mule-coding.c. work done on
663 configure/xemacs.mak/Makefiles not done yet. work on emacs.c/symsinit.h to
664 interface with the new init functions not done yet.
665
666 also put in a few declarations of the way i think the abstracted detection
667 stuff ought to go. DON'T WORK ON THIS MORE UNTIL THE REST IS DEALT WITH
668 AND WE HAVE A WORKING XEMACS AGAIN WITH ALL EOL ISSUES NAILED.
669
670 really need a version of cvs-mods that reports only the current directory.
671 WRITE THIS! use it to implement a better cvs-checkin.
672
673 sep 9, 2001:
674
675 implemented a gzip coding system. unfortunately, doesn't quite work right
676 because it doesn't handle the gzip headers -- it just reads and writes raw
677 zlib data. there's no function in the library to skip past the header, but
678 we do have some code out of the library that we can snarf that implements
679 header parsing. we need to snarf that, store it, and output it again at
680 the beginning when encoding. in the process, we should create a "get next
681 byte" macro that bails out when there are no more. using this, we set up a
682 nice way of doing most stuff statelessly -- if we have to bail, we reject
683 everything back to the sync point. also need to fix up the autodetection
684 of zlib in configure.in.
685
686 BIG problems with eol. finished up everything i thought i would need to
687 get eol stuff working, but no -- when you have mswindows-unicode, with its
688 eol set to autodetect, the detection routines themselves do the autodetect
689 (first), and fail (they report CR on CRLF because of the NULL byte between
690 the CR and the LF) since they're not looking at ascii data. with a chain
691 it's similarly bad. for mswindows-multibyte, for example, which is a chain
692 unicode->unicode-to-multibyte, autodetection happens inside of the chain,
693 both when unicode and unicode-to-multibyte are active. we could twiddle
694 around with the eol flags to try to deal with this, but it's gonna be a big
695 mess, which is exactly what we're trying to avoid. what we basically want
696 is to entirely rip out all EOL settings from either the coding system or
697 the stream (yes, there are two! one might saw autodetect, and then the
698 stream contains the actual detected value). instead, we simply create an
699 eol-autodetect coding system -- or rather, it's part of the convert-eol
700 coding system. convert-eol, type = autodetect, does autodetection the
701 first time it gets data sent to it to decode, and thereafter sets a stream
702 parameter indicating the actual eol type for this stream. this means that
703 all autodetect coding systems, as created by `make-coding-system', really
704 are chains with a convert-eol at the beginning. only subsidiary xxx-unix
705 has no wrapping at all. this should allow eof detection of gzip, unicode,
706 etc. for that matter, general autodetection should be entirely
707 encapsulated inside of the `autodetect' coding system, with no
708 eol-autodetection -- the chain becomes convert-eol (autodetect) ->
709 autodetect or perhaps backwards. the generic autodetect similarly has a
710 coding-system in its stream methods, and needs somehow or other to insert
711 the detected coding-system into the chain. either it contains a chain
712 inside of it (perhaps it *IS* a chain), or there's some magic involving
713 canonicalization-type switcherooing in the middle of a decode. either way,
714 once everything is good and done and we want to save the coding system so
715 it can be used later, we need to do another sort of canonicalization --
716 converting auto-detect-type coding systems into the detected systems.
717 again, a coding-system method, with some magic currently so that
718 subsidiaries get properly used rather than something that's new but
719 equivalent to subsidiaries. (#### perhaps we could use a hash table to
720 avoid recreating coding systems when not necessary. but that would require
721 that coding systems be immutable from external, and i'm not sure that's the
722 case.)
723
724 i really think, after all, that i should reverse the naming of everything
725 in chain and source-sink-type -- they should be decoding-centric. later
726 on, if/when we come up with the proper way to make it totally symmetrical,
727 we'll be fine whether before then we were encoding or decoding centric.
728
729
730 sep 9, 2001:
731
732 investigated eol parameter.
733 implemented handling in make-coding-system of eol-cr and eol-crlf.
734 fixed calls everywhere to Fget_coding_system / Ffind_coding_system to
735 reject non-char->byte coding systems.
736
737 still need to handle "query eol type using coding-system-property" so it
738 magically returns the right type by parsing the chain.
739
740 no work done on formats, as mentioned below. we should consider using :
741 instead of || to indicate casting.
742
743 early sep 9, 2001:
744
745 renamed some codesys properties: `list' in chain -> chain; `subtype' in
746 unicode -> type. everything compiles again and sort of works; some CRLF
747 problems that may resolve themselves when i finish the convert-eol stuff.
748 the stuff to create subsidiaries has been rewritten to use chains; but i
749 still need to investigate how the EOL type parameter is used. also, still
750 need to implement this: when a coding system is created, and its eol type
751 is not autodetect or lf, a chain needs to be created and returned. i think
752 that what needs to happen is that the eol type can only be set to
753 autodetect or lf; later on this should be changed to simply be either
754 autodetect or not (but that would require ripping out the eol converting
755 stuff in the various coding systems), and eventually we will do the work on
756 the detection mechanism so it can do chain detection; then we won't need an
757 eol autodetect setting at all. i think there's a way to query the eol type
758 of a coding system; this should check to see if the coding system is a
759 chain and there's a convert-eol at the front; if so, the eol type comes
760 from the type of the convert-eol.
761
762 also check out everywhere that Fget_coding_system or Ffind_coding_system is
763 called, and see whether anything but a char->byte system can be tolerated.
764 create a new function for all the places that only want char->byte,
765 something like get_coding_system_char_to_byte_only.
766
767 think about specifying formats in make-coding-system. perhaps the name can
768 be a cons of (format1, format2), specifying that it encodes
769 format1->format2 and decodes the other way. if only one name is given,
770 that is assumed to be format2, and the other is either `byte' or `char'
771 depending on the end type. normally the user when decoding gives the
772 decoding order in formats, but can leave off the last one, `char', which is
773 assumed. perhaps we should say `internal' instead of `char' and `external'
774 instead of byte. a multichain might look like gzip|multibyte|unicode,
775 using the coding systems named `gzip', `(unicode . multibyte)' and
776 `unicode'. we would have to allow something where one format is given only
777 as generic byte/char or internal/external to fit with any of the same
778 byte/char type. when forcibly fitting together two converters that have
779 explicitly specified and incompatible names (say you have
780 unicode->multibyte and iso8859-1->ebcdic and you know that the multibyte
781 and iso8859-1 in this case are compatible), you can force-cast using ||,
782 like this: ebcdic|iso8859-1||multibyte|unicode. this will also force
783 external->internal translation as necessary:
784 unicode|multibyte||crlf|internal does unicode->multibyte,
785 external->internal, crlf->internal. perhaps you'd need to put in the
786 internal translation, like this: unicode|multibyte|internal||crlf|internal,
787 which means unicode->multibyte, external->internal (multibyte is compatible
788 with external); force-cast to crlf format and convert crlf->internal.
789
790 even later: Sep 8, 2001:
791
792 chain doesn't need to set character mode, that happens automatically when
793 the coding systems are created. fixed chain to return correct source/sink
794 type for itself and to check the compatibility of source/sink types in its
795 chain. fixed decode/encode-coding-region to check the source and sink
796 types of the coding system performing the conversion and insert appropriate
797 byte->char/char->byte converters (aka "binary" coding system). fixed
798 set-coding-category-system to only accept the traditional
799 encode-char-to-byte types of coding systems.
800
801 still need to extend chain to specify the parameters mentioned below,
802 esp. "reverse". also need to extend the print mechanism for chain so it
803 prints out the chain. probably this should be general: have a new method
804 to return all properties, and output those properties. you could also
805 implement a read syntax for coding systems this way.
806
807 still need to implement convert-eol and finish up the rest of the eol stuff
808 mentioned below.
809
810 later September 7, 2001: (more like Sep 8)
811
812 moved many Lisp_Coding_System * params to Lisp_Object. In general this is
813 the way to go, and if we ever implement a copying GC, we will never want to
814 be passing direct pointers around. With no error-checking, we lose no
815 cycles using Lisp_Objects in place of pointers -- the Lisp_Object itself is
816 nothing but a pointer, and so all the casts and "dereferences" boil down to
817 nothing.
818
819 Clarified and cleaned up the "character mode" on streams, and documented
820 who (caller or object itself) has the right to be setting character mode on
821 a stream, depending on whether it's a read or write stream. changed
822 conversion_end_type method and enum source_sink_type to return
823 encoding-centric values, rather than decoding-centric. for the moment,
824 we're going to be entirely encoding-centric in everything; we can rethink
825 later. fixed coding systems so that the decode and encode methods are
826 guaranteed to receive only full characters, if that's the source type of
827 the data, as per conversion_end_type.
828
829 still need to fix the chain method so that it correctly sets the character
830 mode on all the lstreams in it and checks the source/sink types to be
831 compatible. also fix decode-coding-string and friends to put the
832 appropriate byte->character (i.e. no-conversion) coding systems on the ends
833 as necessary so that the final ends are both character. also add to chain
834 a parameter giving the ability to switch the direction of conversion of any
835 particular item in the chain (i.e. swap encoding and decoding). i think
836 what we really want to do is allow for arbitrary parameters to be put onto
837 a particular coding system in the chain, of which the only one so far is
838 swap-encode-decode. don't need too much codage here for that, but make the
839 design extendable.
840
841
842
843 September 7, 2001:
844
845 just added a return value from the decode and encode methods of a coding
846 system, so that some of the data can get rejected. fixed the calling
847 routines to handle this. need to investigate when and whether the coding
848 lstream is set to character mode, so that the decode/encode methods only
849 get whole characters. if not, we should do so, according to the source
850 type of these methods. also need to implement the convert_eol coding
851 system, and fix the subsidiary coding systems (and in general, any coding
852 system where the eol type is specified and is not LF) to be chains
853 involving convert_eol.
854
855 after everything is working, need to remove eol handling from encode/decode
856 methods and eventually consider rewriting (simplifying) them given the
857 reject ability.
858
859 September 5, 2001:
860
861 -- need to organize this. get everything below into the TODO list.
862 CVS the TODO list frequently so i can delete old stuff. prioritize
863 it!!!!!!!!!
864
865 -- move README.ben-mule... to STATUS.ben-mule...; use README for
866 intro, overview of what's new, what's broken, how to use the
867 features, etc.
868
869 -- need a global and local coding-category-precedence list, which get
870 merged.
871
872 -- finished the BOM support. also finished something not listed
873 below, expansion to the auto-generator of Unicode-encapsulation to
874 support bracketing code with #if ... #endif, for Cygwin and MINGW
875 problems, e.g. This is tested; appears to work.
876
877 -- need to add more multibyte coding systems now that we have various
878 properties to specify them. need to add DEFUN's for mac-code-page
879 and ebcdic-code-page for completeness. need to rethink the whole
880 way that the priority list works. it will continue to be total
881 junk until multiple levels of likeliness get implemented.
882
883 -- need to finish up the stuff about the various defaults. [need to
884 investigate more generally where all the different default values
885 are that control encoding. (there are six places or so.) need to
886 list them in make-coding-system docs and put pointers
887 elsewhere. [[[[#### what interface to specify that this default
888 should be unicode? a "Unicode" language environment seems too
889 drastic, as the language environment controls much more.]]]] even
890 skipping the Unicode stuff here, we need to survey and list the
891 variables that control coding page behavior and determine how they
892 need to be set for various possible scenarios:
893
894 -- total binary: no detection at all.
895 -- raw-text only: wants only autodetection of line endings, nothing else.
896 -- "standard Windows environment": tries for Unicode, falls back on
897 code page encoding.
898 -- some sort of East European environment, and Russian.
899 -- some sort of standard Japanese Windows environment.
900 -- standard Chinese Windows environments (traditional and simplified)
901 -- various Unix environments (European, Japanese, Russian, etc.)
902 -- Unicode support in all of these when it's reasonable
903
904 These really require multiple likelihood levels to be fully
905 implementable. We should see what can be done ("gracefully fall
906 back") with single likelihood level. need lots of testing.
907
908 -- need to fix the truename problem.
909
910 -- lots of testing: need to test all of the stuff above and below that's recently been implemented.
911
912
913
914 September 4, 2001:
915
916 mostly everything compiles. currently there is a crash in
917 parse-unicode-translation-table, and Cygwin/Mule won't run. it may
918 well be a bug in the sscanf() in Cygwin.
919
920 working on today:
921
922 -- adding BOM support for Unicode coding systems. mostly there, but
923 need to finish adding BOM support to the detection routines. then test.
924 -- adding properties to unicode-to-multibyte to specify the coding
925 system in various flexible ways, e.g. directly specified code page
926 or ansi or oem code page of specified locale, current locale,
927 user-default or system-default locale. need to test.
928 -- creating a `multibyte' coding system, with the same parameters as
929 unicode-to-multibyte and which resolves at coding-system-creation
930 time to the appropriate chain. creating the underlying mechanism
931 to allow such under-the-scenes switcheroo. need to test.
932 -- set default-value of buffer-file-coding-system to
933 mswindows-multibyte, as Matej said it should be. need to test.
934 need to investigate more generally where all the different default
935 values are that control encoding. (there are six places or so.)
936 need to list them in make-coding-system docs and put pointers
937 elsewhere. #### what interface to specify that this default should
938 be unicode? a "Unicode" language environment seems too drastic, as
939 the language environment controls much more.
940 -- thinking about adding multiple levels of certainty to the detection
941 schemes, instead of just a mask. eventually, we need to totally
942 abstract things, but that can easier be done in many steps. (we
943 need multiple levels of likelihood to more reasonably support a
944 Windows environment with code-page type files. currently, in order
945 to get them detected, we have to put them first, because they can
946 look like lots of other things; but then, other encodings don't get
947 detected. with multiple levels of likelihood, we still put the
948 code-page categories first, but they will return low levels of
949 likelihood. Lower-down encodings may be able to return higher
950 levels of likelihood, and will get taken preferentially.)
951 -- making it so you cannot disable file-coding, but you get an
952 equivalent default on Unix non-Mule systems where all defaults are
953 `binary'. need to test!!!!!!!!!
954
955 Matej (mostly, + some others) notes the following problems, and here
956 are possible solutions:
957
958 -- he wants the defaults to work right. [figure out what those
959 defaults are. i presume they are auto-detection of data in current
960 code page and in unicode, and new files have current code page set
961 as their output encoding.]
962
963 -- too easy to lose data with incorrect encodings. [need to set up an
964 error system for encoding/decoding. extremely important but a
965 little tricky to implement so let's deal with other issues now.]
966
967 -- EOL isn't always detected correctly. [#### ?? need examples]
968
969 -- truename isn't working: c:\t.txt and c:\tmp.txt have the same truename.
970 [should be easy to fix]
971
972 -- unicode files lose the BOM mark. [working on this]
973
974 -- command-line utilities use OEM. [actually it seems more
975 complicated. it seems they use the codepage of the console. we
976 may be able to set that, e.g. to UTF8, before we invoke a command.
977 need to investigate.]
978
979 -- no way to handle unicode characters not recognized as charsets. [we
980 need to create something like 8 private 2-dimensional charsets to
981 handle all BMP Unicode chars. Obviously this is a stopgap
982 solution. Switching to Unicode internal will ultimately make life
983 far easier and remove the BMP limitation. but for now it will
984 work. we translate all characters where we have charsets into
985 chars in those charsets, and the remainder in a unicode charset.
986 that way we can save them out again and guarantee no data loss with
987 unicode. this creates font problems, though ...]
988
989 -- problems with xemacs font handling. [xemacs font handling is not
990 sophisticated enough. it goes on a charset granularity basis and
991 only looks for a font whose name contains the corresponding windows
992 charset in it. with unicode this fails in various ways. for one
993 the granularity needs to be single character, so that those unicode
994 charsets mentioned above work; and it needs to query the font to
995 see what unicode ranges it supports, rather than just looking at
996 the charset ending.]
997
998
999
1000 August 28, 2001:
1001
1002 working on getting everything to compile again: Cygwin, non-MULE,
1003 pdump. not there yet.
1004
1005 mswindows-multibyte is now defined using chain, and works. removed
1006 most vestiges of the mswindows-multibyte coding system type.
1007
1008 file-coding is on by default; should default to binary only on Unix.
1009 Need to test. (Needs to compile first :-)
1010
1011 August 26, 2001:
1012
1013 I've fixed the issue of inputting non-ASCII text under -nuni, and done
1014 some of the work on the Russian C-x problem -- we now compute the
1015 other possibilities. We still need to fix the key-lookup code,
1016 though, and that code is unfortunately a bit ugly. the best way, it
1017 seems, is to expand the command-builder structure so you can specify
1018 different interpretations for keys. (if we do find an alternative
1019 binding, though, we need to mess with both the command builder and
1020 this-command-keys, as does the function-key stuff. probably need to
1021 abstract that munging code.)
1022
1023 high-priority:
1024
1025 [currently doing]
1026
1027 -- support for WM_IME_CHAR. IME input can work under -nuni if we use
1028 WM_IME_CHAR. probably we should always be using this, instead of
1029 snarfing input using WM_COMPOSITION. i'll check this out.
1030 -- Russian C-x problem. see above.
1031
1032 [clean-up]
1033
1034 -- make sure it compiles and runs under non-mule. remember that some
1035 code needs the unicode support, or at least a simple version of it.
1036 -- make sure it compiles and runs under pdump. see below.
1037 -- clean up mswindows-multibyte, TSTR_TO_C_STRING. see below. [DONE]
1038 -- eliminate last vestiges of codepage<->charset conversion and similar stuff.
1039
1040 [other]
1041 -- cut and paste. see below.
1042 -- misc issues with handling lang environments. see also August 25,
1043 "finally: working on the C-x in ...".
1044 -- when switching lang env, needs to set keyboard layout.
1045 -- user var to control whether, when moving into text of a
1046 particular language, we set the appropriate keyboard layout. we
1047 would need to have a lisp api for retrieving and setting the
1048 keyboard layout, set text properties to indicate the layout of
1049 text, and have a way of dealing with text with no property on
1050 it. (e.g. saved text has no text properties on it.) basically,
1051 we need to get a keyboard layout from a charset; getting a
1052 language would do. Perhaps we need a table that maps charsets
1053 to language environments.
1054 -- test that the lang env is properly set at startup. test that
1055 switching the lang env properly sets the C locale (call
1056 setlocale(), set LANG, etc.) -- a spawned subprogram should have
1057 the new locale in its environment.
1058 -- look through everything below and see if anything is missed in this
1059 priority list, and if so add it. create a separate file for the
1060 priority list, so it can be updated as appropriate.
1061
1062
1063 mid-priority:
1064
1065 -- clean up the chain coding system. its list should specify decode
1066 order, not encode; i now think this way is more logical. it should
1067 check the endpoints to make sure they make sense. it should also
1068 allow for the specification of "reverse-direction coding systems":
1069 use the specified coding system, but invert the sense of decode and
1070 encode.
1071
1072 -- along with that, places that take an arbitrary coding system and
1073 expect the ends to be anything specific need to check this, and add
1074 the appropriate conversions from byte->char or char->byte.
1075
1076 -- get some support for arabic, thai, vietnamese, japanese jisx 0212:
1077 at least get the unicode information in place and make sure we have
1078 things tied together so that we can display them. worry about r2l
1079 some other time.
1080
1081 August 25, 2001:
1082
1083 There is actually more non-Unicode-ized stuff, but it's basically
1084 inconsequential. (See previous note.) You can check using the file
1085 nmkun.txt (#### RENAME), which is just a list of all the routines that
1086 have been split. (It was generated from the output of `nmake
1087 unicode-encapsulate', after removing everything from the output but
1088 the function names.) Use something like
1089
1090 fgrep -f ../nmkun.txt -w [a-hj-z]*.[ch] |m
1091
1092 in the source directory, which does a word match and skips
1093 intl-unicode-win32.[ch] and intl-win32.[ch], which have a whole lot of
1094 references to these, unavoidably. It effectively detects what needs
1095 to be changed because changed versions either begin qxe... or end with
1096 A or W, and in each case there's no whole-word match.
1097
1098 The nasty bug has been fixed below. The -nuni option now works -- all
1099 specially-written code to handle the encapsulation has been tested by
1100 some operation (fonts by loadup and checking the output of (list-fonts
1101 ""); devmode by printing; dragdrop tests other stuff).
1102
1103 NOTE: for -nuni (Win 95), areas need work:
1104
1105 -- cut and paste. we should be able to receive Unicode text if it's
1106 there, and we should be able to receive it even in Win 95 or -nuni.
1107 we should just check in all circumstances. also, under 95, when we
1108 put some text in the clipboard, it may or may not also be
1109 automatically enumerated as unicode. we need to test this out
1110 and/or just go ahead and manually do the unicode enumeration.
1111
1112 -- receiving keyboard input. we get only a single byte, but we should
1113 be able to correlate the language of the keyboard layout to a
1114 particular code page, so we can then decode it correctly.
1115
1116 -- mswindows-multibyte. still implemented as its own thing. should
1117 be done as a chain of (encoding) unicode | unicode-to-multibyte.
1118 need to turn this on, get it working, and look into optimizations
1119 in the dfc stuff. (#### perhaps there's a general way to do these
1120 optimizations??? something like having a method on a coding system
1121 that can specify whether a pure-ASCII string gets rendered as
1122 pure-ASCII bytes and vice-versa.)
1123
1124
1125 ALSO:
1126
1127 -- we have special macros TSTR_TO_C_STRING and such because formerly
1128 the DFC macros didn't know about external stuff that was Unicode
1129 encoded and would call strlen() on them. this is fixed, so now we
1130 should undo the special macros, make em normal, removal the
1131 comments about this, and make sure it works. [DONE]
1132
1133
1134 -- finally: working on the C-x in Russian key layout problem. in the
1135 process will probably end up doing work on cleaning up the handling
1136 of keyboard layouts, integrating or deleting the FSF stuff, adding
1137 code to change the keyboard layout as we move in and out of text in
1138 different languages (implemented as a post-command-hook; we need
1139 something like internal-post-command-hook if not already there, for
1140 internal stuff that doesn't want to get mixed up with the regular
1141 post-command-hook; similar for pre-command-hook). also, when
1142 langenv changes, ways to set the keyboard layout appropriately.
1143
1144 -- i think the stuff above is higher priority than the other stuff
1145 mentioned below. what i'm aiming for is to be able to input and
1146 work with multiple languages without weird glitches, both under 95
1147 and NT. the problems above are all basic impediments to such work.
1148 we assume for the moment that the user can make use of the existing
1149 file i/o conversion stuff, and put that lower in priority, after
1150 the basic input is working.
1151
1152 -- i should get my modem connected and write up what's going on and
1153 send it to the lists; also cvs commit my workspaces and get more
1154 testers.
1155
1156 August 24, 2001:
1157
1158 All code has been Unicode-ized except for some stuff in console-msw.c
1159 that deals with console output. Much of the Unicode-encapsulation
1160 stuff, particularly the hand-written stuff, really needs testing. I
1161 added a new command-line option, -nuni, to force use of all ANSI calls
1162 -- XE_UNICODEP evaluates to false in this case.
1163
1164 There is a nasty bug that appeared recently, probably when the event
1165 code got Unicode-ized -- bad interactions with OS sticky modifiers.
1166 Hold the shift key down and release it, then instead of affecting the
1167 next char only, it gets permanently stuck on (until you do a regular
1168 shift+char stroke). This needs to be debugged.
1169
1170 Other things on agenda:
1171
1172 -- go through and prioritize what's listed below.
1173
1174 -- make sure the pdump code can compile and work. for the moment we
1175 just don't try to dump any Unicode tables and load them up each
1176 time. this is certainly fast but ...
1177
1178 -- there's the problem that XEmacs can't be run in a directory with
1179 non-ASCII/Latin-1 chars in it, since it will be doing Unicode
1180 processing before we've had a chance to load the tables. In fact,
1181 even finding the tables in such a situation is problematic using
1182 the normal commands. my idea is to eventually load the stuff
1183 extremely extremely early, at the same time as the pdump data gets
1184 loaded. in fact, the unicode table data (stored in an efficient
1185 binary format) can even be stuck into the pdump file (which would
1186 mean as a resource to the executable, for windows). we'd need to
1187 extend pdump a bit: to allow for attaching extra data to the pdump
1188 file. (something like pdump_attach_extra_data (addr, length)
1189 returns a number of some sort, an index into the file, which you
1190 can then retrieve with pdump_load_extra_data(), which returns an
1191 addr (mmap()ed or loaded), and later you pdump_unload_extra_data()
1192 when finished. we'd probably also need
1193 pdump_attach_extra_data_append(), which appends data to the data
1194 just written out with pdump_attach_extra_data(). this way,
1195 multiple tables in memory can be written out into one contiguous
1196 table. (we'd use the tar-like trick of allowing new blocks to be
1197 written without going back to change the old blocks -- we just rely
1198 on the end of file/end of memory.) this same mechanism could be
1199 extracted out of pdump and used to handle the non-pdump situation
1200 (or alternatively, we could just dump either the memory image of
1201 the tables themselves or the compressed binary version). in the
1202 case of extra unicode tables not known about at compile time that
1203 get loaded before dumping, we either just dump them into the image
1204 (pdump and all) or extract them into the compressed binary format,
1205 free the original tables, and treat them like all other tables.
1206
1207 -- `C-x b' when using a Russian keyboard layout. XEmacs currently
1208 tries to interpret C+cyrillic char, which causes an error. We want
1209 C-x b to still work even when the keyboard normally generates
1210 Cyrillic. What we should do is expand the keyboard event structure
1211 so that it contains not only the actual char, but what the char
1212 would have been in various other keyboard layouts, and in contexts
1213 where only certain keystrokes make sense (creating control chars,
1214 and looking up in keymaps), we proceed in order, processing each of
1215 them until we get something. order should be something like:
1216 current keyboard layout; layout of the current language
1217 environment; layout of the user's default language; layout of the
1218 system default language; layout of US English.
1219
1220 -- reading and writing Unicode files. multiple problems:
1221
1222 -- EOL's aren't handled right. for the moment, just fix the
1223 Unicode coding systems; later on, create EOL-only coding
1224 systems:
1225
1226 1. they would be character->character and operate next to the
1227 internal data; this means that coding systems need to be able
1228 to handle ends of lines that are either CR, LF, or CRLF.
1229 usually this isn't a problem, as they are just characters
1230 like any other and get encoded appropriately. however,
1231 coding systems that are line-oriented need to recognize any
1232 of the three as line endings.
1233
1234 2. we'd also have to complete the stuff that handles coding
1235 systems where either end can be byte or char (four
1236 possibilities total; use a single enum such as
1237 ENCODES_CHAR_TO_BYTE, ENCODES_BYTE_TO_BYTE, etc.).
1238
1239 3. we'd need ways of specifying the chaining of coding systems.
1240 e.g. when reading a coding system, a user can specify more
1241 than one with a | symbol between them. when a context calls
1242 for a coding system and a chain is needed, the `chain' coding
1243 system is useful; but we should really expand the contexts
1244 where a list of coding systems can be given, and whenever
1245 possible try to inline the chain instead of using a
1246 surrounding `chain' coding system.
1247
1248 4. the `chain' needs some work so that it passes all sorts of
1249 lstream commands down to the chain inside it -- it should be
1250 entirely transparent and the fact that there's actually a
1251 surrounding coding system should be invisible. more general
1252 coding system methods might need to be created.
1253
1254 5. important: we need a way of specifying how detecting works
1255 when we have more than one coding system. we might need more
1256 than a single priority list. need to think about this.
1257
1258 -- Unicode files beginning with the BOM are not recognized as such.
1259 we need to fix this; but to make things sensible, we really need
1260 to add the idea of different levels of confidence regarding
1261 what's detected. otherwise, Unicode says "yes this is me" but
1262 others higher up do too. in the process we should probably
1263 finish abstracting the detection system and fix up some
1264 stupidities in it.
1265
1266 -- When writing a file, we need error detection; otherwise somebody
1267 will create a Unicode file without realizing the coding system
1268 of the buffer is Raw, and then lose all the non-ASCII/Latin-1
1269 text when it's written out. We need two levels
1270
1271 1. first, a "safe-charset" level that checks before any actual
1272 encoding to see if all characters in the document can safely
1273 be represented using the given coding system. FSF has a
1274 "safe-charset" property of coding systems, but it's stupid
1275 because this information can be automatically derived from
1276 the coding system, at least the vast majority of the time.
1277 What we need is some sort of
1278 alternative-coding-system-precedence-list, langenv-specific,
1279 where everything on it can be checked for safe charsets and
1280 then the user given a list of possibilities. When the user
1281 does "save with specified encoding", they should see the same
1282 precedence list. Again like with other precedence lists,
1283 there's also a global one, and presumably all coding systems
1284 not on other list get appended to the end (and perhaps not
1285 checked at all when doing safe-checking?). safe-checking
1286 should work something like this: compile a list of all
1287 charsets used in the buffer, along with a count of chars
1288 used. that way, "slightly unsafe" charsets can perhaps be
1289 presented at the end, which will lose only a few characters
1290 and are perhaps what the users were looking for.
1291
1292 2. when actually writing out, we need error checking in case an
1293 individual char in a charset can't be written even though the
1294 charsets are safe. again, the user gets the choice of other
1295 reasonable coding systems.
1296
1297 3. same thing (error checking, list of alternatives, etc.) needs
1298 to happen when reading! all of this will be a lot of work!
1299
1300
1301
1302 Announcement, August 20, 2001:
1303
1304 I'm looking for testers. There is a complete and fast implementation
1305 in C of Unicode conversion, translations for almost all of the
1306 standardly-defined charsets that load up automatically and
1307 instantaneously at runtime, coding systems supporting the common
1308 external representations of Unicode [utf-16, ucs-4, utf-8,
1309 little-endian versions of utf-16 and ucs-4; utf-7 is sitting there
1310 with abort[]s where the coding routines should go, just waiting for
1311 somebody to implement], and a nice set of primitives for translating
1312 characters<->codepoints and setting the priority lists used to control
1313 codepoint->char lookup.
1314
1315 It's so far hooked into one place: the Windows IME. Currently I can
1316 select the Japanese IME from the thing on my tray pad in the lower
1317 right corner of the screen, and type Japanese into XEmacs, and you get
1318 Japanese in XEmacs -- regardless of whether you set either your
1319 current or global system locale to Japanese,and regardless of whether
1320 you set your XEmacs lang env as Japanese. This should work for many
1321 other languages, too -- Cyrillic, Chinese either Traditional or
1322 Simplified, and many others, but YMMV. There may be some lurking
1323 bugs (hardly surprising for something so raw).
1324
1325 To get at this, checkout using `ben-mule-21-5', NOT the simpler
1326 *`mule-21-5'. For example
1327
1328 cvs -d :pserver:xemacs@cvs.xemacs.org:/usr/CVSroot checkout -r ben-mule-21-5 xemacs
1329
1330 or you get the idea. the `-r ben-mule-21-5' is important.
1331
1332 I keep track of my progress in a file called README.ben-mule-21-5 in
1333 the root directory of the source tree.
1334
1335 WARNING: Pdump might not work. Will be fixed rsn.
1336
1337 August 20, 2001:
1338
1339 -- still need to sort out demand loading, binary format, etc. figure
1340 out what the goals are and how we're going to achieve them. for
1341 the moment let's just say that running XEmacs in a directory with
1342 Japanese or other weird characters in the name is likely to cause
1343 problems under MS Windows, but once XEmacs is initialized (and
1344 before processing init files), all Unicode support is there.
1345
1346 -- wrote the size computation routines, although not yet tested.
1347
1348 -- lots more abstraction of coding systems; almost done.
1349
1350 -- UNICODE WORKS!!!!!
1351
1352
1353 August 19, 2001:
1354
1355 Still needed on the Unicode support:
1356
1357 -- demand loading: load the Unicode table data the first time a
1358 conversion needs to be done.
1359
1360 -- maybe: table size computation: figure out how big the in-memory
1361 tables actually are.
1362
1363 -- maybe: create a space-efficient binary format for the data, and a
1364 way to dump out an existing charset's data into this binary format.
1365 it should allow for many such groups of data to be appended
1366 together in one file, such that you can just append the new data
1367 onto the end and not have to go back and modify anything
1368 previously. (like how tar archives work, and how the UFS? for
1369 CD-R's and CD-RW's works.)
1370
1371 -- maybe: figure out how to be able to access the Unicode tables at
1372 init_intl() time, before we know how to get at data-directory; that
1373 way we can handle the need for unicode conversions that come up
1374 very early, for example if XEmacs is run from a directory
1375 containing Japanese in it. Presumably we'd want to generalize the
1376 stuff in pdump.c that deals with the dumper file, so that it can
1377 handle other files -- putting the file either in the directory of
1378 the executable or in a resource, maybe actually attached to the
1379 pdump file itself -- or maybe we just dump the data into the actual
1380 executable. With pdump we could extend pdump to allow for data
1381 that's in the pdump file but not actually mapped at startup,
1382 separate from the data that does get mapped -- and then at runtime
1383 the pointer gets restored not with a real pointer but an offset
1384 into the file; another pdump call and we get some way to access the
1385 data. (tricky because it might be in a resource, not a file. we
1386 might have to just tell pdump to mmap or whatever the data in, and
1387 then tell pdump to release it.)
1388
1389 -- fix multibyte to use unicode. at first, just reverse
1390 mswindows-multibyte-to-unicode to be unicode-to-multibyte; later
1391 implement something in chain to allow for reversal, for declaring
1392 the ends of the coding systems, etc.
1393
1394 -- actually make sure that the IME stuff is working!!!
1395
1396 Other things before announcing:
1397
1398 -- change so that the Unicode tables are not pdumped. This means we
1399 need to free any table data out there. Make sure that pdump
1400 compiles and try to finish the pretty-much-already-done stuff
1401 already with XD_STRUCT_ARRAY and dynamic size computation; just
1402 need to see what's going on with LO_LINK.
1403
1404 August 14, 2001:
1405
1406 To do a diff between this workspace and the mainline, use the most recent sync tags, currently:
1407
1408 cvs diff -r main-branch-ben-mule-21-5-aug-11-2001-sync -r ben-mule-21-5-post-aug-11-2001-sync
1409
1410 Unicode support:
1411
1412 Unicode support is important for supporting many languages under
1413 Windows, such as Cyrillic, without resorting to translation tables for
1414 particular Windows-specific code pages. Internally, all characters in
1415 Windows can be represented in two encodings: code pages and Unicode.
1416 With Unicode support, we can seamlessly support all Windows
1417 characters. Currently, the test in the drive to support Unicode is if
1418 IME input works properly, since it is being converted from Unicode.
1419
1420 Unicode support also requires that the various Windows API's be
1421 "Unicode-encapsulated", so that they automatically call the ANSI or
1422 Unicode version of the API call appropriately and handle the size
1423 differences in structures. What this means is:
1424
1425 -- first, note that Windows already provides a sort of encapsulation
1426 of all API's that deal with text. All such API's are underlyingly
1427 provided in two versions, with an A or W suffix (ANSI or "wide"
1428 i.e. Unicode), and the compile-time constant UNICODE controls which
1429 is selected by the unsuffixed API. Same thing happens with
1430 structures. Unfortunately, this is compile-time only, not
1431 run-time, so not sufficient. (Creating the necessary run-time
1432 encoding is not conceptually difficult, but very time-consuming to
1433 write. It adds no significant overhead, and the only reason it's
1434 not standard in Windows is conscious marketing attempts by
1435 Microsoft to cripple Windows 95. FUCK MICROSOFT! They even
1436 describe in a KnowledgeBase article exactly how to create such an
1437 API [although we don't exactly follow their procedure], and point
1438 out its usefulness; the procedure is also described more generally
1439 in Nadine Kano's book on Win32 internationalization -- written SIX
1440 YEARS AGO! Obviously Microsoft has such an API available
1441 internally.)
1442
1443 -- what we do is provide an encapsulation of each standard Windows API
1444 call that is split into A and W versions. current theory is to
1445 avoid all preprocessor games; so we name the function with a prefix
1446 -- "qxe" currently -- and require callers to use the prefixed name.
1447 Callers need to explicitly use the W version of all structures, and
1448 convert text themselves using Qmswindows_tstr. the qxe
1449 encapsulated version will automatically call the appropriate A or W
1450 version depending on whether we're running on 9x or NT, and copy
1451 data between W and A versions of the structures as necessary.
1452
1453 -- We require the caller to handle the actual translation of text to
1454 avoid possible overflow when dealing with fixed-size Windows
1455 structures. There are no such problems when copying data between
1456 the A and W versions because ANSI text is never larger than its
1457 equivalent Unicode representation.
1458
1459 -- We allow for incremental creation of the encapsulated routines by
1460 using the coding system Qmswindows_tstr_notyet. This is an alias
1461 for Qmswindows_multibyte, i.e. it always converts to ANSI; but it
1462 indicates that it will be changed to Qmswindows_tstr when we have a
1463 qxe version of the API call that the data is being passed to and
1464 change the code to use the new function.
1465
1466 Besides creating the encapsulation, the following needs to be done for
1467 Unicode support:
1468
1469 -- No actual translation tables are fed into XEmacs. We need to
1470 provide glue code to read the tables in etc/unicode. See
1471 etc/unicode/README for the interface to implement.
1472
1473 -- Fix pdump. The translation tables for Unicode characters function
1474 as unions of structures with different numbers of indirection
1475 levels, in order to be efficient. pdump doesn't yet support such
1476 unions. charset.h has a general description of how the translation
1477 tables work, and the pdump code has constants added for the new
1478 required data types, and descriptions of how these should work.
1479
1480 -- ultimately, there's no end to additional work (composition, bidi
1481 reordering, glyph shaping/ordering, etc.), but the above is enough
1482 to get basic translation working.
1483
1484 Merging this workspace into the trunk requires some work. ChangeLogs
1485 have not yet been created. Also, there is a lot of additional code in
1486 this workspace other than just Windows and Unicode stuff. Some of the
1487 changes have been somewhat disruptive to the code base, in particular:
1488
1489 -- the code that handles the details of processing multilingual text
1490 has been consolidated to make it easier to extend it. it has been
1491 yanked out of various files (buffer.h, mule-charset.h, lisp.h,
1492 insdel.c, fns.c, file-coding.c, etc.) and put into text.c and
1493 text.h. mule-charset.h has also been renamed charset.h. all long
1494 comments concerning the representations and their processing have
1495 been consolidated into text.c.
1496
1497 -- nt/config.h has been eliminated and everything in it merged into
1498 config.h.in and s/windowsnt.h. see config.h.in for more info.
1499
1500 -- s/windowsnt.h has been completely rewritten, and s/cygwin32.h and
1501 s/mingw32.h have been largely rewritten. tons of dead weight has
1502 been removed, and stuff common to more than one file has been
1503 isolated into s/win32-common.h and s/win32-native.h, similar to
1504 what's already done for usg variants.
1505
1506 -- large amounts of code throughout the code base have been Mule-ized,
1507 not just Windows code.
1508
1509 -- file-coding.c/.h have been largely rewritten (although still mostly
1510 syncable); see below.
1511
1512
1513
1514 June 26, 2001:
1515
1516 -- ben-mule-21-5
1517
1518 this contains all the mule work i've been doing. this includes mostly
1519 work done to get mule working under ms windows, but in the process
1520 i've [of course] fixed a whole lot of other things as well, mostly
1521 mule issues. the specifics:
1522
1523 - it compiles and runs under windows and should basically work. the
1524 stuff remaining to do is (a) improved unicode support (see below)
1525 and (b) smarter handling of keyboard layouts. in particular, it
1526 should (1) set the right keyboard layout when you change your
1527 language environment; (2) optionally (a user var) set the
1528 appropriate keyboard layout as you move the cursor into text in a
1529 particular language.
1530
1531 - i added a bunch of code to better support OS locales. it tries to
1532 notice your locale at startup and set the language environment
1533 accordingly (this more or less works), and call setlocale() and set
1534 LANG when you change the language environment (may or may not work).
1535
1536 - major rewriting of file-coding. it's mostly abstracted into coding
1537 systems that are defined by methods (similar to devices and
1538 specifiers), with the ultimate aim being to allow non-i18n coding
1539 systems such as gzip. there is a "chain" coding system that allows
1540 multiple coding systems to be chained together. (it doesn't yet
1541 have the concept that either end of a coding system can be bytes or
1542 chars; this needs to be added.)
1543
1544 - unicode support. very raw. a few days ago i wrote a complete and
1545 efficient implementation of unicode translation. it should be very
1546 fast, and fairly memory-efficient in its tables. it allows for
1547 charset priority lists, which should be language-environment
1548 specific (but i haven't yet written the glue code). it works in
1549 preliminary testing, but obviously needs more testing and work.
1550 as of yet there is no translation data added for the standard charsets.
1551 the tables are in etc/unicode, and all we need is a bit of glue code
1552 to process them. see etc/unicode/README for the interface to
1553 implement.
1554
1555 - support for unicode in windows is partly there. this will work even
1556 on windows 95. the basic model is implemented but it needs finishing
1557 up.
1558
1559 - there is a preliminary implementation of windows ime support courtesy
1560 of ikeyama.
1561
1562 - if you want to get cyrillic working under windows (it appears to "work"
1563 but the wrong chars currently appear), the best way is to add unicode
1564 support for iso-8859-5 and use it in redisplay-msw.c. we are already
1565 passing unicode codepoints to the text-draw routine (ExtTextOutW).
1566 (ExtTextOutW and GetTextExtentPoint32W are implemented on both 95 and NT.)
1567
1568 - i fixed the iso2022 handling so it will correctly read in files
1569 containing unknown charsets, creating a "temporary" charset which
1570 can later be overwritten by the real charset when it's defined.
1571 this allows iso2022 elisp files with literals in strange languages
1572 to compile correctly under mule. i also added a hack that will
1573 correctly read in and write out the emacs-specific "composition"
1574 escape sequences, i.e. ESC 0 through ESC 4. this means that my
1575 workspace correctly compiles the new file devanagari.el that i added
1576 (see below).
1577
1578 - i copied the remaining language-specific files from fsf. i made
1579 some minor changes in certain cases but for the most part the stuff
1580 was just copied and may not work.
1581
1582 - i fixed post-read-conversion in coding systems to follow fsf
1583 conventions. (i also support our convention, for the moment. a
1584 kludge, of course.)
1585
1586 - make-coding-system accepts (but ignores) the additional properties
1587 present in the fsf version, for compatibility.